class: inverse, center, middle # A History Lesson --- # Cholera - In 1854 there was a breakout of the cholera disease in London killing 616 people. - At the time it was speculated that the disease was carried in the air. - A physician called John Snow was sceptical and began to collect data. --- # Snow's map <img src="img/Snow-cholera-map.jpg" width="3120" style="display: block; margin: auto;" /> --- # Snow's map <img src="img/Snow_Cholera.svg" style="display: block; margin: auto;" /> --- # Consequences - The map showed that cholera was more prevalent around a water pump on Broad Street. - The pump was closed down. - Eventually it was established that cholera is a water-borne disease. - Data visualisation saves lives! --- # Crimean War - At the same time Great Britain was at war against Russia in the Crimean peninsula. - Florence Nightingale is famous as a nurse who treated the wounded soldiers. - She also advocated British Parliament for more sanitary conditions in military hospitals. - She knew the power of using data visualisation. --- # Nightingale's Rose chart <img src="img/florose.jpg" width="2880" style="display: block; margin: auto;" /> --- # Nightingale's Rose chart <img src="img/nightingale_rose_chart.svg" style="display: block; margin: auto;" /> - Blue areas: Preventible deaths. - Red areas: Deaths from battle wounds. - Black areas: Other causes. --- # Aftermath - The improved sanitation at military hospitals was eventually implemented in civilian hospitals. - Data visualisation saves lives. - Florence Nightingale became the first female member of the Royal Statistical Society. --- # Napoleon - In 1812 Napoleon thought it was a good idea to invade Russia. - This campaign was a disaster for the French. - Engineer Charles Joseph Minard captured the extent of this catastrophe using visualisation. --- # Minard's plot <img src="img/Minard_Map.svg" style="display: block; margin: auto;" /> --- # Minard's plot - This visualisation provides information on 6 variables in one chart. - Number of troops. - Whether troops advance or retreat. - Temperature and time. - Longitude and latitude. - Despite the clear message that invading Russia in winter is a bad idea, some people did not learn this lesson. --- # Why visualisation? - Gain insights from data. - Overview of large datasets. - Search for: + Trends + Relationships + Irregularites - In business data visualisation is a crucial tool to support decision making. --- # Tesla Motors - Tesla vehicles collect a large number of data from sensors. -- - The plot on the next slide shows tyre pressure over time -- - This visualisation was used to -- + Check pressure when vehicles left factory, -- + See how long customers take to respond to a low pressure alert, -- + Do predictive modelling on when tyres go flat. --- # Tesla Motors <img src="img/tesla.png" width="1600" style="display: block; margin: auto;" /> You can read more about the case study [here](https://hbr.org/2016/06/visualizations-that-really-work). --- class: inverse, center, middle # Plotting Principles --- # Tufte's principles - Principles of good practice in data visualisation are outlined in *The Visual Display of Quantitative Information* by **Edward Tufte**. These include: - Avoid distorting what the data have to say - Present many numbers in a small space - Make large data sets coherent - Encourage the eye to compare different pieces of data --- # Bad plots - Tufte also provides a catalog of bad plots. - What makes these plots *bad* can be put into three categories. - Taste (Aesthetic) - Perceptual - Data --- class: inverse, center, middle # Bad Taste --- # An ugly plot <img src="img/monster.jpg" width="1387" style="display: block; margin: auto;" /> --- # Chartjunk - Chartjunk is the inclusion of elements that are not necessary to communicate the information. - The inclusion of the following can be considered chartjunk: + Heavy gridlines. + Unnecessary text. + Pictures within the chart. --- # Another example <img src="img/Chartjunk-example.svg" style="display: block; margin: auto;" /> --- # Guidelines Aim for a high Data Ink Ratio `$$\mbox{Data Ink ratio}=\frac{\mbox{Ink used to display data}}{\mbox{Ink used in graphic}}$$` Also aim for a high Data Density `$$\mbox{Data Density}=\frac{\mbox{Number of data points}}{\mbox{Area of graphic}}$$` If data density is small, perhaps use a table --- #Low data density <img src="img/sword.jpg" width="428" style="display: block; margin: auto;" /> --- # Bad... but not misleading - Note that although the previous plots look bad, strictly speaking they do not mislead. - Also maximising data-ink ratio should be seen as a guideline rather than a strict rule. - For instance the default background for `ggplot2` is arguably chartjunk - There are good reasons for using it. --- # ggplot2 <img src="GoodBadPlot_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Wickham on the grey background *"We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily "tune" them out... Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity"* ggplot2: Elegant Graphics for Data Analyis --- class: inverse, center, middle # Bad Perception --- # What can we perceive? - Human perception is a broad field that takes in ideas from psychology and philosophy. - For data visualisation we can perceive: - Length - Area - Volume - Shape - Position - Color - Angle --- #Errors of perception - Data visualisation is all about mapping data to things we can perceive. - This should not be done in a way that is innacurate or misleading. - The following plots provide some examples of what can go wrong. --- # Confusing length and area <img src="img/mac.jpg" width="652" style="display: block; margin: auto;" /> --- # Confusing length and area - On the previous plot the number of customers is represented by length (height of computer) - However the *area* of the 2D pictures of computers scale up more than their *heights*. - Also the picture leads us to imagine a 3D computer making this effect worse. - The value for *Mac* is only about 3 to 4 times more than for *None* but we perceive the difference to be much more. --- # Beware 3D <img src="img/seo-traffic-columns.gif" style="display: block; margin: auto;" /> --- # Beware 3D <img src="img/seo-traffic-pie-chart.gif" style="display: block; margin: auto;" /> --- # Beware 3D - Difficult to line up the heights of bars with the actual values - Closer green bar (MSN) looks bigger. - On the pie chart rendering in 3D makes the blue segment (Google) look the biggest. - Do not use three dimensions when two will work well. ??? https://infolific.com/technology/internet/seo-lie-factor/ --- # Lie Factor The lie factor is given by `$$\mbox{Lie factor}=\frac{\mbox{Size of effect in graph}}{\mbox{Size of effect in data}}$$` - The lie factor should be 1. --- # Road miles (from Tufte) <img src="img/Lie_factor_example1_image.jpg" width="731" style="display: block; margin: auto;" /> --- #Effects - The data says that mileage rose from 18 to 27.5 which is a 53% increase. - The line on the graph increases from 0.6 inches to 5.3 inches which is a 783% increase! - The lie factor is `\(783/53\approx 14\)` --- class: inverse, middle, center # Bad Data --- # Bad Data - Sometimes there is nothing wrong with the plot but with the data. - On the following slide is a plot comparing the cost of going to college in the US against the salaries of college graduates. - Can you find problems with this graph? --- # College cost <img src="img/diminishing-return.jpg" width="824" style="display: block; margin: auto;" /> --- # Problems - There is nothing *incorrect* about this graph. - However the message is misleading. - The income is a yearly income while the cost of college is over four years (and only paid once). - Also it does not show the income of people who are not college graduates. - Think carefully about comparisons on a plot. - Make sure your conclusions align with what is in the plot. --- class: inverse, center, middle # The x and y axis --- # The y-axis - Watch this [video](https://www.youtube.com/watch?v=14VYnFhBKcY). - Are we interested in the size of the variable rather than changes in the variable? - Is zero a reasonable value for the variable to take? - Are we using a bar chart? - Answering yes to these questions means we should give more consideration to including zero on the y-axis. --- # Stock Prices <img src="img/twitter_short.png" width="468" style="display: block; margin: auto;" /> From this graph we conclude that Twitter stock prices increased **dramatically** on April 26. --- # A longer term view <img src="img/twitter_long.png" width="468" style="display: block; margin: auto;" /> Not that dramatic anymore. --- class: inverse, middle center # More bad plots --- # Electrolux <img src="img/Consultantsmind-Electrolux-Profit-then-Growth.png" width="548" style="display: block; margin: auto;" /> ??? Data-to-ink ratio is zero --- # McKesson <img src="img/Consultantsmind-McKesson-Graph-Forecast.png" width="605" style="display: block; margin: auto;" /> ??? Forecast not in different colour --- # McKinsey <img src="img/Consultantsmind-McKinsey-Gig-Economy-Messy-Slide.jpg" width="1381" style="display: block; margin: auto;" /> --- # Pizza <img src="img/Its-a-pizza-Chart-Not-a-Pie-chart-1.png" width="399" style="display: block; margin: auto;" /> ??? Chart junk numbers do not add up --- # Pie chart <img src="img/Misleading_Pie_Chart.png" width="552" style="display: block; margin: auto;" /> ??? 3d makes one slice look small --- # Climate Change <img src="img/Nature-climate-change.png" width="690" style="display: block; margin: auto;" /> --- #Season <img src="img/season.png" width="1204" style="display: block; margin: auto;" /> ??? lie factor --- # Narcotics <img src="img/display.jpeg" width="853" style="display: block; margin: auto;" /> --- # Summary - Graphs can be misleading - The default options in ggplot are chosen to protect the user from errors of taste and errors of perception. - Nothing protects you from using bad or misleading data... - ...except for your own common sense