+ - 0:00:00
Notes for current slide
Notes for next slide

A History Lesson

1

Cholera

  • In 1854 there was a breakout of the cholera disease in London killing 616 people.
  • At the time it was speculated that the disease was carried in the air.
  • A physician called John Snow was sceptical and began to collect data.
2

Snow's map

3

Snow's map

4

Consequences

  • The map showed that cholera was more prevalent around a water pump on Broad Street.
  • The pump was closed down.
  • Eventually it was established that cholera is a water-borne disease.
  • Data visualisation saves lives!
5

Crimean War

  • At the same time Great Britain was at war against Russia in the Crimean peninsula.
  • Florence Nightingale is famous as a nurse who treated the wounded soldiers.
  • She also advocated British Parliament for more sanitary conditions in military hospitals.
  • She knew the power of using data visualisation.
6

Nightingale's Rose chart

7

Nightingale's Rose chart

  • Blue areas: Preventible deaths.
  • Red areas: Deaths from battle wounds.
  • Black areas: Other causes.
8

Aftermath

  • The improved sanitation at military hospitals was eventually implemented in civilian hospitals.
  • Data visualisation saves lives.
  • Florence Nightingale became the first female member of the Royal Statistical Society.
9

Napoleon

  • In 1812 Napoleon thought it was a good idea to invade Russia.
  • This campaign was a disaster for the French.
  • Engineer Charles Joseph Minard captured the extent of this catastrophe using visualisation.
10

Minard's plot

11

Minard's plot

  • This visualisation provides information on 6 variables in one chart.
    • Number of troops.
    • Whether troops advance or retreat.
    • Temperature and time.
    • Longitude and latitude.
  • Despite the clear message that invading Russia in winter is a bad idea, some people did not learn this lesson.
12

Why visualisation?

  • Gain insights from data.
  • Overview of large datasets.
  • Search for:
    • Trends
    • Relationships
    • Irregularites
  • In business data visualisation is a crucial tool to support decision making.
13

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
14

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
  • The plot on the next slide shows tyre pressure over time
14

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
  • The plot on the next slide shows tyre pressure over time
  • This visualisation was used to
14

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
  • The plot on the next slide shows tyre pressure over time
  • This visualisation was used to
    • Check pressure when vehicles left factory,
14

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
  • The plot on the next slide shows tyre pressure over time
  • This visualisation was used to
    • Check pressure when vehicles left factory,
    • See how long customers take to respond to a low pressure alert,
14

Tesla Motors

  • Tesla vehicles collect a large number of data from sensors.
  • The plot on the next slide shows tyre pressure over time
  • This visualisation was used to
    • Check pressure when vehicles left factory,
    • See how long customers take to respond to a low pressure alert,
    • Do predictive modelling on when tyres go flat.
14

Tesla Motors

You can read more about the case study here.

15

Plotting Principles

16

Tufte's principles

  • Principles of good practice in data visualisation are outlined in The Visual Display of Quantitative Information by Edward Tufte. These include:
    • Avoid distorting what the data have to say
    • Present many numbers in a small space
    • Make large data sets coherent
    • Encourage the eye to compare different pieces of data
17

Bad plots

  • Tufte also provides a catalog of bad plots.
  • What makes these plots bad can be put into three categories.
    • Taste (Aesthetic)
    • Perceptual
    • Data
18

Bad Taste

19

An ugly plot

20

Chartjunk

  • Chartjunk is the inclusion of elements that are not necessary to communicate the information.
  • The inclusion of the following can be considered chartjunk:
    • Heavy gridlines.
    • Unnecessary text.
    • Pictures within the chart.
21

Another example

22

Guidelines

Aim for a high Data Ink Ratio

Data Ink ratio=Ink used to display dataInk used in graphic

Also aim for a high Data Density

Data Density=Number of data pointsArea of graphic If data density is small, perhaps use a table

23

Low data density

24

Bad... but not misleading

  • Note that although the previous plots look bad, strictly speaking they do not mislead.
  • Also maximising data-ink ratio should be seen as a guideline rather than a strict rule.
  • For instance the default background for ggplot2 is arguably chartjunk
  • There are good reasons for using it.
25

ggplot2

26

Wickham on the grey background

"We can still see the gridlines to aid in the judgement of position (Cleveland, 1993b), but they have little visual impact and we can easily "tune" them out... Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity"

ggplot2: Elegant Graphics for Data Analyis

27

Bad Perception

28

What can we perceive?

  • Human perception is a broad field that takes in ideas from psychology and philosophy.
  • For data visualisation we can perceive:
    • Length
    • Area
    • Volume
    • Shape
    • Position
    • Color
    • Angle
29

Errors of perception

  • Data visualisation is all about mapping data to things we can perceive.
  • This should not be done in a way that is innacurate or misleading.
  • The following plots provide some examples of what can go wrong.
30

Confusing length and area

31

Confusing length and area

  • On the previous plot the number of customers is represented by length (height of computer)
  • However the area of the 2D pictures of computers scale up more than their heights.
  • Also the picture leads us to imagine a 3D computer making this effect worse.
  • The value for Mac is only about 3 to 4 times more than for None but we perceive the difference to be much more.
32

Beware 3D

33

Beware 3D

34

Beware 3D

  • Difficult to line up the heights of bars with the actual values
  • Closer green bar (MSN) looks bigger.
  • On the pie chart rendering in 3D makes the blue segment (Google) look the biggest.
  • Do not use three dimensions when two will work well.
35

Lie Factor

The lie factor is given by

Lie factor=Size of effect in graphSize of effect in data

  • The lie factor should be 1.
36

Road miles (from Tufte)

37

Effects

  • The data says that mileage rose from 18 to 27.5 which is a 53% increase.
  • The line on the graph increases from 0.6 inches to 5.3 inches which is a 783% increase!
  • The lie factor is 783/5314
38

Bad Data

39

Bad Data

  • Sometimes there is nothing wrong with the plot but with the data.
  • On the following slide is a plot comparing the cost of going to college in the US against the salaries of college graduates.
  • Can you find problems with this graph?
40

College cost

41

Problems

  • There is nothing incorrect about this graph.
  • However the message is misleading.
  • The income is a yearly income while the cost of college is over four years (and only paid once).
  • Also it does not show the income of people who are not college graduates.
  • Think carefully about comparisons on a plot.
  • Make sure your conclusions align with what is in the plot.
42

The x and y axis

43

The y-axis

  • Watch this video.
  • Are we interested in the size of the variable rather than changes in the variable?
  • Is zero a reasonable value for the variable to take?
  • Are we using a bar chart?
  • Answering yes to these questions means we should give more consideration to including zero on the y-axis.
44

Stock Prices

From this graph we conclude that Twitter stock prices increased dramatically on April 26.

45

A longer term view

Not that dramatic anymore.

46

More bad plots

47

Electrolux

48

Data-to-ink ratio is zero

McKesson

49

Forecast not in different colour

McKinsey

50

Pizza

51

Chart junk numbers do not add up

Pie chart

52

3d makes one slice look small

Climate Change

53

Season

54

lie factor

Narcotics

55

Summary

  • Graphs can be misleading
  • The default options in ggplot are chosen to protect the user from errors of taste and errors of perception.
  • Nothing protects you from using bad or misleading data...
  • ...except for your own common sense
56

Cholera

  • In 1854 there was a breakout of the cholera disease in London killing 616 people.
  • At the time it was speculated that the disease was carried in the air.
  • A physician called John Snow was sceptical and began to collect data.
2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow