class: center, middle, inverse, title-slide .title[ # Week 1:Introduction ] .author[ ### Visual Data Analytics ] .institute[ ### University of Sydney ] --- # Outline - Why visualisation? - Good and bad visualisation. - Building a narrative with visualisation. - Tools for visualisation. --- class: center, middle, inverse # Why Visualisation --- # Visual Data Analytics (VDA) - The entire process leading to data graphing. -- - Encompasses the preparation of data for graphing and exploratory data analysis methods. -- - Not simply about making ‘pretty pictures’ but about comprehending features of the data that are otherwise hidden by summary statistics. -- - VDA is an invaluable business intelligence tool that uncovers hidden opportunities, and informs clear decision making --- # How is VDA used? - To report data using visual means rather than tables, enabling faster comparisons. -- - For exploratory analysis -- to uncover new questions, -- discover previously unknown patterns, -- identify extreme behaviour and -- understand relationships between variables. -- - As a diagnostic tool following statistical estimation. --- # Why do we visualise? - More than 50% of the brain's neurons dedicated to vision. -- - Nearly 10 million bits of information are processed per second through our eyes. -- - Pre-attentive processing decodes information with high accuracy within 250 milliseconds (Healy and Enns, 2012). -- - We have evolved to better decode information through visualisation. --- # Why not summary stats? - Anscombe's quartet is a synthetic dataset of pairs of variables. -- - For each data set, the means and variances and correlation between `\(x\)` and `\(y\)` are the same. -- - However a simple scatterplot shows how different the datasets are: --- # Anscombe's Quartet <img src="01Intro_files/figure-html/unnamed-chunk-2-1.png" width="2400" style="display: block; margin: auto;" /> --- # Example - Consider data on number of trips made for the purpose of holidays in two Australian regions: -- - Alice Springs in the Northern Territory, - The Wilderness West in Tasmania. -- - We will look at these data in two different ways. -- - May want to understand how demand evolves over the year to plan resourcing. --- # As raw data <table class="table table-striped" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Quarter </th> <th style="text-align:right;"> Alice Springs </th> <th style="text-align:right;"> Wilderness West </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1998-Q1 </td> <td style="text-align:right;"> 8.15 </td> <td style="text-align:right;"> 58.33 </td> </tr> <tr> <td style="text-align:left;"> 1998-Q2 </td> <td style="text-align:right;"> 34.66 </td> <td style="text-align:right;"> 37.46 </td> </tr> <tr> <td style="text-align:left;"> 1998-Q3 </td> <td style="text-align:right;"> 76.54 </td> <td style="text-align:right;"> 9.25 </td> </tr> <tr> <td style="text-align:left;"> 1998-Q4 </td> <td style="text-align:right;"> 27.22 </td> <td style="text-align:right;"> 42.46 </td> </tr> <tr> <td style="text-align:left;"> 1999-Q1 </td> <td style="text-align:right;"> 12.50 </td> <td style="text-align:right;"> 45.88 </td> </tr> <tr> <td style="text-align:left;"> 1999-Q2 </td> <td style="text-align:right;"> 47.38 </td> <td style="text-align:right;"> 28.17 </td> </tr> <tr> <td style="text-align:left;"> 1999-Q3 </td> <td style="text-align:right;"> 60.83 </td> <td style="text-align:right;"> 7.23 </td> </tr> <tr> <td style="text-align:left;"> 1999-Q4 </td> <td style="text-align:right;"> 37.81 </td> <td style="text-align:right;"> 38.62 </td> </tr> <tr> <td style="text-align:left;"> 2000-Q1 </td> <td style="text-align:right;"> 24.29 </td> <td style="text-align:right;"> 45.79 </td> </tr> <tr> <td style="text-align:left;"> 2000-Q2 </td> <td style="text-align:right;"> 24.62 </td> <td style="text-align:right;"> 19.62 </td> </tr> <tr> <td style="text-align:left;"> 2000-Q3 </td> <td style="text-align:right;"> 62.61 </td> <td style="text-align:right;"> 23.72 </td> </tr> <tr> <td style="text-align:left;"> 2000-Q4 </td> <td style="text-align:right;"> 36.12 </td> <td style="text-align:right;"> 41.84 </td> </tr> <tr> <td style="text-align:left;"> 2001-Q1 </td> <td style="text-align:right;"> 14.66 </td> <td style="text-align:right;"> 45.18 </td> </tr> <tr> <td style="text-align:left;"> 2001-Q2 </td> <td style="text-align:right;"> 22.87 </td> <td style="text-align:right;"> 56.04 </td> </tr> <tr> <td style="text-align:left;"> 2001-Q3 </td> <td style="text-align:right;"> 63.55 </td> <td style="text-align:right;"> 11.65 </td> </tr> <tr> <td style="text-align:left;"> 2001-Q4 </td> <td style="text-align:right;"> 22.84 </td> <td style="text-align:right;"> 10.53 </td> </tr> <tr> <td style="text-align:left;"> 2002-Q1 </td> <td style="text-align:right;"> 7.59 </td> <td style="text-align:right;"> 49.72 </td> </tr> <tr> <td style="text-align:left;"> 2002-Q2 </td> <td style="text-align:right;"> 34.10 </td> <td style="text-align:right;"> 29.50 </td> </tr> <tr> <td style="text-align:left;"> 2002-Q3 </td> <td style="text-align:right;"> 52.75 </td> <td style="text-align:right;"> 14.96 </td> </tr> <tr> <td style="text-align:left;"> 2002-Q4 </td> <td style="text-align:right;"> 34.90 </td> <td style="text-align:right;"> 39.17 </td> </tr> <tr> <td style="text-align:left;"> 2003-Q1 </td> <td style="text-align:right;"> 11.84 </td> <td style="text-align:right;"> 41.50 </td> </tr> <tr> <td style="text-align:left;"> 2003-Q2 </td> <td style="text-align:right;"> 36.68 </td> <td style="text-align:right;"> 44.04 </td> </tr> <tr> <td style="text-align:left;"> 2003-Q3 </td> <td style="text-align:right;"> 45.49 </td> <td style="text-align:right;"> 20.32 </td> </tr> <tr> <td style="text-align:left;"> 2003-Q4 </td> <td style="text-align:right;"> 21.77 </td> <td style="text-align:right;"> 38.31 </td> </tr> <tr> <td style="text-align:left;"> 2004-Q1 </td> <td style="text-align:right;"> 11.44 </td> <td style="text-align:right;"> 84.88 </td> </tr> <tr> <td style="text-align:left;"> 2004-Q2 </td> <td style="text-align:right;"> 26.35 </td> <td style="text-align:right;"> 39.33 </td> </tr> <tr> <td style="text-align:left;"> 2004-Q3 </td> <td style="text-align:right;"> 74.38 </td> <td style="text-align:right;"> 8.15 </td> </tr> <tr> <td style="text-align:left;"> 2004-Q4 </td> <td style="text-align:right;"> 17.80 </td> <td style="text-align:right;"> 46.76 </td> </tr> <tr> <td style="text-align:left;"> 2005-Q1 </td> <td style="text-align:right;"> 13.87 </td> <td style="text-align:right;"> 56.23 </td> </tr> <tr> <td style="text-align:left;"> 2005-Q2 </td> <td style="text-align:right;"> 40.32 </td> <td style="text-align:right;"> 30.53 </td> </tr> <tr> <td style="text-align:left;"> 2005-Q3 </td> <td style="text-align:right;"> 64.10 </td> <td style="text-align:right;"> 21.49 </td> </tr> <tr> <td style="text-align:left;"> 2005-Q4 </td> <td style="text-align:right;"> 31.55 </td> <td style="text-align:right;"> 29.67 </td> </tr> <tr> <td style="text-align:left;"> 2006-Q1 </td> <td style="text-align:right;"> 16.19 </td> <td style="text-align:right;"> 49.20 </td> </tr> <tr> <td style="text-align:left;"> 2006-Q2 </td> <td style="text-align:right;"> 37.44 </td> <td style="text-align:right;"> 44.54 </td> </tr> <tr> <td style="text-align:left;"> 2006-Q3 </td> <td style="text-align:right;"> 49.69 </td> <td style="text-align:right;"> 4.27 </td> </tr> <tr> <td style="text-align:left;"> 2006-Q4 </td> <td style="text-align:right;"> 41.30 </td> <td style="text-align:right;"> 23.26 </td> </tr> <tr> <td style="text-align:left;"> 2007-Q1 </td> <td style="text-align:right;"> 13.14 </td> <td style="text-align:right;"> 50.56 </td> </tr> <tr> <td style="text-align:left;"> 2007-Q2 </td> <td style="text-align:right;"> 38.75 </td> <td style="text-align:right;"> 40.03 </td> </tr> <tr> <td style="text-align:left;"> 2007-Q3 </td> <td style="text-align:right;"> 48.00 </td> <td style="text-align:right;"> 19.07 </td> </tr> <tr> <td style="text-align:left;"> 2007-Q4 </td> <td style="text-align:right;"> 29.05 </td> <td style="text-align:right;"> 44.28 </td> </tr> <tr> <td style="text-align:left;"> 2008-Q1 </td> <td style="text-align:right;"> 18.19 </td> <td style="text-align:right;"> 64.15 </td> </tr> <tr> <td style="text-align:left;"> 2008-Q2 </td> <td style="text-align:right;"> 35.90 </td> <td style="text-align:right;"> 37.69 </td> </tr> <tr> <td style="text-align:left;"> 2008-Q3 </td> <td style="text-align:right;"> 36.35 </td> <td style="text-align:right;"> 11.38 </td> </tr> <tr> <td style="text-align:left;"> 2008-Q4 </td> <td style="text-align:right;"> 23.76 </td> <td style="text-align:right;"> 14.74 </td> </tr> <tr> <td style="text-align:left;"> 2009-Q1 </td> <td style="text-align:right;"> 3.56 </td> <td style="text-align:right;"> 61.61 </td> </tr> <tr> <td style="text-align:left;"> 2009-Q2 </td> <td style="text-align:right;"> 44.70 </td> <td style="text-align:right;"> 26.50 </td> </tr> <tr> <td style="text-align:left;"> 2009-Q3 </td> <td style="text-align:right;"> 53.51 </td> <td style="text-align:right;"> 18.40 </td> </tr> <tr> <td style="text-align:left;"> 2009-Q4 </td> <td style="text-align:right;"> 31.50 </td> <td style="text-align:right;"> 35.93 </td> </tr> <tr> <td style="text-align:left;"> 2010-Q1 </td> <td style="text-align:right;"> 10.23 </td> <td style="text-align:right;"> 58.08 </td> </tr> <tr> <td style="text-align:left;"> 2010-Q2 </td> <td style="text-align:right;"> 35.65 </td> <td style="text-align:right;"> 44.61 </td> </tr> <tr> <td style="text-align:left;"> 2010-Q3 </td> <td style="text-align:right;"> 46.48 </td> <td style="text-align:right;"> 16.15 </td> </tr> <tr> <td style="text-align:left;"> 2010-Q4 </td> <td style="text-align:right;"> 45.72 </td> <td style="text-align:right;"> 28.39 </td> </tr> <tr> <td style="text-align:left;"> 2011-Q1 </td> <td style="text-align:right;"> 11.84 </td> <td style="text-align:right;"> 36.05 </td> </tr> <tr> <td style="text-align:left;"> 2011-Q2 </td> <td style="text-align:right;"> 18.69 </td> <td style="text-align:right;"> 15.78 </td> </tr> <tr> <td style="text-align:left;"> 2011-Q3 </td> <td style="text-align:right;"> 37.91 </td> <td style="text-align:right;"> 4.54 </td> </tr> <tr> <td style="text-align:left;"> 2011-Q4 </td> <td style="text-align:right;"> 14.21 </td> <td style="text-align:right;"> 14.57 </td> </tr> <tr> <td style="text-align:left;"> 2012-Q1 </td> <td style="text-align:right;"> 15.17 </td> <td style="text-align:right;"> 44.54 </td> </tr> <tr> <td style="text-align:left;"> 2012-Q2 </td> <td style="text-align:right;"> 21.64 </td> <td style="text-align:right;"> 17.07 </td> </tr> <tr> <td style="text-align:left;"> 2012-Q3 </td> <td style="text-align:right;"> 47.64 </td> <td style="text-align:right;"> 3.81 </td> </tr> <tr> <td style="text-align:left;"> 2012-Q4 </td> <td style="text-align:right;"> 7.28 </td> <td style="text-align:right;"> 22.51 </td> </tr> <tr> <td style="text-align:left;"> 2013-Q1 </td> <td style="text-align:right;"> 6.21 </td> <td style="text-align:right;"> 47.25 </td> </tr> <tr> <td style="text-align:left;"> 2013-Q2 </td> <td style="text-align:right;"> 26.73 </td> <td style="text-align:right;"> 25.84 </td> </tr> <tr> <td style="text-align:left;"> 2013-Q3 </td> <td style="text-align:right;"> 37.05 </td> <td style="text-align:right;"> 13.40 </td> </tr> <tr> <td style="text-align:left;"> 2013-Q4 </td> <td style="text-align:right;"> 17.07 </td> <td style="text-align:right;"> 26.28 </td> </tr> <tr> <td style="text-align:left;"> 2014-Q1 </td> <td style="text-align:right;"> 2.81 </td> <td style="text-align:right;"> 58.30 </td> </tr> <tr> <td style="text-align:left;"> 2014-Q2 </td> <td style="text-align:right;"> 61.38 </td> <td style="text-align:right;"> 35.79 </td> </tr> <tr> <td style="text-align:left;"> 2014-Q3 </td> <td style="text-align:right;"> 46.95 </td> <td style="text-align:right;"> 7.20 </td> </tr> <tr> <td style="text-align:left;"> 2014-Q4 </td> <td style="text-align:right;"> 30.36 </td> <td style="text-align:right;"> 25.14 </td> </tr> <tr> <td style="text-align:left;"> 2015-Q1 </td> <td style="text-align:right;"> 10.52 </td> <td style="text-align:right;"> 65.12 </td> </tr> <tr> <td style="text-align:left;"> 2015-Q2 </td> <td style="text-align:right;"> 38.67 </td> <td style="text-align:right;"> 8.41 </td> </tr> <tr> <td style="text-align:left;"> 2015-Q3 </td> <td style="text-align:right;"> 69.74 </td> <td style="text-align:right;"> 11.30 </td> </tr> <tr> <td style="text-align:left;"> 2015-Q4 </td> <td style="text-align:right;"> 20.55 </td> <td style="text-align:right;"> 21.34 </td> </tr> <tr> <td style="text-align:left;"> 2016-Q1 </td> <td style="text-align:right;"> 3.40 </td> <td style="text-align:right;"> 42.18 </td> </tr> <tr> <td style="text-align:left;"> 2016-Q2 </td> <td style="text-align:right;"> 45.05 </td> <td style="text-align:right;"> 23.49 </td> </tr> <tr> <td style="text-align:left;"> 2016-Q3 </td> <td style="text-align:right;"> 65.05 </td> <td style="text-align:right;"> 9.61 </td> </tr> <tr> <td style="text-align:left;"> 2016-Q4 </td> <td style="text-align:right;"> 15.23 </td> <td style="text-align:right;"> 29.78 </td> </tr> <tr> <td style="text-align:left;"> 2017-Q1 </td> <td style="text-align:right;"> 21.15 </td> <td style="text-align:right;"> 57.03 </td> </tr> <tr> <td style="text-align:left;"> 2017-Q2 </td> <td style="text-align:right;"> 39.03 </td> <td style="text-align:right;"> 18.95 </td> </tr> <tr> <td style="text-align:left;"> 2017-Q3 </td> <td style="text-align:right;"> 35.96 </td> <td style="text-align:right;"> 11.52 </td> </tr> <tr> <td style="text-align:left;"> 2017-Q4 </td> <td style="text-align:right;"> 23.04 </td> <td style="text-align:right;"> 32.94 </td> </tr> </tbody> </table> --- # As a plot (Alice Springs) ``` ## <string>:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. ``` <img src="01Intro_files/figure-html/unnamed-chunk-4-3.png" width="2400" style="display: block; margin: auto;" /> --- # As a plot (Wilderness West) <img src="01Intro_files/figure-html/unnamed-chunk-5-5.png" width="2400" style="display: block; margin: auto;" /> --- # As a plot (Both) <img src="01Intro_files/figure-html/unnamed-chunk-6-7.png" width="2400" style="display: block; margin: auto;" /> --- # Insights - Both time series have a seasonal pattern. - There are some times of the years more popular for holidays. -- - The seasonal patterns are different. - Alice Springs and Tasmanian Wilderness West have very different climate. -- - It is easier to make these insights by visualisation rather than looking at raw numbers. --- class: center, middle, inverse # The good, the bad and the ugly --- # Tufte's Principles - Principles of good practice in data visualisation are outlined in *The Visual Display of Quantitative Information* by Edward Tufte. These include: -- - Avoid distorting what the data have to say, - Present many numbers in a small space, - Make large data sets coherent, - Encourage the eye to compare different pieces of data. --- # Iliinsky's four pillars A visualisation is not just a picture - Purpose (the why): have clear focus. - Content (the what): contain correct and useful information. - Structure (the how): what graph to choose. - Formatting (everything else): bring focus. For more see the video [here](https://methods.sagepub.com/video/four-pillars-of-effective-visualization-design#:~:text=Noah%20Iliinsky%20discusses%20the%20four,and%20design%20types%20to%20avoid.) --- # Bad plots - According to Healey's [*Data Visualisation*](https://socviz.co/index.html) plots may be bad due to: - Bad taste, - Bad perception, - Bad data. -- - In the following examples, think about: - How the data is *encoded* into a visualisation, i.e. data `\(\rightarrow\)` visuals, - How the data are *decoded* by the person interpreting the plot , i.e. visuals `\(\rightarrow\)` insight. --- # Poor taste ![Monstrous costs](img/monster.jpg) --- # Chartjunk - Chartjunk is the inclusion of elements that are not necessary to communicate the information. -- - The inclusion of the following can be considered chartjunk: - Heavy gridlines, - Unnecessary text, - Pictures within the chart. -- - These are not incorrect but can be misleading due to a lack of objectivity. -- - Also examples of chartjunk that do not mislead. --- # More chartjunk ![LowDI](img/Chartjunk-example.svg) --- # Data-ink ratio - One way to think about the design of a visualisation is using the *data-ink* ratio. -- - The idea is to show the most data with the least amount of 'ink'. -- - In the previous plot, the stripes on the bars, the color in the background do not convey any information about the data. -- - This is an example of chatjunk that is not misleading. --- # Data density ![](img/sword.jpg) --- # Data density - In the previous plot there is only one data point. -- - Visualisation is not misleading. -- - However is a visualisation necessary here? -- - Visualisations that convey more data are said to have a high *data density*. -- - In general, try to avoid low data density. --- # Perceptually misleading - Human perception is a broad field that takes in ideas from psychology and philosophy. - For data visualisation we can perceive: - Length/Area/Volume, - Shape, - Position, - Color, - Angle. -- - Now some examples where things go wrong. --- # Confusing length and area ![mac](img/mac.jpg) --- # Confusing Length and Area - On the previous plot, the number of customers is mapped to length (height of computer). -- - The area of the 2D pictures of computers scale up more than their heights. -- - The picture leads us to imagine a 3D computer making this effect worse. -- - The value for Mac is only about 3 to 4 times more than for None but we perceive the difference to be much more. --- # Beware 3D .pull-left[ ![](img/seo-traffic-columns.gif) ] .pull-right[ ![](img/Misleading_Pie_Chart.png) ] --- # Beware 3D - Difficult to line up heights of bars with actual values -- - Closer green bar (MSN) looks bigger. - Do not use 3 dimensions when 2 work well. -- - On the pie chart the green segment looks bigger. -- - In general angles are difficult to perceive - Experts in visualisation prefer not to use pie charts, but they are popular in practice. -- - Argumentum ad populum / Three men make a tiger. --- # Lie factor ![](img/Lie_factor_example1_image.jpg) --- # Lie factor - The data indicates that mileage rose from 18 to 27.5 which is a 53% increase. -- - The line on the graph increases from 0.6 inches to 5.3 inches which is a 783% increase! -- - Tufte formalises this into a lie factor of 783/53≈14. -- - It should be 1! -- - Note that in contrast to previous examples where it was difficult to *decode* insights from the visualisation, here there is an error in how the data are *encoded* into a visualisation. --- # Wrong data or plot .pull-left[ ![](img/display.jpeg) ] .pull-right[ ![](img/Its-a-pizza-Chart-Not-a-Pie-chart-1.png) ] --- # Issues - The percentages do not add up to 100. - In the plot on the left, the data are a time series. - Dates not given ('today' is 2016). - In the plot on the right, respondents can like more than one topping. - In both cases the pie chart is a poor choice of visualisation for the data at hand. -- - Also pineapple should never be on a pizza! --- # Bad (but not wrong) data ![](img/diminishing-return.jpg) --- # Problems - There is nothing incorrect about this graph. - However the message is misleading. - The income is a yearly income while the cost of college is over four years (and only paid once). - Also it does not show the income of people who are not college graduates. - Think carefully about comparisons on a plot. - Make sure conclusions align with what is in the plot. --- class: center, middle inverse # Storytelling with data --- # Data storytelling - Data storytelling is not about generating pretty charts and data presentations. - It is about communicating insights that deliver real value. - Good data stories have data, visualisations and narrative --- # Guide to building narrative Following discussion based on some Harvard Business Review articles -- - How to make sure you're not using data just to justify decisions you've already made [ (link)](https://hbr.org/2018/10/how-to-make-sure-youre-not-using-data-just-to-justify-decisions-youve-already-made). - Use data to answer you key business questions [(link)](https://hbr.org/2020/02/use-data-to-answer-your-key-business-questions) - Visualizations that really work [(link)](https://hbr.org/2016/06/visualizations-that-really-work) --- # Key Business Questions - What problem am I trying to solve? - Focus on something *actionable*! - Immerse yourself in data - Including visualisation - Generate KBQs - Make purpose specific - Prioritise KBQs - Focus on easily activated, high impact KBQs - Iterate! --- # Example: Tesla - Purpose: Improve customer satisfaction and operations with tyres? -- - Visualisation: ![](img/tesla.png) --- # Tesla: KBQs - Identify KBQs - Is there sufficient quality control on tyres leaving factory? - How long do customers take to respond to a low pressure alert? - Can we predict when tyres go flat? -- - Prioritise KBQs - Will depend on context - Predictive model may not be easily activated. --- # Narrative - All stories consist of -- - Setup (current reality) - Conflict (change) - Resolution (new reality) -- - For an example (with house prices) see [here](https://www.youtube.com/watch?v=r5_34YnCmMY&t=13s). -- - We will work through an example now due with the Tourism data --- # Tourism narrative <img src="01Intro_files/figure-html/unnamed-chunk-7-9.png" width="2400" style="display: block; margin: auto;" /> --- class: inverse,middle, center # The tools --- # Python .pull-left[ - General purpose programming language. - Language of choice for data science. - Open source libraries for visualisation. ] .pull-right[ ![](img/python.jpeg) ] --- # Matplotlib .pull-left[ - Most popular Python package for visualisation. - Highly customisable. - Works with with other packages. ] .pull-right[ ![](img/matplotlib.png) ] --- # Seaborn .pull-left[ - Builds on top of Matplotlib. - Easier integration with Pandas dataframes - Nice themes ] .pull-right[ ![](img/seaborn.png) ] --- # Plotly .pull-left[ - Good for interactive plots. - Suited to web-based interface. - Also implemented in other languages. ] .pull-right[ ![](img/plotly.png) ] --- # Bokeh .pull-left[ - Alternative for interactive plots. - Good for interactive dashboards. ] .pull-right[ ![](img/bokeh.svg) ] --- # Why not Tableau? - Tableau is a popular commercial tool for visualisation. - It arguably has an easier interface (no coding). - Python is more customisable. - Python can be used everywhere and anywhere. - If you know Python, easier to learn Tableau (compared to the other way around). - You will need to learn coding, but this is not a coding course. --- class: inverse,center, middle # Questions?