class: center, middle, inverse, title-slide .title[ # Week 3:VisualisingOne Variable ] .author[ ### Visual Data Analytics ] .institute[ ### University of Sydney ] --- # Outline - Nominal/Ordinal Data - Bar - Lollipop - Pie/donut - Numeric data - Box plot - Histograms - Kernel density --- # Motivation - Understand the distribution of a variable - Find outliers - Find multi-modality - Find skew - Understanding the distribution is about generating interesting questions for further analysis. - Thinking probabilistically is about thinking about distributions and not just the mean. --- # Examples - We will use two dataets that can be directly loaded from the `seaborn` package. - The `taxis` dataset with data on pickup and drop off locations, fares, payment type etc., in New York City. - The `diamonds` dataset with information on size, cut clarity, price, etc. of diamonds. - These contain categorical (nominal and ordinal) and numeric variables. --- class: center, middle, inverse # Categorical variables --- # The bar chart - Categories displayed on one axis (usually x). - The *frequency* of each observation is displayed on the other axis (usually y). - The frequency is mapped to the *length* of each bar. - For this reason always include zero on the y axis. --- # Taxis data ```python import seaborn as sns taxisdat = sns.load_dataset('taxis') taxisdat ``` ``` ## pickup dropoff ... pickup_borough dropoff_borough ## 0 2019-03-23 20:21:09 2019-03-23 20:27:24 ... Manhattan Manhattan ## 1 2019-03-04 16:11:55 2019-03-04 16:19:00 ... Manhattan Manhattan ## 2 2019-03-27 17:53:01 2019-03-27 18:00:25 ... Manhattan Manhattan ## 3 2019-03-10 01:23:59 2019-03-10 01:49:51 ... Manhattan Manhattan ## 4 2019-03-30 13:27:42 2019-03-30 13:37:14 ... Manhattan Manhattan ## ... ... ... ... ... ... ## 6428 2019-03-31 09:51:53 2019-03-31 09:55:27 ... Manhattan Manhattan ## 6429 2019-03-31 17:38:00 2019-03-31 18:34:23 ... Queens Bronx ## 6430 2019-03-23 22:55:18 2019-03-23 23:14:25 ... Brooklyn Brooklyn ## 6431 2019-03-04 10:09:25 2019-03-04 10:14:29 ... Brooklyn Brooklyn ## 6432 2019-03-13 19:31:22 2019-03-13 19:48:02 ... Brooklyn Brooklyn ## ## [6433 rows x 14 columns] ``` --- # Bar plot of pick up borough ```python sns.countplot(data = taxisdat, x='pickup_borough') ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" /> --- # Change orientation ```python sns.countplot(data = taxisdat, y='pickup_borough') ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-4-3.png" width="50%" style="display: block; margin: auto;" /> --- # Order by frequency ```python sns.countplot(data = taxisdat, x='pickup_borough', order = taxisdat['pickup_borough'].value_counts().index) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-5-5.png" width="50%" style="display: block; margin: auto;" /> Data are nominal - this is fine. --- # Ordinal data - For nominal data it is suitable, to order according to frequency. - This is not the case for ordinal data - Always order according to categories of the variable. - Diamonds dataset has clarity as an ordinal variable - Categories ordered as IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1. --- # Diamonds data ```python diam = sns.load_dataset('diamonds') diam ``` ``` ## carat cut color clarity depth table price x y z ## 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 ## 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 ## 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 ## 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 ## 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 ## ... ... ... ... ... ... ... ... ... ... ... ## 53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 ## 53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 ## 53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 ## 53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 ## 53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 ## ## [53940 rows x 10 columns] ``` --- # Ordinal ```python diam = sns.load_dataset('diamonds') sns.countplot(data=diam,x='clarity') ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-7-7.png" width="45%" style="display: block; margin: auto;" /> Categories ordered by levels of variable - this is fine. --- #Incorrect plot <img src="03OneVariable_files/figure-html/unnamed-chunk-8-9.png" width="50%" style="display: block; margin: auto;" /> Incorrect. Categories in alphabetical order. --- #Incorrect plot <img src="03OneVariable_files/figure-html/unnamed-chunk-9-11.png" width="50%" style="display: block; margin: auto;" /> Incorrect. Ordered by frequency. --- # Single color ```python diam = sns.load_dataset('diamonds') sns.countplot(data=diam,x='clarity',color='tab:blue') ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-10-13.png" width="50%" style="display: block; margin: auto;" /> --- # Coloring - Although by default categories have different colors this is not strictly necessary. - Arguably it is distracting, especially when there are more categories. - Later on we will use color to display data - For example grouping by a second variable and mapping that to color. - This will be covered later on. --- # Lollipop charts - If there are - A large number of categories, - If the categories all have similar frequencies, - then consider using a lollipop chart. - This can be done with some data munging using `value_counts` and the `stem` function in `matplotlib`. --- # Data preparation For simpler graph, will only consider dropoff in Manhattan ```python freq = taxisdat[taxisdat['pickup_borough']=='Bronx'].value_counts('pickup_zone') freq ``` ``` ## pickup_zone ## East Concourse/Concourse Village 9 ## Parkchester 8 ## Mott Haven/Port Morris 8 ## Co-Op City 5 ## East Tremont 5 ## Mount Hope 5 ## Claremont/Bathgate 5 ## Soundview/Castle Hill 4 ## Morrisania/Melrose 4 ## Van Nest/Morris Park 3 ## University Heights/Morris Heights 3 ## Highbridge 3 ## Belmont 3 ## Bronxdale 2 ## Fordham South 2 ## Kingsbridge Heights 2 ## Melrose South 2 ## Allerton/Pelham Gardens 2 ## Williamsbridge/Olinville 2 ## Van Cortlandt Village 2 ## Westchester Village/Unionport 2 ## Schuylerville/Edgewater Park 2 ## Soundview/Bruckner 2 ## Norwood 2 ## West Concourse 2 ## Bedford Park 1 ## Bronx Park 1 ## West Farms/Bronx River 1 ## Spuyten Duyvil/Kingsbridge 1 ## Crotona Park East 1 ## Riverdale/North Riverdale/Fieldston 1 ## Hunts Point 1 ## Pelham Parkway 1 ## Pelham Bay 1 ## Woodlawn/Wakefield 1 ## dtype: int64 ``` --- # Lollipop plot (code) ```python import matplotlib.pyplot as plt plt.stem(freq) plt.xticks(range(1,len(freq.index)+1), freq.index, rotation='vertical') plt.show() ``` --- # Lollipop plot (output) <img src="03OneVariable_files/figure-html/unnamed-chunk-13-15.png" width="60%" style="display: block; margin: auto;" /> --- # Pie charts - Pie charts are considered to be poor practice by visualisation experts since - It is difficult to compare sizes of angles. - It is difficult to make comparisons unless categories are close. - They do not handle large numbers of categories. - Following examples come from a [Business Insider article](https://www.businessinsider.com/pie-charts-are-the-worst-2013-6) by Walt Hickey. --- # Pie chart ![](img/piechart.webp) --- # Bar chart ![](img/barchart.webp) --- # Pie chart ![](img/eupie.webp) --- # Bar chart ![](img/eubar.webp) --- # How to do pie charts .pull-left[ - If you absolutely MUST do a pie chart a guide can be found at this [link](https://www.geeksforgeeks.org/how-to-create-a-pie-chart-in-seaborn/). - A donut chart is a pie chart with a hole. It is even worse than a pie chart. ] .pull-right[ ![](img/donut.png) ] --- # Treemaps - Even bar charts can struggle when the number of categories is truly huge. - One way to handle this is using a *treemap*. - See [this example](https://www.datarevelations.com/in-praise-of-treemaps/) - These are particularly well suited when categories follow a hierarchy. - The following example considers Swiss exports that are classified into 12 categories and 48 subcategories. --- # Swiss Exports ![](img/Swiss.png) --- class: middle, center, inverse # Numerical Data --- # Histogram - The equivalent of a bar chart for numerical data is a histogram. - The area of each bar represents the frequency within a certain interval. - If all bars have equal width then frquency is mapped to the length of the bars too. - Zero should always be included on the y axis (but not necessarily x axis). --- # Histogram ```python sns.histplot(taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-14-17.png" width="75%" style="display: block; margin: auto;" /> --- #What do we see? - Right skew - Should we use mean or median as measure of central tendency? - A few big outliers. - A spike (second 'mode') at around $50 - Could represent a fixed fee (e.g. from airport). --- # Change number of bins ```python sns.histplot(taxisdat['fare'], bins=10) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-15-19.png" width="75%" style="display: block; margin: auto;" /> --- # Change number of bins ```python sns.histplot(taxisdat['fare'], bins=2000) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-16-21.png" width="75%" style="display: block; margin: auto;" /> --- # Lessons - By having too many (or too few) bins we can miss out on important features of data. - In the above example the spike of fares around $50 is not seen when the number of bins is changed. - In general default choice of bin number is good, however it is always a good idea to experiment. --- # Kernel density estimate (KDE) - A kernel density estimates the *probability density function (pdf)* of data. - For data `\(x_1,x_2,\dots,x_n\)` the KDE is given by `$$\hat{f}(x)=\frac{1}{n}\sum\limits_{i=1}^{n} K_h(x-x_i)$$` - The function `\(K_h(.)\)` is called the *kernel*. - Can take many forms. - The function depends on a bandwidth `\(h\)` (to be explained soon). --- # Intuition behind KDE - If I observe a point in some location, that evidence supports that there is probability that a point comes from a nearby region. - Imagine I drop a mountain of sand at the location I observe the data point. - The shape of the sand is the kernel function. - If I repeat this for `\(n\)` observations the result is the KDE. --- # KDE (n-1) <img src="03OneVariable_files/figure-html/unnamed-chunk-17-23.png" style="display: block; margin: auto;" /> --- # KDE (n=2) <img src="03OneVariable_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # KDE (n=3) <img src="03OneVariable_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # KDE (n=4) <img src="03OneVariable_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # KDE (n=20) <img src="03OneVariable_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # The bandwidth - The bandwidth `\(h\)` controls whether the mountain of sand is 'peaked' or 'flat' . - For small bandwidth the mountain of sand is more peaked and the KDE is more wiggly. - For large bandwidth the mountain of sand is more flat and the KDE is more smooth. - This is similar to the role of the number of bins in the histogram. - There are sensible defaults used by visualisation packages. --- #KDE plot ```python sns.kdeplot(taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-22-1.png" width="50%" style="display: block; margin: auto;" /> --- #KDE plot (double default BW) ```python sns.kdeplot(taxisdat['fare'], bw_adjust = 2) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-23-3.png" width="75%" style="display: block; margin: auto;" /> --- #KDE plot (half default BW) ```python sns.kdeplot(taxisdat['fare'], bw_adjust = 0.5) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-24-5.png" width="75%" style="display: block; margin: auto;" /> --- # Violin plot - A violin plot mirrors a KDE and fills it in. - It is particularly useful for making comparisons of density according to a grouping variable. - We will cover this next week. --- #Violin plot ```python sns.violinplot(taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-25-7.png" width="75%" style="display: block; margin: auto;" /> --- # Boxplot - Inside the violin plot is a boxplot. - The boxplot is a summary of five statistics - Median - First Quartile - Third Quartile - Minimum - Maximum --- #Box plot ```python sns.boxplot(taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-26-9.png" width="75%" style="display: block; margin: auto;" /> --- # Fences - For most implementations, a boxplot actually shows an upper and lower fence rather than the maximum and minimum. - The upper (lower) fence is given by the third (first) quartile plus (minus) 1.5 times the IQR. - The maximum (minimum) is shown instead if it is less (greater) than the upper (lower) fence. --- # Boxplots v KDE - Note that in this example the spike at around $50 is lost in the boxplot. - However it is clearer that there are four outliers above $110. - There is no right and wrong answer, it all depends on what you are trying to visualise. --- # Rug plot - The final plot we will consider is a rug plot. - The rug plot can highlight outliers. - It is harder to understand the shape of the distribution using a rug plot, especially for large sample sizes. - As a univariate plot, a jittered rug plot (strip plot) works better. --- # Rug plot ```python sns.rugplot(taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-27-11.png" width="75%" style="display: block; margin: auto;" /> --- # Rug plot (jittered) ```python sns.stripplot(y=taxisdat['fare']) ``` <img src="03OneVariable_files/figure-html/unnamed-chunk-28-13.png" width="75%" style="display: block; margin: auto;" /> --- class: middle, center, inverse # Wrap-up --- # Conclusions - Univariate plots are useful for - Understanding distribution of a variable - Finding outliers - Finding frequent values - Seeing whether data are skewed. - Always remember that univariate plots generate questions. To answer these questions requires domain knowledge and further analysis. --- class: middle, center, inverse # Questions