class: center, middle, inverse, title-slide .title[ # Week 4:VisualisingTwo Variables ] .author[ ### Visual Data Analytics ] .institute[ ### University of Sydney ] --- # Outline - Categorical v Catgorical - Side by side bar - Categorical v Numerical - Side by side boxplots - Numeric v Numerical - Scatter plot --- # Motivation - Understand the distributions of two variables - Find outliers - Find multi-modality - Find skew - Also understand the relationship between the variables --- class: center, middle, inverse # Categorical v Categorical variables --- # Bar charts - To visualise two nominal/ordinal variables bar charts can still be used to plot the frequency of each category. - There are two options - Grouped bar charts - Stacked bar charts - We will investigate using the taxis and diamonds data also used last week. --- #Grouped bar chart ```python sns.countplot(data=diam, x='clarity', hue='cut') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" /> --- # Grouped bar chart - Can easily make comparisons such as - Fair diamonds are relatively more common for worse levels of clarity (e.g. SI2) - For IF diamonds, these overwhelmingly have an ideal cut. - Harder to make a comparison for - The most common category of clarity. --- # Stacked bar chart First must prepare data by creating cross tab ```python diamct = pd.crosstab(diam['clarity'],diam['cut']) diamct ``` ``` ## cut Ideal Premium Very Good Good Fair ## clarity ## IF 1212 230 268 71 9 ## VVS1 2047 616 789 186 17 ## VVS2 2606 870 1235 286 69 ## VS1 3589 1989 1775 648 170 ## VS2 5071 3357 2591 978 261 ## SI1 4282 3575 3240 1560 408 ## SI2 2598 2949 2100 1081 466 ## I1 146 205 84 96 210 ``` --- # Stacked bar chart ```python diamct.plot(kind = 'bar', stacked = True) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-5-3.png" width="50%" style="display: block; margin: auto;" /> --- # Stacked bar chart - Can easily see - Frequencies of categories for clarity - Relative frequency of cut within a single category of clarity - Harder to make a comparison for - Frequency of cut across categories of clarity (e.g. hard to quickly tell if there are more Good/VS2 or Good/SI2) --- # Percentage stacked bar chart First must prepare data ```python diamct = pd.crosstab(diam['clarity'],diam['cut'], normalize='index') diamct ``` ``` ## cut Ideal Premium Very Good Good Fair ## clarity ## IF 0.677095 0.128492 0.149721 0.039665 0.005028 ## VVS1 0.560055 0.168536 0.215869 0.050889 0.004651 ## VVS2 0.514410 0.171733 0.243782 0.056455 0.013620 ## VS1 0.439236 0.243422 0.217232 0.079305 0.020805 ## VS2 0.413689 0.273862 0.211372 0.079785 0.021292 ## SI1 0.327746 0.273632 0.247991 0.119403 0.031228 ## SI2 0.282576 0.320753 0.228410 0.117577 0.050685 ## I1 0.197031 0.276653 0.113360 0.129555 0.283401 ``` --- # Stacked bar chart ```python diamct.plot(kind = 'bar', stacked = True) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-7-5.png" width="50%" style="display: block; margin: auto;" /> --- # Percentage stacked bar chart - Can easily see - The relationship between cut and clarity - Harder to see - Anything about the distribution of the categories --- #Bar charts - Three options - Grouped - Stacked - Percentage stacked - No one is right or wrong, it all depends on what you are trying to communicate. --- class: middle, center, inverse # Categorical v Numeric --- # Side-by-side - Will consider side-by-side - Boxplots - Violin plots - (Jittered) rug plots - Seaborn syntax well suited to these --- # Side by side boxplots ```python sns.boxplot(data = taxisdat, y = 'fare', x = 'pickup_borough') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-8-7.png" width="50%" style="display: block; margin: auto;" /> --- # Side by side violin plots ```python sns.violinplot(data = taxisdat, y = 'fare', x = 'pickup_borough') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-9-9.png" width="50%" style="display: block; margin: auto;" /> --- # Side by side strip plots ```python sns.stripplot(data = taxisdat, y = 'fare', x = 'pickup_borough') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-10-11.png" width="50%" style="display: block; margin: auto;" /> --- # Side by side swarm plots (slow) ```python sns.swarmplot(data = taxisdat[taxisdat['dropoff_borough']=='Bronx'], y = 'fare', x = 'pickup_borough') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-11-13.png" width="50%" style="display: block; margin: auto;" /> --- # Different plots - Boxplots clearly show that Manhattan has the lowest median fare and less volatility in the fare. - Violin plots show three modes in the Queens distribution (cannot be seen using boxplots). - Strip plots show that frequent value at around $50 occurs for Manhattan and Queens. - No single correct plot, use the plot that tells your story. --- class: middle, center, inverse # Numerical v Numerical --- # Scatter plot - The scatter plot plots one numerical variable on the x axis and the other on the y axis. - It is a very common plot and useful for - Finding outliers - Understanding the relationship between two variables. - We will also look at - Log transformations - Strategies for dealing with big data. --- # Scatterplot ```python sns.scatterplot(data = taxisdat, y = 'fare', x = 'distance') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-12-15.png" width="50%" style="display: block; margin: auto;" /> --- # What do we see? - Can still see certain distributional characteristics - Frequent values around $50 - Outliers - Relationship is positive and linear - There are some observations with low distance but high fares. - This could not be seen by only looking at each distribution individually. --- # Adding rug plot ```python sns.scatterplot(data = taxisdat, y = 'fare', x = 'distance') sns.rugplot(data = taxisdat, y = 'fare', x = 'distance') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-13-17.png" width="50%" style="display: block; margin: auto;" /> --- # Adding regression line ```python sns.lmplot(data = taxisdat, y = 'fare', x = 'distance') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-14-19.png" width="50%" style="display: block; margin: auto;" /> --- # Regression - We usually think of x as an independent variable and y as a dependent variable. - Even if you do not include regression lines, if there are natural choices of independent and dependent variable then use them on the x and y axes respectively. - For instance, in this example, distance determines price rather than the other way around. - Always remember that correlation does not imply causality. --- # Example of negative correlation <img src="04TwoVariables_files/figure-html/unnamed-chunk-15-21.png" width="50%" style="display: block; margin: auto;" /> --- # Example of non-linear relationship ```python mpg = sns.load_dataset('mpg') sns.lmplot(data = mpg, y = 'mpg', x = 'displacement', lowess=True) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-16-23.png" width="50%" style="display: block; margin: auto;" /> --- # Smooth fit - The fitted line comes from the LOcal Weighted Estimated Scatterplot Smoothing (LOWESS) method. - This technique combines the idea of nearest neighbours with regression. - Nearest neighbours are found (along the x-axis) - A constant, linear or quadratic regression is fit to the nearest neighbours. --- # Nearest Neighbours <img src="04TwoVariables_files/figure-html/unnamed-chunk-18-25.png" style="display: block; margin: auto;" /> --- #Nearest Neighbours <img src="04TwoVariables_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Local fit <img src="04TwoVariables_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- #Nearest Neighbours <img src="04TwoVariables_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- #Details - In the LOWESS algorithm, the smoothing parameter we estimate a regression function ( `\(\hat{y}=f(x)\)` ) rather than a regression line. - The smoothness of this function can be tuned by changing the number of nearest neighbours. - Local regressions estimated by weighted least squares so that closer neighbours are given more influence. - LOWESS is not ideal for large datasets. --- # Log transformations - For positively skewed data it is often worth looking a a log transformation - It particularly makes sense for relationships best understood in terms of percentage changes - For example prices change by 1%, demand changes by 5%. - This transformation only works for positive data. --- # Without transformation ```python sns.scatterplot(data = taxisdat, y = 'fare', x = 'distance') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-22-1.png" width="50%" style="display: block; margin: auto;" /> --- # With transformation ``` ## [None, None] ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-23-3.png" width="50%" style="display: block; margin: auto;" /> --- # Using log scales - Before using the log scale the values in the bottom left corner were too close to one another. - By using the log-log scale these values can be seen more clearly. - One thing to notice is that the fare has clusters around certain values. - You will cover using log scales in a tutorial. --- # Overplotting - With datasets with a large number of observations, points are plotted on top of one another. - This is known as *overplotting* - There are a few strategies to avoid this - Smaller points. - Transparent points. - Contour plot (of 2D KDE). - Binning. - Can demonstrate using diamonds data (53940 observations). --- # Scatterplot ```python sns.scatterplot(data = diam, y = 'price', x = 'carat') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-24-5.png" width="50%" style="display: block; margin: auto;" /> --- # Overplotting - There are so many points in the region towards the bottom left of the plot that points are on top of one another. - This causes issues with how the plot is rendered. - It is difficult to tell where any modes are. - We will construct the same plot but with different strategies to deal with overplotting. --- # Smaller points ```python sns.scatterplot(data = diam, y = 'price', x = 'carat', s=1) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-25-7.png" width="50%" style="display: block; margin: auto;" /> --- # Transparent points ```python sns.scatterplot(data = diam, y = 'price', x = 'carat', alpha=0.1) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-26-9.png" width="50%" style="display: block; margin: auto;" /> --- # Add KDE contours ```python sns.scatterplot(data = diam, y = 'price', x = 'carat', s=1) sns.kdeplot(data = diam, y = 'price', x = 'carat') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-27-11.png" width="50%" style="display: block; margin: auto;" /> --- # Two dimensional KDE - Kernel function now takes a vector as an input and returns a single number as an output. - Think of each kernel as a pile of sand on a table, and the value of the function is the height of the sand. - The contour lines give all the values of price and carat that evaluate to the same density value. - If any of you are familiar with elevation maps used in geography (e.g for mountain climbers), this is the same idea. --- # With shading ```python sns.kdeplot(data = diam, y = 'price', x = 'carat', shade = True) ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-28-13.png" width="50%" style="display: block; margin: auto;" /> --- # Binning - Recall a KDE is a smooth version of a histogram. - We can plot a "2D histogram" by binning the data into rectangles and using color to represent the frequncy of observations in each rectangle. - This can be useful for locating modes in bivariate data. --- #Binning ```python sns.displot(data = diam, y = 'price', x = 'carat') ``` <img src="04TwoVariables_files/figure-html/unnamed-chunk-29-15.png" width="50%" style="display: block; margin: auto;" /> --- # Discussion - The advantage of binning (compared to a KDE) is that you can still see outliers - However the KDE is a "smoother" plot that can still be overlaid with a scatterplot. - Again there is no right or wrong plot, it depends on the story you want to tell. --- class: middle, center, inverse # Wrap-up --- # Conclusions - Visualising two variables helps us to understand about the distribution of the variables including - Modes - Outliers - In addition we can understand more about the relationship between the two variables - Direction of relationship. - Whether relationship is linear - Log transformations can be useful to help visualise variables. - With big data, important to avoid overplotting. --- class: middle, center, inverse # Questions