+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction and Motivation

Data Visualisation and Analytics

Anastasios Panagiotelis and Lauren Kennedy

Lecture 1

1

Telling stories with data

2

What is DVA?

  • Data visualisation and analytics
    • Gain insights from data,
    • Communicate,
    • Make informed decisions.
  • This unit is about telling stories with data.
3

What is DVA?

  • Data visualisation and analytics
    • Gain insights from data,
    • Communicate,
    • Make informed decisions.
  • This unit is about telling stories with data.
  • For a good example see this talk by Hans Rosling.
3

Good workflow

From Grolemund and Wickham, R for Data Science

4

A way to do data analysis

  • In the past you may have done data analysis using the following steps.
5

A way to do data analysis

  • In the past you may have done data analysis using the following steps.
    • Download data from a website.
5

A way to do data analysis

  • In the past you may have done data analysis using the following steps.
    • Download data from a website.
    • Manipulate data in Excel, save as a new file.
5

A way to do data analysis

  • In the past you may have done data analysis using the following steps.
    • Download data from a website.
    • Manipulate data in Excel, save as a new file.
    • Open this file in Stata, create some plots and fit some models.
5

A way to do data analysis

  • In the past you may have done data analysis using the following steps.
    • Download data from a website.
    • Manipulate data in Excel, save as a new file.
    • Open this file in Stata, create some plots and fit some models.
    • Cut and paste figures and tables into a word file.
5

An alternative

  • Using R brings that whole pipleline into one place.
6

An alternative

  • Using R brings that whole pipleline into one place.
    • Easier to update analysis,
6

An alternative

  • Using R brings that whole pipleline into one place.
    • Easier to update analysis,
    • Easier to reproduce analysis,
6

An alternative

  • Using R brings that whole pipleline into one place.
    • Easier to update analysis,
    • Easier to reproduce analysis,
    • Easier to collaborate with others,
6

An alternative

  • Using R brings that whole pipleline into one place.
    • Easier to update analysis,
    • Easier to reproduce analysis,
    • Easier to collaborate with others,
    • Easier to automate analysis.
6

Intro to R

7

The R language

  • The R programming language which can be downloaded here.
  • Exact details of installing R will depend on whether you use Windows, Mac or Linux.
  • A great tool for both new and experienced users of R is RStudio which can be downloaded here.
8

Using R

  • To keep track of your workflow use a script:
    • You can open a new script by typing Ctrl+Shift+N
    • You can run a single line of code by pressing Ctrl+Enter
    • You can run a whole script by pressing Ctrl+Shift+S or Ctrl+Shift+Enter
  • You can save scripts to run them anytime.
9

R Markdown

  • If you want to embed your analysis within a written document then R Markdown is an excellent tool.
10

R Markdown

  • If you want to embed your analysis within a written document then R Markdown is an excellent tool.
  • Enclose any R code within three tick marks.
10

R Markdown

  • If you want to embed your analysis within a written document then R Markdown is an excellent tool.
  • Enclose any R code within three tick marks.
  • By setting echo=TRUE to present the code or echo=FALSE to hide it.
10

R Markdown

  • If you want to embed your analysis within a written document then R Markdown is an excellent tool.
  • Enclose any R code within three tick marks.
  • By setting echo=TRUE to present the code or echo=FALSE to hide it.
  • If you are completely new to R, then use scripts at first.
10

Variables in R

  • In R everything is stored in a variable. Here the word variable has a slightly different meaning to the usual statistical meaning.
  • In R, think of variables as little boxes or envelopes with names on them.
  • We can put a number into these boxes, or words or matrices or entire blocks of data or even other boxes.
11

Assigning Variables

  • How to store the number 1 in a variable a and the number 2 in a variable b?
a<-1
b=2
  • Note that you can use either <- or = to assign variables.
12

Seeing results

There are a few options for looking at what is stored in a variable

print(b)
## [1] 2
b
## [1] 2
str(b)
## num 2
13

Character variables

  • We can store more than just numbers in a variable.
  • Try to store your own name in a variable called name.
name<-'Anastasios'
str(name)
## chr "Anastasios"
  • You must use apostrophes otherwise R will look for a variable called Anastasios.
14

Variable Names

  • Variable names can include letters, digits, the full stop . and the underscore _
  • The variable name cannot begin with a number or underscore.
  • They can begin with a full stop but only if the second digit is a letter.
  • For more details type ?make.names into your R console
15

Valid and Invalid

  • Valid:
    • FirstName
    • First.Name
    • First_Name
    • .FirstName
  • Invalid:
    • 1stName
    • .1stName
    • _First.Name
    • First Name
    • FirstName?
16

Foreign Languages

  • R has support for foreign languages, but the same rules apply
  • Valid:
    • Όνομα
    • название
    • 名字
    • 이름
  • Invalid:
    • 1Όνομα
    • .название
    • 名 字
17

Case Sensitivity

  • R is case sensitive.
  • This means that the following are all different:
    • Name
    • name
    • NAME
    • nAMe
18

Workspace

  • All variables are kept in the workspace. You can see what is in your workspace by using the command
ls()
## [1] "a" "b" "name"
19

Clear Workspace

  • You can clear the workspace using
rm(list=ls())
  • If you try ls() again the workspace will be empty.
  • In RStudio you can also see all the variables in the Environment tab.
  • It is worth clearing the workspace at the beginning of every script.
20

Working directory

  • If you try to read data from your hard drive, or save plots or data then the concept of a working directory is important. To check your working directory type
getwd()
## [1] "/home/anastasios/Documents/Teaching/DataVizA2019/Lectures/01Intro"
  • To change the working directory use setwd
setwd("/home/anastasios/Documents")
21

Basic arithmetic in R

  • Basic arithmetic is fairly simple. Try a+b. Also we will put this in a new variable called z.
z<-a+b
str(z)
## num 3
  • To subtract use -, to multiply use *, to divide / and to take powers use ^.
22

Functions in R

  • Apart from very simple arithmetic, variables in R are manipulated using a function.
  • The input (also called argument) goes in parentheses, while the output can be assigned to a new variable.
  • Some functions take more than one input. In this case separate by commas.
23

Example

  • The function sqrt takes the square root.
rootb<-sqrt(b)
str(rootb)
## num 1.41

What happens when you take a square root of something that is not a number?

rootname<-sqrt(name)
## Error in sqrt(name): non-numeric argument to mathematical function
24

Getting Help

  • If you aren't sure what a function does, use R help. The easiest way is to simply use the ?
?sqrt
  • If you want to do something and do not know the name of the relevant function you can search using ??. Try to find a function to do logarithms using
??logarithms
25

Comments

Anything after a # will not be executed by R.

a<-1 # Set the variable a to 1
#x<-4 This line is not executed
str(a)
## num 1
str(x)
## Error in str(x): object 'x' not found

Comment multiple lines using Ctrl+Shift+C

26

Vectors

We can create a variable with multiple numbers or strings using the c function.

Consumption<-c(50,40,25,0)
str(Consumption)
## num [1:4] 50 40 25 0
Drink<-c('Coke','Pepsi','Coke','Homebrand')
str(Drink)
## chr [1:4] "Coke" "Pepsi" "Coke" "Homebrand"
27

Vector

These variables are example of a vector. Sometimes when we apply a function to a vector, we apply the function to each element.

logcons<-log(Consumption)
str(logcons)
## num [1:4] 3.91 3.69 3.22 -Inf
28

Vectors

Other functions take a vector as an input and return a single number as the output

meancons<-mean(Consumption)
str(meancons)
## num 28.8
29

Inf and NaN

There are special values that numeric variables can take. These are Inf and -Inf for positive and negative infinity and NaN for not a number. The presence of NaN indicates an error.

log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN

It is important to distinguish NaN from NA. The latter is used for missing data.

30

Lists

Another object common in R is known as a list. A list can contain completely different types of variables.

alist<-list(w=name, x=Drink, y=Consumption)

elements of lists are accessed using [[]] or $

alist[[1]]
## [1] "Anastasios"
31

Packages

32

R Packages

  • A big advantage of R is the use of add-on packages, easily downloaded from an online repository called CRAN.
  • Using a package involves two steps:
    • Download and install the package using the function install.package (do once).
    • Load the package using library function (include at beginning of script).
  • Both these steps can also be done in RStudio through the Packages tab.
33

Options in installing packages

  • If you have not already done so, download, install and load the R package ggplot2
install.packages('ggplot2')

To load the package

library(ggplot2)
  • By downloading the package you also download all of the help documentation.
34

The tidyverse

  • When you have time, download the tidyverse package
  • This is a question in your tutorial exercises but please do this before next week.
  • The tidyverse is a collection of packages.
    • readr is used for reading in data.
    • dplyr and tidyr is used for manipulating data into an easy to use format.
    • ggplot2 is used for visualisation.
35

Anscombe's quartet

36

Plotting data

  • Anscombe's quartet is a synthetic dataset used to demonstrate the importance of data visualisation.
  • We will also use it to learn some basic R.
  • The data comes built into R.
  • There are 4 pairs of x and y variables.
37

Anscombe's quartet

str(anscombe)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
38

Summary stats

  • We can find the mean of the final pair using the mean function.
xbar<-mean(anscombe$x4)
ybar<-mean(anscombe$y4)
str(xbar)
## num 9
str(ybar)
## num 7.5
39

Summary stats

  • We can find the variance of the final pair using the var function.
vx<-var(anscombe$x4)
vy<-var(anscombe$y4)
str(vx)
## num 11
str(vy)
## num 4.12
40

Summary stats

  • We can find the correlations between x and y using the cor function.
rxy<-cor(anscombe$x4,anscombe$y4)
str(rxy)
## num 0.817
  • There are two inputs or arguments to the function. Separate these using a ,
41

Your turn

  • If your birthday is from January to April:
    • Find the mean and variance of x1 and y1 and their correlation
  • If your birthday is from May to August:
    • Find the mean and variance of x2 and y2 and their correlation
  • If your birthday is from September to December:
    • Find the mean and variance of x3 and y3 and their correlation
42

Conclusions

  • Results
    • The means of all x variables are 9
    • The means of all y variables are 7.5
    • The variances of all x variables are 11
    • The variances of all y variables are 4.12
    • The correlation between x and y is 0.82
  • Does this mean all datasets are equal?
43

Let's visualise

  • Later on we will use the ggplot function to create figures.
  • For now we can use a simple function within the ggplot2 package called qplot.
  • Simply tell qplot the variable(s) that you want to plot and the dataset.
  • The qplot function tries to guess what type of plot you want.
44

Histogram

qplot(x4,data = anscombe)

45

Scatterplot

qplot(x4,y4,data = anscombe)

46

Your turn

  • If your birthday is from January to April:
    • Plot histograms of x1 and y1 and a scatterplot.
  • If your birthday is from May to August:
    • Plot histograms of x2 and y2 and a scatterplot.
  • If your birthday is from September to December:
    • Plot histograms of x3 and y3 and a scatterplot.
47

All Results

48

Visualisation

  • Although all four datsets have the same summary stats they are vastly different.
  • These differences can easily be seen using visualisation.
  • Always look at your data as part of an analysis.
49

Where now

  • Clearly qplot is quite limited in what it is able to do.
  • Over the next period we will consider:
    • More ways to plot a single variable.
    • More ways to plot relationships between two or more variables.
    • Visualising variables that are categorical.
  • Before getting into those details we cover some general principles of good plotting.
50

Telling stories with data

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow