+ - 0:00:00
Notes for current slide
Notes for next slide

Getting Started with R

High Dimensional Data Analysis

Anastasios Panagiotelis & Ruben Loaiza-Maya

Lecture 2

1

Basics

2

The R project

  • In this unit we use the software package R.
3

The R project

  • In this unit we use the software package R.
  • R is one of the the most popular software tool for professionals in the fields of Business Analytics/Data Science.
3

History of R

  • R is based on earlier statistical software called 'S' which was developed at Bell Labs in the 1970s.
4

History of R

  • R is based on earlier statistical software called 'S' which was developed at Bell Labs in the 1970s.
  • R was initially developed in the early 1990s by two academics at the University of Auckland, Ross Ihaka and Robert Gentlemen.
4

History of R

  • R is based on earlier statistical software called 'S' which was developed at Bell Labs in the 1970s.
  • R was initially developed in the early 1990s by two academics at the University of Auckland, Ross Ihaka and Robert Gentlemen.
  • R has grown substantially since then and is now supported the not-for-profit R Foundation
4

R is Free Software

  • R is free in two ways
    • It doesn't cost any money.
    • All of the source code of R is available meaning R can be customised, modified, and most importantly extended.
  • R is part of a big Free Software project known as the GNU Project
5

Downloading R and R Studio

  • R can be downloaded here
6

Downloading R and R Studio

  • R can be downloaded here
  • Exact details of installing R will depend on whether you use Windows, Mac or Linux.
6

Downloading R and R Studio

  • R can be downloaded here
  • Exact details of installing R will depend on whether you use Windows, Mac or Linux.
  • A great tool for both new and experienced users of R is RStudio.
6

Downloading R and R Studio

  • R can be downloaded here
  • Exact details of installing R will depend on whether you use Windows, Mac or Linux.
  • A great tool for both new and experienced users of R is RStudio.
  • It can be downloaded here free of cost.
6

Downloading R and R Studio

  • R can be downloaded here
  • Exact details of installing R will depend on whether you use Windows, Mac or Linux.
  • A great tool for both new and experienced users of R is RStudio.
  • It can be downloaded here free of cost.
  • If you haven't done so already try to download R and R Studio.
6

Ways to use R

  • To keep track of your workflow use a script:
    • Open a new script by typing Ctrl+Shift+N
    • Run a single line of code by pressing Ctrl+Enter
    • Run a whole script by pressing Ctrl+Shift+S or Ctrl+Shift+Enter
  • You can save scripts to run them anytime.
7

Ways to use R

  • To keep track of your workflow use a script:
    • Open a new script by typing Ctrl+Shift+N
    • Run a single line of code by pressing Ctrl+Enter
    • Run a whole script by pressing Ctrl+Shift+S or Ctrl+Shift+Enter
  • You can save scripts to run them anytime.
  • Scripts allow you to keep analysis replicable which is important in research and business.
7

Variables in R

  • In R everything is stored in a variable.
8

Variables in R

  • In R everything is stored in a variable.
  • Here the word variable has a slightly different meaning to the usual statistical meaning.
8

Variables in R

  • In R everything is stored in a variable.
  • Here the word variable has a slightly different meaning to the usual statistical meaning.
  • In R, think of these as little boxes with names on them.
8

Variables in R

  • In R everything is stored in a variable.
  • Here the word variable has a slightly different meaning to the usual statistical meaning.
  • In R, think of these as little boxes with names on them.
  • We can put a number into these boxes, or words or matrices or entire blocks of data or even other boxes.
8

Assigning Variables

  • How to store the number 1 in a variable a and the number 2 in a variable b?
a<-1
b=2
  • Note that you can use either <- or = to assign variables.
9

Seeing results

To see what is stored in a variable

print(b)
## [1] 2
str(b)
## num 2
10

Character variables

  • Text can also be stored in a variable.
  • Try to store your own name in a variable called name.
name<-'Anastasios'
str(name)
## chr "Anastasios"
  • Use apostrophes so that R does not look for a variable called Anastasios.
11

Variable Names

  • Variable names can include letters, digits, the full stop . and the underscore _
  • The variable name cannot begin with a number or underscore.
  • They can begin with a full stop but only if the second digit is a letter.
  • For more details type ?make.names into your R console
12

Valid and Invalid

  • Valid:
    • FirstName
    • First.Name
    • First_Name
    • .FirstName
  • Invalid:
    • 1stName
    • .1stName
    • _First.Name
    • First Name
    • FirstName?
13

Foreign Languages

  • R has support for foreign languages, but the same rules apply
  • Valid:
    • Όνομα
    • название
    • 名字
    • 이름
  • Invalid:
    • 1Όνομα
    • .название
    • 名 字
14

Case Sensitivity

  • R is case sensitive.
  • This means that the following are all different:
    • Name
    • name
    • NAME
    • nAMe
15

Workspace

  • All variables are kept in the workspace. You can see what is in your workspace by using the command
ls()
## [1] "a" "b" "name"
16

Clear Workspace

  • You can clear the workspace using
rm(list=ls())
  • If you try ls() again the workspace will be empty. In RStudio you can also see all the variables in the Environment tab.
17

Clear Workspace

  • You can clear the workspace using
rm(list=ls())
  • If you try ls() again the workspace will be empty. In RStudio you can also see all the variables in the Environment tab.
  • It is good practice to start every script with this command so that you do not accidentally use data from a different project.
17

Working directory

  • If you read data stored on your computer, or if you save plots or data then the concept of a working directory is important.
18

Working directory

  • If you read data stored on your computer, or if you save plots or data then the concept of a working directory is important.
  • To check your working directory type
getwd()
  • To change the Working directory use setwd
setwd("/home/anastasios/Documents")
18

Basic arithmetic in R

  • Basic arithmetic is fairly simple. Try a+b. Also we will put this in a new variable called z.
z<-a+b
str(z)
## num 3
  • To subtract use -, to multiply use *, to divide / and to take powers use ^.
19

Functions in R

  • Apart from very simple arithmetic, variables in R are manipulated using a function.
20

Functions in R

  • Apart from very simple arithmetic, variables in R are manipulated using a function.
  • The input goes in parentheses, while the output is assigned to a new variable.
20

Functions in R

  • Apart from very simple arithmetic, variables in R are manipulated using a function.
  • The input goes in parentheses, while the output is assigned to a new variable.
  • For example the function sqrt is used to take the square root.
rootb<-sqrt(b)
str(rootb)
## num 1.41
20

Garbage in/Garbage out

What happens when you take a square root of something that is not a number?

rootname<-sqrt(name)
## Error in sqrt(name): non-numeric argument to mathematical function

Many if not most of the mistakes you make in R occur because you enter the incorrect type of input in a function.

21

Getting Help

  • If you aren't sure what a function does, use R help. The easiest way is to simply use the ?
?sqrt
  • If you want to do something and do not know the name of the relevant function you can search using ??.
22

Getting Help

  • If you aren't sure what a function does, use R help. The easiest way is to simply use the ?
?sqrt
  • If you want to do something and do not know the name of the relevant function you can search using ??.
  • Find a function to do logarithms using
??logarithms
22

Comments

Anything after a # will not be executed by R.

a<-1 # Set the variable a to 1
#x<-4 This line is not executed
str(a)
## num 1
str(x)
## Error in str(x): object 'x' not found

Comment multiple lines using Ctrl+Shift+C

23

Vectors

In stats we have many observations for each variable. The function c() stores these in a vector. Suppose we have the following data:

Names Drink Consumption Satisfaction
Andrew Coke 50 5
Boris Pepsi 40 4
Cathy Coke 25 4
Diana 7Up 0 3
24

Manually inputting data

First let's create the variable Consumption

Consumption<-c(50,40,25,0)
print(Consumption)
## [1] 50 40 25 0
str(Consumption)
## num [1:4] 50 40 25 0

Put values of drink into a variable.

25

Solution

The solution is

Drink<-c('Coke','Pepsi','Coke','7Up')
print(Drink)
## [1] "Coke" "Pepsi" "Coke" "7Up"
str(Drink)
## chr [1:4] "Coke" "Pepsi" "Coke" "7Up"
26

Vector

These variables are example of a vector. Sometimes when we apply a function to a vector, we apply the function to each element.

logcons<-log(Consumption)
str(logcons)
## num [1:4] 3.91 3.69 3.22 -Inf
27

Vectors

Other functions take a vector as an input and return a single number as the output

meancons<-mean(Consumption)
str(meancons)
## num 28.8
28

Inf and NaN

The values Inf and -Inf refer to positive and negative infinity. The value NaN stands for not a number and indicates an error.

log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN

It is important to distinguish NaN from NA. The latter is used for missing data.

29

Data Frame

  • It is tedious to manually enter large datasets in this way. You will usually import data from an external file.
30

Data Frame

  • It is tedious to manually enter large datasets in this way. You will usually import data from an external file.
  • There are many ways to import data. For files with the .rds extension it is easy
Beer<-readRDS("Beer.rds")
  • Get the location of the file right. You can also open a file through the file tab
30

Data Frame

  • There is only one variable here, the variable Beer. However this is a very special case of variable known as a Data Frame.
31

Data Frame

  • There is only one variable here, the variable Beer. However this is a very special case of variable known as a Data Frame.
  • A data frame contains other variables. For example alcohol content can be accessed via.
str(Beer$alcohol)
## num [1:35] 3.7 4.1 4.2 4.3 2.9 2.3 4.2 4.7 5.5 4.7 ...
31

Lists

Another object common in R is known as a list. A list can contain completely different variables.

alist<-list(w=name, x=Drink, y=Beer)

elements of lists are accessed using [[]] or $

alist[[1]]
## [1] "Anastasios"
32

Packages

33

R Packages

  • One of the biggest advantages of R is the use of add-on packages, which are are easily downloaded from an online repository called CRAN. Using a package involves two steps:
34

R Packages

  • One of the biggest advantages of R is the use of add-on packages, which are are easily downloaded from an online repository called CRAN. Using a package involves two steps:
    • Downloading and installing the package using the function install.package.
    • Load the package using library function
34

R Packages

  • One of the biggest advantages of R is the use of add-on packages, which are are easily downloaded from an online repository called CRAN. Using a package involves two steps:
    • Downloading and installing the package using the function install.package.
    • Load the package using library function
  • Both these steps can also be done in RStudio through the Packages tab.
34

R Packages

  • One of the biggest advantages of R is the use of add-on packages, which are are easily downloaded from an online repository called CRAN. Using a package involves two steps:
    • Downloading and installing the package using the function install.package.
    • Load the package using library function
  • Both these steps can also be done in RStudio through the Packages tab.
  • Try install and load the R package ggplot2
34

Package Documentation

  • By downloading the package you also download all of the help documentation.
install.packages('ggplot2')

To load the package

library(ggplot2)
35

Graphics in R

  • Three different ways to do graphs in R
    • Base graphics do not require any packages
    • Trellis graphics (using package lattice)
    • ggplot2
36

Graphics in R

  • Three different ways to do graphs in R
    • Base graphics do not require any packages
    • Trellis graphics (using package lattice)
    • ggplot2
  • In this unit you will mostly be given instruction on using ggplot2, however if you have learnt base graphics in another unit and prefer this, then you can use it.
36

Graphics in R

  • Three different ways to do graphs in R
    • Base graphics do not require any packages
    • Trellis graphics (using package lattice)
    • ggplot2
  • In this unit you will mostly be given instruction on using ggplot2, however if you have learnt base graphics in another unit and prefer this, then you can use it.
  • There are many resources for learning ggplot2, including some that are free online.
36

MT cars dataset

To demonstrate ggplot2 we use the dataset mpg, which contains information on the fuel efficiency of different cars. This can be loaded into R using the command

data(mpg)

It is data that comes together with ggplot2.

37

ggplot object

  • To make a plot, the first task is to create a ggplot object.
38

ggplot object

  • To make a plot, the first task is to create a ggplot object.
  • You need to specify the data frame and the aesthetic.
38

ggplot object

  • To make a plot, the first task is to create a ggplot object.
  • You need to specify the data frame and the aesthetic.
  • In a 2D plot we can compare two variables
38

ggplot object

  • To make a plot, the first task is to create a ggplot object.
  • You need to specify the data frame and the aesthetic.
  • In a 2D plot we can compare two variables
  • To start, think of the aesthetic as the x and y variable.
38

ggplot object

  • To make a plot, the first task is to create a ggplot object.
  • You need to specify the data frame and the aesthetic.
  • In a 2D plot we can compare two variables
  • To start, think of the aesthetic as the x and y variable.
  • Consider a plot to compare the fuel efficiency of a car in the city and on the highway.
38

Number of Cylinders v MPG

ggplot(mpg,aes(x=cty,y=hwy))

39

Geometry

  • This should produce some axes and labels but there is no plot yet.
40

Geometry

  • This should produce some axes and labels but there is no plot yet.
  • To produce a plot we need to tell R what type of plot we want.
40

Geometry

  • This should produce some axes and labels but there is no plot yet.
  • To produce a plot we need to tell R what type of plot we want.
  • In the language of ggplot this is called a geometry
40

Geometry

  • This should produce some axes and labels but there is no plot yet.
  • To produce a plot we need to tell R what type of plot we want.
  • In the language of ggplot this is called a geometry
  • For a scatter plot the geometry is geom_point().
40

Scatterplot in R

ggplot(mpg,aes(x=cty,y=hwy))+geom_point()

41

Other aesthetics

We can think of colour as a third aesthetic

ggplot(mpg,aes(x=cty,y=hwy,col=cyl))+geom_point()

42

With a categorical variable

ggplot(mpg,aes(x=cty,y=hwy,col=drv))+geom_point()

43

Quicker plots

A useful function in ggplot2 is to use qplot which will try to guess the plot you want. Try these examples

qplot(x=cty,y=hwy,data=mpg)
qplot(x=cty,data=mpg)
44

Exporting graphics

  • You can export graphics using the Export tab in the Plot tab in Rstudio.
45

Exporting graphics

  • You can export graphics using the Export tab in the Plot tab in Rstudio.
  • Many different file formats are available. As an alternative you can do the following:
pdf('myplot.pdf')
qplot(x=cty,data=mpg)
dev.off()
  • Other file formats such an png or jpeg can be used instead of pdf.
45

Using R Markdown

  • If you want to avoid exporting graphics and then importing them into your document then R Markdown is extremely useful.
46

Using R Markdown

  • If you want to avoid exporting graphics and then importing them into your document then R Markdown is extremely useful.
  • You can enclose code blocks within three tick marks and R Markdown does the rest.
46

Using R Markdown

  • If you want to avoid exporting graphics and then importing them into your document then R Markdown is extremely useful.
  • You can enclose code blocks within three tick marks and R Markdown does the rest.
  • You can set echo=TRUE to present the code or echo=FALSE to hide it.
46

Advantages of R markdown

  • Easy to update analysis.
    • Consider that new data is obtained and all reports need to reflect the new data.
47

Advantages of R markdown

  • Easy to update analysis.
    • Consider that new data is obtained and all reports need to reflect the new data.
  • Easy to reproduce/audit analyses.
    • Suppose that some time after an analysis has been completed it is necessary to check what has gone wrong.
47

Advantages of R markdown

  • Easy to update analysis.
    • Consider that new data is obtained and all reports need to reflect the new data.
  • Easy to reproduce/audit analyses.
    • Suppose that some time after an analysis has been completed it is necessary to check what has gone wrong.
  • Easy to automate analyses.
    • Some tasks require the generation of reports that are tedious to do manually every time.
47

Data Manipulation

  • There are several ways to manipulate data, but a particularly useful and easy package to use is called dplyr.
48

Data Manipulation

  • There are several ways to manipulate data, but a particularly useful and easy package to use is called dplyr.
  • We can exclude observations using the filter function.
48

Data Manipulation

  • There are several ways to manipulate data, but a particularly useful and easy package to use is called dplyr.
  • We can exclude observations using the filter function.
  • To really understand how to use this function it helps to know about logical operators (try ?Logic) and relational operators (try ?Comparison).
48

Data Manipulation

  • There are several ways to manipulate data, but a particularly useful and easy package to use is called dplyr.
  • We can exclude observations using the filter function.
  • To really understand how to use this function it helps to know about logical operators (try ?Logic) and relational operators (try ?Comparison).
  • We will do a few simple examples here
48

Using dplyr

To create a new data frame that only includes 4 wheel drives

library(dplyr)
mpg_4wd<-filter(mpg,drv=='4')

To exclude all 4 wheel drives

mpg_no4wd<-filter(mpg,drv!='4')
49

Two conditions

Suppose we only want to consider cars that are 4 wheel drives and can drive more than 15 miles per gallon on the highway

mpg_4wd_hwyg15<-filter(mpg,(drv=='4')&(hwy>15))

Or those that are either 4 wheel drives or can drive less than 15 miles per gallon in the city

mpg_4wd_ctyl15<-filter(mpg,(drv=='4')|(cty<15))
50

Without dplyr

  • This sort of data manipulation can be done without dplyr but is more verbose.
  • For example the last line would be
mpg_4wd_ctyl15<-mpg[((mpg$drv=='4')|(mpg$cty<15)),]
  • Both give the same result, use whichever you prefer.
51

Summarise Fuction

  • Suppose we want the mean and standard deviation of the (filtered) data.
mpg_4wd_ctyl15<-filter(mpg,(drv=='4')|(cty<15))
mean_sd_hwy<-summarise(mpg_4wd_ctyl15,mean(hwy),sd(hwy))
mean_sd_hwy
## # A tibble: 1 x 2
## `mean(hwy)` `sd(hwy)`
## <dbl> <dbl>
## 1 19 3.91
52

Pipes

  • Pipes from the magrittr package make this easier.
filter(mpg,(drv=='4')|(cty<15))%>%
summarise(mean(hwy),sd(hwy))%>%
print
## # A tibble: 1 x 2
## `mean(hwy)` `sd(hwy)`
## <dbl> <dbl>
## 1 19 3.91
53

Conclusion

  • This lecture has given you a foundation in R
54

Conclusion

  • This lecture has given you a foundation in R
  • You can use R to do much more including collecting data off the web, cleaning the data, fitting models to the data, creating web applications, or even creating documents (these slides were created in RStudio).
54

Conclusion

  • This lecture has given you a foundation in R
  • You can use R to do much more including collecting data off the web, cleaning the data, fitting models to the data, creating web applications, or even creating documents (these slides were created in RStudio).
  • This can be daunting, but remember the best thing about R is that there are lots of ways to teach yourself R.
54

Basics

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow