What is R?

R is a free, open source, environment for doing data analysis.

Currently, R and python are the dominant computing environments for data science.

Both R and python are scripting languages. You can interactively type in commands at the “command line” and you can maintain a file with script of commands which can be run in chunks or all at one time.

Installing R and Rstudio

First install R and then Rstudio.

To install R go to https://www.r-project.org/

then:

CRAN/pick a mirror/download install (mac, or windows, or linux).

For Rstudio, got to https://rstudio.com/

then:

Download/rstudio desktop free.

This YouTube video walks you through it:

https://www.youtube.com/watch?v=orjLGFmx6l4

This site https://swirlstats.com/students.html also tells you how to install R and Rstudio and is a very nice site for learning R.

First R, Entering Commands in the Console Pane

After installation, you should be able to fire up Rstudio by just clicking (or double clicking on the icon).

When you first go into R, you will see three panes as in the picture below.
Note that by moving your cursor to the boundaries between the panes you can adjust the size of the panes.
There are also convenient ways to enlarge or make small each panel by clicking near the top right of each of the four panes.

The pane on the left is the console pane.
For now, let’s focus on the console pane.

Rstudio initial

The simplest way to use R is to type commands into the console and then hit return.
R will print out the results.

2+3
## [1] 5

The “2+3” is what you type into the console (and then hit return) and below is the result R prints out.

You can do just about any kind of numerical calculation simply in R:

2^2
## [1] 4
2**3
## [1] 8
2+3*5
## [1] 17
10/2
## [1] 5

You can quit R by typing q() in the console or using the menu /File/Quit Session.

Second R, Writing R Scripts

Clearly, typing commands into the console will quickly become tedious.

The way you really work with R is to maintain of file of R commands that do what you want.
The is called an R script.

To start an R script, use /File/new file/R script to get an (empty) R script.
This will create a new pane (top left) where you can write and edit your script.
You can run commands directly from you R script, rather then having to enter them in the console.
When you execute the commands you see the results in the console tab of the bottom left pane (the console pane).

To run the R code on a single line of your script, you can put the cursor at the line and then click Run near to top right of the pane.
To run a chunk of code to can select the chunk and then click run.

Once you have a script that does a lot of stuff you like you will want to save it.
Note that the file name is in red and there is a star when you have edits to your script that have not been saved.
To save you can just click the disk icon or go to /File/Save.

When you save a file, it will be saved to your working directory (or working folder).
You can set the working directory with /Session/Working directory.
You can see the files in your working directory in the files tab of the bottom-right notebook pane.

Rstudio with R script panel

Variables and Functions in R

Just about everything you do in R involves storing some kind of information in a variable and doing something with that information using a function. Let’ see a simple example.

x=2

We have created the variable with name x.

The information “2” is stored and we can access this information by the name x. For example we can print out x just by typing its name.

x 
## [1] 2

Note that the variable x now shows up in the Environment tab of the top-right pane.

We can then do things with the information in x:

x^2
## [1] 4

We can apply functions to the information in variables:

y = x^2
z = sqrt(y)
z
## [1] 2

sqrt is the function which calculates the square root. We have stored the results from calling the function sqrt with argument y in the variable z.

Now we have three variables in our R session. To see what variables are around use

ls()
## [1] "x" "y" "z"

The arguments to a function are the variables or quantities you put in. Sometimes we will have lots of arguments. Above, the function ls has no arguments.

Here we use the sum function with two arguments.

sum(x,y)
## [1] 6

To remove of variable from your workspace, use the rm function:

ls()
## [1] "x" "y" "z"
rm(z)
ls()
## [1] "x" "y"

You can get rid of all the variables using the top-right pane. You just click on the little brush beside the drop down menu labeled “Import Dataset”.

Note that R will often talk about “objects”. For example, when you click the “little brush” just mentioned, you get a message asking you if you want to clear all objects. For us, an object and a variable will almost always mean the same thing.

Note the R is case senstive. x is not the same as X.

xvar = 5
Xvar = 20
ls()
## [1] "x"    "xvar" "Xvar" "y"

Note that many R purists use <- to assign values to variables rather than =.

XX <- 55
XX
## [1] 55

Note that variables in R can represent text strings.
We often use text strings as labels for things.

thename = "Rob"
thename
## [1] "Rob"

Help in R

If you want help for the function sum,

help(sum)

Help results will show up under the Help tab in the lower-right pane.

You can also type ?sum in the console to get help on sum.

Under the Help tab of the lower-right pane, try clicking on the “little house” (Show R help). This will get you to several links to R help. The R-studio help is generally pretty good. Check out the Cheat sheets at /Help/Cheatsheets, they are great.

Note that googling works amazing well for R. Anytime you don’t know how to do something, just google it. For example, try googling, “how do I list variables in R”.

Working With a Vector and Data Frame, A Simple Plot

Let’s get some data in R and have a look at it.

To work with data we need to use variables that point to collections of information.
In R we use vectors, data frames, and lists a lot to store information.
Let’s start with an example of using vectors.

We want so see how the size of a house relates to it’s price.
We will get data on 4 houses and store the size in one vector and the price in another.

Note that the following commands might be maintained in an R script.

Size = c(.8,1.5,2.4,3.5)
Price = c(70,85,138,172)

We can then use functions to summarize our data.

We can summarize the information in the variable Size.

mean(Size)
## [1] 2.05
sd(Size)
## [1] 1.167619
summary(Size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.325   1.950   2.050   2.675   3.500

To see the relationship between Size and Price we can use a scatter plot.

plot(Size,Price)

To work the the two variable togther we might combine them into a data frame.

hd = data.frame(Size,Price)
summary(hd)
##       Size           Price       
##  Min.   :0.800   Min.   : 70.00  
##  1st Qu.:1.325   1st Qu.: 81.25  
##  Median :1.950   Median :111.50  
##  Mean   :2.050   Mean   :116.25  
##  3rd Qu.:2.675   3rd Qu.:146.50  
##  Max.   :3.500   Max.   :172.00

Data Frames are probably the most common way to work with a data set in R.
We can get an idea of what is in a data frame using a few simple functions.

# each row corresponds an observation of variable for a specifice house, each column is a variable
print(dim(hd))  #number of rows and columns
## [1] 4 2
print(names(hd))  # names of columns (variables)
## [1] "Size"  "Price"

You can access a variable (column) in a data frame by name using the $ notation:

print(hd$Size)
## [1] 0.8 1.5 2.4 3.5
mean(hd$Size)
## [1] 2.05

You can also access a row or column by the row or column number:

print(hd[,1]) # first variable
## [1] 0.8 1.5 2.4 3.5
mean(hd[,1])
## [1] 2.05
print(hd[1,]) # first observation
##   Size Price
## 1  0.8    70

We can change the names of the variables in our data frame:

names(hd) = c("size","price") # a vector of text strings.
head(hd)

Try > View(hd).

We can index into vectors:

nms = names(hd)
cat("the variables names in had are:\n")
## the variables names in had are:
print(nms)
## [1] "size"  "price"
cat("the first name is\n")
## the first name is
print(nms[1])
## [1] "size"
cat("the second name is\n")
## the second name is
print(nms[2])
## [1] "price"
cat("the Size vector is\n")
## the Size vector is
print(Size)
## [1] 0.8 1.5 2.4 3.5
cat("the first two elements of Size are\n")
## the first two elements of Size are
print(Size[1:2])
## [1] 0.8 1.5
cat("the first and fourth elements of Size are:\n")
## the first and fourth elements of Size are:
print(Size[c(1,4)])
## [1] 0.8 3.5

You can assign to an element of a vector:

names(hd)[2] = "theprice"
hd

Arguments to Functions

The basic way R works is to have data contained in vectors or data frames, or some other data structure and then apply functions to the data.

The “arguments” to the function is what goes in. For example:

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

Here the function seq is taking two arguments, the first gives the start and the second gives the end for a sequence of integers.

There are a few ways to provide arguments to functions. The basic way is to give the arguments in a specific order. Above, the function seq knows that the first argument is the start and the second argument is the end.

The other basic way to provide arguments is by their name:

x = seq(from=1,to=10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10

If I give the arguments by name, the order does not matter:

x = seq(to=10,from=1)
x
##  [1]  1  2  3  4  5  6  7  8  9 10

Functions can have a lot of arguments and it often helpful to call by name so you know what you are talking about.

For example:

x = seq(from=1,to=10,length.out = 20)
x
##  [1]  1.000000  1.473684  1.947368  2.421053  2.894737  3.368421  3.842105
##  [8]  4.315789  4.789474  5.263158  5.736842  6.210526  6.684211  7.157895
## [15]  7.631579  8.105263  8.578947  9.052632  9.526316 10.000000

You can figure out what happened just from the names of the arguments.

Try > help(seq) and see how the arguments are discussed.

Reading Data in From a File and a Simple Data Analysis

We don’t want to have to type our data into vectors in R scripts.

Let’s see you to read our data into R from a csv file.

#mhd for midcity house data, you may like a longer name
mhd = read.csv("http://www.rob-mcculloch.org/data/midcity.csv")

The variable mhd is a data frame.
We can quickly get an idea about the data in mhd.

print(dim(mhd))
## [1] 128   8
print(names(mhd))
## [1] "Home"      "Nbhd"      "Offers"    "SqFt"      "Brick"     "Bedrooms" 
## [7] "Bathrooms" "Price"
head(mhd)

We can pull off columns (or rows) of a data frame by indexing.
We only want to work with the variables Nbhd, SqFt, and Price which are the (2,4,8) columns.

mhd = mhd[,c(2,4,8)]
head(mhd)

Let’s change Price to thousands of dollars, it is currently in dollars.
Let’s also change SqFt to thousands of square feet.

mhd$Price = mhd$Price/1000 #replace Price in dollars with Price in thousands of dollars
mhd$Size = mhd$SqFt/1000 #new variable Size
summary(mhd)
##       Nbhd            SqFt          Price            Size      
##  Min.   :1.000   Min.   :1450   Min.   : 69.1   Min.   :1.450  
##  1st Qu.:1.000   1st Qu.:1880   1st Qu.:111.3   1st Qu.:1.880  
##  Median :2.000   Median :2000   Median :126.0   Median :2.000  
##  Mean   :1.961   Mean   :2001   Mean   :130.4   Mean   :2.001  
##  3rd Qu.:3.000   3rd Qu.:2140   3rd Qu.:148.2   3rd Qu.:2.140  
##  Max.   :3.000   Max.   :2590   Max.   :211.2   Max.   :2.590

Is the price of a house related to its size?

plot(mhd$Size,mhd$Price,xlab="house size",ylab="house price",col='blue')

Larger houses sell for more!
We can summarize this with the correlation:

cor(mhd$SqFt,mhd$Price)
## [1] 0.5529822

Let’s take a closer look at the prices using a histogram.

hist(mhd$Price,nclass=20,xlab="prices of houses",main="Histogram of house prices")

Categorical Variables

The variable Nbhd in our data frame is categorical.
Even though the variable values are the numbers 1,2, or 3, these just refer to three different neighborhoods rather than a house have “2” of Nbhd.

So, for example, the summary provided above is not very meaningful.

We can tell R to interpret a variable as categorical by making it a factor.

mhd$Nbhd = as.factor(mhd$Nbhd)
summary(mhd)
##  Nbhd        SqFt          Price            Size      
##  1:44   Min.   :1450   Min.   : 69.1   Min.   :1.450  
##  2:45   1st Qu.:1880   1st Qu.:111.3   1st Qu.:1.880  
##  3:39   Median :2000   Median :126.0   Median :2.000  
##         Mean   :2001   Mean   :130.4   Mean   :2.001  
##         3rd Qu.:2140   3rd Qu.:148.2   3rd Qu.:2.140  
##         Max.   :2590   Max.   :211.2   Max.   :2.590

Now the variable Nbhd is summarized more sensibly.
We just see how many houses are in each of the three neighborhoods.

ggplot, tidyverse, and R packages

So far we have just uses “base R”.
We have only used the R function and data structures that are automatically included in R.

A major strength of R is that it has been widely adopted by the statistics community and people write new functions which may be used in R.
The new functions (and possible data structures) are bundled in R packages.
See the Packages tab of the lower right pane in Rstudio.

To use a R package you first have to install it.
Then, everytime you want to use it in an are session you have to load it.

You can install a package at the Packages tab of the lower right pane.
Or you can enter the command > install.packages(packagname) at the command line where packagename is a text string of the package name. To install package BART, you would type
`> install.packages(“BART”).

A very popular set of packages is Hadley Wickhams “tidyverse”.
In particular, the R package ggplot2 provides a set of tools for graphics.
Let’s have a quick look.

# I have already installed the package ggplot2
# now is have to load it
library(ggplot2)

Now let’s use ggplot2 to plot price vs. size.

plt = ggplot(data=mhd,mapping = aes(x=Size,y=Price)) + geom_point()
plt

How about a histogram with ggplot2:

plt = ggplot(data=mhd,mapping=aes(x=Price)) + geom_histogram(color="white",fill="blue",binwidth=20)
plt

So, the ggplot function allows us to specify the data frame we are working with and which variables in the data frame are “x” and “y”. The “geom” tells ggplot how to geometrically represent the data in a plot.

We can build up plot features by adding in layers of information:

plt = ggplot(data=mhd,mapping = aes(x=Size,y=Price,col=Nbhd)) + geom_point(size=.8)
plt = plt + xlab("Size in thousands of square feed") + ylab("Price in thousands of dollars")
plt