R provides several basic structures for working with data.
In these notes we will review some of the most important ones:
We will also learn how to work with subsets of our data by indexing into rows and/or columns.
Our simplest kind of data is just a bunch of numbers. We can use a vector to hold numbers:
leafsG = c(47,31,26,21,16,13)
summary(leafsG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 17.25 23.50 25.67 29.75 47.00
length(leafsG)
## [1] 6
The function summary
gives us some statistical summaries of the numbers in the vector leafsG.
The function length
tells us the length of the vector.
There are a bunch of ways to make vectors in R:
ind = 1:10 # 1 to 10
print("ind is:")
## [1] "ind is:"
ind
## [1] 1 2 3 4 5 6 7 8 9 10
x = seq(1,20,2) # 1 to 20 by 2
print("x is :")
## [1] "x is :"
x
## [1] 1 3 5 7 9 11 13 15 17 19
Very often we need access to a subset of values in a vector.
We append square brackets to the vector name and put the indices of the elements we want inside the brackets.
leafsG[2] #the second number
## [1] 31
leafsG[2:4] # the second through 4th
## [1] 31 26 21
leafsG[c(1,3,5)] # the first, third, and fifth
## [1] 47 26 16
We can also assign to elements of a vector using indices:
x = 1:10
x[2]=20
x
## [1] 1 20 3 4 5 6 7 8 9 10
x[c(5,7,9)] = c(50,70,90)
x
## [1] 1 20 3 4 50 6 70 8 90 10
We can use arithmetic expressions with vectors.
The arithmetic is applied to all elements of the vector in a (usually) intuitive way.
x = c(2,5,7,9)
x^2
## [1] 4 25 49 81
1 + 2* x
## [1] 5 11 15 19
y = c(4,6,8,9)
x+y
## [1] 6 11 15 18
z = log(y)
z
## [1] 1.386294 1.791759 2.079442 2.197225
What happens when we add two vectors?
x = c(1,2,3)
y = c(20,30,40)
z = x + y
z
## [1] 21 32 43
The first x is added to the first y, the second x to the second y, and so on.
This is very powerful.
1 + 2*x + y
## [1] 23 35 47
Here each element of x
is doubled then we add the corresponding elements of y
and then we add 1 to all of the elements.
So far we have used vectors to contain a bunch of numbers.
Besides numbers, we often work with character strings and logical the logical values TRUE and FALSE.
x = "Rob" # x refers to the character string "Rob"
x
## [1] "Rob"
x = TRUE
x
## [1] TRUE
Vectors have to be “all one thing”, where the “one thing” could be numbers, character strings, or logical values.
pnms = c("Matthews","Nylander","Tavares", "Hyman", "Marner" ,"Kapenan")
length(pnms)
## [1] 6
pnms
## [1] "Matthews" "Nylander" "Tavares" "Hyman" "Marner" "Kapenan"
We can label all the elements of a vector:
names(leafsG) = pnms
leafsG
## Matthews Nylander Tavares Hyman Marner Kapenan
## 47 31 26 21 16 13
A logical values tells us if something is TRUE or FALSE:
leafsG[1] == 47 # is the first value equal to 47, note the double =
## Matthews
## TRUE
leafsG[1] >= 50
## Matthews
## FALSE
g20 = leafsG >=20
g20
## Matthews Nylander Tavares Hyman Marner Kapenan
## TRUE TRUE TRUE TRUE FALSE FALSE
g20 is a vector of logical values indicating whether or not the value of leafsG is greater than or equal to 20.
You can select observations using a logical vector:
leafsG[g20]
## Matthews Nylander Tavares Hyman
## 47 31 26 21
# or more succinctly
leafsG[leafsG > 20]
## Matthews Nylander Tavares Hyman
## 47 31 26 21
You can add points, lines, and text to and existing plot.
This allows us to build up interesting plots.
leafsG = c(47,31,26,21,16,13) # goals
leafsA = c(33,28,34,16,51,23) # assists
pnms = c("Matthews","Nylander","Tavares", "Hyman", "Marner" ,"Kapenan") #player names
plot(leafsG,leafsA,xlab="goals",ylab = "assists",col="blue")
# now I can add text to the plot
text(leafsG,leafsA,pnms,pos=3)
That did not quite work. We need to change the plot limits.
plot(leafsG,leafsA,xlab="goals",ylab = "assists",col="blue",ylim=c(14,55),xlim=c(10,50))
text(leafsG,leafsA,pnms,pos=3)
You can also add points and lines.
I make an initial plot with the argument type="n"
. This will make the plot axes but do no plotting.
Then I add the points and text.
plot(leafsG,leafsA,xlab="goals",ylab = "assists",col="blue",
ylim=c(14,55),xlim=c(10,50),type="n")
text(leafsG,leafsA,pnms,pos=3,cex=.8)
points(leafsG,leafsA,col="red",pch=2)
Notes on the arguments:
Sometimes we need to store a variety of types of information in one place. We can use a list to to this.
player = list(first = "Auston", last = "Mathews", pts = c(47,33))
player
## $first
## [1] "Auston"
##
## $last
## [1] "Mathews"
##
## $pts
## [1] 47 33
The list player has the first and last name as well a vector containing the goals and assists scored by the the player.
You can access a component of a list using the $ notation or a numerical index:
player$pts # the (goal, assists) vector component
## [1] 47 33
player[1:2] # a list with the first two components
## $first
## [1] "Auston"
##
## $last
## [1] "Mathews"
player[[3]] # same as player$pts, note the double [[]]
## [1] 47 33
Very often data is represented as a rectangular array where rows correspond to different observations and columns correspond to different variables. In R, we use a data.frame to represent this.
Let’s say each observation corresponds to a Leafs player (as in our vector example) and the variables we want to think about are number of goals and number of assists.
First let’s make a vector of the assists:
leafsA = c(33,28,34,16,51,23)
Now we can make a data.frame holding the goals and assists:
leafsTop = data.frame(leafsG,leafsA)
leafsTop
When you make the data frame, you can choose new names for the variables:
leafsTop = data.frame(G = leafsG, A = leafsA)
leafsTop
You can pull of variables by name or index and you can pull off rows by index:
leafsTop$G
## [1] 47 31 26 21 16 13
leafsTop[1:3,1] # first three rows of first column
## [1] 47 31 26
leafsTop[,1] # the first column, note the [,1] says we want all the rows
## [1] 47 31 26 21 16 13
leafsTop[1:3,2] # first three rows of second column
## [1] 33 28 34
tp3 = leafsTop[1:3,] # first three rows, all the columns
tp3
You can get a summary of each variable in a data.frame:
summary(leafsTop)
## G A
## Min. :13.00 Min. :16.00
## 1st Qu.:17.25 1st Qu.:24.25
## Median :23.50 Median :30.50
## Mean :25.67 Mean :30.83
## 3rd Qu.:29.75 3rd Qu.:33.75
## Max. :47.00 Max. :51.00
You can apply a function to each column of a data.frame:
apply(leafsTop,2,sd)
## G A
## 12.32342 11.92337
You can assign using indexing:
leafsTop[1,1]=50 #Auston should have got to 50!!
leafsTop
As with vectors, we can select using logic:
leafsTop[leafsTop$G >20,]