Basic data structures
The basic data structures used to organize data within the R environment include vectors, lists, data frames, tables, and matrices. Here, we provide details for each of these data structures and demonstrate how to create them. This chapter does not include information about how to read data from a file, and the focus is on the data structures themselves. More information about reading from a file can be found in Chapter 3, Saving Data and Printing Results.
Vectors
The default data structure in R is the vector. For example, if you define a variable as a single number, R will treat it as a vector of length one:
> a <- 5
> a[1]
[1] 5
Vectors represent a convenient and straightforward way to store a long list of numbers. Please see a set of arguments to form a single vector:
> v <- c(1,3,5,7,-10) > v [1] 1 3 5 7 -10 > v[4] [1] 7 > v[2] <- v[1]-v[5] > v [1] 1 11 5 7 -10
Two other methods to generate vectors make use of the :
notation and the seq
command. The :
notation is used to create a list of sequentially numbered values for given start and end values. The seq
command does the same thing, but it provides more options to determine the increment between values in the vector:
> 1:5 [1] 1 2 3 4 5 > 10:14 [1] 10 11 12 13 14 > a <- 3:7 > a [1] 3 4 5 6 7 > b <- seq(3,5) > b [1] 3 4 5 > b <- seq(3,10,by=3) > b [1] 3 6 9
Lists
Another important type is the list. Lists are flexible and are an unstructured way of organizing information. A list can be thought of as a collection of named objects. A list is created using the list
command, and a variable can be tested or coerced using the is.list
and as.list
commands. A component within a list is accessed using the $
character to denote which object within the list to use. As an example, suppose that we want to create a list to keep track of the parameters for an experiment. The first component, called means, will be a vector of the assumed means. The second component will be the confidence level, and the third component will be the value of a parameter for the experiment called maxRealEigen
:
> assumedMeans <- c(1.0,1.5,2.1) > alpha <- 0.05 > eigenValue <- 3+2i > l <- list(means=assumedMeans,alpha=alpha,maxRealEigen=eigenValue) > l $means [1] 1.0 1.5 2.1 $alpha [1] 0.05 $maxRealEigen [1] 3+2i > l$means [1] 1.0 1.5 2.1 > l$means[2] [1] 1.5
The names
and attributes
commands can be used to determine the components within a list. The attributes
command is a more generic command that can be used to list the components of classes and a wider range of objects. Note that the names
command can also be used to rename the components of a list. In the following example, we use the previous example but change the names of the elements:
> l <- list(means=c(1.0,1.5,2.1),alpha=0.05,maxRealEigen=3+2i) > names(l) [1] "means" "alpha" "maxRealEigen" > names(l) <- c("assumedMeans","confidenceLevels","maximumRealValue") > l $assumedMeans [1] 1.0 1.5 2.1 $confidenceLevels [1] 0.05 $maximumRealValue [1] 3+2i
Data frames
A data frame is similar to a list, and many of the operations are similar. The primary difference is that all of the components of a data frame must have the same number of elements. This is one of the most common ways to store information, and many of the functions available to read data from a file return a data frame by default. For example, suppose we ask five people two questions. The first question is, "Do you have a pet cat?" The second question is, "How many rooms in your house need new carpet?":
> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")), + Q2=c(2,0,1,2,0)) > d Q1 Q2 1 y 2 2 n 0 3 y 1 4 y 2 5 n 0 > d$Q1 [1] y n y y n Levels: n y > summary(d) Q1 Q2 n:2 Min. :0 y:3 1st Qu.:0 Median :1 Mean :1 3rd Qu.:2 Max. :2
The names
and attributes
commands have the same behaviors with data frames as lists. In the preceding example, we take the data frame defined in the previous example and rename the fields to something more descriptive:
> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")), + Q2=c(2,0,1,2,0)) > names(d) <- c("HaveCat","NumberRooms") > d HaveCat NumberRooms 1 y 2 2 n 0 3 y 1 4 y 2 5 n 0
Tables
Tables can be easily constructed and R will automatically generate frequency tables from categorical data. The table
command has a number of options, but we focus on basic examples here. More details can be found using the help(table)
command. In the next example, we take the data from the preceding cat questions and create a table from the answers from the first question:
> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")), + Q2=c(2,0,1,2,0)) > q1Results <- table(d$Q1) > q1Results n y 2 3 > summary(q1Results) Number of cases in table: 5 Number of factors: 1
If you wish to create a two way table, then simply provide two vectors to the table command to get the cross tabulation. Again, we look at the data from the cat questions. Note that we have to convert the second question to a factor:
> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")), + Q2=c(2,0,1,2,0)) > results <- table(d$Q1,as.factor(d$Q2)) > results 0 1 2 n 2 0 0 y 0 1 2 > summary(results) Number of cases in table: 5 Number of factors: 2 Test for independence of all factors: Chisq = 5, df = 2, p-value = 0.08208 Chi-squared approximation may be incorrect
The rows and columns of the table have names associated with them, and the rownames
and colnames
commands can be used to assign the names. These commands are similar to the names
command. In the preceding example, the names in the table are not descriptive. In the following example, we build the table and rename the rows and columns:
> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")), + Q2=c(2,0,1,2,0)) > results <- table(d$Q1,as.factor(d$Q2)) > rownames(results) <- c("No Cat","Has Cat") > colnames(results) <- c("No room","One room","Two rooms") > results No room One room Two rooms No Cat 2 0 0 Has Cat 0 1 2
One last note; the argument to the table command requires ordinal data. If you have numeric data, it can be quickly transformed to encode which interval contains each number. The cut
command takes the numeric data and a vector of break points that indicate the cutoff points between each interval, as follows:
> a <- c(-0.8,-0.7,0.9,-1.4,-0.3,1.2) > b <- cut(a,breaks=c(-1.5,-1,-0.5,0,0.5,1.0,1.5)) > b [1] (-1,-0.5] (-1,-0.5] (0.5,1] (-1.5,-1] (-0.5,0] (1,1.5] Levels: (-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5] > summary(b) (-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5] 1 2 1 0 1 1 > table(b) b (-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5] 1 2 1 0 1 1
Matrices and arrays
Tables are a special case of an array. An array or a matrix can be constructed directly using either the array
or matrix
commands. The array
command takes a vector and dimensions, and it constructs an array using column major order. If you wish to provide the data in row major order, then the command to transpose the result is t
:
> a <- c(1,2,3,4,5,6) > A <- array(a,c(2,3)) > A [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > t(A) [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6
You are not limited to two-dimensional arrays. The dim
option can include any number of dimensions. In the following example, a three-dimensional array is created by using three numbers for the number of dimensions:
> A <- array(1:24,c(2,3,4),dimnames=c("row","col","dep")) > A , , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12 , , 3 [,1] [,2] [,3] [1,] 13 15 17 [2,] 14 16 18 , , 4 [,1] [,2] [,3] [1,] 19 21 23 [2,] 20 22 24 > A[2,3,4] [1] 24
A matrix is a two-dimensional array and is a special case that can be created using the matrix
command. Rather than using the dimensions, the matrix
command requires that you specify the number of rows or columns. The command has an additional option to specify whether or not the data is in row major or column major order:
> B <- matrix(1:12,nrow=3) > B [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > B <- matrix(1:12,nrow=3,byrow=TRUE) > B [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12
Both matrices and arrays can be manipulated to determine or change their dimensions. The dim
command can be used to get or set this information:
> C <- matrix(1:12,ncol=3) > C [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > dim(C) [1] 4 3 > dim(C) <- c(3,4) > C [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12
Censoring data
Using a logical vector as an index is useful for limiting data that is examined. For example, to limit a vector to examine only the positive values in the data set, a logical comparison can be used for the index into the vector:
> u <- 1:6 > v <- c(-1,1,-2,2,-3,3) > u [1] 1 2 3 4 5 6 > v [1] -1 1 -2 2 -3 3 > u[v > 0] [1] 2 4 6 > u[v < 0] = -2*u[v < 0] > u [1] -2 2 -6 4 -10 6
Another useful aspect of a logical index into a vector is the use of the NA
data type. The is.na
function and a logical NOT operator (!
) can be a useful tool when a vector contains data that is not defined:
> v <- c(1,2,3,NA,4,NA) > v [1] 1 2 3 NA 4 NA > v[is.na(v)] [1] NA NA > v[!is.na(v)] [1] 1 2 3 4
Note that many functions have optional arguments to specify how R should react to data that contains a value with the NA
type. Unfortunately, the way this is done is not consistent, and you should use the help
command with respect to any particular function:
> v <- c(1,2,3,NA,4,NA) > v [1] 1 2 3 NA 4 NA > mean(v) [1] NA > mean(v,na.rm=TRUE) [1] 2.5
In this last example, the na.rm
option in the mean
function is set to TRUE
to specify that R should ignore the entries in the vector that are NA
.
Appending rows and columns
The cbind
and rbind
commands can be used to append data to existing objects. These commands can be used on vectors, matrices, arrays, and they are extended to also act on data frames. The following examples use data frames, as that is a common operation. You should be careful and try the commands on arrays to make sure that the operation behaves in the way you expect.
The cbind
command is used to combine the columns of the data given as arguments:
> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three"))) > e <- c("ein","zwei","drei") > newDataFrame <- cbind(d,third=e) > newDataFrame one two third 1 1 one ein 2 2 two zwei 3 3 three drei > newDataFrame$third [1] ein zwei drei Levels: drei ein zwei
If the arguments to the cbind
command are two data frames (or two arrays), then the command combines all of the columns from all of the data frames (arrays):
> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three"))) > e <- data.frame(three=c(4,5,6),four=as.factor(c("vier","funf","sechs"))) > newDataFrame <- cbind(d,e) > newDataFrame one two three four 1 1 one 4 vier 2 2 two 5 funf 3 3 three 6 sechs
The rbind
command concatenates the rows of the objects passed to it. The command uses the names of the columns to determine how to append the data. The number and names of the columns must be identical:
> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three"))) > e <- data.frame(one=c(4,5,6),two=as.factor(c("vier","funf","sechs"))) > newDataFrame <- rbind(d,e) > newDataFrame one two 1 1 one 2 2 two 3 3 three 4 4 vier 5 5 funf 6 6 sechs