Basic data structures_R Object-oriented Programming-QQ阅读男生都市网

上QQ阅读APP看书，第一时间看更新

Basic data structures

The basic data structures used to organize data within the R environment include vectors, lists, data frames, tables, and matrices. Here, we provide details for each of these data structures and demonstrate how to create them. This chapter does not include information about how to read data from a file, and the focus is on the data structures themselves. More information about reading from a file can be found in Chapter 3, Saving Data and Printing Results.

Vectors

The default data structure in R is the vector. For example, if you define a variable as a single number, R will treat it as a vector of length one:

> a <- 5
> a[1]
[1] 5

Vectors represent a convenient and straightforward way to store a long list of numbers. Please see a set of arguments to form a single vector:

> v <- c(1,3,5,7,-10)
> v
[1] 1 3 5 7 -10
> v[4]
[1] 7
> v[2] <- v[1]-v[5]
> v
[1] 1 11 5 7 -10

Two other methods to generate vectors make use of the : notation and the seq command. The : notation is used to create a list of sequentially numbered values for given start and end values. The seq command does the same thing, but it provides more options to determine the increment between values in the vector:

> 1:5
[1] 1 2 3 4 5
> 10:14
[1] 10 11 12 13 14
> a <- 3:7
> a
[1] 3 4 5 6 7
> b <- seq(3,5)
> b
[1] 3 4 5
> b <- seq(3,10,by=3)
> b
[1] 3 6 9

Lists

Another important type is the list. Lists are flexible and are an unstructured way of organizing information. A list can be thought of as a collection of named objects. A list is created using the list command, and a variable can be tested or coerced using the is.list and as.list commands. A component within a list is accessed using the $ character to denote which object within the list to use. As an example, suppose that we want to create a list to keep track of the parameters for an experiment. The first component, called means, will be a vector of the assumed means. The second component will be the confidence level, and the third component will be the value of a parameter for the experiment called maxRealEigen:

> assumedMeans <- c(1.0,1.5,2.1)
> alpha <- 0.05
> eigenValue <- 3+2i
> l <- list(means=assumedMeans,alpha=alpha,maxRealEigen=eigenValue)
> l
$means
[1] 1.0 1.5 2.1
$alpha
[1] 0.05
$maxRealEigen
[1] 3+2i

> l$means
[1] 1.0 1.5 2.1
> l$means[2]
[1] 1.5

The names and attributes commands can be used to determine the components within a list. The attributes command is a more generic command that can be used to list the components of classes and a wider range of objects. Note that the names command can also be used to rename the components of a list. In the following example, we use the previous example but change the names of the elements:

> l <- list(means=c(1.0,1.5,2.1),alpha=0.05,maxRealEigen=3+2i)
> names(l)
[1] "means" "alpha" "maxRealEigen"
> names(l) <- c("assumedMeans","confidenceLevels","maximumRealValue")
> l
$assumedMeans
[1] 1.0 1.5 2.1
$confidenceLevels
[1] 0.05
$maximumRealValue
[1] 3+2i

Data frames

A data frame is similar to a list, and many of the operations are similar. The primary difference is that all of the components of a data frame must have the same number of elements. This is one of the most common ways to store information, and many of the functions available to read data from a file return a data frame by default. For example, suppose we ask five people two questions. The first question is, "Do you have a pet cat?" The second question is, "How many rooms in your house need new carpet?":

> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")),
+                 Q2=c(2,0,1,2,0))
> d
 Q1 Q2
1 y 2
2 n 0
3 y 1
4 y 2
5 n 0
> d$Q1
[1] y n y y n
Levels: n y
> summary(d)
 Q1 Q2 
 n:2 Min. :0 
 y:3 1st Qu.:0 
 Median :1 
 Mean :1 
 3rd Qu.:2 
 Max. :2

The names and attributes commands have the same behaviors with data frames as lists. In the preceding example, we take the data frame defined in the previous example and rename the fields to something more descriptive:

> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")),
+                 Q2=c(2,0,1,2,0))
> names(d) <- c("HaveCat","NumberRooms")
> d
 HaveCat NumberRooms
1 y 2
2 n 0
3 y 1
4 y 2
5 n 0

Tables

Tables can be easily constructed and R will automatically generate frequency tables from categorical data. The table command has a number of options, but we focus on basic examples here. More details can be found using the help(table) command. In the next example, we take the data from the preceding cat questions and create a table from the answers from the first question:

> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")),
+                 Q2=c(2,0,1,2,0))
> q1Results <- table(d$Q1)
> q1Results
n y 
2 3 
> summary(q1Results)
Number of cases in table: 5 
Number of factors: 1

If you wish to create a two way table, then simply provide two vectors to the table command to get the cross tabulation. Again, we look at the data from the cat questions. Note that we have to convert the second question to a factor:

> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")),
+                 Q2=c(2,0,1,2,0))
> results <- table(d$Q1,as.factor(d$Q2))
> results
 0 1 2
 n 2 0 0
 y 0 1 2
> summary(results)
Number of cases in table: 5 
Number of factors: 2 
Test for independence of all factors:
 Chisq = 5, df = 2, p-value = 0.08208
 Chi-squared approximation may be incorrect

The rows and columns of the table have names associated with them, and the rownames and colnames commands can be used to assign the names. These commands are similar to the names command. In the preceding example, the names in the table are not descriptive. In the following example, we build the table and rename the rows and columns:

> d <- data.frame(Q1=as.factor(c("y","n","y","y","n")),
+                 Q2=c(2,0,1,2,0))
> results <- table(d$Q1,as.factor(d$Q2))
> rownames(results) <- c("No Cat","Has Cat")
> colnames(results) <- c("No room","One room","Two rooms")
> results
 No room One room Two rooms
 No Cat 2 0 0
 Has Cat 0 1 2

One last note; the argument to the table command requires ordinal data. If you have numeric data, it can be quickly transformed to encode which interval contains each number. The cut command takes the numeric data and a vector of break points that indicate the cutoff points between each interval, as follows:

> a <- c(-0.8,-0.7,0.9,-1.4,-0.3,1.2)
> b <- cut(a,breaks=c(-1.5,-1,-0.5,0,0.5,1.0,1.5))
> b
[1] (-1,-0.5] (-1,-0.5] (0.5,1] (-1.5,-1] (-0.5,0] (1,1.5] 
Levels: (-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5]
> summary(b)
(-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5] 
 1 2 1 0 1 1 
> table(b)
b
(-1.5,-1] (-1,-0.5] (-0.5,0] (0,0.5] (0.5,1] (1,1.5] 
 1 2 1 0 1 1

Matrices and arrays

Tables are a special case of an array. An array or a matrix can be constructed directly using either the array or matrix commands. The array command takes a vector and dimensions, and it constructs an array using column major order. If you wish to provide the data in row major order, then the command to transpose the result is t:

> a <- c(1,2,3,4,5,6)
> A <- array(a,c(2,3))
> A
 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> t(A)
 [,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6

You are not limited to two-dimensional arrays. The dim option can include any number of dimensions. In the following example, a three-dimensional array is created by using three numbers for the number of dimensions:

> A <- array(1:24,c(2,3,4),dimnames=c("row","col","dep"))
> A
, , 1

 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

, , 2

 [,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12

, , 3

 [,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18

, , 4

 [,1] [,2] [,3]
[1,] 19 21 23
[2,] 20 22 24
> A[2,3,4]
[1] 24

A matrix is a two-dimensional array and is a special case that can be created using the matrix command. Rather than using the dimensions, the matrix command requires that you specify the number of rows or columns. The command has an additional option to specify whether or not the data is in row major or column major order:

> B <- matrix(1:12,nrow=3)
> B
 [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> B <- matrix(1:12,nrow=3,byrow=TRUE)
> B
 [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12

Both matrices and arrays can be manipulated to determine or change their dimensions. The dim command can be used to get or set this information:

> C <- matrix(1:12,ncol=3)
> C
 [,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
> dim(C)
[1] 4 3
> dim(C) <- c(3,4)
> C
 [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

Censoring data

Using a logical vector as an index is useful for limiting data that is examined. For example, to limit a vector to examine only the positive values in the data set, a logical comparison can be used for the index into the vector:

> u <- 1:6
> v <- c(-1,1,-2,2,-3,3)
> u
[1] 1 2 3 4 5 6
> v
[1] -1 1 -2 2 -3 3
> u[v > 0]
[1] 2 4 6
> u[v < 0] = -2*u[v < 0]
> u
[1] -2 2 -6 4 -10 6

Another useful aspect of a logical index into a vector is the use of the NA data type. The is.na function and a logical NOT operator (!) can be a useful tool when a vector contains data that is not defined:

> v <- c(1,2,3,NA,4,NA)
> v
[1] 1 2 3 NA 4 NA
> v[is.na(v)]
[1] NA NA
> v[!is.na(v)]
[1] 1 2 3 4

Note that many functions have optional arguments to specify how R should react to data that contains a value with the NA type. Unfortunately, the way this is done is not consistent, and you should use the help command with respect to any particular function:

> v <- c(1,2,3,NA,4,NA)
> v
[1] 1 2 3 NA 4 NA
> mean(v)
[1] NA
> mean(v,na.rm=TRUE)
[1] 2.5

In this last example, the na.rm option in the mean function is set to TRUE to specify that R should ignore the entries in the vector that are NA.

Appending rows and columns

The cbind and rbind commands can be used to append data to existing objects. These commands can be used on vectors, matrices, arrays, and they are extended to also act on data frames. The following examples use data frames, as that is a common operation. You should be careful and try the commands on arrays to make sure that the operation behaves in the way you expect.

The cbind command is used to combine the columns of the data given as arguments:

> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three")))
> e <- c("ein","zwei","drei")
> newDataFrame <- cbind(d,third=e)
> newDataFrame
 one two third
1 1 one ein
2 2 two zwei
3 3 three drei
> newDataFrame$third
[1] ein zwei drei
Levels: drei ein zwei

If the arguments to the cbind command are two data frames (or two arrays), then the command combines all of the columns from all of the data frames (arrays):

> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three")))
> e <- data.frame(three=c(4,5,6),four=as.factor(c("vier","funf","sechs")))
> newDataFrame <- cbind(d,e)
> newDataFrame
 one two three four
1 1 one 4 vier
2 2 two 5 funf
3 3 three 6 sechs

The rbind command concatenates the rows of the objects passed to it. The command uses the names of the columns to determine how to append the data. The number and names of the columns must be identical:

> d <- data.frame(one=c(1,2,3),two=as.factor(c("one","two","three")))
> e <- data.frame(one=c(4,5,6),two=as.factor(c("vier","funf","sechs")))
> newDataFrame <- rbind(d,e)
> newDataFrame
 one two
1 1 one
2 2 two
3 3 three
4 4 vier
5 5 funf
6 6 sechs

本周热推：

Applying Math with Python “互联网+”时代立体化计算机组 Hands-On High Performance with Spring 5 XNA 4 3D Game Development by Example：Beginner's Guide AI Crash Course