https://DevOpsCloud.io -- Cloud Monk Losang Jinpa

R Cookbook Chapter 5. Data Structures

Return to R Cookbook, R Bibliography, R DevOps, R Data Science, R Statistics, R Machine Learning, R Deep Learning, Data Science Bibliography, Statistics Bibliography

“ (RCook 2019)

Chapter 5. R Data Structures

“You can get pretty far in R just using R vectors. That’s what Chapter 2 is all about. This chapter moves beyond R vectors to R recipes for R matrices, R lists, R factors, R data frames, and R tribbles (which are a special kind of R data frame). If you have preconceptions about data structures, we suggest you put them aside. [[R does R data structures differently than many other programming languages. Before we get to the R recipes in this chapter, we’ll take a quick look at different data structures in R.” (RCook 2019)

“If you want to R study the technical aspects of R’s data structures, we suggest reading R in a Nutshell and the R Language Definition. The notes here are more informal. These are things we wish we’d known when we started using R.” (RCook 2019)

R Vectors

R Vectors: Here are some key properties of R vectors:

R Vectors are homogeneous “All R elements of a R vector must have the same R type or, in R terminology, the same R mode.”

R Vectors can be R indexed by position.

So v[2] refers to the second element of v.

Vectors can be indexed by multiple positions, returning a subvector. So v[c(2,3)] is a subvector of v that consists of the second and third elements.

Vector elements can have names. Vectors have a names property, the same length as the vector itself, that gives names to the elements:

v ← c(10, 20, 30) names(v) ← c(“Moe”, “Larry”, “Curly”) print(v)

> Moe Larry Curly
> 10 20 30

If vector elements have names, then you can select them by name. Continuing the previous example:

v"Larry"

> [1] 20

R Lists

R Lists: Here are some key properties of R lists:

Lists are heterogeneous.

Lists can contain elements of different types—in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.

Lists can be indexed by position. So lst2 refers to the second element of lst. Note the double square brackets. Double brackets means that R will return the element as whatever type of element it is.

Lists let you extract sublists. So lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets. Single brackets means that R will return the items in a list. If you pull a single element with single brackets, like lst[2], R will return a list of length 1 with the first item being the desired item.

List elements can have names. Both lst"Moe" and lst$Moe refer to the element named “Moe.”

Since lists are heterogeneous and since their elements can be retrieved by name, a list is like a dictionary or hash or lookup table in other programming languages (discussed in Recipe 5.9).

What’s surprising (and cool) is that in R, unlike most of those other programming languages, lists can also be indexed by position.

Mode: Physical Type In R, every object has a mode, which indicates how it is stored in memory: as a number, as a character string, as a list of pointers to other objects, as a function, and so forth (see Table 5-1).

Table 5-1. R object-mode mapping Object Example Mode Number

3.1415

Numeric

Vector of numbers

c(2.7.182, 3.1415)

Numeric

Character string

“Moe”

Character

Vector of character strings

c(“Moe”, “Larry”, “Curly”)

Character

Factor

factor(c(“NY”, “CA”, “IL”))

Numeric

List

list(“Moe”, “Larry”, “Curly”)

List

Data frame

data.frame(x=1:3, y=c(“NY”, “CA”, “IL”))

List

Function

print

Function

The mode function gives us this information:

mode(3.1415) # Mode of a number

> [1] “numeric”

mode(c(2.7182, 3.1415)) # Mode of a vector of numbers

> [1] “numeric”

mode(“Moe”) # Mode of a character string

> [1] “character”

mode(list(“Moe”, “Larry”, “Curly”)) # Mode of a list

> [1] “list”

A critical difference between a vectors and lists can be summed up this way:

In a vector, all elements must have the same mode.

In a list, the elements can have different modes.

Class: Abstract Type In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, or a weight, for example. All those objects have a mode of “numeric” because they are stored as a number, but they could have different classes to indicate their interpretation.

For example, a Date object consists of a single number:

d ← as.Date(“2010-03-15”) mode(d)

> [1] “numeric”

length(d)

> [1] 1

But it has a class of Date, telling us how to interpret that number—namely, as the number of days since January 1, 1970:

class(d)

> [1] “Date”

R uses an object’s class to decide how to process the object. For example, the generic function print has specialized versions (called methods) for printing objects according to their class: data.frame, Date, lm, and so forth. When you print an object, R calls the appropriate print function according to the object’s class.

Scalars The quirky thing about scalars is their relationship to vectors. In some software, scalars and vectors are two different things. In R, they are the same thing: a scalar is simply a vector that contains exactly one element. In this book we often use the term “scalar,” but that’s just shorthand for “vector with one element.”

Consider the built-in constant pi. It is a scalar:

pi

> [1] 3.14

Since a scalar is a one-element vector, you can use vector functions on pi:

length(pi)

> [1] 1

You can index it. The first (and only) element is π, of course:

pi[1]

> [1] 3.14

If you ask for the second element, there is none:

pi[2]

> [1] NA

Matrices In R, a matrix is just a vector that has dimensions. It may seem strange at first, but you can transform a vector into a matrix simply by giving it dimensions.

A vector has an attribute called dim, which is initially NULL, as shown here:

A ← 1:6 dim(A)

> NULL

print(A)

> [1] 1 2 3 4 5 6

We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it:

dim(A) ← c(2, 3) print(A)

> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6

Voilà! The vector was reshaped into a 2 × 3 matrix.

A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:

B ← list(1, 2, 3, 4, 5, 6) dim(B)

> NULL

If we set the dim attribute, it gives the list a shape:

dim(B) ← c(2, 3) print(B)

> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6

Voilà! We have turned this list into a 2 × 3 matrix.

Arrays The discussion of matrices can be generalized to three-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a three-dimensional array with dimensions 2 × 3 × 2:

D ← 1:12 dim(D) ← c(2, 3, 2) print(D)

> , , 1
>
> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6
>
> , , 2
>
> [,1] [,2] [,3]
> [1,] 7 9 11
> [2,] 8 10 12

Note that R prints one “slice” of the structure at a time, since it’s not possible to print a three-dimensional structure on a two-dimensional medium.

It strikes us as very odd that we can turn a list into a matrix just by giving the list a dim attribute. But wait: it gets stranger.

Recall that a list can be heterogeneous (mixed modes). We can start with a heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This code snippet creates a matrix that is a mix of numeric and character data:

C ← list(1, 2, 3, “X”, “Y”, “Z”) dim(C) ← c(2, 3) print(C)

> [,1] [,2] [,3]
> [1,] 1 3 “Y”
> [2,] 2 “X” “Z”

To us, this is strange because we ordinarily assume a matrix is purely numeric, not mixed. R is not that restrictive.

The possibility of a heterogeneous matrix may seem powerful and strangely fascinating. However, it creates problems when you are doing normal, day-to-day stuff with matrices. For example, what happens when the matrix C (from the previous example) is used in matrix multiplication? What happens if it is converted to a data frame? The answer is that odd things happen.

In this book, we generally ignore the pathological case of a heterogeneous matrix. We assume you’ve got simple, vanilla matrices. Some recipes involving matrices may work oddly (or not at all) if your matrix contains mixed data. Converting such a matrix to a vector or data frame, for instance, can be problematic (see Recipe 5.29).

Factors A factor looks like a character vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.

There are two key uses for factors:

Categorical variables A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.

Grouping This is a technique for labeling or tagging your data items according to their group. See Chapter 6.

Data Frames A data frame is a powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS, or a table in an SQL database.

A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list with the following characteristics:

The elements of the list are vectors and/or factors.1

Those vectors and factors are the columns of the data frame.

The vectors and factors must all have the same length; in other words, all columns must have the same height.

The equal-height columns give a rectangular shape to the data frame.

The columns must have names.

Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:

You can use list operators to extract columns from a data frame, such as df[i], dfi, or df$name.

You can use matrix-like notation, such as df[i,j], df[i,], or df[,j].

Your perception of a data frame likely depends on your background:

To a statistician A data frame is a table of observations. Each row contains one observation. Each observation must contain the same variables. These variables are called columns, and you can refer to them by name. You can also refer to the contents by row number and column number, just as with a matrix.

To a SQL programmer A data frame is a table. The table resides entirely in memory, but you can save it to a flat file and restore it later. You needn’t declare the column types because R figures that out for you.

To an Excel user A data frame is like a worksheet, or perhaps a range within a worksheet. It is more restrictive, however, in that each column has a type.

To an SAS user A data frame is like an SAS dataset for which all the data resides in memory. R can read and write the data frame on disk, but the data frame must be in memory while R is processing it.

To an R programmer A data frame is a hybrid data structure, part matrix and part list. A column can contain numbers, character strings, or factors, but not a mix of them. You can index the data frame just like you index a matrix. The data frame is also a list, where the list elements are the columns, so you can access columns by using list operators.

To a computer scientist A data frame is a rectangular data structure. The columns are typed, and each column must contain numeric values, character strings, or factors. Columns must have labels; rows may have labels. The table can be indexed by position, column name, and/or row name. It can also be accessed by list operators, in which case R treats the data frame as a list whose elements are the columns of the data frame.

To a corporate executive You can put names and numbers into a data frame. A data frame is like a little database. Your staff will enjoy using data frames.

Tibbles A tibble is a modern reimagining of the data frame, introduced by Hadley Wickham in the tibble package, which is a core package in the tidyverse. Most of the common functions you would use with data frames also work with tibbles. However, tibbles typically do less than data frames and complain more. This idea of complaining and doing less may remind you of your least favorite coworker; however, we think tibbles will be one of your favorite data structures. Doing less and complaining more can be a feature, not a bug.

Unlike data frames, tibbles:

Do not give you row numbers by default.

Do not give you strange, unexpected column names.

Don’t coerce your data into factors (unless you explicitly ask for that).

Recycle vectors of length 1 but not other lengths.

In addition to basic data frame functionality, tibbles:

Print only the top four rows and a bit of metadata by default.

Always return a tibble when subsetting.

Never do partial matching: if you want a column from a tibble, you have to ask for it using its full name.

Complain more by giving you more warnings and chatty messages to make sure you understand what the software is doing.

All these extras are designed to give you fewer surprises and help you make fewer mistakes.

5.1 Appending Data to a Vector Problem You want to append additional data items to a vector.

Solution Use the vector constructor © to construct a vector with the additional data items:

v ← c(1, 2, 3) newItems ← c(6, 7, 8) c(v, newItems)

> [1] 1 2 3 6 7 8

For a single item, you can also assign the new item to the next vector element. R will automatically extend the vector:

v ← c(1, 2, 3) v[length(v) + 1] ← 42 v

> [1] 1 2 3 42

Discussion If you ask us about appending a data item to a vector, we will likely suggest that maybe you shouldn’t.

TIP R works best when you think about entire vectors, not single data items. Are you repeatedly appending items to a vector? If so, then you are probably working inside a loop. That’s OK for small vectors, but for large vectors your program will run slowly. The memory management in R works poorly when you repeatedly extend a vector by one element. Try to replace that loop with vector-level operations. You’ll write less code, and R will run much faster.

Nonetheless, one does occasionally need to append data to vectors. Our experiments show that the most efficient way of doing so is to create a new vector using the vector constructor © to join the old and new data. This works for appending single elements or multiple elements:

v ← c(1, 2, 3) v ← c(v, 4) # Append a single value to v v

> [1] 1 2 3 4

w ← c(5, 6, 7, 8) v ← c(v, w) # Append an entire vector to v v

> [1] 1 2 3 4 5 6 7 8

You can also append an item by assigning it to the position past the end of the vector, as shown in the Solution. In fact, R is very liberal about extending vectors. You can assign to any element and R will expand the vector to accommodate your request:

v ← c(1, 2, 3) # Create a vector of three elements v[10] ← 10 # Assign to the 10th element v # R extends the vector automatically

> [1] 1 2 3 NA NA NA NA NA NA 10

Note that R did not complain about the out-of-bounds subscript. It just extended the vector to the needed length, filling it with NA.

R includes an append function that creates a new vector by appending items to an existing vector. However, our experiments show that this function runs more slowly than both the vector constructor and the element assignment.

5.2 Inserting Data into a Vector Problem You want to insert one or more data items into a vector.

Solution Despite its name, the append function inserts data into a vector by using the after parameter, which gives the insertion point for the new item or items:

append(vec, newvalues, after = n) Discussion The new items will be inserted at the position given by after. This example inserts 99 into the middle of a sequence:

append(1:10, 99, after = 5)

> [1] 1 2 3 4 5 99 6 7 8 9 10

The special value of after=0 means insert the new items at the head of the vector:

append(1:10, 99, after = 0)

> [1] 99 1 2 3 4 5 6 7 8 9 10

The comments in Recipe 5.1 apply here, too. If you are inserting single items into a vector, you might be working at the element level when working at the vector level would be easier to code and faster to run.

5.3 Understanding the Recycling Rule Problem You want to understand the mysterious Recycling Rule that governs how R handles vectors of unequal length.

Discussion When you do vector arithmetic, R performs element-by-element operations. That works well when both vectors have the same length: R pairs the elements of the vectors and applies the operation to those pairs.

But what happens when the vectors have unequal lengths?

In that case, R invokes the Recycling Rule. It processes the vector elements in pairs, starting at the first elements of both vectors. At a certain point, the shorter vector is exhausted while the longer vector still has unprocessed elements. R then returns to the beginning of the shorter vector, “recycling” its elements, while it continues taking elements from the longer vector until it completes the operation. It will recycle the shorter vector’s elements as often as necessary until the operation is complete.

It’s useful to visualize the Recycling Rule. Here is a diagram of two vectors, 1:6 and 1:3:

  1:6   1:3
 ----- -----
   1     1
   2     2
   3     3
   4
   5
   6

Obviously, the 1:6 vector is longer than the 1:3 vector. If we try to add the vectors using (1:6) + (1:3), it appears that 1:3 has too few elements. However, R recycles the elements of 1:3, pairing the two vectors like this and producing a six-element vector:

  1:6   1:3   (1:6) + (1:3)
 ----- ----- ---------------
   1     1         2
   2     2         4
   3     3         6
   4               5
   5               7
   6               9

Here is what you see in the R console:

(1:6) + (1:3)

> [1] 2 4 6 5 7 9

It’s not only vector operations that invoke the Recycling Rule; functions can, too. The cbind function can create column vectors, such as the following column vectors of 1:6 and 1:3. The two columns have different heights, of course:

cbind(1:6)

cbind(1:3) If we try binding these column vectors together into a two-column matrix, the lengths are mismatched. The 1:3 vector is too short, so cbind invokes the Recycling Rule and recycles the elements of 1:3:

cbind(1:6, 1:3)

> [,1] [,2]
> [1,] 1 1
> [2,] 2 2
> [3,] 3 3
> [4,] 4 1
> [5,] 5 2
> [6,] 6 3

If the longer vector’s length is not a multiple of the shorter vector’s length, R gives a warning. That’s good, since the operation is highly suspect and there is likely a bug in your logic:

(1:6) + (1:5) # Oops! 1:5 is one element too short

> Warning in (1:6) + (1:5): longer object length is not a multiple of shorter
> object length
> [1] 2 4 6 8 10 7

Once you understand the Recycling Rule, you will realize that operations between a vector and a scalar are simply applications of that rule. In this example, the 10 is recycled repeatedly until the vector addition is complete:

(1:6) + 10

> [1] 11 12 13 14 15 16

5.4 Creating a Factor (Categorical Variable) Problem You have a vector of character strings or integers. You want R to treat them as a factor, which is R’s term for a categorical variable.

Solution The factor function encodes your vector of discrete values into a factor:

f ← factor(v) # v can be a vector of strings or integers If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor:

f ← factor(v, levels) Discussion In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models.

Most of the time, converting your categorical data into a factor is a simple matter of calling the factor function, which identifies the distinct levels of the categorical data and packs them into a factor:

f ← factor(c(“Win”, “Win”, “Lose”, “Tie”, “Win”, “Lose”)) f

> [1] Win Win Lose Tie Win Lose
> Levels: Lose Tie Win

Notice that when we printed the factor, f, R did not put quotes around the values. They are levels, not strings. Also notice that when we printed the factor, R displayed the distinct levels below the factor.

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels. Suppose you have a string-valued variable wday that gives the day of the week on which your data was observed:

wday ← c(“Wed”, “Thu”, “Mon”, “Wed”, “Thu”,

         "Thu", "Thu", "Tue", "Thu", "Tue")

f ← factor(wday) f

> [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
> Levels: Mon Thu Tue Wed

R thinks that Monday, Thursday, Tuesday, and Wednesday are the only possible levels. Friday is not listed. Apparently, the lab staff never made observations on a Friday, so R does not know that Friday is a possible value. Hence, you need to list the possible levels of wday explicitly:

f ← factor(wday, levels=c(“Mon”, “Tue”, “Wed”, “Thu”, “Fri”)) f

> [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
> Levels: Mon Tue Wed Thu Fri

Now R understands that f is a factor with five possible levels. It knows their correct order, too. It originally put Thursday before Tuesday because it assumes alphabetical order by default. The explicit levels argument defines the correct order.

In many situations it is not necessary to call factor explicitly. When an R function requires a factor, it usually converts your data to a factor automatically. The table function, for instance, works only on factors, so it routinely converts its inputs to factors without asking. You must explicitly create a factor variable when you want to specify the full set of levels or when you want to control the ordering of levels.

See Also See Recipe 12.5 to create a factor from continuous data.

5.5 Combining Multiple Vectors into One Vector and a Factor Problem You have several groups of data, with one vector for each group. You want to combine the vectors into one large vector and simultaneously create a parallel factor that identifies each value’s original group.

Solution Create a list that contains the vectors. Use the stack function to combine the list into a two-column data frame:

comb ← stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors The data frame’s columns are called values and ind. The first column contains the data, and the second column contains the parallel factor.

Discussion Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format.

Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:

freshmen ← c(1, 2, 1, 1, 5) sophomores ← c(3, 2, 3, 3, 5) juniors ← c(5, 3, 4, 3, 3)

comb ← stack(list(fresh = freshmen, soph = sophomores, jrs = juniors)) print(comb)

> values ind
> 1 1 fresh
> 2 2 fresh
> 3 1 fresh
> 4 1 fresh
> 5 5 fresh
> 6 3 soph
> 7 2 soph
> 8 3 soph
> 9 3 soph
> 10 5 soph
> 11 5 jrs
> 12 3 jrs
> 13 4 jrs
> 14 3 jrs
> 15 3 jrs

Now you can perform the ANOVA on the two columns:

aov(values ~ ind, data = comb) When building the list we must provide tags for the list elements. (The tags are fresh, soph, and jrs in this example.) Those tags are required because stack uses them as the levels of the parallel factor.

5.6 Creating a List Problem You want to create and populate a list.

Solution To create a list from individual data items, use the list function:

lst ← list(x, y, z) Discussion Lists can be quite simple, such as this list of three numbers:

lst ← list(0.5, 0.841, 0.977) lst

> 1
> [1] 0.5
>
> 2
> [1] 0.841
>
> 3
> [1] 0.977

When R prints the list, it identifies each list element by its position (1, 2, 3) and prints the element’s value (e.g., [1] 0.5) under its position.

More usefully, lists can, unlike vectors, contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:

lst ← list(3.14, “Moe”, c(1, 1, 2, 3), mean) lst

> 1
> [1] 3.14
>
> 2
> [1] “Moe”
>
> 3
> [1] 1 1 2 3
>
> 4
> function (x, …)
> UseMethod(“mean”)
> <bytecode: 0x7ff04b0bc900>
> <environment: namespace:base>

You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:

lst ← list() lst1 ← 3.14 lst2 ← “Moe” lst3 ← c(1, 1, 2, 3) lst4 ← mean lst

> 1
> [1] 3.14
>
> 2
> [1] “Moe”
>
> 3
> [1] 1 1 2 3
>
> 4
> function (x, …)
> UseMethod(“mean”)
> <bytecode: 0x7ff04b0bc900>
> <environment: namespace:base>

List elements can be named. The list function lets you supply a name for every element:

lst ← list(mid = 0.5, right = 0.841, far.right = 0.977) lst

> $mid
> [1] 0.5
>
> $right
> [1] 0.841
>
> $far.right
> [1] 0.977

See Also See the introduction to this chapter for more about lists; see Recipe 5.9 for more about building and using lists with named elements.

5.7 Selecting List Elements by Position Problem You want to access list elements by position.

Solution Use one of these ways. Here, lst is a list variable:

lstn Selects the nth element from the list

lst[c(n1, n2, …, nk)] Returns a list of elements, selected by their positions

Note that the first form returns a single element and the second form returns a list.

Discussion Suppose we have a list of four integers, called years:

years ← list(1960, 1964, 1976, 1994) years

> 1
> [1] 1960
>
> 2
> [1] 1964
>
> 3
> [1] 1976
>
> 4
> [1] 1994

We can access single elements using the double-square-bracket syntax:

years1

> [1] 1960

We can extract sublists using the single-square-bracket syntax:

years[c(1, 2)]

> 1
> [1] 1960
>
> 2
> [1] 1964

This syntax can be confusing because of a subtlety: there is an important difference between lstn and lst[n]. They are not the same thing:

lstn This is an element, not a list. It is the nth element of lst.

lst[n] This is a list, not an element. The list contains one element, taken from the nth element of lst.

TIP The second form is a special case of lst[c(n1, n2, …, nk)] in which we eliminated the c(…) construct because there is only one n.

The difference becomes apparent when we inspect the structure of the result—one is a number and the other is a list:

class(years1)

> [1] “numeric”

class(years[1])

> [1] “list”

The difference becomes annoyingly apparent when we cat the value. Recall that cat can print atomic values or vectors but complains about printing structured objects:

cat(years1, “\n”)

> 1960

cat(years[1], “\n”)

> Error in cat(years[1], “\n”): argument 1 (type 'list')
> cannot be handled by 'cat'

We got lucky here because R alerted us to the problem. In other contexts, you might work long and hard to figure out that you accessed a sublist when you wanted an element, or vice versa.

5.8 Selecting List Elements by Name Problem You want to access list elements by their names.

Solution Use one of these forms. Here, lst is a list variable:

lst"name" Selects the element called name. Returns NULL if no element has that name.

lst$name Same as previous, just different syntax.

lst[c(name1, name2, …, namek)] Returns a list built from the indicated elements of lst.

Note that the first two forms return an element, whereas the third form returns a list.

Discussion Each element of a list can have a name. If named, the element can be selected by its name. This assignment creates a list of four named integers:

years ← list(Kennedy = 1960, Johnson = 1964,

             Carter = 1976, Clinton = 1994)

These next two expressions return the same value—namely, the element that is named “Kennedy”:

years"Kennedy"

> [1] 1960

years$Kennedy

> [1] 1960

The following two expressions return sublists extracted from years:

years[c(“Kennedy”, “Johnson”)]

> $Kennedy
> [1] 1960
>
> $Johnson
> [1] 1964

years[“Carter”]

> $Carter
> [1] 1976

Just as with selecting list elements by position (see Recipe 5.7), there is an important difference between lst"name" and lst[“name”]. They are not the same:

lst"name" This is an element, not a list.

lst[“name”] This is a list, not an element.

TIP The second form is a special case of lst[c(name1, name2, …, namek)] in which we don’t need the c(…) construct because there is only one name.

See Also See Recipe 5.7 to access elements by position rather than by name.

5.9 Building a Name/Value Association List Problem You want to create a list that associates names and values, like a dictionary, hash, or lookup table would in another programming language.

Solution The list function lets you give names to elements, creating an association between each name and its value:

lst ← list(mid = 0.5, right = 0.841, far.right = 0.977) If you have parallel vectors of names and values, you can create an empty list and then populate the list by using a vectorized assignment statement:

values ← c(1, 2, 3) names ← c(“a”, “b”, “c”) lst ← list() lst[names] ← values Discussion Each element of a list can be named, and you can retrieve list elements by name. This gives you a basic programming tool: the ability to associate names with values.

You can assign element names when you build the list. The list function allows arguments of the form name=value:

lst ← list(

 far.left = 0.023,
 left = 0.159,
 mid = 0.500,
 right = 0.841,
 far.right = 0.977

) lst

> $far.left
> [1] 0.023
>
> $left
> [1] 0.159
>
> $mid
> [1] 0.5
>
> $right
> [1] 0.841
>
> $far.right
> [1] 0.977

One way to name the elements is to create an empty list and then populate it via assignment statements:

lst ← list() lst$far.left ← 0.023 lst$left ← 0.159 lst$mid ← 0.500 lst$right ← 0.841 lst$far.right ← 0.977 Sometimes you have a vector of names and a vector of corresponding values:

values ← -2:2 names ← c(“far.left”, “left”, “mid”, “right”, “far.right”) You can associate the names and the values by creating an empty list and then populating it with a vectorized assignment statement:

lst ← list() lst[names] ← values lst

> $far.left
> [1] -2
>
> $left
> [1] -1
>
> $mid
> [1] 0
>
> $right
> [1] 1
>
> $far.right
> [1] 2

Once the association is made, the list can “translate” names into values through a simple list lookup:

cat(“The left limit is”, lst"left", “\n”)

> The left limit is -1

cat(“The right limit is”, lst"right", “\n”)

> The right limit is 1

for (nm in names(lst)) cat(“The”, nm, “limit is”, lstnm, “\n”)

> The far.left limit is -2
> The left limit is -1
> The mid limit is 0
> The right limit is 1
> The far.right limit is 2

5.10 Removing an Element from a List Problem You want to remove an element from a list.

Solution Assign NULL to the element. R will remove it from the list.

Discussion To remove a list element, select it by position or by name, and then assign NULL to the selected element:

years ← list(Kennedy = 1960, Johnson = 1964,

             Carter = 1976, Clinton = 1994)

years

> $Kennedy
> [1] 1960
>
> $Johnson
> [1] 1964
>
> $Carter
> [1] 1976
>
> $Clinton
> [1] 1994

years"Johnson" ← NULL # Remove the element labeled “Johnson” years

> $Kennedy
> [1] 1960
>
> $Carter
> [1] 1976
>
> $Clinton
> [1] 1994

You can remove multiple elements this way, too:

years[c(“Carter”, “Clinton”)] ← NULL # Remove two elements years

> $Kennedy
> [1] 1960

5.11 Flattening a List into a Vector Problem You want to flatten all the elements of a list into a vector.

Solution Use the unlist function.

Discussion There are many contexts that require a vector. Basic statistical functions work on vectors but not on lists, for example. If iq.scores is a list of numbers, then we cannot directly compute their mean:

iq.scores ← list(100, 120, 103, 80, 99) mean(iq.scores)

> Warning in mean.default(iq.scores): argument is not numeric or logical:
> returning NA
> [1] NA

Instead, we must flatten the list into a vector using unlist and then compute the mean of the result:

mean(unlist(iq.scores))

> [1] 100

Here is another example. We can cat scalars and vectors, but we cannot cat a list:

cat(iq.scores, “\n”)

> Error in cat(iq.scores, “\n”): argument 1 (type 'list') cannot be
> handled by 'cat'

One solution is to flatten the list into a vector before printing:

cat(“IQ Scores:”, unlist(iq.scores), “\n”)

> IQ Scores: 100 120 103 80 99

See Also Conversions such as this are discussed more fully in Recipe 5.29.

5.12 Removing NULL Elements from a List Problem Your list contains NULL values. You want to remove them.

Solution The compact function from the purrr package will remove the NULL elements.

Discussion The curious reader may be wondering how a list can contain NULL elements, given that we remove elements by setting them to NULL (see Recipe 5.10). The answer is that we can create a list containing NULL elements:

library(purrr) # or library(tidyverse)

lst ← list(“Moe”, NULL, “Curly”) lst

> 1
> [1] “Moe”
>
> 2
> NULL
>
> 3
> [1] “Curly”

compact(lst) # Remove NULL element

> 1
> [1] “Moe”
>
> 2
> [1] “Curly”

In practice, we might also end up with NULL items in a list after applying some transformation.

Note that in R, NA and NULL are not the same thing. The compact function will remove NULL from a list but not NA. To remove NA values, see Recipe 5.13.

See Also See Recipe 5.10 for how to remove list elements and Recipe 5.13 for how to remove list elements conditionally.

5.13 Removing List Elements Using a Condition Problem You want to remove elements from a list according to a conditional test, such as removing elements that are undefined, negative, or smaller than some threshold.

Solution Start with a function that returns TRUE when your criteria are met and FALSE otherwise. Then use the discard function from purrr to remove values that match your criteria. This code snippet, for example, uses the is.na function to remove NA values from lst:

lst ← list(NA, 0, NA, 1, 2)

lst %>%

 discard(is.na)

> 1
> [1] 0
>
> 2
> [1] 1
>
> 3
> [1] 2

Discussion The discard function removes elements from a list using a predicate, which is a function that returns either TRUE or FALSE. The predicate is applied to each element of the list. If the predicate returns TRUE, the element is discarded; otherwise, it is kept.

Suppose we want to remove character strings from lst. The function is.character is a predicate that returns TRUE if its argument is a character string, so we can use it with discard:

lst ← list(3, “dog”, 2, “cat”, 1)

lst %>%

 discard(is.character)

> 1
> [1] 3
>
> 2
> [1] 2
>
> 3
> [1] 1

You can define your own predicate and use it with discard. This example removes both NA and NULL values from a list by defining the predicate is_na_or_null:

is_na_or_null ← function(x) {

 is.na(x) || is.null(x)

}

lst ← list(1, NA, 2, NULL, 3)

lst %>%

 discard(is_na_or_null)

> 1
> [1] 1
>
> 2
> [1] 2
>
> 3
> [1] 3

Lists can hold complex objects, too, not just atomic values. Suppose that mods is a list of linear models created by the lm function:

mods ← list(lm(x ~ y1),

            lm(x ~ y2),
            lm(x ~ y3))

We can define a predicate, filter_r2, to identify models whose R2 values are less than 0.70, then use the predicate to remove those models from mods:

filter_r2 ← function(model) {

 summary(model)$r.squared < 0.7

}

mods %>%

 discard(filter_r2)

The inverse of discard is the keep function, which uses a predicate to retain list elements instead of discarding them.

See Also See Recipe 5.7, Recipe 5.10, and Recipe 15.3.

5.14 Initializing a Matrix Problem You want to create a matrix and initialize it from given values.

Solution Capture the data in a vector or list, and then use the matrix function to shape the data into a matrix. This example shapes a vector into a 2 × 3 matrix (i.e., two rows and three columns):

vec ← 1:6 matrix(vec, 2, 3)

> [,1] [,2] [,3]
> [1,] 1 3 5
> [2,] 2 4 6

Discussion The first argument of matrix is the data, the second argument is the number of rows, and the third argument is the number of columns. Note that the matrix in the Solution was filled column by column, not row by row.

It’s common to initialize an entire matrix to one value, such as 0 or NA. If the first argument of matrix is a single value, then R will apply the Recycling Rule and automatically replicate the value to fill the entire matrix:

matrix(0, 2, 3) # Create an all-zeros matrix

> [,1] [,2] [,3]
> [1,] 0 0 0
> [2,] 0 0 0

matrix(NA, 2, 3) # Create a matrix populated with NAs

> [,1] [,2] [,3]
> [1,] NA NA NA
> [2,] NA NA NA

You can create a matrix with a one-liner, of course, but it becomes difficult to read:

mat ← matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3) mat

> [,1] [,2] [,3]
> [1,] 1.1 1.3 2.2
> [2,] 1.2 2.1 2.3

A common idiom in R is typing the data itself in a rectangular shape that reveals the matrix structure:

theData ← c(

 1.1, 1.2, 1.3,
 2.1, 2.2, 2.3

) mat ← matrix(theData, 2, 3, byrow = TRUE) mat

> [,1] [,2] [,3]
> [1,] 1.1 1.2 1.3
> [2,] 2.1 2.2 2.3

Setting byrow=TRUE tells matrix that the data is row-by-row and not column-by-column (which is the default). In condensed form, that becomes:

mat ← matrix(c(1.1, 1.2, 1.3,

               2.1, 2.2, 2.3),
             2, 3,
             byrow = TRUE)

Expressed this way, it’s easy to see the two rows and three columns of data.

There is a quick-and-dirty way to turn a vector into a matrix: just assign dimensions to the vector. This was discussed in the introduction to this chapter. The following example creates a vanilla vector and then shapes it into a 2 × 3 matrix:

v ← c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3) dim(v) ← c(2, 3) v

> [,1] [,2] [,3]
> [1,] 1.1 1.3 2.2
> [2,] 1.2 2.1 2.3

We find this more opaque than using matrix, especially since there is no byrow option here.

See Also See Recipe 5.3.

5.15 Performing Matrix Operations Problem You want to perform matrix operations such as transposition, inversion, multiplication, or constructing an identity matrix.

Solution Perform these operations with the following functions:

t(A) Matrix transposition of A

solve(A) Matrix inverse of A

A %*% B Matrix multiplication of A and B

diag(n) Constructs an n×n diagonal (identity) matrix

Discussion Recall that A*B is element-wise multiplication, whereas A %*% B is matrix multiplication (see Recipe 2.11).

All these functions return a matrix. Their arguments can be either matrices or data frames. If they are data frames, then R will first convert them to matrices (although this is useful only if the data frame contains exclusively numeric values).

5.16 Giving Descriptive Names to the Rows and Columns of a Matrix Problem You want to assign descriptive names to the rows or columns of a matrix.

Solution Every matrix has a rownames attribute and a colnames attribute. Assign a vector of character strings to the appropriate attribute:

rownames(mat) ← c(“rowname1”, “rowname2”, …, “rownameN”) colnames(mat) ← c(“colname1”, “colname2”, …, “colnameN”) Discussion R lets you assign names to the rows and columns of a matrix, which is useful for printing the matrix. R will display the names if they are defined, enhancing the readability of your output. Consider this matrix of correlations between the stock prices of IBM, Microsoft, and Google:

print(corr_mat)

> [,1] [,2] [,3]
> [1,] 1.000 0.556 0.390
> [2,] 0.556 1.000 0.444
> [3,] 0.390 0.444 1.000

In this form, the interpretation of the matrix is not self-evident. We can give names to the rows and columns, clarifying its meaning:

colnames(corr_mat) ← c(“AAPL”, “MSFT”, “GOOG”) rownames(corr_mat) ← c(“AAPL”, “MSFT”, “GOOG”) corr_mat

> AAPL MSFT GOOG
> AAPL 1.000 0.556 0.390
> MSFT 0.556 1.000 0.444
> GOOG 0.390 0.444 1.000

Now you can see at a glance which rows and columns apply to which stocks.

Another advantage of naming rows and columns is that you can refer to matrix elements by those names:

What is the correlation between MSFT and GOOG?

corr_mat[“MSFT”, “GOOG”]

> [1] 0.444

5.17 Selecting One Row or Column from a Matrix Problem You want to select a single row or a single column from a matrix.

Solution The solution depends on what you want. If you want the result to be a simple vector, just use normal indexing:

mat[1, ] # First row mat[, 3] # Third column If you want the result to be a one-row matrix or a one-column matrix, then include the drop=FALSE argument:

mat[1, , drop=FALSE] # First row, one-row matrix mat[, 3, drop=FALSE] # Third column, one-column matrix Discussion Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:

mat[1, ]

> [1] 1.1 1.2 1.3

mat[, 3]

> [1] 1.3 2.3

When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):

mat[1, , drop=FALSE]

> [,1] [,2] [,3]
> [1,] 1.1 1.2 1.3

Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):

mat[, 3, drop=FALSE]

> [,1]
> [1,] 1.3
> [2,] 2.3

5.18 Initializing a Data Frame from Column Data Problem Your data is organized by columns, and you want to assemble it into a data frame.

Solution If your data is captured in several vectors and/or factors, use the data.frame function to assemble them into a data frame:

df ← data.frame(v1, v2, v3, f1) If your data is captured in a list that contains vectors and/or factors, use as.data.frame instead:

df ← as.data.frame(list.of.vectors) Discussion A data frame is a collection of columns, each of which corresponds to an observed variable (in the statistical sense, not the programming sense). If your data is already organized into columns, then it’s easy to build a data frame.

The data.frame function can construct a data frame from vectors, where each vector is one observed variable. Suppose you have two numeric variables, one character variable, and one response variable. The data.frame function can create a data frame from your vectors:

data.frame(pred1, pred2, pred3, resp)

> pred1 pred2 pred3 resp
> 1 1.75 11.8 AM 13.2
> 2 4.01 10.7 PM 12.9
> 3 2.64 12.2 AM 13.9
> 4 6.03 12.2 PM 14.9
> 5 2.78 15.0 PM 16.4

Notice that data.frame takes the column names from your program variables. You can override that default by supplying explicit column names:

data.frame(p1 = pred1, p2 = pred2, p3 = pred3, r = resp)

> p1 p2 p3 r
> 1 1.75 11.8 AM 13.2
> 2 4.01 10.7 PM 12.9
> 3 2.64 12.2 AM 13.9
> 4 6.03 12.2 PM 14.9
> 5 2.78 15.0 PM 16.4

If you’d rather have a tibble than a data frame, use the tibble function from the tidyverse:

tibble(p1 = pred1, p2 = pred2, p3 = pred3, r = resp)

> # A tibble: 5 x 4
> p1 p2 p3 r
> <dbl> <dbl> <fct> <dbl>
> 1 1.75 11.8 AM 13.2
> 2 4.01 10.7 PM 12.9
> 3 2.64 12.2 AM 13.9
> 4 6.03 12.2 PM 14.9
> 5 2.78 15.0 PM 16.4

Sometimes, your data may indeed be organized into vectors, but those vectors are held in a list, not individual program variables:

list.of.vectors ← list(p1=pred1, p2=pred2, p3=pred3, r=resp) In that case, use the as.data.frame function to create a data frame from the list:

as.data.frame(list.of.vectors)

> p1 p2 p3 r
> 1 1.75 11.8 AM 13.2
> 2 4.01 10.7 PM 12.9
> 3 2.64 12.2 AM 13.9
> 4 6.03 12.2 PM 14.9
> 5 2.78 15.0 PM 16.4

or use as_tibble to create a tibble:

as_tibble(list.of.vectors)

> # A tibble: 5 x 4
> p1 p2 p3 r
> <dbl> <dbl> <fct> <dbl>
> 1 1.75 11.8 AM 13.2
> 2 4.01 10.7 PM 12.9
> 3 2.64 12.2 AM 13.9
> 4 6.03 12.2 PM 14.9
> 5 2.78 15.0 PM 16.4

Factors in data frames There is an important difference between creating a data frame and creating a tibble. When you use the data.frame function to create a data frame, R will convert character values into factors by default. The pred3 value in the preceding data.frame example was converted to a factor, but that is not evident from the output.

The tibble and as_tibble functions, however, do not change character data. If you look at the tibble example, you’ll see column p3 has type chr, meaning character.

This difference is something you should be aware of. It can be maddeningly frustrating to debug an issue caused by this subtle difference.

5.19 Initializing a Data Frame from Row Data Problem Your data is organized by rows, and you want to assemble it into a data frame.

Solution Store each row in a one-row data frame. Use rbind to bind the rows into one large data frame:

rbind(row1, row2, … , rowN) Discussion Data often arrives as a collection of observations. Each observation is a record or tuple that contains several values, one for each observed variable. The lines of a flat file are usually like that: each line is one record, each record contains several columns, and each column is a different variable (see Recipe 4.15). Such data is organized by observation, not by variable. In other words, you are given rows one at a time rather than columns one at a time.

Each such row might be stored in several ways. One obvious way is as a vector. If you have purely numerical data, use a vector.

Many datasets, however, are a mixture of numeric, character, and categorical data, in which case a vector won’t work. We recommend storing each such heterogeneous row in a one-row data frame. (You could store each row in a list, but this recipe gets a little more complicated.)

We need to bind together those rows into a data frame. That’s what the rbind function does. It binds its arguments in such a way that each argument becomes one row in the result. If we rbind these three observations, for example, we get a three-row data frame:

r1 ← data.frame(a = 1, b = 2, c = “X”) r2 ← data.frame(a = 3, b = 4, c = “Y”) r3 ← data.frame(a = 5, b = 6, c = “Z”) rbind(r1, r2, r3)

> a b c
> 1 1 2 X
> 2 3 4 Y
> 3 5 6 Z

When you’re working with a large number of rows, they will likely be stored in a list; that is, you will have a list of rows. The bind_rows function, from the tidyverse package dplyr, handles that case, as shown in this toy example:

list.of.rows ← list(r1, r2, r3) bind_rows(list.of.rows)

> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
> Warning in bind_rows_(x, .id): binding character and factor vector,
> coercing into character vector

> Warning in bind_rows_(x, .id): binding character and factor vector,
> coercing into character vector

> Warning in bind_rows_(x, .id): binding character and factor vector,
> coercing into character vector
> a b c
> 1 1 2 X
> 2 3 4 Y
> 3 5 6 Z

Sometimes, for reasons beyond your control, each row of data is stored in a list rather than one-row data frames. You may be dealing with rows returned by a function or a database package, for example. bind_rows can handle that situation as well:

Same toy data, but rows stored in lists

l1 ← list(a = 1, b = 2, c = “X”) l2 ← list(a = 3, b = 4, c = “Y”) l3 ← list(a = 5, b = 6, c = “Z”) list.of.lists ← list(l1, l2, l3)

bind_rows(list.of.lists)

> # A tibble: 3 x 3
> a b c
> <dbl> <dbl> <chr>
> 1 1 2 X
> 2 3 4 Y
> 3 5 6 Z

Factors in data frames If you would rather get characters instead of factors, you have a couple of options. One is to set the stringsAsFactors parameter to FALSE when data.frame is called:

data.frame(a = 1, b = 2, c = “a”, stringsAsFactors = FALSE)

> a b c
> 1 1 2 a

Of course, if you inherited your data and it’s already in a data frame with factors, you can convert all the factors to characters using this bonus recipe:

same setup as in the previous examples

l1 ← list( a=1, b=2, c='X' ) l2 ← list( a=3, b=4, c='Y' ) l3 ← list( a=5, b=6, c='Z' ) obs ← list(l1, l2, l3) df ← do.call(rbind, Map(as.data.frame, obs))

Yes, you could use stringsAsFactors=FALSE above,
but we're assuming the data.frame
came to you with factors already

i ← sapply(df, is.factor) # determine which columns are factors df[i] ← lapply(df[i], as.character) # turn only the factors to characters Keep in mind that if you use a tibble instead of a data frame, then characters will not be forced into factors by default.

See Also See Recipe 5.18 if your data is organized by columns, not rows.

5.20 Appending Rows to a Data Frame Problem You want to append one or more new rows to a data frame.

Solution Create a second, temporary data frame containing the new rows. Then use the rbind function to append the temporary data frame to the original data frame.

Discussion Suppose we have a data frame of Chicago-area suburbs:

suburbs ← read_csv(”./data/suburbs.txt“)

> Parsed with column specification:
> cols(
> city = col_character(),
> county = col_character(),
> state = col_character(),
> pop = col_double()
> )

Further suppose we want to append a new row. First, we create a one-row data frame with the new data:

newRow ← data.frame(city = “West Dundee”, county = “Kane”,

                    state = "IL", pop = 7352)

Next, we use the rbind function to append that one-row data frame to our existing data frame:

rbind(suburbs, newRow)

> # A tibble: 18 x 4
> city county state pop
> <chr> <chr> <chr> <dbl>
> 1 Chicago Cook IL 2853114
> 2 Kenosha Kenosha WI 90352
> 3 Aurora Kane IL 171782
> 4 Elgin Kane IL 94487
> 5 Gary Lake(IN) IN 102746
> 6 Joliet Kendall IL 106221
> # … with 12 more rows

The rbind function tells R that we are appending a new row to suburbs, not a new column. It may be obvious to you that newRow is a row and not a column, but it is not obvious to R. (Use the cbind function to append a column.)

WARNING The new row must use the same column names as the data frame. Otherwise, rbind will fail.

We can combine these two steps into one, of course:

rbind(suburbs,

     data.frame(city = "West Dundee", county = "Kane",
                state = "IL", pop = 7352))

> # A tibble: 18 x 4
> city county state pop
> <chr> <chr> <chr> <dbl>
> 1 Chicago Cook IL 2853114
> 2 Kenosha Kenosha WI 90352
> 3 Aurora Kane IL 171782
> 4 Elgin Kane IL 94487
> 5 Gary Lake(IN) IN 102746
> 6 Joliet Kendall IL 106221
> # … with 12 more rows

We can even extend this technique to multiple new rows because rbind allows multiple arguments:

rbind(suburbs,

     data.frame(city = "West Dundee", county = "Kane",
                state = "IL", pop = 7352),
     data.frame(city = "East Dundee", county = "Kane",
                state = "IL", pop = 3192)

)

> # A tibble: 19 x 4
> city county state pop
> <chr> <chr> <chr> <dbl>
> 1 Chicago Cook IL 2853114
> 2 Kenosha Kenosha WI 90352
> 3 Aurora Kane IL 171782
> 4 Elgin Kane IL 94487
> 5 Gary Lake(IN) IN 102746
> 6 Joliet Kendall IL 106221
> # … with 13 more rows

It’s worth noting that in the previous examples we seamlessly commingled tibbles and data frames. suburbs is a tibble because we used the tidy function read_csv, which produces tibbles, while newRow was created using data.frame, which returns a traditional R data frame. And note that the data frames contain factors while the tibbles do not:

str(suburbs) # a tibble

> Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 17 obs. of
> 4 variables:
> $ city : chr “Chicago” “Kenosha” “Aurora” “Elgin” …
> $ county: chr “Cook” “Kenosha” “Kane” “Kane” …
> $ state : chr “IL” “WI” “IL” “IL” …
> $ pop : num 2853114 90352 171782 94487 102746 …
> - attr(*, “spec”)=
> .. cols(
> .. city = col_character(),
> .. county = col_character(),
> .. state = col_character(),
> .. pop = col_double()
> .. )

str(newRow) # a data.frame

> 'data.frame': 1 obs. of 4 variables:
> $ city : Factor w/ 1 level “West Dundee”: 1
> $ county: Factor w/ 1 level “Kane”: 1
> $ state : Factor w/ 1 level “IL”: 1
> $ pop : num 7352

When the inputs to rbind are a mix of data.frame objects and tibble objects, the result will have the same type as the first argument of rbind. So this would produce a tibble:

rbind(some_tibble, some_data.frame) while this would produce a data frame:

rbind(some_data.frame, some_tibble) 5.21 Selecting Data Frame Columns by Position Problem You want to select columns from a data frame according to their position.

Solution Use the select function:

df %>% select(n1, n2, …, nk) where df is a data frame and n1, n2, …, nk are integers with values between 1 and the number of columns.

Discussion Let’s use the first three rows of the dataset of population data for the 16 largest cities in the Chicago metropolitan area:

suburbs ← read_csv(“data/suburbs.txt”) %>% head(3)

> Parsed with column specification:
> cols(
> city = col_character(),
> county = col_character(),
> state = col_character(),
> pop = col_double()
> )

suburbs

> # A tibble: 3 x 4
> city county state pop
> <chr> <chr> <chr> <dbl>
> 1 Chicago Cook IL 2853114
> 2 Kenosha Kenosha WI 90352
> 3 Aurora Kane IL 171782

Right off the bat, we can see this is a tibble. This will extract the first column (and only the first column):

suburbs %>%

 dplyr::select(1)

> # A tibble: 3 x 1
> city
> <chr>
> 1 Chicago
> 2 Kenosha
> 3 Aurora

These will extract multiple columns:

suburbs %>%

 dplyr::select(1, 3, 4)

> # A tibble: 3 x 3
> city state pop
> <chr> <chr> <dbl>
> 1 Chicago IL 2853114
> 2 Kenosha WI 90352
> 3 Aurora IL 171782

suburbs %>%

 dplyr::select(2:4)

> # A tibble: 3 x 3
> county state pop
> <chr> <chr> <dbl>
> 1 Cook IL 2853114
> 2 Kenosha WI 90352
> 3 Kane IL 171782

List expressions The select verb is part of the tidyverse package dplyr. Base R also has its own rich functionality for selecting columns, at the cost of some additional syntax. The choices can be confusing until you understand the logic behind the alternatives.

One alternative uses list expressions. This might seem odd until you recall that a data frame is a list of columns. The list expression selects columns from that list. As you read this explanation, notice how the change in syntax—double brackets versus single brackets—changes the meaning of the expression.

We can select exactly one column by using double brackets (and):

dfn Returns a vector—specifically, the vector in the nth column of df

We can select one or more columns by using single brackets ([ and ]).

df[n] Returns a data frame consisting solely of the nth column of df

df[c(n1, n2, …, nk)] Returns a data frame built from the columns in positions n1, n2, …, nk of df

For example, we can use list notation to select the first column from suburbs, the city column:

suburbs1

> [1] “Chicago” “Kenosha” “Aurora”

That column is a character vector, so that’s what suburbs1 returns: a vector.

The result changes when we use the single-bracket notation, as in suburbs[1] or suburbs[c(1,3)]. We still get the requested columns, but R leaves them in a data frame. This example returns the first column as a one-column data frame:

suburbs[1]

> # A tibble: 3 x 1
> city
> <chr>
> 1 Chicago
> 2 Kenosha
> 3 Aurora

And this example returns the first and third columns as a data frame:

suburbs[c(1, 3)]

> # A tibble: 3 x 2
> city state
> <chr> <chr>
> 1 Chicago IL
> 2 Kenosha WI
> 3 Aurora IL

TIP The expression suburbs[1] is actually a shortened form of suburbs[c(1)]. We don’t need the c(…) wrapper because there is only one n.

A major source of confusion is that suburbs1 and suburbs[1] look similar but produce very different results:

suburbs1 Returns one column

suburbs[1] Returns a data frame that contains exactly one column

The point here is that “one column” is different from “a data frame that contains one column.” The first expression returns a vector. The second expression returns a data frame, which is a different data structure.

Matrix-style subscripting You can use matrix-style subscripting to select columns from a data frame:

df[, n] Returns a vector taken from the nth column (assuming that n contains exactly one value)

df[, c(n1, n2, …, nk)] Returns a data frame built from the columns in positions n1, n2, …, nk

An odd quirk can bite you here: you might get a column vector or you might get a data frame, depending upon how many subscripts you use and whether you are operating on a tibble or a data.frame. Tibbles will always return tibbles when you index. However, a data.frame may return a vector if you use one index.

In the simple case of one index on a data.frame you get a vector, like this:

suburbs is a tibble so we convert for this example

suburbs_df <← as.data.frame(suburbs) suburbs_df[, 1]

> [1] “Chicago” “Kenosha” “Aurora”

But using the same matrix-style syntax with multiple indexes returns a data frame:

suburbs_df[, c(1, 4)]

> city pop
> 1 Chicago 2853114
> 2 Kenosha 90352
> 3 Aurora 171782

This creates a problem. Suppose you see this expression in some old R code:

df[, vec] Quick, does that return a column or a data frame? Well, it depends. If vec contains one value, then you get a column; otherwise, you get a data frame. You cannot tell from the syntax alone.

To avoid this problem, you can include drop=FALSE in the subscripts, forcing R to return a data frame:

df[, vec, drop = FALSE] Now there is no ambiguity about the returned data structure. It’s a data frame.

When all is said and done, using matrix notation to select columns from data frames can be tricky. Use select when you can.

See Also See Recipe 5.17 for more about using drop=FALSE.

5.22 Selecting Data Frame Columns by Name Problem You want to select columns from a data frame according to their name.

Solution Use select and give it the column names.

df %>% select(name1, name2, …, namek) Discussion All columns in a data frame must have names. If you know the name, it’s usually more convenient and readable to select by name, not by position. Note that you don’t put the column names in quotes when using select.

The solutions described here are similar to those for Recipe 5.21, where we selected columns by position. The only difference is that here we use column names instead of column numbers. All the observations made in that recipe apply here.

List expressions The select verb is part of the tidyverse. Base R itself also has several rich methods for selecting columns by name, at the cost of some additional syntax.

To select a single column, use one of these list expressions. Note that they use double brackets (and):

df"name" Returns one column, the column called name

df$name Same as previous, just different syntax

To select one or more columns, use these list expressions. Note that they use single brackets ([ and ]):

df[“name”] Selects one column from a data frame

df[c(“name1”, “name2”, …, “namek”)] Selects several columns

Matrix-style subscripting Base R also allows matrix-style subscripting for selecting one or more columns from a data frame by name:

df[, “name”] Returns the named column

df[, c(“name1”, “name2”, …, “namek”)] Selects several columns in a data frame

The matrix-style subscripting can return either a column or a data frame, so be careful how many names you supply. See the comments in Recipe 5.21 for a discussion of this “gotcha” and using drop=FALSE.

See Also See Recipe 5.21 to select by position instead of name.

5.23 Changing the Names of Data Frame Columns Problem You want to change the names of a data frame’s columns.

Solution The rename function from the dplyr package makes renaming pretty easy:

df %>% rename(newname1 = oldname1, … , newnamen = oldnamen) where df is a data frame, oldnamei are names of columns in df, and newnamei are the desired new names.

Note that the argument order is newname = oldname.

Discussion The columns of data frames must have names. You can change them using rename:

df ← data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9) df %>% rename(tom = V1, dick = V2)

> tom dick V3
> 1 1 4 7
> 2 2 5 8
> 3 3 6 9

The column names are stored in an attribute called colnames, so another way to rename columns is to change that attribute:

colnames(df) ← c(“tom”, “dick”, “V2”) df

> tom dick V2
> 1 1 4 7
> 2 2 5 8
> 3 3 6 9

If you happen to be using select to select individual columns, you can rename those columns at the same time:

df ← data.frame(V1 = 1:3, V2 = 4:6, V3 = 7:9) df %>% select(tom = V1, V2)

> tom V2
> 1 1 4
> 2 2 5
> 3 3 6

The difference between renaming with select versus renaming with rename is that rename will rename what you specify, leaving all other columns intact and unchanged, whereas select keeps only the columns you select. In the preceding example, V3 is dropped because it’s not in the select statement. Both select and rename use the same argument order: newname = oldname.

Special considerations for matrices

“The matrix conversions detailed here assume that your matrix is homogeneous—that is, all elements have the same mode (e.g., all numeric or all character). A matrix can be heterogeneous, too, when the matrix is built from a list. If so, conversions become messy. For example, when you convert a mixed-mode matrix to a data frame, the data frame’s columns are actually lists (to accommodate the mixed data).” (RCook 2019)

See Also See Recipe 5.28 for converting atomic data types; see the introduction to this chapter for remarks on problematic conversions.

1 “A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.” (RCook 2019)

R: R Fundamentals, R Inventor - R Language Designer: Ross Ihaka and Robert Gentleman in August 1993; R Core Team, R Language Definition on R-Project.org, R reserved words (R keywords), R data structures - R algorithms, R syntax, R input and Output, R data transformations, R probability, R statistics, R linear regression (ANOVA), R time series analysis, R graphics, R markdown, R OOP, R on Linux, R on macOS, R on Windows, R installation, R containerization, R configuration, R compiler - R interpreter (R REPL), R IDEs (RStudio, Jupyter Notebook), R development tools, R DevOps - R SRE, R data science - R DataOps, R machine learning, R deep learning, Functional R, R concurrency, R history, R bibliography, R glossary, R topics, R courses, R Standard Library, R libraries, R packages (tidyverse package), R frameworks, RDocumentation.org / CRAN, R research, R GitHub, Written in R, R popularity, R Awesome list, R Versions, Python. (navbar_r)

Fair Use Sources

Fair Use Sources:

B07TDVNC15 (RCook 2019)

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.

Table of Contents

R Cookbook Chapter 5. Data Structures

R Vectors

R Lists

Special considerations for matrices

Fair Use Sources