https://DevOpsCloud.io -- Cloud Monk Losang Jinpa

R Cookbook Chapter 6. Data Transformations

Return to R Cookbook, R Bibliography, R DevOps, R Data Science, R Statistics, R Machine Learning, R Deep Learning, Data Science Bibliography, Statistics Bibliography

“ (RCook 2019)

Chapter 6. Data Transformations

“While traditional programming languages use loops, R has traditionally encouraged using R vectorized operations and the R apply family of functions to R crunch data in R batches, greatly streamlining the R calculations. There is nothing to prevent you from writing loops in R (R loops) that break your data into whatever chunks you want and then doing an R operation on each chunk. However, using R vectorized functions can, in many cases, increase the R speed, R readability, and R maintainability of your R code.” (RCook 2019)

In recent R history, though, the R tidyverse — specifically the R purrr and R dplyr R packages — has introduced new R idioms into R that make these R concepts easier to learn R and slightly more R consistent. The name purrr comes from a play on the phrase “Pure R.” A “R pure function” is a R function whose R result is determined only by its R inputs, and which does not produce any R side effects. This is not a R functional programming concept you need to understand in order to get great value from purrr, however. All most users need to know is that purrr contains R functions to help us R operate “chunk by chunk” on our R data in a way that meshes well with other R tidyverse packages such as dplyr.” (RCook 2019)

“Base R has many R apply functions — R apply, R lapply, R sapply, R tapply, and R mapply — as well as their cousins, R by and R split. These are solid R functions that have been workhorses in Base R for years. We struggled a bit with how much to focus on the Base R apply functions and how much to focus on the newer “tidy” approach. After much debate, we’ve chosen to try to illustrate the purrr approach and to acknowledge Base R approaches and, in a few places, to illustrate both. The R interface to purrr and dplyr is very R clean and, we believe, in most cases, more R intuitive.” (RCook 2019)

6.1 Applying a Function to Each List Element

R Problem: You have a R list, and you want to apply a R function to each R element of the list.

R Solution: Use R map to apply a R function to every R element of a list:

R library(tidyverse)

R lst %>%

 map(fun)

Discussion: Let’s look at a specific example of taking the R average of all the R numbers in each R element of a list:

library(tidyverse)

lst ← list(

 a = c(1,2,3),
 b = c(4,5,6)

) lst %>%

 map(mean)

> $a
> [1] 2
>
> $b
> [1] 5

The R map function will call your R function once for every element in your list. Your function should expect one R argument, an element from the list. The R map functions will collect the R returned values and return them in a list.

The purrr package contains a whole family of R map functions that take a R list or a R vector and then return an object with the same number of elements as the input. The type of object they return varies based on which map function is used. See the help file for map for a complete list, but a few of the most common are as follows:

R map: Always returns a list, and the elements of the list may be of different types. This is quite similar to the Base R function lapply.

R map_chr: Returns a R character vector.

R map_int: Returns an R integer vector.

R map_dbl: Returns a R floating-point numeric vector.

Let’s take a quick look at a contrived situation where we have a R function that could result in a R character or an R integer result:

fun ← function(x) {

 if (x > 1) {
   1
 } else {
   "Less Than 1"
 }

}

fun(5)

> [1] 1

fun(0.5)

> [1] “Less Than 1”

Let’s create a list of elements that we can map fun to and look at how some of the map variants behave:

lst ← list(.5, 1.5, .9, 2)

map(lst, fun)

> 1
> [1] “Less Than 1”
>
> 2
> [1] 1
>
> 3
> [1] “Less Than 1”
>
> 4
> [1] 1

You can see that map produced a list and it is of mixed data types.

map_chr will produce a character vector and coerce the numbers into characters:

map_chr(lst, fun)

> [1] “Less Than 1” “1.000000” “Less Than 1” “1.000000”

or using pipes

lst %>%

 map_chr(fun)

> [1] “Less Than 1” “1.000000” “Less Than 1” “1.000000”

while map_dbl will try to coerce a character string into a double and die trying:

map_dbl(lst, fun)

> Error: Can't coerce element 1 from a character to a double

As mentioned earlier, the Base R lapply function acts very much like map. The Base R sapply function is more like the other map functions we discussed previously, in that the function tries to simplify the results into a vector or matrix.

See Also See Recipe 15.3.

6.2 Applying a Function to Every Row of a Data Frame Problem You have a function and you want to apply it to every row in a data frame.

Solution The mutate function will create a new variable based on a vector of values. But if we are using a function that can’t take in a vector and output a vector, then we have to do a row-by-row operation using rowwise.

We can use rowwise in a pipe chain to tell dplyr to do all following commands row by row:

df %>%

 rowwise() %>%
 row_by_row_function()

Discussion Let’s create a function and apply it row by row to a data frame. Our function will simply calculate the sum of a sequence from a to b by c:

fun ← function(a, b, c) {

 sum(seq(a, b, c))

} Let’s create some data to apply this function to, then use rowwise to apply our function, fun, to it:

df ← data.frame(mn = c(1, 2, 3),

                mx = c(8, 13, 18),
                rng = c(1, 2, 3))

df %>%

 rowwise %>%
 mutate(output = fun(a = mn, b = mx, c = rng))

> Source: local data frame [3 x 4]
> Groups: <by row>
>
> # A tibble: 3 x 4
> mn mx rng output
> <dbl> <dbl> <dbl> <dbl>
> 1 1 8 1 36
> 2 2 13 2 42
> 3 3 18 3 63

Had we tried to run this function without rowwise, it would have thrown an error because the seq function cannot process an entire vector:

df %>%

 mutate(output = fun(a = mn, b = mx, c = rng))

> Error in seq.default(a, b, c): 'from' must be of length 1

6.3 Applying a Function to Every Row of a Matrix Problem You have a matrix. You want to apply a function to every row, calculating the function result for each row.

Solution Use the apply function. Set the second argument to 1 to indicate row-by-row application of the function:

results ← apply(mat, 1, fun) # mat is a matrix, fun is a function The apply function will call fun once for each row of the matrix, assemble the returned values into a vector, and then return that vector.

Discussion You may notice that we show only the use of the Base R apply function here, while other recipes illustrate purrr alternatives. As of this writing, matrix operations are out of scope for purrr, so we use the very solid Base R apply function. If you really like the purrr syntax, you can use those functions if you first convert your matrix to a data frame or tibble. But if your matrix is large, you will notice a meaningful runtime slowdown using purrr.

Suppose we have a matrix long containing longitudinal data, so each row has data for one subject and the columns contain the repeated observations over time:

long ← matrix(1:15, 3, 5) long

> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 4 7 10 13
> [2,] 2 5 8 11 14
> [3,] 3 6 9 12 15

We could calculate the average observation for each subject by applying the mean function to each row. The result is a vector:

apply(long, 1, mean)

> [1] 7 8 9

If our matrix has row names, apply uses them to identify the elements of the resulting vector, which is handy:

rownames(long) ← c(“Moe”, “Larry”, “Curly”) apply(long, 1, mean)

> Moe Larry Curly
> 7 8 9

The function being called should expect one argument, a vector, which will be one row from the matrix. The function can return a scalar or a vector. In the vector case, apply assembles the results into a matrix. The range function returns a vector of two elements, the minimum and the maximum, so applying it to long produces a matrix:

apply(long, 1, range)

> Moe Larry Curly
> [1,] 1 2 3
> [2,] 13 14 15

You can employ this recipe on data frames as well. It works if the data frame is homogeneous—that is, either all numbers or all character strings. When the data frame has columns of different types, extracting vectors from the rows isn’t sensible because vectors must be homogeneous.

6.4 Applying a Function to Every Column Problem You have a matrix or data frame, and you want to apply a function to every column.

Solution For a matrix, use the apply function. Set the second argument to 2, which indicates column-by-column application of the function. So, if our matrix or data frame was named mat and we wanted to apply a function named fun to every column, it would look like this:

apply(mat, 2, fun) For a data frame, use the map_df function from purrr:

df2 ← map_df(df, fun) Discussion Let’s look at an example with real numbers and apply the mean function to every column of a matrix:

mat ← matrix(c(1, 3, 2, 5, 4, 6), 2, 3) colnames(mat) ← c(“t1”, “t2”, “t3”) mat

> t1 t2 t3
> [1,] 1 2 4
> [2,] 3 5 6

apply(mat, 2, mean) # Compute the mean of every column

> t1 t2 t3
> 2.0 3.5 5.0

In Base R, the apply function is intended for processing a matrix or data frame. The second argument of apply determines the direction:

1 means process row by row.

2 means process column by column.

This is more mnemonic than it looks. We speak of matrices in “rows and columns,” so rows are first and columns second: 1 and 2, respectively.

A data frame is a more complicated data structure than a matrix, so there are more options. You can simply use apply, in which case R will convert your data frame to a matrix and then apply your function. That will work if your data frame contains only one type of data but will probably not do what you want if some columns are numeric and some are character. In that case, R will force all columns to have identical types, likely performing an unwanted conversion as a result.

Fortunately, there are multiple alternatives. Recall that a data frame is a kind of list: it is a list of the columns of the data frame. purrr has a whole family of map functions that return different types of objects. Of particular interest here is map_df, which returns a data.frame (thus the df in the name):

df2 ← map_df(df, fun) # Returns a data.frame The function fun should expect one argument: a column from the data frame.

Here is a common recipe to check the types of columns in data frames. In this example, the batch column of this data frame, at a quick glance, seems to contain numbers:

load(“./data/batches.rdata”) head(batches)

> batch clinic dosage shrinkage
> 1 3 KY IL -0.307
> 2 3 IL IL -1.781
> 3 1 KY IL -0.172
> 4 3 KY IL 1.215
> 5 2 IL IL 1.895
> 6 2 NJ IL -0.430

But using map_df to print out the class of each column reveals the column batch to be a factor instead:

map_df(batches, class)

> # A tibble: 1 x 4
> batch clinic dosage shrinkage
> <chr> <chr> <chr> <chr>
> 1 factor factor factor numeric

NOTE Notice how the third line of the output says <chr> repeatedly. This is because the output of class is being put in a data frame and then printed. The intermediate data frame is all character fields. It’s the last row that tells us our original data frame has three factor columns and one numeric field.

See Also See Recipe 5.21, Recipe 6.1, and Recipe 6.3.

6.5 Applying a Function to Parallel Vectors or Lists Problem You have a function that takes multiple arguments. You want to apply the function element-wise to vectors and obtain a vector result. Unfortunately, the function is not vectorized; that is, it works on scalars but not on vectors.

Solution Use one of the map or pmap functions from the tidyverse core package purrr. The most general solution is to put your vectors in a list, then use pmap:

lst ← list(v1, v2, v3) pmap(lst, fun) pmap will take the elements of lst and pass them as the inputs to fun.

If you only have two vectors you are passing as inputs to your function, the map2 family of functions is convenient and saves you the step of putting your vectors in a list first. map2 will return a list:

map2(v1, v2, fun) while the typed variants (map2_chr, map2_dbl, etc.) return vectors of the type their name implies. So, if fun returns only a double, use the typed variant of map2 instead:

map2_dbl(v1, v2, fun) The typed variants in purrr functions refer to the output type expected from the function. All the typed variants return vectors of their respective type, while the untyped variants return lists, which allow mixing of types.

Discussion The basic operators of R, such as x + y, are vectorized; this means that they compute their result element by element and return a vector of results. Also, many R functions are vectorized.

Not all functions are vectorized, however, and those that are not typed work only on scalars. Using vector arguments produces errors at best and meaningless results at worst. In such cases, the map functions from purrr can effectively vectorize the function for you.

Consider the gcd function from Recipe 15.3, which takes two arguments:

gcd ← function(a, b) {

 if (b == 0) {
   return(a)
 } else {
   return(gcd(b, a %% b))
 }

} If we apply gcd to two vectors, the result is wrong answers and a pile of error messages:

gcd(c(1, 2, 3), c(9, 6, 3))

> Warning in if (b == 0) {: the condition has length > 1 and only the first
> element will be used

> Warning in if (b == 0) {: the condition has length > 1 and only the first
> element will be used

> Warning in if (b == 0) {: the condition has length > 1 and only the first
> element will be used
> [1] 1 2 0

The function is not vectorized, but we can use map to “vectorize” it. In this case, since we have two inputs we’re mapping over, we should use the map2 function. This gives the element-wise greatest common divisors (GCDs) between two vectors:

a ← c(1, 2, 3) b ← c(9, 6, 3) my_gcds ← map2(a, b, gcd) my_gcds

> 1
> [1] 1
>
> 2
> [1] 2
>
> 3
> [1] 3

Notice that map2 returns a list of lists. If we wanted the output in a vector, we could use unlist on the result:

unlist(my_gcds)

> [1] 1 2 3

or use one of the typed variants, such as map2_dbl.

The map family of purrr functions give you a series of variations that return specific types of output. The suffixes on the function names communicate the type of vector they will return. While map and map2 return lists, since the type-specific variants are returning objects guaranteed to be the same type, they can be put in atomic vectors. For example, we could use the map_chr function to ask R to coerce the results into character output or map2_dbl to ensure the results are doubles:

map2_chr(a, b, gcd)

> [1] “1.000000” “2.000000” “3.000000”

map2_dbl(a, b, gcd)

> [1] 1 2 3

If our data has more than two vectors, or the data is already in a list, we can use the pmap family of functions, which take a list as an input:

lst ← list(a,b) pmap(lst, gcd)

> 1
> [1] 1
>
> 2
> [1] 2
>
> 3
> [1] 3

Or if we want a typed vector as output:

lst ← list(a,b) pmap_dbl(lst, gcd)

> [1] 1 2 3

With the purrr functions, remember that the pmap family are parallel mappers that take in a list as inputs, while map2 functions take two, and only two, vectors as inputs.

See Also This is really just a special case of our very first recipe in this chapter, Recipe 6.1. See that recipe for more discussion of map variants. In addition, Jenny Bryan has a great collection of purrr tutorials on her GitHub site.

6.6 Applying a Function to Groups of Data Problem Your data elements occur in groups. You want to process the data by groups—for example, summing by group or averaging by group.

Solution The easiest way to do grouping is with the dplyr function group_by in conjunction with summarize. If our data frame is df and has a variable we want to group by named grouping_var, and we want to apply the function fun to all the combinations of v1 and v2, we can do that with group_by:

df %>%

 group_by(v1, v2) %>%
 summarize(
   result_var = fun(value_var)
 )

Discussion Let’s look at a specific example where our input data frame, df, contains a variable, my_group, which we want to group by, and a field named values which we would like to calculate some statistics on:

df ← tibble(

 my_group = c("A", "B","A", "B","A", "B"),
 values = 1:6

)

df %>%

 group_by(my_group) %>%
 summarize(
   avg_values = mean(values),
   tot_values = sum(values),
   count_values = n()
 )

> # A tibble: 2 x 4
> my_group avg_values tot_values count_values
> <chr> <dbl> <int> <int>
> 1 A 3 9 3
> 2 B 4 12 3

The output has one record per grouping along with calculated values for the three summary fields we defined.

WARNING If you are grouping by several variables, please be aware that summarize will change your grouping. Each grouping becomes a single row; at the same time, it also removes the last grouping variable. In other words, if you group your data by A, B, and C and then summarize it, the resulting data frame is grouped only by A and B. This is surprising but necessary. If summarize kept the C grouping, each “group” would contain exactly one row, which would be pointless.

6.7 Creating a New Column Based on Some Condition Problem You want to create a new column in a data frame based on some condition.

Solution Using the dplyr tidyverse package, we can create new data frame columns with mutate and then use case_when to implement conditional logic.

df %>%

 mutate(
   new_field = case_when(my_field == "something" ~ "result",
                   my_field != "something else" ~ "other result",
                   TRUE ~ "all other results")
 )

Discussion The case_when function from dplyr is analogous to CASE WHEN in SQL or nested IF statements in Excel. The function tests every element and, when it finds a condition that is true, returns the value on the righthand side of the ~ (tilde).

Let’s look at an example where we want to add a text field that describes a value. First let’s set up some simple example data in a data frame with one column named vals:

df ← data.frame(vals = 1:5) Now let’s implement logic that creates a field called new_vals. If vals is less than or equal to 2, we’ll return 2 or less; if the value is greater than 2 and less than or equal to 4, we’ll return 2 to 4, and otherwise we’ll return over 4:

df %>%

 mutate(new_vals = case_when(vals <= 2 ~ "2 or less",
                             vals > 2 & vals <= 4 ~ "2 to 4",
                             TRUE ~ "over 4"))

> vals new_vals
> 1 1 2 or less
> 2 2 2 or less
> 3 3 2 to 4
> 4 4 2 to 4
> 5 5 over 4

You can see in the example that the condition goes on the left of the ~, while the resulting return value goes on the right. Each condition is separated by commas. case_when will evaluate each condition sequentially and stop evaluating as soon as one of the criteria returns TRUE. Our last line is our “or else” statement. Setting the condition to TRUE ensures that, no matter what, this condition will be met if no condition above it has returned TRUE.

See Also See Recipe 6.2 for more examples of using mutate.

R: R Fundamentals, R Inventor - R Language Designer: Ross Ihaka and Robert Gentleman in August 1993; R Core Team, R Language Definition on R-Project.org, R reserved words (R keywords), R data structures - R algorithms, R syntax, R input and Output, R data transformations, R probability, R statistics, R linear regression (ANOVA), R time series analysis, R graphics, R markdown, R OOP, R on Linux, R on macOS, R on Windows, R installation, R containerization, R configuration, R compiler - R interpreter (R REPL), R IDEs (RStudio, Jupyter Notebook), R development tools, R DevOps - R SRE, R data science - R DataOps, R machine learning, R deep learning, Functional R, R concurrency, R history, R bibliography, R glossary, R topics, R courses, R Standard Library, R libraries, R packages (tidyverse package), R frameworks, RDocumentation.org / CRAN, R research, R GitHub, Written in R, R popularity, R Awesome list, R Versions, Python. (navbar_r)

Fair Use Sources

Fair Use Sources:

B07TDVNC15 (RCook 2019)

SYI LU SENG E MU CHYWE YE. NAN. WEI LA YE. WEI LA YE. SA WA HE.

Table of Contents

R Cookbook Chapter 6. Data Transformations

Fair Use Sources