Aggregate Data with Nested Groups
A Short General Primer on Loops
In my daily programming I barely use any base for loops. I used to rely on lapply
and friends a lot, then switched to the plyr family of apply functions and eventually
ended up using the purr::map_* family of functions.
There is a bit of overhead involved in mapand [lvs]apply can be faster / more efficient,
but I do like the verbosity of map and friends as well as the possibility to use
lambda functions via ~1.
Especially the predictability of the result as opposed to sapply for instance is a clear
plus. Also, if you want the apply function to return something else than a list the
map_* variants become rather handy.
Compare:
## using the R 4.1. anon function syntax
(res <- lapply(1:3, \(x) x + 1))
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
## is typesafe but will always return a list, thus this will throw an error
sum(res)
## Error in sum(res): invalid 'type' (list) of argument
## sapply can be a remedy
(res <- sapply(1:3, \(x) x + 1))
## [1] 2 3 4
sum(res)
## [1] 9
## however output format of sapply is not always the same
sapply(1:3, \(x) c(1, x)) ## a matrix
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 2 3
sapply(1:3, seq_len) ## a list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 2 3
sapply(1:3, \(x) x + 1) ## a vector
## [1] 2 3 4
to:
library(purrr)
## map_* enforces you to return the right type
map_int(1:3, ~ .x + 1L)
## [1] 2 3 4
map(1:3, seq_len)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 2 3
map_dfr(1:3, ~ data.frame(x = .x))
## x
## 1 1
## 2 2
## 3 3
## type safety is taken really seriously:
## 1 is a numeric, thus . + 1 is a numeric and __not__ an int
map_int(1:3, ~ .x + 1)
## Error: Can't coerce element 1 from a double to a integer
To conclude: map adds verbosity and type safety and allows for tidyverse anonymous
functions. Enough reasons for me to completely rely on map_ and friends2.
Revisit Sunburst Graphs
In my post on how to color sunbursts graphs, I used a rather tedious and lengthy way of
bringing the data into the right format. We can use purrr::accumulate to generalize the
idea while saving a lot of copy-pasted code:
library(tidyr)
library(forcats)
library(dplyr)
titanic <- as_tibble(Titanic) %>%
mutate(across(where(is.character), fct_inorder))
prepare_sunburst_data <- function(data, ..., n = n(), total_name = "Total") {
n <- enquo(n)
grps <- enquos(...)
data <- data %>% group_by(!!!grps) # [1]
calc_margins <- function(.data) { # [3]
grps <- .data %>%
groups()
agg <- .data %>%
summarise(margin = sum(!!n), .groups = "drop")
if (length(grps)) {
agg <- agg %>%
unite("id", !!!grps, remove = FALSE) %>%
mutate(label = !!grps[[1L]] %>% as.character())
par_grps <- tail(grps, -1L)
if (length(par_grps)) {
agg <- agg %>%
unite("parent", !!!par_grps)
} else {
agg <- agg %>%
mutate(parent = total_name)
}
} else {
agg <- agg %>%
mutate(label = total_name, id = total_name, parent = "")
}
agg %>%
select(id, parent, label, margin)
}
accumulate(grps, ~ .x %>% ungroup(!!.y), .init = data) %>% # [2]
map_dfr(calc_margins)
}
(sunburst_data <- prepare_sunburst_data(titanic, Survived, Sex, Age, Class, n = n))
## # A tibble: 61 x 4
## id parent label margin
## <chr> <chr> <chr> <dbl>
## 1 No_Male_Child_1st Male_Child_1st No 0
## 2 No_Male_Child_2nd Male_Child_2nd No 0
## 3 No_Male_Child_3rd Male_Child_3rd No 35
## 4 No_Male_Child_Crew Male_Child_Crew No 0
## 5 No_Male_Adult_1st Male_Adult_1st No 118
## 6 No_Male_Adult_2nd Male_Adult_2nd No 154
## 7 No_Male_Adult_3rd Male_Adult_3rd No 387
## 8 No_Male_Adult_Crew Male_Adult_Crew No 670
## 9 No_Female_Child_1st Female_Child_1st No 0
## 10 No_Female_Child_2nd Female_Child_2nd No 0
## # ... with 51 more rows
How does the function work?
- We use
enquosto enquote all passed (via...) grouping variables and add these grouping variables to the data (# [1]). - Then, we use
accumulateto create a list of data frames where each element uses one grouping level less. That is in the above example, the first element groups according toSurvived,Sex,AgeandClass, the second one dropsSurvivedthe third one dropsSexand so on (# [2]). - Now, we can simply apply our
calc_margins(# [3]) function, which uses the given summary function and creates meaningful ids and labels and sets the parent reference accordingly.
And as we are better in juding visuals rather than tables, here is the sunburst graph
again to show that the data created by prepare_sunburst_data is indeed the same as the
approach by hand:
library(plotly)
sunburst_data %>%
plot_ly() %>%
add_trace(ids = ~ id,
labels = ~ label,
parents = ~ parent,
values = ~ margin,
type = "sunburst",
marker = list(line = list(color = "#FFF")),
branchvalues = "total") %>%
layout(paper_bgcolor = "#00000000")