Aggregate Data with Nested Groups
A Short General Primer on Loops
In my daily programming I barely use any base for
loops. I used to rely on lapply
and friends a lot, then switched to the plyr
family of apply functions and eventually
ended up using the purr::map_*
family of functions.
There is a bit of overhead involved in map
and [lvs]apply
can be faster / more efficient,
but I do like the verbosity of map
and friends as well as the possibility to use
lambda functions via ~
1.
Especially the predictability of the result as opposed to sapply
for instance is a clear
plus. Also, if you want the apply function to return something else than a list the
map_*
variants become rather handy.
Compare:
## using the R 4.1. anon function syntax
(res <- lapply(1:3, \(x) x + 1))
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
## is typesafe but will always return a list, thus this will throw an error
sum(res)
## Error in sum(res): invalid 'type' (list) of argument
## sapply can be a remedy
(res <- sapply(1:3, \(x) x + 1))
## [1] 2 3 4
sum(res)
## [1] 9
## however output format of sapply is not always the same
sapply(1:3, \(x) c(1, x)) ## a matrix
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 2 3
sapply(1:3, seq_len) ## a list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 2 3
sapply(1:3, \(x) x + 1) ## a vector
## [1] 2 3 4
to:
library(purrr)
## map_* enforces you to return the right type
map_int(1:3, ~ .x + 1L)
## [1] 2 3 4
map(1:3, seq_len)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 2 3
map_dfr(1:3, ~ data.frame(x = .x))
## x
## 1 1
## 2 2
## 3 3
## type safety is taken really seriously:
## 1 is a numeric, thus . + 1 is a numeric and __not__ an int
map_int(1:3, ~ .x + 1)
## Error: Can't coerce element 1 from a double to a integer
To conclude: map
adds verbosity and type safety and allows for tidyverse
anonymous
functions. Enough reasons for me to completely rely on map_
and friends2.
Revisit Sunburst Graphs
In my post on how to color sunbursts graphs, I used a rather tedious and lengthy way of
bringing the data into the right format. We can use purrr::accumulate
to generalize the
idea while saving a lot of copy-pasted code:
library(tidyr)
library(forcats)
library(dplyr)
titanic <- as_tibble(Titanic) %>%
mutate(across(where(is.character), fct_inorder))
prepare_sunburst_data <- function(data, ..., n = n(), total_name = "Total") {
n <- enquo(n)
grps <- enquos(...)
data <- data %>% group_by(!!!grps) # [1]
calc_margins <- function(.data) { # [3]
grps <- .data %>%
groups()
agg <- .data %>%
summarise(margin = sum(!!n), .groups = "drop")
if (length(grps)) {
agg <- agg %>%
unite("id", !!!grps, remove = FALSE) %>%
mutate(label = !!grps[[1L]] %>% as.character())
par_grps <- tail(grps, -1L)
if (length(par_grps)) {
agg <- agg %>%
unite("parent", !!!par_grps)
} else {
agg <- agg %>%
mutate(parent = total_name)
}
} else {
agg <- agg %>%
mutate(label = total_name, id = total_name, parent = "")
}
agg %>%
select(id, parent, label, margin)
}
accumulate(grps, ~ .x %>% ungroup(!!.y), .init = data) %>% # [2]
map_dfr(calc_margins)
}
(sunburst_data <- prepare_sunburst_data(titanic, Survived, Sex, Age, Class, n = n))
## # A tibble: 61 x 4
## id parent label margin
## <chr> <chr> <chr> <dbl>
## 1 No_Male_Child_1st Male_Child_1st No 0
## 2 No_Male_Child_2nd Male_Child_2nd No 0
## 3 No_Male_Child_3rd Male_Child_3rd No 35
## 4 No_Male_Child_Crew Male_Child_Crew No 0
## 5 No_Male_Adult_1st Male_Adult_1st No 118
## 6 No_Male_Adult_2nd Male_Adult_2nd No 154
## 7 No_Male_Adult_3rd Male_Adult_3rd No 387
## 8 No_Male_Adult_Crew Male_Adult_Crew No 670
## 9 No_Female_Child_1st Female_Child_1st No 0
## 10 No_Female_Child_2nd Female_Child_2nd No 0
## # ... with 51 more rows
How does the function work?
- We use
enquos
to enquote all passed (via...
) grouping variables and add these grouping variables to the data (# [1]
). - Then, we use
accumulate
to create a list of data frames where each element uses one grouping level less. That is in the above example, the first element groups according toSurvived
,Sex
,Age
andClass
, the second one dropsSurvived
the third one dropsSex
and so on (# [2]
). - Now, we can simply apply our
calc_margins
(# [3]
) function, which uses the given summary function and creates meaningful ids and labels and sets the parent reference accordingly.
And as we are better in juding visuals rather than tables, here is the sunburst graph
again to show that the data created by prepare_sunburst_data
is indeed the same as the
approach by hand:
library(plotly)
sunburst_data %>%
plot_ly() %>%
add_trace(ids = ~ id,
labels = ~ label,
parents = ~ parent,
values = ~ margin,
type = "sunburst",
marker = list(line = list(color = "#FFF")),
branchvalues = "total") %>%
layout(paper_bgcolor = "#00000000")