# Less Volume More Creativity in R

· by Nicholas Horton · Read in about 6 min · (1220 words) ·

In 2016, GAISE enunciated the importance of multivariate thinking and technology when teaching introductory statistics and data science courses. A big challenge is how to do this using R and RStudio without running into cognitive overload with our students.

The mosaic package was created by Randall Pruim, Danny Kaplan, and Nicholas Horton with the goal of introducing a Less Volume, More Creativity approach to introductory statistics that could simplify the use of technology. In their 2017 RJournal paper entitled “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” they write:

The mosaic package provides a simplified and systematic introduction to the core functionality related to descriptive statistics, visualization, modeling, and simulation-based inference required in first and second courses in statistics. This introduction to the package describes some of the guiding principles behind the design of the package and provides illustrative examples of several of the most important functions it implements. These can be combined to help students “think with data” using R in their early course work, starting with simple, yet powerful, declarative commands.

The package builds on the formula object in R, which allows the specification of models in a compact symbolic form. As an example, consider the drugrisk measure of drug related risk behaviors from the mosaicData::HELPmiss Health Evaluation and Linkage to Primary Care (HELP) study.

library(mosaic)
favstats(~ drugrisk, data = HELPmiss)
##  min Q1 median Q3 max     mean       sd   n missing
##    0  0      0  1  21 1.869658 4.313479 468       2

The favstats() function takes a formula as an argument to display the summary statistics for the drugrisk variable.

Alternative the df_stats() function provides a similar display (but returns a dataframe).

df_stats(~ drugrisk, data = HELPmiss)
##   min Q1 median Q3 max     mean       sd   n missing
## 1   0  0      0  1  21 1.869658 4.313479 468       2

As Chris Wild has noted, comparisons are more interesting than descriptions of a single group. So we can modify the command to calculate the summary values by homeless status (housed or homeless). Note that favstats has taken advantage of the formula syntax in R which is given by response variable ~ explanatory variable.

favstats(drugrisk ~ homeless, data = HELPmiss)
##   homeless min Q1 median   Q3 max     mean       sd   n missing
## 1   housed   0  0      0 0.75  21 1.712000 3.948089 250       1
## 2 homeless   0  0      0 1.00  21 2.050459 4.700449 218       1

If only the means are needed, they can be calculated instead.

mean(drugrisk ~ homeless, data = HELPmiss, na.rm = TRUE)
##   housed homeless
## 1.712000 2.050459

The mosaic modeling language approach is attractive because the same syntax used above for the calculation of the mean can be used to fit a linear model (in this case, equivalent to an equal-variance t-test).

msummary(lm(drugrisk ~ homeless, data = HELPmiss))
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        1.7120     0.2729   6.274 8.07e-10 ***
## homelesshomeless   0.3385     0.3998   0.846    0.398
##
## Residual standard error: 4.315 on 466 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.001535,   Adjusted R-squared:  -0.0006073
## F-statistic: 0.7165 on 1 and 466 DF,  p-value: 0.3977

Finally, the ggformula package (automatically loaded with the mosaic package) can be used to create ggplot style side by side boxplots, again, using the same response variable ~ explanatory variable syntax.

gf_boxplot(drugrisk ~ homeless, data = HELPmiss)

The key insight is the dramatic skew for the risk score: describing the distribution with means (as done with the lm() call) may be misleading.

### A more complicated example

What if we were also interested in a third variable? The formula interface accommodates multivariate thinking.

gf_point(pcs ~ drugrisk, data = HELPmiss, color = ~ homeless) %>%
gf_lm()

lm(pcs ~ drugrisk + homeless, data = HELPmiss)
##
## Call:
## lm(formula = pcs ~ drugrisk + homeless, data = HELPmiss)
##
## Coefficients:
##      (Intercept)          drugrisk  homelesshomeless
##          49.5654           -0.3484           -1.8132

Time spent learning the formula interface is worthwhile in the long-run as it is the basis of the lm() and glm() functions. When students move into upper level courses, it’s not hard for them to move from the scaffolding of ggformula to full-blown ggplot2.

### Replicating the above analyses without mosaic

It’s certainly possible to undertake the same analysis using base R or using ggplot2 for graphics. But it’s more complicated to have beginning students run commands like:

tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean)
##   housed homeless
##       NA       NA

or

tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean, na.rm = TRUE)
##   housed homeless
## 1.712000 2.050459

when they realize that there are missing values for the drugrisk variable.

The dplyr functions group_by() and summarize() can be used, but some people find the suite of tidyverse functions too challenging for a first course in statistics:

HELPmiss %>%
group_by(homeless) %>%
summarize(meanval = mean(drugrisk, na.rm=TRUE))
## summarise() ungrouping output (override with .groups argument)
## # A tibble: 2 x 2
##   homeless meanval
##   <fct>      <dbl>
## 1 housed      1.71
## 2 homeless    2.05

It’s also not a huge stretch to get students to use ggplot2 to generate the boxplots comparing homeless and drugrisk.

ggplot(data = HELPmiss, aes(x = homeless, y = drugrisk)) + geom_boxplot() 

However, all of the alternative approaches require new idioms and learning outcomes. They add complexity. The mosaic authors argue that teaching just one approach to modeling can help reduce the cognitive load of the course and allow the instructor to focus on getting students to explore data.

### Notes

• Different instructors will make different pedagogical choices depending on the courses and their students. For instructors hesitant to use modern tools, the mosaic package may simplify the process of getting students started while minimizing cognitive load.

• If R was rewritten from scratch, functions such as mean() would probably support a formula interface as well as a data= option. Unfortunately, the core R interface is (wisely) not being modified so additional packages such as mosaic or the tidyverse are needed to provide a more coherent and consistent experience.

• The mosaic package masks certain functions from other packages when it loads (for example, the mosaic::mean() function has augmented functionality related to formulas and the data= option). The find() function can be used to identify which function is referenced in the user’s environment.

find("mean")
## [1] "package:mosaic" "package:Matrix" "package:base"