In 2016, GAISE enunciated the importance of multivariate thinking and technology when teaching introductory statistics and data science courses. A big challenge is how to do this using R and RStudio without running into cognitive overload with our students.
The mosaic package was created by Randall Pruim, Danny Kaplan, and Nicholas Horton with the goal of introducing a Less Volume, More Creativity approach to introductory statistics that could simplify the use of technology. In their 2017 RJournal paper entitled “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” they write:
The mosaic package provides a simplified and systematic introduction to the core functionality related to descriptive statistics, visualization, modeling, and simulation-based inference required in first and second courses in statistics. This introduction to the package describes some of the guiding principles behind the design of the package and provides illustrative examples of several of the most important functions it implements. These can be combined to help students “think with data” using R in their early course work, starting with simple, yet powerful, declarative commands.
The package builds on the formula object in R, which allows the specification of models in a compact symbolic form. As an example, consider the
drugrisk measure of drug related risk behaviors from the
mosaicData::HELPmiss Health Evaluation and Linkage to Primary Care (HELP) study.
library(mosaic) favstats(~ drugrisk, data = HELPmiss) ## min Q1 median Q3 max mean sd n missing ## 0 0 0 1 21 1.869658 4.313479 468 2
favstats() function takes a formula as an argument to display the summary statistics for the
df_stats() function provides a similar display (but returns a dataframe).
df_stats(~ drugrisk, data = HELPmiss) ## min Q1 median Q3 max mean sd n missing ## 1 0 0 0 1 21 1.869658 4.313479 468 2
As Chris Wild has noted, comparisons are more interesting than descriptions of a single group. So we can modify the command to calculate the summary values by homeless status (
homeless). Note that
favstats has taken advantage of the formula syntax in R which is given by
response variable ~ explanatory variable.
favstats(drugrisk ~ homeless, data = HELPmiss) ## homeless min Q1 median Q3 max mean sd n missing ## 1 housed 0 0 0 0.75 21 1.712000 3.948089 250 1 ## 2 homeless 0 0 0 1.00 21 2.050459 4.700449 218 1
If only the means are needed, they can be calculated instead.
mean(drugrisk ~ homeless, data = HELPmiss, na.rm = TRUE) ## housed homeless ## 1.712000 2.050459
The mosaic modeling language approach is attractive because the same syntax used above for the calculation of the mean can be used to fit a linear model (in this case, equivalent to an equal-variance t-test).
msummary(lm(drugrisk ~ homeless, data = HELPmiss)) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.7120 0.2729 6.274 8.07e-10 *** ## homelesshomeless 0.3385 0.3998 0.846 0.398 ## ## Residual standard error: 4.315 on 466 degrees of freedom ## (2 observations deleted due to missingness) ## Multiple R-squared: 0.001535, Adjusted R-squared: -0.0006073 ## F-statistic: 0.7165 on 1 and 466 DF, p-value: 0.3977
ggformula package (automatically loaded with the
mosaic package) can be used to create ggplot style side by side boxplots, again, using the same
response variable ~ explanatory variable syntax.
gf_boxplot(drugrisk ~ homeless, data = HELPmiss)
The key insight is the dramatic skew for the risk score: describing the distribution with means (as done with the
lm() call) may be misleading.
A more complicated example
What if we were also interested in a third variable? The formula interface accommodates multivariate thinking.
gf_point(pcs ~ drugrisk, data = HELPmiss, color = ~ homeless) %>% gf_lm()
lm(pcs ~ drugrisk + homeless, data = HELPmiss) ## ## Call: ## lm(formula = pcs ~ drugrisk + homeless, data = HELPmiss) ## ## Coefficients: ## (Intercept) drugrisk homelesshomeless ## 49.5654 -0.3484 -1.8132
Time spent learning the formula interface is worthwhile in the long-run as it is the basis of the
glm() functions. When students move into upper level courses, it’s not hard for them to move from the scaffolding of
ggformula to full-blown
Replicating the above analyses without mosaic
It’s certainly possible to undertake the same analysis using base R or using
ggplot2 for graphics. But it’s more complicated to have beginning students run commands like:
tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean) ## housed homeless ## NA NA
tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean, na.rm = TRUE) ## housed homeless ## 1.712000 2.050459
when they realize that there are missing values for the
summarize() can be used, but some people find the suite of
tidyverse functions too challenging for a first course in statistics:
HELPmiss %>% group_by(homeless) %>% summarize(meanval = mean(drugrisk, na.rm=TRUE)) ## # A tibble: 2 x 2 ## homeless meanval ## <fct> <dbl> ## 1 housed 1.71 ## 2 homeless 2.05
It’s also not a huge stretch to get students to use
ggplot2 to generate the boxplots comparing
ggplot(data = HELPmiss, aes(x = homeless, y = drugrisk)) + geom_boxplot()
However, all of the alternative approaches require new idioms and learning outcomes. They add complexity. The mosaic authors argue that teaching just one approach to modeling can help reduce the cognitive load of the course and allow the instructor to focus on getting students to explore data.
Different instructors will make different pedagogical choices depending on the courses and their students. For instructors hesitant to use modern tools, the
mosaicpackage may simplify the process of getting students started while minimizing cognitive load.
If R was rewritten from scratch, functions such as
mean()would probably support a formula interface as well as a
data=option. Unfortunately, the core R interface is (wisely) not being modified so additional packages such as
mosaicor the tidyverse are needed to provide a more coherent and consistent experience.
mosaicpackage masks certain functions from other packages when it loads (for example, the
mosaic::mean()function has augmented functionality related to formulas and the
find()function can be used to identify which function is referenced in the user’s environment.
find("mean") ##  "package:mosaic" "package:Matrix" "package:base"
- https://journal.r-project.org/archive/2017/RJ-2017-024/index.html (“The mosaic Package: Helping Students to ‘Think with Data’ Using R.” RJournal paper)
- https://cran.r-project.org/web/packages/mosaic/vignettes/mosaic-resources.html (mosaic teaching resources)
- https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf (minimal R commands)
- https://cran.r-project.org/web/packages/ggformula/vignettes/ggformula-blog.html (ggformula: another option for teaching graphics in R to beginners)
- https://nhorton.people.amherst.edu/is5/about.html (Intro Stats 5th edition in R with mosaic)
About this blog
Each day during the summer of 2019 we intend to add a new entry to this blog on a given topic of interest to educators teaching data science and statistics courses. Each entry is intended to provide a short overview of why it is interesting and how it can be applied to teaching. We anticipate that these introductory pieces can be digested daily in 20 or 30 minute chunks that will leave you in a position to decide whether to explore more or integrate the material into your own classes. By following along for the summer, we hope that you will develop a clearer sense for the fast moving landscape of data science. Sign up for emails at https://groups.google.com/forum/#!forum/teach-data-science (you must be logged into Google to sign up).
We always welcome comments on entries and suggestions for new ones.