In 2016, GAISE enunciated the importance of *multivariate thinking* and *technology* when teaching introductory statistics and data science courses. A big challenge is how to do this using R and RStudio without running into cognitive overload with our students.

The mosaic package was created by Randall Pruim, Danny Kaplan, and Nicholas Horton with the goal of introducing a *Less Volume, More Creativity* approach to introductory statistics that could simplify the use of technology. In their 2017 RJournal paper entitled “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” they write:

The mosaic package provides a simplified and systematic introduction to the core functionality related to descriptive statistics, visualization, modeling, and simulation-based inference required in first and second courses in statistics. This introduction to the package describes some of the guiding principles behind the design of the package and provides illustrative examples of several of the most important functions it implements. These can be combined to help students “think with data” using R in their early course work, starting with simple, yet powerful, declarative commands.

The package builds on the formula object in R, which allows the specification of models in a compact symbolic form. As an example, consider the `drugrisk`

measure of drug related risk behaviors from the `mosaicData::HELPmiss`

Health Evaluation and Linkage to Primary Care (HELP) study.

```
library(mosaic)
favstats(~ drugrisk, data = HELPmiss)
## min Q1 median Q3 max mean sd n missing
## 0 0 0 1 21 1.869658 4.313479 468 2
```

The `favstats()`

function takes a formula as an argument to display the summary statistics for
the `drugrisk`

variable.

Alternative the `df_stats()`

function provides a similar display (but returns a dataframe).

```
df_stats(~ drugrisk, data = HELPmiss)
## min Q1 median Q3 max mean sd n missing
## 1 0 0 0 1 21 1.869658 4.313479 468 2
```

As Chris Wild has noted, comparisons are more interesting than descriptions of a single group. So we can modify the command to calculate the summary values by homeless status (`housed`

or `homeless`

). Note that `favstats`

has taken advantage of the formula syntax in R which is given by `response variable ~ explanatory variable`

.

```
favstats(drugrisk ~ homeless, data = HELPmiss)
## homeless min Q1 median Q3 max mean sd n missing
## 1 housed 0 0 0 0.75 21 1.712000 3.948089 250 1
## 2 homeless 0 0 0 1.00 21 2.050459 4.700449 218 1
```

If only the means are needed, they can be calculated instead.

```
mean(drugrisk ~ homeless, data = HELPmiss, na.rm = TRUE)
## housed homeless
## 1.712000 2.050459
```

The mosaic modeling language approach is attractive because the same syntax used above for the calculation of the mean can be used to fit a linear model (in this case, equivalent to an equal-variance t-test).

```
msummary(lm(drugrisk ~ homeless, data = HELPmiss))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7120 0.2729 6.274 8.07e-10 ***
## homelesshomeless 0.3385 0.3998 0.846 0.398
##
## Residual standard error: 4.315 on 466 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.001535, Adjusted R-squared: -0.0006073
## F-statistic: 0.7165 on 1 and 466 DF, p-value: 0.3977
```

Finally, the `ggformula`

package (automatically loaded with the `mosaic`

package) can be used to create ggplot style side by side boxplots, again, using the same `response variable ~ explanatory variable`

syntax.

`gf_boxplot(drugrisk ~ homeless, data = HELPmiss)`

The key insight is the dramatic skew for the risk score: describing the distribution with means (as done with the `lm()`

call) may be misleading.

### A more complicated example

What if we were also interested in a third variable? The formula interface accommodates
*multivariate thinking*.

```
gf_point(pcs ~ drugrisk, data = HELPmiss, color = ~ homeless) %>%
gf_lm()
```

```
lm(pcs ~ drugrisk + homeless, data = HELPmiss)
##
## Call:
## lm(formula = pcs ~ drugrisk + homeless, data = HELPmiss)
##
## Coefficients:
## (Intercept) drugrisk homelesshomeless
## 49.5654 -0.3484 -1.8132
```

Time spent learning the formula interface is worthwhile in the long-run as it is the basis of the `lm()`

and `glm()`

functions. When students move into upper level courses, it’s not hard for them to move from the scaffolding of `ggformula`

to full-blown `ggplot2`

.

### Replicating the above analyses without mosaic

It’s certainly possible to undertake the same analysis using base R or using `ggplot2`

for graphics. But it’s more complicated to have beginning students run commands like:

```
tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean)
## housed homeless
## NA NA
```

or

```
tapply(HELPmiss$drugrisk, HELPmiss$homeless, mean, na.rm = TRUE)
## housed homeless
## 1.712000 2.050459
```

when they realize that there are missing values for the `drugrisk`

variable.

The `dplyr`

functions `group_by()`

and `summarize()`

can be used, but some people find the suite of `tidyverse`

functions too challenging for a first course in statistics:

```
HELPmiss %>%
group_by(homeless) %>%
summarize(meanval = mean(drugrisk, na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## homeless meanval
## <fct> <dbl>
## 1 housed 1.71
## 2 homeless 2.05
```

It’s also not a huge stretch to get students to use `ggplot2`

to generate the boxplots comparing `homeless`

and `drugrisk`

.

`ggplot(data = HELPmiss, aes(x = homeless, y = drugrisk)) + geom_boxplot() `

However, all of the alternative approaches require new idioms and learning outcomes. They add complexity. The mosaic authors argue that teaching just *one* approach to modeling can help reduce the cognitive load of the course and allow the instructor to focus on getting students to explore data.

### Notes

Different instructors will make different pedagogical choices depending on the courses and their students. For instructors hesitant to use modern tools, the

`mosaic`

package may simplify the process of getting students started while minimizing cognitive load.If R was rewritten from scratch, functions such as

`mean()`

would probably support a formula interface as well as a`data=`

option. Unfortunately, the core R interface is (wisely) not being modified so additional packages such as`mosaic`

or the tidyverse are needed to provide a more coherent and consistent experience.The

`mosaic`

package*masks*certain functions from other packages when it loads (for example, the`mosaic::mean()`

function has augmented functionality related to formulas and the`data=`

option). The`find()`

function can be used to identify which function is referenced in the user’s environment.

```
find("mean")
## [1] "package:mosaic" "package:Matrix" "package:base"
```

### Learn more

- https://journal.r-project.org/archive/2017/RJ-2017-024/index.html (“The mosaic Package: Helping Students to ‘Think with Data’ Using R.” RJournal paper)
- https://cran.r-project.org/web/packages/mosaic/vignettes/mosaic-resources.html (mosaic teaching resources)
- https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf (minimal R commands)
- https://cran.r-project.org/web/packages/ggformula/vignettes/ggformula-blog.html (ggformula: another option for teaching graphics in R to beginners)
- https://nhorton.people.amherst.edu/is5/about.html (Intro Stats 5th edition in R with mosaic)

### About this blog

Each day during the summer of 2019 we intend to add a new entry to this blog on a given topic of interest to educators teaching data science and statistics courses. Each entry is intended to provide a short overview of why it is interesting and how it can be applied to teaching. We anticipate that these introductory pieces can be digested daily in 20 or 30 minute chunks that will leave you in a position to decide whether to explore more or integrate the material into your own classes. By following along for the summer, we hope that you will develop a clearer sense for the fast moving landscape of data science. Sign up for emails at https://groups.google.com/forum/#!forum/teach-data-science (you must be logged into Google to sign up).

We always welcome comments on entries and suggestions for new ones.