Practical Data Science: an introduction to the PeerJ collection

In 2017, Jenny Bryan and Hadley Wickham published the “Practical Data Science for Stats” PeerJ collection. (The papers were also published in a special issue of The American Statistician.)

The “Practical Data Science for Stats” Collection contains a series of short papers focused on the practical side of data science workflows and statistical analysis.

There are many aspects of day-to-day data analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians.

The goal of the collection is to increase the visibility and adoption of modern data analytical workflows and facilitate the transfer of tools and frameworks between industry and academia, between software engineering and Stats/CS, and across different domains.

We think that the set of papers are an invaluable contribution to the pedagogy of data science, particularly for those whose work and training has primarily been in statistics.

Data organization in spreadsheets: Karl W Broman, Kara H. Woo (viewed more than 90,000 times on the TAS website!)
Infrastructure and tools for teaching computing throughout the statistical curriculum: Mine Cetinkaya-Rundel, Colin W Rundel
Opinionated analysis development: Hilary Parker
Wrangling categorical data in R: Amelia McNamara, Nicholas J Horton
Lessons from between the white lines for isolated data scientists: Benjamin S Baumer
Teaching stats for data science: Daniel T Kaplan
Documenting and evaluating Data Science contributions in academic promotion in Departments of Statistics and Biostatistics: Lance A Waller
Modeling offensive player movement in professional basketball: Steven Wu, Luke Bornn
Excuse me, do you have a moment to talk about version control?: Jennifer Bryan
How to share data for collaboration: Shannon E Ellis, Jeffrey T Leek
The democratization of data science education: Sean Kross, Roger D Peng, Brian S Caffo, Ira Gooding, Jeffrey T Leek
Packaging data analytical work reproducibly using R (and friends): Ben Marwick, Carl Boettiger, Lincoln Mullen
Forecasting at Scale: Sean J Taylor, Benjamin Letham
Extending R with C++: A Brief Introduction to Rcpp: Dirk Eddelbuettel, James Joseph Balamuta
How R helps Airbnb make the most of its data: Ricardo Bion, Robert Chang, Jason Goodman
Declutter your R workflow with tidy tools: Zev Ross, Hadley Wickham, David Robinson

There are many ways to integrate the ideas into your data science classes.

One approach using COPSS Past, Present, and Future of Statistical Science is to have students pick a chapter and give a lightning talk (no more than three minutes and three slides) that describes why they picked the entry, one thing they learned, and one question that they still have. My rubric also includes having a “compelling opening line” and pushing their slides to the class github repository by a given deadline. I could imagine doing the same type of activity using the PeerJ papers instead of the COPSS book.
Past blog entries have discussed directly incorporating some of the ideas into a classroom. For example, consider the three GitHub entries: GitHub, GitHub in RStudio, GitHub Classroom.
Stay tuned for future entries to the Teaching Data Science blog which are informed by the excellent series of PeerJ articles and ideas!

The PeerJ paper collection may also make great additions to your summer reading list.

Learn more

https://peerj.com/collections/50-practicaldatascistats (PeerJ paper)
https://magazine.amstat.org/blog/2017/10/01/peerj_collection (Amstat News)
https://www.tandfonline.com/toc/utas20/72/1 (TAS)

About this blog

Each day during the summer of 2019 we intend to add a new entry to this blog on a given topic of interest to educators teaching data science and statistics courses. Each entry is intended to provide a short overview of why it is interesting and how it can be applied to teaching. We anticipate that these introductory pieces can be digested daily in 20 or 30 minute chunks that will leave you in a position to decide whether to explore more or integrate the material into your own classes. By following along for the summer, we hope that you will develop a clearer sense for the fast moving landscape of data science. Sign up for emails at https://groups.google.com/forum/#!forum/teach-data-science (you must be logged into Google to sign up).

We always welcome comments on entries and suggestions for new ones.