To finish out the summer, we leave you with one last blog entry. The links below provide information about upcoming endeavors related to data science education. As we become aware of other projects, we are likely to add to the list. Feel free to check back to see what is new on the horizon. Thanks for all the great feedback that we’ve gotten over the summer. Here’s to many future discussions on data science education.

Read more →

We don’t know about you, but at the end of this project we find ourselves rejuvenated, empowered, and somewhat exhausted. In writing ten weeks of daily blog entries, we have learned a tremendous amount in terms of technical skills, pedagogical ideas to try in our fall courses, and ways to connect to the amazing & large data science community. One of our main goals we set for ourselves this summer was to create a roadmap for faculty development to “ease the learning curve and help busy people incorporate new tools and approaches into their teaching.

Read more →

Previous blog entries have discussed cloud based servers (RStudio Server and JupyterHub) and parallel/grid/cluster computing. Today we will expand upon these ideas to discuss at a high level how data science students can leverage cloud based tools to undertake their analyses in a flexible manner. Our discussion is motivated by several recent papers and blog posts that describe how complex, real-world data science computation can be structured in ways that would not have been feasible in past years without herculean efforts.

Read more →

As we near the end of our summer posts, we’ve started to think more broadly about statistics as well as data science courses. Today’s post considers a broad question relevant for many courses: how can we teach statistical thinking without having to resort to introducing a profusion of tests? Jonas Kristoffer Lindeløv proposed an elegant approach using the idea that common statistical tests are linear models.

Read more →

Today’s guest entry by Amelia McNamara (University of St. Thomas) describes a creative way that she tackled a problem in one of her upper level courses. One note: The JSM is underway. Looking for interesting talks? Mine’s excellent Shiny for JSM 2019 app for those of you in Denver. This past semester, I taught two sections of a course called Advanced Statistical Software (yes, I’m aware of the acronym. We’re changing the course title soon…).

Read more →

Reproducibility and Replicability On May 7, 2019 the National Academies of Sciences, Engineering, Medicine published, “New report examines reproducibility and replicability in science” article here. The report recommends “ways that researchers, academic institutions, journals, and funders should help strengthen rigor and transparency in order to improve the reproducibility and replicability of scientific research.” Reproducibility is at the core of data acumen and needs to be stressed at all levels of the data science curriculum.

Read more →

Many of you are likely to have been following the recent news (see here and here and here and here for recent articles) on facial recognition software, its use in the criminal justice system, and the systematic racial biases associated with facial recognition. You may also be aware of other algorithms which are systematically biased against a particular group of people. However, you may not have a plan for how to bring the ideas into the data science classroom.

Read more →

Eighteen years ago, Leo Breiman published an important paper entitled Statistical modeling: the two cultures in Statistical Science. In today’s blog entry we discuss the implications of the paper for data science education. Breiman argued that the two cultures included: one that assumes a stochastic data model one that uses an algorithmic model (and treats the data mechanism as unknown) Breiman asserted that the statistics community had almost exclusively focused on the former, with interpretation of parameters at the core.

Read more →

Today’s guest entry by Kelly McConville (Reed College) describes the creation of data packages in R by instructors and students. Sharing data with students The beginning of the data analysis cycle involves ingesting data. For novice students in the first week or two of an introductory course, this can be a tricky step. If I am new to R and I can’t even load the data, then my first impressions of R are not going to be too great.

Read more →

Last week’s entries focused on Python included a description of the innovative and popular data8, today we describe the follow-up course, data100, http://www.ds100.org/ (Principles and Techniques of Data Science) offered by the University of California/Berkeley Division of Data Sciences. Course Goals The goals of data100 are listed on the data100 website and reproduced here. The goals are lofty indeed, but they also address an incredibly important shortcoming in many undergraduate curricula – a student who is successful in data100 will hit the ground running doing data science after graduation.

Read more →