Data8: The Foundations of Data Science at Berkeley

· by Nicholas Horton · Read in about 4 min · (797 words) ·

As part of our week of Python, we wanted to focus on innovative pedagogical approaches that have been used to scale outreach efforts. A great example is the http://Data8.org (Foundations of Data Science) course that has been offered by the University of California/Berkeley Division of Data Sciences.

The course combines three perspectives: inferential thinking, computational thinking, and real-world relevance. Students are asked to use real data to understand relationships and patterns while teaching critical concepts and skills in computer programming and statistical inference.

The course objectives of Data 8 are wide-ranging. Upon completion of the class, students should be able to:

  • Write correct small programs that manipulate and combine data sets and carry out iterative procedures.
  • Extend a program with multiple functions so that it runs correctly with additional functionality.
  • Calculate specified statistics of a given dataset.
  • Identify the sources of randomness in an experiment.
  • Formulate a null hypothesis that relates to a given question, which can be assessed using a statistical test.
  • Carry out statistical analyses including computing confidence intervals and performing hypothesis tests in a variety of data settings.
  • Given the result of a statistical analysis from the course, form correct conclusions about a question.
  • Given a question and an analysis, explain whether the analysis addresses the question and how the analysis could change and still address the question.
  • Articulate the benefits and limits of computing technology for analyzing data and answering questions.
  • Correctly generate and interpret histograms, bar charts, and box plots.
  • Correctly make predictions using regression and classification techniques.
  • Assess the accuracy and variability of a prediction.

The creators have developed a simplified approach to teaching Python with the goal of reducing the cognitive load of standard packages such as pandas (see a parallel in R in our “Less Volume, More Creativity” entry). They write:

The course uses a module for table manipulation, charts, and maps that provides an interface appropriate for an introductory course. The Table class is similar to a DataFrame in Pandas, but explicitly does not support row indexes, hierarchical indexes, time series data, missing values, slicing, and many other advanced features that can complicate table manipulation for novices. The charting features use Matplotlib, but customize the output to match the pedagogical goals of the course. The mapping features are implemented by Folium, but aim to simplify working with tables and geojson files. While the datascience module can certainly be used outside the context of the course, it was specifically designed to support the Data 8 curriculum, while setting up students to transition to more standard tools such as Pandas.

An example of the code to bootstrap a regression coefficient can be found here. The textbook, available as a website, includes the other examples from the class.

Other resources

  • There’s lots to explore here and we encourage you to do so. A recent workshop on Data Science Education led by Ani Adhikari and David Wagner took place at Berkeley in June with scads of lecture notes and pedagogical materials.
  • Perusing the textbook, assignments, and lecture materials (all available for free under a Creative Commons license) is likely to yield useful insights, even if an instructor doesn’t use Python. See here for final exam review materials and related resources.
  • Don’t have a local Python 3 installation? The Berkeley folks have worked with EdX to develop a set of three courses that cover the Data8 material and provide access to cloud-based JupyterHub servers (see https://www.edx.org/course/foundations-of-data-science-computational-thinking-with-python-0).

Other courses

There are other innovative approaches that bear mentioning.

About this blog

Each day during the summer of 2019 we intend to add a new entry to this blog on a given topic of interest to educators teaching data science and statistics courses. Each entry is intended to provide a short overview of why it is interesting and how it can be applied to teaching. We anticipate that these introductory pieces can be digested daily in 20 or 30 minute chunks that will leave you in a position to decide whether to explore more or integrate the material into your own classes. By following along for the summer, we hope that you will develop a clearer sense for the fast moving landscape of data science. Sign up for emails at https://groups.google.com/forum/#!forum/teach-data-science (you must be logged into Google to sign up).

We always welcome comments on entries and suggestions for new ones.