Many of you are likely to have been following the recent news (see here and here and here and here for recent articles) on facial recognition software, its use in the criminal justice system, and the systematic racial biases associated with facial recognition. You may also be aware of other algorithms which are systematically biased against a particular group of people. However, you may not have a plan for how to bring the ideas into the data science classroom.

Read more →

Eighteen years ago, Leo Breiman published an important paper entitled Statistical modeling: the two cultures in Statistical Science. In today’s blog entry we discuss the implications of the paper for data science education. Breiman argued that the two cultures included: one that assumes a stochastic data model one that uses an algorithmic model (and treats the data mechanism as unknown) Breiman asserted that the statistics community had almost exclusively focused on the former, with interpretation of parameters at the core.

Read more →

Today’s guest entry by Kelly McConville (Reed College) describes the creation of data packages in R by instructors and students. Sharing data with students The beginning of the data analysis cycle involves ingesting data. For novice students in the first week or two of an introductory course, this can be a tricky step. If I am new to R and I can’t even load the data, then my first impressions of R are not going to be too great.

Read more →

Last week’s entries focused on Python included a description of the innovative and popular data8, today we describe the follow-up course, data100, http://www.ds100.org/ (Principles and Techniques of Data Science) offered by the University of California/Berkeley Division of Data Sciences. Course Goals The goals of data100 are listed on the data100 website and reproduced here. The goals are lofty indeed, but they also address an incredibly important shortcoming in many undergraduate curricula – a student who is successful in data100 will hit the ground running doing data science after graduation.

Read more →

All week we’ve been celebrating using Python in data science. There is no question that Python is a fantastic and very powerful language. Additionally, it is typically thought of as clearly the most used language for doing data science. The kaggle 2017 survey reports that more than three-quarters of data scientists use Python (although they also mention that most statisticians use R). Knowing how to use Python is an important first step to engaging with the software.

Read more →

As part of our week of Python, we wanted to focus on innovative pedagogical approaches that have been used to scale outreach efforts. A great example is the http://Data8.org (Foundations of Data Science) course that has been offered by the University of California/Berkeley Division of Data Sciences. The course combines three perspectives: inferential thinking, computational thinking, and real-world relevance. Students are asked to use real data to understand relationships and patterns while teaching critical concepts and skills in computer programming and statistical inference.

Read more →

For many statisticians, their go-to software language is R. However, there is no doubt that Python is an equally important language in data science. Indeed, the Jupyter blog entry from earlier this week described the capacities of writing Python code (as well as R and Julia and other environments) using interactive Jupyter notebooks. knitr::opts_chunk$set(collapse = TRUE) library(reticulate) use_virtualenv("r-reticulate") use_python("F:/Anaconda3", required = TRUE) py_config() Teaching Python and R A quick google search can quickly bring up many arguments on both sides of the heated Python vs R debate.

Read more →

About pandas pandas is an open-source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Straight from the library’s homepage, “pandas helps fill Python’s long-standing gap in tools for data analysis and modeling.” In short, pandas offers some new and some improved Python tools for doing the following: Reading data in to data frame-type structures Viewing and selecting data Handling missing data

Read more →

For the entire week, we’re going to be celebrating using Python for data science education. Stay tuned for topics on specific Python functionality, using Python inside RStudio, Python in the curriculum, and the larger Python community. But before we get to any of those topics, we’re going to start by introducing the go-to interface for Python programming, Jupyter Notebooks. What is Project Jupyter? Project Jupyter is a non-profit, open-source project, developed in 2014 out of the IPython Project and designed to support interactive data science and scientific computing across multiple programming languages.

Read more →

Today’s blog entry is on parallel and grid computing. As a data science education blog, our focus is more on how to discuss ways to help students learn about high performance computing in the classroom rather than parallel computing for particular research projects (for a recent example see “Ambitious data science can be painless”). Early on in data science education it’s important to develop a foundation and precursors for future work.

Read more →