Today’s blog is a compilation of datasets and data sources to use in a data science classroom whose goals are to include relevant and timely information to consider issues of the day. We hope that the datasets below can be used in conjunction with some of this summer’s previous blogs, for example, considering
- health implications when describing COVID data,
- language around describing social justice data, and
- learning outcomes for getting the most out of ethical data science discussions.
Collecting Data
Before linking to the data, we encourage you to reflect on how data are collected and what impact poor data collection can have on any ensuing conclusions.
- In criminal justice datasets race designation is often a guess of the reporting officer. Consider California Assembly Bill No. 953 from 2015:
- In their Data Equity Framework, We All Count details seven stages of looking at data projects, including data collection & sourcing.
They say:
The requirements for equitable data collection are complex. It’s not as simple as trying to ask everyone and not leave people out. Sample selection is important of course, but so is survey design, collector behaviour, scope and scale, cultural translation, collection mediums, data corruption, compatibility and fidelity and much more. It’s super worth doing, if for no other reason than your data will be more useful.
Christina Abraham discusses the impact of over-simplifying racial categories.
The Schusterman Family Foundation writes about How we collect data determines whose voice is heard and has provided guidance on More Than Numbers: A Guide Toward Diversity, Equity and Inclusion in Data Collection.
Researchers at the Urban Institute have put together a report on The Alarming Lack of Data on Latinos in the Criminal Justice System.
Datasets
Criminal Justice
Campaign Zero has created the Police Scorecard Data to evaluate how police departments interact with the communities they serve.
The Stanford Open Policing Project provides information on over 200 million traffic stops across 42 states.
The Citizens Police Data Project collects and publishes information about police misconduct in Chicago.
The Police Data Initiative promotes responsible policing through the use of open data.
The National Archive of Criminal Justice Data curates data on criminal justice, with close to 3,000 studies / datasets.
ProPublica has compiled datasets related to criminal justice on a wealth of issues:
The data from Five Thirty Eight includes many studies related to criminal justice:
Environment
- ProPublica has compiled datasets related to the environment including:
- Pulitzer-winning Washington Post series on Dangerous new hot zones are spreading around the world with data sources and explanation of the data.
Health
- ProPublica has compiled datasets related to health including:
Race & gender
Tidy Tuesday dataset on African American Achievements.
Tidy Tuesday dataset on the Slave trade.
Racial bias in red cards given in Major League Soccer, with data provided.
An experiment done with names on resumes to measure racial discrimination in the labor market with data summarized nicely by OpenIntro.
Center for American Women and Politics posted the entire Women Elected Officials Database.
Measuring stereotypical bias in language models with an applicartion and reproducible code.
Southern Poverty Law Center mapping information on Confederate monuments.
Mitigating gender bias in student evaluations of teaching including data.
Elections
Issues with mail in ballots including GA data and NC data.
Washington Post article on Postal Service warns 46 states their voters could be disenfranchised by delayed mail-in ballots. Jacob Bogage is collecting data and will likely post it publicly.
- Voter registration during the pandemic, including data on new voter registration.
Large Data Archives
DrivenData crowd-sources solving data science problems with positive social impact.
Data is Plural has compiled over a thousand datasets on every topic imaginable.
The Markup uses data-driven approaches to investigate how powerful institutions use technology, often against our best interest. All Markup data is freely available.
FiveThirtyEight is a data journalism website which started by doing political analyses but now uses data to cover politics, science, economics, and lifestyle. They provide access to many of their datasets.
ProPublica does investigative journalism and provides many of their datasets for free.
About this blog
Last summer we wrote a series of blog entries designed to start conversations around teaching data science, Teach Data Science. We covered topics such as data science software, data ingestation, data technologies, data wrangling, visualization & exploration, communication, and key reports and findings on data science.
One key element that was lacking on our 2019 blog was a discussion about and a commitment to teaching the ethical aspects of data science. We have now found ourselves in the summer of 2020, overwhelmed by the state of the world and re-committed to the ethical challenges which can help data science be a positive force for change.
Although none of us are experts in ethics, we have all included ethics discussions in our classrooms for many years. In the weeks to come, we will share some of the ways we engage our students in these important topics. We will provide resources for readings, examples, datasets, and exercises. We believe that data ethics are part of every data science analysis and classroom experience, and we hope that this summer’s blog will entice you into presenting ethical dilemmas and related conversations to your students early and often.
During the summer of 2020, we wrote a dozen or so blog entries. We hope that you bookmark the site and check in regularly. Want a reminder? Sign up for emails at https://groups.google.com/forum/#!forum/teach-data-science (you must be logged into Google to sign up).