Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a data set you want to work with; while cleaning data is an integral part of data science, you may want to start with a clean data set for your first project so that you can focus on the analysis rather than on cleaning the data.
Based on the learnings from our Introduction to Data Science Course and the Data Science Career Track, we’ve selected data sets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data sets cover a variety of sources: demographic data, economic data, text data, and corporate data.
Need more? Check out our list of free data mining tools.
- United States Census Data: The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean and very comprehensive.
- FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.
- CDC Cause of Death: The Centers for Disease Control and Prevention maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
- Medicare Hospital Quality: The Centers for Medicare & Medicaid Services maintains a database on quality of care at more than 4,000 Medicare-certified hospitals across the U.S., providing for interesting comparisons.
- SEER Cancer Incidence: The U.S. government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors. It comes from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Program.
- Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
- Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, including gross domestic product and exchange rates.
- IMF Economic Data: For access to global financial statistics and other data, check out the International Monetary Fund’s website.
- Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One relevant data set to explore is the weekly returns of the Dow Jones Index from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.
- Data.gov.uk: The British government’s official data portal offers access to tens of thousands of data sets on topics such as crime, education, transportation, and health.
- Enron Emails: After the collapse of Enron, a data set of roughly 500,000 emails with message text and metadata were released. The data set is now famous and provides an excellent testing ground for text-related analysis. You also can explore other research uses of this data set through the page.
- Google Books Ngrams: If you’re interested in truly massive data, the Ngram viewer data set counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- UNICEF: If data about the lives of children around the world is of interest, UNICEF is the most credible source. The organization’s public data sets touch upon nutrition, immunization, and education, among others.
- Reddit Comments: Reddit released a data set of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller data set to work with Kaggle has hosted the comments from May 2015 on their site.
- Wikipedia: Wikipedia provides instructions for downloading the text of English-language articles, in addition to other projects from the Wikimedia Foundation.
- Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The data set lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan).
- Walmart: Walmart has released historical sales data for 45 stores located in different regions across the United States.
- Airbnb: Inside Airbnb offers different data sets related to Airbnb listings in dozens of cities around the world.
- Yelp: Yelp maintains a dataset for use in personal, educational, and academic purposes. It includes 6 million reviews spanning 189,000 businesses in 10 metropolitan areas. Students are welcome to participate in Yelp’s dataset challenge.
(This post was originally published October 13, 2015. It was last updated August 21, 2018.)
Now it’s time to get cracking! If you want to jumpstart your data science career, check out Springboard’s Introduction to Data Science course, or, for those with more experience, the Data Science Career Track.