Completing your first project is a major milestone on the road to becoming a data scientist. It’s also an intimidating process. The first step is to find an appropriate, interesting data set. You should decide how large and how messy a dataset you want to work with; while cleaning data is an integral part of data science, you may want to start with clean dataset for your first project so that you can focus on the analysis rather than on cleaning the data.
Based on the learnings from our Foundations of Data Science Workshop, we’ve selected datasets of varying types and complexity that we think work well for first projects (some of them work for research projects as well!). These data-sets cover a variety of sources: demographic data, economic data, text data, and corporate data.
- United States Census Data: The United States Census publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the chloroplethr. In general, this data is very clean and very comprehensive.
- FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20 year period. Alternatively, you can look at the data geographically.
- CDC Cause of Death: The Center for Disease Control control maintains a database on cause of death. The data can be segmented in almost every way imaginable: age, race, year, and so on.
- Medicare Hospital Quality: Medicare maintains a database on complication rates by hospital that provides for interesting comparisons.
- SEER Cancer Incidence: The US government also has data about cancer incidence, again segmented by age, race, gender, year, and other factors.
- Bureau of Labor Statistics: Many important economic indicators for the United States (like unemployment and inflation) can be found on the Bureau of Labor Statistics website. Most of the data can be segmented both by time and by geography.
- The Bureau of Economic Analysis: The Bureau of Economic Analysis also has national and regional economic data, like GDP and exchange rates.
- IMF Economic Data: If you want a view of international data, you can find it on the IMF website.
- Dow Jones Weekly Returns: Predicting stock prices is a major application of data analysis and machine learning. One dataset to explore is the weekly returns of the Dow Jones Index.
- Boston Housing Data: The Boston Housing Data Set contains median housing prices in Boston suburbs as well as 13 attributes that contribute to those prices. It’s an excellent set for experimenting with various types of regressions.
- Enron Emails: After the collapse of Enron, a dataset of roughly 500,000 emails with message text and metadata were released. The dataset is now famous and provides an excellent testing ground for text related analysis. It has the messiness of real world data.
- Google N-Grams: If you’re interested in truly massive data, the Google n-grams dataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- Sentence Sentiments: Researchers have labeled 3,000 sentences as expressing positive or negative sentiments. If you’re interested in classifying text, this is a great place to start.
- Reddit Comments: Reddit released a dataset of every comment that has ever been made on the site. That’s over a terabyte of data uncompressed, so if you want a smaller dataset to work with Kaggle has hosted the comments from May 2015 on their site.
- Wikipedia: Wikipedia provides instructions for downloading the text of English language articles.
- Lending Club: Lending Club provides data about loan applications it has rejected as well as the performance of loans that it issued. The dataset lends itself both to categorization techniques (will a given loan default) as well as regressions (how much will be paid back on a given loan.)
- Walmart: Walmart has released store level sales data for 98 items across 45 stores. This is an excellent data for time series analysis and has interesting seasonal components as well.
- Airbnb: Airbnb released user session data as part of a content to create analysis and visualizations.
- Yelp: Yelp releases an academic dataset that contains information for the areas around 30 universities.
Well – now it’s time to get cracking! If you want to jumpstart your Data Science career today, I’d recommend checking out our 12-Week Online Workshop – Foundations of Data Science. Head here for more on that. If you wanted even more resources, check out the Springboard home page.