IN THIS ARTICLE
Get expert insights straight to your inbox.
If you’re early in your career as a data scientist, you might want to consider taking on some personal projects. There are two reasons why.
Firstly, it’s a way for you to test yourself. You’ve probably spent many months working through data science theory and studying different approaches to analyzing data. But how do you know you’ve actually gained useful real-world skills? You can do that by choosing a problem that seems interesting to you and unleashing your newfound analytical skills to solve it.
Another important reason you should build projects is so that you have something to put in your data science portfolio. Recruiters prefer to look at candidates’ portfolios instead of reading long statements of purpose or lists of data science classes they’ve taken. A portfolio evinces that you’ve got practical skills and the ability to take projects from conception to completion.
Now if you want to work on a data science project, then you can’t do that without the data. If you’re wondering where you can source data from, we’ve got you covered. We’re going to c
But before we get into those resources, let’s take a look at what a data set is.
What Is a Data Set?
You’ve probably figured out partially what a data set is from what it’s called. It is, of course, a set of data points. But along with that, it’s also important to remember a few other characteristics that a data set must exhibit.
A data set is always formed of related data. Let’s say you have a data set about a housing subsidy program. In that case, the data would include data points relating to the prices of houses over time, the demographics of the buyers, the areas where these programs are run, and so on. All of these data points are related and therefore would constitute a data set.
Secondly, the data in a data set is always discrete. Each record is independent and can take the form of only a finite value.
Data sets are most commonly stored in a tabular format. Every column in the table corresponds to a specific category of information. The rows are the data values that fall under that specific category.
For example, assume that you have a database on stock prices during a certain period. Some of the columns that you would have in this data set are company name, company stock price, change in stock price year on year, and so on. As you can see, the values that would be entered in this table would be related, discrete, and structured.
Become a Data Scientist. Land a Job or Your Money Back.
Build job-ready skills with 28 mini-projects, three capstones, and an advanced specialization project. Work 1:1 with an industry mentor. Land a job — or your money back.
Free Data Sets To Analyze
Now that we know what a data set is, we can move on to looking at some of the best public data sets that are out there. These data sets have been sourced from government agencies, private companies, and public institutions. All of the data available in them is structured, so you don’t have to worry about cleaning data.
Free General Data Sets
Kaggle is a community that has been built specifically for data scientists and machine learning engineers. The goal is to have a place where members can work on Kaggle data problems together and access data sets so they can regularly practice data analysis.
Kaggle has something to offer for data scientists across levels, whether that’s a simple data set for students or something advanced for a data scientist looking to work on their artificial intelligence chops. The platform is also known for hosting regular competitions where you can go up against other data scientists to solve real-world problems posted by companies.
The Google Cloud marketplace comes with a website that offers data sets that have been sourced from various Google products. So if you want an excellent data set from services like Google Trends, Google Patents Research, and Community Mobility Reports, this is where you can find it.
Google also offers a collection of repositories from commercial and public data sets. You can conduct your analyses on Google Cloud or download the data sets and use your own tools for the job.
You probably think of Github as a version control tool, but did you know that they also offer a wide variety of data sets that you can use for your personal projects? These are all available for free and you can quickly port the data into your project when you need it.
Let’s take this glacier mass balance data set, for example. This fantastic data set provides information on the mass of reference glaciers across the world. You can use this and similar data sets to conduct analyses on a wide range of topics.
Free Government Data Sets
Data.gov is where all of the American government’s public data sets live. You can access all kinds of data that is a matter of public record in the country. The main categories of data available are agriculture, climate, energy, local government, maritime, ocean, and older adult health.
Along with giving access to this collection of repositories for free, the website also has various resources for data scientists. You can use it to learn more about data analysis tools, data management frameworks, and case studies of projects taken up by data scientists who work in government.
What Data.gov does at the federal level, NYC Open Data does for New York City. This website is a collection of repositories that offer data sourced from various public institutions that govern the city.
The main categories of data available here are business, city government, education, environment, and health. You can also browse data sets compiled by different agencies, such as the Financial Information Services Agency (FISA) or the Mayor’s Office of Climate Policy and Programs (CPP).
This is the official portal for all of the public data that is offered by the European Union. The scope of the available data is broken down into national data, European data, and international data. You can find a detailed data set for just about any aspect of European life here, covering economic indicators, law enforcement agencies, health care institutions, and more.
Free Health Data Sets
Healthdata.gov is a repository of freely available healthcare data from the US government. It is managed by the U.S. Department of Health and Human Services Office.
This website is a treasure trove for anyone interested in healthcare data. You can find public data sets on everything ranging from cancer incidence to COVID-19 prevalence and impact. Working on these data sets can be especially helpful if you plan on getting a data science job in healthcare.
This is a federal website managed by the U.S. Centers for Medicare & Medicaid Services. The data sets available on this website are specifically geared towards medical and dental plans for groups and individuals. There’s also an API with clear documentation in case you want to source your data directly into a web application.
The Berkeley Library Health Statistics and Data website provides free access to a large variety of data sets. That includes data sets that are both nationwide statistics and specific to California state.
Get To Know Other Data Science Students
Free Environment Data Sets
The National Centers for Environmental Information offers its climate data for free through these public data sets. The goal of the undertaking is to make global climate data available for analysis and study.
The public data sets available on this website constitute a cross-section of data across months, seasons, and years. You can get information on things like temperature, wind, precipitation, and other climate data here. The site also offers specialized tools that you can use to access this climate data.
If you want to do a data science project on climate data, then this website offers just about every kind of data set that you could possibly need. This website by Tutiempo Network contains public data sets with climate data for every country on the planet. Some of this data goes back to the first half of the 20th century.
The data on this website is sourced from over 9,000 weather stations. It is easy to break the available data sets down by continent or country if you want to focus your analyses on one particular region.
Five Thirty-Eight—the website known for its data journalism stories—used this US Weather History data repository to produce its 2015 story What 12 Months of Record-Setting Temperatures Looks Like Across the US. Analyzing this data set is a good way to understand how data science connects with storytelling. You can use the story as inspiration to work on your data visualization skills.
Free Economic Data Sets
This is a public website with data offered by the World Bank. Due to the nature of this institution, you know that you’re going to get access to economic data from across every continent on the planet.
Each data page allows you to download data in bulk in a CSV file and other file formats. There is also an API using which you can access this data to analyze or display on your own tool.
This is a website with public data that pertains to employment levels and labor information for the United States. You can access data that covers things like inflation and prices, workplace injuries, productivity, employment benefits, etc.
These are business-related data sets that are made available by the Carnegie Mellon library. You can peruse data that pertains to all kinds of national and international economic information. Some examples include economic data from the federal reserve, data from the International Labor Organization, and data from the World DataBank.
Data Set FAQs
We’ve got answers to your most frequently asked questions about data sets.
How Do I Know if a Free Data Set Is Complete?
You can make sure that the data you source is complete by choosing reliable sources for your data sets. Always go with data that has been made available by governments, reputed private companies, and public institutions.
Are All Data Sets Free?
Not all data sets are free. Some require users to pay for access to download the data or to use an API that gives access to the data.
Can You Make Your Own Data Set?
Yes, you can build your own data set by sourcing data from various sources like social media sites, online directories, and so on.