Many newcomers to data science spend a significant amount of time on theory and not enough on practical application. To make real progress along the path toward becoming a data scientist, it’s important to start building data science projects as soon as possible.
If you’re thinking about putting together your own data science projects and don’t know where to begin, it’s a good idea to seek inspiration from others. At Springboard, we offer mentored bootcamps that culminate in capstone projects focused on solving a real-world problem using the skills acquired throughout the course.
In this post, we’ll share data science project examples from both Springboard students and outside data scientists that will help you understand what a completed project should look like. We’ll also provide some tips for creating your own interesting data science projects.
Data Science Projects
“Eat, Rate, Love” — An Exploration of R, Yelp, and the Search for Good Indian Food (Beginner)
When it comes time to choose a restaurant, many people turn to Yelp to determine which is the best option for the type of food they’re in search of. But what happens if you’re looking for a specific type of cuisine and there are many restaurants rated the same within a small radius? Which one do you choose? Robert Chen took Springboard’s Introduction to Data Science course and chose as his capstone project a way to further evaluate Yelp reviewers to determine if their reviews led to the best Indian restaurants.
Chen discovered while searching Yelp that there were many recommended Indian restaurants with close to the same scores. Certainly not all the reviewers had the same knowledge of this cuisine, right? With this in mind, he took into consideration the following:
- The number of restaurant reviews by a single person of a particular cuisine (in this case, Indian food). He was able to justify this parameter by looking at reviewers of other cuisines, such as Chinese food.
- The apparent ethnicity of the reviewer in question. If the reviewer had an Indian name, he could infer that they might be of Indian ethnicity, and therefore more familiar with what constituted good Indian food.
His modification to the data and the variables showed that those with Indian names tended to give good reviews to only one restaurant per city out of the 11 cities he analyzed, thus providing a clear choice per city for restaurant patrons.
Third and Goal (Intermediate)
The intersection of sports and data is full of opportunities for aspiring data scientists. A lover of both, Divya Parmar decided to focus on the NFL for his capstone project during Springboard’s Introduction to Data Science course.
Divya’s goal: to determine the efficiency of various offensive plays in different tactical situations. Here’s a sample from Divya’s project write-up:
To investigate 3rd down behavior, I obtained play-by-play data from Armchair Analysis; the dataset was every play from the first eight weeks of this NFL season. Since the dataset was clean, and we know that 80 percent of the data analysis process is cleaning, I was able to focus on the essential data manipulation to create the data frames and graphs for my analysis. I used R as my programming language of choice for analysis, as it is open source and has thousands of libraries that allow for vast functionality.
I loaded in my csv file into RStudio (my software for the analysis). First, I wanted to look at offensive drives themselves, so I generated a drive number for each drive and attached it to individual plays dataset. With that, I could see the length of each drive based on the count of each drive number.
Then, I moved on to my main analysis of 3rd down plays. I created a new data frame, which only included 3rd down plays which were a run or pass (excluding field goals, penalties, etc). I added a new categorical column named “Distance,” which signified how many yards a team had to go to convert the first down. Using conventional NFL definitions, I decided on this:
This hands-on project work was the most challenging part of the course for Divya, he said, but it allowed him to practice the different steps in the data science process: assessing the problem, manipulating the data, and delivering actionable insights to stakeholders.
You can access the data set Divya used here.
Who’s a Good Dog? (Intermediate)
Garrick Chu, another Springboard alum, chose to work on an image classification project, identifying dog breeds using neural networks. This project primarily leveraged Keras through Jupyter notebooks and tested the wide variety of skills commonly associated with neural networks and image data:
- Working with large data sets
- Effective processing of images (rather than traditional data structures)
- Network design and tuning
- Avoiding over-fitting
- Transfer learning (combining neural nets trained on different data sets)
- Performing exploratory data analysis to understand model outputs that people can’t directly interpret
One of Garrick’s goals was to determine whether he could build a model that would be better than humans at identifying a dog’s breed from an image. Because this was a learning task with no benchmark for human accuracy, once Garrick optimized the network to his satisfaction, he went on to conduct original survey research in order to make a meaningful comparison.
Amazon vs. eBay (Advanced)
Ever pulled the trigger on a purchase only to discover shortly afterward that the item was significantly cheaper at another outlet?
In support of a Chrome extension he was building, Chase Roberts decided to compare the prices of 3,500 products on eBay and Amazon. With his biases acknowledged, Chase walks readers of this blog post through his project, starting with how he gathered the data and documenting the challenges he faced during this process.
The results showed potential for substantial savings: “Our shopping cart has 3,520 unique items and if you chose the wrong platform to buy each of these items (by always shopping at whichever site has a more expensive price), this cart would cost you $193,498.45. Or you could pay off your mortgage. This is the worst case scenario for our shopping cart. The best case scenario for our shopping cart, assuming you found the lowest price between eBay and Amazon on every item, is $149,650.94. This is a $44,000 difference, or 23%!”
Find out more about the project here.
Fake News! (Advanced)
These days, it’s hard enough for the average social media user to determine when an article is made up with an intention to deceive. So is it possible to build a model that can discern whether a news piece is credible? That’s the question a four-person team from the University of California at Berkeley attempted to answer with this project.
First, the team identified two common forms of fake news to focus on: clickbait (“shocking headlines meant to generate clicks to increase ad revenue”) and propaganda (“intentionally misleading or deceptive articles meant to promote the author’s agenda”).
To develop a classifier that would be able to detect clickbait and propaganda articles, the foursome scraped data from news sources listed on OpenSources, preprocess articles for content-based classification using natural language processing, trained different machine learning models to classify the news articles, and created a web application to serve as the front end for their classifier.
Find out more and try it out here.
Audio Snowflake (Advanced)
When you think about data science projects, chances are you think about how to solve a particular problem, as seen in the examples above. But what about creating a project for the sheer beauty of the data? That’s exactly what Wendy Dherin did.
The purpose of her Hackbright Academy project was to create a stunning visual representation of music as it played, capturing a number of components, such as tempo, duration, key, and mood. The web application Wendy created uses an embedded Spotify web player, an API to scrape detailed song data, and trigonometry to move a series of colorful shapes around the screen. Audio Snowflake maps both quantitative and qualitative characteristics of songs to visual traits such as color, saturation, rotation speed, and the shapes of figures it generates.
She explains a bit about how it works:
Each line forms a geometric shape called a hypotrochoid (pronounced hai-po-tro-koid).
Hypotrochoids are mathematical roulettes traced by a point P that is attached to circle which rolls around the interior of a larger circle. If you have played with Spirograph, you may be familiar with the concept.
The shape of any hypotrochoid is determined by the radius a of the large circle, the radius b of the small circle, and the distance h between the center of the smaller circle and point P.
For Audio Snowflake, these values are determined as follows:
- a: song duration
- b: section duration
- h: song duration minus section duration
Find out more here.
Bonus Data Sets for Data Science Projects
Here are a few more data sets to consider as you ponder data science project ideas:
- VoxCeleb: an audio-visual data set consisting of short clips of human speech, extracted from interviews uploaded to YouTube.
- Titanic: a classic data set appropriate for data science projects for beginners.
- Boston Housing Data: a fairly small data set based on U.S. Census Bureau data that’s focused on a regression problem.
- Big Mart Sales: a retail industry data set that can be used to predict store sales.
- FiveThirtyEight: Nate Silver’s publication shares the data and code behind some of its articles and graphics so admirers can create stories and visualizations of their own.
You can also find a wide range of free public data sets in this blog post.
Tips for Creating Cool Data Science Projects
Getting started on your own data science project may seem daunting at first, which is why at Springboard, we pair students with one-on-one mentors and student advisors who help guide them through the process.
When you start your data science project, you need to come up with a problem that you can use data to help solve. It could be a simple problem or a complex one, depending on how much data you have, how many variables you must consider, and how complicated the programming is.
Choose the Right Problem
If you’re a data science beginner, it’s best to consider problems that have limited data and variables. Otherwise, your project may get too complex too quickly, potentially deterring you from moving forward. Choose one of the data sets in this post, or look for something in real life that has a limited data set. Data wrangling can be tedious work, so it’s key, especially when starting out, to make sure the data you’re manipulating and the larger topic are interesting to you. This often are challenging projects, but they should be fun!
Breaking Up the Project Into Manageable Pieces
Your next task is to outline the steps you’ll need to take to create your data science project. Once you have your outline, you can tackle the problem and come up with a model that may prove your hypothesis. You can do this in six steps:
- Generate your hypotheses
- Study the data
- Clean the data
- Engineer the features
- Create predictive models
- Communicate your results
Generate Your Hypotheses
After you have your problem, you need to create at least one hypothesis that will help solve the problem. The hypothesis is your belief about how the data reacts to certain variables. For example, if you are working with the Big Mart data set that we included among the bonus options above, you may make the hypothesis that stores located in affluent neighborhoods are more likely to see higher sales of expensive coffee than those stores in less affluent neighborhoods.
This is, of courses, dependent on you obtaining general demographics of certain neighborhoods. You will need to create as many hypotheses as you need to solve the problem.
Study the Data
Your hypotheses need to have data that will allow you to prove or disprove them. This is where you need to look in the data set for variables that affect the problem. In the Big Mart example, you’ll be looking for data that will lead to variables. In the coffee hypothesis, you need to be able to identify brands of coffee, prices, sales, and the surrounding neighborhood demographics of each store. If you do not have the data, you either have to dig deeper or change your hypothesis.
Clean the Data
As much as data scientists prefer to have clean, ready-to-go data, the reality is seldom neat or orderly. You may have outlier data that you can’t readily explain, like a sudden large, one-time purchase of expensive coffee in a store that is in a lower income neighborhood or a dip in coffee purchases that you wouldn’t expect during a random two-week period (using the Big Mart scenario). Or maybe one store didn’t report data for a week.
These are all problems with the data that isn’t the norm. In these cases, it’s up to you as a data scientist to remove those outliers and add missing data so that the data is more or less consistent. Without these changes, your results will become skewed and the outlier data will affect the results, sometimes drastically.
With the problem you’re trying to solve, you aren’t looking for exceptions, but rather you’re looking for trends. Those trends are what will help predict profits at the Big Mart stores.
Engineer the Features
At this stage, you need to start assigning variables to your data. You need to factor in what will affect your data. Does a heat wave during the summer cause coffee sales to drop? Does the holiday season affect sales of high-end coffee in all stores and not just middle-to-high-income neighborhoods? Things like seasonal purchases become variables you need to account for.
You may have to modify certain variables you created in order to have a better prediction of sales. For example, maybe the sales of high-end coffee isn’t an indicator of profits, but whether the store sells a lot of holiday merchandise is. You’d have to examine and tweak the variables that make the most sense to solve your problem.
Create Your Predictive Models
At some point, you’ll have to come up with predictive models to support your hypotheses. For example, you’ll have to design code that will show that when certain variables occur, you have a flux in sales. For Big Mart, your predictive models might include holidays and other times of the year when retail sales spike. You may explore whether an after-Christmas sale increases profits, and if so, by how much. You may find that a certain percentage of sales earn more money than other sales, given the volume and overall profit.
Communicate Your Results
In the real world, all the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you must develop. To finish your project, you’ll want to create a data visualization or a presentation that explains your results to non-technical folks.
Bonus: How Many Projects Should Be in a Data Science Portfolio?
Data scientist and Springboard mentor David Yakobovitch recently shared expertise on how to optimize a data science portfolio with our data science student community. Among the advice he shared were these tips:
For the Data Science Career Track, we have two capstones that students work on, so I like to say a minimum of two projects in your portfolio. Often when I work with students and they’ve finished the capstones and they’re starting the job search, I say, “Why not start a third project?” That could be using data sets on popular sites such as Kaggle or using a passion project you’re interested in or partnering with a non-profit.
When you’re doing these interviews, you want to have multiple projects you can talk about. If you’re just talking about one project for a 30- to 60-minute interview, it doesn’t give you enough material. So that’s why it’s great to have two or three, because you could talk about the whole workflow—and ideally, these projects work on different components of data science.
Learning the theory behind data science is an important part of the process. But project-based learning is the key to fully understanding the data science process. Springboard emphasizes data science projects in all three data science courses. The Data Science Career Track features 14 real-world projects, including two industry-worthy capstone projects.
Interested in a project-based learning program that comes with the support of a mentor? Check out our Data Science Career Track—you’ll learn the skills and get the personalized guidance you need to land the job you want.