In data science, one of the most important ways you can set yourself apart from other job seekers is by creating a top-notch portfolio. When executed well, your portfolio allows you to demonstrate not just your technical skills, but also your independent thinking, your ability to concisely present your work, and your overall preparedness to succeed in the role.

As we built our Data Science Career Track curriculum, we talked to many hiring managers to guarantee that our course taught the in-demand skills companies value. We also made sure that the curriculum gave our students an opportunity to showcase these skills. That’s why a key component of the course is working on not one, but two industry-grade capstone projects.

To complete these intensive works (that take at least 80 hours each), students search for a public data set that excites them, then they develop a hypothesis and build a project from the ground up. This provides essential experience not just in terms of developing technical skills, like cleaning datasets, but also soft skills, like learning to be humble and recognizing that your first hypothesis might not be correct.

Below are a few of the capstone projects that our mentors have flagged as standing out:

Who’s a good dog?

Who's a good dog?

Garrick Chu chose to work on an image classification project, identifying dog breeds using neural networks. This project primarily leveraged Keras through Jupyter notebooks and tested the wide variety of skills commonly associated with neural networks and image data:

  •      Dealing with large datasets
  •      Effective processing of images (rather than traditional data structures)
  •      Network design and tuning
  •      Avoiding over-fitting
  •      Transfer learning (combining neural nets trained on different datasets)
  •      Performing EDA to understand model outputs that people can’t directly interpret

Paul, the mentor for this project, called out: “Garrick nearly burnt out his computer’s GPU and still wasn’t getting the results he wanted, so he had to do more research, expand his toolkit, and learn how to implement transfer learning in order to make it work. In addition to this stretch, he went the extra mile in validation: because this is a learning task with no benchmark for human accuracy (which is the gold standard for AI performance measurement), once he had optimized the network to his satisfaction he went on to conduct original survey research in order to make a meaningful comparison. This shows creativity and drive (and showed that his network beat humans).

See more of Garrick’s work here.

What does a 4-star rating on Yelp actually mean?

Diving deeper into Yelp reviews

Sam Gutentag demonstrated a sophisticated understanding of natural language processing by using an LDA topic model to infer sub-topics from written Yelp reviews. The goal: to help Yelp users make better-informed decisions and provide restaurant owners and managers with helpful insight into areas in which their business is leading or tailing customer expectations, then make improvements to drive sales.

Simon, Sam’s mentor, noted: “Beyond the sophisticated understanding of natural language processing, this project stood out for me because it had such a strong use case. I could definitely see restaurant owners paying for a service that would tell them that their current, undifferentiated, 4-star rating is because they have 5-star food but 3-star service. For any business, this kind of model would allow them to identify and improve their deficiencies..”

See more of Sam’s work here.

What is my heart failure risk?

Modeling heart failure risk

Ginny (Jie) Zhu leveraged machine learning to predict heart failure onset risk based on electronic health records (EHRs). She noticed that many care management and population health analytics software solutions were based on “tip-of-the-iceberg data,” i.e., what is obtained from claims. But the depth and breadth of data exytracted from EHRs was potentially far more impactful.

Amir, Ginny’s mentor, was impressed that his mentee “used a Bayesian method for hyper-parameter tuning that took initiative to learn and apply in this case. She also picked up the PyTorch library that was not covered in the material as an extra challenge for learning a new tool and gained some perspective on the differences between deep learning libraries.”

See more of Ginny’s work here.

Who sings that song?

Who sings that song?

George Mendoza focused on applying natural language processing and machine learning to the song lyrics from some of his favorite artists (everyone from Bob Dylan to Nas). The goal was to build a classifier that could accurately identify an artist from their lyrics. All data was taken from the Genius API using R and George used regex processing to get rid of data quirks.

Danny, who mentored George, said, “George’s capstones stand out for how interesting they are and George’s insightful analysis into his results.”

See more of George’s work here.

How important is an NFL combine score?

The NFL combine

For the first of his two capstones, Paul Kim analyzed whether prospective football players would be successful in the NFL draft based on combine statistics. He broke up the project into two major phases: an exploratory data analysis phase during which he identified performance benchmarks “so that both stakeholders have an empirical reference point for judging performance, rather than a heuristic one,” and a predictive phase where he used machine learning methodologies to forecast which draft group a player would land in using a multiclass classifier.

Jeff, who worked with Paul, said: “He really understood the domain and ended up building multiple models based on position. Given this was his first real work with machine learning, this was a great first step.”

See more of Paul’s work here.

What’s the hold up?

Predicting pothole frequency in Chicago

A (nearly) lifelong Chicagoan, Melanie Hanna picked a topic close to home for her first capstone project: predicting pothole frequency in the city, along with the Department of Transportation’s response time. She wanted to find out if any specific factor, such as neighborhood income, correlated to pothole creation or city response time. To traverse the data set, Melanie worked in Python, primarily using the scikit-learn and pandas packages.

Rajiv, her mentor, called Melanie’s capstones “the best set of projects I have seen as a mentor.”

See more of Melanie’s work here. And check out this profile for more.

For a detailed annotation of an exemplary data science capstone project, check out this post.

Compelled to complete a capstone of your own? Interested in a career in data science? Check out Springboard’s Data Science Career Track today!