While data analysis is always technical (and sometimes even a little bit repetitive), that doesn’t mean you can’t have a little bit of fun with it. The internet is a treasure trove of valuable information for aspiring data scientists. Playing around with existing online datasets is the best type of practice: not only is it risk-free, but it’s the best way to learn directly by doing and breathe new life into your analytics experience. You’ll find various data-driven projects put together by experts and aficionados; many of them available in open-source communities like Github.
What’s more, you can easily find one that relates to your non-data-related hobbies and interests, from your favorite TV show to tracking the 2020 election.
To spice things up a bit, we’ve turned to today’s pop culture hot topics. In this blog, you’ll find a list of free and public datasets that span from entertainment to animals to sports.
For a more tailored approach to your learning journey, we’ve also organized the data sets into four top skills that any data analyst would want to master: data cleaning; data visualization; machine learning; and data analysis.
Get started below!
University of Rochester’s Human-Computer Interaction lab along with the Language Technologies Institute created the first dataset for multimodal humor detection. Using language, visual, and acoustic features, this UR-FUNNY data set is a great jumpoff point for data cleaning. There is an original and an updated version that removed noisy data instances so a great exercise would be to clean the original version, then compare your work to the available updates.
Video Game Culture Wars
Practice data cleaning by using an existing dataset and implementing your own limits. Following the Gamergate controversy of a few years ago, 72 hours of tweets using the #gamergate hashtag were compiled in this spreadsheet.
Choose a path to take when working through the data, and get started on training yourself to automatically identify any irrelevant data and remove or replace it.
Clever Weather Patterns
Brazil is the largest country in South America with balmy temperatures and plenty of rain. Using this large dataset on hourly weather data from over 100 stations, strengthen your data cleaning abilities by reading through the data, and understanding what to keep and what to delete.
Global Warming Trends
Climate change is a hot button topic these days, and there are many resources out there for you to actively explore. This dataset reports on land and ocean temperature by the subsets country, state, and major cities as well as weather observations.
This data is already cleaned and packaged, making it a great start for data analysis. For data that dives deeper into global surface temperature anomalies, you can visit here. Try practicing by creating a line graph as data visualization to show temperature changes over time.
Did you know that you can use data analytics to win all your Bachelor pools next season? Just take a note from this guy. Break down the data to take note of the winners’ shared attributes and find any trends that can pinpoint from the start who will find love.
This article also shows how the avid viewer who created the dataset utilized data visualization to communicate his findings. Continue his work to enhance your abilities—and maybe even outsmart your friends during Bachelor wine night.
A Smarter Way to Play Fantasy Football
Exercise your data visualization skills while keeping tabs on your favorite fantasy football team. You can discover patterns in The Football Database that can help decide who to include in your starting lineup.
From there, create graphs to plot relevant data points to present to the rest of your league to boost everyone’s experience. Refer to the graphical representations you’ve created to improve your performance each season.
POTUS’s Twitter Account
Try and create a graphical representation of Donald Trump’s Twitter based on this dataset. Analyze the data to discover patterns within sentiment, word priority, active hours and days of the week, and more.
Once you have the answers you’re looking for, you can play around by creating graphics that display what you’ve gathered.
Who Rules the Kardashians?
If you’re a fan of reality TV’s most powerful family, build up your data visualization prowess by sharing who the most famous Kardashian actually is—with data! The data is already out there to explore tendencies within the family and their relationship with the media.
You can study and organize this data to create visual graphics that can communicate who really takes the cake amongst the Calabasas queens.
Grocery Shopping: 2020 Edition
Instacart is a popular grocery delivery service in the United States and Canada. If you’re looking to practice machine learning with a fun topic, this website provides over 3 million grocery orders worth of data.
This dataset would be excellent to test models that could predict future orders, repeat buys, and user habits.
Demystify the TikTok Algorithm
TikTok is slowly taking over the world. Active users have discovered the different communities within TikTok that can include “Alt TikTok,” “Basket Weaving TikTok,” “Boomer TikTok,” “Frog TikTok”…the list goes on!
You can use data sets to study the algorithm and see how different interactions affect what is delivered to the user to gain a better understanding of how machine learning works.
Here are a few datasets that can supply useful data about TikTok:
- TikTok Video Comments
- OTT Consumption Profile
- TikTok Revenue and Usage Statistics
- Chinese Social Media Trends
- TikTok Statistics
Million Song Dataset
For any pop or contemporary fans out there, this dataset was created to encourage research on algorithms that scale to commercial sizes. Derived features are taken from a million contemporary popular music tracks that can serve as the foundation for your predictive analysis of what will—or won’t—be a hit.
New York City Squirrel Census
Yep, you read that right. A native New Yorker data enthusiast and over 300 volunteers counted and observed the squirrels living in the city—all to gather an immense amount of data that can be found here.
A skill within data analysis involves asking the right questions, and this dataset can be a great tool to study and come up with questions that can be answered with this squirrel census. Some might include their most frequented bodega trash cans, most popular coat patterns, or where they summer.
Data pulled from all seven generations of Pokemon has been scraped here including base stats, height, weight, abilities, and more.
The dataset was formed to discover things like the weakest and strongest types of Pokemon and identifying legendary Pokemon. You can easily come up with a few questions that can be answered from the given information and practice your analytics skills.
Ever wonder which Hogwarts House you’d be sorted into? Trying to decide your favorite character? Use these Harry Potter datasets to extract a definitive answer. Here are some favorites:
- This dataset provides a detailed list of each movie’s characters and their demographic information
- This dataset dives deep into language processing and sentiment analysis within the movies
- If you want to go beyond the books, use this data set for 111,963 Potter fanfiction titles, authors, and summaries
Datasets for Dog Lovers
Becoming a dog owner requires extensive research and preparation. Use this data gathered in Germany to practice your analysis skills and pull out any answers to frequent dog-related questions, such as what climate different breeds thrive best in and what dogs are best with children.
Any of the above datasets can be a perfect way to find new inspiration within the data science world. In such a dynamic industry, it’s important to stay sharp. Practicing without pressure is a surefire way to boost your skills on your own.
Ever wonder what a data scientist really does? Check out Springboard’s comprehensive guide to data science. We’ll teach you everything you need to know about becoming a data scientist, from what to study to essential skills, salary guide, and more!