{"id":9744,"date":"2022-04-12T03:39:01","date_gmt":"2022-04-12T10:39:01","guid":{"rendered":"https:\/\/www.springboard.com\/?p=9744"},"modified":"2023-08-25T08:28:59","modified_gmt":"2023-08-25T15:28:59","slug":"15-fun-datasets-to-analyze","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/15-fun-datasets-to-analyze\/","title":{"rendered":"19 Fun Data Sets to Analyze and Level Up Your Portfolio"},"content":{"rendered":"\n<p>While data analysis is always technical (and sometimes even a little bit repetitive), you can still have fun with it. Playing around with existing online datasets is great practice, and you\u2019ll find various data-driven projects put together by experts and aficionados, many of them available in open-source communities like Github.<\/p>\n\n\n\n<p>What\u2019s more, you can easily find data sets that relate to your non-data-related hobbies and interests, from your favorite TV show to tracking the 2020 election.<\/p>\n\n\n\n<p>In this blog, we\u2019ll cover some of the fun datasets you can use to hone your skills, which are free, publically available, and range from entertainment to animals to sports. For a more tailored approach to your learning journey, we\u2019ve also organized the data sets into four top skills that all data analysts should master: data cleaning; data visualization; machine learning; and data analysis.<\/p>\n\n\n\n<p>Get started below!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What Is a \u201cFun\u201d Data Set?<\/h2>\n\n\n\n<p>\u201cFun\u201d datasets concern topics that are of personal interest, and can be used to answer unexpected questions and explore relationships that aren\u2019t immediately intuitive. Perhaps you start with a question or hypothesis, and then <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/15-fun-datasets-to-analyze\/\" target=\"_blank\" rel=\"noreferrer noopener\">find a dataset<\/a> to prove (or disprove) your theory. Or, you might even generate your own dataset using web scraping techniques or an open API. In fact, creating your own dataset enables you to collect, label, and prepare a clean dataset.&nbsp;<\/p>\n\n\n\n<p>Working with fun datasets will make your <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-portfolio\/\" target=\"_blank\" rel=\"noreferrer noopener\">data science portfolio<\/a> more eye-catching to employers, who have probably seen their fair share of Netflix-inspired recommendation engines and Twitter sentiment analysis projects.&nbsp;<\/p>\n\n\n\n<p>Play with your data as much as you can before you begin your analysis. See if you can cross-reference two different datasets to compare different variables. For example: how do rising gas prices affect hotel occupancy in different parts of the country?<\/p>\n\n\n\n<p>The dataset should be rich enough to let you play with it and derive patterns. In other words, it must have at least a few thousand rows and at least 20-25 columns, and a reasonable mix of continuous and categorical variables.&nbsp;<\/p>\n\n\n\n<p>These datasets can be a perfect way to find new inspiration within the <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\">data science<\/a> world. In such a dynamic industry, it\u2019s important to stay sharp. Practicing without pressure is a surefire way to boost your skills on your own.<\/p>\n\n\n<style>.blog-cta-salsey-02 {\toverflow: hidden;\t}\t.blog-cta-salsey-02-img {\tmax-width: 160px !important;\t}\t@media (min-width: 768px) {\t.blog-cta-salsey-02-content {\tmax-width: calc(100% - 281px);\t}\t.blog-cta-salsey-02-img {\tposition: absolute;\tmax-width: 100% !important;\tright: -10px;\tbottom: -10px;\t}\t}<\/style><div class=\"blog-cta-salsey-02 bg-blue-50 p-3 my-5 position-relative\"><div class=\"d-block d-md-flex\"><img decoding=\"async\" loading=\"lazy\" width=\"212\" height=\"232\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2023\/08\/dsc-student.png\" alt=\"Data Science student\" class=\"blog-cta-salsey-02-img mb-3 mb-md-0\" \/><div class=\"blog-cta-salsey-02-content\"><div class=\"d-flex align-items-center mb-2\"><img decoding=\"async\" class=\"pe-2\" width=\"86\" height=\"71\" loading=\"lazy\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2023\/04\/job-guarantee-heading-badge.png\" alt=\"Job Guarantee\" style=\"mix-blend-mode: multiply\"><h4 class=\"fw-bold mb-0\">Become a Data Scientist. Land a Job or Your Money Back.<\/h4><\/div><p>Build job-ready skills with 28 mini-projects, three capstones, and an advanced specialization project. Work 1:1 with an industry mentor. Land a job \u2014 or your money back.<\/p><p class=\"mb-sm-0\"><a class=\"btn btn-primary btn-lg\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/#job-guarantee\">Explore course<\/a><\/p><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Fun Data Sets To Analyze<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Data Cleaning<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Funny Data<\/strong><\/h4>\n\n\n\n<p>University of Rochester\u2019s Human-Computer Interaction lab, along with the Language Technologies Institute, has created the <a href=\"https:\/\/github.com\/ROC-HCI\/UR-FUNNY\" target=\"_blank\" rel=\"noreferrer noopener\">first dataset for multimodal humor detection<\/a>. Using language, visual, and acoustic features, this UR-FUNNY data set is a great jumpoff point for data cleaning. An updated version removed noisy data instances, so a great exercise would be to clean the original version, then compare your work to the available updates.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Video Game Culture Wars<\/strong><\/h4>\n\n\n\n<p>Practice data cleaning by using an existing dataset and implementing your own limits. After the Gamergate controversy of a few years ago, tweets from a 72-hour window were compiled into <a href=\"http:\/\/waxy.org\/random\/misc\/gamergate_tweets.csv\" target=\"_blank\" rel=\"noreferrer noopener\">this spreadsheet<\/a>. Choose a path when working through the data, and get started on training yourself to automatically identify any irrelevant data.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Clever Weather Patterns<\/strong><\/h4>\n\n\n\n<p>Brazil is the largest country in South America with balmy temperatures and plenty of rain. Using <a href=\"https:\/\/www.kaggle.com\/PROPPG-PPG\/hourly-weather-surface-brazil-southeast-region\" target=\"_blank\" rel=\"noreferrer noopener\">this large dataset<\/a> on hourly weather data from over 100 stations, strengthen your data cleaning abilities by reading through the data, and understanding what to keep and what to delete.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Trending Shows on Streaming Platforms<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"803\" height=\"521\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/streaming-platforms.png\" alt=\"data sets to analyze\" class=\"wp-image-17401\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/streaming-platforms.png 803w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/streaming-platforms-380x247.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/streaming-platforms-380x247.png 420w\" sizes=\"(max-width: 803px) 100vw, 803px\" \/><\/figure>\n\n\n\n<p>With so many streaming platforms to choose from, viewers have plenty of choices. From new releases to enduring favorites, the most-streamed shows make for an ever-changing dataset, and are often reflective of the current cultural zeitgeist (remember when Tiger King inspired all those <a href=\"https:\/\/time.com\/5810608\/tiger-king-memes\/\" target=\"_blank\" rel=\"noreferrer noopener\">pandemic-related memes?<\/a>). Using <a href=\"https:\/\/www.kaggle.com\/prasertk\/netflix-daily-top-10-in-us\" target=\"_blank\" rel=\"noreferrer noopener\">this dataset<\/a> on Netflix&#8217;s top 10 shows from March 2020 to March 2022, you can analyze what people were binge-watching throughout the COVID-19 pandemic.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data Visualization<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">LEGO Bricks Data<\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"667\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/lego-bricks.jpeg\" alt=\"data sets to analyze\" class=\"wp-image-17409\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/lego-bricks.jpeg 1000w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/lego-bricks-380x253.jpeg 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/lego-bricks-380x253.jpeg 420w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.kaggle.com\/rtatman\/lego-database\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> was originally compiled to help people figure out how to repurpose the <a href=\"https:\/\/www.kaggle.com\/rtatman\/lego-database\" target=\"_blank\" rel=\"noreferrer noopener\">LEGO sets<\/a> they already own. The data contains the LEGO parts, sets, colors, and inventories of every official LEGO set in the Rebrickable database. While the data is current as of July 2017, you can use the <a href=\"https:\/\/rebrickable.com\/api\/\" target=\"_blank\" rel=\"noreferrer noopener\">Rebrickable API<\/a> to find more recent data. Using this dataset, you can explore questions such as: What sets have the most used pieces in them? What are the rarest LEGO pieces? How have the sizes of LEGO sets changed over time?&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">The Nutritional Value of Starbucks Drinks<\/h4>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"1730\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/the-nutritional-value-of-starbucks-drinks.jpeg\" alt=\"data sets to analyze\" class=\"wp-image-17410\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/the-nutritional-value-of-starbucks-drinks.jpeg 1200w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/the-nutritional-value-of-starbucks-drinks-380x548.jpeg 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/the-nutritional-value-of-starbucks-drinks-380x548.jpeg 420w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<p>Ever wondered how much sugar and fat goes into <a href=\"https:\/\/www.kaggle.com\/starbucks\/starbucks-menu\" target=\"_blank\" rel=\"noreferrer noopener\">your favorite coffee drinks<\/a>? Because of branding, it\u2019s easy to assume that food items from Starbucks are healthier than McDonald\u2019s, but you can\u2019t know that for sure without digging into the data. <a href=\"https:\/\/www.kaggle.com\/starbucks\/starbucks-menu\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> from Kaggle contains nutrition facts for menu items from both Starbucks and McDonald\u2019s. You can use one or both sets of data to compare the nutritional values of similar food and drink items and visualize your findings.&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Global Warming Trends<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"496\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/global-warming-trends.png\" alt=\"data sets to analyze: Global Warming Trends\" class=\"wp-image-16920\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/global-warming-trends.png 800w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/global-warming-trends-380x236.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/global-warming-trends-380x236.png 420w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.kaggle.com\/berkeleyearth\/climate-change-earth-surface-temperature-data\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> by data science nonprofit Berkeley Earth reports on how land and ocean temperature vary by location. This data is already cleaned and packaged, making it a great start for data analysis. For data that dives deeper into global surface temperature anomalies, you can visit <a href=\"https:\/\/www.globalchange.gov\/browse\/datasets\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. Try creating a line graph as data visualization to show temperature changes over time.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Bachelor Winners<\/strong><\/h4>\n\n\n\n<p>Did you know that you can use <a href=\"https:\/\/www.springboard.com\/courses\/data-analytics-career-track\/\" target=\"_blank\" rel=\"noreferrer noopener\">data analytics<\/a> to figure out who will win The Bachelor next season? <a href=\"https:\/\/www.vice.com\/en\/article\/qvdbem\/using-data-to-predict-this-seasons-winner-of-the-bachelor\" target=\"_blank\" rel=\"noreferrer noopener\">This article<\/a> also shows how an avid viewer created a dataset on the demographic data of Bachelor contestants, and utilized data visualization to communicate his findings. Break down the data to take note of the winners\u2019 shared attributes and find any trends that can pinpoint from the start who will find love. Maybe you\u2019ll even outsmart your friends during your next Bachelor wine night.<\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/corey-wade\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1680712086\/Corey_Wade_LinkedIn.jpg\" alt=\"Corey Wade\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Corey Wade<\/p><p class=\"text-muted lh-1\">Founder And Director at Berkeley Coding Academy<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/corey-wade\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/meghan-thomason\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203464\/Student%20Success\/Megan_Thomason_125x125.png\" alt=\"Meghan Thomason\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Meghan Thomason<\/p><p class=\"text-muted lh-1\">Data Scientist at Spin<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/meghan-thomason\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/haotian-wu\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203192\/Student%20Success\/Haotian_Wu_125x125.png\" alt=\"Haotian Wu\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Haotian Wu<\/p><p class=\"text-muted lh-1\">Data Scientist at RepTrak<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/haotian-wu\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>A Smarter Way to Play Fantasy Football<\/strong><\/h4>\n\n\n\n<p>Exercise your data visualization skills while keeping tabs on your favorite fantasy football team. You can discover patterns in <a href=\"https:\/\/www.footballdb.com\/fantasy-football\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">The Football Database<\/a> that can help decide your starting lineup. From there, create graphs to plot relevant data points to present to the rest of your league to boost everyone\u2019s experience. Refer to the graphical representations you\u2019ve created to improve your performance each season.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>POTUS\u2019s Twitter Account<\/strong><\/h4>\n\n\n\n<p>Try and create a graphical representation of Donald Trump\u2019s Twitter account based on <a href=\"https:\/\/www.kaggle.com\/rahulanand0070\/trump-tweet-analysis\" target=\"_blank\" rel=\"noreferrer noopener\">this dataset<\/a>. Analyze the data to discover patterns within sentiment, word priority, active hours and days of the week, and more. Once you have the answers you\u2019re looking for, you can play around by creating graphics that display what you\u2019ve gathered.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Who Rules the Kardashians?<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"1910\" height=\"1246\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/screenshot-2022-04-08-at-4.22.45-pm.png\" alt=\"data sets to analyze\" class=\"wp-image-17411\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/screenshot-2022-04-08-at-4.22.45-pm.png 1910w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/screenshot-2022-04-08-at-4.22.45-pm-380x248.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/screenshot-2022-04-08-at-4.22.45-pm-380x248.png 420w\" sizes=\"(max-width: 1910px) 100vw, 1910px\" \/><\/figure>\n\n\n\n<p>If you\u2019re a fan of reality TV\u2019s most powerful family, build up your data visualization prowess by figuring out who the most famous Kardashian actually is. The data, <a href=\"https:\/\/www.datacamp.com\/projects\/538\" target=\"_blank\" rel=\"noreferrer noopener\">contained in this tutorial<\/a>, is already out there to explore tendencies within the family and their relationship with the media.<\/p>\n\n\n\n<p>You can study and organize this data to create visual graphics that communicate who takes the cake amongst the Calabasas queens.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Machine Learning<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Fake Job Posts<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"496\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/machine-learning.png\" alt=\"data sets to analyze: Machine Learning\" class=\"wp-image-16921\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/machine-learning.png 800w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/machine-learning-380x236.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/machine-learning-380x236.png 420w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>Scammers use fake job posts to steal people\u2019s identities by posting unusually enticing job descriptions, and then requiring applicants to provide their Social Security numbers and personal details upfront, ostensibly so they can be considered for an interview. <a href=\"https:\/\/www.kaggle.com\/shivamb\/real-or-fake-fake-jobposting-prediction\" target=\"_blank\" rel=\"noreferrer noopener\">This Kaggle dataset<\/a> compiled by data scientist Shivam Bansal contains 18,000 job descriptions, of which about 800 are fake. The data consists of both textual information and meta-information about the job posts. You can use the data to create classification models to determine which job posts are fraudulent or real.&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Jeopardy! Questions<\/strong><\/h4>\n\n\n\n<p>If you\u2019re ready to take on an advanced machine learning project, <a href=\"https:\/\/www.kaggle.com\/tunguz\/200000-jeopardy-questions\" target=\"_blank\" rel=\"noreferrer noopener\">this Kaggle dataset by data scientist Bojan Tunguz<\/a> contains over 200,000 questions from the popular game show Jeopardy!, and can be used for multiple purposes. For example, you can run classification algorithms to predict the category or dollar value of the question. Or, you can take things up a notch and train a BERT model, a language model for natural language processing (NLP).&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Million Song Dataset<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"620\" height=\"300\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/musicscience1.jpeg\" alt=\"data sets to analyze\" class=\"wp-image-17405\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/musicscience1.jpeg 620w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/musicscience1-380x184.jpeg 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/musicscience1-380x184.jpeg 420w\" sizes=\"(max-width: 620px) 100vw, 620px\" \/><\/figure>\n\n\n\n<p>For any pop or contemporary fans out there, <a href=\"http:\/\/millionsongdataset.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">this dataset<\/a> was created under a grant by the National Science Foundation to encourage research on algorithms that scale to commercial sizes. Derived features are taken from a million contemporary popular music tracks that can serve as the foundation for your predictive analysis of what will\u2014or won\u2019t\u2014be a hit.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data Analysis<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>New York City Squirrel Census<\/strong><\/h4>\n\n\n\n<p>A native New Yorker data enthusiast, with the help of over 300 volunteers, counted and observed the squirrels living in the city\u2014all to gather an immense amount of data that can be found <a href=\"https:\/\/github.com\/rfordatascience\/tidytuesday\/tree\/master\/data\/2019\/2019-10-29\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n\n\n\n<p>Knowing how to ask the right questions is an important data analytics skill, and this dataset can be a great tool to study and come up with questions that can be answered with this squirrel census. Some might include their most frequented bodega trash cans, most popular coat patterns, or where they summer.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Bigfoot Sightings<\/strong><\/h4>\n\n\n\n<p>Despite limited physical evidence attesting to the existence of Bigfoot, about <a href=\"https:\/\/civicscience.com\/bigfoot-is-real-for-11-of-u-s-adults\/\" target=\"_blank\" rel=\"noreferrer noopener\">11%<\/a> of US adults believe the eight-foot-tall, ape-like creature is real. <a href=\"https:\/\/data.world\/timothyrenner\/bfro-sightings-data\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> from the Bigfoot Field Researchers Organization (BFRO), an organization dedicated to investigating the Bigfoot mystery, contains publicly available sighting data in a digestible form. You can use the data to analyze geographical and meteorological trends associated with Bigfoot sightings and the types of evidence compiled (eg: direct sighting, noises, tracks, etc), for example.&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Pok\u00e9mon<\/strong><\/h4>\n\n\n\n<p>Data pulled from all seven generations of Pok\u00e9mon has been scraped <a href=\"https:\/\/www.kaggle.com\/rounakbanik\/pokemon\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>, including base stats, height, weight, abilities, and more. The dataset can identify the weakest and strongest types of Pok\u00e9mon, and identify legendary Pok\u00e9mon. You can easily come up with a few questions that can be answered from the given information and practice your analytics skills.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Harry Potter<\/strong><\/h4>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"1668\" height=\"752\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/harry-potter1-disneyscreencaps-com-5582.jpeg\" alt=\"\" class=\"wp-image-17408\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/harry-potter1-disneyscreencaps-com-5582.jpeg 1668w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/harry-potter1-disneyscreencaps-com-5582-380x171.jpeg 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/04\/harry-potter1-disneyscreencaps-com-5582-380x171.jpeg 420w\" sizes=\"(max-width: 1668px) 100vw, 1668px\" \/><\/figure>\n\n\n\n<p>Ever wonder which Hogwarts House you\u2019d be sorted into? Trying to decide your favorite character? Use these Harry Potter datasets to extract a definitive answer. Here are our favorites:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.kaggle.com\/gulsahdemiryurek\/harry-potter-dataset\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> provides a detailed list of each movie\u2019s characters and their demographic information.<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/mhl343\/HarryPotterAnalysis\" target=\"_blank\" rel=\"noreferrer noopener\">This dataset<\/a> dives deep into language processing and sentiment analysis within the movies.<\/li>\n\n\n\n<li>If you want to go beyond the books, use <a href=\"https:\/\/github.com\/janelleshane\/harry-potter-fanfic-dataset\" target=\"_blank\" rel=\"noreferrer noopener\">this data set<\/a> for 111,963 Potter fanfiction titles, authors, and summaries.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Datasets for Dog Lovers<\/strong><\/h4>\n\n\n\n<p>Becoming a dog owner requires extensive research and preparation. Use <a href=\"https:\/\/www.kaggle.com\/kmader\/dogs-of-zurich\" target=\"_blank\" rel=\"noreferrer noopener\">this data gathered in Germany<\/a> to practice your analysis skills and answer frequent dog-related questions. Some examples include: What breeds thrive in which climates? And what dogs are best with children?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Top 6 Sources To Find Data Sets<\/h2>\n\n\n\n<p>Even if you\u2019ve never worked on a paid data science project before, the internet has plenty of publicly available data that you can use for your personal projects. And with those projects, you can build a stellar portfolio. Here is a list of sources where you can find free, publicly available datasets on everything from crime to science, politics, and more.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. <a href=\"https:\/\/careerfoundry.com\/en\/blog\/data-analytics\/where-to-find-free-datasets\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>UCI Machine Learning Repository<\/strong><\/a><\/h3>\n\n\n\n<p>The UCI Machine Learning Repository by the University of California Irvine contains over 600 datasets on everything from <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Bone+marrow+transplant%3A+children\" target=\"_blank\" rel=\"noreferrer noopener\">bone marrow transplants in children<\/a> to data on <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Auto+MPG\" target=\"_blank\" rel=\"noreferrer noopener\">automobile fuel efficiency<\/a>. Best of all, the datasets are categorized by task (eg: classification, regression, or clustering), data type, and area of interest.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2. <a href=\"https:\/\/github.com\/awesomedata\/awesome-public-datasets\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Github\u2019s Awesome-Public-Datasets<\/strong><\/a><\/h3>\n\n\n\n<p>This Github repository contains a long list of high-quality datasets, from agriculture, to entertainment, to social networks and neuroscience. Working on such datasets would undoubtedly enable you to improve your abilities as an aspiring <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\">data scientist<\/a>. You can join the associated AwesomeData <a href=\"https:\/\/awesomedataworld.slack.com\/join\/shared_invite\/zt-dllew5xy-PJYi~mWUdY3hupohbmVZsA#\/shared-invite\/email\" target=\"_blank\" rel=\"noreferrer noopener\">Slack channel<\/a> to ask questions about the data or contribute your own dataset.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. <a href=\"https:\/\/www.pewresearch.org\/internet\/datasets\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Pew Research Center<\/strong><\/a><\/h3>\n\n\n\n<p>If your area of interest is culture, sociology, and current events, visit the Pew Research Center\u2019s data repository, which contains datasets and surveys covering media consumption, social media use, and demographic trends. Each dataset comes with reports that were released from the data, which can be a good starting point for your own analysis.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. <a href=\"https:\/\/github.com\/orgs\/BuzzFeedNews\/repositories?type=all\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>BuzzFeed News Github<\/strong><\/a><strong>&nbsp;<\/strong><\/h3>\n\n\n\n<p>Buzzfeed News has emerged as a credible news source with its hard-hitting investigative journalism. Here, you can access the data repositories used in some of the top investigative stories published on Buzzfeed News, including data on firearm background checks, political campaign donors, gentrification, and more.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5. <a href=\"https:\/\/data.fivethirtyeight.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>FiveThirtyEight<\/strong><\/a><\/h3>\n\n\n\n<p>An award-winning data journalism website, FiveThirtyEight makes its datasets publicly available. The datasets are highly curated and some of them come with the code associated with the visualizations and graphics used in the original news article. If you\u2019re interested in analyzing data about current events, FiveThirtyEight datasets are added several times a day and are meant to answer some of the most pressing questions of the day.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6. <a href=\"https:\/\/data.world\/search\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Data.world<\/strong><\/a><strong>&nbsp;<\/strong><\/h3>\n\n\n\n<p>Data.world is a data catalog service (like a search engine for datasets) and is home to the world\u2019s largest collaborative data community, which is free and open to the public. Anyone can use data.world to create a workspace or project that hosts a dataset, and you can share your analysis with the community to get feedback on your work.&nbsp;<\/p>\n\n\n\n<p><em>Related Read: <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">15 Free Data Sets for Your Next Project or Portfolio<\/a><\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">FAQs About Analyzing Data Sets<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How Big Is a Data Set?<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full is-style-rounded\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"496\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/faqs-about-analyzing-data-sets.png\" alt=\"data sets to analyze: FAQs About Analyzing Data Sets\" class=\"wp-image-16924\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/faqs-about-analyzing-data-sets.png 800w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/faqs-about-analyzing-data-sets-380x236.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2022\/03\/faqs-about-analyzing-data-sets-380x236.png 420w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>Datasets used for analytics vary in size. A <a href=\"https:\/\/www.kdnuggets.com\/2015\/11\/big-ram-big-data-size-datasets.html#:~:text=The%20dataset%20sizes%20vary%20over,in%20the%20many%20Petabytes%20range.\" target=\"_blank\" rel=\"noreferrer noopener\">2015 poll<\/a> by KDNuggets found that most users worked with datasets in the 10 megabytes to 10 terabytes range, with a minority of users tackling petabyte-sized datasets. Generally speaking, the larger your dataset, the more representative it is, especially when training machine learning models.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>What Is the Process To Analyze a Data Set?<\/strong><\/h3>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<ol class=\"wp-block-list\">\n<li><strong>Define the Problem<br><\/strong>Start by defining a question you want to answer. Business problems can be quite open-ended. The question \u201cWhy are we losing customers?\u201d can have multiple answers, so it helps to further define the problem using contextual information. For example, you might decide to use data to investigate what factors are negatively impacting the customer experience.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li><strong>Collect Your Data<\/strong><br>Once you\u2019ve established your objective, you\u2019ll need to create a strategy for aggregating the appropriate data. This might include quantitative (numeric) data, such as sales figures, or qualitative (descriptive) data, such as customer reviews. Then, you\u2019ll use a <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/data-analytics-tools\/\" target=\"_blank\" rel=\"noreferrer noopener\">data management platform<\/a> to collect and analyze data from numerous sources, such as your organization\u2019s CRM (Customer Relationship Management) tool.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li><strong>Clean Your Data<\/strong><br>Cleaning your data\u2014or wrangling it\u2014involves eliminating duplicates, missing values, and redundancies that create noise in your data. Doing this transforms raw data into a usable format for analysis. The amount of <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/data-cleaning\/\" target=\"_blank\" rel=\"noreferrer noopener\">data cleaning<\/a> you must do depends on several factors, such as whether you\u2019re using first-party (data your organization collects directly from customers), second-party (first-party data from other organizations), or third-party data (data aggregated by an outside organization). Unstructured data requires more cleaning because it may lack standardized naming conventions and formatting rules. And this is where you\u2019ll perform an exploratory data analysis (EDA) to identify trends and characteristics in the data.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\">\n<li><strong>Analyze the Data<\/strong><br>Before you analyze your data, it may be useful to segment it. For example, if you\u2019re analyzing sales data, you may wish to break it down by region or product category. From there, you can glean insights about specific groups or make comparisons between them. The type of data analysis technique you use depends on the question you\u2019re trying to answer.&nbsp;<\/li>\n<\/ol>\n<\/div>\n\n\n\n<div class=\"wp-block-group is-layout-flow wp-block-group-is-layout-flow\">\n<ul class=\"wp-block-list\">\n<li><strong>Bivariate and Multivariate Analysis<\/strong><br>One of the simplest forms of statistical analysis, bivariate analysis is the process of determining a relationship between an independent variable and a dependent variable. This relationship is usually expressed in the form of a linear equation that tells you the strength of the correlation, or a correlation coefficient (a value from 0-1 that indicates a positive or negative relationship).&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cohort Analysis<\/strong><br>Cohort analysis involves examining groups or segments of your data to determine their common characteristics. For example, you might want to understand what product categories are most popular in a specific region, or the demographic makeup of your top buyers.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Time Series Analysis<\/strong><br><a href=\"https:\/\/www.springboard.com\/blog\/data-science\/time-series-forecasting\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/time-series-forecasting\/\">Time series analysis<\/a> is a statistical technique used to identify trends and patterns over time. Using this technique, you can measure the same variable at different points in time. Time-related trends can help you understand what factors might cause the variable to change (eg: cyclic patterns or seasonality) and forecast how it may fluctuate in the future.&nbsp;<\/li>\n<\/ul>\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Factor Analysis<\/strong><br>Factor analysis is a technique used to reduce a large number of variables into a smaller number of factors. This technique works by finding data points that are strongly correlated, which is known as covariance. For example, say there is a strong relationship between the customer\u2019s region and household income, you can group this into a single factor such as \u201cconsumer purchasing power.\u201d This leaves you with a smaller number of factors rather than hundreds of seemingly unrelated variables. You can then explore these factors for further analysis.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\" start=\"5\">\n<li><strong>Visualize Your Data<\/strong><br>Data visualizations are the best way to communicate your findings with non-technical stakeholders. Visuals should be based on the following questions:<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Who is my audience?<\/li>\n\n\n\n<li>What questions do they have?<\/li>\n\n\n\n<li>What answers do I have for them?&nbsp;<\/li>\n\n\n\n<li>What other questions will my visualizations inspire?&nbsp;<\/li>\n<\/ul>\n\n\n\n<p><em>Related Read: <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/best-data-visualization-courses\/\" target=\"_blank\" rel=\"noreferrer noopener\">Top 13 Best Data Visualization Courses<\/a><\/em><\/p>\n\n\n\n<p>Where possible, <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/31-free-data-visualization-tools\/\" target=\"_blank\" rel=\"noreferrer noopener\">use a range of formats<\/a> to communicate your findings\u2014from dashboards to interactive graphs\u2014to help viewers understand the issue from different angles.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Are Some Data Sets Better Than Others?<\/strong><\/h3>\n\n\n\n<p>First and foremost, a good dataset contains the elements and variables you need for your specific analysis. For example, a time series analysis is a great way to visualize changes over time, but it requires data that contains a date or timestamp. You may also need to contextualize your data by using a third-party data source. For example, say you\u2019re analyzing the education outcomes of a specific demographic group. How does this cohort compare with the rest of the population?<\/p>\n\n\n\n<p>A good dataset is disaggregated. An example of this would be differentiating test scores for students with various learning abilities, instead of aggregating data on the entire student population. You should also look for datasets that have metadata or a data dictionary if the fields aren\u2019t already well-labeled. A data dictionary provides information about column names and members in a column. The data should also be relatively easy to manipulate. If the data requires an outsized amount of effort to clean up, it might be incomplete or filled with inaccuracies.<\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While data analysis is always technical (and sometimes even a little bit repetitive), you can still have fun with it. Playing around with existing online datasets is great practice, and you\u2019ll find various data-driven projects put together by experts and aficionados, many of them available in open-source communities like Github. What\u2019s more, you can easily [&hellip;]<\/p>\n","protected":false},"author":85,"featured_media":17415,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-9744","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/9744"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/85"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=9744"}],"version-history":[{"count":4,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/9744\/revisions"}],"predecessor-version":[{"id":49373,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/9744\/revisions\/49373"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/17415"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=9744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=9744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=9744"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=9744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}