You may have heard the term data wrangling before. This example-filled guide will help you understand what exactly it is, and how you can start doing some data wrangling yourself, with plenty of code examples for you to follow along. We are going to be using the open source tool Python and the Pandas library within, but the examples and logic can be applied across multiple tools and programs such as R, Hadoop, or SAS. Think of it as an introduction to the main tasks you should expect to do with data wrangling — and if you’re following along in Pandas and Python, a hands-on tutorial.
What Is Data Wrangling?
Suppose you are working on Kaggle’s Titanic: Machine Learning from Disaster challenge. You decide to use your favourite classification algorithm only to realise that the training data set contains a mixture of continuous and categorical variables and you’ll need to transform some of the variables into a suitable format. You realize that the raw data you have can’t be used for your analysis without some manipulation — what you’ll soon know as data wrangling. You’ll need to clean this messy data to get anywhere with it.
Data Wrangling Definition
It is often the case with data science projects that you’ll have to deal with messy or incomplete data. The raw data we obtain from different data sources is often unusable at the beginning. All the activity that you do on the raw data to make it “clean” enough to input to your analytical algorithm is called data wrangling or data munging. If you want to create an efficient ETL pipeline (extract, transform and load) or create beautiful data visualizations, you should be prepared to do a lot of data wrangling.
As most statisticians, data analysts and data scientists will admit,most of the time spent implementing an analysis is devoted to cleaning or wrangling the data itself, rather than to coding or running a particular model that uses the data. According to O’Reilly’s 2016 Data Science Salary Survey, 69% of data scientists will spend a significant amount of time in their day-to-day dealing with basic exploratory data analysis, while 53% spend time cleaning their data. Data wrangling is an essential part of the data science role — and if you gain data wrangling skills and become proficient at it, you’ll quickly be recognized as somebody who can contribute to cutting-edge data science work and who can hold their own as a data professional.
To sum it all up: take messy, incomplete data or data that is too complex and simplify and/or clean it so that it’s useable for analysis — and you’ll have done data wrangling.
In this guide, we’ll illustrate with the help of examples some popular Pandas techniques that you may use to make the data wrangling process easier. We’ll ensure that you feel comfortable enough with the tools and techniques involved in data wrangling such that you become an expert data wrangler yourself!
Data Wrangling With Pandas
Pandas is one of the most popular Python library for data wrangling. In this example we’ll use Pandas to learn data wrangling techniques to deal with some of the most common data formats and their transformations. We’ll be playing with Pandas dataframes which are structured as tables where you can use Python code to easily manipulate the rows and columns.
You can get started with Pandas by reading Julia Evans’ Pandas Cookbook (which also has examples of how to use Pandas to solve data problems) and by understanding the basics of Python. You can use Anaconda and this video tutorial to get started with an interactive and intuitive Python environment that will allow you to interact with Pandas easily, even if you’ve never coded before!
We’ll use the ‘train.csv’ provided for Kaggle’s Titanic: Machine Learning from Disaster as our experimental dataset. You can download it and follow along if you’d like.
Let’s read the data as a Pandas dataframe and explore it.
Dropping missing or null values in the dataset.
Often data that you’ll work with will have some missing data points. It is important to understand how to deal with missing data. As you learn more data science/statistics, you’ll learn about data imputation. Here, we’ll learn to find missing data points and then we’ll drop those points from the dataset so as not to affect our analysis with bias: an important part of data wrangling and data cleaning. We’ll try to find which columns in ‘train.csv’ contain missing values and drop those missing values so you’ll have tidy data. You’ll want to type out the input in In , or use the version saved in the repository to follow along.
As we can see from the output, the columns ‘Age’, ‘Cabin’ and ‘Embarked’ contain missing or null values as they’ve indicated that the statement if there are any null values in the column is TRUE. We’d like to drop all the rows with missing values.
As we can see with a quick printout of the first five values in the dataframe, there are some null values (marked “NaN”) in the “Cabin” column. We’ll have to drop those values before doing any serious analysis!
As you can see with this next command, we’ll have dropped PassengerIds 1, 3 and 5 — the null rows we were looking to drop since we had missing information about their cabins!
This sort of data wrangling is very useful for sets of data that are incomplete. You don’t want to bias or make mistakes in your analysis because you’re dealing with incomplete data. For example, you may receive a list of stores and the sales they’ve generated. Perhaps some stores have given you blank or null values, and you want to only calculate averages for those stores with valid entries. You’d have to drop null values in order to do any reliable data analysis. Knowing how to drop values that don’t make any sense only helps strengthen your data analysis skills for situations like this.
During the exploratory data analysis phase you may be interested in working with a subset of data or filtering out parts of data using some criteria. Let’s say you only want to see data points where the ‘Age’ of the person is greater than 30 or the ‘Sex’ is ‘female’.
Let’s start with isolating those rows which have passengers above 30 years of age.
Now, let’s begin with isolating those passengers who are female.
We can also filter data with multiple conditions. Let’s now look at how we’d isolate passengers who were both female AND above the age of 30.
We can use this kind of filtering to do exploratory data analysis on a segment of a population. It can be important to isolate certain traits of a particular group.
In this case, we could find out that your chances of survival (the mean of the Survived column) were above 80% if you were a woman above 30. You should wrangle your data and filter through until you get a representative look at particular segments you’re looking at — perhaps you’ll want to use this technique the next time you compare the difference between website or store visitors who become customers and those who do not.
Let’s move to something more challenging than just being able to filter data. What if we want to understand if the gender of the person had any correlation with the survival of the person? Pandas groupby returns a DataFrameGroupBy object which has a variety of methods. Calling value_count() returns the count of values for each of the unique values for the column.
At a rough glance, it appears that if you’re a male, you would have had a much lower chance of survival than if you were a female.
You’ll want to group data by a larger segment if you want to discover larger trends in the data. For example, you might have individual sales from different stores, but you might want to group them by the geographical region they belong so that you can see trends city-by-city instead of store-by-store
Time series data handling
Time series data is one of the most important forms of data you will find while working with financial data, weather data etc. You’ll want to look at certain periods of time to see climate patterns or the differences between financial growth and stagnation. Here we will use Python’s Pandas to understand timeseries data. Depending on the analysis you may want to work on change the timezone, resample the data from seconds to minutes etc. Let’s try to perform these operations using Pandas and Python’s datetime library.
You have a set of values by the second they appear in. This is not in the original Titanic dataset (and it’s a somewhat morbid thought), but with the power of this time series analysis tool, you could analyze how many deaths occured by the hour or even by the minute as the Titanic slowly met its fate.
Let’s assume you cleaned and filtered the data. And now, you want to share this data with a colleague. She isn’t familiar with Pandas and uses Excel to analyze data. Let’s try to export the result of one of our previous queries to an Excel sheet.
This is an important step to consider whenever you’re done with data analysis: how to best share your results. You’ll want to think carefully about the different data formats your colleagues work in. Make sure that data is presented to them in the best way possible.
Data Wrangling: Conclusion
Data wrangling is an important part of any data analysis. You’ll want to make sure your data is in tip-top shape and ready for convenient consumption before you apply any algorithms to it. Data preparation is a key part of a great data analysis. By dropping null values, filtering and selecting the right data, and working with timeseries, you can ensure that any machine learning or treatment you apply to your cleaned-up data is fully effective.
By using Python and Pandas, you’ll have explored a suite of the most powerful data wrangling tools out there. We hope you use this knowledge to enhance your data science projects and advance to a possible data science career!
Had a hard time data wrangling? Have something you want to add to the discussion? Comment below!
This article was originally written by Shubham Singh Tomar. Shubham is a Data Scientist. He lives and works in Bangalore, India. On weekends he volunteers to work on Data Science projects for NGOs and Social organizations.