{"id":8353,"date":"2019-08-22T12:24:10","date_gmt":"2019-08-22T19:24:10","guid":{"rendered":"https:\/\/www.springboard.com\/?p=8353"},"modified":"2023-07-07T15:46:52","modified_gmt":"2023-07-07T22:46:52","slug":"data-cleaning","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-analytics\/data-cleaning\/","title":{"rendered":"Data Cleaning: The Why and the How"},"content":{"rendered":"\n<p><span style=\"font-weight: 400;\">In my posts on data analysis methods and machine learning, I briefly mentioned data cleaning and its importance. Those two articles were not focused on data cleansing, so they were indeed brief mentions. However, the importance of using (relatively) clean data is paramount in machine learning and statistics.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Do We Really Need to Clean the Data?<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Yes. Bad data will lead to bad results, plain and simple. The saying \u201cgarbage in, garbage out\u201d is well-known in the computer science world for a reason. Computers are not magical gods, they\u2019re machines performing calculations very rapidly. They have no insight or intuition, nor do they have intellect or sentience (as far as we know!) to determine when something \u201cjust doesn\u2019t make sense.\u201d<\/span> So, it is very important for any <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/what-does-data-analyst-do\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-analytics\/what-does-data-analyst-do\/\" rel=\"noreferrer noopener\">data analysts<\/a> to ensure the quality of a data.<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">To accurately reflect reality, our input data must remove errors and issues that trip up our algorithms. Data cleaning (or pre-processing, if you prefer) is how we do this. Data cleansing is a <\/span><a href=\"https:\/\/visit.figure-eight.com\/rs\/416-ZBE-142\/images\/CrowdFlower_DataScienceReport.pdf\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">time-consuming and unpopular aspect of data analysis<\/span><\/a><span style=\"font-weight: 400;\"> (PDF, p5), but it must be done.<\/span><\/p>\n\n\n\n<p><i><span style=\"font-weight: 400;\">Note 1: In this article, <\/span><\/i><b>rows<\/b><i><span style=\"font-weight: 400;\"> will be instances of datapoints while <\/span><\/i><b>columns<\/b><i><span style=\"font-weight: 400;\"> will be variable\/field names. Row 1 may be Jane, row 2 may be John. Column 1 may be age, column 2 may be income. <\/span><\/i><\/p>\n\n\n\n<p><i><span style=\"font-weight: 400;\">Note 2: Most links to explain programming terms will link to <\/span><\/i><a href=\"https:\/\/www.python.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><i><span style=\"font-weight: 400;\">Python<\/span><\/i><\/a><i><span style=\"font-weight: 400;\"> resources, one of the most widely used data science languages. Common libraries are <\/span><\/i><a href=\"https:\/\/numpy.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><i><span style=\"font-weight: 400;\">NumPy<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">, <\/span><\/i><a href=\"https:\/\/www.scipy.org\/index.html\" target=\"_blank\" rel=\"noopener\"><i><span style=\"font-weight: 400;\">SciPy<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">, and <\/span><\/i><a href=\"https:\/\/pandas.pydata.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><i><span style=\"font-weight: 400;\">pandas<\/span><\/i><\/a><i><span style=\"font-weight: 400;\">. If you haven\u2019t programmed before, Python is a high-level, relatively intuitive language with many code libraries already created for you. I strongly recommend it for beginners.<\/span><\/i><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What Causes Data to Be Unclean?<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Data from real-world applications may be unstructured and therefore its capture automated. This leads to all kinds of consistency and quality issues in<a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/what-is-data-analytics\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-analytics\/what-is-data-analytics\/\" rel=\"noreferrer noopener\"> data analytics<\/a> process. Sometimes data sets are aggregated from multiple sources, leading to other issues. Sometimes inputs are sourced from bad sensors. There are many reasons our data can be unusable for algorithms.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Input Errors<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">There are plenty of ways a human can enter the wrong information. They may mistype, miscalculate, or misread. For open-ended fields, like unverified text, there could be typos that a computer will not interpret as identical (e.g., \u201cyes,\u201d \u201cyeah,\u201d \u201cY\u201d). For numeric fields, there could be confusion about magnitude (000s or 000,000s?). I\u2019m sure you can think of myriad other ways humans can errantly enter data.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Malfunctioning Sensors<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Machines can err, too. A washing machine sensor shouldn\u2019t register 534 C for the water temperature. An elevator\u2019s current load cannot equal -150 kilograms. Humans easily catch these errors, but computers will absolutely miss them (unless the code specifically checks for these errors).<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Mangled Data<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">When sensors malfunction, they are likely to generate values outside the acceptable range (like a negative weight). Sometimes the sensors are generating valid data, but that data is mangled on its way to the collection site. Proprietary data formats may not be readable by different programs. Interference during public-internet transmission might cause dropped packets and thus partial input.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Duplicates<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">If the initial data set is an amalgamation of multiple sources, there is a high probability of duplicates. Sometimes these are not identified by a simple key check, because one data source might use the <\/span><i><span style=\"font-weight: 400;\">Name<\/span><\/i> <b><i>Sensor AA<\/i><\/b><span style=\"font-weight: 400;\"> while another data source may use <\/span><b><i>Device 01<\/i><\/b><span style=\"font-weight: 400;\">. Both entries are identical, but your algorithm relies on the <\/span><i><span style=\"font-weight: 400;\">Name<\/span><\/i><span style=\"font-weight: 400;\"> field, thus double-counting the double-agent AA\/01.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Lack of Standardization<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">When using multiple data sources, lack of standardization is common. To achieve true results, all data that is similar in reality must be represented similarly in the input. This is obvious, but it is not always obvious how to achieve this.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Even within single data source projects, standardization problems can still arise, particularly with open-ended human input. Some people might have different spelling or capitalization habits, and people on different teams within an organization may even use different names for the same topic or product!<\/span><\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Analytics Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/cana-curtis\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1682910063\/Cana_Curtis_profile.jpg\" alt=\"Cana Curtis\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Cana Curtis<\/p><p class=\"text-muted lh-1\">Research Analyst at Virginia Commonwealth University<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/cana-curtis\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/yogita-nesargi\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1648222893\/Yogita_Nesargi_1.jpg\" alt=\"Yogita Nesargi\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Yogita Nesargi<\/p><p class=\"text-muted lh-1\">Data Engineer at Deloitte<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/yogita-nesargi\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/joel-antolijao\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1648514795\/Student%20Success\/Joel_Antolijao.jpg\" alt=\"Joel Antolijao\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Joel Antolijao<\/p><p class=\"text-muted lh-1\">Data Analyst at FanDuel<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/joel-antolijao\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Identifying Problematic Data<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">In this section, let\u2019s explore some data constraints and how to approach them.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Range Constraints<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">One of the easiest restraints to check, and one of the easiest to understand, is the range<\/span> <span style=\"font-weight: 400;\">constraint. Water is only physically water between 0 and 100 degrees Celsius. If your data says the water is 435 C, it isn\u2019t water anymore. In fact, it\u2019s steam\u2014it\u2019s steam at a very high pressure. Your washing machine is going to explode\u2026 If the reading is -30 C, your washer is probably frozen solid.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">There is likely a \u201cmax\u201d and a \u201cmin\u201d function in your chosen language. You can very quickly check a range constraint this way. These are generally optimized as fundamental functions, too, so they will run quickly even on huge data sets. Another way is to use a <\/span><a href=\"https:\/\/jakevdp.github.io\/PythonDataScienceHandbook\/02.06-boolean-arrays-and-masks.html#Boolean-Arrays-as-Masks\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">Boolean mask<\/span><\/a> <span style=\"font-weight: 400;\">and check for any values that violate the constraint, or graph your data and look for abnormal <\/span><a href=\"https:\/\/www.itl.nist.gov\/div898\/handbook\/eda\/section3\/eda366.htm\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">distributions<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Categorical Constraints<\/span><\/h3>\n\n\n\n<p><a href=\"http:\/\/sites.utexas.edu\/sos\/variables\/\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">Categorical variable<\/span><\/a> <span style=\"font-weight: 400;\">values can only be chosen from a specific set of options. A device can only be <\/span><i><span style=\"font-weight: 400;\">on<\/span><\/i><span style=\"font-weight: 400;\"> or <\/span><i><span style=\"font-weight: 400;\">off<\/span><\/i><span style=\"font-weight: 400;\">. A marital status may only be <\/span><i><span style=\"font-weight: 400;\">single<\/span><\/i><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">engaged, married<\/span><\/i><span style=\"font-weight: 400;\">, <\/span><i><span style=\"font-weight: 400;\">divorced<\/span><\/i><span style=\"font-weight: 400;\">, or <\/span><i><span style=\"font-weight: 400;\">widowed<\/span><\/i><span style=\"font-weight: 400;\">. An animal may only be one species, and it must be a valid species.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">You can check categorical constraints by running some kind of unique search function (in <\/span><a href=\"https:\/\/docs.scipy.org\/doc\/numpy\/reference\/generated\/numpy.unique.html\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">Python<\/span><\/a><span style=\"font-weight: 400;\">, in <\/span><a href=\"https:\/\/www.springboard.com\/blog\/data-science\/sql-what-you-need-to-know\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/sql-what-you-need-to-know\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">SQL<\/span><\/a><span style=\"font-weight: 400;\">) and viewing the output. If there are eight possible categories, your unique search should produce eight or fewer results. You can easily check that all the identified categories are legitimate.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Categories that contain many classes may require another method. DNA sequences come to mind, as there could be hundreds of unique and valid sequences in a single data set. If you have a list of all possible combinations, you can use a function that cross-references the input data with a set of known sequences. If you don\u2019t, you <\/span><i><span style=\"font-weight: 400;\">might<\/span><\/i><span style=\"font-weight: 400;\"> be able to use something like a <\/span><a href=\"https:\/\/www.w3schools.com\/python\/python_regex.asp\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">Regex<\/span><\/a> <span style=\"font-weight: 400;\">(regular expression), but that\u2019s up to you to decide.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Consistency<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">The strings \u201cMarried,\u201d \u201cmarried,\u201d and \u201cmar.\u201d will not be recognized as identical. If some columns are in 000s but a specific data source is quoting raw numbers, some of your rows will be off by three orders of magnitude. Percentages that are entered in decimal form will not be comparable to percentages entered in basis-point form.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">You can find these, at least for categorical data with few classes, by manually reviewing the list of classes (use a unique function to find them all). Finding these errors for numeric data may require some ingenuity, though range constraints are a start.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Cross Field Constraints<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">This one is hardest to check because it doesn\u2019t concern a single field. This constraint pertains to validity across columns or rows. For example, $100,000 may be a valid income, and 3 may be a valid age, but a datapoint of Alex with an age of 3 and an income of $100,000 is very unlikely. It can be even more complicated, like this example:<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1304\" height=\"234\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM.png\" alt=\"\" class=\"wp-image-8354\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM.png 1304w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-400x72.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-1200x215.png 1200w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-768x138.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-380x68.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-700x126.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2019\/08\/Screen-Shot-2019-08-05-at-11.19.14-AM-380x68.png 420w\" sizes=\"(max-width: 1304px) 100vw, 1304px\" \/><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">Obviously, one of these entries is incorrect. Finding out which one<\/span><span style=\"font-weight: 400;\">\u2014<\/span><span style=\"font-weight: 400;\">and indeed just finding this cross-reference error<\/span><span style=\"font-weight: 400;\">\u2014<\/span><span style=\"font-weight: 400;\">can be extremely difficult. If your project relies critically on the veracity of cross-column (age, income) or cross-row (1234 \/ 9876) information, you\u2019ll need to develop code to catch these kinds of errors. Perhaps here you can simply iterate through every ID and ensure the relationship is reciprocal. 1234 should be married to 9876 and 9876 should be married to 1234. The same should appear for 4233 \/ 9324 and 9324 \/ 4233. Any deviations can be collected in a list and manually reviewed.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Visualizations<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Visualizations are a good way to easily find outliers, strange distributions, and other problems. If you believe it is a normal distribution, but in fact you have a bimodal distribution, you will need to adjust your starting assumptions. Visualization techniques, like box-and-whisker plots, histograms, and scatterplots, can be immensely helpful in immediately catching some issues.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Visualizations will also assist you in identifying trends, which can be helpful in scaling. Excel has some of the most powerful <\/span><a href=\"https:\/\/support.office.com\/en-us\/article\/charts-and-other-visualizations-in-power-view-141bd462-9853-4973-ac37-842e8345f51e\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">visualization capabilities<\/span><\/a><span style=\"font-weight: 400;\">, but there are plenty of libraries in other languages <\/span><a href=\"https:\/\/www.anaconda.com\/python-data-visualization-2018-why-so-many-libraries\/\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">like Python<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Counting the Errors<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">If you find constraint violations, you\u2019ll want to count them. Just one may be a fluke. If 50% of the datapoints are in violation, there may be an inapplicable constraint or bad input data. Even if you use visualizations, make sure you also count the errors. If a sensor spits out 500 identical, invalid instances, they will all be hidden behind the same visual datapoint on the screen.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Missing Values<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Missing values is probably the most common type of data issue that must be addressed. Values may be missing because you combined two data sets from different sources, the entry process somehow skipped a row, or the value was inadvertently deleted. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">One or two missing values probably isn\u2019t a problem, but if you notice there is a high density of missing values, you should investigate the reason. Is a sensor malfunctioning? Is a certain group conspicuously missing? If surveys are missing from all tenants in a particular building, perhaps the building management is avoiding distributing the surveys. This could be important in itself.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Missing values can often impart and imply a lot of information, so don\u2019t just ignore them if they\u2019re occurring frequently. <\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Cleaning Techniques<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Once you\u2019ve identified data to be cleaned, there are a few main ways to actually go about that data cleanup. <\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Removal<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">This is the most frowned-upon method. For missing values, it is better to investigate the reason instead of simply eliminating the rows or columns that contain the missing values. This is not always avoidable, though. If an entire column is 85% missing and you cannot find another data source, you may not be able to use that column.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Additionally, it\u2019s not optimal to remove outliers, as this is a kind of results doctoring. If you do remove datapoints, explain the reasoning for doing so (such as 85% of the data is irrecoverable) in the results and report.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Before removing many datapoints, it\u2019s important to get input from experts in the field. This is critically true if you want to remove an entire column. Removing the Tactile Strength column from an engineering study is probably a bad idea, since tactile strength may heavily influence the decision that comes out of the analysis.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The only datapoints regularly eliminated are duplicates that are clearly duplicates, and only when the presence of duplicates does not affect your assumptions or results.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Direct Correction<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">This one is a little more interesting, and if possible, infinitely better than removal. Consistency issues are often simple to fix, though they must be identified first.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">For string consistency correction in smaller categorical sets, it can be trivial to run a unique values search and then write a couple of if-statements to replace errors. If you have something like city names, it may be difficult to go with explicit if-statements. You may want to use a <\/span><a href=\"https:\/\/medium.com\/@categitau\/fuzzy-string-matching-in-python-68f240d910fe\" target=\"_blank\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">fuzzy search<\/span><\/a><span style=\"font-weight: 400;\"> and make corrections that way.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Numerical consistency errors, such as order of magnitude mismatches, are simple to fix by multiplication or division. Binary consistency issues can be corrected if you can accurately assign the non-binary input to one of the binary categories. In the set {on, off, broken}, you can probably safely map <\/span><i><span style=\"font-weight: 400;\">broken<\/span><\/i><span style=\"font-weight: 400;\"> to <\/span><i><span style=\"font-weight: 400;\">off<\/span><\/i><span style=\"font-weight: 400;\">. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Errors that arise from malfunctioning sensors or human input errors should also be corrected from the source, if possible. If you are using publicly available or large-scale, one-time-collection data sets, though, this won\u2019t be possible. In those cases, you may want to impute the values.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Scaling<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">Scaling changes the ranges of data so some features do not dominate solely because they naturally produce larger values. For example, temperature for a city tends to have a much smaller range than the population for a city. Distance-based algorithms will assign much greater importance to the population variable, possibly entirely ignoring the temperature variable. Scaling brings variables in line with each other while retaining the proportional relationships within the variable. This is seen when you convert to percentages or baseline to 100.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Imputation<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">This technique is most closely associated with filling in missing values, but it can be used for incorrect values, too, especially when a direct correction cannot be made. <\/span><\/p>\n\n\n\n<p><i><span style=\"font-weight: 400;\">Imputation<\/span><\/i><span style=\"font-weight: 400;\"> is a fancy way to say <\/span><i><span style=\"font-weight: 400;\">guess<\/span><\/i><span style=\"font-weight: 400;\">. However, since we are in the field of data science, this will be a data-driven guess, not just a random guess. You can <a href=\"https:\/\/en.wikipedia.org\/wiki\/Imputation_(statistics)\" target=\"_blank\" rel=\"noopener\">impute<\/a> values with statistical indicators (like mean, median, mode), hot-decking, stratification, and others.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">One approach is to replace every missing value with a statistical indicator. However, be very careful here. You may reinforce the pattern already gleaned from the known data, potentially masking a critical pattern belied by the fact that the data itself is missing. In our missing building survey example above, if you just used the mean score for all missing data, you may overlook a strong negative sentiment in that building (which was why the building manager \u201cforgot\u201d to distribute the survey).<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Hot-decking fills in missing values by randomly selecting a value from the set of already-known values. Again, this can cause you to overlook important information belied by \u201cmissingness.\u201d However, hot-decking won\u2019t drag your whole data set toward a statistical indicator like repeatedly filling in statistical indicators does.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Finally, stratification is beneficial if you already know some patterns in your data. The heights of women are, on average, shorter than the heights of men. You could split your data set into men and women, then use those sub-indicators for replacement or hot-deck from the subsets of men and women. Is it perfect? No, but it\u2019s better than using the indicators or hot-decking from the entire population.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">If you have a mission-critical project and it\u2019s absolutely impossible to retrieve the correct data, it is strongly recommended to look for known patterns in the data set and impute values in line with those known patterns. You want to avoid introducing your own or latent biases as much as possible when imputing data, so get a second opinion if you can.<\/span><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Flagging<\/span><\/h3>\n\n\n\n<p><span style=\"font-weight: 400;\">This is particularly useful for missing values when you don\u2019t want to drop all of them. For numeric data, you can add another column to your data set and flag any missing values there. This will inform your algorithm of missing values, which may turn out to be influential. For categorical variables, simply create a \u201cMissing\u201d or \u201cUnknown\u201d class.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">If your flagged variable becomes important, you will probably need to correct or impute values to reflect reality or gather any useful insights. If you cannot do this, ensure the report mentions this shortcoming.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Data cleansing is not glamorous, and most data scientists actually do not seem to enjoy this part of the job. However, it is absolutely vital to feed clean, qualify data into your machine learning algorithms if you want useful, truthful, and actionable insights.<\/span><\/p>\n\n\n\n<p><strong>Ready to learn more? Consider Springboard\u2019s <a href=\"https:\/\/www.springboard.com\/courses\/data-analytics-career-track\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/courses\/data-analytics-career-track\/\" rel=\"noreferrer noopener\">Data Analytics Career Track<\/a>. You\u2019ll master both the technical and business thinking skills to get hired\u2014job guaranteed!<\/strong><\/p>\n\n\n\n<p class=\"has-background\" style=\"background-color:#efeff6\"><strong>Since you&#8217;re here&#8230;<br><\/strong>Switching to a career in data analytics is possible, no matter your background. We\u2019ve helped <a href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\" rel=\"noreferrer noopener\">over 10,000 students <\/a>make it happen. Check out our <a href=\"https:\/\/www.springboard.com\/resources\/learning-paths\/data-analysis\/\" target=\"_blank\" rel=\"noreferrer noopener\">free data analytics curriculum<\/a> to gauge your interest, or go all-in with our <a href=\"https:\/\/www.springboard.com\/courses\/data-analytics-career-track\/\" target=\"_blank\" rel=\"noreferrer noopener\">Data Analytics Bootcamp<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my posts on data analysis methods and machine learning, I briefly mentioned data cleaning and its importance. Those two articles were not focused on data cleansing, so they were indeed brief mentions. However, the importance of using (relatively) clean data is paramount in machine learning and statistics. Do We Really Need to Clean the [&hellip;]<\/p>\n","protected":false},"author":73,"featured_media":8413,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[134],"tags":[],"marketing_tags":[],"class_list":{"0":"post-8353","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-analytics"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8353"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/73"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=8353"}],"version-history":[{"count":3,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8353\/revisions"}],"predecessor-version":[{"id":47548,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/8353\/revisions\/47548"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/8413"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=8353"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=8353"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=8353"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=8353"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}