{"id":3957,"date":"2018-08-08T12:49:07","date_gmt":"2018-08-08T19:49:07","guid":{"rendered":"https:\/\/www.springboard.com\/?p=3957"},"modified":"2025-04-23T01:29:01","modified_gmt":"2025-04-23T08:29:01","slug":"data-quality-management-tips","status":"publish","type":"post","link":"https:\/\/www.springboard.com\/blog\/data-science\/data-quality-management-tips\/","title":{"rendered":"Data Quality Management: What You Should Know"},"content":{"rendered":"\n<p><span style=\"font-weight: 400;\">The heuristics we learn in the classroom are often just the tip of the iceberg. Sooner or later, if we want to deepen our study, it becomes necessary to bend or even break the rules that got us so far in the first place.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This is certainly the case with data quality management. Data quality, defined as \u201ca perception or an assessment of data&#8217;s fitness to serve its purpose in a given context,\u201d might <\/span><i><span style=\"font-weight: 400;\">seem <\/span><\/i><span style=\"font-weight: 400;\">straightforward at first but is actually rather difficult to evaluate and maintain. This is in part because the threshold dividing high-quality data from low-quality data depends on a number of variables, including the data\u2019s intended use (or uses) and context (which may change over time), as well as a subjective understanding of the data\u2019s accuracy, completeness, and reliability.<\/span><\/p>\n\n\n\n<p>This is where data quality management (DQM) comes in.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What Is Data Quality Management?<\/h2>\n\n\n\n<p>If you&#8217;ve ever heard a variation of the phrase &#8220;a system is only as good as the data it contains&#8221; it&#8217;s likely the speaker was talking about data quality management.&nbsp; At its core, DQM is simply a set of protocols and processes that help a business collect, clean, analyze, store, and distribute data as consistently as possible.<\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">While the clean, uncomplicated datasets we encounter in <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/free-public-data-sets-data-science-project\/\" target=\"_blank\" rel=\"noreferrer noopener\">examples<\/a> and <a href=\"https:\/\/www.springboard.com\/blog\/data-analytics\/free-data-analytics-courses\/\" target=\"_blank\" rel=\"noreferrer noopener\">courses<\/a> may help us grasp data quality fundamentals, they are unlikely to prepare us for the more nuanced scenarios we find in the workplace. Let\u2019s look at how basic quality control practices mature in these more complex data environments.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Stage 1: Referential Integrity<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">When you first start working with relational databases, you\u2019ll come across something called a referential integrity check. It\u2019s a data quality control feature built right into the database that <\/span><a href=\"https:\/\/www.techopedia.com\/definition\/1233\/referential-integrity-ri\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">enforces the relationships between tables<\/span><\/a><span style=\"font-weight: 400;\">: \u201cany foreign key field must agree with the primary key that is referenced by the foreign key.\u201d If you\u2019re unfamiliar with these terms and missed <\/span><a href=\"https:\/\/www.springboard.com\/blog\/data-science\/joining-data-tables\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/joining-data-tables\/\" rel=\"noreferrer noopener\"><span style=\"font-weight: 400;\">our post on SQL table joining<\/span><\/a><span style=\"font-weight: 400;\">, here\u2019s a classroom-style example to illustrate the concept. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s imagine we\u2019re a medical office, and we have patient records stored in a number of database tables, among which are the PATIENT table and the ADDRESS table. Because our PATIENT table references the ADDRESS table, linking each patient to his or her address, our database will require that we supply an address key in order to add a new patient to the system.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Given the address table below, the database would bar us from adding the highlighted patient record because its address id does not correspond to an address id in the ADDRESS table.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"680\" height=\"305\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/chrome_2018-07-16_16-46-18.png\" alt=\"\" class=\"wp-image-3990\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/chrome_2018-07-16_16-46-18.png 680w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/chrome_2018-07-16_16-46-18-400x179.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/chrome_2018-07-16_16-46-18-380x170.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/chrome_2018-07-16_16-46-18-380x170.png 420w\" sizes=\"(max-width: 680px) 100vw, 680px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"852\" height=\"244\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2.png\" alt=\"integrity\" class=\"wp-image-4727\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2.png 852w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2-400x115.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2-768x220.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2-380x109.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2-700x200.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table2-380x109.png 420w\" sizes=\"(max-width: 852px) 100vw, 852px\" \/><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">The referential integrity constraint is predicated on the idea that quality data <\/span><i><span style=\"font-weight: 400;\">exists where it is supposed to exist<\/span><\/i><span style=\"font-weight: 400;\">. Because address #22222 isn\u2019t where it\u2019s supposed to be (in the ADDRESS table), the integrity of the entire Jess Keating record is questioned.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">But what if our fictitious medical office is actually an emergency room at a hospital? Perhaps Jess Keating came in on a gurney in critical condition, and although we know her name from the student ID card on her person, we don\u2019t know her address and cannot ask her for it. We need to be able to add her to the system without an address.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">This might seem like an extreme case, but unknowns pop up all the time in the real world. Jess could just as easily have shown up as an outpatient and simply neglected to give her address before being called in to see the doctor. Or perhaps she was giving her address over the phone but got cut off. <\/span><\/p>\n\n\n\n<p>Why not just leave her address blank or NULL, then? The absence of a value isn&#8217;t the same thing as a conflicting value, is it?<\/p>\n\n\n\n<p>Well, this depends on context, and for the purposes of our example, a non-existent address is a value. Moreover, it&#8217;s an incorrect value, as Keating does live somewhere.<\/p>\n\n\n\n<p>It is not in our best interest to populate tables with false information, and yet the practice is <a href=\"https:\/\/www.urban.org\/urban-wire\/misleading-data-and-visualizations\" target=\"_blank\" rel=\"noreferrer noopener\">pretty common in the real world<\/a>. This is a simplistic example compared to the kinds of unknowns that routinely surface in high-volatility industries like finance and insurance, where the repercussions of entering a false empty or NULL value can be costly indeed. In these cases, insufficient data is preferable to the alternative\u2014even though it lacks completeness\u2014because it is more accurate and therefore more appropriate to the business context. We have progressed from a referential integrity issue to a data integrity issue.<\/p>\n\n\n<div class=\"bg-leaf-50 p-4 my-3\"><h4 class=\"fw-bold text-center\">Get To Know Other\tData Science Students<\/h4><div class=\"row row-cols-1 row-cols-lg-3\"><div class=\"col\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/abby-morgan\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1654205000\/Student%20Success\/Abby_Morgan.jpg\" alt=\"Abby Morgan\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Abby Morgan<\/p><p class=\"text-muted lh-1\">Data Scientist at NPD Group<\/p><\/div><div class=\"w-100 d-block d-md-none mt-3\"><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/abby-morgan\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/meghan-thomason\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203464\/Student%20Success\/Megan_Thomason_125x125.png\" alt=\"Meghan Thomason\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Meghan Thomason<\/p><p class=\"text-muted lh-1\">Data Scientist at Spin<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/meghan-thomason\">Read Story<\/a><\/p><\/div><\/div><div class=\"col d-none d-md-block\"><div class=\"card success-story-card h-100 d-flex justify-content-between mb-0\"><div class=\"flex-grow-1 text-center\"><a class=\"d-inline-block rounded-circle\" href=\"\/success\/jonathan-king\" style=\"width:125px;height:125px;overflow:hidden\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/res.cloudinary.com\/springboard-images\/image\/upload\/v1629203191\/Student%20Success\/Jonathan_King_125x125.png\" alt=\"Jonathan King\" style=\"object-fit:contain;max-width:170px;height:125px\" \/><\/a><p class=\"fw-bold mb-0\">Jonathan King<\/p><p class=\"text-muted lh-1\">Sr. Healthcare Analyst at IBM<\/p><\/div><p class=\"mb-0 mx-auto text-center\"><a class=\"btn btn-primary mx-auto\" href=\"\/success\/jonathan-king\">Read Story<\/a><\/p><\/div><\/div><\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Stage 2: Data Integrity<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">Where in Stage 1 we were most concerned with the integrity of our table references, we are now going to loosen those requirements in order to prioritize <\/span><a href=\"https:\/\/searchdatacenter.techtarget.com\/definition\/integrity\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">data integrity<\/span><\/a><span style=\"font-weight: 400;\">\u2014that is, the internal consistency and lack of corruption in the data itself. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">Let\u2019s refer back to our hospital example for a moment. If Jess Keating is currently undergoing a life-saving procedure in our emergency room, it is important that we have a record of her, <\/span><i><span style=\"font-weight: 400;\">even<\/span><\/i><span style=\"font-weight: 400;\"> if that record is incomplete. Failure to record her visit as accurately as possible could have catastrophic consequences, so in order to bypass our database\u2019s referential integrity check, we need to implement an override.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In our case, the override would take the form of a stub record, a temporary stand-in for the real address we don\u2019t yet have.<\/span><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"841\" height=\"297\" src=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3.png\" alt=\"data integrity table 3\" class=\"wp-image-4728\" srcset=\"https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3.png 841w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3-400x141.png 400w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3-768x271.png 768w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3-380x134.png 380w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3-700x247.png 700w, https:\/\/www.springboard.com\/blog\/wp-content\/uploads\/2018\/08\/table3-380x134.png 420w\" sizes=\"(max-width: 841px) 100vw, 841px\" \/><\/figure>\n\n\n\n<p><span style=\"font-weight: 400;\">This dummy record satisfies the database\u2019s referential integrity requirements while simultaneously accounting for the fact that we have a patient by the name of Jess Keating with no known address. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">But now we have another data quality problem on our hands. Jess Keating\u2019s address, by being blank, is effectively incorrect. Ordinarily, <\/span><i><span style=\"font-weight: 400;\">no <\/span><\/i><span style=\"font-weight: 400;\">data entry would be preferable to <\/span><i><span style=\"font-weight: 400;\">false <\/span><\/i><span style=\"font-weight: 400;\">data entry, but in this case, circumstances and the referential integrity constraint force our hands. The stub record must eventually be corrected, not only to restore the record\u2019s integrity but also to keep from triggering a series of patient processing issues. The hospital needs to instate a system that will ensure that the record gets filled. We now progress from data integrity to data quality.<\/span><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span style=\"font-weight: 400;\">Stage 3: Data Quality<\/span><\/h2>\n\n\n\n<p><span style=\"font-weight: 400;\">So far, we\u2019ve gone from determining that our tables are accurate to verifying that our data is accurate. But to assess <a href=\"http:\/\/www.dataversity.net\/what-is-data-quality\/\" target=\"_blank\" rel=\"noreferrer noopener\">all the factors<\/a> contributing to data quality\u2014accuracy, completeness, validity, and relevance, among them\u2014we must actually understand what the information is <\/span><i><span style=\"font-weight: 400;\">for<\/span><\/i><span style=\"font-weight: 400;\">. This, in the data world, is referred to as \u201csemantics.\u201d Whoever designs the processes that will ensure Jess Keating\u2019s address is collected and added to the database will have to understand who needs that information and when and why. We are all familiar with the utility of an address, but not all data points are as transparent to the layperson. Data management teams need to understand what the data means and how it&#8217;s being used in order to establish data quality metrics and safeguard its quality.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In our example, the data quality check might be that the receptionist at the ER\u2019s discharge desk is required to fill in any vacant fields before the patient is released. The hospital <\/span><i><span style=\"font-weight: 400;\">could<\/span><\/i><span style=\"font-weight: 400;\"> rely on receptionists to scan all the necessary paperwork and identify where information is missing, but this leaves a lot of room for human error. An application programmed to perform this scan on the receptionist\u2019s behalf is better equipped to catch all the missing information, and it is this program that a data quality analyst might design based on the hospital\u2019s processing needs.<\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">The need for automated quality check compounds when we introduce more complex data quality concerns, such as duplicate records. Let\u2019s say this isn\u2019t Jess Keating\u2019s first visit to the hospital; she\u2019s already in the system under <\/span><i><span style=\"font-weight: 400;\">Jessica <\/span><\/i><span style=\"font-weight: 400;\">Keating. We don\u2019t realize this at first because all we have is her student ID, which identifies her as Jess. When Jess is discharged from the ER and we collect her address, it\u2019s possible we\u2019ll get something different than what we have on file for Jessica; she may have moved recently. <\/span><\/p>\n\n\n\n<p><span style=\"font-weight: 400;\">In this scenario, the patient\u2019s social security number would lead us to consolidate the duplicate records, but not all real-world scenarios will have such a convenient unique identifier! In less certain cases, data stewards may need to utilize data quality measures and run a program checking for patterns in multiple fields (e.g., records with the same address, same last name, and same first letter of first name), flagging the potential duplicates for investigation. <\/span><\/p>\n\n\n\n<p>Data quality management is a key concept in the field of <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-science-definition\/\" target=\"_blank\" rel=\"noreferrer noopener\">data science<\/a>. <span style=\"font-weight: 400;\">Subject matter experts and <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/data-scientist-job-description\/\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.springboard.com\/blog\/data-science\/data-scientist-job-description\/\" rel=\"noreferrer noopener\">data specialists<\/a> or <a href=\"https:\/\/www.springboard.com\/blog\/data-science\/what-does-a-data-scientist-do\/\" target=\"_blank\" data-type=\"post\" data-id=\"24427\" rel=\"noreferrer noopener\">data scientists<\/a> need to work <\/span><i><span style=\"font-weight: 400;\">together <\/span><\/i><span style=\"font-weight: 400;\">to ensure data quality at the highest level. Referential integrity and data integrity are merely stepping stones on the path to a top-quality data system that can accommodate real-world information requirements.<\/span><\/p>\n\n\n\n<p><em>Nicole Hitner is a content strategist at Exago, Inc., producer of&nbsp;<a href=\"http:\/\/exagoinc.com\/?utm_source=springboard&amp;utm_medium=blog\" target=\"_blank\" rel=\"noreferrer noopener\">embedded business intelligence for software companies<\/a>. She manages the company\u2019s content marketing, writes for&nbsp;<a href=\"http:\/\/exagoinc.com\/blog\/?utm_source=springboard&amp;utm_medium=blog\" target=\"_blank\" rel=\"noreferrer noopener\">their blog<\/a>, and assists the product design team in continuing to enhance Exago BI.<\/em><\/p>\n\n\n\n<p class=\"rm has-background\" style=\"background-color:#efeff6\"><strong>Since you\u2019re here\u2026<br><\/strong>Curious about a career in data science? Experiment with our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/resources\/guides\/data-science-process\/\" target=\"_blank\">free data science learning path<\/a>, or join our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/courses\/data-science-career-track\/\" target=\"_blank\">Data Science Bootcamp<\/a>, where you\u2019ll get your tuition back if you don&#8217;t land a job after graduating. We\u2019re confident because our courses work \u2013 check out our <a rel=\"noreferrer noopener\" href=\"https:\/\/www.springboard.com\/success\/\" target=\"_blank\">student success stories<\/a> to get inspired.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The heuristics we learn in the classroom are often just the tip of the iceberg. Sooner or later, if we want to deepen our study, it becomes necessary to bend or even break the rules that got us so far in the first place. This is certainly the case with data quality management. Data quality, [&hellip;]<\/p>\n","protected":false},"author":45,"featured_media":3986,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_eb_attr":"","_eb_data_table":"","footnotes":""},"categories":[67],"tags":[],"marketing_tags":[],"class_list":{"0":"post-3957","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-data-science"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/3957"}],"collection":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/users\/45"}],"replies":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/comments?post=3957"}],"version-history":[{"count":4,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/3957\/revisions"}],"predecessor-version":[{"id":56567,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/posts\/3957\/revisions\/56567"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media\/3986"}],"wp:attachment":[{"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/media?parent=3957"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/categories?post=3957"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/tags?post=3957"},{"taxonomy":"marketing_tags","embeddable":true,"href":"https:\/\/www.springboard.com\/blog\/wp-json\/wp\/v2\/marketing_tags?post=3957"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}