Data Quality Management: What You Should Know
In this article
The heuristics we learn in the classroom are often just the tip of the iceberg. Sooner or later, if we want to deepen our study, it becomes necessary to bend or even break the rules that got us so far in the first place.
This is certainly the case with data quality management. Data quality, defined as “a perception or an assessment of data’s fitness to serve its purpose in a given context,” might seem straightforward at first but is actually rather difficult to evaluate and maintain. This is in part because the threshold dividing high-quality data from low-quality data depends on a number of variables, including the data’s intended use (or uses) and context (which may change over time), as well as a subjective understanding of the data’s accuracy, completeness, and reliability.
This is where data quality management (DQM) comes in.
What Is Data Quality Management?
If you’ve ever heard a variation of the phrase “a system is only as good as the data it contains” it’s likely the speaker was talking about data quality management. At its core, DQM is simply a set of protocols and processes that help a business collect, clean, analyze, store and distribute data as consistently as possible.
While the clean, uncomplicated datasets we encounter in examples and courses may help us grasp data quality fundamentals, they are unlikely to prepare us for the more nuanced scenarios we find in the workplace. Let’s look at how basic quality control practices mature in these more complex data environments.
Stage 1: Referential Integrity
When you first start working with relational databases, you’ll come across something called a referential integrity check. It’s a data quality control feature built right into the database that enforces the relationships between tables: “any foreign key field must agree with the primary key that is referenced by the foreign key.” If you’re unfamiliar with these terms and missed our post on SQL table joining, here’s a classroom-style example to illustrate the concept.
Let’s imagine we’re a medical office, and we have patient records stored in a number of database tables, among which are the PATIENT table and the ADDRESS table. Because our PATIENT table references the ADDRESS table, linking each patient to his or her address, our database will require that we supply an address key in order to add a new patient to the system.
Given the address table below, the database would bar us from adding the highlighted patient record because its address id does not correspond to an address id in the ADDRESS table.
The referential integrity constraint is predicated on the idea that quality data exists where it is supposed to exist. Because address #22222 isn’t where it’s supposed to be (in the ADDRESS table), the integrity of the entire Jess Keating record is questioned.
But what if our fictitious medical office is actually an emergency room at a hospital? Perhaps Jess Keating came in on a gurney in critical condition, and although we know her name from the student ID card on her person, we don’t know her address and cannot ask her for it. We need to be able to add her to the system without an address.
This might seem like an extreme case, but unknowns pop up all the time in the real world. Jess could just as easily have shown up as an outpatient and simply neglected to give her address before being called in to see the doctor. Or perhaps she was giving her address over the phone but got cut off.
Why not just leave her address as blank or NULL, then? The absence of a value isn’t the same thing as a conflicting value, is it?
Well, this depends on context, and for the purposes of our example, a non-existent address is a value. Moreover, it’s an incorrect value, as Keating does live somewhere.
It is not in our best interest to populate tables with false information, and yet the practice is pretty common in the real world. This is a simplistic example compared to the kinds of unknowns that routinely surface in high-volatility industries like finance and insurance, where the repercussions of entering a false empty or NULL value can be costly indeed. In these cases, insufficient data is preferable to the alternative—even though it lacks completeness—because it is more accurate and therefore more appropriate to the business context. We have progressed from a referential integrity issue to a data integrity issue.
Stage 2: Data Integrity
Where in Stage 1 we were most concerned with the integrity of our table references, we are now going to loosen those requirements in order to prioritize data integrity—that is, the internal consistency and lack of corruption in the data itself.
Let’s refer back to our hospital example for a moment. If Jess Keating is currently undergoing a life-saving procedure in our emergency room, it is important that we have a record of her, even if that record is incomplete. Failure to record her visit as accurately as possible could have catastrophic consequences, so in order to bypass our database’s referential integrity check, we need to implement an override.
In our case, the override would take the form of a stub record, a temporary stand-in for the real address we don’t yet have.
This dummy record satisfies the database’s referential integrity requirements while simultaneously accounting for the fact that we have a patient by the name of Jess Keating with no known address.
But now we have another data quality problem on our hands. Jess Keating’s address, by being blank, is effectively incorrect. Ordinarily, no data entry would be preferable to false data entry, but in this case, circumstances and the referential integrity constraint force our hands. The stub record must eventually be corrected, not only to restore the record’s integrity but also to keep from triggering a series of patient processing issues. The hospital needs to instate a system that will ensure that the record gets filled. We now progress from data integrity to data quality.
Stage 3: Data Quality
So far, we’ve gone from determining that our tables are accurate to verifying that our data is accurate. But to assess all the factors contributing to data quality—accuracy, completeness, validity, and relevance, among them—we must actually understand what the information is for. This, in the data world, is referred to as “semantics.” Whoever designs the processes that will ensure Jess Keating’s address is collected and added to the database will have to understand who needs that information and when and why. We are all familiar with the utility of an address, but not all data points are as transparent to the layperson. Data management teams need to understand what the data means and how it’s being used in order to establish data quality metrics and safeguard its quality.
In our example, the data quality check might be that the receptionist at the ER’s discharge desk be required to fill in any vacant fields before the patient is released. The hospital could rely on receptionists to scan all the necessary paperwork and identify where information is missing, but this leaves a lot of room for human error. An application programmed to perform this scan on the receptionist’s behalf is better equipped to catch all the missing information, and it is this program that a data quality analyst might design based on the hospital’s processing needs.
The need for an automated quality check compounds when we introduce more complex data quality concerns, such as duplicate records. Let’s say this isn’t Jess Keating’s first visit to the hospital; she’s already in the system under Jessica Keating. We don’t realize this at first because all we have is her student ID, which identifies her as Jess. When Jess is discharged from the ER and we collect her address, it’s possible we’ll get something different than what we have on file for Jessica; she may have moved recently.
In this scenario, the patient’s social security number would lead us to consolidate the duplicate records, but not all real-world scenarios will have such a convenient unique identifier! In less certain cases, data stewards may need to utilize data quality measures and run a program checking for patterns in multiple fields (e.g., records with the same address, same last name, and same first letter of first name), flagging the potential duplicates for investigation.
Data quality management is a key concept in the field of data science. Subject matter experts and data specialists or data scientists need to work together to ensure data quality at the highest level. Referential integrity and data integrity are merely stepping stones on the path to a top-quality data system that can accommodate real-world information requirements.
Nicole Hitner is a content strategist at Exago, Inc., producers of embedded business intelligence for software companies. She manages the company’s content marketing, writes for their blog, and assists the product design team in continuing to enhance Exago BI.
Since you’re here…
Curious about a career in data science? Experiment with our free data science learning path, or join our Data Science Bootcamp, where you’ll only pay tuition after getting a job in the field. We’re confident because our courses work – check out our student success stories to get inspired.