Data is constantly changing: from business addresses and names to contact phone numbers and email addresses. Data that was useful weeks or months ago quickly becomes outdated and new data needs to be incorporated into decision-making.
The purpose of data analysis is to remove bias and use historical data to create actionable recommendations and predictions for the future. But this only works if the data is of high quality, to begin with.
This continuous maintenance of changing data is what we refer to as “data quality” management. One definition of data quality is: “the planning, implementation, and control of activities that apply quality management techniques to data, in order to assure it is fit for consumption and meet the needs of data consumers.” In other words, ensuring that data can serve its intended purpose within an organization.
With quintillions of data bytes generated daily, data quality is a top priority in order to stay competitive in an increasingly digital landscape.
Why Does Data Quality Matter?
Poor data quality can cost an organization $9.7 million annually. As of 2016, it cost the United States $13 trillion per year. Data quality problems result in a 20% decrease in worker productivity and explain why 40% of business initiatives fail to achieve set goals. Incorrect data can harm a reputation, misdirect resources, slow down the retrieval of information, and lead to false insights and missed opportunities.
For example, if an organization has the incorrect name or mailing address of a prospective client, their marketing materials could go to the wrong recipient. If sales data is attributed to the wrong SKU or brand, the company might invest in a product line with less than stellar customer demand.
Historically, errors with data reporting have even led to global catastrophes. The Enron scandal of 2001, which resulted from the non-disclosure of billions of dollars of liabilities and led to the energy firm’s bankruptcy, could have been prevented through better ethical auditing that would have detected the fictitious nature of presented data. This incident gives a clear overview about how poor data quality can jeopardize the whole data analytics process and cause unbearable damage to the businesses.
Data quality always matters, but there are certain business contexts that require extra special attention paid to data quality. When engaging in a merger and acquisition, companies need to unify disparate data sources under common data standards, processes, strategies, technologies, and cultures. Data quality is also important for any enterprise resource planning or customer relationship management function.
How Can You Preserve Data Quality?
One of the primary responsibilities of data analysts is to guarantee data quality. Data problems can be caused by employee or customer data entry mistakes (the most prevalent cause, according to The Data Warehousing Institute), system changes, software errors, or erroneous data integration/migration.
The procedure of examining data for accuracy and completeness is called data profiling. Data quality assurance involves removing outliers and irregularities so that the data is representative of the larger picture.
The first step in data profiling is making sure there are no missing data fields and that information has been inputted correctly.
Some of the most common issues affecting data quality are inconsistent formatting of dates and numbers, unusual character sets and symbols, duplicate entries, and different languages and measurement units. For example, a date can be written out or represented numerically in a few different formats—dd/mm/yy, mm/dd/yy, or “day, month, year”—which would prevent a computer system from properly aggregating and synthesizing data related to time. Many organizations use unicode (universal code standards) for data processing, but sometimes foreign characters come through in an unreadable format and must be converted during the data cleansing process.
After importing the data and identifying a problem, data analysts can either accept the error if it doesn’t disrupt the interpretation, remove the error, fix the error, or add a default such as “N/A” or “unknown” in place of the error.
When profiling large volumes of data, data analysts will need to construct data hierarchies, rules, and term definitions to understand the interrelationships between types of data. Rules can be simple, such as: “Customer full name must be capitalized and consist only of letters.” Data profiling verifies what percentage of entries meet the rules and that this percentage is above the threshold required by the organization.
Another important check is ensuring referential integrity, that all the table relationships are in agreement. Techopedia gives us a good example:
When a CUSTOMER_MASTER table contains data like name, social security number, address, and birthdate, and an ACCOUNTS_MASTER table bears bank account information like account type, account creation date, account holder, and withdrawal limits, a Customer_ID field serves as the primary key, linking the two tables. Referential integrity means that a change in Customer_ID in the CUSTOMER_MATER table must be reflected in the ACCOUNTS_MASTER table.
What Factors Determine Quality of Data?
A Gartner study lists several key factors to examine data quality:
- Is there data to work with?
- Example: Did the organization actually collect data on sales performance in China?
- If a data point appears in multiple locations, does it bear the same meaning?
- Example: In data sets that contain revenue by store for a given week, is the same number associated with a particular store in all data sets?
- Does the data represent real facts and properties?
- Example: Are reported sales representative of what actually happened in the store?
- Does the data depict genuine relationships?
- Example: In a report of customers and billing addresses, is each customer linked to the right billing address?
- Do the data entries make sense?
- Example: If data in a column “location” is linked to data “price,” are the related values consistent with allowable values in the data set and when compared with external benchmarks?
The DAMA UK Working Group on “Data Quality Dimensions” defines a few other criteria to measure data quality completeness (do we have all the recorded information?), uniqueness (is every data entry unique?), and timeliness (does the data represent the right date and time?). Data needs to be refreshed continuously to prevent staleness. In many cases, real-time data collection and analysis can help with data timeliness.
At times, data problems can be fixed pretty easily. For example, inserting a drop-down menu into a survey instead of relying on free-form responses can improve data consistency. Similarly, making fields mandatory reduces occurrences of incomplete data, and requiring picture capture or GPS location and time stamp can increase data accuracy.
Organizations with good data quality practices will have a process for automating data collection and entry (since many mistakes are caused by human error), user profiles defining who should be able to access different data types, and a dashboard to monitor data quality changes over time.
Related: Data Analysis Methods: An Overview
Get To Know Other Data Analytics Students
What Tools Are Needed for Data Quality Management?
With advances in technology, there are many tools that organizations can use to improve data quality, depending on their needs and preferences (cloud-based versus on-premise, compatibility with different sources, integrations with other platforms, complexity of data sets).
These tools often perform three main functions: data cleansing, data auditing, and data migration. Data auditing has more advanced capabilities than data cleansing and checks for fraud and other compliance vulnerabilities. Data migration involves moving various data sets to a data warehouse or centralized data set for storage and data quality analysis.
Some popular software services include:
- Informatica – Informatica is one of the most popular data management software options. It comes with a set of prebuilt data rules, a rule builder for customization, and artificial intelligence (AI) capabilities for diagnosing problems.
- Talend – Talend has a metadata management solution and a popular tool for the ETL (extract, transform, and load) function. The basic package is free and open source and provides a graphical depiction of performance on compliance matters.
- SAS – The SAS Data Management Tool handles large data volumes. Data quality technology is all integrated within the same architecture and can connect to other SAS tools for data visualization and business analytics.
- Oracle – Oracle offers a collection of data quality programs, including Oracle Big Data Cloud, Oracle Big Data Cloud Service, Oracle Big Data SQL Cloud Service, and Oracle NoSQL Database.
- SAP – SAP HANA is an in-memory platform and database that retrieves and stores data for applications.
- IBM – IBM has a few different products, such as the InfoSphere Information Server for Data Quality, to monitor and cleanse data, analyze information for consistency, and create a holistic view of entities and relationships.
To succeed in a data quality role, you will need to learn the company’s software of choice and also basic technical skills for the data analyst position, which may include Excel, SQL/CQL, Python, and R.
What Is the Future of Data Quality?
Data analysis is changing and data quality standards must adjust. Increasingly, governments are regulating data to ensure ethics and privacy through legislation like the General Data Protection Regulation in the European Union. With the introduction of natural language processing, machine learning, and artificial intelligence, the stakes for poor data quality are higher. When using past X-ray images to train machines to detect diseases, it is vital that the machines are “learning” on clean data records—or it could have life-threatening consequences. Since 60% of companies cite data quality as a deterrent to AI adoption, investment in data quality can foster a more AI-friendly environment.
Advances in artificial intelligence can also improve data quality by automating data capture, identifying anomalies, and eliminating duplicates more quickly. This will save human time and allow for more efficient processing of huge data sets.
Whether pursuing a career as a data analyst, data scientist, business analyst, or data engineer, it is critical to understand what constitutes good data. Business results can only be as helpful as their data foundation.
Since you’re here…
Switching to a career in data analytics is possible, no matter your background. We’ve helped over 10,000 students make it happen. Check out our free data analytics curriculum to gauge your interest, or go all-in with our Data Analytics Bootcamp.