Duplicates
In an earlier blog post, we mentioned duplicates as one of the types of unclean or “dirty” data (i.e., when identical records appear multiple times in a database). Now, we will discuss this topic in more detail.
The Drawbacks of Duplicates
The disadvantages of duplicates hardly need explaining: if the same person or company appears multiple times in a database and receives the same customer notification letter multiple times or is called repeatedly for the same issue, customer satisfaction may decrease. Additionally, sending duplicate letters, emails, or making multiple phone calls leads to unnecessary costs and administrative burden.
Deduplication in the Data Quality Assurance Process
For this reason, one of the key steps in the data quality assurance process is usually the identification and merging of duplicates in the system. (When we talk about duplication, we acknowledge that it is not just two records that may be identical—an entity can appear in multiple copies in the database. The correct term would be multiplication, but for simplicity and common usage, we refer to it as duplication.)
The success of duplicate detection depends on the quality of the data most suitable for identification. For individuals, key identifiers include name, place of birth, and date of birth, while for legal entities, tax numbers and company registration numbers are typically the most reliable. If these data fields are of poor quality, it is highly recommended to perform data cleansing before running duplicate detection.
Of course, data is never perfect, so duplicate detection must ultimately be performed on more or less incomplete and erroneous data. This means that exact matches cannot be relied upon, and similarity criteria must be defined instead. Using good duplicate detection algorithms, it is possible to identify probable duplicate groups even among moderately erroneous data, which human review often confirms as accurate.
Duplicate Search
During duplicate detection, algorithmic methods are used to identify and list identical entities, such as customers or products. Duplicate groups are formed, which consist of record sets with more than one entry for the same entity. The goal is to classify records that represent the same entity into the same group.
Master Record Creation
If full deduplication is not feasible or not desired (e.g., because records from duplicate groups must remain in different systems), it is advisable to create a master record for each group. The master record is typically based on the record from the system with the highest priority, which can be updated or supplemented with data from other systems as needed. A master record generally contains only the most important customer (identifier) information and, if necessary, a few additional data points relevant to the business.
Resolving Duplicates (Deduplication)
Once duplicates have been identified, the next step is deduplication, meaning the removal of duplicate records. This process involves selecting the record to be retained in each duplicate group and eliminating the others. The selection criterion is usually the record with the highest data quality, but sometimes the process is more complex.
Entities, such as products linked to the duplicate records, must be transferred to the retained record from the ones being deleted. However, technical limitations may arise—certain products may not be transferable or may not be cost-effective to reassign. In such cases, the record associated with the product must be retained.
DSS Consulting and Deduplication
In our previous post, we briefly touched on DSS’s experience with data cleansing—now, let’s look at some practical examples of our experience in deduplication:
- At a leading bank, as part of a comprehensive data quality assurance project, we developed an algorithm for detecting duplicates within and across different systems.
- For an Italian-founded, globally active medtech company, we performed data cleansing and duplicate detection in their ERP and CRM systems using our Quality Monitor software package.
- For a leading telecommunications provider, we carried out data quality assurance tasks related to the implementation of their new CRM system, including duplicate detection and deduplication.
- At a major insurance company, we conducted a central customer database quality assessment, including duplicate detection among customer records.
This brings us to the end of our data quality assurance series.
Please find a brief description of our data quality assurance solutions here.
However, if your company has ever considered improving the quality of its data assets, why not discuss your challenges over a cup of coffee?