Data Cleansing in Practice

In our previous blog post, we discussed the importance of data cleansing. Now, we will focus on its practical aspects, along with some examples from past projects.

adattisztítás, data cleansing in practice

What Is Data Cleansing in Practice?

What types of data are affected by data cleansing, and how is the process carried out? In other words, how does data cleansing function in the real-world? Below, we answer this question by covering the most common areas of data cleansing.

Name Cleansing

Name cleansing involves identifying and correcting errors in name lists. Full names are broken down into individual components (e.g., prefix, surname, given name, suffix) and corrected element by element. The corrected elements are then recombined to form the cleaned full name, but they can also be used separately.

For example, a corrected given name can help determine gender or name day, while a prefix or suffix might indicate a title or academic degree, which can be valuable information. In modern IT systems, structured name components are often required rather than unstructured full names.

Prefixes, suffixes, and given names can be corrected using reference dictionaries. (For Hungarian names, we rely on the official list of given names approved by the Hungarian Academy of Sciences. For foreign names, we use similarly reliable reference sources.)

Address Cleansing

Address cleansing involves standardizing address formats, correcting typos and errors, filling in missing information, and updating old street names to their current versions. This process eliminates inconsistencies and outdated information in address lists.

Addresses are typically broken down into elements (e.g., postal code, city, street name) and corrected individually before being recombined into a structured format. Modern IT systems increasingly store address components in separate fields rather than in a single unstructured format. This allows for location-based segmentation of customer groups.

To verify and correct addresses, we use reference databases containing accurate address data.

Email Cleansing

Many email addresses in a database contain typos or formatting errors, making them invalid. Some of these errors—especially common ones—can be corrected, turning invalid email addresses into usable ones.

Both algorithmic processing and reference database matching can be used to clean email addresses.

Phone Number Cleansing

Phone number cleansing includes format validation and content verification. We check whether phone numbers conform to the standard formats, distinguishing between city, regional, mobile, and special service numbers.

In addition, phone numbers are validated against an official area code reference database and converted into standardized formats.

Document and Identifier Cleansing

Various identification numbers—such as personal ID numbers, passport numbers, tax identification numbers, health insurance numbers, and business identifiers like tax numbers and company registration numbers—can be algorithmically verified or checked against reference databases.

DSS Consulting and Data Cleansing

DSS Consulting was one of the first companies in Hungary to focus on data quality assurance in the early 2000s. We developed the comprehensive DSS Quality Monitor solution and later its iQualidator module, which offers proactive, real-time data validation at input points, ensuring cleaner data from the start.

Our data cleansing experience includes:

  • A leading insurance company, where we conducted data migration and cleansing for a new portfolio management system. The company continues to use DSS Quality Monitor for ongoing data quality monitoring.
  • A leading bank, where we conducted a data quality assessment and implemented the DSS Quality Monitor in its branch network for continuous data cleansing.
  • Another bank, where we cleaned and deduplicated customer data during a core banking system migration.
  • A major publishing company, where we cleaned customer records before implementing a new CRM system.
  • A leading food retail chain, where we cleansed and deduplicated data from three different source systems.

Data cleansing is closely related to duplicate detection. But what exactly are duplicates? We’ll cover this topic in our next post. In the meantime, please find here a brief description of our Data Quality Assurance services.

Adatminőség-biztosítási megoldásainkról itt talál egy rövid ismertetőt.

Interested in improving your company’s data quality? Why not discuss it over a cup of coffee?