Demystifying Fuzzy Duplicates in Data Analytics

Beta Software consultants have been working in the ‘All things Data’ world for more than 20 years. This article will focus on our Top 5 Tips on how to Analyse and Harmonise data using Fuzzy Duplicates. 

When independent analysts are asked to perform Fuzzy Duplicate Analytics, the most common questions arise:

  • What exactly is Fuzzy Duplicate Testing?
  • Does it make sense to do Fuzzy Duplicate testing?
  • What ‘best practices’ should I be using?
  • What data prep should I be doing before the analysis?
  • And how to interpret the results?

What are Fuzzy Duplicates you ask? An emerging and increasingly important area of data analytics is fuzzy matching. Fuzzy matching is the art of looking for matching values in character strings that are not exactly the same but are close enough to be deemed similar like individual or company names or addresses. “Close” is often interpreted as “visually” close, in other words, discernibly similar when scrutinized by the eye.

With our 20 years of knowledge in this area, these Top 5 Tips will help you analyse fuzzy duplicates in no time at all!

Tip 1: Master Tables Make Sense

Generally speaking, it’s a good practice to perform Fuzzy Duplicate testing on Master Tables for various reasons. Most Master Tables should have unique entries and there should be no Duplicates detected in these tables. The Master tables that I am referring to are Vendor Master with unique Vendor Names and Addresses Another Master table is Payments/Bank Account Master. Stock Tables with Unique Stock names and Employee Master with Unique Employee names.

Again, most Master Tables should have unique entries but this does not hold true for all Master Tables, check your data to confirm. 

Tip 2: Clean Up On Isle 5

Basic data cleansing or scrubbing should be done. Depending on the data all that may be required is to remove leading and trailing spaces and to make the data uniform (to name a few), for example by making all Alpha characters upper, lower or proper case. 

Our Analytic software used is Arbutus Analyzer which can do this and much, much more…  An example of the Upper and AllTrim Functions can be seen below to clean up the Company_Name Field by creating a Computed Field: 

Tip 3: Harmony, Not Just For The Birds

Data Harmonisation/Normalisation is swift and easy to achieve in Arbutus Analyzer. Within Arbutus there is a Normalize function, this is a powerful function that allows you to clean up your data in preparation for your Fuzzy Duplicate testing.

Specifically, it:

  • Removes all non-alphanumeric characters except blanks (like commas, dashes, periods, etc.), replacing contiguous removed characters with a single blank. 
  • Replaces all international characters (containing accents, etc.) with the English base character (è becomes E, for example). With respect to replacing international characters, in German, the “ß” in “Straße” is replaced with “SS” to yield “STRASSE”. 
  • Trims leading blanks 
  • Compacts contiguous blanks into a single blank. 
  • Upper cases the remaining data 
  • Optionally, allows specification of non-alphanumeric and foreign characters to be retained. 
  • Optionally, allows specification of one or more substitution files to standardize alternate valid representations of values within the data (e.g. “Rd” for “Road”, “St” for “Street“).

As seen below, Company_Name (Highlighted in Blue) was recreated using the Normalize function.

Tip 4: Not So Fuzzy Now?

Data Harmonization and uniformity now reached, we can use the powerful Duplicate command in Arbutus to test for the Duplicates we wish, in this example Exact and Similar Duplicate testing was performed. 

Exact Duplicate Testing does what the name implies, it looks for exact matches in our data. Since the data has been normalized, we should expect a more accurate/concise Exact Duplicate testing/results. 

With Similar Duplicate testing, this uses the Damerau-Levenshtein distance and a brief explanation of this, the Damerau-Levenshtein distance is a measure of how many character replacements would have to be made to one string to make it match another string.

Similar Duplicate Testing was run and the output can be seen below compared side by side.

Tip 5: Results, I Want Results!

As stated, the above Tips showed Fuzzy Duplicate testing on a character field that could contain Alpha, Numeric and special characters. We can do more comprehensive testing such as Same-same-same, Same-same-different, Same-same-Near (numeric and date) and Same-same-Similar.

With regards to the results and interpretation, well that depends on the data used in the test. For Company Name we can determine if there are Duplicate customers which could have different bank account numbers, Duplicate Employee names with multiple payroll transactions. Another example is Stock Items, a great way to test if there are the same stock item with different vendors at different prices.

 A note worth mentioning is that there could be false positives in your data and there could be legitimate reasons for the duplicates. However, there could be more sinister reasons as well. 

Fuzzy Duplicate Testing, a test worth Duplicating! 

Author Bio

Terence Olivier has been in the BI/Data analytics world for 15 years. He has worked with a vast array of BI and Data analysis software. He has enjoyed being involved in projects in the Mining industry, Logistics, Retail and many more. Data Analytics, Fraud detection and Data Analytics Training is his specialty as well as visually displaying the data into trends that tells a story about your data and ultimately your company. 

Please feel free to contact us if you have any questions, challenges or obstacles you would like to overcome. Thank you for reading!