Your Analytic is WRONG

In a recent LinkedIn post, Brent Dykes, a data story teller, talks about the dull but essential “hidden” work that happens in data analytics.

Brent uses this insightful graphic to illustrate the point.  

This picture also clearly shows why communication breakdowns and expectation gaps so often exist between data analysts and data users. It made me realise that, just like the audience or listeners of data stories know very little of what happens beneath the data lake, so do data analysts sometimes know very little of what is happening in the business, what is really expected of them and what responsibilities they carry.

This brings me to a very serious and dangerous risk that I believe is not receiving the attention it deserves. Your analytic may be wrong and you don’t even know it.

There are generally two types of errors in data analytic exception reports. False Positives and False Negatives. A false positive is the inclusion of a record or transaction in a list of exceptions when it is in fact not an exception. A false negative is the exact opposite, not including a real exception in a list of exceptions.

False positives are less dangerous since users are very quick to point them out to the analysts. It’s the false negatives that are the real problems.

At Beta Software we use Arbutus Analytics software to develop continuous auditing & risk monitoring systems to consulting clients and to provide data analytics on external audit assignments . Exceptions reported to auditors or line managers must be investigated as soon as they are reported by the analytic. A false positive is a pain in the neck and the user will complain that he or she had to spend time investigating something that is not a real exception.

In my experience, if something is not reported (false negative), everyone involved is happy. The analytic tells them there are no exceptions or control failures. But what if the underlying analytic is wrong? It is similar to driving your vehicle and not knowing your oil pressure gauge is not working.

Dykes’ post talks about some of the “hidden” tasks performed in data analytics. I want to add and highlight two tasks that are crucial to minimise the risk of false negatives.

Firstly, Data Access & Preparation is probably one of the most technical and important tasks in data analytics. If we don’t access ALL the CORRECT data, this task can and will result in incorrect analytics.

It is essential for data analysts to:

  • Know and understand all databases, tables and fields in the data lake.
  • Be able to connect to all databases using odbc, jdbc, api, rfc etc.
  • Know how to join or relate tables within and across all databases. This requires intimate knowledge of key fields or how to create key fields.

The second task I want to highlight, understanding the analytic objective, is far less technical but is the main reason why miscommunication and expectation gaps exist. It’s all about understanding the question. The problem is that business and IT speak different languages. It is the reason why a Business Analyst can be a strategic member of any development team.

Let’s look at a classic example of a control measuring analytic – management wants to know if employees are also supplying goods or services to the organisation. It sounds like a reasonably simple question without much room for miscommunication or expectation gaps.

In reality it can be much more complex. It’s easy to go down the wrong road.

I heard the question and now start translating the question into “data speak”. My logic goes something like this:

I must know which system, table and fields contain supplier master data. I must also know which system, table and fields contain employee master data. In many cases the data comes from separate systems and databases (eg. SAP on Oracle and VIP on MS SQL Server). Once the correct data has been accessed, the question of key fields must be answered. Do you match on bank account numbers, email addresses, physical addresses, telephone numbers or a combination of fields? I then need to harmonise those fields to ensure correct matches take place.

I must now step back and ask the question. Am I answering my audience’s question – Are employees also suppliers?

Well, the answer is no. I’m telling the audience which employees are “probably or potentially” also suppliers.

To really answer the question and allow the data storytellers to answer the audience correctly and completely, I need to check if the employee actually supplied goods or services to the organisation. I need to analyse transactional procurement data in another table with many fields. I also need to join this information to the supplier and employee master data.

OK, I’ve done it! Right? I’ve shown that the employee and supplier bank accounts are the same in the master tables. I’ve also shown that the supplier actually supplied goods or services to the organisation.

Wrong

Like the famous actor said. Show me the money! To which bank account did the payment for the goods or services (money) flow? This information is contained in tables from your banking system logs. Access those tables, do the joins and required matches. Now I know that the payment was in fact made to a bank account owned by both an employee and a supplier.

Did I answer the question? I’ve answered that “the following suppliers, with bank account numbers matching bank account numbers in the employee master table, supplied goods or services to the organisation and the payment went to that (matched) bank account” I’ve answered the audience completely and accurately. Right?

Wrong again.

Without getting too technical here, it’s worth mentioning that the sequence of your joins influence your final matches. Remember we first joined the supplier master data to the employee master data and extracted matches. All other employees have been “cleared”.

Forensic analysts will tell you that the person capturing payment data often change the bank account number on the banking system before payment is made. That changed bank account number is also often their own. But this employee has already been “cleared” in the first join…

In addition, many so called “one-time” payments are physically made to bank accounts that are not on the supplier master table. The impact is that you will get no match between the bank account number on the banking system log and the supplier master data.

You need to join and match the banking system log, for payments other than salaries, with the complete employee master table to find all payments to bank accounts owned by employees. Matches represent employees that were paid for supplying goods or service.

That is it. It’s done. I didn’t need the supplier master data. I didn’t need the transactional data. I only needed to understand the question properly.

Many life gurus tell us that a mistake is only a bad one if we learn nothing from it. So why did I make this mistake?

Well, firstly, the question wasn’t well formulated. The question should have been “Did we pay for goods and services into employee bank accounts?”. I should have asked management to describe an exception to me.

Secondly, I translated the question directly into data language. I saw employees and I saw supply in the question and immediately assumed the employee master tables and supplier master tables are the main players.

Thirdly, my professors in our masters classes drilled the “access path principle” into us constantly. I should have drawn the complete payment transaction from initiation to payment before designing the analytic. However, I only took data tables inside our own data lake into account. I failed to realise that the banking system, a third party system, is a very important but segregated part of the environment. It’s worth mentioning here that I believe data analysts, and specifically auditors, don’t look at banking systems and its data nearly enough.

In conclusion, I realise experienced data analysts might not make these mistakes. Large IT environments have people whose only job is to look for such mistakes. However, outside these IT environments lives many data analysts, like myself, who most often is the only analyst on an audit or in their department. These people are the designers, developers, and executioners of important analytics on a daily basis. This example shows that data analytics can become complex, messy and dangerous if we don’t make sure we understand the question properly or don’t plan the analytic properly and don’t consider false negatives.

I use an analytic design checklist that includes “consider false negatives” at the beginning and “verify false negatives” at the end.

As a footnote, in the example I referred to the employee master table. This table typically contains good data but not necessarily complete data. The employee can own other bank accounts or their family members can. This must be explained to the audience. Some analytics like fuzzy matches on payee names etc. can be used to get a more complete match. But that is a story for another time. 

Author Bio

Anton has been working as an auditor and data analyst for the past 30 years. A large number of his clients are on SAP ERP or SAP/4HANA and he has extensive experience extracting and analyzing SAP data.

He holds a Master’s degree in Computer Auditing and is Regional Director: Data Analytics at RSM South Africa and CEO of RSM subsidiary, Beta Software, an authorised partner of Arbutus Software in Southern Africa.