Synthetic data for data augmentation

Improving fraud detection models with synthetic data

STARTING POINT
37%
of mortgage requests were manually reviewed
110k€
potential median loss per fraud (ACFE report)
RESULT
+15%
Recall

Fraud detection is a critical task in the financial domain, and machine learning is often regarded as a promising way to automate it. Machine learning is usually used to flag suspicious requests that need to be manually checked by a domain expert. Reducing the number of flags will strongly reduce the workload of the domain expert, at the same time it is paramount that no fraudulent case goes unflagged.

Unfortunately, datasets used for training and evaluating these algorithms can be affected by a strong class imbalance, which occurs when the number of examples of one class (e.g. fraudulent transactions) is significantly lower than the number of examples of the other class (e.g. non-fraudulent transactions). This issue makes it difficult to train robust and performing models, and it usually corresponds to a higher workload of domain experts who have to check each flagged request manually.

For this use case, the most important metrics to monitor in order to assess the efficiency of the automation process are the model’s Precision and Recall. In particular, the recall measures the ability of a model to identify all fraudulent requests correctly.

BearingPoint is an independent management and technology consultancy with European roots and a global reach. The company is one of the leading providers of ML-based fraud detection models and was exposed to these issues daily. Thanks to the test on the field that we carried out with BearingPoint, we demonstrated how, through the generation of synthetic data, it is possible to increase model performances while decreasing the number of transactions that must be manually checked.

In particular, Clearbox AI's Enterprise Solution helped them tackle class imbalance problems while working with one of their clients who commissioned them a fraud detection model.

Challenge
How to improve class imbalance affecting fraud detection datasets when dealing with complex data pipelines?
Solution
Our product has been used to generate synthetic data points of fraudulent examples.
Result
BearingPoint was able to train high-performing models on the augmented data, translating to higher recall and lower fraud detection workloads.

The challenge

Several techniques can be adopted to improve class imbalance. Oversampling, for example, consists in creating synthetic minority examples to re-balance the original dataset. SMOTE is one of the most popular techniques which has been proven to be useful in many applications. The problem arises when the cardinality of the dataset increases. This is often the case for fraud detection use cases where we want to make use of as much information as possible. In this case the synthetic examples generated by SMOTE start becoming more and more unrealistic. It is therefore necessary to use alternative methods, for example based on generative models.

The solution

BearingPoint installed our Enterprise Solution on the infrastructure of one of their clients, a retail bank. They connected it to a relational database containing transaction histories and used our tool to quantify class imbalance and find the best data augmentation strategy. They finally generated an enriched dataset containing the original clients plus several synthetic fraudulent examples. They used this dataset to train a machine learning model based on boosted trees.

The result

Accessing the augmented dataset allowed BearingPoint to considerably reduce the number of false flags. As we demonstrated, the model trained on augmented data presented a Recall improvement of 15% (+12% in respective of the best combination of under/oversampling). This automatically translates into more efficient and cost-effective fraud detection workflows and workloads. These results can be extended to other use cases as well, both in the financial sector and others. The technology can be applied to any sector that needs a lot of data to improve its processes. For example, insurance, energy, telco, urban mobility, retail, and healthcare.

Talk with us

Drop us a line if you want to learn more about how we can help you or to figure out the best option for your project. We will reach out to you ASAP.