Vehicle Insurance Fraud Analytics


The insurance sector has traditionally analysed fraud data in silos and largely ignored unstructured data points, but this is changing. According to Morgan Stanley, a more advanced analytics approach helps insurers improve fraud detection rates by 30%.

Many insurers now analyse their internal data, such as call centre notes and voice recordings, alongside social media data and third party details on people’s bills, wages, bankruptcies, criminal records, and address changes to gain insight into potentially fraudulent claims.

For example, while a claimant may declare their car was damaged by floods, but their social media feed may indicate weather conditions were sunny on the day of the incident. Insurers can supplement this data with text analytics technology that can detect minor discrepancies hidden in a claimant’s case report. Fraudsters tend to alter their story over time, making this a powerful tool in detecting criminal activity.

In addition to cost-savings, advanced analytics is helping businesses to improve the customer experience and protect their brand reputation too. For insurers, fraud-related losses are not only detrimental to their finances but can lead to price increases for customers and lengthen review times for legitimate claims. Honest customers have little patience for this, so the ability to keep fraud to a minimum is crucial to keep turnover down.

In this article we will go through the process of analysing the available data from exploratory data analysis, statistical analysis to machine learning models to detect the fraud in claims.

Exploratory Data Analysis

We have collected some fictitious data from the net on that contains 25% fraudulent claims. It has 40 odd different attributes. Based on our understanding of the insurance domain, we plot various graphs relating different attributes to the fraud. Graphs and associated observations are provided below.

  1. There is no clear trend of increase or decrease observed over time
  2. In year 2002, highest number of frauds reported (15).

South Carolina (SC) reported highest number of fraud incidents (73), while Ohio (OH) reported highest frauds as percentage of total claims from the state (47.6%).


Highest Count (167) as well as highest % of fraud (61.17%) reported in Major Damage category.


Unknown cases represent highest count (103) as well as highest % of fraud reported (29.1%).


Unknown cases represent highest count (103) as well as highest % of fraud reported (29.1%).


Highest #of incidents reported with one vehicle involvement, but highest %of fraudulent cases when 4 vehicles are involved


While Exploratory Data Analysis gives some insights into how each of the attributes is related to fraudulent claim, we can’t get any idea on relative influence of each of these attributes, i.e. which attributes in the data influencing fraud significantly, and which attributes do not have any influence or relatively lessor significance. It also does not provide a model to assess the probability of a new claim being fraudulent.

Below are some of the insights observed from above charts:

  1. Ohio shares highest fraud cases which is 47.6% and also the capital loss in Ohio is the highest (29739 USD)
  2. 61.17 % of the total reported fraudulent are identified when the incident severity is registered as major
  3. More number of fraudulent claims (31.82%) are identified when the collision happened at rear side of the car

Developing a Machine Learning Algorithm to detect fraud

Listed below are the key data analytics techniques used for this purpose. All of these techniques together will help create a robust fraud prediction solution.

Predictive Analytics

Machine learning algorithms (usually supervised models) are used to build analytical models which use historical data (where the value of the outcome variable is known - labelled) to build a model, which can predict the value of the outcome variable in new data where that value is not known. A good machine learning model can accurately predict the value of the outcome variable and thus help with quick decisions in the process workflow. Predictive analytics uses an outcome variable, which, in the fraud prediction case, is the fraud indicator variable, for building the predictive model.

As a first step, it is important to understand what data is useful and available for building the fraud model. Normally, the historical claim, policy, customer profile data, investigation data, along with the classified and reviewed, non-investigated data are used for this purpose. Allied data points such as driving license and ticket-related can also be useful. The machine learning model building process has to go through several steps such as assessing the quality of data, understanding the variables and relationships between them, selecting the best predictor variables and model building and validation.

Challenges in creating predictive models

There are two main challenges in creating machine learning models for fraud detection

  1. Main challenge in identifying frauds through supervised learning is possessing labelled data. Most of the organizations don’t have the fraud history data properly labelled and stored. Labelling such data involves more cost and time as well.
  2. As the fraudsters always find new ways to commit frauds, another challenge in using supervised learning is that those algorithms learn and identify the fraud patterns only from the history patterns. Those cases which are new and don’t follow the pattern from the history data may be left unidentified.

Solution: Hybrid Machine Learning Methodology

Since it’s not always possible to have the labelled data, we shall use unsupervised methodologies such as clustering algorithms, anomaly ranking algorithms at first to identify the fraudulent patterns. Those claims which is suspected as fraudulent by the unsupervised models will be reported and further sent to investigation team for the enquiry.

Investigation team will work on these reported cases and check whether they are actually fraudulent or not. Then they mark them and store them in the data stores. This is the basic way to create the database which will have labelled data.

This labelled data can be used for further data analysis and creating supervised machine learning models which will enrich the solution to identify the fraudulent cases.

Unsupervised Learning: Anomaly Detection

Anomaly Detection approach which is mostly an unsupervised methodology can be used to identify the frauds. Anomaly detection encompasses a large collection of data mining problems such as disease detection, credit card fraud detection, and detection of any new pattern amongst the existing patterns. In addition, comparing to simply providing binary classifications, providing a ranking, which represents the degree of relative abnormality, is advantageous in cost and benefit evaluation analysis, as well as in turning analytic analysis into action. Moreover, these methodologies do not require labelled data to identify the trends.

There are so many anomaly detection algorithms can be used. For the scope of this work, we have used the following algorithms:

  1. Density Based Spatial Clustering of Applications with Noise (DBSCAN)
  2. Local Outlier Factor (LOF)
  3. Spectral Ranking for Anomaly (SRA)

Supervised Learning

Once the claims are labelled as fraud or not, supervised models can be used to enrich the prediction power. There are several supervised statistical and machine learning algorithms used for classification to predict insurance fraud. Below are the algorithms that will be used in our model building process

  1. Logistic Regression
  2. Random Forests
  3. Gradient Boosting
  4. Neural Networks

Social Network Analysis

It is possible that several people are colluding together to defraud an insurance company. They may be known to each other or can also be part of a fraud ring. This aspect is missed by most of the traditional fraud detection methods. Frauds which are committed by such a coalition may not appear as a fraud when looked at individually.

But when these are looked at as a group, it is possible to see common patterns between their claims. Social network analysis brings out such patterns in the data and helps to discover people/networks committing these frauds. Algorithms such as social network analysis, Google page rank, association algorithms are used to bring out these insights from the data. Post this investigations into those networks need to happen to confirm the presence of fraud rings.

Text Analytics

Text Analytics is also a new technique available for fraud prediction. The claim settlement process creates several documents. Some of these documents, such as a police report, medical report, adjustor notes, etc., may provide an indication of fraudulent transactions. Most of the time, these reports can be paper based and stored in a document management system as a scanned pdf or image files. These reports can be a rich source of data to predict fraud. So using an optical character recognition tool, one can convert these scanned PDFs/images to structured and unstructured text data. Text analytics techniques are used on unstructured text to convert it into insights. Text analytics and sentiment analytics can be used on unstructured text data to identify feelings, attitudes and opinions. So by analysing the doctor’s notes, adjuster’s notes, police reports, etc. opinion can be mined and used for analysis. Text Analytics can also be used for indexing the unstructured notes and using the additional variables for building predictive models.

A Case Study of Fraud Detection using Predictive Analytics

Data definition

In this section we will develop a predictive analytics model on the sample data we have. Below are the features involved during the model building process.

Feature Name Description
age Customer's age
authorities_contacted Which government authority is contacted
auto_make Car manufacturer
auto_model Car model
auto_year Year of production
bodily_injuries How many bodily Injuries happened
capital-gains Capital gain from the claim
capitalloss Capital loss from the claim

Feature Engineering and Feature Selection

Feature Engineering

Success in machine learning algorithms is dependent on how the data is represented. Feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model performance on unseen data. Domain knowledge is critical in identifying which features might be relevant and the exercise calls for close interaction between a loss claims specialist and a data scientist.

Importance of Feature Engineering:

  1. Better features results in better performance: The features used in the model depict the classification structure and result in better performance
  2. Better features reduce the complexity: Even a bad model with better features has a tendency to perform well on the datasets because the features expose the structure of the data classification
  3. Better features and better models yield high performance: If all the features engineered are used in a model that performs reasonably well, then there is a greater chance of highly valuable outcome

Feature Selection:

Out of all the attributes listed in the dataset, the attributes that are relevant to the domain and result in the boosting of model performance are picked and used. I.e. the attributes that result in the degradation of model performance are removed. This entire process is called Feature Selection or Feature Elimination.

There are several methods used for selecting appropriate features for optimal model performance. Following are some of the most commonly used methods.

Trial & Error: Start with features known with domain knowledge, and keep adding other features one at a time, and see the model performance. Keep the features that improve the performance and avoid those that don’t improve or degrade the performance. This approach is called Forward Selection. The other approach is to start with all features and keep eliminating one feature at a time, and observe the performance. Again, keep those features that don’t degrade the performance. This approach is called Backward Elimination. This is proven as the better method while working with trees.

Dimensionality Reduction (PCA): PCA (Principle Component Analysis) is used to translate the given higher dimensional data into lower dimensional data. PCA is used to reduce the number of dimensions and selecting the dimensions which explain most of the datasets variance. (In this case it is 99% of variance). The best way to see the number of dimensions that explains the maximum variance is by plotting a two-dimensional scatter plot.

Statistical Significance: Univariate feature selection is one of the feature selection methods which works by selecting the best features based on univariate statistical tests. In our data set we have both numerical and categorical data. To calculate the statistical significance for categorical we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will give the p values of each independent variables with the dependent variables. Below is the table contains the list of features and their p-values (If categorical – chi-sq will be performed else ANOVA will be performed)

Feature Name P value
umbrella_limit_Y_N 95%
age 70%
insured_education_level 75%
property_damage 65%
policy_annual_premium 65%
policy_deductable 64%
capitalloss 64%
capitalloss Capital loss from the claim

Building Models, Comparison and Improvement

Unsupervised Modelling

Three unsupervised models are created based on the given data where each algorithm has its own outlier scoring methodology. Below is the output format of each algorithm for sample users


Outputs of each algorithm can be interpreted as follows:

  1. DBSCAN: Since DBSCAN is basically a clustering algorithm, it tells the cluster number to which the data point belong to. Data points which are outliers are marked as -1
  2. LOF algorithm marks outliers as -1 and inliers as 1
  3. SAR scores high for the data points that are abnormal. Higher the score higher the abnormality. Since SAR doesn’t mark the outliers directly, we can choose the threshold value based on the business requirement

Supervised Modelling

The model building activity involves construction of machine learning algorithms that can learn from a historical data and make predictions or decisions on unseen data. Following is the detailed modelling process

Once the models are built, they are trained with training dataset, then validated with validation dataset, while fine tuning hyper parameters, and finally tested with test dataset. At each stage, chosen performance metric is observed to get desired performance level.


The following steps summarize the process of the model development:

  1. Once the dataset is obtained, it is processed for better quality, then divided into Training, Validation and Test Sets with 70:20:10 ratio. This ratio can vary depending on over all size of the data set available
  2. Then a particular algorithm is chosen and features engineered for the algorithm, and the model is trained with training data till we get desired performance
  3. Then the model is tested with validation data and if the performance here is not good enough, we will go back to training step and tune some of the hyper parameters and test again with validation set. This process gets repeated until we satisfy with both training and validation set accuracies
  4. Then finally test with test data set. If we get desired result here, then deploy the model for production use
  5. If we don’t get desired results with this algorithm, try with other suitable algorithms and repeat the process again
Logistic Regression Random Forest Gradient Boosting Feed Forward Network
Recall 72% 76% 69% 83%
Precision 61% 71% 65% 72%
F 67% 75% 73% 79%
F2 69% 76% 79% 83%

Evaluation results are showing that Feed Forward Networks does better prediction on the data set.

Model Improvement

We can improve the base model’s performance by various methodologies. Here we will only include the features that have significant relationship with the target variable. To show the impact of doing feature selection, we’ve built six Feed Forward Network models by increasing the number of features in every model by including top 5 features from the table in feature selection section. To validate the model, “F2” is used as validation criteria as more weightage should be given to recall. Below chart show the F2 values against number of features included in the model.


It’s clearly evident that including all features doesn’t give best model performance. When we use top 15 features, we get best model performance. This model also takes less time to train and run, it reduces the complexity compared to the model using all 40 odd features.

Importance of Features

Based on Feed Forward Network algorithm the below factors are considered as top 10 important as they play significant role in predicting the fraud claims

Variable Name Relative Importance
Incident Severity 100%
Insurer Hobbies 88%
Auto Model 42%
Insurer Occupation 37%
Incident City 33%
Incident State 33%
Authorities Contacted (Y/N) 32%
Insurer relationship 31%
Collision Type 30%
Insurer Education Level 29%

Below are the visualization of top factors and their statistical significance results which will help the insurance company to take decisions based on the insights provided


This chart shows that more number of frauds are found in the Major Damage category in Incident Severity. Chi-Square test proves that this result is statistically significant as p = 001


Above chart is quite interesting as it shows that the claims by customer whose hobby is Chess tend to be 82.6 % fraudulent claims and it’s statistically significant as p = 0.087


Insurance claims of RAM model type has 32.6% are fraudulent. But there is no statistical evidence to conclude this trend as the p value is 0.162


36.8% of the claims where customer’s occupation is executive/Managerial is tend to be fraud. And it’s not statistically significant as p = 0.013


28.9 % of the claims are fraudulent when the incident happened in Arlington city. But there is no statistical evidence to conclude this trend as the p value is 0.11


Interestingly 30.9 % of the incidents taken place in North Carolina are found to be fraudulent claims. It’s statistically proven as p = 0.02


31.8% of the claims are fraud when the authorities contacted as Others. It’s statistically significant as p = 0.00


29.4% of claims are fraudulent where insured relationship is mentioned as “Other Relative” but there is no statistical evidence to conclude this trend as the p value is 0.33


The claim is tend to be a fraud one when the collision happened at the rear side and it’s statistically proven as p = 0.07


Fraud rate is high when the insurer’s education level is PhD, MD, College or JD. Insurer’s education level is not statistically significant as the p value is 0.96.


Below are various parameter settings used while building models

Random Forest:

Parameter Value
ntrees 1000
max_dept 50
stopping_metric logloss
nbins 5
minrows 10
categorical_encoding label_encoder
stopping_rounds 5

Gradient Boosting:

Parameter Value
ntrees 1000
max_dept 50
stopping_metric logloss
nbins 10
stopping_metric logloss
categorical_encoding label_encoder
stopping_rounds 5

Neural Network:

Parameter Value
Hidden Layers 500,500,300,200
epochs 1000
epsilon 0.00000001
initial_weight_distribution UniformAdaptive
adaptive_rate TRUE
rho 0.99
stopping_rounds 5
rate_annealing 0.000001
huber_alpha 0.9
score_interval 5