## Medical Insurance Fraud Analytics

### Introduction

One of the most important problems of the insurance industry is fraud which
causes substantial losses. Fraud in the insurance industry can be defined as
**“knowingly making a fictitious claim, inflating a claim or adding extra items to
a claim, or being in any way dishonest with the intention of gaining more than
legitimate entitlement”**. It is estimated that fraud in property and casualty
insurance costs the Canadian insurance industry 1.3 billion Canadian Dollars
every year which translates to about 10-15% of the claims paid out in
Canada (Gill, 2009). It is reported that fraud detection is difficult and it is
not “Cost Effective” because if it’s done incorrectly it may irritate legitimate
customers and it may result in delayed claims adjudication. High costs of
investigations are also a concern. As a result, many insurance companies prefer
to pay the claim without investigation, because ultimately this is cheaper for
them.

**Some common fraud types in health insurance: **

- Charging excessive prices for a treatment or medicine in a health centre.
- Unusually high number of invoices for a particular insurer in short time frame (3-4 days).
- Insurance transaction(s) where the insurer has got some treatment or medicine but either has not paid any instalments or has paid only the first instalment.
- Cases where the insurer buying medicine without medical examination.
- Claiming medical invoices with dates prior to or after than the beginning of the insurance period (this is permitted in some cases).
- Excessive number of medicine claims in a specific period.
- Bank account number changes of a business partner such as agency, health centre or pharmacy.
- Excessive numbers of manual invoice demand whose amounts are smaller than the usual inspection limit.
- Claims whose payable amounts are greater than the invoice amounts that insurance company will pay.

### Data Set

We have shaped the final medical claim fraud dataset by collecting and merging the below datasets.

- PartD 2013 Medicare data set from “https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html” . For our analysis we have considered only 2013 data
- NPI exclusions data from “https://oig.hhs.gov/exclusions/exclusions_list.asp#instruct” . This data set contains the data about the practitioner and doctors who have committed some form of fraud and excluded from Federal healthcare programs. NPIs (National Provider Identifier) available in this dataset considered as frauds.
- FDA datasets from “https://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm" that will help to join drugs and active ingredients.

All data sources are joined together by using “NPI”. Below is final dataset description.

### Predictive Modelling

We have developed an AI platform which has the below functionality.

- Unsupervised model to detect anomalies where labelled data don’t exist
- Supervised model for identification of historic frauds
- Graph networks for identifying collusions and investigation path

As the scope of this work, we’ll explain second part which is creating supervised models

### Supervised Modelling

Supervised algorithms are used to build analytical models which use historical
data (where the value of the outcome variable is known - labelled) to build a
model, which can predict the value of the outcome variable in new data where
that value is not known. A good machine learning model can accurately predict
the value of the outcome variable and thus help with quick decisions in the
process workflow. Predictive analytics uses an outcome variable, which, in the
medical claim fraud detection case, is the **fraud** variable, for building the
predictive model.

Building a machine learning model will have the below typical steps:

- Gathering data
- Data preparation
- Choosing a model
- Training
- Evaluation

**Gathering Data**

As discussed in part B, the final dataset is gathered from different federal data sources. We used PostgreSQL to store the data. The reason PostgreSQL being used is to handle a great volume of data and faster data processing.

**Data Preparation**

Data preparation will have tasks like:

#### Selecting correct sample data

Once the data is collected, it’s time to assess the condition of it, including looking for outliers, exceptions, incorrect, inconsistent, missing, or skewed information. This is important because source data will inform all of model’s findings, so it is critical to be sure it does not contain unseen biases. For example, if we are looking at practioneer behavior nationally, but only pulling in data from a limited sample, you might miss important geographic regions. This is the time to catch any issues that could incorrectly skew your model’s findings, on the entire data set, and not just on partial or sample data sets.

#### Formatting data to make it consistent

The next step in great data preparation is to ensure your data is formatted in a way that best fits the machine learning model. If data is aggregated from different sources, or if the data set has been manually updated by more than one stakeholder, it’s likely to discover anomalies in how the data is formatted (e.g. USD5.50 versus $5.50). In the same way, standardizing values in a column. Consistent data formatting takes away these errors so that the entire data set uses the same input formatting protocols.

#### Improving data quality

Here, start by having a strategy for dealing with erroneous data, missing values, extreme values, and outliers in the data. Self-service data preparation tools can help if they have intelligent facilities built in to help match data attributes from disparate datasets to combine them intelligently. For instance, if we have columns for FIRST NAME and LAST NAME in one dataset and another dataset has a column called PHYSICIAN NAME that seem to hold a FIRST and LAST NAME combined, intelligent algorithms should be able to determine a way to match these and join the datasets to get a singular view of the customer.

#### Feature Engineering

Success in machine learning algorithms is dependent on how the data is represented. Feature engineering is a process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model performance on unseen data. Domain knowledge is critical in identifying which features might be relevant and the exercise calls for close interaction between a domain specialist and a data scientist.

We have derived few features by considering the intuition that high cost, a greater number of prescriptions, costly drugs may indicate the fraud. Other assumption is that outliers in the cost, number of prescriptions may have significant effect on fraud claims. We have followed a scoring mechanism to score each physician.

#### Feature Selection:

Out of all the attributes listed in the dataset, the attributes that are relevant to the domain and result in the boosting of model performance are picked and used. I.e. the attributes that result in the degradation of model performance are removed. This entire process is called Feature Selection or Feature Elimination.

There are several methods used for selecting appropriate features for optimal model performance. Following are some of the most commonly used methods.

**Trial & Error:** Start with features known with domain knowledge, and keep
adding other features one at a time, and see the model performance. Keep the
features that improve the performance and avoid those that don’t improve or
degrade the performance. This approach is called Forward Selection. The other
approach is to start with all features and keep eliminating one feature at a
time, and observe the performance. Again, keep those features that don’t degrade
the performance. This approach is called Backward Elimination. This is proven as
the better method while working with trees.

**Dimensionality Reduction (PCA):** PCA (Principle Component Analysis) is used
to translate the given higher dimensional data into lower dimensional data. PCA
is used to reduce the number of dimensions and selecting the dimensions which
explain most of the dataset’s variance. (In this case it is 99% of variance).
The best way to see the number of dimensions that explains the maximum variance
is by plotting a two-dimensional scatter plot.

**Statistical Significance:** Univariate feature selection is one of the
feature selection methods which works by selecting the best features based on
univariate statistical tests. In our data set we have both numerical and
categorical data. To calculate the statistical significance for categorical
we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will
give the p values of each independent variables with the dependent variables.
Below is the table containing the list of features and their p-values
(If categorical – chi-sq will be performed else ANOVA will be performed)

**Trial & Error:** Start with features known with domain knowledge, and keep
adding other features one at a time, and see the model performance. Keep the
features that improve the performance and avoid those that don’t improve or
degrade the performance. This approach is called Forward Selection. The other
approach is to start with all features and keep eliminating one feature at a
time, and observe the performance. Again, keep those features that don’t degrade
the performance. This approach is called Backward Elimination. This is proven as
the better method while working with trees.

**Dimensionality Reduction (PCA):** PCA (Principle Component Analysis) is used
to translate the given higher dimensional data into lower dimensional data.
PCA is used to reduce the number of dimensions and selecting the dimensions
which explain most of the dataset’s variance. (In this case it is 99% of
variance). The best way to see the number of dimensions that explains the
maximum variance is by plotting a two-dimensional scatter plot.

**Statistical Significance:** Univariate feature selection is one of the
feature selection methods which works by selecting the best features based on
univariate statistical tests. In our data set we have both numerical and
categorical data. To calculate the statistical significance for categorical
we’ll select chi-sq and for numerical we’ll select ANOVA. These methods will
give the p values of each independent variables with the dependent variables.
Below is the table containing the list of features and their p-values (If
categorical – chi-sq will be performed else ANOVA will be performed)

In later part, we’ll see the importance of including the variables which have significant relationship with the target variable.

**Choosing the model**

There are many supervised algorithms available. Each algorithm differs in nature and produce different results based on the given data set. We have to choose the appropriate algorithms according to the problem that wants to be solved and the nature of the data. Below are the algorithms that will be used in our model building process:

- Logistic Regression
- Random Forests
- Gradient Boosting
- Neural Networks
- Support Vector Machines

**Training Models and Evaluation**

The model building activity involves construction of machine learning algorithms that can learn from a historical data and make predictions or decisions on unseen data. Following is the detailed modelling process.

Once the models are built, they are trained with training dataset, then validated with validation dataset, while fine tuning hyper parameters, and finally tested with test dataset. At each stage, chosen performance metric is observed to get desired performance level.

**Training the models**

The following steps summarize the process of the model development:

- Once the dataset is obtained, it is processed for better quality, then divided into Training, Validation and Test Sets with 70:20:10 ratio. This ratio can vary depending on overall size of the data set available
- Then a particular algorithm is chosen and features engineered for the algorithm, and the model is trained with training data till we get desired performance
- Then the model is tested with validation data and if the performance here is not good enough, we will go back to training step and tune some of the hyper parameters and test again with validation set. This process gets repeated until we satisfy with both training and validation set accuracies
- Then finally test with test data set. If we get desired result here, then deploy the model for production use
- If we don’t get desired results with this algorithm, try with other suitable algorithms and repeat the process again

**Model Evaluation**

A classification model’s performance can be evaluated by using confusion matrices. The key concept of confusion matrix is that it calculates the no. of correct & incorrect predictions which is further summarized with the no. of count values and breakdown into each classes. It eventually shows the path in which classification model is confused when it makes predictions.

The performance of the base models on the test data set is given below:

Evaluation results are showing that Distributed Random Forest and Gradient Boosting does better prediction on the data set.

**Importance of Features**

Based on Distributional Random Forest algorithm the below factors are considered as top 5 important as they play significant role in predicting customer churn.

Below are the visualization of top factors and their statistical significance results which will help the medical insurance company to take decisions based on the insights provided

The trend shows that the proportion of fraudulent claims are high where the sum of total cost by that physician is more than 1500 USD where as legitimate claims have higher proportion in the low-cost space. This trend is statistically significant as the P value is 0.001

The trend shows that the proportion of fraudulent claims are high where maximum number of days prescribed by physician is more than 600 days where as legitimate claims happen with a smaller number of days. This trend is statistically significant as the P value is 0.001

High number of frauds happened when Warfarin Sodium is prescribed. This result is statistically significant as the p values is 0.03

High number of frauds happened the physician having the speciality in Family Practice. This result is statistically significant as the p values is 0.02

The trend shows that the proportion of fraudulent claims are high where the number of prescriptions by that physician is more than 400 where as legitimate claims are happened when the number of prescriptions by the physicians are lesser. This trend is statistically significant as the P value is 0.07