Fixing an Imbalanced Dataset Using Rapidminer

What is Unbalanced Data?

Unbalanced data ballsImbalanced data means that one class of a response variable is hugely disproportionate than the opposite class. Consider the field of credit risk as an example where It is said that only around 2% of credit cards are defrauded each year.


Here are some other examples:
  1. The number of machines in a factory that fail versus that do not
  2. Discrimination between earthquakes and nuclear explosion
  3. Detection of fraudulent phone calls
  4. People who clicked on a digital ad vs those who did not

The Problem It Poses

Now, the problem with class imbalances arise when we try to build a classification model. It will tend to be biased towards the majority class and would not be able to detect fraud in a majority of the cases with new unseen data because there is very few data points for the model to learn from. 


What Dataset Are We Going to Use?

The Credit Card Fraud Dataset

The data was downloaded from Kaggle. It presents 2-day credit card transactions by European cardholders in September 2013. The original downloaded file contained 31 columns and 284,807 rows. Rapidminer downsampled the data to 50,000 due to software license limitations.

The column "Class" is the response variable and it takes two values:
0 - Legit
1 - Fraud

Credit Card Fraud Data Set
We only need the "Class" column for this example.


The plot below shows how imbalanced the Class variable is:
Datapoints classified as "0" far outnumber those classified as "1"

The challenge now is to make the two classes proportionate to each other.

The Process

The Main Operator

There are a couple of Rapidminer operators for the task but we're only going to use the Sample operator for now:

Rapidminer Sample Operators

After retrieving the data, we then include 2 more operators to satisfy the Sample operator's requirements and tweak some parameters:
Rapidminer Resampling Process

Result:

Credit Card Fraud Balanced Dataset
A nicely balanced 50:50 dataset.
For more about handling imbalanced datasets, visit this website.




Comments

Post a Comment