Fixing an Imbalanced Dataset Using Rapidminer

What is Unbalanced Data?

Imbalanced data means that one class of a response variable is hugely disproportionate than the opposite class. Consider the field of credit risk as an example where It is said that only around 2% of credit cards are defrauded each year.

Here are some other examples:

The number of machines in a factory that fail versus that do not
Discrimination between earthquakes and nuclear explosion
Detection of fraudulent phone calls
People who clicked on a digital ad vs those who did not

The Problem It Poses

Now, the problem with class imbalances arise when we try to build a classification model. It will tend to be biased towards the majority class and would not be able to detect fraud in a majority of the cases with new unseen data because there is very few data points for the model to learn from.

What Dataset Are We Going to Use?

The Credit Card Fraud Dataset

The data was downloaded from Kaggle. It presents 2-day credit card transactions by European cardholders in September 2013. The original downloaded file contained 31 columns and 284,807 rows. Rapidminer downsampled the data to 50,000 due to software license limitations.

The column "Class" is the response variable and it takes two values:
0 - Legit
1 - Fraud

We only need the "Class" column for this example.

The plot below shows how imbalanced the Class variable is:

Datapoints classified as "0" far outnumber those classified as "1"

The challenge now is to make the two classes proportionate to each other.

The Process

The Main Operator

There are a couple of Rapidminer operators for the task but we're only going to use the Sample operator for now:

After retrieving the data, we then include 2 more operators to satisfy the Sample operator's requirements and tweak some parameters:

Result:

A nicely balanced 50:50 dataset.

For more about handling imbalanced datasets, visit this website.

Juan Antonio Pajarillo's Data Analytics Projects

Search This Blog