What is Unbalanced Data?
Imbalanced data means that one class of a response variable is hugely disproportionate than the opposite class. Consider the field of credit risk as an example where It is said that only around 2% of credit cards are defrauded each year.Here are some other examples:
- The number of machines in a factory that fail versus that do not
- Discrimination between earthquakes and nuclear explosion
- Detection of fraudulent phone calls
- People who clicked on a digital ad vs those who did not
The Problem It Poses
Now, the problem with class imbalances arise when we try to build a classification model. It will tend to be biased towards the majority class and would not be able to detect fraud in a majority of the cases with new unseen data because there is very few data points for the model to learn from.
What Dataset Are We Going to Use?
The Credit Card Fraud Dataset
The data was downloaded from Kaggle. It presents 2-day credit card transactions by European cardholders in September 2013. The original downloaded file contained 31 columns and 284,807 rows. Rapidminer downsampled the data to 50,000 due to software license limitations.The column "Class" is the response variable and it takes two values:
0 - Legit
1 - Fraud
We only need the "Class" column for this example. |
The plot below shows how imbalanced the Class variable is:
Datapoints classified as "0" far outnumber those classified as "1" |
The challenge now is to make the two classes proportionate to each other.
The Process
The Main Operator
There are a couple of Rapidminer operators for the task but we're only going to use the Sample operator for now:
After retrieving the data, we then include 2 more operators to satisfy the Sample operator's requirements and tweak some parameters:
Result:
A nicely balanced 50:50 dataset. |
You have done excellent job. I enjoyed your blog .
ReplyDeleteVideo animation services