k-NN: Classify Who Will Most Likely Default on their Credit Card

"If it walks like a duck and it quacks like a duck,then it's probably a duck."
Image: http://bit.ly/2qJJ0mA

k-Nearest Neighbors (kNN) is a classification (or regression) algorithm that combines the classification of a pre-determined number (k) of nearest points in order to determine the classification of a particular point in question. It is SUPERVISED because you are trying to classify a point based on the known classification. For example, we already have two sets of bins in our historical data:

a) Default - includes the names and attributes of people who we already know have defaulted on their credit card
b) Did Not Default - for those who did not default

We then use those bins as basis or guide to classify new data.

Here's how it works:

Image: KDNuggets

What data are we going to use?

The Default of Credit Card Clients data set comes from the UCI Machine Learning Repository. Here's the description from the page:

The case of customers' default payments in Taiwan in 2016. It has 30,000 rows and 24 columns. The variables are as follows:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
X6 = the repayment status in September, 2005;
 X7 = the repayment status in August, 2005;
X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005;
 X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar).
X18 = amount paid in September, 2005;
X19 = amount paid in August, 2005;
X23 = amount paid in April, 2005.


Goal:

We wish to find out whether new cases will default or not on their credit card accounts. We use the historical data that contain the attributes of people who defaulted and did not default and classify the new data points based on their "similarities" with the old ones.

R Code:

Comments