What's Naive Bayes?
- A classification tool
- Fast and easy to interpret
- Good for large data sets
- It's "naive" because we are supposed to assume that all predictors or attributes are independent from one another
- Based on Bayes' Rule of Posterior Probability:
- P
( c|x) is the posterior probability of class (c, Target variable) given predictor (x, attributes). - P
( x|c) is the likelihood which is the probability of predictor given class. - P
( c) is the prior probability of class. Or the probability of class before seeing the data - P
( x) is the prior probability ofpredictor .
What data are we going to use?
We look into a fictional set of data collected from 300 respondents. It contains:- Age
- Gender
- Income
- No.
of children - Own/Rent their homes
- Are they cable tv subscribers
The respondents have each already been assigned to one consumer segment:
- Suburb Mix
- Urban Hip
- Travelers
- Moving Up
Here's a snippet:
You may also download the complete data here.
Goal
Use Naive Bayes (NB) to predict which consumer segment a person (or a new observation) belongs to based on historical data.How to Use NB Manually?
Analytics Vidhya provides a clear and- Identify the Target Variable (in this case, the Consumer Segment)
- Convert the data set into a frequency table for each attribute against a target
- Create Likelihood table by finding the probabilities
- Use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of
prediction
The model yields a good performance of 81.7% accuracy in segment assignment. It also gives an individual-level estimation of attribute likelihood. For example, the training data tells us that there's a 68% chance that a member of the Moving Up segment is female.
R Code:
> set.seed(04444)
> seg.raw <-read.csv("https://1drv.ms/u/s!AqpWLKfaiJ83tHxNn3ClDFDZ1By_")
> summary (seg.raw)
age gender income
Min. :19.26 Female:157 Min. : -5183
1st Qu.:33.01 Male :143 1st Qu.: 39656
Median :39.49 Median : 52014
Mean :41.20 Mean : 50937
3rd Qu.:47.90 3rd Qu.: 61403
Max. :80.49 Max. :114278
kids ownHome subscribe
Min. :0.00 ownNo :159 subNo :260
1st Qu.:0.00 ownYes:141 subYes: 40
Median :1.00
Mean :1.27
3rd Qu.:2.00
Max. :7.00
Segment
Moving up : 70
Suburb mix:100
Travelers : 80
Urban hip : 50
> head(seg.raw)
age gender income kids ownHome subscribe
1 47.31613 Male 49482.81 2 ownNo subNo
2 31.38684 Male 35546.29 1 ownYes subNo
3 43.20034 Male 44169.19 0 ownYes subNo
4 37.31700 Female 81041.99 1 ownNo subNo
5 40.95439 Female 79353.01 3 ownYes subNo
6 43.03387 Male 58143.36 4 ownYes subNo
Segment
1 Suburb mix
2 Suburb mix
3 Suburb mix
4 Suburb mix
5 Suburb mix
6 Suburb mix
> library(e1071)
#First, split the data into Training and Test (60-40)
> train.prop <-0.60
> train.cases <- sample(nrow(seg.raw), nrow(seg.raw)*train.prop)
> seg.df.train <-seg.raw[train.cases, ]
#The "-" is used to omit all the index in seg.df.train from the test data
> seg.df.test <-seg.raw[-train.cases, ]
#Check-out both new data using summary()
> summary(seg.df.test)
age gender income
Min. :21.00 Female:63 Min. : -5183
1st Qu.:32.98 Male :57 1st Qu.: 45176
Median :40.13 Median : 53474
Mean :42.17 Mean : 53759
3rd Qu.:51.54 3rd Qu.: 63696
Max. :80.49 Max. :114278
kids ownHome subscribe
Min. :0.000 ownNo :60 subNo :101
1st Qu.:0.000 ownYes:60 subYes: 19
Median :1.000
Mean :1.225
3rd Qu.:2.000
Max. :6.000
Segment
Moving up :26
Suburb mix:37
Travelers :39
Urban hip :18
> summary(seg.df.train)
age gender income
Min. :19.26 Female:94 Min. : -694
1st Qu.:33.10 Male :86 1st Qu.: 37594
Median :39.19 Median : 49496
Mean :40.55 Mean : 49055
3rd Qu.:47.12 3rd Qu.: 61077
Max. :78.20 Max. :105538
kids ownHome subscribe Segment
Min. :0.0 ownNo :99 subNo :159 Moving up :44
1st Qu.:0.0 ownYes:81 subYes: 21 Suburb mix:63
Median :1.0 Travelers :41
Mean :1.3 Urban hip :32
3rd Qu.:2.0
Max. :7.0
#Train an NB classifier to predict training data
> {seg.nb <- naiveBayes(Segment ~ ., data=seg.df.train)}
> seg.nb
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
#See how the NB works BEFORE new info is added.
A-priori probabilities:
Y
Moving up Suburb mix Travelers Urban hip
0.2444444 0.3500000 0.2277778 0.1777778
#Example: There's a 68% chance a respondent is F is she is part of the Moving Up segment
Conditional probabilities:
age
Y [,1] [,2]
Moving up 36.80072 3.857355
Suburb mix 40.01748 5.159565
Travelers 58.57831 8.242138
Urban hip 23.67364 1.886160
gender
Y Female Male
Moving up 0.6818182 0.3181818
Suburb mix 0.4603175 0.5396825
Travelers 0.5121951 0.4878049
Urban hip 0.4375000 0.5625000
income
Y [,1] [,2]
Moving up 53036.44 10545.406
Suburb mix 53801.84 13056.033
Travelers 59219.21 23336.043
Urban hip 21211.03 4611.308
kids
Y [,1] [,2]
Moving up 1.931818 1.5157733
Suburb mix 1.809524 1.2681857
Travelers 0.000000 0.0000000
Urban hip 1.093750 0.9954534
ownHome
Y ownNo ownYes
Moving up 0.6818182 0.3181818
Suburb mix 0.5079365 0.4920635
Travelers 0.3170732 0.6829268
Urban hip 0.7500000 0.2500000
subscribe
Y subNo subYes
Moving up 0.81818182 0.18181818
Suburb mix 0.93650794 0.06349206
Travelers 0.90243902 0.09756098
Urban hip 0.84375000 0.15625000
#We call on R to predict segment membership in the Test data
> seg.nb.class <- predict(seg.nb, seg.df.test)
> seg.nb.class
[1] Suburb mix Suburb mix Suburb mix Travelers
[5] Travelers Suburb mix Suburb mix Suburb mix
[9] Suburb mix Suburb mix Travelers Suburb mix
[13] Suburb mix Moving up Travelers Suburb mix
[17] Moving up Suburb mix Suburb mix Suburb mix
[21] Moving up Suburb mix Suburb mix Suburb mix
[25] Suburb mix Moving up Moving up Moving up
[29] Suburb mix Suburb mix Suburb mix Suburb mix
[33] Moving up Suburb mix Suburb mix Suburb mix
[37] Suburb mix Urban hip Urban hip Urban hip
[41] Urban hip Urban hip Urban hip Urban hip
[45] Urban hip Urban hip Urban hip Urban hip
[49] Urban hip Urban hip Urban hip Urban hip
[53] Urban hip Urban hip Urban hip Travelers
[57] Travelers Travelers Travelers Travelers
[61] Travelers Travelers Travelers Travelers
[65] Travelers Travelers Travelers Travelers
[69] Travelers Travelers Travelers Travelers
[73] Travelers Travelers Travelers Travelers
[77] Travelers Travelers Travelers Travelers
[81] Travelers Travelers Travelers Travelers
[85] Travelers Travelers Travelers Travelers
[89] Travelers Travelers Travelers Travelers
[93] Travelers Travelers Moving up Moving up
[97] Moving up Moving up Travelers Travelers
[101] Moving up Moving up Travelers Moving up
[105] Moving up Moving up Moving up Suburb mix
[109] Travelers Moving up Moving up Suburb mix
[113] Travelers Moving up Moving up Suburb mix
[117] Suburb mix Suburb mix Travelers Moving up
Levels: Moving up Suburb mix Travelers Urban hip
#The frequencies of predicted membership
> prop.table(table(seg.nb.class))
seg.nb.class
Moving up Suburb mix Travelers Urban hip
0.1833333 0.2583333 0.4083333 0.1500000
#The model has a raw agreement rate of 81.7% between predicted and actual. Good stuff!
> mean(seg.df.test$Segment==seg.nb.class)
[1] 0.8166667
#Compare performance for catergories using Confusion Matrix
#NB was correct for all 18 of the Urban Hip
#However, it also misclassified 11 in the Suburb Mix
> table(seg.nb.class, seg.df.test$Segment)
seg.nb.class Moving up Suburb mix Travelers Urban hip
Moving up 15 7 0 0
Suburb mix 5 26 0 0
Travelers 6 4 39 0
Urban hip 0 0 0 18
#Now we use predict() on the Test data
> predict(seg.nb, seg.df.test, type="raw")
Moving up Suburb mix Travelers
[1,] 1.440881e-01 8.552193e-01 6.926154e-04
[2,] 1.456777e-01 8.536193e-01 7.029732e-04
[3,] 2.709197e-01 7.289515e-01 1.288325e-04
[4,] 4.184773e-02 2.944421e-02 9.287081e-01
[5,] 2.671615e-03 6.115050e-03 9.912133e-01
[6,] 4.186620e-01 5.812942e-01 4.379586e-05
[7,] 4.731223e-01 5.268739e-01 3.701474e-06
[8,] 4.374748e-01 5.624934e-01 3.180412e-05
[9,] 3.474860e-01 6.524404e-01 7.357777e-05
[10,] 3.656449e-01 6.343508e-01 4.358883e-06
[11,] 1.012914e-03 6.497213e-03 9.924899e-01
[12,] 3.323157e-01 6.676041e-01 8.020473e-05
[13,] 1.223884e-03 9.808062e-01 1.796993e-02
[14,] 5.601595e-01 4.398083e-01 3.220151e-05
[15,] 6.742957e-04 2.880094e-03 9.964456e-01
[16,] 6.010685e-02 9.396161e-01 2.770600e-04
[17,] 7.380343e-01 2.619626e-01 3.115947e-06
[18,] 1.420365e-01 8.578766e-01 8.685087e-05
[19,] 7.583260e-02 9.239193e-01 2.481195e-04
[20,] 3.794170e-01 6.204883e-01 9.465629e-05
[21,] 9.615440e-01 3.810100e-02 3.549975e-04
[22,] 3.765093e-01 6.230697e-01 4.210493e-04
[23,] 2.812958e-01 7.186759e-01 2.835803e-05
[24,] 4.565263e-01 5.434132e-01 6.050987e-05
[25,] 1.296606e-01 8.695552e-01 7.842591e-04
[26,] 5.055131e-01 4.944639e-01 2.307682e-05
[27,] 5.462156e-01 4.537640e-01 2.037672e-05
[28,] 6.884615e-01 3.115325e-01 5.950060e-06
[29,] 2.508885e-01 7.490686e-01 4.290961e-05
[30,] 4.675615e-01 5.323982e-01 4.024894e-05
[31,] 3.678550e-01 6.321157e-01 2.933847e-05
[32,] 2.019428e-01 7.979988e-01 5.835591e-05
[33,] 6.890461e-01 3.109439e-01 9.998462e-06
[34,] 3.812665e-01 6.187179e-01 1.558092e-05
[35,] 4.963026e-01 5.017074e-01 1.990014e-03
[36,] 8.874581e-02 9.110617e-01 1.924935e-04
[37,] 8.625658e-02 8.989502e-01 1.479327e-02
[38,] 4.685754e-06 1.931791e-05 2.213979e-09
[39,] 9.304912e-07 6.334395e-06 9.529707e-10
[40,] 4.967475e-07 1.171192e-05 4.496936e-09
[41,] 1.331001e-05 1.478564e-04 4.489671e-09
[42,] 5.507304e-04 2.097786e-03 5.443669e-02
[43,] 2.861508e-05 3.604446e-04 1.235494e-07
[44,] 1.596150e-03 7.068018e-03 1.208915e-07
[45,] 8.223918e-06 2.654345e-05 8.822192e-09
[46,] 3.340850e-06 3.397923e-05 1.539063e-03
[47,] 2.130488e-06 2.523107e-05 1.402425e-03
[48,] 8.354152e-05 2.839538e-04 1.271094e-08
[49,] 3.850316e-05 1.475367e-04 1.263822e-02
[50,] 1.619232e-05 1.177392e-04 5.914589e-09
[51,] 5.995930e-04 3.215583e-03 1.282580e-07
[52,] 7.269372e-03 3.707811e-02 8.244888e-07
[53,] 3.837801e-05 9.611870e-05 4.944857e-09
[54,] 5.253787e-05 1.372892e-04 5.650724e-03
[55,] 1.594063e-04 7.351829e-04 5.441922e-02
[56,] 8.793436e-10 2.906219e-06 9.999971e-01
[57,] 8.810414e-19 1.750207e-10 1.000000e+00
[58,] 4.197303e-17 2.253424e-09 1.000000e+00
[59,] 1.076254e-09 2.875665e-06 9.999971e-01
[60,] 2.943310e-10 3.190429e-06 9.999968e-01
[61,] 2.077226e-07 8.669579e-05 9.999131e-01
[62,] 4.783358e-16 4.082836e-09 1.000000e+00
[63,] 2.608279e-16 2.771068e-09 1.000000e+00
[64,] 2.382554e-04 1.734936e-03 9.980268e-01
[65,] 1.014131e-06 1.930618e-04 9.998059e-01
[66,] 2.945811e-16 2.531311e-09 1.000000e+00
[67,] 2.037384e-14 1.969520e-08 1.000000e+00
[68,] 4.277916e-07 1.257500e-04 9.998738e-01
[69,] 8.105189e-12 2.110480e-07 9.999998e-01
[70,] 4.925874e-09 8.490373e-06 9.999915e-01
[71,] 7.634588e-11 1.132205e-06 9.999989e-01
[72,] 1.158982e-09 4.021510e-06 9.999960e-01
[73,] 1.067817e-03 3.553700e-03 9.953785e-01
[74,] 2.046742e-13 1.039585e-07 9.999999e-01
[75,] 3.475463e-04 3.747890e-03 9.959046e-01
[76,] 6.513202e-06 4.263900e-04 9.995671e-01
[77,] 2.034014e-06 1.541788e-04 9.998438e-01
[78,] 3.733899e-08 3.832453e-05 9.999616e-01
[79,] 2.903394e-11 4.367194e-07 9.999996e-01
[80,] 1.357737e-20 1.498671e-11 1.000000e+00
[81,] 7.015985e-12 1.298615e-07 9.999999e-01
[82,] 9.672542e-13 2.447063e-07 9.999998e-01
[83,] 8.831723e-08 2.212918e-05 9.999778e-01
[84,] 1.159770e-06 1.169970e-04 9.998818e-01
[85,] 1.578677e-07 7.959836e-05 9.999202e-01
[86,] 5.337042e-21 8.857918e-12 1.000000e+00
[87,] 4.468045e-09 6.361052e-06 9.999936e-01
[88,] 1.896246e-13 5.900718e-08 9.999999e-01
[89,] 1.476904e-31 1.503828e-16 1.000000e+00
[90,] 2.491062e-17 4.662340e-10 1.000000e+00
[91,] 1.777054e-08 1.658310e-05 9.999834e-01
[92,] 3.309277e-13 3.370546e-08 1.000000e+00
[93,] 2.022654e-12 1.925338e-07 9.999998e-01
[94,] 8.992757e-16 1.022816e-08 1.000000e+00
[95,] 5.594756e-01 4.405157e-01 8.692925e-06
[96,] 5.093043e-01 4.902429e-01 4.527934e-04
[97,] 5.038497e-01 4.961470e-01 3.278535e-06
[98,] 5.395342e-01 4.604591e-01 6.695716e-06
[99,] 1.134885e-01 5.054778e-02 8.359637e-01
[100,] 1.045384e-01 4.268873e-02 8.527729e-01
[101,] 6.574224e-01 3.425641e-01 1.350755e-05
[102,] 5.269006e-01 4.730843e-01 1.511052e-05
[103,] 4.010783e-02 2.769830e-02 9.321939e-01
[104,] 7.596165e-01 2.403801e-01 3.435641e-06
[105,] 8.266192e-01 1.733733e-01 7.554315e-06
[106,] 7.158419e-01 2.841322e-01 2.592139e-05
[107,] 8.072156e-01 1.927769e-01 7.477022e-06
[108,] 4.715239e-01 5.282529e-01 2.232633e-04
[109,] 2.344395e-01 8.290417e-02 6.826564e-01
[110,] 6.363906e-01 3.635995e-01 9.927859e-06
[111,] 7.524809e-01 2.475157e-01 3.364222e-06
[112,] 3.981777e-01 6.017763e-01 4.596976e-05
[113,] 6.383044e-03 9.784916e-03 9.838320e-01
[114,] 7.421359e-01 2.578578e-01 6.374946e-06
[115,] 8.662959e-01 1.337004e-01 3.724162e-06
[116,] 4.136607e-01 5.863277e-01 1.159949e-05
[117,] 2.079833e-01 7.919563e-01 6.032739e-05
[118,] 4.307428e-01 5.692184e-01 3.881736e-05
[119,] 6.851292e-02 2.617547e-02 9.053116e-01
[120,] 5.941645e-01 4.056357e-01 1.997297e-04
Urban hip
[1,] 5.701669e-53
[2,] 2.500649e-38
[3,] 4.974154e-33
[4,] 1.259713e-19
[5,] 3.307442e-33
[6,] 1.583309e-30
[7,] 1.122160e-07
[8,] 2.511723e-26
[9,] 8.699924e-27
[10,] 2.365510e-14
[11,] 1.540653e-41
[12,] 2.426069e-46
[13,] 7.193706e-73
[14,] 2.707334e-27
[15,] 4.130228e-50
[16,] 4.594774e-37
[17,] 6.932017e-14
[18,] 4.715025e-28
[19,] 3.135981e-39
[20,] 1.499935e-36
[21,] 1.103694e-22
[22,] 1.237063e-32
[23,] 4.815927e-22
[24,] 6.357093e-24
[25,] 8.544212e-53
[26,] 2.943143e-33
[27,] 1.504438e-17
[28,] 1.933925e-20
[29,] 1.066315e-28
[30,] 5.949756e-17
[31,] 3.637818e-31
[32,] 3.838050e-24
[33,] 8.675468e-23
[34,] 2.070329e-30
[35,] 3.357981e-32
[36,] 2.463502e-46
[37,] 2.257196e-53
[38,] 9.999760e-01
[39,] 9.999927e-01
[40,] 9.999878e-01
[41,] 9.998388e-01
[42,] 9.429148e-01
[43,] 9.996108e-01
[44,] 9.913357e-01
[45,] 9.999652e-01
[46,] 9.984236e-01
[47,] 9.985702e-01
[48,] 9.996325e-01
[49,] 9.871757e-01
[50,] 9.998661e-01
[51,] 9.961847e-01
[52,] 9.556517e-01
[53,] 9.998655e-01
[54,] 9.941594e-01
[55,] 9.446862e-01
[56,] 2.528127e-38
[57,] 5.246860e-123
[58,] 5.646726e-128
[59,] 1.988537e-100
[60,] 3.119177e-96
[61,] 1.627676e-52
[62,] 7.431677e-119
[63,] 5.068258e-148
[64,] 1.776856e-40
[65,] 1.548777e-62
[66,] 1.034270e-139
[67,] 1.025226e-110
[68,] 1.039411e-65
[69,] 4.500244e-121
[70,] 5.254512e-78
[71,] 2.846292e-93
[72,] 4.766651e-88
[73,] 1.031865e-48
[74,] 9.656782e-116
[75,] 4.794541e-37
[76,] 1.824557e-65
[77,] 1.230530e-63
[78,] 4.992639e-73
[79,] 2.151835e-107
[80,] 2.371682e-190
[81,] 2.437758e-113
[82,] 3.820872e-104
[83,] 9.914883e-90
[84,] 4.805351e-73
[85,] 2.506946e-61
[86,] 5.451476e-160
[87,] 1.682258e-70
[88,] 1.974787e-91
[89,] 5.118333e-236
[90,] 5.815997e-166
[91,] 8.437300e-80
[92,] 6.148129e-136
[93,] 8.060332e-115
[94,] 8.960795e-124
[95,] 1.171934e-13
[96,] 6.583983e-24
[97,] 5.619707e-13
[98,] 1.945742e-20
[99,] 5.801518e-08
[100,] 1.124058e-22
[101,] 7.626077e-23
[102,] 6.651534e-20
[103,] 1.847591e-35
[104,] 2.979964e-16
[105,] 1.526116e-12
[106,] 3.883986e-22
[107,] 7.467901e-15
[108,] 1.013706e-23
[109,] 4.116193e-11
[110,] 1.048310e-26
[111,] 8.348068e-17
[112,] 1.178320e-29
[113,] 1.393098e-30
[114,] 8.724017e-18
[115,] 1.543938e-13
[116,] 1.942577e-26
[117,] 6.192493e-32
[118,] 2.587944e-37
[119,] 8.274105e-21
[120,] 3.253257e-20
> eval <- predict(seg.nb, seg.df.test, type="raw")
Comments
Post a Comment