Using Naive Bayes to Classify Consumer Segments for Better Targeting


What's Naive Bayes?

  • A classification tool
  • Fast and easy to interpret 
  • Good for large data sets
  • It's "naive" because we are supposed to assume that all predictors or attributes are independent from one another
  • Based on Bayes' Rule of Posterior Probability:
  • P(c|x) is the posterior probability of class (c, Target variable) given predictor (x, attributes).
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(c) is the prior probability of class. Or the probability of class before seeing the data
  • P(x) is the prior probability of predictor.


What data are we going to use?

We look into a fictional set of data collected from 300 respondents. It contains:
  1. Age
  2. Gender
  3. Income
  4. No. of children
  5. Own/Rent their homes
  6. Are they cable tv subscribers
The respondents have each already been assigned to one consumer segment: 
  1. Suburb Mix
  2. Urban Hip
  3. Travelers
  4. Moving Up
Here's a snippet:


You may also download the complete data here.


Goal

Use Naive Bayes (NB) to predict which consumer segment a person (or a new observation) belongs to based on historical data.


How to Use NB Manually?

Analytics Vidhya provides a clear and short explanation (albeit for Python):
  1. Identify the Target Variable (in this case, the Consumer Segment)
  2. Convert the data set into a frequency table for each attribute against a target
  3. Create Likelihood table by finding the probabilities 
  4. Use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction
Short Answer:
The model yields a good performance of 81.7% accuracy in segment assignment. It also gives an individual-level estimation of attribute likelihood. For example, the training data tells us that there's a 68% chance that a member of the Moving Up segment is female.
 

R Code:



Exported from Notepad++
> set.seed(04444) > seg.raw <-read.csv("https://1drv.ms/u/s!AqpWLKfaiJ83tHxNn3ClDFDZ1By_") > summary (seg.raw) age gender income Min. :19.26 Female:157 Min. : -5183 1st Qu.:33.01 Male :143 1st Qu.: 39656 Median :39.49 Median : 52014 Mean :41.20 Mean : 50937 3rd Qu.:47.90 3rd Qu.: 61403 Max. :80.49 Max. :114278 kids ownHome subscribe Min. :0.00 ownNo :159 subNo :260 1st Qu.:0.00 ownYes:141 subYes: 40 Median :1.00 Mean :1.27 3rd Qu.:2.00 Max. :7.00 Segment Moving up : 70 Suburb mix:100 Travelers : 80 Urban hip : 50 > head(seg.raw) age gender income kids ownHome subscribe 1 47.31613 Male 49482.81 2 ownNo subNo 2 31.38684 Male 35546.29 1 ownYes subNo 3 43.20034 Male 44169.19 0 ownYes subNo 4 37.31700 Female 81041.99 1 ownNo subNo 5 40.95439 Female 79353.01 3 ownYes subNo 6 43.03387 Male 58143.36 4 ownYes subNo Segment 1 Suburb mix 2 Suburb mix 3 Suburb mix 4 Suburb mix 5 Suburb mix 6 Suburb mix > library(e1071) #First, split the data into Training and Test (60-40) > train.prop <-0.60 > train.cases <- sample(nrow(seg.raw), nrow(seg.raw)*train.prop) > seg.df.train <-seg.raw[train.cases, ] #The "-" is used to omit all the index in seg.df.train from the test data > seg.df.test <-seg.raw[-train.cases, ] #Check-out both new data using summary() > summary(seg.df.test) age gender income Min. :21.00 Female:63 Min. : -5183 1st Qu.:32.98 Male :57 1st Qu.: 45176 Median :40.13 Median : 53474 Mean :42.17 Mean : 53759 3rd Qu.:51.54 3rd Qu.: 63696 Max. :80.49 Max. :114278 kids ownHome subscribe Min. :0.000 ownNo :60 subNo :101 1st Qu.:0.000 ownYes:60 subYes: 19 Median :1.000 Mean :1.225 3rd Qu.:2.000 Max. :6.000 Segment Moving up :26 Suburb mix:37 Travelers :39 Urban hip :18 > summary(seg.df.train) age gender income Min. :19.26 Female:94 Min. : -694 1st Qu.:33.10 Male :86 1st Qu.: 37594 Median :39.19 Median : 49496 Mean :40.55 Mean : 49055 3rd Qu.:47.12 3rd Qu.: 61077 Max. :78.20 Max. :105538 kids ownHome subscribe Segment Min. :0.0 ownNo :99 subNo :159 Moving up :44 1st Qu.:0.0 ownYes:81 subYes: 21 Suburb mix:63 Median :1.0 Travelers :41 Mean :1.3 Urban hip :32 3rd Qu.:2.0 Max. :7.0 #Train an NB classifier to predict training data > {seg.nb <- naiveBayes(Segment ~ ., data=seg.df.train)} > seg.nb Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.default(x = X, y = Y, laplace = laplace) #See how the NB works BEFORE new info is added. A-priori probabilities: Y Moving up Suburb mix Travelers Urban hip 0.2444444 0.3500000 0.2277778 0.1777778 #Example: There's a 68% chance a respondent is F is she is part of the Moving Up segment Conditional probabilities: age Y [,1] [,2] Moving up 36.80072 3.857355 Suburb mix 40.01748 5.159565 Travelers 58.57831 8.242138 Urban hip 23.67364 1.886160 gender Y Female Male Moving up 0.6818182 0.3181818 Suburb mix 0.4603175 0.5396825 Travelers 0.5121951 0.4878049 Urban hip 0.4375000 0.5625000 income Y [,1] [,2] Moving up 53036.44 10545.406 Suburb mix 53801.84 13056.033 Travelers 59219.21 23336.043 Urban hip 21211.03 4611.308 kids Y [,1] [,2] Moving up 1.931818 1.5157733 Suburb mix 1.809524 1.2681857 Travelers 0.000000 0.0000000 Urban hip 1.093750 0.9954534 ownHome Y ownNo ownYes Moving up 0.6818182 0.3181818 Suburb mix 0.5079365 0.4920635 Travelers 0.3170732 0.6829268 Urban hip 0.7500000 0.2500000 subscribe Y subNo subYes Moving up 0.81818182 0.18181818 Suburb mix 0.93650794 0.06349206 Travelers 0.90243902 0.09756098 Urban hip 0.84375000 0.15625000 #We call on R to predict segment membership in the Test data > seg.nb.class <- predict(seg.nb, seg.df.test) > seg.nb.class [1] Suburb mix Suburb mix Suburb mix Travelers [5] Travelers Suburb mix Suburb mix Suburb mix [9] Suburb mix Suburb mix Travelers Suburb mix [13] Suburb mix Moving up Travelers Suburb mix [17] Moving up Suburb mix Suburb mix Suburb mix [21] Moving up Suburb mix Suburb mix Suburb mix [25] Suburb mix Moving up Moving up Moving up [29] Suburb mix Suburb mix Suburb mix Suburb mix [33] Moving up Suburb mix Suburb mix Suburb mix [37] Suburb mix Urban hip Urban hip Urban hip [41] Urban hip Urban hip Urban hip Urban hip [45] Urban hip Urban hip Urban hip Urban hip [49] Urban hip Urban hip Urban hip Urban hip [53] Urban hip Urban hip Urban hip Travelers [57] Travelers Travelers Travelers Travelers [61] Travelers Travelers Travelers Travelers [65] Travelers Travelers Travelers Travelers [69] Travelers Travelers Travelers Travelers [73] Travelers Travelers Travelers Travelers [77] Travelers Travelers Travelers Travelers [81] Travelers Travelers Travelers Travelers [85] Travelers Travelers Travelers Travelers [89] Travelers Travelers Travelers Travelers [93] Travelers Travelers Moving up Moving up [97] Moving up Moving up Travelers Travelers [101] Moving up Moving up Travelers Moving up [105] Moving up Moving up Moving up Suburb mix [109] Travelers Moving up Moving up Suburb mix [113] Travelers Moving up Moving up Suburb mix [117] Suburb mix Suburb mix Travelers Moving up Levels: Moving up Suburb mix Travelers Urban hip #The frequencies of predicted membership > prop.table(table(seg.nb.class)) seg.nb.class Moving up Suburb mix Travelers Urban hip 0.1833333 0.2583333 0.4083333 0.1500000 #The model has a raw agreement rate of 81.7% between predicted and actual. Good stuff! > mean(seg.df.test$Segment==seg.nb.class) [1] 0.8166667 #Compare performance for catergories using Confusion Matrix #NB was correct for all 18 of the Urban Hip #However, it also misclassified 11 in the Suburb Mix > table(seg.nb.class, seg.df.test$Segment) seg.nb.class Moving up Suburb mix Travelers Urban hip Moving up 15 7 0 0 Suburb mix 5 26 0 0 Travelers 6 4 39 0 Urban hip 0 0 0 18 #Now we use predict() on the Test data > predict(seg.nb, seg.df.test, type="raw") Moving up Suburb mix Travelers [1,] 1.440881e-01 8.552193e-01 6.926154e-04 [2,] 1.456777e-01 8.536193e-01 7.029732e-04 [3,] 2.709197e-01 7.289515e-01 1.288325e-04 [4,] 4.184773e-02 2.944421e-02 9.287081e-01 [5,] 2.671615e-03 6.115050e-03 9.912133e-01 [6,] 4.186620e-01 5.812942e-01 4.379586e-05 [7,] 4.731223e-01 5.268739e-01 3.701474e-06 [8,] 4.374748e-01 5.624934e-01 3.180412e-05 [9,] 3.474860e-01 6.524404e-01 7.357777e-05 [10,] 3.656449e-01 6.343508e-01 4.358883e-06 [11,] 1.012914e-03 6.497213e-03 9.924899e-01 [12,] 3.323157e-01 6.676041e-01 8.020473e-05 [13,] 1.223884e-03 9.808062e-01 1.796993e-02 [14,] 5.601595e-01 4.398083e-01 3.220151e-05 [15,] 6.742957e-04 2.880094e-03 9.964456e-01 [16,] 6.010685e-02 9.396161e-01 2.770600e-04 [17,] 7.380343e-01 2.619626e-01 3.115947e-06 [18,] 1.420365e-01 8.578766e-01 8.685087e-05 [19,] 7.583260e-02 9.239193e-01 2.481195e-04 [20,] 3.794170e-01 6.204883e-01 9.465629e-05 [21,] 9.615440e-01 3.810100e-02 3.549975e-04 [22,] 3.765093e-01 6.230697e-01 4.210493e-04 [23,] 2.812958e-01 7.186759e-01 2.835803e-05 [24,] 4.565263e-01 5.434132e-01 6.050987e-05 [25,] 1.296606e-01 8.695552e-01 7.842591e-04 [26,] 5.055131e-01 4.944639e-01 2.307682e-05 [27,] 5.462156e-01 4.537640e-01 2.037672e-05 [28,] 6.884615e-01 3.115325e-01 5.950060e-06 [29,] 2.508885e-01 7.490686e-01 4.290961e-05 [30,] 4.675615e-01 5.323982e-01 4.024894e-05 [31,] 3.678550e-01 6.321157e-01 2.933847e-05 [32,] 2.019428e-01 7.979988e-01 5.835591e-05 [33,] 6.890461e-01 3.109439e-01 9.998462e-06 [34,] 3.812665e-01 6.187179e-01 1.558092e-05 [35,] 4.963026e-01 5.017074e-01 1.990014e-03 [36,] 8.874581e-02 9.110617e-01 1.924935e-04 [37,] 8.625658e-02 8.989502e-01 1.479327e-02 [38,] 4.685754e-06 1.931791e-05 2.213979e-09 [39,] 9.304912e-07 6.334395e-06 9.529707e-10 [40,] 4.967475e-07 1.171192e-05 4.496936e-09 [41,] 1.331001e-05 1.478564e-04 4.489671e-09 [42,] 5.507304e-04 2.097786e-03 5.443669e-02 [43,] 2.861508e-05 3.604446e-04 1.235494e-07 [44,] 1.596150e-03 7.068018e-03 1.208915e-07 [45,] 8.223918e-06 2.654345e-05 8.822192e-09 [46,] 3.340850e-06 3.397923e-05 1.539063e-03 [47,] 2.130488e-06 2.523107e-05 1.402425e-03 [48,] 8.354152e-05 2.839538e-04 1.271094e-08 [49,] 3.850316e-05 1.475367e-04 1.263822e-02 [50,] 1.619232e-05 1.177392e-04 5.914589e-09 [51,] 5.995930e-04 3.215583e-03 1.282580e-07 [52,] 7.269372e-03 3.707811e-02 8.244888e-07 [53,] 3.837801e-05 9.611870e-05 4.944857e-09 [54,] 5.253787e-05 1.372892e-04 5.650724e-03 [55,] 1.594063e-04 7.351829e-04 5.441922e-02 [56,] 8.793436e-10 2.906219e-06 9.999971e-01 [57,] 8.810414e-19 1.750207e-10 1.000000e+00 [58,] 4.197303e-17 2.253424e-09 1.000000e+00 [59,] 1.076254e-09 2.875665e-06 9.999971e-01 [60,] 2.943310e-10 3.190429e-06 9.999968e-01 [61,] 2.077226e-07 8.669579e-05 9.999131e-01 [62,] 4.783358e-16 4.082836e-09 1.000000e+00 [63,] 2.608279e-16 2.771068e-09 1.000000e+00 [64,] 2.382554e-04 1.734936e-03 9.980268e-01 [65,] 1.014131e-06 1.930618e-04 9.998059e-01 [66,] 2.945811e-16 2.531311e-09 1.000000e+00 [67,] 2.037384e-14 1.969520e-08 1.000000e+00 [68,] 4.277916e-07 1.257500e-04 9.998738e-01 [69,] 8.105189e-12 2.110480e-07 9.999998e-01 [70,] 4.925874e-09 8.490373e-06 9.999915e-01 [71,] 7.634588e-11 1.132205e-06 9.999989e-01 [72,] 1.158982e-09 4.021510e-06 9.999960e-01 [73,] 1.067817e-03 3.553700e-03 9.953785e-01 [74,] 2.046742e-13 1.039585e-07 9.999999e-01 [75,] 3.475463e-04 3.747890e-03 9.959046e-01 [76,] 6.513202e-06 4.263900e-04 9.995671e-01 [77,] 2.034014e-06 1.541788e-04 9.998438e-01 [78,] 3.733899e-08 3.832453e-05 9.999616e-01 [79,] 2.903394e-11 4.367194e-07 9.999996e-01 [80,] 1.357737e-20 1.498671e-11 1.000000e+00 [81,] 7.015985e-12 1.298615e-07 9.999999e-01 [82,] 9.672542e-13 2.447063e-07 9.999998e-01 [83,] 8.831723e-08 2.212918e-05 9.999778e-01 [84,] 1.159770e-06 1.169970e-04 9.998818e-01 [85,] 1.578677e-07 7.959836e-05 9.999202e-01 [86,] 5.337042e-21 8.857918e-12 1.000000e+00 [87,] 4.468045e-09 6.361052e-06 9.999936e-01 [88,] 1.896246e-13 5.900718e-08 9.999999e-01 [89,] 1.476904e-31 1.503828e-16 1.000000e+00 [90,] 2.491062e-17 4.662340e-10 1.000000e+00 [91,] 1.777054e-08 1.658310e-05 9.999834e-01 [92,] 3.309277e-13 3.370546e-08 1.000000e+00 [93,] 2.022654e-12 1.925338e-07 9.999998e-01 [94,] 8.992757e-16 1.022816e-08 1.000000e+00 [95,] 5.594756e-01 4.405157e-01 8.692925e-06 [96,] 5.093043e-01 4.902429e-01 4.527934e-04 [97,] 5.038497e-01 4.961470e-01 3.278535e-06 [98,] 5.395342e-01 4.604591e-01 6.695716e-06 [99,] 1.134885e-01 5.054778e-02 8.359637e-01 [100,] 1.045384e-01 4.268873e-02 8.527729e-01 [101,] 6.574224e-01 3.425641e-01 1.350755e-05 [102,] 5.269006e-01 4.730843e-01 1.511052e-05 [103,] 4.010783e-02 2.769830e-02 9.321939e-01 [104,] 7.596165e-01 2.403801e-01 3.435641e-06 [105,] 8.266192e-01 1.733733e-01 7.554315e-06 [106,] 7.158419e-01 2.841322e-01 2.592139e-05 [107,] 8.072156e-01 1.927769e-01 7.477022e-06 [108,] 4.715239e-01 5.282529e-01 2.232633e-04 [109,] 2.344395e-01 8.290417e-02 6.826564e-01 [110,] 6.363906e-01 3.635995e-01 9.927859e-06 [111,] 7.524809e-01 2.475157e-01 3.364222e-06 [112,] 3.981777e-01 6.017763e-01 4.596976e-05 [113,] 6.383044e-03 9.784916e-03 9.838320e-01 [114,] 7.421359e-01 2.578578e-01 6.374946e-06 [115,] 8.662959e-01 1.337004e-01 3.724162e-06 [116,] 4.136607e-01 5.863277e-01 1.159949e-05 [117,] 2.079833e-01 7.919563e-01 6.032739e-05 [118,] 4.307428e-01 5.692184e-01 3.881736e-05 [119,] 6.851292e-02 2.617547e-02 9.053116e-01 [120,] 5.941645e-01 4.056357e-01 1.997297e-04 Urban hip [1,] 5.701669e-53 [2,] 2.500649e-38 [3,] 4.974154e-33 [4,] 1.259713e-19 [5,] 3.307442e-33 [6,] 1.583309e-30 [7,] 1.122160e-07 [8,] 2.511723e-26 [9,] 8.699924e-27 [10,] 2.365510e-14 [11,] 1.540653e-41 [12,] 2.426069e-46 [13,] 7.193706e-73 [14,] 2.707334e-27 [15,] 4.130228e-50 [16,] 4.594774e-37 [17,] 6.932017e-14 [18,] 4.715025e-28 [19,] 3.135981e-39 [20,] 1.499935e-36 [21,] 1.103694e-22 [22,] 1.237063e-32 [23,] 4.815927e-22 [24,] 6.357093e-24 [25,] 8.544212e-53 [26,] 2.943143e-33 [27,] 1.504438e-17 [28,] 1.933925e-20 [29,] 1.066315e-28 [30,] 5.949756e-17 [31,] 3.637818e-31 [32,] 3.838050e-24 [33,] 8.675468e-23 [34,] 2.070329e-30 [35,] 3.357981e-32 [36,] 2.463502e-46 [37,] 2.257196e-53 [38,] 9.999760e-01 [39,] 9.999927e-01 [40,] 9.999878e-01 [41,] 9.998388e-01 [42,] 9.429148e-01 [43,] 9.996108e-01 [44,] 9.913357e-01 [45,] 9.999652e-01 [46,] 9.984236e-01 [47,] 9.985702e-01 [48,] 9.996325e-01 [49,] 9.871757e-01 [50,] 9.998661e-01 [51,] 9.961847e-01 [52,] 9.556517e-01 [53,] 9.998655e-01 [54,] 9.941594e-01 [55,] 9.446862e-01 [56,] 2.528127e-38 [57,] 5.246860e-123 [58,] 5.646726e-128 [59,] 1.988537e-100 [60,] 3.119177e-96 [61,] 1.627676e-52 [62,] 7.431677e-119 [63,] 5.068258e-148 [64,] 1.776856e-40 [65,] 1.548777e-62 [66,] 1.034270e-139 [67,] 1.025226e-110 [68,] 1.039411e-65 [69,] 4.500244e-121 [70,] 5.254512e-78 [71,] 2.846292e-93 [72,] 4.766651e-88 [73,] 1.031865e-48 [74,] 9.656782e-116 [75,] 4.794541e-37 [76,] 1.824557e-65 [77,] 1.230530e-63 [78,] 4.992639e-73 [79,] 2.151835e-107 [80,] 2.371682e-190 [81,] 2.437758e-113 [82,] 3.820872e-104 [83,] 9.914883e-90 [84,] 4.805351e-73 [85,] 2.506946e-61 [86,] 5.451476e-160 [87,] 1.682258e-70 [88,] 1.974787e-91 [89,] 5.118333e-236 [90,] 5.815997e-166 [91,] 8.437300e-80 [92,] 6.148129e-136 [93,] 8.060332e-115 [94,] 8.960795e-124 [95,] 1.171934e-13 [96,] 6.583983e-24 [97,] 5.619707e-13 [98,] 1.945742e-20 [99,] 5.801518e-08 [100,] 1.124058e-22 [101,] 7.626077e-23 [102,] 6.651534e-20 [103,] 1.847591e-35 [104,] 2.979964e-16 [105,] 1.526116e-12 [106,] 3.883986e-22 [107,] 7.467901e-15 [108,] 1.013706e-23 [109,] 4.116193e-11 [110,] 1.048310e-26 [111,] 8.348068e-17 [112,] 1.178320e-29 [113,] 1.393098e-30 [114,] 8.724017e-18 [115,] 1.543938e-13 [116,] 1.942577e-26 [117,] 6.192493e-32 [118,] 2.587944e-37 [119,] 8.274105e-21 [120,] 3.253257e-20 > eval <- predict(seg.nb, seg.df.test, type="raw")

Comments