Using Decision Trees to Predict If A Person Makes >50K

What's a Decision Tree?

  • builds classification or regression models that when visualized into a chart, resembles a tree
  • specifies sequences of decisions and consequences
  • simple and fast algorithm
  • two kinds of trees: classification (usually categorical output variables) and regression (numeric output variables)
  • parts (from top to bottom): root node (best predictor), branch (outcome of a decision; visualized as a line connecting two nodes), decision node (an input variable or attribute), and a leaf node or terminal node (at the end of the last branches of the tree; represents a classification or a decision)
  • can be converted into a set of decision rules

How to build a tree?

Constructing a decision tree is all about finding the attributes that return the highest information gain (i.e. the most homogenous branches):
  1. Calculate entropy of the parent class (target value)
  2. Calculate the information gain for all attributes
  3. From these attributes, choose the one with the largest gini gain as the decision node
  4. Label a branch if entropy = 0 as a leaf node
  5. Run recursively on the non-leaf branches until all data is classified
  6. Form classification rules


What data are we going to use?

The Adult Data Set (AKA Census Income) is composed of 32,561 observations and 15 variables. The variables are as follows:
  1. age: continuous.
  2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  3. fnlwgt: continuous.
  4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  5. education-num: continuous.
  6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  10. sex: Female, Male.
  11. capital-gain: Capital gains 
  12. capital-loss: Capital loss
  13. hours-per-week: Hours worked p/w
  14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  15. income: whether the annual income is greater than or less than 50K
Here's a snippet:

Goal:

Predict whether a person (new observation) makes more than 50k a year.

Short Answer:

jdsljfsd

R Code



Tree w/ 8 Leaf Nodes

Resize tree between 5-8


Pruned Tree

Comments

Post a Comment