What's a Decision Tree?
- builds classification or regression models that when visualized into a chart, resembles a tree
- specifies sequences of decisions and consequences
- simple and fast algorithm
- two kinds of trees: classification (usually categorical output variables) and regression (numeric output variables)
- parts (from top to bottom): root node (best predictor), branch (outcome of a decision; visualized as a line connecting two nodes), decision node (an input variable or attribute), and a leaf node or terminal node (at the end of the last branches of the tree; represents a classification or a decision)
- can be converted into a set of decision rules
How to build a tree?
Constructing a decision tree is all about finding the attributes that return the highest information gain (i.e. the most homogenous branches):
- Calculate entropy of the parent class (target value)
- Calculate the information gain for all attributes
- From these attributes, choose the one with the largest gini gain as the decision node
- Label a branch if entropy = 0 as a leaf node
- Run recursively on the non-leaf branches until all data is classified
- Form classification rules
What data are we going to use?
The
Adult Data Set (AKA Census Income) is composed of 32,561 observations and 15 variables. The variables are as follows:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: Capital gains
- capital-loss: Capital loss
- hours-per-week: Hours worked p/w
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- income: whether the annual income is greater than or less than 50K
Here's a snippet:
Goal:
Predict whether a person (new observation) makes more than 50k a year.
Short Answer:
jdsljfsd
R Code
|
Tree w/ 8 Leaf Nodes |
|
Resize tree between 5-8
|
|
Pruned Tree |
Great Information you have shared, Check it once machine learning online training Bangalore
ReplyDelete