OneR

What's One Rule?

a classification algorithm
simple but accurate
works best with categorical data
generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule".

What data set are we going to use?

We're going to use the famous Iris flower data set. It consists of 50 samples from each of three species of Iris (setosa, virginica, and versicolor). Four features were measured from each sample: sepal length, sepal width, petal length, and petal width. Here's a snippet of the table:

Fisher's *Iris* Data
Sepal length	Sepal width	Petal length	Petal width	Species
5.1	3.5	1.4	0.2	I. setosa
4.9	3.0	1.4	0.2	I. setosa
4.7	3.2	1.3	0.2	I. setosa
4.6	3.1	1.5	0.2	I. setosa
5.0	3.6	1.4	0.2	I. setosa
5.4	3.9	1.7	0.4	I. setosa
4.6	3.4	1.4	0.3	I. setosa
5.0	3.4	1.5	0.2	I. setosa
4.4	2.9	1.4	0.2	I. setosa
4.9	3.1	1.5	0.1	I. setosa
5.4	3.7	1.5	0.2	I. setosa
4.8	3.4	1.6	0.2	I. setosa

The Species column is the target variable. The rest to its left are the predictors.

Learn more about its history and see the complete table here.

Ready? Before we proceed, here's a picture of an Iris Versicolor:

from Wikipedia

Goal:

Train a one rule classifier which we can use to predict the species of a new flower we've never seen before.

In other words, if we're given a new flower, which one of the features above (the Predictors: sepal length, petal width, etc.) can best tell whether it's a Sentosa, a Virginica, or a Versicolor?

How to find the One Rule manually:

Construct a frequency table for each predictor against the target
Count how often each value of target (class) appears --> for this example we put the values into bins (see below)
Find the most frequent class
Make the rule assign that class to this value of the predictor
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error

Short Answer:

If you'd like to determine the classification of a new flower, there's one rule that corresponds to each type:

If Petal.Width = (0.0976,0.791] then Species = setosa

If Petal.Width = (0.791,1.63] then Species = versicolor

If Petal.Width = (1.63,2.5] then Species = virginica

The accuracy for this method is very high. Out of 150, 144 instances were classified correctly or roughly 96%.

Finding our First Rule

Let's try it out for Sentosa:

So out of the 50 types of Setosas that have a petal length of 0.994-2.46 and a petal width of 0.0976-0.791, there are

1 with a Sepal Length (SL) between 4.3-5.41 and a Sepal Width (SW) between 2.87-3.19,
11 with the same SL and a SW between 2.87-3.19
33 with the same SL and a SW between 3.19-4.4
5 with a SL between 5.41-6.25 and a SW between 3.19-4.4

If you count the total of those numbers enumerated above, you'll see that we've already reached 50 Sentosas. But which among the predictors will clinch it for us? Take a look at the rest of the frequency tables below and you'll notice that R tried other petal length and petal width combinations which ultimately yielded 0 positive results

Juan Antonio Pajarillo's Data Analytics Projects

Search This Blog