OneR

What's One Rule?

  • a classification algorithm
  • simple but accurate
  • works best with categorical data
  • generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule".

What data set are we going to use? 

We're going to use the famous Iris flower data set. It consists of 50 samples from each of three species of Iris (setosa, virginica, and versicolor). Four features were measured from each sample: sepal length, sepal width, petal length, and petal width. Here's a snippet of the table:


Fisher's Iris Data
Sepal lengthSepal widthPetal lengthPetal widthSpecies
5.13.51.40.2I. setosa
4.93.01.40.2I. setosa
4.73.21.30.2I. setosa
4.63.11.50.2I. setosa
5.03.61.40.2I. setosa
5.43.91.70.4I. setosa
4.63.41.40.3I. setosa
5.03.41.50.2I. setosa
4.42.91.40.2I. setosa
4.93.11.50.1I. setosa
5.43.71.50.2I. setosa
4.83.41.60.2I. setosa

The Species column is the target variable. The rest to its left are the predictors. 

Learn more about its history and see the complete table here


Ready? Before we proceed, here's a picture of an Iris Versicolor:
from Wikipedia





Goal:

Train a one rule classifier which we can use to predict the species of a new flower we've never seen before.

In other words, if we're given a new flower, which one of the features above (the Predictors: sepal length, petal width, etc.) can best tell whether it's a Sentosa, a Virginica, or a Versicolor?

How to find the One Rule manually:

  1. Construct a frequency table for each predictor against the target
  2. Count how often each value of target (class) appears --> for this example we put the values into bins (see below)
  3. Find the most frequent class
  4. Make the rule assign that class to this value of the predictor
  5. Calculate the total error of the rules of each predictor
  6. Choose the predictor with the smallest total error

Short Answer:

If you'd like to determine the classification of a new flower, there's one rule that corresponds to each type:

If Petal.Width = (0.0976,0.791] then Species = setosa

If Petal.Width = (0.791,1.63]   then Species = versicolor

If Petal.Width = (1.63,2.5]     then Species = virginica

The accuracy for this method is very high. Out of 150, 144 instances were classified correctly or roughly 96%.

Finding our First Rule

Let's try it out for Sentosa:


So out of the 50 types of Setosas that have a petal length of 0.994-2.46 and a petal width of 0.0976-0.791, there are  
  • 1 with a Sepal Length (SL) between 4.3-5.41 and a Sepal Width (SW) between 2.87-3.19, 
  • 11 with the same SL and a SW between 2.87-3.19
  • 33 with the same SL and a SW between 3.19-4.4
  • 5 with a SL between 5.41-6.25 and a SW between 3.19-4.4
If you count the total of those numbers enumerated above, you'll see that we've already reached 50 Sentosas. But which among the predictors will clinch it for us? Take a look at the rest of the frequency tables below and you'll notice that R tried other petal length and petal width combinations which ultimately yielded 0 positive results

Comments