RapidMiner Tutorial: Correlation and Dependence

While this is not a tutorial on correlation, let's still do a quick rundown as a refresher.

What is Correlation?

It describes the strength of relationship between two numeric attributes in a data set.

Correlation can have a value between -1 to 1.

Correlation Values
Pic from www.mathsisfun.com

Correlation can also tell us the direction of the relationship.

What dataset are we going to use?

We're going to use fictitious data that contains 100 rows and 4 variables that are all self-explanatory.

Here's how it looks:


Goal

Find which among the numerical attributes are correlated. For example, describe the relationship between annual income with the number of years of education.

We will be using Rapidminer Studio 8.2.

TL;DR

Rapidminer Tutorial: Correlation Process

RapidMiner Steps:

  1. Look for the Read Excel operator in the Operators window
  2. Drag the operator to the Process window
  3. Click on the Import Configuration Wizard button in the Parameters window to locate the file
    Rapidminer Tutorial: Correlation Steps 1-3
  4. After the Data Import Wizard window pops-up, click on the Next button until you reach Step 4
  5. Check that all variables, except for "Person", are categorized as "Integer"
  6. Click Finish
    Rapidminer Tutorial: Correlation Steps 4-8
  7. Look for the Select Attributes in the Operators window. Since correlation can only take in numerical data, we need to remove the Person column
  8. Drag it to the Process window and connect the Read Excel operator's Out to the Select Attributes Exa and the latter's Exa to Res
  9. In the Parameters window, select Single in the Attribute Filter Type drop down menu
  10. Select Person in the Attribute drop down menu
  11. Click on Invert Selection
  12. Run process 
    Rapidminer Tutorial: Correlation Steps 7-11
  13. Rapidminer will automatically switch to Results view mode to show you your data minus the Person column
    Select Attributes Operator Results
  14. Switch back to Design view 
  15. In the Operators window, look for Correlation Matrix
  16. Drag the operator to the right of Select Attributes in the Process window
  17. Connect  Select Attributes and Correlation Matrix operators via their Exa ports
  18. Connect the Correlation Matrix's Mat port to the Res port
  19. Rapidminer Tutorial: Correlation Steps 15-1919. Run process and Rapidminer will switch to Results view to show you its findings. We find out that Annual_Income has a strong positive relationship (70%) with Year_of_Education while Annual_Income has a weak positive relationship (14%) with Age.Rapidminer Tutorial: Correlation Results

Comments

Post a Comment