RapidMiner Tutorial: K-means Clustering

What is K-means?

en.proft.me

A unsupervised correlational technique that groups together like types of observations in a data set.

The "K" in K-means clustering implies the number of clusters the user is interested in. In other words, the user has the option to set the number of clusters he wants the algorithm to produce.

What data are we going to use?

We're going to use a made-up data set that details the lists the applicants and their attributes. The attribute names are all self-explanatory. It has 546 rows and 4 columns.

Goal

We would like to cluster applicants using their physical and mental attributes in order to make it easier to pick the right people.

TL;DR

RapidMiner Process

Look for the Read CSV operator in the Operator panel
Drag it to the Process panel
Connect the operator's Out to Res
Click on Import Configuration Wizard in the Parameters panel
Look for our example file, click it, and then click Next
In the Data Wizard window, make sure you pick Comma in the Column Separation section since we are using comma separated values (.CSV) type of file then click Next twice
Make sure the data types are all correct and then click Finish
Look for the Select Attributes operator and
Drag it to the Process panel
In the Parameters panel, click on Attribute Filter type dropdown menu and pick Single so that we can isolate only one attribute in the data set
In the Attribute dropdown menu, pick Training Course to tell RM that it's the attribute we want to isolate
Click on the Invert Selection checkbox to indicate that we want to exclude Training Course in the actual calculation
Next, look for the Set Role operator and drag it to the right of Select Attributes
Connect the operators via their Exa nodes
In the Parameters panel, click on the Attributes Name dropdown menu and pick Applicant
Click the Target Role dropdown menu and pick ID. This will make the Applicant variable an identifier of all our observations so we do not just get anonymous results
Look for the Normalize operator and drag it to the right of the Set Role operator. Connect them via their Exa nodes. We do not need to modify any of the parameters for now. Normalization is used so that no particular attribute will over-influence the clustering
Look for the K-means operator and drag it to the right of the Normalize operator. Connect the two via their Exa ports. Connect the K-means operator's Clu port to the Process panel's Res port to the right
Press F11 on your keyboard to Run Process. Once it is done, RapidMiner will automatically switch to Results View
You can look at the Cluster Model Description to find out the number of observations per cluster
You can check the names of the people per cluster in the Folder View. This is the reason why we designated the ID role to Applicant using the Set Role operator earlier
You can also check the means of the centroids per cluster in the Centroid Table

Comments

RaviJuly 20, 2019 at 12:25 AM
Simple and nice article. Thanks.
https://analyticsblog.ravivk.com
Code SkripsiNovember 22, 2019 at 11:09 AM
permisi admin.
Bagi mahasiswa yang perlu source code php, natif maupun framework bermetode AHP, SAW, Smart, Topsis, Fuzzy Logic, K-Means, Bayes dan lain-lain bisa kunjungi situ saya di :
https://code-skripsi.blogspot.com/

Terima kasih
360digitmgdelhiJuly 19, 2020 at 6:09 PM
Regular visits listed here are the easiest method to appreciate your energy, which is why why I am going to the website everyday, searching for new, interesting info. Many, thank you!data science course in noida
tejaswiniJuly 24, 2020 at 8:12 PM
Thank a lot. You have done excellent job. I enjoyed your blog . Nice effortsdata science course
360digitmgdelhiOctober 17, 2020 at 5:58 PM
This is a great post I saw thanks to sharing. I really want to hope that you will continue to share great posts in the future.
https://360digitmg.com/india/data-science-using-python-and-r-programming-in-delhi
360DigiTMGAurangabadJanuary 25, 2021 at 2:37 PM
Your content is very unique and understandable useful for the readers keep update more article like this.
business analytics courses in aurangabad
360DigiTMGAurangabadApril 23, 2021 at 6:20 PM
This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me..
machine learning course aurangabad
360DigiTMG-PuneApril 28, 2021 at 4:57 PM
Thanks for the information about Blogspot very informative for everyone
data science certification
AnonymousOctober 8, 2021 at 6:43 PM
the information provided in this article is so useful and would be of great help for people who are in this field and wanting to learn data management. If you want you can check
data science course they have a whole bunch of information on Data Science, Machine Learning and AI.
traininginstituteFebruary 25, 2022 at 6:04 PM
You completely match our expectation and the variety of our information.
cyber security training malaysia

Juan Antonio Pajarillo's Data Analytics Projects

Search This Blog