In this blog we will try some intermediate models like K means clustering using the Azure platform. If you are new to Azure and have not used simple models than we suggest you have a look at our earlier blog which uses a simple linear regression.
What is K means clustering?
You can have a look at this blog if you are new to K means clustering concepts.
What is our dataset for this model?
We will use the dataset provided by Government of India at https://data.gov.in/catalog/accidental-deaths-suicides-india-adsi-1987
These datasets talks about suicides and accidental deaths and comes with a header based csv files for download.
Please go ahead and download the dataset locally which is shown below.
What we do now in Azure Machine Learning Studio?
This blog will not detail log in steps for ML studio. If you need help refer the blog here.
Create a new experiment and then we click dataset and click the + Add icon to add the dataset. Simply browse to the path where you have downloaded the CSV and upload in ML studio.
If the import was successful you would see the CSV under dataset as shown below.
We if you have reached so far then you have started well. We would now would need to add the K means from left and configure the parameters as shown below.
The Microsoft parameters guide on K means is well documented and can be found at the link as below.
The trainer mode is taken as Parameter Range as we are not sure of the best parameters for the data set. K means ++ is the default option and a better model than K means. I have taken the random seed as 3 and you can experiment with it. We used the metric as Euclidean and not Cosine as we are interested in the distance and not the angles. Also, Euclidean is more familiar with me and also preferred in K means.
We then train the model
To train the model , simply call the train module from left. The Parameters that we have used are default - that is all columns and features. We do the same default for Assign cluster from left.
The model by now should look something as below.
Final step and conclusion
In the final step we would need to select Columns in dataset. We keep this default as well. Save and Run the model.
To conclude on this model we right click on select Columns in dataset -> Result dataset->visualise.
We see the below. Note that the Assignment column is the cluster number which is assigned to each cluster.
If we scroll down on this result dataset we would see that post 1984 the cluster number is 2. So if I was an administrator I would see what were the social economic changes post 1984 that clustered accidents and deaths. For sure more analysis would be needed but it provides a base of classification of the data based on K means clustering.
Are there any other observations that you can pull from this data set? Please comment in the section below.
About Author: Anirban explains himself as a combination of a coach, manager, leader and technologist. Anirban also runs the famous robotics site MieRobot.com which is voted as the top 40 robotics blog sites on this planet by Feed Spot.Anirban loves working with youth from his numerous corporate assignments with interns and freshers, he can give you a run on being patient and cool. Anirban's technical stack includes Microsoft Azure Machine learning, Unix, C++, ROS, Python, Microcontroller Programming, Neural Networks, Tensorflow and web services. Anirban is a keen social media engineer and product UX designer. He trains young professionals and students in Machine learning and his offerings are at https://www.dneur.com/machine-learning