Support vector machine is a very common algorithm used in supervised machine learning. This is also one of our recommended algorithms in the machine learning roadmap.
In case you have not read our machine learning roadmap you can click on the below link.
If you wish to skip directly to the code download click as Jupyter Notebook below at the GitHub link.
The aim of this blog:
a) Understand the basics of Support Vector plane b) Discuss what is a Hyperplane c) Hyperplane and SVM relation
d) Confusion matrix d) Visualise the IRIS dataset using Seaborn python package
e) Implement SVM in Python using SciKit learn
Understand the basics of SVM:
Imagine a case where you need to classify data into two partitions like chances of an event happening or not by providing a past data set is vaguely a problem of classification.
SVM defines a margin between the data points plotted in an N-dimension space.
SVM is widely used in Medical research to determine cancer or naturally occurring rare disaster like earthquakes and floods.
What is a Hyperplane?
A plane is an abstract surface of infinite width and length, zero thickness and curvature. You can not see a plane naturally being formed and is an abstract concept.
A Hyperplane in one dimension is a point, in 2d it is a line, in 3d it is a plane and in N dimension it is a Hyperplane.
By means of Vectors, Two vectors are orthogonal (at right angles to one another) if their dot product (scalar product) is zero. The vector from any point q to a particular point p is q-p.
So if p is a point in your plane and n is a vector perpendicular to the plane (the normal vector), then all points in the plane must satisfy (q-p). n = 0.
By means of cartesian co-ordinates, If the plane is represented by ax+by+cz=d, then the normal vector to the plane is <a,b,c>.
Hyperplane and SVM:
Ok guys, now that we have a fair idea of the plane. Let us use our Vector knowledge and see what is a 'support vector'.
A support vector is a vector which is closest to each set to the 'maximum margin hyperplane'.
Below image shows a support vector for another sample of x/y data.
So can you now pick up the points which are close to the maximum margin hyperplane?
Of course, you can and maybe you have correctly picked these ones below.
So you see we have to maximise the hyperplane which separates the two data classes.
In cartesian, hence we say that for a line y=mx+c, we need to find weights (points) mx+c >=1 and mx+c<=-1
SVM using Python:
We would use the famous IRIS dataset which talks of three flowers and their respective petal and sepal length and width.
We call the Wikipedia images first to see these flowers and get some idea of the real world.
from IPython.display import Image url = 'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg' Image(url,width=300, height=300)
You can see the other IRIS flowers as well by calling the URL of the jpg as:
from IPython.display import Image url = 'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg' Image(url,width=300, height=300
And also the next image by:
from IPython.display import Image url = 'http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg' Image(url,width=300, height=300)
Now that we have seen these IRIS flowers let us try to have a view of the dataset.
As usual, we call the libraries we may need as:
import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets %matplotlib inline
Pandas and matplotlib is called. We also scikitlearn will be used for train-test-split.
We then call and load the IRIS datasets over Pandas frame:
iris = datasets.load_iris() iris_pd=pd.DataFrame(iris.data)
We call the dataset by: iris_pd.head()
The output makes no sense as we just see numbers. So we need to append column headers. (What each column means behind the numbers).
To do this we will use the 'feature_name' which can see if you describe the dataset.
The dataset finally makes some sense now as below:
Now we know that each column represents lengths and width of sepal and petal.
The next step is calling Scikitlearn for train-split-test and also checking the confusion matrix.
Seaborn visualisation of the IRIS dataset:
Before we start using the machine learning algorithm we have to be sure that the dataset is an ideal candidate for classification.
the best way to do this is to conduct some data visualisation. We will use Python Seaborn package for this.
import seaborn as sns iris = sns.load_dataset('iris')
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
As you can see clearly 'Sentosa' from the 3 catagories of IRIS flower is the idea candidate for the predicted value Y. So we will use Sentosa for prediction.
Ok now we need to see why we split data in machine learning.
Train-test-split is an important approach in machine learning. In simple terms we divide the dataset into subsets for training and testing. This is done to avoid Overfitting and Underfitting the mathematical model. If you need to know more about this approach have a look at the reference section.
from sklearn.model_selection import train_test_split
Before we move to the next part we should know a little about confusion matrix.
What is confusion matrix?:
Also known as error matrix, Each column of the confusion matrix represents the instances in a predicted class while each row represents the instances in an actual class.
In the above table this confusion matrix reads pictures and classifies as Cat or Rabbit. As you can see in this example that this algorithm is good for Cat with a 10/12 correct predictions but a total failure for predicting Rabbit.
Now that we have a decent idea of the dataset and more confident that this dataset is ideal for a classification problem we will apply the algorithms from Scikitlearn. We would also need to remove the predicted variable Y from the remaining dataset. If we donot do this then we will get a 100% prediction which is like fake.
So we separate out 'Sentosa' from the rest as below:
import seaborn as sns iris = sns.load_dataset('iris') X = iris.drop('species',axis=1) y = iris['species'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
Have a look that we dropped the column species for Sentosa.
We then call the SVM model from Scikit learn using the kernal 'svm'. To know more about other SVM kernals check the Scikitlearn documentation for SVM here.
from sklearn.svm import SVC svc_model = SVC() svc_model.fit(X_train,y_train) predictions = svc_model.predict(X_test)
Finally, we print the confusion matrix and prediction accuracy for Sentosa.
from sklearn.metrics import classification_report,confusion_matrix
The result is pretty good as you can see in the below image.
The precision is a brilliant 98% and we can safely say now that we have ducked the SVM bouncer.
Share us your experience as well by being our Guest blogger. Write at email@example.com