Saturday, December 2, 2017

How to classify iris species using logistic regression

Despite its name, logistic regression can actually be used as a model for classification. In this post I will show you how to build a classification system in scikit-learn, and apply logistic regression to classify flower species from the famous Iris dataset.

The Iris dataset

A famous dataset in the world of machine learning is called the Iris dataset. The Iris dataset contains measurements of 150 iris flowers from three different species: setosa, versicolor, and viriginica. These measurements include the length and width of the petals, and the length and width of the sepals, all measured in centimeters:

Understanding logistic regression

Despite its name, logistic regression can actually be used as a model for classification. It uses a logistic function (or sigmoid) to convert any real-valued input \(x\) into a predicted output value \(\hat{y}\) that take values between 0 and 1, as shown in the following figure:

Rounding \(\hat{y}\) to the nearest integer effectively classifies the input as belonging either to class 0 or 1.

Of course, most often, our problems have more than one input or feature value, \(x\). For example, the Iris dataset provides a total of four features. For the sake of simplicity, let’s focus here on the first two features, sepal length—which we will call feature \(f_1\)—and sepal width—which we will call \(f_2\). Using the tricks we learned when talking about linear regression, we know we can express the input \(x\) as a linear combination of the two features, \(f_1\) and \(f_2\):

\[ x = w_1 f_1 + w_2 f_2 \]

However, in contrast to linear regression, we are not done yet. From the previous section, we know that the sum of products would result in a real-valued, output—but we are interested in a categorical value, zero or one. This is where the logistic function comes in: it acts as a squashing function, \(\sigma\), that compresses the range of possible output values to the range [0, 1]:

\[ \hat{y} = \sigma(x) \]

Because the output is always between 0 and 1, it can be interpreted as a probability. If we only have a single input variable \(x\), the output value \(\hat{y}\) can be interpreted as the probability of \(x\) belonging to class 1.

Now let’s apply this knowledge to the Iris dataset!

Loading the training data

The Iris dataset is included with scikit-learn. We first load all the necessary modules:

In [1]: import numpy as np
... import cv2
... from sklearn import datasets
... from sklearn import model_selection
... from sklearn import metrics
...     import matplotlib.pyplot as plt
... %matplotlib inline
In [2]:'ggplot')

Then, loading the dataset is a one-liner:

In [3]: iris = datasets.load_iris()

This function returns a dictionary we call iris, which contains a bunch of different fields:

In [4]: dir(iris)
Out[4]: ['DESCR', 'data', 'feature_names', 'target',

Here, all the data points are contained in 'data'. There are 150 data points, each of which has four feature values:

In [5]:
Out[5]: (150, 4)

These four features correspond to the sepal and petal dimensions mentioned earlier:

In [6]: iris.feature_names 
Out[6]: ['sepal length (cm)', 'sepal width (cm)',
         'petal length (cm)', 'petal width (cm)']

For every data point, we have a class label stored in target:

In [7]:
Out[7]: (150,)

We can also inspect the class labels, and find that there is a total of three classes:

In [8]: np.unique(
Out[8]: array([0, 1, 2])

Making it a binary classification problem

For the sake of simplicity, we want to focus on a binary classification problem for now, where we only have two classes. The easiest way to do this is to discard all data points belonging to a certain class, such as class label 2, by selecting all the rows that do not belong to class 2:

In [9]: idx = != 2
... data =[idx].astype(np.float32)
... target =[idx].astype(np.float32)

Inspecting the data

Before you get started with setting up a model, it is always a good idea to have a look at the data. We did this earlier for the town map example, so let’s continue our streak. Using Matplotlib, we create a scatter plot where the color of each data point corresponds to the class label:

In [10]: plt.scatter(data[:, 0], data[:, 1], c=target,
...        , s=100)
...  plt.xlabel(iris.feature_names[0])
...  plt.ylabel(iris.feature_names[1])

To make plotting easier, we limit ourselves to the first two features (iris.feature_names[0] being the sepal length and iris.feature_names[1] being the sepal width). We can see a nice separation of classes in the following figure:

Splitting the data into training and test sets

We learned in the previous chapter that it is essential to keep training and test data separate. We can easily split the data using one of scikit-learn’s many helper functions:

In [11]: X_train, X_test, y_train, y_test =
...          model_selection.train_test_split(
...              data, target, test_size=0.1,
...              random_state=42
...      )

Here we want to split the data into 90 percent training data and 10 percent test data, which we specify with test_size=0.1. By inspecting the return arguments, we note that we ended up with exactly 90 training data points and 10 test data points:

In [12]: X_train.shape, y_train.shape
Out[12]: ((90, 4), (90,))
In [13]: X_test.shape, y_test.shape
Out[13]: ((10, 4), (10,))

Training the classifier

Creating a logistic regression classifier involves pretty much the same steps as setting up k–NN:

In [14]: lr =

We then have to specify the desired training method. Here, we can choose or For now, all we need to know is that we want to update the model after every data point, which can be achieved with the following code:

In [15]: lr.setTrainMethod(
...      lr.setMiniBatchSize(1)

We also want to specify the number of iterations the algorithm should run before it terminates:

In [16]: lr.setIterations(100)

We can then call the training method of the object (in the exact same way as we did earlier), which will return True upon success:

In [17]: lr.train(X_train,, y_train)
Out[17]: True

As we just saw, the goal of the training phase is to find a set of weights that best transform the feature values into an output label. A single data point is given by its four feature values \((f_0, f_1, f_2, f_3)\). Since we have four features, we should also get four weights, so that \(x = w_0 f_0 + w_1 f_1 + w_2 f_2 + w_3 f_3\), and \(\hat{y}=\sigma(x)\). However, as discussed previously, the algorithm adds an extra weight that acts as an offset or bias, so that \(x = w_0 f_0 + w_1 f_1 + w_2 f_2 + w_3 f_3 + w_4\). We can retrieve these weights as follows:

In [18]: lr.get_learnt_thetas()
Out[18]: array([[-0.04109113, -0.01968078, -0.16216497,
                  0.28704911, 0.11945518]], dtype=float32)

This means that the input to the logistic function is \(x = -0.0411 f_0 – 0.0197 f_1 – 0.162 f_2 + 0.287 f_3 + 0.119\). Then, when we feed in a new data point \((f_0, f_1, f_2, f_3)\) that belongs to class 1, the output \(\hat{y}=\sigma(x)\) should be close to 1. But how well does that actually work?

Testing the classifier

Let’s see for ourselves by calculating the accuracy score on the training set:

In [19]: ret, y_pred = lr.predict(X_train)
In [20]: metrics.accuracy_score(y_train, y_pred)
Out[20]: 1.0

Perfect score! However, this only means that the model was able to perfectly memorize the training dataset. This does not mean that the model would be able to classify a new, unseen data point. For this, we need to check the test dataset:

In [21]: ret, y_pred = lr.predict(X_test)
...      metrics.accuracy_score(y_test, y_pred)
Out[21]: 1.0

Luckily, we get another perfect score! Now we can be sure that the model we built is truly awesome.

More information can be found in the book Machine Learning for OpenCV, Packt Publishing Ltd., July 2017.
Check out Chapter 3: First Steps in Supervised Learning.