Thursday, March 22, 2018

Logistic Regression classifier


Logistic Regression classifier

Logistic regression is a technique that is used to explain the relationship between input variables and output variables.

The input variables are assumed to be independent and the output variable is referred to as the dependent variable. The dependent variable can take only a fixed set of values. These values correspond to the classes of the classification problem.

Our goal is to identify the relationship between the independent variables and the dependent variables by estimating the probabilities using a logistic function.


This logistic function is a sigmoidcurve that's used to build the function with various parameters. It is very closely related to generalized linear model analysis, where we try to fit a line to a bunch of points to minimize the error. Instead of using linear regression, we use logistic regression.

Logistic regression by itself is actually not a classification technique, but we use it in this way so as to facilitate classification. It is used very commonly in machine learning because of its simplicity.

Let's see how to build a classifier using logistic regression. Make sure you have Tkinter package installed on your system before you proceed. If you don't, you can find it at:


Create a new Python file (logistic_regression.py) and import the following packages.

We will be importing a function from the file utilities.py. We will be looking into that function very soon. But for now, let's import it:

import numpy as np

from sklearn import linear_model
import matplotlib.pyplot as plt

from utilities import visualize_classifier

Define sample input data with two-dimensional vectors and corresponding labels:

# Define sample input data

X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5], [5.6, 5],
[3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6, 4.9]])

y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

We will train the classifier using this labeled data. Now create the logistic regression classifier object:

# Create the logistic regression classifier

classifier = linear_model.LogisticRegression(solver='liblinear', C=1)

Train the classifier using the data that we defined earlier:

# Train the classifier
classifier.fit(X, y)

Visualize the performance of the classifier by looking at the boundaries of the classes:

# Visualize the performance of the classifier
visualize_classifier(classifier, X, y)

We need to define this function before we can use it. We will be using this multiple times in this chapter, so it's better to define it in a separate file and import the function. This function is given in the utilities.py file provided to you.

Create a new Python file and import the following packages:

import numpy as np
import matplotlib.pyplot as plt


Create the function definition by taking the classifier object, input data, and labels as input parameters:

def visualize_classifier(classifier, X, y):
    # Define the minimum and maximum values for X and Y
    # that will be used in the mesh grid
    min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
    min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

We also defined the minimum and maximum values of X and Y directions that will be used in our mesh grid. This grid is basically a set of values that is used to evaluate the function,so that we can visualize the boundaries of the classes. Define the step size for the grid and create it using the minimum and maximum values:


    # Define the step size to use in plotting the mesh grid
    mesh_step_size = 0.01
    # Define the mesh grid of X and Y values
    x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size),
    np.arange(min_y, max_y, mesh_step_size))

Run the classifier on all the points on the grid:

    # Run the classifier on the mesh grid
    output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
    # Reshape the output array
    output = output.reshape(x_vals.shape)

Create the figure, pick a color scheme, and overlay all the points:

    # Create a plot
    plt.figure()
    # Choose a color scheme for the plot
    plt.pcolormesh(x_vals, y_vals, output, cmap=plt.cm.gray)
    # Overlay the training points on the plot
    plt.scatter(X[:, 0], X[:, 1], c=y, s=75, edgecolors='black',
    linewidth=1, cmap=plt.cm.Paired)


Specify the boundaries of the plots using the minimum and maximum values, add the tick marks, and display the figure:

    # Specify the boundaries of the plot
    plt.xlim(x_vals.min(), x_vals.max())
    plt.ylim(y_vals.min(), y_vals.max())
    # Specify the ticks on the X and Y axes
    plt.xticks((np.arange(int(X[:, 0].min() - 1), int(X[:, 0].max() + 1),
1.0)))
    plt.yticks((np.arange(int(X[:, 1].min() - 1), int(X[:, 1].max() + 1),
1.0)))
    plt.show()

If you run the code, you will see the following screenshot:


If you change the value of C to 100 in the following line, you will see that the boundaries become more accurate:

classifier = linear_model.LogisticRegression(solver='liblinear', C=100)

The reason is that C imposes a certain penalty on misclassification, so the algorithm customizes more to the training data. You should be careful with this parameter, because if you increase it by a lot, it will overfit to the training data and it won't generalize well.

If you run the code with C set to 100, you will see the following screenshot:




Goto Index