Logistic Regression classifier
Logistic regression
is a technique that is used to explain the relationship between input variables and
output variables.
The input variables are assumed to be
independent and the output variable is referred to as the dependent variable.
The dependent variable can take only a fixed set of values. These values correspond
to the classes of the classification problem.
Our goal is to
identify the relationship between the independent variables and the dependent
variables by estimating the probabilities using a logistic function.
This logistic function
is a sigmoidcurve
that's
used to build the function with various parameters. It is very closely related
to generalized linear model analysis, where we try to fit a line to a bunch of
points to minimize the error. Instead of using linear regression, we use
logistic regression.
Logistic regression
by itself is actually not a classification technique, but we use it in this way
so as to facilitate classification. It is used very commonly in machine
learning because of its simplicity.
Let's see how to build a classifier using logistic
regression. Make sure you have Tkinter package
installed on your system before you proceed. If you don't, you can find it at:
Create a new Python file (logistic_regression.py) and import the following packages.
We will be importing a function from the
file utilities.py. We will be
looking into that function very soon. But for now, let's import it:
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from utilities import visualize_classifier
|
Define sample input data with
two-dimensional vectors and corresponding labels:
# Define sample input data
X = np.array([[3.1, 7.2], [4, 6.7], [2.9, 8], [5.1, 4.5], [6, 5],
[5.6, 5],
[3.3, 0.4], [3.9, 0.9], [2.8, 1], [0.5, 3.4], [1, 4], [0.6, 4.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])
|
We will train the classifier using this
labeled data. Now create the logistic regression classifier object:
# Create the logistic regression classifier
classifier = linear_model.LogisticRegression(solver='liblinear', C=1)
|
Train the classifier using the data that
we defined earlier:
# Train the classifier
classifier.fit(X, y)
|
Visualize the performance of the
classifier by looking at the boundaries of the classes:
# Visualize the performance of the classifier
visualize_classifier(classifier, X, y)
|
We need to define
this function before we can use it. We will be using this multiple times in this
chapter, so it's better to define it in a separate file and import the
function. This function is given in the utilities.py file provided to you.
Create a new Python file and import the
following packages:
import numpy as np
import matplotlib.pyplot as plt
|
Create the function definition by taking
the classifier object, input data, and labels as input parameters:
def visualize_classifier(classifier, X, y):
# Define the minimum and
maximum values for X and Y
# that will be used in the
mesh grid
min_x, max_x = X[:, 0].min()
- 1.0, X[:, 0].max() + 1.0
min_y, max_y = X[:, 1].min()
- 1.0, X[:, 1].max() + 1.0
|
We also defined the minimum and maximum
values of X and Y directions
that will be used in our mesh grid. This grid is basically a
set of values that is used to evaluate the function,so that we can visualize the boundaries of
the classes. Define the step size for the grid and create it using the minimum and maximum
values:
# Define the step size to
use in plotting the mesh grid
mesh_step_size = 0.01
# Define the mesh grid of X
and Y values
x_vals, y_vals =
np.meshgrid(np.arange(min_x, max_x, mesh_step_size),
np.arange(min_y, max_y,
mesh_step_size))
|
Run the classifier on all the points on
the grid:
# Run the classifier on the
mesh grid
output =
classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
# Reshape the output array
output =
output.reshape(x_vals.shape)
|
Create the figure, pick a color scheme,
and overlay all the points:
# Create a plot
plt.figure()
# Choose a color scheme for
the plot
plt.pcolormesh(x_vals,
y_vals, output, cmap=plt.cm.gray)
# Overlay the training
points on the plot
plt.scatter(X[:, 0], X[:,
1], c=y, s=75, edgecolors='black',
linewidth=1, cmap=plt.cm.Paired)
|
Specify the boundaries of the plots using
the minimum and maximum values, add the tick marks, and display the figure:
# Specify the boundaries of
the plot
plt.xlim(x_vals.min(),
x_vals.max())
plt.ylim(y_vals.min(),
y_vals.max())
# Specify the ticks on the X and Y axes
plt.xticks((np.arange(int(X[:,
0].min() - 1), int(X[:, 0].max() + 1),
1.0)))
plt.yticks((np.arange(int(X[:,
1].min() - 1), int(X[:, 1].max() + 1),
1.0)))
plt.show()
|
If you run the code, you will see the following
screenshot:
If you change the value of C to 100 in the following line, you will
see that the boundaries become more accurate:
classifier
= linear_model.LogisticRegression(solver='liblinear', C=100)
|
The reason is that C imposes a certain penalty on
misclassification, so the algorithm customizes more to the training data. You should be
careful with this parameter, because if you increase it by a lot, it will overfit to the
training data and it won't generalize well.
If you run the code with C set to 100, you will see the following screenshot:
Goto Index