Preprocessing data
We deal with a lot
of raw data in the real world. Machine learning algorithms expect data to be
formatted in a certain way before they start the training process. In order
to prepare the data for ingestion by machine learning algorithms, we have to preprocess
it and convert it into the right format.
Let's see how to do
it.
Create a new Python file and import the
following packages: (File name: data_preprocessor.py)
import numpy as np
from sklearn import preprocessing
|
Let's define some sample data:
input_data = np.array([[6.2, -1.5, 2.3],
[-2.4, 9.1,
-3.1],
[4.9, 0.7,
2.1],
[5.3, -5.4,-2.6]])
|
We will be talking about several different
preprocessing techniques.
Let's start with some Preprocessing techniques:
- Binarization
- Mean
removal
- Scaling
- Normalization
Let's take a look at each technique,
starting with the first.
Binarization
This process is used when we want
to convert our numerical values into boolean values.
Let's use an inbuilt method to binarize
input data using 2.1 as the threshold
value.
Add the following lines to the same Python
file:
# Binarize data
data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print("\nBinarized data:\n", data_binarized)
|
Binarized data:
[[ 1. 0. 1.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]]
|
As we can see here, all the values above 2.1 become 1.
The remaining values become 0.
Mean removal
Removing the mean
is a common preprocessing technique used in machine learning. It's usually useful to remove the mean from our
feature vector, so that each feature
is centered on zero. We do this in order to remove bias from the features in
our feature vector.
Add the following lines to the same Python file as in the previous section:
# Print mean and standard deviation
print("\nBEFORE:")
print("Mean =", input_data.mean(axis=0))
print("Std deviation =", input_data.std(axis=0))
|
The preceding line displays the mean and
standard deviation of the input data. Let's remove the mean:
# Remove mean
data_scaled = preprocessing.scale(input_data)
print("\nAFTER:")
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))
|
If you run the code, you will see the
following printed on your Terminal:
BEFORE:
Mean = [ 3.5 0.725 -0.325 ]
Std deviation = [ 3.43874977 5.30583405 2.53216804 ]
AFTER:
Mean = [ -2.77555756e-17 1.11022302e-16 5.55111512e-17]
Std deviation = [ 1. 1. 1.]
|
As seen from the values obtained, the mean
value is very close to 0 and standard
deviation is 1.
Scaling
In our feature vector, the value of
each feature can vary between many random values. So it becomes important to scale those features
so that it is a level playing field for the machine learning algorithm
to train on.
We don't want any feature to be
artificially large or small just because of the nature of the measurements.
Add the following line to the same Python
file:
# Min max scaling
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print("\nMin max scaled data:\n", data_scaled_minmax)
|
If you run the code, you will see the
following printed on your Terminal:
Min max scaled data:
[[ 1. 0.26896552 1. ]
[ 0. 1. 0. ]
[ 0.84883721 0.42068966
0.96296296]
[ 0.89534884 0. 0.09259259]]
|
Each row is scaled so that the maximum
value is 1 and all the other values are relative to this
value.
Normalization
We use the process
of normalization to modify the values in the feature vector so that we can measure them on
a common scale.
In machine learning, we use many different
forms of normalization. Some of the most common forms of normalization aim to
modify the values so that they sum up to 1.
L1 normalization, which refers to Least Absolute
Deviations,
works by making sure that the sum of absolute values is 1 in each row.
L2 normalization, which refers to least squares, works by making sure
that the sum of squares is 1.
In general, L1 normalization technique is
considered more robust than L2 normalization technique.
L1 normalization technique is robust
because it is resistant to outliers in the data.
A lot of times, data tends to contain
outliers and we cannot do anything about it. We want to use techniques that can
safely and effectively ignore them during the calculations. If we are solving a
problem where outliers are important, then maybe L2 normalization becomes a
better choice.
Add the following lines to the same Python
file:
# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l1)
print("\nL2 normalized data:\n", data_normalized_l2)
|
If you run the code, you will see the
following printed on your Terminal:
L1 normalized data:
[[ 0.62 -0.15 0.23 ]
[ -0.16438356 0.62328767 -0.21232877]
[ 0.63636364 0.09090909 0.27272727 ]
[ 0.39849624 -0.40601504 -0.19548872]]
L2 normalized data:
[[ 0.91433892 -0.22121103 0.33919024]
[ -0.24221427 0.91839578 -0.3128601]
[ 0.91132238 0.13018891 0.39056673]
[ 0.66244825 -0.67494727 -0.32497461]]
|