Tuesday, March 20, 2018

Preprocessing data


Preprocessing data

We deal with a lot of raw data in the real world. Machine learning algorithms expect data to be formatted in a certain way before they start the training process. In order to prepare the data for ingestion by machine learning algorithms, we have to preprocess it and convert it into the right format.

Let's see how to do it.

Create a new Python file and import the following packages: (File name: data_preprocessor.py)

import numpy as np
from sklearn import preprocessing

Let's define some sample data:

input_data = np.array([[6.2, -1.5, 2.3],
                      [-2.4, 9.1, -3.1],
                      [4.9, 0.7, 2.1],
                      [5.3, -5.4,-2.6]])
We will be talking about several different preprocessing techniques.

 Let's start with some Preprocessing techniques:
  • Binarization
  • Mean removal
  • Scaling
  • Normalization

Let's take a look at each technique, starting with the first.


Binarization

            This process is used when we want to convert our numerical values into boolean values.

Let's use an inbuilt method to binarize input data using 2.1 as the threshold value.

Add the following lines to the same Python file:

# Binarize data
data_binarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print("\nBinarized data:\n", data_binarized)


 If you run the code, you will see the following output:

Binarized data:
[[ 1. 0. 1.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]]

As we can see here, all the values above 2.1 become 1. The remaining values become 0.

Mean removal

Removing the mean is a common preprocessing technique used in machine learning. It's usually useful to remove the mean from our feature vector, so that each feature is centered on zero. We do this in order to remove bias from the features in our feature vector.


Note:- How to find Mean, Standard Deviation and Variance?, pls watch this Video >>> 



Add the following lines to the same Python file as in the previous section:

# Print mean and standard deviation
print("\nBEFORE:")
print("Mean =", input_data.mean(axis=0))
print("Std deviation =", input_data.std(axis=0))


The preceding line displays the mean and standard deviation of the input data. Let's remove the mean:

# Remove mean
data_scaled = preprocessing.scale(input_data)
print("\nAFTER:")
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis=0))


If you run the code, you will see the following printed on your Terminal:

BEFORE:
Mean = [ 3.5 0.725 -0.325 ]
Std deviation = [ 3.43874977 5.30583405 2.53216804 ]
AFTER:
Mean = [ -2.77555756e-17 1.11022302e-16 5.55111512e-17]
Std deviation = [ 1. 1. 1.]

As seen from the values obtained, the mean value is very close to 0 and standard deviation is 1.


Scaling

In our feature vector, the value of each feature can vary between many random values. So it becomes important to scale those features so that it is a level playing field for the machine learning algorithm to train on.

We don't want any feature to be artificially large or small just because of the nature of the measurements.

Add the following line to the same Python file:

# Min max scaling
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print("\nMin max scaled data:\n", data_scaled_minmax)


If you run the code, you will see the following printed on your Terminal:

Min max scaled data:
[[ 1.          0.26896552   1. ]
[  0.          1.           0. ]
[  0.84883721  0.42068966   0.96296296]
[  0.89534884  0.           0.09259259]]

Each row is scaled so that the maximum value is 1 and all the other values are relative to this value.

Normalization

We use the process of normalization to modify the values in the feature vector so that we can measure them on a common scale.

In machine learning, we use many different forms of normalization. Some of the most common forms of normalization aim to modify the values so that they sum up to 1.

L1 normalization, which refers to Least Absolute Deviations, works by making sure that the sum of absolute values is 1 in each row.

L2 normalization, which refers to least squares, works by making sure that the sum of squares is 1.

In general, L1 normalization technique is considered more robust than L2 normalization technique.

L1 normalization technique is robust because it is resistant to outliers in the data.

A lot of times, data tends to contain outliers and we cannot do anything about it. We want to use techniques that can safely and effectively ignore them during the calculations. If we are solving a problem where outliers are important, then maybe L2 normalization becomes a better choice.

Add the following lines to the same Python file:

# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l1)
print("\nL2 normalized data:\n", data_normalized_l2)


If you run the code, you will see the following printed on your Terminal:


L1 normalized data:
[[ 0.62        -0.15       0.23 ]
[  -0.16438356 0.62328767  -0.21232877]
[  0.63636364  0.09090909  0.27272727 ]
[  0.39849624  -0.40601504 -0.19548872]]
L2 normalized data:
[[ 0.91433892  -0.22121103 0.33919024]
[  -0.24221427 0.91839578  -0.3128601]
[  0.91132238  0.13018891  0.39056673]
[  0.66244825  -0.67494727 -0.32497461]]