Part 8: Artificial Intelligence for Data Preparation

Preprocessing the Data

In order to provide the data as the input of machine learning algorithms, we need to convert it into a meaningful data. That is where data preprocessing comes into picture. In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.

[wpsbx_html_block id=1891]

Data preprocessing steps

Follow these steps to preprocess the data in Python:

Step 1: Importing the useful packages: If we are using Python then this would be the first step for converting the data into a certain format, i.e., preprocessing. It can be done as follows −

import numpy as np
import sklearn.preprocessing

Here we have used the following two packages:

NumPy − Basically NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.
Sklearn.preprocessing − This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.

Step 2: Defining sample data: After importing the packages, we need to define some sample data so that we can apply preprocessing techniques on that data. We will now define the following sample data −

input_data = np.array([2.1, -1.9, 5.5],
                      [-1.5, 2.4, 3.5],
                      [0.5, -7.9, 5.6],
                      [5.9, 2.3, -5.8])

Step3: Applying preprocessing technique: In this step, we need to apply any of the preprocessing techniques.

The following section describes the data preprocessing techniques.

Techniques for Data Preprocessing

The techniques for data preprocessing are described below:

Binarization

This is the preprocessing technique which is used when we need to convert our numerical values into Boolean values. We can use an inbuilt method to binarize the input data say by using 0.5 as the threshold value in the following way −

data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)

Now, after running the above code we will get the following output, all the values above 0.5(threshold value) would be converted to 1 and all the values below 0.5 would be converted to 0.

Binarized data

[[ 1. 0. 1.]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]

Mean Removal

It is another very common preprocessing technique that is used in machine learning. Basically it is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector. For applying mean removal preprocessing technique on the sample data, we can write the Python code shown below. The code will display the Mean and Standard deviation of the input data −

print("Mean = ", input_data.mean(axis = 0))
print("Std deviation = ", input_data.std(axis = 0))

We will get the following output after running the above lines of code −

         Mean = [ 1.75       -1.275       2.2]
Std deviation = [ 2.71431391  4.20022321  4.69414529]

Now, the code below will remove the Mean and Standard deviation of the input data −

data_scaled = preprocessing.scale(input_data)
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis = 0))

We will get the following output after running the above lines of code −

         Mean = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Std deviation = [ 1.             1.             1.]

Scaling

It is another data preprocessing technique that is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. With the help of the following Python code, we can do the scaling of our input data, i.e., feature vector −

# Min max scaling

data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)

We will get the following output after running the above lines of code −

Min max scaled data

[ [ 0.48648649  0.58252427   0.99122807]
[   0.          1.           0.81578947]
[   0.27027027  0.           1.        ]
[   1.          0. 99029126  0.        ]]

Normalization

It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning −

L1 Normalization

It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm = 'l1')
print("\nL1 normalized data:\n", data_normalized_l1)

The above line of code generates the following output &miuns;

L1 normalized data:
[[ 0.22105263  -0.2          0.57894737]
[ -0.2027027    0.32432432   0.47297297]
[  0.03571429  -0.56428571   0.4       ]
[  0.42142857   0.16428571  -0.41428571]]

L2 Normalization

It is also referred to as least squares. This kind of normalization modifies the values so that the sum of the squares is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

# Normalize data
data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
print("\nL2 normalized data:\n", data_normalized_l2)

The above line of code will generate the following output −

L2 normalized data:
[[ 0.33946114  -0.30713151   0.88906489]
[ -0.33325106   0.53320169   0.7775858 ]
[  0.05156558  -0.81473612   0.57753446]
[  0.68706914   0.26784051  -0.6754239 ]]

Labeling the Data

We already know that data in a certain format is necessary for machine learning algorithms. Another important requirement is that the data must be labelled properly before sending it as the input of machine learning algorithms. For example, if we talk about classification, there are lot of labels on the data. Those labels are in the form of words, numbers, etc. Functions related to machine learning in sklearn expect that the data must have number labels. Hence, if the data is in other form then it must be converted to numbers. This process of transforming the word labels into numerical form is called label encoding.

Label encoding steps

Follow these steps for encoding the data labels in Python −

Step1: Importing the useful packages

If we are using Python then this would be first step for converting the data into certain format, i.e., preprocessing. It can be done as follows −

import numpy as np
from sklearn import preprocessing

Step 2 − Defining sample labels

After importing the packages, we need to define some sample labels so that we can create and train the label encoder. We will now define the following sample labels −

# Sample input labels
input_labels = ['red','black','red','green','black','yellow','white']

Step 3 − Creating & training of label encoder object

In this step, we need to create the label encoder and train it. The following Python code will help in doing this −

# Creating the label encoder
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

Following would be the output after running the above Python code −

LabelEncoder()

Step4: Checking the performance by encoding random ordered list

This step can be used to check the performance by encoding the random ordered list. Following Python code can be written to do the same −

# encoding a set of labels
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)

The labels would get printed as follows −

Labels = ['green', 'red', 'black']

Now, we can get the list of encoded values i.e. word labels converted to numbers as follows −

print("Encoded values =", list(encoded_values))

The encoded values would get printed as follows −

Encoded values = [1, 2, 0]

Step 5: Checking the performance by decoding a random set of numbers −

This step can be used to check the performance by decoding the random set of numbers. Following Python code can be written to do the same −

# decoding a set of values
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)

Now, Encoded values would get printed as follows −

Encoded values = [3, 0, 4, 1]
print("\nDecoded labels =", list(decoded_list))

Now, decoded values would get printed as follows −

Decoded labels = ['white', 'black', 'yellow', 'green']

Labeled v/s Unlabeled Data

Unlabeled data mainly consists of the samples of natural or human-created object that can easily be obtained from the world. They include, audio, video, photos, news articles, etc.
On the other hand, labeled data takes a set of unlabeled data and augments each piece of that unlabeled data with some tag or label or class that is meaningful. For example, if we have a photo then the label can be put based on the content of the photo, i.e., it is photo of a boy or girl or animal or anything else. Labeling the data needs human expertise or judgment about a given piece of unlabeled data.
There are many scenarios where unlabeled data is plentiful and easily obtained but labeled data often requires a human/expert to annotate. Semi-supervised learning attempts to combine labeled and unlabeled data to build better models.

The Enduring Power of PHP: Why It’s Still Thriving in 2025 and Beyond

Web Development | 0 Comments

Despite frequent predictions of its demise, PHP remains a cornerstone of web development. Discover why this ‘immortal’ language continues to evolve and power millions of websites.

Mastering IELTS Speaking: Strategies & Band 9 Tips for November 2025 (Germany Focus)

IELTS | 0 Comments

Prepare for your IELTS Speaking test with expert strategies, potential November 2025 question insights, and essential tips to achieve a Band 9 score, especially for test-takers in Germany.

Unlocking Tomorrow: Your Guide to the Latest Tech News & Innovations

Others | 0 Comments

Stay ahead of the curve with our comprehensive guide to the latest technology news and innovations shaping our world, from AI to IoT and beyond.

Laravel’s Power: Unlocking Large-Scale Projects with DDD & The New Cloud API

Laravel | 0 Comments

Explore how Laravel excels in large-scale applications, integrates with Domain-Driven Design, and the exciting potential of the new Laravel Cloud API.

Python’s Dominance: Powering the Future of Data Science and AI

Python Programming | 0 Comments

Discover why Python is the undisputed leader in data science, exploring its key advantages, essential libraries, and profound impact on AI, machine learning, and data-driven innovations.

Complete Guide: Create Laravel Project in Docker Without Local Dependencies

Laravel, Others, Web Development, Web Development Frameworks | 0 Comments

Create Laravel Project Through Docker — No Need to Install PHP, MySQL, or Apache on Your Local Machine In this tutorial, I’ll show you how to create and run a full Laravel project using Docker containers. That means you won’t have to install PHP, MySQL, or Apache...

Part 8: Artificial Intelligence for Data Preparation

Preprocessing the Data

Data preprocessing steps

Techniques for Data Preprocessing

Binarization

Mean Removal

Scaling

Normalization

Labeling the Data

Label encoding steps

Labeled v/s Unlabeled Data

The Enduring Power of PHP: Why It’s Still Thriving in 2025 and Beyond

Mastering IELTS Speaking: Strategies & Band 9 Tips for November 2025 (Germany Focus)

Unlocking Tomorrow: Your Guide to the Latest Tech News & Innovations

Laravel’s Power: Unlocking Large-Scale Projects with DDD & The New Cloud API

Python’s Dominance: Powering the Future of Data Science and AI

Complete Guide: Create Laravel Project in Docker Without Local Dependencies

0 Comments

You may find interest following article

The Enduring Power of PHP: Why It’s Still Thriving in 2025 and Beyond

Mastering IELTS Speaking: Strategies & Band 9 Tips for November 2025 (Germany Focus)

Unlocking Tomorrow: Your Guide to the Latest Tech News & Innovations

Laravel’s Power: Unlocking Large-Scale Projects with DDD & The New Cloud API

Python’s Dominance: Powering the Future of Data Science and AI

Complete Guide: Create Laravel Project in Docker Without Local Dependencies

Java’s Next Leap: Exploring Project Valhalla & JEP 401 (Value Classes) Early Access

Navigating the AI Frontier: Why Staying Updated on Artificial Intelligence News is Crucial

PHP’s Enduring Evolution: Anticipating What PHP 8.5 Might Bring to Web Development

Master Your IELTS Journey: Latest News, Updates & Essential Resources

Decoding Tomorrow: Exploring the Latest Technology Breakthroughs and Scientific Research

Python in 2025: Unlock Web Development’s Full Potential with Modern Versions

Enhancing the Java Developer Experience: Unlocking the Potential of Build Tools

Unlocking Innovation: A Comprehensive Guide to Essential AI Tools for 2024

PHP 8.5: Unpacking the Future of Web Development (Release, Features & Migration Guide)

Mastering IELTS Speaking: Your Daily English Practice Roadmap for Success

Anticipating Python 3.14: Features, Performance & Data Science Trends for 2025

Stay Ahead: Why Keeping Up with the Latest AI News and Trends is Crucial

Unlocking the PHP Ecosystem: Your Essential Weekly Guide to News, Trends, and Updates

Mastering the IELTS Speaking Test: Your Comprehensive Guide to an Impressive Score