Module 2: Data Preprocessing — The Art of "Data Cleaning"

Think of Module 1 as gathering your ingredients. Module 2 is where you wash, chop, and season them.

You could have the most expensive “oven” (a high-end Machine Learning model), but if you put rotten ingredients inside, you’re going to get a terrible meal. In the industry, we call this GIGO: Garbage In, Garbage Out.

Data preprocessing before and after split-screen visual

2.1 Why Clean Data? (The “GIGO” Rule)

In a perfect world, data is neat. In the real world, data is “noisy.” It has:

Missing Values: People skip questions on surveys; sensors lose battery.
Incorrect Formats: Dates written as “Jan 1st” vs “01-01-2024.”
Outliers: A “human age” listed as 250 years old.

If you don’t fix these, your model will get confused. Cleaning your data isn’t a chore; it’s actually the most important step in the entire workflow.

2.2 Handling Missing Values: Mind the Gaps

When you see a NaN (Not a Number) in your dataset, you have two main choices:

The “Delete” Strategy (Dropping): If a row is missing almost all its info, just delete it. But be careful—if you delete too much, you’ll have no data left to learn from!
The “Fill-in” Strategy (Imputation): We fill the gaps with a smart guess.
- Mean: Use the average (best for normal numbers).
- Median: Use the middle value (best if you have wild outliers).
- Mode: Use the most frequent value (best for categories like “Color”).

Python Tip: code Pythondownloadcontent_copyexpand_less

# To fill missing ages with the average age
df['age'] = df['age'].fillna(df['age'].mean())

2.3 Encoding Categorical Data: Teaching Machines to Read

Computers are essentially giant calculators. They are amazing at math, but they don’t understand what “Red” or “New York” means. We have to turn words into numbers.

Label Encoding: Assigns a number to each category (e.g., Apple=1, Banana=2).
- The Trap: Your model might think Banana (2) is “greater than” Apple (1). Use this only for things with a natural order (like Small, Medium, Large).
One-Hot Encoding: Creates new columns for each category with 1s and 0s.
- Example: Instead of a “Color” column, you get a “Is_Red” column and an “Is_Blue” column. This is the safest way to handle categories.

2.4 Feature Scaling: Leveling the Playing Field

Imagine a dataset with two features: Age (0–100) and Annual Salary (0–200,000).
Because the salary numbers are so much bigger, the model might think salary is “more important” than age. Scaling fixes this by putting all numbers on the same scale (usually between 0 and 1).

Normalization (Min-Max): Squashes everything between 0 and 1.
Standardization: Centers the data so the average is 0.

The Pro-Tip: Use StandardScaler from the Scikit-learn library—it’s the industry standard for making sure your features play nice together.

2.5 Visualization: Seeing is Believing

Before you start training a model, you need to look at your data. We use Matplotlib and Seaborn to turn rows of numbers into pictures.

Histograms: Great for seeing the “shape” of your data (e.g., Are most of our customers young or old?).
Scatter Plots: Perfect for seeing relationships. (e.g., Does “Study Time” actually relate to “Test Score”?).
Box Plots: The best tool for spotting outliers. If you see a dot way outside the “whiskers” of the box, that’s a data point you need to investigate.

🛠 Hands-on Lab: Cleaning the Titanic (Continued)

Now, let’s take that Titanic data from Module 1 and actually prep it for a model. code Pythondownloadcontent_copyexpand_less

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# 1. Handle Missing Values
# The 'Embarked' column has a few missing values. Let's fill them with the most common port.
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# 2. Encode Categorical Data
# Turn 'Sex' into 1s and 0s using One-Hot Encoding
df = pd.get_dummies(df, columns=['Sex'])

# 3. Visualize
# Let's see who survived based on their ticket class
sns.countplot(x='Survived', hue='Pclass', data=df)
plt.show()

# 4. Scaling
scaler = StandardScaler()
df['Fare_Scaled'] = scaler.fit_transform(df[['Fare']])

Module 2 Summary

You’ve just learned the “unsexy” but vital part of AI. You know how to fill holes in data, translate words for the computer, scale numbers so they don’t fight, and visualize the results.

You’re now ready for Module 3: where we finally start building the actual Machine Learning models!