Practical ML7 min read

Feature Engineering

Turn raw data into inputs a model can actually use
raw data:Messy Β· missing values Β· mixed formatsengineered features:Clean Β· numerical Β· model-readyimpact:Often matters more than the algorithm choice

Imagine you're about to cook a meal. You've got a bag of groceries from the market: muddy carrots, whole chickens, unpeeled garlic, a block of cheese. Can you throw them straight into the pan? Absolutely not. You need to wash, peel, chop, measure, and prep everything first.

Feature engineering is the data prep step in machine learning. Raw data comes in messy β€” dates as text, categories as words, values on wildly different scales. Your model can't eat that. You need to transform it into clean, numerical inputs the model can actually learn from.

And here's the kicker: the quality of your ingredient prep often matters more than the recipe you pick. A simple model with great features will beat a fancy model with terrible features almost every time.

Common feature engineering techniques

1. Encoding categories

Models need numbers, not words. If your data has a "color" column with values like "red", "blue", "green", you can't just plug those in. You encode them:

  • One-Hot Encoding: Create a column for each category. Red β†’ [1, 0, 0], Blue β†’ [0, 1, 0]
  • Label Encoding: Red β†’ 0, Blue β†’ 1, Green β†’ 2 (careful β€” this implies an order!)

2. Scaling numbers

If "age" ranges from 0-100 and "salary" ranges from 20,000-200,000, the model will think salary is more important just because the numbers are bigger. Scaling fixes this by putting everything on a similar range.

3. Creating new features

Sometimes the best features aren't in the raw data. From a timestamp, you can extract: day of week, hour, is_weekend, time_since_last_event. From an address, you can extract: zip code, distance to city center, neighborhood income level.

Feature Engineering in Practice

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Raw data β€” messy and mixed
df = pd.DataFrame({
'age': [25, 45, 35, 50],
'salary': [30000, 80000, 55000, 120000],
'city': ['NYC', 'LA', 'NYC', 'Chicago'],
'join_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2018-11-05']
})
# 1. Scale numerical features
scaler = StandardScaler()
df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])
# 2. One-hot encode categories
df = pd.get_dummies(df, columns=['city'])
# 3. Engineer new features from date
df['join_date'] = pd.to_datetime(df['join_date'])
df['tenure_days'] = (pd.Timestamp('2024-01-01') - df['join_date']).dt.days
print(df[['age_scaled', 'salary_scaled', 'city_NYC', 'tenure_days']].to_string())
Output
   age_scaled  salary_scaled  city_NYC  tenure_days
0      -1.07          -1.17      True         1447
1       0.63           0.19     False         1656
2      -0.22          -0.49      True         1027
3       1.49           1.47     False         1883

Handling missing data

Real-world data is full of holes. A survey respondent skipped a question, a sensor glitched, a database migration dropped a column. You have options:

  • Drop the row β€” simple, but you lose data
  • Fill with the mean/median β€” works for numerical data, preserves dataset size
  • Fill with the mode β€” works for categorical data
  • Create a "missing" flag β€” sometimes the fact that data is missing is itself a useful signal

Feature selection: less is more

More features isn't always better. Irrelevant features add noise and slow down training. Techniques like correlation analysis, feature importance, and PCA help you keep only what matters.

Note: "Applied machine learning is basically feature engineering" β€” Andrew Ng. Choosing the right algorithm matters, but the features you feed it matter more. Spend 80% of your time on data prep and feature engineering.

Key Metrics

πŸ”’ One-Hot Encoding
Watch out: 1000 categories = 1000 new columns
O(n Γ— c) n rows Γ— c categories
πŸ“ Standard Scaling
Always fit on training data only, transform both
O(n) One pass for mean/std
🧹 Missing Value Imputation
Advanced methods (KNN imputation) are slower but better
O(n) Simple fill is fast
πŸ—οΈ New Feature Creation
This is where domain expertise shines
Domain-dependent Manual + creative

Quick check

Why can't you feed the string 'red' directly into most ML models?
Challenge

Continue reading