Feature Engineering
Imagine you're about to cook a meal. You've got a bag of groceries from the market: muddy carrots, whole chickens, unpeeled garlic, a block of cheese. Can you throw them straight into the pan? Absolutely not. You need to wash, peel, chop, measure, and prep everything first.
Feature engineering is the data prep step in machine learning. Raw data comes in messy β dates as text, categories as words, values on wildly different scales. Your model can't eat that. You need to transform it into clean, numerical inputs the model can actually learn from.
And here's the kicker: the quality of your ingredient prep often matters more than the recipe you pick. A simple model with great features will beat a fancy model with terrible features almost every time.
Common feature engineering techniques
1. Encoding categories
Models need numbers, not words. If your data has a "color" column with values like "red", "blue", "green", you can't just plug those in. You encode them:
- One-Hot Encoding: Create a column for each category. Red β [1, 0, 0], Blue β [0, 1, 0]
- Label Encoding: Red β 0, Blue β 1, Green β 2 (careful β this implies an order!)
2. Scaling numbers
If "age" ranges from 0-100 and "salary" ranges from 20,000-200,000, the model will think salary is more important just because the numbers are bigger. Scaling fixes this by putting everything on a similar range.
3. Creating new features
Sometimes the best features aren't in the raw data. From a timestamp, you can extract: day of week, hour, is_weekend, time_since_last_event. From an address, you can extract: zip code, distance to city center, neighborhood income level.
Feature Engineering in Practice
Handling missing data
Real-world data is full of holes. A survey respondent skipped a question, a sensor glitched, a database migration dropped a column. You have options:
- Drop the row β simple, but you lose data
- Fill with the mean/median β works for numerical data, preserves dataset size
- Fill with the mode β works for categorical data
- Create a "missing" flag β sometimes the fact that data is missing is itself a useful signal
Feature selection: less is more
More features isn't always better. Irrelevant features add noise and slow down training. Techniques like correlation analysis, feature importance, and PCA help you keep only what matters.