Features & Labels
Imagine you're looking at a recipe card. On one side, you've got the ingredients: flour, sugar, eggs, butter, vanilla extract. On the other side, you've got the dish name: chocolate cake.
In machine learning, the ingredients are called features and the dish name is called the label.
Features = the information the model uses to make a prediction (the inputs).
Label = the thing the model is trying to predict (the output).
That's it. Every supervised ML problem boils down to: "Given these features, predict this label."
A concrete example
Say you're predicting whether a student will pass or fail an exam. Here's your data:
| Hours Studied | Hours Slept | Attended Review? | Result |
|---|---|---|---|
| 6 | 8 | Yes | Pass |
| 2 | 4 | No | Fail |
| 7 | 7 | Yes | Pass |
| 1 | 5 | No | Fail |
The first three columns β Hours Studied, Hours Slept, Attended Review β are features. They're the clues.
The last column β Result β is the label. That's the answer the model learns to predict.
In code, features are usually called X (capital, because it's a matrix of many columns), and labels are called y (lowercase, because it's a single column).
Extracting Features & Labels
Good features vs. bad features
Not all features are created equal. A good feature is relevant to the prediction. A bad feature is noise that confuses the model.
Predicting house price?
- Good features: square footage, number of bedrooms, neighborhood, age of house
- Bad features: the color of the mailbox, the owner's favorite movie, what day you scraped the listing
The process of choosing, creating, and transforming features is called feature engineering β and experienced ML practitioners will tell you it's often more important than which algorithm you pick.
Types of features
- Numerical: numbers like age, salary, temperature (ready to use)
- Categorical: categories like color, country, "yes/no" (need to be converted to numbers)
- Text: raw text like reviews or tweets (need heavy processing)
- Derived: new features you create β like "age of house" from "year built" minus "current year"
Feature Engineering: Creating Better Features
Quick check
Continue reading