Foundations6 min read

Features & Labels

Ingredients are features, the dish name is the label β€” teach your model what to look at and what to predict
features:Input columns Β· what the model seeslabels:Output column Β· what the model predictsfeature engineering:Critical Β· often matters more than the algorithm

Imagine you're looking at a recipe card. On one side, you've got the ingredients: flour, sugar, eggs, butter, vanilla extract. On the other side, you've got the dish name: chocolate cake.

In machine learning, the ingredients are called features and the dish name is called the label.

Features = the information the model uses to make a prediction (the inputs).
Label = the thing the model is trying to predict (the output).

That's it. Every supervised ML problem boils down to: "Given these features, predict this label."

A concrete example

Say you're predicting whether a student will pass or fail an exam. Here's your data:

Hours StudiedHours SleptAttended Review?Result
68YesPass
24NoFail
77YesPass
15NoFail

The first three columns β€” Hours Studied, Hours Slept, Attended Review β€” are features. They're the clues.

The last column β€” Result β€” is the label. That's the answer the model learns to predict.

In code, features are usually called X (capital, because it's a matrix of many columns), and labels are called y (lowercase, because it's a single column).

Extracting Features & Labels

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Raw data as a DataFrame
data = pd.DataFrame({
'hours_studied': [6, 2, 7, 1, 5, 8, 3, 4],
'hours_slept': [8, 4, 7, 5, 6, 9, 3, 7],
'attended_review': [1, 0, 1, 0, 1, 1, 0, 0],
'result': [1, 0, 1, 0, 1, 1, 0, 0], # 1=pass, 0=fail
})
# Split into features (X) and label (y)
X = data[['hours_studied', 'hours_slept', 'attended_review']]
y = data['result']
print("Features (X):")
print(X.head(3))
print("\nLabels (y):")
print(y.head(3))
# Train a model
model = DecisionTreeClassifier()
model.fit(X, y)
# Predict: studied 5hrs, slept 7hrs, attended review
print("\nPrediction:", model.predict([[5, 7, 1]]))
Output
Features (X):
   hours_studied  hours_slept  attended_review
0              6            8                1
1              2            4                0
2              7            7                1

Labels (y):
0    1
1    0
2    1

Prediction: [1]

Good features vs. bad features

Not all features are created equal. A good feature is relevant to the prediction. A bad feature is noise that confuses the model.

Predicting house price?

  • Good features: square footage, number of bedrooms, neighborhood, age of house
  • Bad features: the color of the mailbox, the owner's favorite movie, what day you scraped the listing

The process of choosing, creating, and transforming features is called feature engineering β€” and experienced ML practitioners will tell you it's often more important than which algorithm you pick.

Types of features

  • Numerical: numbers like age, salary, temperature (ready to use)
  • Categorical: categories like color, country, "yes/no" (need to be converted to numbers)
  • Text: raw text like reviews or tweets (need heavy processing)
  • Derived: new features you create β€” like "age of house" from "year built" minus "current year"

Feature Engineering: Creating Better Features

import pandas as pd
data = pd.DataFrame({
'year_built': [1990, 2005, 2018, 1975],
'sqft': [1400, 2200, 1800, 1100],
'bedrooms': [3, 4, 3, 2],
'bathrooms': [2, 3, 2, 1],
})
# Derived feature: age of house
data['age'] = 2026 - data['year_built']
# Derived feature: sqft per bedroom
data['sqft_per_bed'] = data['sqft'] / data['bedrooms']
# Derived feature: bathroom-to-bedroom ratio
data['bath_ratio'] = data['bathrooms'] / data['bedrooms']
print(data[['age', 'sqft_per_bed', 'bath_ratio']])
Output
   age  sqft_per_bed  bath_ratio
0   36    466.666667    0.666667
1   21    550.000000    0.750000
2    8    600.000000    0.666667
3   51    550.000000    0.500000
Note: A common beginner mistake: accidentally including the label (or information derived from the label) as a feature. If you're predicting whether someone will buy a product and you include "receipt amount" as a feature β€” that's cheating! The model will get perfect scores during training but learn nothing useful. This is called data leakage.

Quick check

You're building a model to predict whether it will rain tomorrow. Which of these is a FEATURE?
Challenge

Continue reading