ML Pipeline
Think of a car assembly line. Raw materials arrive at one end β sheets of metal, rubber, glass. They pass through stations: cutting, shaping, welding, painting, wiring, quality inspection. A finished car rolls off the other end.
If the metal is rusted, the car will be weak. If the welding station is miscalibrated, doors won't close. If quality inspection is skipped, defective cars reach customers. Every station matters.
An ML pipeline works the same way. Raw data flows in at one end, passes through cleaning, transformation, training, and evaluation stations, and predictions come out the other end. Skip a step or do one poorly, and the whole thing falls apart.
The stages of an ML pipeline
1. Data Collection
You need data. Lots of it. This might come from databases, APIs, web scraping, sensors, or manual labeling. The quality of your data caps the quality of your model.
2. Data Cleaning
Real data is messy: missing values, duplicates, typos, inconsistent formats ("New York" vs "NY" vs "new york"). This stage is often 80% of the work.
3. Feature Engineering
Transform raw data into features the model can use: scale numbers, encode categories, create derived features, handle text/images.
4. Train/Test Split
Hold out data the model won't see during training. This is your unbiased evaluation set.
5. Model Training
Pick an algorithm, feed it the training data, tune hyperparameters. This is what most people think ML is β but it's actually only one step in the chain.
6. Evaluation
Test the model on held-out data. Check accuracy, precision, recall, F1. If it's not good enough, go back to step 2 or 3 and iterate.
7. Deployment
Ship the model to production where it serves real predictions. This means building an API, monitoring performance, handling edge cases, and planning for retraining.
A Complete ML Pipeline in Code
Why pipelines matter
Without a pipeline, data preprocessing and model training happen in scattered scripts. This leads to:
- Data leakage β accidentally using test data during preprocessing (e.g., scaling with test set statistics)
- Inconsistency β applying different transformations during training and prediction
- Messy code β impossible to reproduce or debug
A pipeline bundles everything together: when you call pipe.fit(X_train, y_train), it fits the scaler AND trains the model. When you call pipe.predict(X_test), it applies the same scaling and then predicts. No leakage, no inconsistency.
The deployment gap
Getting a model working in a Jupyter notebook is 10% of the job. Deploying it to production β where it handles real traffic, monitors for drift, retrains on new data, and fails gracefully β is the other 90%. That's why tools like MLflow, Kubeflow, and Airflow exist.