Models do not learn from raw data — they learn from features. Feature engineering is the discipline of transforming raw data into representations a model can learn from effectively. It is often the single biggest determinant of model quality.
What is a Feature?
A feature is one measurable property or characteristic of an observation. For predicting house prices:
- Square footage (numerical)
- Number of bedrooms (numerical, integer)
- Neighbourhood (categorical)
- Has a pool? (boolean)
- Days since last sale (engineered from raw dates)
The feature vector for one house is the list of these values. The training set is a matrix where each row is a feature vector and the last column is the target (price).
Feature Types and Encoding
| Type | Examples | Encoding |
|---|---|---|
| Numerical (continuous) | Price, age, temperature | Often standardised (mean 0, std 1) or min-max normalised |
| Numerical (discrete) | Number of clicks, count of pets | Same as continuous — sometimes log-transformed if highly skewed |
| Categorical (low cardinality) | Country, colour, gender | One-hot encoding |
| Categorical (high cardinality) | User ID, product ID, ZIP code | Target encoding or embeddings |
| Text | Reviews, descriptions | TF-IDF, word embeddings, sentence embeddings |
| Datetime | Timestamp, date of birth | Decompose: year, month, day-of-week, hour, is-weekend |
| Image | Photos, screenshots | Pixel arrays — usually fed to convolutional networks directly |
One-Hot Encoding Example
A "colour" feature with values red, green, blue becomes three binary columns:
| colour | colour_red | colour_green | colour_blue |
|---|---|---|---|
| red | 1 | 0 | 0 |
| green | 0 | 1 | 0 |
| blue | 0 | 0 | 1 |
Embeddings
For high-cardinality categorical data (millions of users, billions of product IDs) or unstructured data (text, images), one-hot encoding is impractical. Embeddings are dense vectors of, say, 128 floating-point numbers that represent the input.
Crucially, embeddings are learned — similar inputs end up close together in the embedding space. The classic example: in word embeddings, king − man + woman ≈ queen. Modern LLMs use embeddings as their input layer; vector databases (Pinecone, Weaviate, pgvector) store and search embeddings at scale.
Feature Scaling
Features often have wildly different ranges. Without scaling, a feature ranging 0–1,000,000 will dominate a feature ranging 0–1 — even if the latter is more predictive. Common scalers:
- Standardisation: Subtract mean, divide by standard deviation. Output has mean 0, std 1.
- Min-max normalisation: Linearly scale to [0, 1].
- Log transform: Apply log to highly skewed features (income, web traffic).
Tree-based models (random forest, XGBoost) do not require scaling — they only care about ordering. Linear models, neural networks, and distance-based methods (k-NN) require it.
Data Leakage
The most insidious bug in ML. Data leakage happens when your training data contains information that would not be available at prediction time — usually because the target leaked into a feature.
Example: Predicting customer churn using a feature called days_until_cancellation. The model achieves 99% accuracy. In production, it fails — because at prediction time, you don't know how long until the customer cancels (that's the prediction!).
Subtler leakage: scaling the entire dataset before splitting into train/test. The test set's mean now influences the training data. Always fit scalers on the training set only, then apply to validation and test.
Class Imbalance
If 99% of credit card transactions are legitimate and 1% are fraud, a model that always predicts "legitimate" has 99% accuracy and zero business value. Mitigations:
- Resampling: Oversample the minority class (SMOTE generates synthetic minority examples) or undersample the majority class.
- Class weights: Tell the loss function to penalise mistakes on the minority class more heavily.
- Different metrics: Use precision, recall, and F1 instead of accuracy.