Skip to content
6 min read·Lesson 4 of 10

Data, Features, and Training Sets

Understand the difference between data and features, learn key feature engineering techniques, and avoid common pitfalls like data leakage and class imbalance.

Models do not learn from raw data — they learn from features. Feature engineering is the discipline of transforming raw data into representations a model can learn from effectively. It is often the single biggest determinant of model quality.

What is a Feature?

A feature is one measurable property or characteristic of an observation. For predicting house prices:

  • Square footage (numerical)
  • Number of bedrooms (numerical, integer)
  • Neighbourhood (categorical)
  • Has a pool? (boolean)
  • Days since last sale (engineered from raw dates)

The feature vector for one house is the list of these values. The training set is a matrix where each row is a feature vector and the last column is the target (price).

Feature Types and Encoding

TypeExamplesEncoding
Numerical (continuous)Price, age, temperatureOften standardised (mean 0, std 1) or min-max normalised
Numerical (discrete)Number of clicks, count of petsSame as continuous — sometimes log-transformed if highly skewed
Categorical (low cardinality)Country, colour, genderOne-hot encoding
Categorical (high cardinality)User ID, product ID, ZIP codeTarget encoding or embeddings
TextReviews, descriptionsTF-IDF, word embeddings, sentence embeddings
DatetimeTimestamp, date of birthDecompose: year, month, day-of-week, hour, is-weekend
ImagePhotos, screenshotsPixel arrays — usually fed to convolutional networks directly

One-Hot Encoding Example

A "colour" feature with values red, green, blue becomes three binary columns:

colourcolour_redcolour_greencolour_blue
red100
green010
blue001

Embeddings

For high-cardinality categorical data (millions of users, billions of product IDs) or unstructured data (text, images), one-hot encoding is impractical. Embeddings are dense vectors of, say, 128 floating-point numbers that represent the input.

Crucially, embeddings are learned — similar inputs end up close together in the embedding space. The classic example: in word embeddings, king − man + woman ≈ queen. Modern LLMs use embeddings as their input layer; vector databases (Pinecone, Weaviate, pgvector) store and search embeddings at scale.

Feature Scaling

Features often have wildly different ranges. Without scaling, a feature ranging 0–1,000,000 will dominate a feature ranging 0–1 — even if the latter is more predictive. Common scalers:

  • Standardisation: Subtract mean, divide by standard deviation. Output has mean 0, std 1.
  • Min-max normalisation: Linearly scale to [0, 1].
  • Log transform: Apply log to highly skewed features (income, web traffic).

Tree-based models (random forest, XGBoost) do not require scaling — they only care about ordering. Linear models, neural networks, and distance-based methods (k-NN) require it.

Data Leakage

The most insidious bug in ML. Data leakage happens when your training data contains information that would not be available at prediction time — usually because the target leaked into a feature.

Example: Predicting customer churn using a feature called days_until_cancellation. The model achieves 99% accuracy. In production, it fails — because at prediction time, you don't know how long until the customer cancels (that's the prediction!).

Subtler leakage: scaling the entire dataset before splitting into train/test. The test set's mean now influences the training data. Always fit scalers on the training set only, then apply to validation and test.

Class Imbalance

If 99% of credit card transactions are legitimate and 1% are fraud, a model that always predicts "legitimate" has 99% accuracy and zero business value. Mitigations:

  • Resampling: Oversample the minority class (SMOTE generates synthetic minority examples) or undersample the majority class.
  • Class weights: Tell the loss function to penalise mistakes on the minority class more heavily.
  • Different metrics: Use precision, recall, and F1 instead of accuracy.

Key Takeaways

  • A feature is a measurable input to a model — feature engineering turns raw data into features the model can learn from.
  • Categorical features need encoding (one-hot, embeddings); numerical features often need scaling (standardisation, normalisation).
  • Data leakage happens when training data accidentally contains information from the future or the target — it inflates accuracy and breaks at deployment.
  • Imbalanced classes need careful handling: oversampling (SMOTE), undersampling, or class weights.
  • Embeddings convert text, images, or other complex inputs into dense numeric vectors that capture meaning.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →