Data, Features, and Training Sets — AI and ML Fundamentals | CertQnA

Models do not learn from raw data — they learn from features. Feature engineering is the discipline of transforming raw data into representations a model can learn from effectively. It is often the single biggest determinant of model quality.

What is a Feature?

A feature is one measurable property or characteristic of an observation. For predicting house prices:

Square footage (numerical)
Number of bedrooms (numerical, integer)
Neighbourhood (categorical)
Has a pool? (boolean)
Days since last sale (engineered from raw dates)

The feature vector for one house is the list of these values. The training set is a matrix where each row is a feature vector and the last column is the target (price).

Feature Types and Encoding

Type	Examples	Encoding
Numerical (continuous)	Price, age, temperature	Often standardised (mean 0, std 1) or min-max normalised
Numerical (discrete)	Number of clicks, count of pets	Same as continuous — sometimes log-transformed if highly skewed
Categorical (low cardinality)	Country, colour, gender	One-hot encoding
Categorical (high cardinality)	User ID, product ID, ZIP code	Target encoding or embeddings
Text	Reviews, descriptions	TF-IDF, word embeddings, sentence embeddings
Datetime	Timestamp, date of birth	Decompose: year, month, day-of-week, hour, is-weekend
Image	Photos, screenshots	Pixel arrays — usually fed to convolutional networks directly

One-Hot Encoding Example

A "colour" feature with values red, green, blue becomes three binary columns:

colour	colour_red	colour_green	colour_blue
red	1	0	0
green	0	1	0
blue	0	0	1

Embeddings

For high-cardinality categorical data (millions of users, billions of product IDs) or unstructured data (text, images), one-hot encoding is impractical. Embeddings are dense vectors of, say, 128 floating-point numbers that represent the input.

Crucially, embeddings are learned — similar inputs end up close together in the embedding space. The classic example: in word embeddings, king − man + woman ≈ queen. Modern LLMs use embeddings as their input layer; vector databases (Pinecone, Weaviate, pgvector) store and search embeddings at scale.

Feature Scaling

Features often have wildly different ranges. Without scaling, a feature ranging 0–1,000,000 will dominate a feature ranging 0–1 — even if the latter is more predictive. Common scalers:

Standardisation: Subtract mean, divide by standard deviation. Output has mean 0, std 1.
Min-max normalisation: Linearly scale to [0, 1].
Log transform: Apply log to highly skewed features (income, web traffic).

Tree-based models (random forest, XGBoost) do not require scaling — they only care about ordering. Linear models, neural networks, and distance-based methods (k-NN) require it.

Data Leakage

The most insidious bug in ML. Data leakage happens when your training data contains information that would not be available at prediction time — usually because the target leaked into a feature.

Example: Predicting customer churn using a feature called days_until_cancellation. The model achieves 99% accuracy. In production, it fails — because at prediction time, you don't know how long until the customer cancels (that's the prediction!).

Subtler leakage: scaling the entire dataset before splitting into train/test. The test set's mean now influences the training data. Always fit scalers on the training set only, then apply to validation and test.

Class Imbalance

If 99% of credit card transactions are legitimate and 1% are fraud, a model that always predicts "legitimate" has 99% accuracy and zero business value. Mitigations:

Resampling: Oversample the minority class (SMOTE generates synthetic minority examples) or undersample the majority class.
Class weights: Tell the loss function to penalise mistakes on the minority class more heavily.
Different metrics: Use precision, recall, and F1 instead of accuracy.