Machine Learning: Teaching Computers to Learn

This is Post 10 in the Computer Science Series. The previous post covered distributed systems. Now we look at machine learning — the technology behind spam filters, face recognition, music recommendations, and medical diagnosis.

Traditional programming: you write rules, the computer follows them. Machine learning: you show the computer examples, and it figures out the rules itself.

This shift sounds simple but it’s profound. Writing rules for “is this email spam?” is hard and brittle. Showing the computer millions of spam and non-spam emails and letting it learn the pattern — that works remarkably well.

Machine Learning Overview

The Big Picture

╔══════════════════════════════════════════════════════════════════════════════╗
║                      Machine Learning Landscape                              ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Supervised Learning          Unsupervised Learning      Reinforcement       ║
║  (labeled examples)           (no labels, find patterns) (reward signal)     ║
║  ──────────────────────       ────────────────────────   ─────────────────   ║
║  Classification               Clustering                 Game playing        ║
║   spam detection              k-means, DBSCAN            robot control       ║
║   image recognition                                                          ║
║                               Dimensionality reduction   Self-driving cars   ║
║  Regression                   PCA, t-SNE                                     ║
║   house price prediction                                                     ║
║   stock price forecasting     Anomaly detection                              ║
║                               fraud, failures                                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  Key Algorithms                                                              ║
║  Linear Regression · Logistic Regression · Decision Trees                    ║
║  Random Forests · Gradient Boosting (XGBoost) · SVM                          ║
║  k-NN · Naive Bayes · k-Means · PCA                                          ║
╚══════════════════════════════════════════════════════════════════════════════╝

1. The Core Idea: Learning from Examples

Consider teaching a child to recognise cats. You don’t write rules like “if has four legs AND has whiskers AND meows → cat”. Instead, you show thousands of pictures and say “cat” or “not cat” until the child builds an internal model.

Machine learning does the same:

Training data:
  Photo 1 → cat ✓
  Photo 2 → not cat ✗
  Photo 3 → cat ✓
  ... (millions of examples)

Model learns patterns: pointy ears, whiskers, fur texture...

New photo → Model predicts: cat (91% confident)

The model is not programmed with cat rules. It learned them from data.

Features

Every example is described by features — measurable properties.

Email	has_link	CAPS_ratio	word_free	word_winner	→ Spam?
Email 1	Yes	0.3	Yes	Yes	Yes
Email 2	No	0.01	No	No	No
Email 3	Yes	0.5	Yes	No	Yes

Feature engineering — choosing which features to include — is often as important as choosing the algorithm.

2. Supervised Learning — Learning from Labeled Data

In supervised learning, every training example comes with the correct answer (label).

Classification: predict a category

Is this email spam or not? (2 classes)
What digit is this handwriting? (10 classes: 0–9)
What species is this bird? (thousands of classes)

Regression: predict a number

What will this house sell for?
How many hours until this machine fails?

Training and Evaluation

All data (100,000 examples)
  ├── Training set (80,000) → model learns from these
  ├── Validation set (10,000) → tune hyperparameters
  └── Test set (10,000) → final evaluation (never seen during training)

The test set simulates real-world performance. Never train on test data — that would give overly optimistic results (like memorising exam answers).

Overfitting vs Underfitting

Underfitting:  model too simple → misses patterns → poor on training AND test
Overfitting:   model too complex → memorises training data → great on training, poor on test
Just right:    generalizes well → good on both

A model that memorizes your training data is useless — it can’t generalize to new examples. Techniques to prevent overfitting: more data, simpler models, regularization, dropout (in neural networks).

3. Linear Regression — The Simplest Model

Predict a number from features by fitting a line (or plane in higher dimensions).

House price = 200 × (square_meters) + 50,000 × (bedrooms) + 30,000

Training: find the values (200, 50,000, 30,000) that best fit the data
Predict:  new house → plug in features → get price estimate

The model is just: y = w₁x₁ + w₂x₂ + ... + b

w values are weights (how much each feature matters). b is the bias (baseline value). Training finds the w and b values that minimize the error on training data.

4. Gradient Descent — How Models Learn

How do we find the right weights? Gradient descent — move in the direction that reduces error.

Imagine you’re lost in mountains in fog, and you want to reach the lowest valley. You can’t see far, but you can feel the ground slope. The strategy: always step downhill. Eventually you reach a valley.

Error (loss function) = how wrong the model's predictions are
Gradient = direction of steepest increase in error
Step = move opposite to gradient (downhill = less error)

Repeat:
  1. Make predictions with current weights
  2. Calculate error (e.g., mean squared error)
  3. Calculate gradient of error w.r.t. weights
  4. Update: weights = weights - learning_rate × gradient
  5. Repeat until error stops decreasing

The learning rate controls step size. Too large: you overshoot and bounce around. Too small: takes forever to converge.

# Simplified gradient descent for linear regression
for epoch in range(1000):
    predictions = X @ weights + bias
    error = predictions - y_true

    # gradients
    dW = (2/n) * X.T @ error
    db = (2/n) * sum(error)

    # update
    weights -= learning_rate * dW
    bias    -= learning_rate * db

Gradient descent is the core of training almost every ML model, from linear regression to massive language models.

5. Decision Trees — Transparent Decisions

A decision tree splits data based on feature values, like a flowchart.

Is income > $50,000?
├── Yes → Is credit_score > 700?
│          ├── Yes → APPROVE loan
│          └── No  → Is employment_years > 3?
│                    ├── Yes → APPROVE
│                    └── No  → REJECT
└── No  → REJECT

Trees are:

Interpretable: you can explain every decision
No feature scaling needed: trees don’t care about units
Handle mixed data: numbers and categories in the same model

A single tree tends to overfit. The fix: use many trees together.

Random Forest

Build 100 decision trees, each on a random subset of training data and features. For prediction, take the majority vote.

trees each make a prediction:
trees say: SPAM
trees say: NOT SPAM
  → Final: SPAM (majority wins)

Random forests are more accurate and much more robust than single trees. They’re among the most reliable “out of the box” algorithms.

Gradient Boosting (XGBoost, LightGBM)

Build trees sequentially. Each new tree corrects the errors of all previous trees.

Tree 1: makes predictions, has some errors
Tree 2: trained on the residual errors of Tree 1
Tree 3: trained on the residual errors of Trees 1+2
... (100-1000 trees)
Final prediction = sum of all trees' predictions

Gradient boosting is the top algorithm for structured/tabular data competitions. XGBoost and LightGBM have won hundreds of Kaggle competitions.

6. Support Vector Machines (SVM)

SVMs find the maximum-margin hyperplane — the decision boundary that’s as far as possible from both classes.

Class A points: o o o
Class B points: x x x

Bad boundary: x x | o o o  (close to class A)
Good boundary: x x   | o o o (equal margin on both sides)

The points closest to the boundary are called "support vectors"

SVMs with the kernel trick can draw non-linear boundaries (circles, curves) by implicitly mapping data to higher dimensions.

Good for: text classification, image classification (before deep learning), when data is small and features are well-engineered.

7. Unsupervised Learning — Finding Hidden Patterns

What if you don’t have labels? Unsupervised learning finds structure in unlabeled data.

k-Means Clustering

Group data into k clusters, where each point belongs to the cluster with the nearest center.

Start: randomly place k cluster centers
Repeat:
  1. Assign each point to nearest center
  2. Move each center to the average of its assigned points
Until centers stop moving

Result: k groups of similar points

Uses: customer segmentation (group customers by behaviour), compressing images (group similar colours), finding communities in social networks.

Principal Component Analysis (PCA)

Reduce many features to a few by finding the directions of maximum variance.

10,000 features → 2 features that capture most of the information

Used for: visualizing high-dimensional data, reducing computation, removing noise.

8. Evaluation Metrics

How do you measure if a model is good?

For Classification

Accuracy: fraction of predictions that are correct.

accuracy = correct predictions / total predictions

Accuracy is misleading for imbalanced datasets: if 99% of emails are not spam, a model that always says “not spam” has 99% accuracy but is useless.

Better metrics:

Precision: of everything I predicted as spam, what fraction was actually spam? Recall: of all actual spam emails, what fraction did I catch?

Precision = True Positives / (True Positives + False Positives)
Recall    = True Positives / (True Positives + False Negatives)

There’s a trade-off: higher recall (catch more spam) → more false positives (legitimate email marked spam). You tune based on what matters for your use case.

F1 Score: harmonic mean of precision and recall — single number that balances both.

For Regression

MAE (Mean Absolute Error): average absolute difference between predicted and actual. RMSE (Root Mean Squared Error): like MAE but penalises large errors more.

9. Classical ML vs Deep Learning

Aspect	Classical ML	Deep Learning
Data required	Hundreds–thousands	Tens of thousands–millions
Feature engineering	Required (you choose)	Learned automatically
Interpretability	Often interpretable	Usually a black box
Training time	Seconds–minutes	Hours–weeks (on GPUs)
Best for	Tabular/structured data	Images, audio, text, video

Classical ML (random forests, XGBoost) still dominates for tabular/structured data (the kind in databases). Deep learning dominates for images, audio, and text — where features are hard to engineer by hand.

Summary

Machine learning is pattern recognition at scale:

Supervised:    labeled examples → model → predictions on new data
Classification: predict categories (spam/not, cat/dog)
Regression:    predict numbers (price, temperature, time)

Core technique: gradient descent → minimize prediction error
Key algorithms: linear regression, decision trees, random forests, XGBoost

Pitfalls:
  Overfitting → model memorizes, doesn't generalize
  Data leakage → test data accidentally seen during training
  Wrong metric → accuracy on imbalanced data is misleading

ML doesn’t magically create knowledge. It finds patterns in the data you give it. Bad data → bad patterns. Biased data → biased models. The data is always the foundation.

In the next post, we’ll go deeper: Deep Learning — neural networks, convolutional networks, and the transformer architecture that powers modern AI.

Back to the series: Welcome to the Computer Science Series