The Probabilistic View of Machine Learning

March 19, 2025

This post is a collection of the mathematical concepts that I have come across in the past few months. I'm hoping that it will help me understand the math of machine learning better. I won't necessarily try to motivate all of the why behind the math here as I think it's a bit of a rabbit hole, but I'll try to explain it as best as I can. Overall, I feel the probabilistic ways of thinking about machine learning is quite elegant and it answers a lot of the seemingly arbitrary choices in machine learning.

Notation

To keep things easier to follow, let's standardize some notation first.

$n$ is the number of points in the dataset.
$d$ is the number of dimensions in the input space.
$x$ denotes the $n \times d$ -dimensional input space i.e. $x \in \mathbb{R}^{n \times d}$ , where $x_{ij}$ is the $i^{th}$ row and $j^{th}$ column of the input space.
$y$ is the output space with $n$ points i.e. $y \in \mathbb{R}^{n \times 1}$ .
$D$ is the dataset. It is a set of $n$ points in $x$ and $y$ i.e. $D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}$ .

The Problem Setup

We begin by defining the problem we are trying to solve I like to think of machine learning as fundamentally pattern matching, where we try to find a function that maps input to output. A probabilistic interpretation generalizes this notion to a probability distribution over the output given the input.

Given a dataset $D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}$ with input space $x$ and output space $y$ , we aim to find a probabilistic distribution $p(x, y)$ that "best" describes the dataset i.e maximizes the likelihood of the data under some assumptions.

An additional constraint is the pair $(x_{i}, y_{i}) \in D$ are independent and identically distributed, i.e. the dataset we are training on is a good (random) sample from the underlying distribution, which intuitively makes sense since we can't really "learn" anything otherwise. Also, we want our model to be able to generalize and predict on new data (however, there are nuances to this). [1].

The fun part of mathematics is that there are always a number of ways to approach a problem. Here the problem of finding the "best" probabilistic model, depends on our modeling goals and assumptions about the data, we can classify methods along two axes:

1. Based on What You Model

Discriminative
As the name suggests, this method discriminates between classes, which makes it a good tool to model classification tasks. We model the conditional distribution $p(y \mid x)$ . This is the most common approach to machine learning in practice, and some examples include Logistic Regression, Neural Networks, and Decision Trees.
Generative
A generative approach is more nuanced, as we are trying to create a "blueprint" of our data. Building on the "blueprint" analogy, if the final result of the "blueprint" is the observed sample $(x, y)$ , our blueprint is then the joint distribution $p(x, y)$ . Some of the most common generative models are Naive Bayes and Gaussian Mixture Models.

2. Based on How You Estimate the Distribution

Parametric Likelihood‑Based Methods
We assume that the distribution has a parametric form $p(x, y; \theta)$ , where $\theta$ is the parameter vector, and fit $\theta$ via Maximum Likelihood Estimation (MLE) or Bayesian inference. I like to think of this as a "knobs" approach, where I am looking for the best $\theta$ to fit the data. Examples include Logistic Regression, Decision Trees, and Naive Bayes.
Mixture Models
Combine several simple components (e.g. Gaussians) and use an Expectation-Maximization (EM) algorithm to estimate mixture weights and parameters.

This is an interesting approach, the best analogy I can think of is this, imagine we have a "mixture of puzzles", we are trying to separate the pieces into different based on how the pieces fit together and what the overall image looks like. Examples include Gaussian Mixture Models and Dirichlet Process Mixture Models.
Nonparametric Density Estimation
Make minimal assumptions about the form of the distribution, use methods like kernel density estimation (KDE) and $k$ -NN density estimators to estimate the distribution.
Neural Density Estimation & Deep Generative Models
Leverage neural networks to model the density and generate data. Some examples include autoregressive models, normalizing flows, VAEs, and diffusion models.
Energy‑Based & Score‑Matching Methods
Define unnormalized density $p(x) \propto e^{-E_ heta(x)}$ and train via contrastive divergence or score matching.
Conditional Distribution Learning
Learn $p(y \mid x)$ directly with methods like conditional VAEs, conditional flows, or deep nets trained with cross‑entropy.

Here, the main focus is on methods based on Parametric Likelihood. So, with a high-level understanding of the problem setup, let's dive into the details of some of the most common machine learning algorithms.

Parametric Methods

Logistic Regression

We start by understanding logistic regression which is a classification model (seriously the real hard problem in machine learning is naming stuff). Logistic regression is a discriminative model that uses a parametric approach to model the conditional distribution $p(y \mid x; \theta)$ , making it a good starting point.

Given a dataset $D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}$ where $y_i \in \{0, 1\}$ , the logistic regression algorithm models the log odds of the output as a linear combination of the input features.

Following from the definition above, we have:

\log \frac{p(y = 1 \mid x; \theta)}{p(y = 0 \mid x; \theta)} = \theta^T x

where $\theta$ is the parameter vector that we want to learn.

Taking the exponential of both sides, we get:

\frac{p(y = 1 \mid x; \theta)}{p(y = 0 \mid x; \theta)} = e^{\theta^T x}

Now we know since the data labels are binary, the probability of the $p(y = 1 \mid x; \theta) = 1- p(y = 0 \mid x; \theta)$ . So, we can rewrite the above equation as:

\frac{p(y = 1 \mid x; \theta)}{1 - p(y = 1 \mid x; \theta)} = e^{\theta^T x}

p(y = 1 \mid x; \theta) = \frac{1}{1 + e^{-\theta^T x}} = \sigma(\theta^{\top}x_i)

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function which maps any real value to a probability between 0 and 1.

Next, to find $\theta$ that maximizes the likelihood of the data: $f_{\theta}(x) = \argmax_{\theta} p(y \mid x; \theta)$ , we can equivalently maximize the likelihood function, which is given by:

\mathcal L( \theta) = \prod_{i=1}^{n} p(y_{i} \mid x_{i}; \theta)

substituting the sigmoid function, we get:

\mathcal L( \theta) = \prod_{i=1}^{n} \sigma(\theta^{\top}x_i)^{y_{i}} \left( 1 - \sigma(\theta^{\top}x_i) \right)^{1 - y_{i}}.

Maximizing the likelihood function over the data is equivalent to minimizing the negative log of the likelihood function,

\ell(\theta) = - \log \mathcal L(\theta) = - \log \prod_{i=1}^{n} \left( \sigma(\theta^{\top}x_i) \right)^{y_{i}} \left( 1 - \sigma(\theta^{\top}x_i) \right)^{1 - y_{i}}

\ell(\theta) = -\sum_{i=1}^{n} \left( y_{i} \log \sigma (\theta^{\top}x_i) + (1 - y_{i}) \log (1 - \sigma(\theta^{\top}x_i)) \right).

To find the parameter vector $\theta$ that minimizes the negative log-likelihood of the data, we use gradient descent because no closed-form solution exists. Intuitively, gradient descent works by looking for the direction of the steepest descent i.e. the slope is the steepest, and taking a step in that direction. Mathematically, the direction of the steepest descent is where the gradient of a function is equal to zero, so we represent the gradient update rule for $\theta$ as: $\theta = \theta - \eta \nabla l( \theta)$ where $\eta$ is the learning rate.

To make this easier to work with, we start by taking the derivative of the log likelihood for a single data point,

\frac{\partial \ell_i(\theta)}{\partial \theta} = - \frac{\partial }{\partial \theta} \left( y_{i} \log \sigma (\theta) + (1 - y_{i}) \log (1 - \sigma(\theta^{\top}x_i)) \right) = 0.

A quick note on the derivative of the sigmoid function:

\frac{\partial \sigma(\theta^{\top}x_i)}{\partial \theta} = \sigma(\theta^{\top}x_i) (1 - \sigma(\theta^{\top}x_i))

using this, we can take the derivative of the log-likelihood function:

\frac{\partial \ell_i( \theta)}{\partial \theta} = y_{i}(1 - \sigma(\theta^{\top}x_i)) - (1 - y_{i}) \sigma(\theta^{\top}x_i)x_i.

So over all the data points, we have:

\boxed{\nabla \ell( \theta) = \sum_{i=1}^{n} y_{i}(1 - \sigma (\theta^{\top}x_i)) - (1 - y_{i}) \sigma(\theta^{\top}x_i) x_i.}

Because $\ell(\theta)$ is convex, gradient-based optimisers converge to the unique global minimum.[3]

Multinomial Regression

With logistic regression we focus on binary classification, now we generalize the idea to multi-class classification.

Given a dataset $D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}$ where $y_i \in \{0, 1, 2, ..., k\}$ , similar to the binary case, we model the log-odds of the output as a linear combination of the input features, however, this time we have $k$ classes. We represent the conditional probability of that $y_i$ , and choose $C$ as the reference class such that $\theta_C = 0$ . Next, we know that the sum of the probabilities of all the classes must be 1, so we have:

p(y = C \mid x; \theta) + \sum_{c' \neq C} p(y = c' \mid x; \theta) = 1 \tag{2.1}

Now, we can rewrite the log-odds of the output as a linear combination of the input features, for all $c' \neq C$ ,

\log \frac{p(y = c \mid x; \theta)}{p(y = C \mid x; \theta)} = \theta_c^{\top}x

exponentiating both sides, we get:

\frac{p(y = c \mid x; \theta)}{p(y = C \mid x; \theta)} = e^{\theta_c^{\top}x} \tag{2.2}

From equation 2.1, we can substitute $p(y = C \mid x; \theta)$ into equation 2.2, to get:

p(y = C \mid x; \theta)\left [1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x} \right ] = 1

p(y = C \mid x; \theta) = \frac{1}{1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x}} \tag{2.3}

From equation 2.3, we can substitute it into equation 2.2, to get:

p(y = c \mid x; \theta) = p(y = C \mid x; \theta) e^{\theta_c^{\top}x}

p(y = c \mid x; \theta) = \frac{e^{\theta_c^{\top}x}}{ 1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x}}

Since we have $\theta_C = 0$ , we can simplify the above equation to:

p(y = c \mid x; \theta) = \frac{e^{\theta_c^{\top}x}}{ \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x}} \tag{2.4}

which is the softmax function.

Similar to the logistic regression, we can use gradient descent to find the parameter vector $\theta$ that minimizes the negative log-likelihood of the data. However, the likelihood function a bit more complicated,

\mathcal L(\theta) =\prod_{i=1}^{n} \prod_{c=1}^{k} p(y_i = c \mid x_i; \theta)^{ 1 \{y_i = c\}}

where $1\{y_i = c\}$ is the indicator function that is 1 if $y_i = c$ and 0 otherwise.

Taking the negative log of the likelihood function, we get:

\ell(\theta) = - \log \mathcal L(\theta) = - \log \left( \prod_{i=1}^{n} \prod_{c=1}^{k} p(y_i = c \mid x_i; \theta)^{ 1 \{y_i = c\}} \right )

\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \log p(y_i = c \mid x_i; \theta)

\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \log \frac{e^{\theta_c^{\top}x_i}}{\sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i}}

\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \left( \theta_c^{\top}x_i - \log \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i} \right) \tag{2.5}

To find the parameter vector $\theta$ , since a closed-form solution is not possible, we use gradient descent once again. So we represent the gradient update rule for $\theta$ as: $\theta = \theta - \eta \nabla l( \theta)$ where $\eta$ is the learning rate. Again to make this easier to work with, we take the derivative of the log-likelihood function with respect to a single weight vector $\theta_c$ for a single example,

\frac{\partial \ell_i(\theta)}{\partial \theta_c} = - \frac{\partial }{\partial \theta_c} 1 \{y_i = c\} \left( \theta_c^{\top}x_i - \log \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i} \right)

Since, we have,

\frac{\partial e^{\theta_{c'}^{\top} x_i}}{\partial \theta_c} = \begin{cases} e^{\theta_c^{\top} x_i} & \text{if } c' = c \\ 0 & \text{if } c' \neq c \end{cases}

So, we can rewrite the gradient as:

\frac{\partial \ell_i(\theta)}{\partial \theta_c} = \left(\frac{e^{\theta_c^{\top}x_i}}{\sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i}} - 1 \{y_i = c\}\right)x_i