The Probabilistic View of Machine Learning

March 19, 2025

This post is a collection of the mathematical concepts that I have come across in the past few months. I'm hoping that it will help me understand the math of machine learning better. I won't necessarily try to motivate all of the why behind the math here as I think it's a bit of a rabbit hole, but I'll try to explain it as best as I can. Overall, I feel the probabilistic ways of thinking about machine learning is quite elegant and it answers a lot of the seemingly arbitrary choices in machine learning.

Notation

To keep things easier to follow, let's standardize some notation first.

  • nn is the number of points in the dataset.
  • dd is the number of dimensions in the input space.
  • xx denotes the n×dn \times d-dimensional input space i.e. xRn×dx \in \mathbb{R}^{n \times d}, where xijx_{ij} is the ithi^{th} row and jthj^{th} column of the input space.
  • yy is the output space with nn points i.e. yRn×1y \in \mathbb{R}^{n \times 1}.
  • DD is the dataset. It is a set of nn points in xx and yy i.e. D={(x1,y1),(x2,y2),...,(xn,yn)}D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\}.

The Problem Setup

We begin by defining the problem we are trying to solve I like to think of machine learning as fundamentally pattern matching, where we try to find a function that maps input to output. A probabilistic interpretation generalizes this notion to a probability distribution over the output given the input.

Given a dataset D={(x1,y1),(x2,y2),...,(xn,yn)}D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\} with input space xx and output space yy, we aim to find a probabilistic distribution p(x,y)p(x, y) that "best" describes the dataset i.e maximizes the likelihood of the data under some assumptions.

An additional constraint is the pair (xi,yi)D(x_{i}, y_{i}) \in D are independent and identically distributed, i.e. the dataset we are training on is a good (random) sample from the underlying distribution, which intuitively makes sense since we can't really "learn" anything otherwise. Also, we want our model to be able to generalize and predict on new data (however, there are nuances to this). [1].

The fun part of mathematics is that there are always a number of ways to approach a problem. Here the problem of finding the "best" probabilistic model, depends on our modeling goals and assumptions about the data, we can classify methods along two axes:

1. Based on What You Model

  • Discriminative

    As the name suggests, this method discriminates between classes, which makes it a good tool to model classification tasks. We model the conditional distribution p(yx)p(y \mid x). This is the most common approach to machine learning in practice, and some examples include Logistic Regression, Neural Networks, and Decision Trees.

  • Generative

    A generative approach is more nuanced, as we are trying to create a "blueprint" of our data. Building on the "blueprint" analogy, if the final result of the "blueprint" is the observed sample (x,y)(x, y), our blueprint is then the joint distribution p(x,y)p(x, y). Some of the most common generative models are Naive Bayes and Gaussian Mixture Models.

2. Based on How You Estimate the Distribution

  • Parametric Likelihood‑Based Methods

    We assume that the distribution has a parametric form p(x,y;θ)p(x, y; \theta), where θ\theta is the parameter vector, and fit θ\theta via Maximum Likelihood Estimation (MLE) or Bayesian inference. I like to think of this as a "knobs" approach, where I am looking for the best θ\theta to fit the data. Examples include Logistic Regression, Decision Trees, and Naive Bayes.

  • Mixture Models

    Combine several simple components (e.g. Gaussians) and use an Expectation-Maximization (EM) algorithm to estimate mixture weights and parameters.

    This is an interesting approach, the best analogy I can think of is this, imagine we have a "mixture of puzzles", we are trying to separate the pieces into different based on how the pieces fit together and what the overall image looks like. Examples include Gaussian Mixture Models and Dirichlet Process Mixture Models.

  • Nonparametric Density Estimation

    Make minimal assumptions about the form of the distribution, use methods like kernel density estimation (KDE) and kk-NN density estimators to estimate the distribution.

  • Neural Density Estimation & Deep Generative Models

    Leverage neural networks to model the density and generate data. Some examples include autoregressive models, normalizing flows, VAEs, and diffusion models.

  • Energy‑Based & Score‑Matching Methods

    Define unnormalized density p(x)eEheta(x)p(x) \propto e^{-E_ heta(x)} and train via contrastive divergence or score matching.

  • Conditional Distribution Learning

    Learn p(yx)p(y \mid x) directly with methods like conditional VAEs, conditional flows, or deep nets trained with cross‑entropy.

Here, the main focus is on methods based on Parametric Likelihood. So, with a high-level understanding of the problem setup, let's dive into the details of some of the most common machine learning algorithms.

Parametric Methods

Logistic Regression

We start by understanding logistic regression which is a classification model (seriously the real hard problem in machine learning is naming stuff). Logistic regression is a discriminative model that uses a parametric approach to model the conditional distribution p(yx;θ)p(y \mid x; \theta), making it a good starting point.

Given a dataset D={(x1,y1),(x2,y2),...,(xn,yn)}D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\} where yi{0,1}y_i \in \{0, 1\}, the logistic regression algorithm models the log odds of the output as a linear combination of the input features.

Following from the definition above, we have:

logp(y=1x;θ)p(y=0x;θ)=θTx \log \frac{p(y = 1 \mid x; \theta)}{p(y = 0 \mid x; \theta)} = \theta^T x

where θ\theta is the parameter vector that we want to learn.

Taking the exponential of both sides, we get:

p(y=1x;θ)p(y=0x;θ)=eθTx \frac{p(y = 1 \mid x; \theta)}{p(y = 0 \mid x; \theta)} = e^{\theta^T x}

Now we know since the data labels are binary, the probability of the p(y=1x;θ)=1p(y=0x;θ)p(y = 1 \mid x; \theta) = 1- p(y = 0 \mid x; \theta). So, we can rewrite the above equation as:

p(y=1x;θ)1p(y=1x;θ)=eθTx \frac{p(y = 1 \mid x; \theta)}{1 - p(y = 1 \mid x; \theta)} = e^{\theta^T x}
p(y=1x;θ)=11+eθTx=σ(θxi)p(y = 1 \mid x; \theta) = \frac{1}{1 + e^{-\theta^T x}} = \sigma(\theta^{\top}x_i)

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function which maps any real value to a probability between 0 and 1.

Next, to find θ\theta that maximizes the likelihood of the data: fθ(x)=arg maxθp(yx;θ) f_{\theta}(x) = \argmax_{\theta} p(y \mid x; \theta), we can equivalently maximize the likelihood function, which is given by:

L(θ)=i=1np(yixi;θ) \mathcal L( \theta) = \prod_{i=1}^{n} p(y_{i} \mid x_{i}; \theta)

substituting the sigmoid function, we get:

L(θ)=i=1nσ(θxi)yi(1σ(θxi))1yi.\mathcal L( \theta) = \prod_{i=1}^{n} \sigma(\theta^{\top}x_i)^{y_{i}} \left( 1 - \sigma(\theta^{\top}x_i) \right)^{1 - y_{i}}.

Maximizing the likelihood function over the data is equivalent to minimizing the negative log of the likelihood function,

(θ)=logL(θ)=logi=1n(σ(θxi))yi(1σ(θxi))1yi\ell(\theta) = - \log \mathcal L(\theta) = - \log \prod_{i=1}^{n} \left( \sigma(\theta^{\top}x_i) \right)^{y_{i}} \left( 1 - \sigma(\theta^{\top}x_i) \right)^{1 - y_{i}}
(θ)=i=1n(yilogσ(θxi)+(1yi)log(1σ(θxi))).\ell(\theta) = -\sum_{i=1}^{n} \left( y_{i} \log \sigma (\theta^{\top}x_i) + (1 - y_{i}) \log (1 - \sigma(\theta^{\top}x_i)) \right).

To find the parameter vector θ\theta that minimizes the negative log-likelihood of the data, we use gradient descent because no closed-form solution exists. Intuitively, gradient descent works by looking for the direction of the steepest descent i.e. the slope is the steepest, and taking a step in that direction. Mathematically, the direction of the steepest descent is where the gradient of a function is equal to zero, so we represent the gradient update rule for θ\theta as: θ=θηl(θ)\theta = \theta - \eta \nabla l( \theta) where η\eta is the learning rate.

To make this easier to work with, we start by taking the derivative of the log likelihood for a single data point,

i(θ)θ=θ(yilogσ(θ)+(1yi)log(1σ(θxi)))=0. \frac{\partial \ell_i(\theta)}{\partial \theta} = - \frac{\partial }{\partial \theta} \left( y_{i} \log \sigma (\theta) + (1 - y_{i}) \log (1 - \sigma(\theta^{\top}x_i)) \right) = 0.

A quick note on the derivative of the sigmoid function:

σ(θxi)θ=σ(θxi)(1σ(θxi))\frac{\partial \sigma(\theta^{\top}x_i)}{\partial \theta} = \sigma(\theta^{\top}x_i) (1 - \sigma(\theta^{\top}x_i))

using this, we can take the derivative of the log-likelihood function:

i(θ)θ=yi(1σ(θxi))(1yi)σ(θxi)xi.\frac{\partial \ell_i( \theta)}{\partial \theta} = y_{i}(1 - \sigma(\theta^{\top}x_i)) - (1 - y_{i}) \sigma(\theta^{\top}x_i)x_i.

So over all the data points, we have:

(θ)=i=1nyi(1σ(θxi))(1yi)σ(θxi)xi.\boxed{\nabla \ell( \theta) = \sum_{i=1}^{n} y_{i}(1 - \sigma (\theta^{\top}x_i)) - (1 - y_{i}) \sigma(\theta^{\top}x_i) x_i.}

Because (θ)\ell(\theta) is convex, gradient-based optimisers converge to the unique global minimum.[3]

Multinomial Regression

With logistic regression we focus on binary classification, now we generalize the idea to multi-class classification.

Given a dataset D={(x1,y1),(x2,y2),...,(xn,yn)}D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n})\} where yi{0,1,2,...,k}y_i \in \{0, 1, 2, ..., k\}, similar to the binary case, we model the log-odds of the output as a linear combination of the input features, however, this time we have kk classes. We represent the conditional probability of that yiy_i, and choose CC as the reference class such that θC=0\theta_C = 0. Next, we know that the sum of the probabilities of all the classes must be 1, so we have:

p(y=Cx;θ)+cCp(y=cx;θ)=1(2.1)p(y = C \mid x; \theta) + \sum_{c' \neq C} p(y = c' \mid x; \theta) = 1 \tag{2.1}

Now, we can rewrite the log-odds of the output as a linear combination of the input features, for all cCc' \neq C,

logp(y=cx;θ)p(y=Cx;θ)=θcx\log \frac{p(y = c \mid x; \theta)}{p(y = C \mid x; \theta)} = \theta_c^{\top}x

exponentiating both sides, we get:

p(y=cx;θ)p(y=Cx;θ)=eθcx(2.2)\frac{p(y = c \mid x; \theta)}{p(y = C \mid x; \theta)} = e^{\theta_c^{\top}x} \tag{2.2}

From equation 2.1, we can substitute p(y=Cx;θ)p(y = C \mid x; \theta) into equation 2.2, to get:

p(y=Cx;θ)[1+cCeθcx]=1p(y = C \mid x; \theta)\left [1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x} \right ] = 1
p(y=Cx;θ)=11+cCeθcx(2.3)p(y = C \mid x; \theta) = \frac{1}{1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x}} \tag{2.3}

From equation 2.3, we can substitute it into equation 2.2, to get:

p(y=cx;θ)=p(y=Cx;θ)eθcxp(y = c \mid x; \theta) = p(y = C \mid x; \theta) e^{\theta_c^{\top}x}
p(y=cx;θ)=eθcx1+cCeθcxp(y = c \mid x; \theta) = \frac{e^{\theta_c^{\top}x}}{ 1 + \sum_{c' \neq C} e^{\theta_{c'}^{\top}x}}

Since we have θC=0\theta_C = 0, we can simplify the above equation to:

p(y=cx;θ)=eθcxc=1keθcx(2.4)p(y = c \mid x; \theta) = \frac{e^{\theta_c^{\top}x}}{ \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x}} \tag{2.4}

which is the softmax function.

Similar to the logistic regression, we can use gradient descent to find the parameter vector θ\theta that minimizes the negative log-likelihood of the data. However, the likelihood function a bit more complicated,

L(θ)=i=1nc=1kp(yi=cxi;θ)1{yi=c}\mathcal L(\theta) =\prod_{i=1}^{n} \prod_{c=1}^{k} p(y_i = c \mid x_i; \theta)^{ 1 \{y_i = c\}}

where 1{yi=c}1\{y_i = c\} is the indicator function that is 1 if yi=cy_i = c and 0 otherwise.

Taking the negative log of the likelihood function, we get:

(θ)=logL(θ)=log(i=1nc=1kp(yi=cxi;θ)1{yi=c})\ell(\theta) = - \log \mathcal L(\theta) = - \log \left( \prod_{i=1}^{n} \prod_{c=1}^{k} p(y_i = c \mid x_i; \theta)^{ 1 \{y_i = c\}} \right )
(θ)=i=1nc=1k1{yi=c}logp(yi=cxi;θ)\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \log p(y_i = c \mid x_i; \theta)
(θ)=i=1nc=1k1{yi=c}logeθcxic=1keθcxi\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \log \frac{e^{\theta_c^{\top}x_i}}{\sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i}}
(θ)=i=1nc=1k1{yi=c}(θcxilogc=1keθcxi)(2.5)\ell(\theta) = - \sum_{i=1}^{n} \sum_{c=1}^{k} 1\{y_i = c\} \left( \theta_c^{\top}x_i - \log \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i} \right) \tag{2.5}

To find the parameter vector θ\theta, since a closed-form solution is not possible, we use gradient descent once again. So we represent the gradient update rule for θ\theta as: θ=θηl(θ)\theta = \theta - \eta \nabla l( \theta) where η\eta is the learning rate. Again to make this easier to work with, we take the derivative of the log-likelihood function with respect to a single weight vector θc\theta_c for a single example,

i(θ)θc=θc1{yi=c}(θcxilogc=1keθcxi)\frac{\partial \ell_i(\theta)}{\partial \theta_c} = - \frac{\partial }{\partial \theta_c} 1 \{y_i = c\} \left( \theta_c^{\top}x_i - \log \sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i} \right)

Since, we have,

eθcxiθc={eθcxiif c=c0if cc\frac{\partial e^{\theta_{c'}^{\top} x_i}}{\partial \theta_c} = \begin{cases} e^{\theta_c^{\top} x_i} & \text{if } c' = c \\ 0 & \text{if } c' \neq c \end{cases}

So, we can rewrite the gradient as:

i(θ)θc=(eθcxic=1keθcxi1{yi=c})xi\frac{\partial \ell_i(\theta)}{\partial \theta_c} = \left(\frac{e^{\theta_c^{\top}x_i}}{\sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i}} - 1 \{y_i = c\}\right)x_i

So, we can rewrite the gradient as:

(θ)=i=1n(eθcxic=1keθcxi1{yi=c})xi.\boxed{\nabla \ell( \theta) = \sum_{i=1}^{n} \left( \frac{e^{\theta_c^{\top}x_i}}{\sum_{c' = 1}^{k} e^{\theta_{c'}^{\top}x_i}} - 1 \{y_i = c\}\right)x_i.}

References