This post is a collection of the mathematical concepts that I have come across in the past few months. I'm hoping that it will help me understand the math of machine learning better. I won't necessarily try to motivate all of the why behind the math here as I think it's a bit of a rabbit hole, but I'll try to explain it as best as I can. Overall, I feel the probabilistic ways of thinking about machine learning is quite elegant and it answers a lot of the seemingly arbitrary choices in machine learning.
To keep things easier to follow, let's standardize some notation first.
We begin by defining the problem we are trying to solve I like to think of machine learning as fundamentally pattern matching, where we try to find a function that maps input to output. A probabilistic interpretation generalizes this notion to a probability distribution over the output given the input.
Given a dataset with input space and output space , we aim to find a probabilistic distribution that "best" describes the dataset i.e maximizes the likelihood of the data under some assumptions.
An additional constraint is the pair are independent and identically distributed, i.e. the dataset we are training on is a good (random) sample from the underlying distribution, which intuitively makes sense since we can't really "learn" anything otherwise. Also, we want our model to be able to generalize and predict on new data (however, there are nuances to this). [1].
The fun part of mathematics is that there are always a number of ways to approach a problem. Here the problem of finding the "best" probabilistic model, depends on our modeling goals and assumptions about the data, we can classify methods along two axes:
As the name suggests, this method discriminates between classes, which makes it a good tool to model classification tasks. We model the conditional distribution . This is the most common approach to machine learning in practice, and some examples include Logistic Regression, Neural Networks, and Decision Trees.
A generative approach is more nuanced, as we are trying to create a "blueprint" of our data. Building on the "blueprint" analogy, if the final result of the "blueprint" is the observed sample , our blueprint is then the joint distribution . Some of the most common generative models are Naive Bayes and Gaussian Mixture Models.
We assume that the distribution has a parametric form , where is the parameter vector, and fit via Maximum Likelihood Estimation (MLE) or Bayesian inference. I like to think of this as a "knobs" approach, where I am looking for the best to fit the data. Examples include Logistic Regression, Decision Trees, and Naive Bayes.
Combine several simple components (e.g. Gaussians) and use an Expectation-Maximization (EM) algorithm to estimate mixture weights and parameters.
This is an interesting approach, the best analogy I can think of is this, imagine we have a "mixture of puzzles", we are trying to separate the pieces into different based on how the pieces fit together and what the overall image looks like. Examples include Gaussian Mixture Models and Dirichlet Process Mixture Models.
Make minimal assumptions about the form of the distribution, use methods like kernel density estimation (KDE) and -NN density estimators to estimate the distribution.
Leverage neural networks to model the density and generate data. Some examples include autoregressive models, normalizing flows, VAEs, and diffusion models.
Define unnormalized density and train via contrastive divergence or score matching.
Learn directly with methods like conditional VAEs, conditional flows, or deep nets trained with cross‑entropy.
Here, the main focus is on methods based on Parametric Likelihood. So, with a high-level understanding of the problem setup, let's dive into the details of some of the most common machine learning algorithms.
We start by understanding logistic regression which is a classification model (seriously the real hard problem in machine learning is naming stuff). Logistic regression is a discriminative model that uses a parametric approach to model the conditional distribution , making it a good starting point.
Given a dataset where , the logistic regression algorithm models the log odds of the output as a linear combination of the input features.
Following from the definition above, we have:
where is the parameter vector that we want to learn.
Taking the exponential of both sides, we get:
Now we know since the data labels are binary, the probability of the . So, we can rewrite the above equation as:
where is the sigmoid function which maps any real value to a probability between 0 and 1.
Next, to find that maximizes the likelihood of the data: , we can equivalently maximize the likelihood function, which is given by:
substituting the sigmoid function, we get:
Maximizing the likelihood function over the data is equivalent to minimizing the negative log of the likelihood function,
To find the parameter vector that minimizes the negative log-likelihood of the data, we use gradient descent because no closed-form solution exists. Intuitively, gradient descent works by looking for the direction of the steepest descent i.e. the slope is the steepest, and taking a step in that direction. Mathematically, the direction of the steepest descent is where the gradient of a function is equal to zero, so we represent the gradient update rule for as: where is the learning rate.
To make this easier to work with, we start by taking the derivative of the log likelihood for a single data point,
A quick note on the derivative of the sigmoid function:
using this, we can take the derivative of the log-likelihood function:
So over all the data points, we have:
Because is convex, gradient-based optimisers converge to the unique global minimum.[3]
With logistic regression we focus on binary classification, now we generalize the idea to multi-class classification.
Given a dataset where , similar to the binary case, we model the log-odds of the output as a linear combination of the input features, however, this time we have classes. We represent the conditional probability of that , and choose as the reference class such that . Next, we know that the sum of the probabilities of all the classes must be 1, so we have:
Now, we can rewrite the log-odds of the output as a linear combination of the input features, for all ,
exponentiating both sides, we get:
From equation 2.1, we can substitute into equation 2.2, to get:
From equation 2.3, we can substitute it into equation 2.2, to get:
Since we have , we can simplify the above equation to:
which is the softmax function.
Similar to the logistic regression, we can use gradient descent to find the parameter vector that minimizes the negative log-likelihood of the data. However, the likelihood function a bit more complicated,
where is the indicator function that is 1 if and 0 otherwise.
Taking the negative log of the likelihood function, we get:
To find the parameter vector , since a closed-form solution is not possible, we use gradient descent once again. So we represent the gradient update rule for as: where is the learning rate. Again to make this easier to work with, we take the derivative of the log-likelihood function with respect to a single weight vector for a single example,
Since, we have,
So, we can rewrite the gradient as:
So, we can rewrite the gradient as: