Skip to content

Kernel Mean Embedding of Distributions

Published:

Definitions Math

Vector space

A vector space over a field FF is a non empty set VV together with a binary operation ++ and binary function \cdot that satisfy some properties.
More precisely, +:V×VV+ : V \times V \to V (vector addition) and :F×VV\cdot : F \times V \to V (scalar multiplication). Elements in FF are called scalars and elements in VV are called vectors.

Given u,v,wVu, v, w \in V and α,βF\alpha, \beta \in F, the following properties must hold:

Inner product space

An inner product space (or pre-Hilbert space) is a vector space VV over a field FF equipped with an inner product, which is a function ,:V×VF\langle \cdot, \cdot \rangle : V \times V \to F that satisfies the following properties:

Inner products are particularly easy to use in Rn\mathbb{R}^n, as the symmetry does not require complex conjugation, making them linear in both arguments.
The standard inner product is defined as u,v=uTv\langle u, v \rangle = u^T v for u,vRnu, v \in \mathbb{R}^n.

Norm

The norm of a vector vv in a vector space VV is a function :VR\| \cdot \| : V \to \mathbb{R} that satisfies the following properties:

The norm induced by an inner product is defined as v=v,v\| v \| = \sqrt{\langle v, v \rangle}.

Banach space

A Banach space is a complete normed vector space, meaning that every Cauchy sequence in the space converges to a limit in the space.

Hilbert space

Hilbert spaces are complete inner product spaces. Any finite inner product space is a Hilbert space, but the converse is not true.
Furthermore, note that Hilbert spaces are Banach spaces, as the norm is induced by the inner product, but the converse is not true.

In an Hilbert space we can define the concept of distance and angle. More precisely, given the inner product ,\langle \cdot, \cdot \rangle and the norm =,\| \cdot \| = \sqrt{\langle \cdot, \cdot \rangle}, we can define:

An Hilbert space also needs to be complete, meaning that every Cauchy sequence in the space converges to a limit in the space. A Cauchy sequence is a sequence of elements x1,x2,x_1, x_2, \ldots such that for any ϵ>0\epsilon > 0 there exists an NNN \in \mathbb{N} such that for all n,m>Nn, m > N we have that xnxm<ϵ\| x_n - x_m \| < \epsilon. In other words, is the series converges to a point, it must be in the space.

Intuition: if through a series composed of elements of the space we can reach a point that is not in the space, then the space is not complete.

Kernel

Given a vector space X\mathcal{X}, we choose to apply a nonlinear transformation φ:XH\varphi : \mathcal{X} \to \mathcal{H} to map the input space into a Hilbert space H\mathcal{H}, where we can perform the inner product to compute the similarity between two points. The map φ\varphi is called a feature map.
Therefore, we can define the kernel function k:X×XHk : \mathcal{X} \times \mathcal{X} \to \mathcal{H} as k(x,x)φ(x),φ(x)Hk(x, x') \coloneqq \langle \varphi(x), \varphi(x') \rangle _\mathcal{H}.
Note that we are not imposing any condition on X\mathcal{X}.

The same kernel could be defined by different feature maps. For instance, considering XRp\mathcal{X} \in \mathbb{R}^p, both φ(x)1=x\varphi(x)_1 = x and φ(x)2=[x2x2]\varphi(x)_2 = \begin{bmatrix} \frac{x}{\sqrt{2}} \\ \frac{x}{\sqrt{2}} \\ \end{bmatrix} are valid feature maps, mapping to H=Rp\mathcal{H} = \mathbb{R}^p and H=R2p\mathcal{H} = \mathbb{R}^{2p} respectively that denote the same kernel k(x,x)=xTxk(x, x') = x^T x'.

Positive definite kernel

A symmetric kernel k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} is positive definite if for any nNn \in \mathbb{N}, any x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, and any α1,,αnR\alpha_1, \ldots, \alpha_n \in \mathbb{R}. That is

i=1nj=1nαiαjk(xi,xj)0\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0

This is because

i=1nj=1nαiαjk(xi,xj)0=i=1nj=1nαiαjφ(xi),φ(xj)=i=1nαiφ(xi)20\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle \varphi(x_i), \varphi(x_j) \rangle = \left\| \sum_{i=1}^n \alpha_i \varphi(x_i) \right\|^2 \geq 0

Notable kernels

Kernel trick

The key observation to make is that we do not need to compute the feature map φ\varphi explicitly, but only the kernel function kk to obtain the result of the inner product.
For instance, let us consider the polynomial kernel: k(x,x)k(x, x') such that φ(x)=[x1x22x1x2]T\varphi(x) = \begin{bmatrix} x_1 & x_2 & \sqrt{2}x_1x_2 \end{bmatrix}^T if xR2x \in \mathbb{R}^2. Now, we could compute the each feature map and then the inner product φ(x),φ(x)\langle \varphi(x), \varphi(x') \rangle. But since we know that this is equal to k(x,x)=(xTx)2k(x, x') = (x^T x')^2, we can skip the feature map computation and directly compute the kernel function, i.e., the inner product.

Note that the kernel trick is not always possible, as it depends on the kernel function and the feature map.

TODO: Mercer’s theorem

Reproducing kernel

Let H\mathcal{H} be a Hilbert space of functions f:XRf : \mathcal{X} \to \mathbb{R} defined over a non empty set X\mathcal{X}. A function k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} is called a reproducing kernel of H\mathcal{H} if

If H\mathcal{H} has a kernel with these properties, it is called a reproducing kernel Hilbert space (RKHS).

Moreover, the following three statements are equivalent:

Gram matrix

Given a kernel function k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} and a set of nn points x1,,xnXx_1, \ldots, x_n \in \mathcal{X}, we can define the Gram matrix as the matrix KRn×nK \in \mathbb{R}^{n \times n} such that Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

K=[k(x1,x1)k(x1,x2)k(x1,xn)k(x2,x1)k(x2,x2)k(x2,xn)k(xn,x1)k(xn,x2)k(xn,xn)]K = \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{bmatrix}

Note that the Gram matrix is symmetric and positive semi-definite, as it inherits these properties from the kernel function.

TODO: Lp space

TODO: Sobolev space

Reproducing kernel Hilbert space (RKHS)

A Reproducing Kernel Hilbert Space (RKHS) is a Hilbert space H\mathcal{H} of functions f:XRf : \mathcal{X} \to \mathbb{R} with a reproducing kernel k:X×XRk : \mathcal{X} \times \mathcal{X} \to \mathbb{R} such that k(x,)Hk(x, \cdot) \in \mathcal{H} and f(x)=f,k(x,)f(x) = \langle f, k(x, \cdot) \rangle for all fHf \in \mathcal{H} and xXx \in \mathcal{X}.

What’s more is that we can find a basis of the RKHS H\mathcal{H} composed of the functions kxi(x)=k(x,xi)k_{x_i}(x) = k(x, x_i) with xiXx_i \in \mathcal{X}. This way we can express any other function fHf \in \mathcal{H} as a linear combination of the basis functions, i.e., f()=i=1nαik(,xi)f(\cdot) = \sum_{i=1}^n \alpha_i k(\cdot, x_i).

Definition Machine Learning

Supervised vs Unsupervised learning

With Supervised learning the algorithm is given labeled data, i.e. D={xi,yi}i=1nD = \{x_i, y_i\}^n_{i=1}, where xix_i is the input and yiy_i is the output. The goal is to learn a function f:XYf : \mathcal{X} \to \mathcal{Y} that maps the input to the output, so that we can predict the output for new inputs.

With Unsupervised learning the algorithm is given unlabeled data, i.e. D={xi}i=1nD = \{x_i\}^n_{i=1}, where xix_i is the input. The goal is to learn the underlying structure of the data, such as clustering or dimensionality reduction.

Discriminative vs Generative models

Discriminative models can make predictions, while Generative models try to explain how the data was generated.

In other words, discriminative models learn a mapping between the input space and the output space. The training usually involves using the data to minimise a loss function, and the prediction is made by applying the learned function to new inputs.

Generative models, on the other hand, construct a probabilistic model of how the data was generated. for instance, given some labeled data, we may construct a joint model p(x,y)p(x, y) and then use Bayes’ rule to compute the conditional distribution p(Y=yX=x)p(Y = y| X = x). It can even be generative about the parameters of the model, such as in Bayesian inference.

Discriminative modelsGenerative models
ProsSimpler to directly solve the predictionAllow to incorporate problem specific expertise
Typically makes few assumptionsProvide explanation for how the data was generated
Naturally provide uncertainty estimate
ConDifficult to impart domain expertiseThe problem in harder in general
Typically lacks interpretabilityCould contain unwanted assumptions
Tend to require more fine tuning for the problem

Loss function

To evaluate the performance of a model, we need to define a loss function that quantifies the error between the predicted output and the true output. In general, such a function can be defined as

L:Y×Y×XR+\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \times \mathcal{X} \to \mathbb{R}^+

where Y\mathcal{Y} is the output space and X\mathcal{X} is the input space and R+\mathbb{R}^+ is the set of non-negative real numbers. The function is configured as

L(y,fx,x)\mathcal{L}(y, f{x}, x)

where yy is the true output, f(x)f(x) is the predicted output, and xx is the input.

Since we usually do not care about the input space, we can simplify the notation to

L:Y×YR+\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}^+

so that we can write

L(y,f(x)).\mathcal{L}(y, f(x)) .

Risk function

The most common approach to the loss function is the risk function, which is the expected loss over the input space X\mathcal{X} and the output space Y\mathcal{Y} when dealing with random variables XX and XX.

R(f)=EX,Y[L(Y,f(X))].R(f) = \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f(X)) \right] .

Ideally the risk function would use the real probability distribution, but since we do not have access to it and cannot afford to sample the infinite elements of the input space, we use the empirical risk function instead.

R^(f)=1ni=1nL(yi,f(xi)).\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) .

If ff is chosen at random, the empirical risk function is an unbiased estimator of the risk function. Our goal is to find the function ff that minimises the risk function.

Hypothesis space

The hypothesis space is the set of functions that the model can learn. Among all those, we want to find the function that minimises the risk function, i.e.,

f=argminfHR(f)=argminfHEX,Y[L(Y,f(X))].f^* = \arg \min_{f \in \mathcal{H}} R(f) = \arg \min_{f \in \mathcal{H}} \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f(X)) \right] .

Hyperparameters

In practice, functions are oftern explicitly parameterised by some parameters θΘ\theta \in \Theta. In that case, what we are looking for is the function fθf_{\theta^*} among all possible fθf_\theta that minimises the risk function, i.e.,

θ=argminθΘR(fθ)=argminθΘEX,Y[L(Y,fθ(X))].\theta^* = \arg \min_{\theta \in \Theta} R(f_\theta) = \arg \min_{\theta \in \Theta} \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f_\theta(X)) \right] .

Empirical risk minimisation

Since we cannot directly minimise the true risk function (we can’t even compute it), we settle for the next best thing, which is to minimise the empirical risk function. In other words, we want to find the function f^\hat{f} such that

f^=argminfHR^(f)=argminfH1ni=1nL(yi,f(xi)).\hat{f} = \arg \min_{f \in \mathcal{H}} \hat{R}(f) = \arg \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) .

The three choices

Given this framework, approahces differ base on the following three choices:

Example hypothesis classes

Example loss functions

Overfitting

We need to determine how complex the hypothesis class should be. In fact, if we leave it unbounded, we very quickly run in the problem of overfitting.

Take, as an example, the function

f(x)={yiif x{x1,,xn}0otherwisef(x) = \begin{cases} y_i & \text{if } x \in \{x_1, \ldots, x_n\} \\ 0 & \text{otherwise} \end{cases}

where yiy_i is the output for the input xix_i we have in the training set. It’s easy to see that this function has zero empirical risk, but it’s completely useless for any new input.

Regularisation

To prevent overfitting, we can add a regularisation term to the empirical risk function. This term penalises the complexity of the function, so that the optimisation algorithm will prefer simpler functions. The optimisation problem becomes

f^=argminfHR^(f)+r(f)=argminfH1ni=1nL(yi,f(xi))+r(f)\hat{f} = \arg \min_{f \in \mathcal{H}} \hat{R}(f) + r(f) = \arg \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) + r(f)

where r(f)r(f) is the regularisation term.

When we are dealing with hyperparameters, their value being higher usually means that the model is more complex. Therefore, we can apply a similar regularisation trying to minimise the hyperparameters, hence looking for the simpler model.
Something like:

θ=argminθΘR(fθ)+λr(θ)\theta^* = \arg \min_{\theta \in \Theta} R(f_\theta) + \lambda r(\theta)

This is also known as shrinkage since we are trying to shrink the parameters towards zero. We can regulate the amount of shrinkage by tuning the regularisation parameter λ\lambda, which becomes another hyperparameter.

Examples of regularisation

Resources