Kernel Mean Embedding of Distributions

Definitions Math

Vector space

A vector space over a field $F$ is a non empty set $V$ together with a binary operation $+$ and binary function $\cdot$ that satisfy some properties.
More precisely, $+ : V \times V \to V$ (vector addition) and $\cdot : F \times V \to V$ (scalar multiplication). Elements in $F$ are called scalars and elements in $V$ are called vectors.

Given $u, v, w \in V$ and $\alpha, \beta \in F$ , the following properties must hold:

Associativity of $+$ : $u + (v + w) = (u + v) + w$
Commutativity of $+$ : $u + v = v + u$
Identity element of $+$ : There exists an element $0 \in V$ such that $v + 0 = v$
Inverse element of $+$ : For each $v \in V$ there exists an element $-v \in V$ such that $v + (-v) = 0$
Compatibility of scalar multiplication with field multiplication: $\alpha (\beta v) = (\alpha \beta) v$
Identity element of scalar multiplication: $1 v = v$
Distributivity of scalar multiplication with respect to vector addition: $\alpha (u + v) = \alpha u + \alpha v$
Distributivity of scalar multiplication with respect to field addition: $(\alpha + \beta) v = \alpha v + \beta v$

Inner product space

An inner product space (or pre-Hilbert space) is a vector space $V$ over a field $F$ equipped with an inner product, which is a function $\langle \cdot, \cdot \rangle : V \times V \to F$ that satisfies the following properties:

Conjugate symmetry: $\langle u, v \rangle = \overline{\langle v, u \rangle}$
Linearity in the first argument: $\langle \alpha u + \beta v, w \rangle = \alpha \langle u, w \rangle + \beta \langle v, w \rangle$
Positive-definiteness: $\langle v, v \rangle \geq 0$ and $\langle v, v \rangle = 0$ if and only if $v = 0$

Inner products are particularly easy to use in $\mathbb{R}^n$ , as the symmetry does not require complex conjugation, making them linear in both arguments.
The standard inner product is defined as $\langle u, v \rangle = u^T v$ for $u, v \in \mathbb{R}^n$ .

Norm

The norm of a vector $v$ in a vector space $V$ is a function $\| \cdot \| : V \to \mathbb{R}$ that satisfies the following properties:

Non-negativity: $\| v \| \geq 0$ and $\| v \| = 0$ if and only if $v = 0$
Homogeneity: $\| \alpha v \| = |\alpha| \| v \|$
Triangle inequality: $\| u + v \| \leq \| u \| + \| v \|$ for all $u, v \in V$

The norm induced by an inner product is defined as $\| v \| = \sqrt{\langle v, v \rangle}$ .

Banach space

A Banach space is a complete normed vector space, meaning that every Cauchy sequence in the space converges to a limit in the space.

Hilbert space

Hilbert spaces are complete inner product spaces. Any finite inner product space is a Hilbert space, but the converse is not true.
Furthermore, note that Hilbert spaces are Banach spaces, as the norm is induced by the inner product, but the converse is not true.

In an Hilbert space we can define the concept of distance and angle. More precisely, given the inner product $\langle \cdot, \cdot \rangle$ and the norm $\| \cdot \| = \sqrt{\langle \cdot, \cdot \rangle}$ , we can define:

Distance: The distance between two vectors $u, v \in V$ , obtained as $\| u - v \|$
Angle: The angle between two vectors $u, v \in V$ , obtained as $\theta = \cos^{-1} \left( \frac{\langle u, v \rangle}{\| u \| \| v \|} \right)$ or, more simply, $\left \langle u, v \right \rangle = \| u \| \| v \| \cos \theta$ .

An Hilbert space also needs to be complete, meaning that every Cauchy sequence in the space converges to a limit in the space. A Cauchy sequence is a sequence of elements $x_1, x_2, \ldots$ such that for any $\epsilon > 0$ there exists an $N \in \mathbb{N}$ such that for all $n, m > N$ we have that $\| x_n - x_m \| < \epsilon$ . In other words, is the series converges to a point, it must be in the space.

Intuition: if through a series composed of elements of the space we can reach a point that is not in the space, then the space is not complete.

Kernel

Given a vector space $\mathcal{X}$ , we choose to apply a nonlinear transformation $\varphi : \mathcal{X} \to \mathcal{H}$ to map the input space into a Hilbert space $\mathcal{H}$ , where we can perform the inner product to compute the similarity between two points. The map $\varphi$ is called a feature map.
Therefore, we can define the kernel function $k : \mathcal{X} \times \mathcal{X} \to \mathcal{H}$ as $k(x, x') \coloneqq \langle \varphi(x), \varphi(x') \rangle _\mathcal{H}$ .
Note that we are not imposing any condition on $\mathcal{X}$ .

The same kernel could be defined by different feature maps. For instance, considering $\mathcal{X} \in \mathbb{R}^p$ , both $\varphi(x)_1 = x$ and $\varphi(x)_2 = \begin{bmatrix} \frac{x}{\sqrt{2}} \\ \frac{x}{\sqrt{2}} \\ \end{bmatrix}$ are valid feature maps, mapping to $\mathcal{H} = \mathbb{R}^p$ and $\mathcal{H} = \mathbb{R}^{2p}$ respectively that denote the same kernel $k(x, x') = x^T x'$ .

Positive definite kernel

A symmetric kernel $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is positive definite if for any $n \in \mathbb{N}$ , any $x_1, \ldots, x_n \in \mathcal{X}$ , and any $\alpha_1, \ldots, \alpha_n \in \mathbb{R}$ . That is

\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0

This is because

\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \geq 0 = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle \varphi(x_i), \varphi(x_j) \rangle = \left\| \sum_{i=1}^n \alpha_i \varphi(x_i) \right\|^2 \geq 0

Notable kernels

Linear kernel: $k(x, x') = \langle x, x' \rangle = x^T x'$
Polynomial kernel: $k(x, x') = (x^T x')^d$

Kernel trick

The key observation to make is that we do not need to compute the feature map $\varphi$ explicitly, but only the kernel function $k$ to obtain the result of the inner product.
For instance, let us consider the polynomial kernel: $k(x, x')$ such that $\varphi(x) = \begin{bmatrix} x_1 & x_2 & \sqrt{2}x_1x_2 \end{bmatrix}^T$ if $x \in \mathbb{R}^2$ . Now, we could compute the each feature map and then the inner product $\langle \varphi(x), \varphi(x') \rangle$ . But since we know that this is equal to $k(x, x') = (x^T x')^2$ , we can skip the feature map computation and directly compute the kernel function, i.e., the inner product.

Note that the kernel trick is not always possible, as it depends on the kernel function and the feature map.

TODO: Mercer’s theorem

Reproducing kernel

Let $\mathcal{H}$ be a Hilbert space of functions $f : \mathcal{X} \to \mathbb{R}$ defined over a non empty set $\mathcal{X}$ . A function $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is called a reproducing kernel of $\mathcal{H}$ if

$\forall x \in \mathcal{X}, \quad k_x \coloneqq k(x, \cdot) \in \mathcal{H}$
$\forall x \in \mathcal{X}, \forall f \in \mathcal{H}, \quad f(x) = \langle f, k_x \rangle$

If $\mathcal{H}$ has a kernel with these properties, it is called a reproducing kernel Hilbert space (RKHS).

Moreover, the following three statements are equivalent:

Reproducing kernel $f(x) = \langle f, k_x \rangle$
Kernel as innerproduct between features $k(x, x') = \langle \varphi(x), \varphi(x') \rangle$
Positive definiteness of the kernel

Gram matrix

Given a kernel function $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ and a set of $n$ points $x_1, \ldots, x_n \in \mathcal{X}$ , we can define the Gram matrix as the matrix $K \in \mathbb{R}^{n \times n}$ such that $K_{ij} = k(x_i, x_j)$ .

K = \begin{bmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{bmatrix}

Note that the Gram matrix is symmetric and positive semi-definite, as it inherits these properties from the kernel function.

TODO: Lp space

TODO: Sobolev space

Reproducing kernel Hilbert space (RKHS)

A Reproducing Kernel Hilbert Space (RKHS) is a Hilbert space $\mathcal{H}$ of functions $f : \mathcal{X} \to \mathbb{R}$ with a reproducing kernel $k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ such that $k(x, \cdot) \in \mathcal{H}$ and $f(x) = \langle f, k(x, \cdot) \rangle$ for all $f \in \mathcal{H}$ and $x \in \mathcal{X}$ .

What’s more is that we can find a basis of the RKHS $\mathcal{H}$ composed of the functions $k_{x_i}(x) = k(x, x_i)$ with $x_i \in \mathcal{X}$ . This way we can express any other function $f \in \mathcal{H}$ as a linear combination of the basis functions, i.e., $f(\cdot) = \sum_{i=1}^n \alpha_i k(\cdot, x_i)$ .

Definition Machine Learning

Supervised vs Unsupervised learning

With Supervised learning the algorithm is given labeled data, i.e. $D = \{x_i, y_i\}^n_{i=1}$ , where $x_i$ is the input and $y_i$ is the output. The goal is to learn a function $f : \mathcal{X} \to \mathcal{Y}$ that maps the input to the output, so that we can predict the output for new inputs.

With Unsupervised learning the algorithm is given unlabeled data, i.e. $D = \{x_i\}^n_{i=1}$ , where $x_i$ is the input. The goal is to learn the underlying structure of the data, such as clustering or dimensionality reduction.

Discriminative vs Generative models

Discriminative models can make predictions, while Generative models try to explain how the data was generated.

In other words, discriminative models learn a mapping between the input space and the output space. The training usually involves using the data to minimise a loss function, and the prediction is made by applying the learned function to new inputs.

Generative models, on the other hand, construct a probabilistic model of how the data was generated. for instance, given some labeled data, we may construct a joint model $p(x, y)$ and then use Bayes’ rule to compute the conditional distribution $p(Y = y| X = x)$ . It can even be generative about the parameters of the model, such as in Bayesian inference.

	Discriminative models	Generative models
Pros	Simpler to directly solve the prediction	Allow to incorporate problem specific expertise
	Typically makes few assumptions	Provide explanation for how the data was generated
		Naturally provide uncertainty estimate
Con	Difficult to impart domain expertise	The problem in harder in general
	Typically lacks interpretability	Could contain unwanted assumptions
		Tend to require more fine tuning for the problem

Loss function

To evaluate the performance of a model, we need to define a loss function that quantifies the error between the predicted output and the true output. In general, such a function can be defined as

\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \times \mathcal{X} \to \mathbb{R}^+

where $\mathcal{Y}$ is the output space and $\mathcal{X}$ is the input space and $\mathbb{R}^+$ is the set of non-negative real numbers. The function is configured as

\mathcal{L}(y, f{x}, x)

where $y$ is the true output, $f(x)$ is the predicted output, and $x$ is the input.

Since we usually do not care about the input space, we can simplify the notation to

\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}^+

so that we can write

\mathcal{L}(y, f(x)) .

Risk function

The most common approach to the loss function is the risk function, which is the expected loss over the input space $\mathcal{X}$ and the output space $\mathcal{Y}$ when dealing with random variables $X$ and $X$ .

R(f) = \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f(X)) \right] .

Ideally the risk function would use the real probability distribution, but since we do not have access to it and cannot afford to sample the infinite elements of the input space, we use the empirical risk function instead.

\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) .

If $f$ is chosen at random, the empirical risk function is an unbiased estimator of the risk function. Our goal is to find the function $f$ that minimises the risk function.

Hypothesis space

The hypothesis space is the set of functions that the model can learn. Among all those, we want to find the function that minimises the risk function, i.e.,

f^* = \arg \min_{f \in \mathcal{H}} R(f) = \arg \min_{f \in \mathcal{H}} \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f(X)) \right] .

Hyperparameters

In practice, functions are oftern explicitly parameterised by some parameters $\theta \in \Theta$ . In that case, what we are looking for is the function $f_{\theta^*}$ among all possible $f_\theta$ that minimises the risk function, i.e.,

\theta^* = \arg \min_{\theta \in \Theta} R(f_\theta) = \arg \min_{\theta \in \Theta} \mathbb{E}_{X, Y} \left[ \mathcal{L}(Y, f_\theta(X)) \right] .

Empirical risk minimisation

Since we cannot directly minimise the true risk function (we can’t even compute it), we settle for the next best thing, which is to minimise the empirical risk function. In other words, we want to find the function $\hat{f}$ such that

\hat{f} = \arg \min_{f \in \mathcal{H}} \hat{R}(f) = \arg \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) .

The three choices

Given this framework, approahces differ base on the following three choices:

The hypothesis class $\mathcal{H}$
The loss function $\mathcal{L}$
The optimisation algorithm to minimise the empirical risk

Example hypothesis classes

Linear functions: $\mathcal{H} = \{ f(x) = w^T x + b \}$ parametrised by $w \in \mathbb{R}^d$ and $b \in \mathbb{R}$
Linear functions with non linear mappings: $\mathcal{H} = \{ f(x) = w^T \varphi(x) + b \}$ parametrised by $w \in \mathbb{R}^d$ and $b \in \mathbb{R}$ , where $\varphi : \mathcal{X} \to \mathcal{H}$ is a non-linear feature map
Deep learning: graph of parametrised differential mappings that starts from the input space and ends in the output space

Example loss functions

Squared loss: $\mathcal{L}(y, f(x)) = (y - f(x))^2$ $L (y, f (x)) = (y - f (x))^{2}$
- The optimal $f$ is $\mathbb{E}[Y|X=x]$
Absolute loss: $\mathcal{L}(y, f(x)) = |y - f(x)|$ $L (y, f (x)) = ∣ y - f (x) ∣$
- The optimal $f$ is the median of $Y|X=x$

Overfitting

We need to determine how complex the hypothesis class should be. In fact, if we leave it unbounded, we very quickly run in the problem of overfitting.

Take, as an example, the function

f(x) = \begin{cases} y_i & \text{if } x \in \{x_1, \ldots, x_n\} \\ 0 & \text{otherwise} \end{cases}

where $y_i$ is the output for the input $x_i$ we have in the training set. It’s easy to see that this function has zero empirical risk, but it’s completely useless for any new input.

Regularisation

To prevent overfitting, we can add a regularisation term to the empirical risk function. This term penalises the complexity of the function, so that the optimisation algorithm will prefer simpler functions. The optimisation problem becomes

\hat{f} = \arg \min_{f \in \mathcal{H}} \hat{R}(f) + r(f) = \arg \min_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \mathcal{L}(y_i, f(x_i)) + r(f)

where $r(f)$ is the regularisation term.

When we are dealing with hyperparameters, their value being higher usually means that the model is more complex. Therefore, we can apply a similar regularisation trying to minimise the hyperparameters, hence looking for the simpler model.
Something like:

\theta^* = \arg \min_{\theta \in \Theta} R(f_\theta) + \lambda r(\theta)

This is also known as shrinkage since we are trying to shrink the parameters towards zero. We can regulate the amount of shrinkage by tuning the regularisation parameter $\lambda$ , which becomes another hyperparameter.

Examples of regularisation

Ridge regression: $\lambda r(\theta) = \lambda \| w \|^2_2$