Definitions Math
Vector space
A vector space over a field is a non empty set together with a binary operation and binary function that satisfy some properties.
More precisely, (vector addition) and (scalar multiplication).
Elements in are called scalars and elements in are called vectors.
Given and , the following properties must hold:
- Associativity of :
- Commutativity of :
- Identity element of : There exists an element such that
- Inverse element of : For each there exists an element such that
- Compatibility of scalar multiplication with field multiplication:
- Identity element of scalar multiplication:
- Distributivity of scalar multiplication with respect to vector addition:
- Distributivity of scalar multiplication with respect to field addition:
Inner product space
An inner product space (or pre-Hilbert space) is a vector space over a field equipped with an inner product, which is a function that satisfies the following properties:
- Conjugate symmetry:
- Linearity in the first argument:
- Positive-definiteness: and if and only if
Inner products are particularly easy to use in , as the symmetry does not require complex conjugation, making them linear in both arguments.
The standard inner product is defined as for .
Norm
The norm of a vector in a vector space is a function that satisfies the following properties:
- Non-negativity: and if and only if
- Homogeneity:
- Triangle inequality: for all
The norm induced by an inner product is defined as .
Banach space
A Banach space is a complete normed vector space, meaning that every Cauchy sequence in the space converges to a limit in the space.
Hilbert space
Hilbert spaces are complete inner product spaces.
Any finite inner product space is a Hilbert space, but the converse is not true.
Furthermore, note that Hilbert spaces are Banach spaces, as the norm is induced by the inner product, but the converse is not true.
In an Hilbert space we can define the concept of distance and angle. More precisely, given the inner product and the norm , we can define:
- Distance: The distance between two vectors , obtained as
- Angle: The angle between two vectors , obtained as or, more simply, .
An Hilbert space also needs to be complete, meaning that every Cauchy sequence in the space converges to a limit in the space. A Cauchy sequence is a sequence of elements such that for any there exists an such that for all we have that . In other words, is the series converges to a point, it must be in the space.
Intuition: if through a series composed of elements of the space we can reach a point that is not in the space, then the space is not complete.
Kernel
Given a vector space , we choose to apply a nonlinear transformation to map the input space into a Hilbert space , where we can perform the inner product to compute the similarity between two points.
The map is called a feature map.
Therefore, we can define the kernel function as .
Note that we are not imposing any condition on .
The same kernel could be defined by different feature maps. For instance, considering , both and are valid feature maps, mapping to and respectively that denote the same kernel .
Positive definite kernel
A symmetric kernel is positive definite if for any , any , and any . That is
This is because
Notable kernels
- Linear kernel:
- Polynomial kernel:
Kernel trick
The key observation to make is that we do not need to compute the feature map explicitly, but only the kernel function to obtain the result of the inner product.
For instance, let us consider the polynomial kernel: such that if .
Now, we could compute the each feature map and then the inner product .
But since we know that this is equal to , we can skip the feature map computation and directly compute the kernel function, i.e., the inner product.
Note that the kernel trick is not always possible, as it depends on the kernel function and the feature map.
TODO: Mercer’s theorem
Reproducing kernel
Let be a Hilbert space of functions defined over a non empty set . A function is called a reproducing kernel of if
If has a kernel with these properties, it is called a reproducing kernel Hilbert space (RKHS).
Moreover, the following three statements are equivalent:
- Reproducing kernel
- Kernel as innerproduct between features
- Positive definiteness of the kernel
Gram matrix
Given a kernel function and a set of points , we can define the Gram matrix as the matrix such that .
Note that the Gram matrix is symmetric and positive semi-definite, as it inherits these properties from the kernel function.
TODO: Lp space
TODO: Sobolev space
Reproducing kernel Hilbert space (RKHS)
A Reproducing Kernel Hilbert Space (RKHS) is a Hilbert space of functions with a reproducing kernel such that and for all and .
What’s more is that we can find a basis of the RKHS composed of the functions with . This way we can express any other function as a linear combination of the basis functions, i.e., .
Definition Machine Learning
Supervised vs Unsupervised learning
With Supervised learning the algorithm is given labeled data, i.e. , where is the input and is the output. The goal is to learn a function that maps the input to the output, so that we can predict the output for new inputs.
With Unsupervised learning the algorithm is given unlabeled data, i.e. , where is the input. The goal is to learn the underlying structure of the data, such as clustering or dimensionality reduction.
Discriminative vs Generative models
Discriminative models can make predictions, while Generative models try to explain how the data was generated.
In other words, discriminative models learn a mapping between the input space and the output space. The training usually involves using the data to minimise a loss function, and the prediction is made by applying the learned function to new inputs.
Generative models, on the other hand, construct a probabilistic model of how the data was generated. for instance, given some labeled data, we may construct a joint model and then use Bayes’ rule to compute the conditional distribution . It can even be generative about the parameters of the model, such as in Bayesian inference.
Discriminative models | Generative models | |
---|---|---|
Pros | Simpler to directly solve the prediction | Allow to incorporate problem specific expertise |
Typically makes few assumptions | Provide explanation for how the data was generated | |
Naturally provide uncertainty estimate | ||
Con | Difficult to impart domain expertise | The problem in harder in general |
Typically lacks interpretability | Could contain unwanted assumptions | |
Tend to require more fine tuning for the problem |
Loss function
To evaluate the performance of a model, we need to define a loss function that quantifies the error between the predicted output and the true output. In general, such a function can be defined as
where is the output space and is the input space and is the set of non-negative real numbers. The function is configured as
where is the true output, is the predicted output, and is the input.
Since we usually do not care about the input space, we can simplify the notation to
so that we can write
Risk function
The most common approach to the loss function is the risk function, which is the expected loss over the input space and the output space when dealing with random variables and .
Ideally the risk function would use the real probability distribution, but since we do not have access to it and cannot afford to sample the infinite elements of the input space, we use the empirical risk function instead.
If is chosen at random, the empirical risk function is an unbiased estimator of the risk function. Our goal is to find the function that minimises the risk function.
Hypothesis space
The hypothesis space is the set of functions that the model can learn. Among all those, we want to find the function that minimises the risk function, i.e.,
Hyperparameters
In practice, functions are oftern explicitly parameterised by some parameters . In that case, what we are looking for is the function among all possible that minimises the risk function, i.e.,
Empirical risk minimisation
Since we cannot directly minimise the true risk function (we can’t even compute it), we settle for the next best thing, which is to minimise the empirical risk function. In other words, we want to find the function such that
The three choices
Given this framework, approahces differ base on the following three choices:
- The hypothesis class
- The loss function
- The optimisation algorithm to minimise the empirical risk
Example hypothesis classes
- Linear functions: parametrised by and
- Linear functions with non linear mappings: parametrised by and , where is a non-linear feature map
- Deep learning: graph of parametrised differential mappings that starts from the input space and ends in the output space
Example loss functions
- Squared loss:
- The optimal is
- Absolute loss:
- The optimal is the median of
Overfitting
We need to determine how complex the hypothesis class should be. In fact, if we leave it unbounded, we very quickly run in the problem of overfitting.
Take, as an example, the function
where is the output for the input we have in the training set. It’s easy to see that this function has zero empirical risk, but it’s completely useless for any new input.
Regularisation
To prevent overfitting, we can add a regularisation term to the empirical risk function. This term penalises the complexity of the function, so that the optimisation algorithm will prefer simpler functions. The optimisation problem becomes
where is the regularisation term.
When we are dealing with hyperparameters, their value being higher usually means that the model is more complex.
Therefore, we can apply a similar regularisation trying to minimise the hyperparameters, hence looking for the simpler model.
Something like:
This is also known as shrinkage since we are trying to shrink the parameters towards zero. We can regulate the amount of shrinkage by tuning the regularisation parameter , which becomes another hyperparameter.
Examples of regularisation
- Ridge regression: