Error analysis

Math quiz

How much is

1 + 2 = \mathord{?}

1 + 2 = 3

print(1 + 2)  # 3

Math quiz

How much is

1.0 + 2.0 = \mathord{?}

1.0 + 2.0 = 3.0

print(1.0 + 2.0)  # 3.0

Math quiz

How much is

0.1 + 0.2 = \mathord{?}

0.1 + 0.2 = 0.3

print(0.1 + 0.2)  # 0.30000000000000004

Number representation

How do we represent numbers in a computer?

Computers only store bits, usually in groups of 8, 16, 32, or 64.

It is natural to represent integers in binary.

Loading diagram...

\underbrace{0} _ {\min} \le \underbrace{101101_2 = 23} _ {\text{value}} \le \underbrace{ 2^{31} - 1 = 2,147,483,647} _ {\max}

Fixed-point

What about decimal numbers?

A natural way is to use fixed-point numbers.

Loading diagram...

\begin{gather*} \underbrace{0} _ {\min} \le \underbrace{111001.101_2 = 23.625} _ {\text{value}} \le \newline \le \underbrace{ \underbrace{11\cdots 11} _ {15}.{\underbrace{11_\cdots 11_2} _ {16}} = 32767.99998474121} _ {\max} \end{gather*}

Pros-Cons of fixed-point numbers

Pros
- Stable error
- Simple operations
Cons
- Extremely Limited range
- Low precision

Floating-point

The IEEE 754 standard defines floating-point numbers.

Loading diagram...

\begin{gather*} \underbrace{0} _ {\min} \le \underbrace{1.101_2 \times 2^{2} = 6.5} _ {\text{value}} \le \\ \le \underbrace{ 1.\underbrace{11 \cdots 11_2}_{23} \times 2^{127} = (2 - 2^{-23}) \times 2^{127}} _ {\max} \end{gather*}

Problems with floating-point

The number of bits is still finite.

Range: Limited exponent
Error: Rounding errors
Unit roundoff: $\epsilon$ is the smallest positive number such that $1 + \epsilon > 1$
Infinite digits: Irrational numbers or non-dyadic fractions

Below the unit roundoff

#include <stdio.h>

int main() {
    float meters = 0;
    int iterations = 100000000;
    for (int i = 0; i < iterations; i++) {
        meters += 0.01;
    }
    printf("Expected: %f km\n", 0.01 * iterations / 1000 );
    printf("Got: %f km \n", meters / 1000);
}

Output:

// Expected: 1000.000000 km
// Got: 262.144012 km

The gap is too large: $[262144.0], [262144.03125], [262144.0625]$

The importance of partial pivoting

Compute the $LU$ factorisation of

A = \begin{bmatrix} \epsilon & -1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ l_{21} & 1 \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} \\ 0 & u_{22} \end{bmatrix}

u_{11} = \epsilon, \quad u_{12} = -1, \quad l_{21} = \epsilon^{-1}, \quad u_{22} = 1 + \epsilon^{-1} \approx\epsilon^{-1}

\hat{L}\hat{U} = \begin{bmatrix} \epsilon & -1 \\ 1 & 0 \end{bmatrix}

Computation over floating point numbers

The standard dictates that the result of an operation is computed exactly and then rounded.

\begin{align*} a + b & \implies \textbf{fl}(a + b) = (a + b)(1 + \delta) \\ a - b & \implies \textbf{fl}(a - b) = (a - b)(1 + \delta)\\ a \times b & \implies \textbf{fl}(a \times b) = (a \times b)(1 + \delta) \\ a \div b & \implies \textbf{fl}(a \div b) = (a \div b)(1 + \delta) \end{align*}

with $|\delta| \le \epsilon$ .

Overflows and underflows are also possible, but they are detectable.

Errors in computation

Let the exponent and mantissa be 4 and 7 (+1) bits.

\underbrace{1.1111100_2 \times 2^{7}} _ {252} + \underbrace{1.1_2} _ {1.5} =\ ?

\begin{array}{crll} & 111111000 & + \\ & 11 & = \\ \hline & 111111011 & \\ \end{array}

The result would need 8 bits (+1) to be represented.

1.\underbrace{1111101} _ {\text{Stored}}1_2 \times 2^{7} = 253.0 \ne 252 + 1.5 = 253.5

Errors propagate

Errors propagate through operations.

\textbf{fl}(a + b + c) = ((a + b)(1 + \delta_1) + c)(1 + \delta_2)

The exact value depends on the order of operations, making floating point arithmetic non-associative.

Error analysis

Error analysis is the study of how floating-point computations affect our results.

Forward error analysis VS Backward error analysis.

Errors in inner products

The inner product of two column vectors is

x, y \in \mathbb{R}^n \newline \hat{s} = \textbf{fl}(x^Ty) = \textbf{fl}\left(\sum_{i=1}^{n} x_i y_i\right)

\hat{s} = x_1y_1(1 + \delta^{\times}_{i})\prod^n_{i = 2}(i + \delta^{+}_i) + \sum_{i=2}^{n} x_i y_i(1 + \delta^{\times}_{i})\prod^n_{j = i}(1 + \delta^{+}_i)

Typical notation

\begin{align*} &\prod^n_{i = 1}(1 + \delta_i)^{p_i} = 1 + \theta_n \newline &|\theta_n| \le \frac{n\epsilon}{1 - n\epsilon} \eqqcolon \gamma_n \end{align*}

with $p_i \in \{-1, 1\}$

Error bound

\hat{s} = \textbf{fl}(x^Ty) \le x^Ty(1 + \gamma_n)

Using the backward error analysis,

\textbf{fl}(x^Ty) = (x + \Delta x)^Ty = x^T(y + \Delta y)

with

|\Delta x| \le \gamma_n|x|, \quad |\Delta y| \le \gamma_n|y|

Notable results

Triangular systems: $(T + \Delta T)\hat{x} = b \quad |\Delta T| \le \gamma_n|T|$
LU decomposition (PP): $(A + \Delta A) = \hat{L}\hat{U} \quad |\Delta A| \le \gamma_n|\hat{L}||\hat{U}|$
LU system: $(A + \Delta A)\hat{x} = b \quad |\Delta A| \le \gamma_{3n}|\hat{L}||\hat{U}|$

Simplex algorithm

The simplex is among the most widely used methods for solving LP problems.

\begin{align*} \text{maximize} \quad & c^Tx \\ \text{subject to} \quad & Ax = b \\ & x \ge 0 \end{align*}

with $A \in \mathbb{R}^{m \times n}$ , $b \in \mathbb{R}^m$ , and $c \in \mathbb{R}^n$ .

We define a basis $B$ of $m$ columns of $A$ . It must be non-singular.

Revised simplex

The revised simplex is the basis for the more common implementation.

Find an initial feasible solution (auxiliary LP)
Primal solution $x_B = B^{-1}b$
Dual solution $y_B = B^{-T}c_B$
Entering variable $e = \arg\max_{r_e > 0}(r_e \coloneqq c_e - A_e^Ty_B)$
Coefficient $d = B^{-1}A_e$
Leaving variable $l = \arg\min_{d_i > 0}(x_{B_i}/d_i)$
Update $B = \begin{bmatrix} B_1 & \cdots & B_{l-1} & B_{l+1} & \cdots A_e \end{bmatrix}$

Exit conditions

The steps of the simplex loop until

No initial feasible solution $\implies$ infeasible problem
No entering variable $\implies$ optimal solution
No leaving variable $\implies$ unbounded problem

Bartels–Golub decomposition

Three linear systems involving $B$ each iteration.

$x_B = B^{-1}b$ .
$y_B = B^{-T}c_B$ .
$d = B^{-1}A_e$ .

Use $LU$ decomposition. Keep $L$ fixed and only update $U$ .

H^{(i+1)} = \begin{bmatrix} U^{(i)}_1 & \cdots & U^{(i)}_{l_{i} - 1} & U^{(i)}_{l_{i} + 1} & \cdots & U^{(i)}_m & L^{-1}A_e \end{bmatrix}

Matrix shapes

U^{(i)}

H^{(i+1)}

U^{(i+1)}

Bartels–Golub decomposition

We need to re-triangularise $H^{(i)}$ .

\begin{align*} B^{(i)} &= LG^{(i - 1)}H^{(i)} \\ &= LC^{(1)}C^{(2)}\cdots C^{(i)}\underbrace{\Gamma^{(i)}_{m-1}{\Pi}^{(i)}_{m-1}\dots {\Gamma}^{(i)}_{l_i}{\Pi}^{(i)}_{l_i}}_{C^{(i)-1}}H^{(i)} = \\ &= L\underbrace{C^{(1)}C^{(2)}\cdots C^{(i)}}_{G^{(i)}}U^{(i)} = \\ &= LG^{(i)}U^{(i)} \\ \end{align*}

with $i > 0, \quad G^{(0)} \coloneqq I$ .

Stability

It can be proven that the Bartels–Golub decomposition is backward stable.

As long as the iterations is not too large, the decomposition remains accurate.

At some point it becomes more convenient to recompute the decomposition.

Exact floating point simplex

$\exists \tau, \epsilon \in \mathbb{R}$ , $x, \tau > 0$ , such that

def fl_simplex(A, b, c, B_idx, tolerance, epsilon):
    m, n = A.shape
    for i in range(n!/(n-m)!):
        B = A[:, B_idx]
        L, G, U = bartels_golub(B)
        x_B = solve(L, G, U, b)
        y_B = solve_t(L, G, U, c[B_idx])
        r = c - A.T @ y_B
        if r <= tolerance: return ("optimal" B_idx)
        e = argmax(r)
        d = solve(L, G, U, A[:, e])
        if d <= tolerance: return ("unbounded", B_idx)
        l = argmin(x_B / d, where=d > tolerance)
        B_idx[l] = e

behaves exactly.

Exact rational simplex

def simplex(A, b, c):
    A_f = [ A | I | -I ]
    c_f = [ 0 | 1 |  1 ]
    B_idx_f = [ n + i if b[i] >= 0 else n + m + i for i in range(m) ]
    for (ep, tau) in product(eps, taus):
        (res, B_idx) = CheckFeasibility(A_f, b, c_f, B_idx_f, tau, ep)
        if res == "infeasible": return ("infeasible", 0)
        if res == "error": continue
        (res, B_idx) = fl_simplex(A, b, c, B_idx, tau, ep)
        try: L, U = LU(A[:, B_idx]) except: continue
        x_B = solve(L, U, b)
        y_B = solve_t(L, U, c[B_idx])
        r = c - A.T @ y_B
        if all(x_B >= 0) and all(r >= 0): return ("optimal", x_B)
        elif all(x_B >= 0):
          d = solve(L, U, A[:, argmax(r[r < 0])])
          if all(d <= 0): return ("unbounded", x_B)

Key steps

We need to check that the result is exact.

We can do that using rational arithmetic given the last basis used.

If we are mistaken, we increase the precision and try again.

We limit the use of rational arithmetic to the minimum necessary.

Advantages

Double precision or slightly higher is often enough.

Significant speedup compared to rational arithmetic.

We can stop when the result is within an interval around the optimal solution thanks to weak duality.
This leads to a $\delta$ -complete solver.

Applicability in QF_LRA SMT solvers.

Takeaways

Use integers
Use rational arithmetic
Use symbolic computations
Check the stability of your algorithm
Try increasing precision or introduce a tolerance
Avoid square roots when only the comparison is needed
Cache the results
Collapse constants
- $(x \cdot k_1)(y \cdot k_2) = (x \cdot y \cdot (k_1 \cdot k_2))$
Be careful when comparing floating point numbers
- $abs(a - b) < \epsilon \qquad abs(a - b) < \epsilon \times max(abs(a), abs(b))$

Error analysis

Math quiz

Math quiz

Math quiz

Number representation

Fixed-point

Pros-Cons of fixed-point numbers

Floating-point

Problems with floating-point

Below the unit roundoff

The importance of partial pivoting

Computation over floating point numbers

Errors in computation

Errors propagate

Error analysis

Errors in inner products

Typical notation

Error bound

Notable results

Simplex algorithm

Revised simplex

Exit conditions

Bartels–Golub decomposition

Matrix shapes

Bartels–Golub decomposition

Stability

Exact floating point simplex

Exact rational simplex

Key steps

Advantages

Takeaways

Resources