Skip to content

Error analysis

Published:

Math quiz

How much is

1+2=?1 + 2 = \mathord{?} 1+2=31 + 2 = 3
print(1 + 2)  # 3

Math quiz

How much is

1.0+2.0=?1.0 + 2.0 = \mathord{?} 1.0+2.0=3.01.0 + 2.0 = 3.0
print(1.0 + 2.0)  # 3.0

Math quiz

How much is

0.1+0.2=?0.1 + 0.2 = \mathord{?} 0.1+0.2=0.30.1 + 0.2 = 0.3
print(0.1 + 0.2)  # 0.30000000000000004

Number representation

How do we represent numbers in a computer?

Computers only store bits, usually in groups of 8, 16, 32, or 64.

It is natural to represent integers in binary.

Loading diagram...
0min1011012=2310value1127112=214748364710max\underbrace{0} _ {\min} \le \underbrace{101101_2 = 23_{10}} _ {\text{value}} \le \underbrace{ 11\underset{27}{\cdots}11_2 = 2147483647_{10}} _ {\max}

Fixed-point

What about fractions?

A natural way is to use fixed-point numbers.

Loading diagram...
0min111001.1012=23.62510value111111.1112112=32767.9999847412110max\begin{gather*} \underbrace{0} _ {\min} \le \underbrace{111001.101_2 = 23.625_{10}} _ {\text{value}} \le \\ \le \underbrace{ 11\underset{11}{\cdots}11.11\underset{12}{\cdots}11_2 = 32767.99998474121 _{10}} _ {\max} \end{gather*}

Pros-Cons of fixed-point numbers

  • Pros

    • Stable error
    • Simple operations
  • Cons

    • Extremely Limited range
    • Low precision

Floating-point

The IEEE 754 standard defines floating-point numbers.

Loading diagram...
0min1.1012×22=6.510value1.1119112×2127=(1224)10×2127max\begin{gather*} \underbrace{0} _ {\min} \le \underbrace{1.101_2 \times 2^{2} = 6.5_{10}} _ {\text{value}} \le \\ \le \underbrace{ 1.11\underset{19}{\cdots}11_2 \times 2^{127} = (1 - 2^{-24}) _ {10} \times 2^{127}} _ {\max} \end{gather*}

Problems with floating-point

The number of bits is still finite.

Numbers with infinite decimals

Irrational numbers are not representable in a finite number of bits.

We always have to round them (approximation) or use symbolic computation.

Some fractions can create issues (think of 1/31/3). They are called non-dyadic numbers.

Computation over floating point numbers

The standard dictates that the result of an operation is computed exactly and then rounded.

a+b    fl(a+b)=(a+b)(1+δ)ab    fl(ab)=(ab)(1+δ)a×b    fl(a×b)=(a×b)(1+δ)a÷b    fl(a÷b)=(a÷b)(1+δ)\begin{align*} a + b & \implies \textbf{fl}(a + b) = (a + b)(1 + \delta) \\ a - b & \implies \textbf{fl}(a - b) = (a - b)(1 + \delta)\\ a \times b & \implies \textbf{fl}(a \times b) = (a \times b)(1 + \delta) \\ a \div b & \implies \textbf{fl}(a \div b) = (a \div b)(1 + \delta) \end{align*}

with δϵ|\delta| \le \epsilon.

Overflows and underflows are also possible, but they are detectable.

Errors in computation

Let the exponent and mantissa be 4 and 7 (+1) bits.

1.11111002×2725210+1.121.510= ?\underbrace{1.1111100_2 \times 2^{7}} _ {252_{10}} + \underbrace{1.1_2} _ {1.5_{10}} =\ ? 111111000+11=111111011\begin{array}{crll} & 111111000 & + \\ & 11 & = \\ \hline & 111111011 & \\ \end{array}

The result would need 8 bits (+1) to be represented.

1.1111101Stored12×27=253.0252+1.5=253.51.\underbrace{1111101} _ {\text{Stored}}1_2 \times 2^{7} = 253.0 \ne 252 + 1.5 = 253.5

Errors propagate

Errors propagate through operations.

fl(a+b+c)=((a+b)(1+δ1)+c)(1+δ2)\textbf{fl}(a + b + c) = ((a + b)(1 + \delta_1) + c)(1 + \delta_2)

The exact value depends on the order of operations, making floating point arithmetic non-associative.

Error analysis

Error analysis is the study of how floating-point computations affect our results.

Forward error analysis VS Backward error analysis.

InputOutput Backwarderror Forwarderror

Errors in inner products

The inner product of two column vectors is

s^=fl(xTy)=fl(i=1nxiyi)\hat{s} = \textbf{fl}(x^Ty) = \textbf{fl}\left(\sum_{i=1}^{n} x_i y_i\right) s^=x1y1(1+δi×)i=2n(i+δi)+i=2nxiyi(1+δi×)j=in(1+δi)\hat{s} = x_1y_1(1 + \delta_{i\times})\prod^n_{i = 2}(i + \delta_i) + \sum_{i=2}^{n} x_i y_i(1 + \delta_{i\times})\prod^n_{j = i}(1 + \delta_i)

Typical notation

i=1n(1+δi)pi=1+θnθnnϵ1nϵγn\begin{align*} &\prod^n_{i = 1}(1 + \delta_i)^{p_i} = 1 + \theta_n \\ &|\theta_n| \le \frac{n\epsilon}{1 - n\epsilon} \eqqcolon \gamma_n \end{align*}

Error bound

fl(xTy)xTy(1+γn)\textbf{fl}(x^Ty) \le x^Ty(1 + \gamma_n)

Using the backward error analysis,

fl(xTy)=(x+Δx)Ty=xT(y+Δy)\textbf{fl}(x^Ty) = (x + \Delta x)^Ty = x^T(y + \Delta y)

with

Δxγnx,Δyγny|\Delta x| \le \gamma_n|x|, \quad |\Delta y| \le \gamma_n|y|

LU decomposition

Compute the LULU factorisation of

A=[ϵ111]=[10l211][u11u120u22]A = \begin{bmatrix} \epsilon & -1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ l_{21} & 1 \end{bmatrix} \begin{bmatrix} u_{11} & u_{12} \\ 0 & u_{22} \end{bmatrix} u11=ϵ,u12=1,l21=ϵ1,u22=1+ϵ1ϵ1u_{11} = \epsilon, \quad u_{12} = -1, \quad l_{21} = \epsilon^{-1}, \quad u_{22} = 1 + \epsilon^{-1} \approx\epsilon^{-1} L^U^=[ϵ110]\hat{L}\hat{U} = \begin{bmatrix} \epsilon & -1 \\ 1 & 0 \end{bmatrix}

Notable results

Simplex algorithm

The simplex is the most widely used method for solving LP problems.

maximizecTxsubject toAx=bx0\begin{align*} \text{maximize} \quad & c^Tx \\ \text{subject to} \quad & Ax = b \\ & x \ge 0 \end{align*}

with ARm×nA \in \mathbb{R}^{m \times n}, bRmb \in \mathbb{R}^m, and cRnc \in \mathbb{R}^n.

We define a basis BB of mm columns of AA. It must be non-singular.

Revised simplex

The revised simplex is the basis for the more common implementation.

  1. Find an initial feasible solution (auxiliary LP).
  2. Primal solution xB=B1bx_B = B^{-1}b.
  3. Dual solution yB=BTcBy_B = B^{-T}c_B.
  4. Entering variable e=argmaxre>0(receAeTyB)e = \arg\max_{r_e > 0}(r_e \coloneqq c_e - A_e^Ty_B).
  5. Coefficient d=B1Aed = B^{-1}A_e.
  6. Leaving variable l=argmindi>0(xBi/di)l = \arg\min_{d_i > 0}(x_{B_i}/d_i).
  7. Update B=[B1Bl1Bl+1Ae]B = \begin{bmatrix} B_1 & \cdots & B_{l-1} & B_{l+1} & \cdots A_e \end{bmatrix}.

Exit conditions

The steps of the simplex loop until

Bartels–Golub decomposition

Three linear systems involving BB each iteration. Why not use LULU decomposition?

Keep LL fixed and only update UU.

H(i)=[U1(i1)Uli11(i1)Uli1+1(i1)Um(i1)L1Ae]H^{(i)} = \begin{bmatrix} U^{(i-1)}_1 & \cdots & U^{(i-1)}_{l_{i-1} - 1} & U^{(i-1)}_{l_{i-1} + 1} & \cdots & U^{(i-1)}_m & L^{-1}A_e \end{bmatrix}

Bartels–Golub decomposition

We need to re-triangularise H(i)H^{(i)}.

B(i)=LH(i)=LC(1)C(2)C(i)C(i)1C(i1)1C(1)1Gaussian eliminationH(i)==LC(1)C(2)C(i)G(i)U(i)==LG(i)U(i)\begin{align*} B^{(i)} &= LH^{(i)} \\ &= LC^{(1)}C^{(2)}\cdots C^{(i)}\underbrace{C^{(i)-1}C^{(i-1)-1}\cdots C^{(1)-1}}_{\text{Gaussian elimination}}H^{(i)} = \\ &= L\underbrace{C^{(1)}C^{(2)}\cdots C^{(i)}}_{G^{(i)}}U^{(i)} = \\ &= LG^{(i)}U^{(i)} \\ \end{align*}

Stability

It can be proven that the Bartels–Golub decomposition is backward stable.

As long as the iterations do not grow too much, the error is bounded.

At some point it becomes more convenient to recompute the decomposition.

Exact floating point simplex

τ,ϵR\exists \tau, \epsilon \in \mathbb{R}, x,τ>0x, \tau > 0, such that

def simplex(A, b, c, B_idx, tolerance, epsilon):
    m, n = A.shape
    for i in range(n!/(n-m)!):
        B = A[:, B_idx]
        L, G, U = bartels_golub(B)
        x_B = solve(L, G, U, b)
        y_B = solve(L.T, G.T, U.T, c[B_idx])
        r = c - A.T @ y_B
        if r <= tolerance: return x_B # Optimal
        e = argmax(r)
        d = solve(L, G, U, A[:, e])
        if d <= tolerance: return x_B # Unbounded
        l = argmin(x_B / d, where=d > tolerance)
        B_idx[l] = e

behaves exactly.

Takeaways

Resources