Math quiz
How much is
1 + 2 = ? 1 + 2 = \mathord{?} 1 + 2 = ?
1 + 2 = 3 1 + 2 = 3 1 + 2 = 3
print ( 1 + 2 ) # 3
Math quiz
How much is
1.0 + 2.0 = ? 1.0 + 2.0 = \mathord{?} 1.0 + 2.0 = ?
1.0 + 2.0 = 3.0 1.0 + 2.0 = 3.0 1.0 + 2.0 = 3.0
print ( 1.0 + 2.0 ) # 3.0
Math quiz
How much is
0.1 + 0.2 = ? 0.1 + 0.2 = \mathord{?} 0.1 + 0.2 = ?
0.1 + 0.2 = 0.3 0.1 + 0.2 = 0.3 0.1 + 0.2 = 0.3
print ( 0.1 + 0.2 ) # 0.30000000000000004
Number representation
How do we represent numbers in a computer?
Computers only store bits, usually in groups of 8, 16, 32, or 64.
It is natural to represent integers in binary.
Loading diagram...
0 ⏟ min ≤ 10110 1 2 = 2 3 10 ⏟ value ≤ 11 ⋯ 27 1 1 2 = 214748364 7 10 ⏟ max \underbrace{0} _ {\min} \le \underbrace{101101_2 = 23_{10}} _ {\text{value}} \le \underbrace{ 11\underset{27}{\cdots}11_2 = 2147483647_{10}} _ {\max} m i n 0 ≤ value 10110 1 2 = 2 3 10 ≤ m a x 11 27 ⋯ 1 1 2 = 214748364 7 10
Fixed-point
What about fractions?
A natural way is to use fixed-point numbers.
Loading diagram...
0 ⏟ min ≤ 111001.10 1 2 = 23.62 5 10 ⏟ value ≤ ≤ 11 ⋯ 11 11.11 ⋯ 12 1 1 2 = 32767.9999847412 1 10 ⏟ max \begin{gather*}
\underbrace{0} _ {\min} \le \underbrace{111001.101_2 = 23.625_{10}} _ {\text{value}} \le \\
\le \underbrace{ 11\underset{11}{\cdots}11.11\underset{12}{\cdots}11_2 = 32767.99998474121 _{10}} _ {\max}
\end{gather*} m i n 0 ≤ value 111001.10 1 2 = 23.62 5 10 ≤ ≤ m a x 11 11 ⋯ 11.11 12 ⋯ 1 1 2 = 32767.9999847412 1 10
Pros-Cons of fixed-point numbers
Pros
Stable error
Simple operations
Cons
Extremely Limited range
Low precision
Floating-point
The IEEE 754 standard defines floating-point numbers.
Loading diagram...
0 ⏟ min ≤ 1.10 1 2 × 2 2 = 6. 5 10 ⏟ value ≤ ≤ 1.11 ⋯ 19 1 1 2 × 2 127 = ( 1 − 2 − 24 ) 10 × 2 127 ⏟ max \begin{gather*}
\underbrace{0} _ {\min} \le \underbrace{1.101_2 \times 2^{2} = 6.5_{10}} _ {\text{value}} \le \\
\le \underbrace{ 1.11\underset{19}{\cdots}11_2 \times 2^{127} = (1 - 2^{-24}) _ {10} \times 2^{127}} _ {\max}
\end{gather*} m i n 0 ≤ value 1.10 1 2 × 2 2 = 6. 5 10 ≤ ≤ m a x 1.11 19 ⋯ 1 1 2 × 2 127 = ( 1 − 2 − 24 ) 10 × 2 127
Problems with floating-point
The number of bits is still finite.
Range : Limited exponent
Error : Rounding errors
Unit roundoff : ϵ \epsilon ϵ is the smallest positive number such that 1 + ϵ > 1 1 + \epsilon > 1 1 + ϵ > 1
Numbers with infinite decimals
Irrational numbers are not representable in a finite number of bits.
We always have to round them (approximation) or use symbolic computation.
Some fractions can create issues (think of 1 / 3 1/3 1/3 ).
They are called non-dyadic numbers.
Computation over floating point numbers
The standard dictates that the result of an operation is computed exactly and then rounded.
a + b ⟹ fl ( a + b ) = ( a + b ) ( 1 + δ ) a − b ⟹ fl ( a − b ) = ( a − b ) ( 1 + δ ) a × b ⟹ fl ( a × b ) = ( a × b ) ( 1 + δ ) a ÷ b ⟹ fl ( a ÷ b ) = ( a ÷ b ) ( 1 + δ ) \begin{align*}
a + b & \implies \textbf{fl}(a + b) = (a + b)(1 + \delta) \\
a - b & \implies \textbf{fl}(a - b) = (a - b)(1 + \delta)\\
a \times b & \implies \textbf{fl}(a \times b) = (a \times b)(1 + \delta) \\
a \div b & \implies \textbf{fl}(a \div b) = (a \div b)(1 + \delta)
\end{align*} a + b a − b a × b a ÷ b ⟹ fl ( a + b ) = ( a + b ) ( 1 + δ ) ⟹ fl ( a − b ) = ( a − b ) ( 1 + δ ) ⟹ fl ( a × b ) = ( a × b ) ( 1 + δ ) ⟹ fl ( a ÷ b ) = ( a ÷ b ) ( 1 + δ )
with ∣ δ ∣ ≤ ϵ |\delta| \le \epsilon ∣ δ ∣ ≤ ϵ .
Overflows and underflows are also possible, but they are detectable.
Errors in computation
Let the exponent and mantissa be 4 and 7 (+1) bits.
1.111110 0 2 × 2 7 ⏟ 25 2 10 + 1. 1 2 ⏟ 1. 5 10 = ? \underbrace{1.1111100_2 \times 2^{7}} _ {252_{10}} + \underbrace{1.1_2} _ {1.5_{10}} =\ ? 25 2 10 1.111110 0 2 × 2 7 + 1. 5 10 1. 1 2 = ?
111111000 + 11 = 111111011 \begin{array}{crll}
& 111111000 & + \\
& 11 & = \\
\hline
& 111111011 & \\
\end{array} 111111000 11 111111011 + =
The result would need 8 bits (+1) to be represented.
1. 1111101 ⏟ Stored 1 2 × 2 7 = 253.0 ≠ 252 + 1.5 = 253.5 1.\underbrace{1111101} _ {\text{Stored}}1_2 \times 2^{7} = 253.0 \ne 252 + 1.5 = 253.5 1. Stored 1111101 1 2 × 2 7 = 253.0 = 252 + 1.5 = 253.5
Errors propagate
Errors propagate through operations.
fl ( a + b + c ) = ( ( a + b ) ( 1 + δ 1 ) + c ) ( 1 + δ 2 ) \textbf{fl}(a + b + c) = ((a + b)(1 + \delta_1) + c)(1 + \delta_2) fl ( a + b + c ) = (( a + b ) ( 1 + δ 1 ) + c ) ( 1 + δ 2 )
The exact value depends on the order of operations, making floating point arithmetic non-associative.
Error analysis
Error analysis is the study of how floating-point computations affect our results.
Forward error analysis VS Backward error analysis .
Input Output
Backward error
Forward error
Errors in inner products
The inner product of two column vectors is
s ^ = fl ( x T y ) = fl ( ∑ i = 1 n x i y i ) \hat{s} = \textbf{fl}(x^Ty) = \textbf{fl}\left(\sum_{i=1}^{n} x_i y_i\right) s ^ = fl ( x T y ) = fl ( i = 1 ∑ n x i y i )
s ^ = x 1 y 1 ( 1 + δ i × ) ∏ i = 2 n ( i + δ i ) + ∑ i = 2 n x i y i ( 1 + δ i × ) ∏ j = i n ( 1 + δ i ) \hat{s} = x_1y_1(1 + \delta_{i\times})\prod^n_{i = 2}(i + \delta_i) + \sum_{i=2}^{n} x_i y_i(1 + \delta_{i\times})\prod^n_{j = i}(1 + \delta_i) s ^ = x 1 y 1 ( 1 + δ i × ) i = 2 ∏ n ( i + δ i ) + i = 2 ∑ n x i y i ( 1 + δ i × ) j = i ∏ n ( 1 + δ i )
Typical notation
∏ i = 1 n ( 1 + δ i ) p i = 1 + θ n ∣ θ n ∣ ≤ n ϵ 1 − n ϵ ≕ γ n \begin{align*}
&\prod^n_{i = 1}(1 + \delta_i)^{p_i} = 1 + \theta_n \\
&|\theta_n| \le \frac{n\epsilon}{1 - n\epsilon} \eqqcolon \gamma_n
\end{align*} i = 1 ∏ n ( 1 + δ i ) p i = 1 + θ n ∣ θ n ∣ ≤ 1 − n ϵ n ϵ = : γ n
Error bound
fl ( x T y ) ≤ x T y ( 1 + γ n ) \textbf{fl}(x^Ty) \le x^Ty(1 + \gamma_n) fl ( x T y ) ≤ x T y ( 1 + γ n )
Using the backward error analysis,
fl ( x T y ) = ( x + Δ x ) T y = x T ( y + Δ y ) \textbf{fl}(x^Ty) = (x + \Delta x)^Ty = x^T(y + \Delta y) fl ( x T y ) = ( x + Δ x ) T y = x T ( y + Δ y )
with
∣ Δ x ∣ ≤ γ n ∣ x ∣ , ∣ Δ y ∣ ≤ γ n ∣ y ∣ |\Delta x| \le \gamma_n|x|, \quad |\Delta y| \le \gamma_n|y| ∣Δ x ∣ ≤ γ n ∣ x ∣ , ∣Δ y ∣ ≤ γ n ∣ y ∣
LU decomposition
Compute the L U LU LU factorisation of
A = [ ϵ − 1 1 1 ] = [ 1 0 l 21 1 ] [ u 11 u 12 0 u 22 ] A = \begin{bmatrix}
\epsilon & -1 \\
1 & 1
\end{bmatrix}
= \begin{bmatrix}
1 & 0 \\
l_{21} & 1
\end{bmatrix}
\begin{bmatrix}
u_{11} & u_{12} \\
0 & u_{22}
\end{bmatrix} A = [ ϵ 1 − 1 1 ] = [ 1 l 21 0 1 ] [ u 11 0 u 12 u 22 ]
u 11 = ϵ , u 12 = − 1 , l 21 = ϵ − 1 , u 22 = 1 + ϵ − 1 ≈ ϵ − 1 u_{11} = \epsilon, \quad u_{12} = -1, \quad l_{21} = \epsilon^{-1}, \quad u_{22} = 1 + \epsilon^{-1} \approx\epsilon^{-1} u 11 = ϵ , u 12 = − 1 , l 21 = ϵ − 1 , u 22 = 1 + ϵ − 1 ≈ ϵ − 1
L ^ U ^ = [ ϵ − 1 1 0 ] \hat{L}\hat{U} = \begin{bmatrix}
\epsilon & -1 \\
1 & 0
\end{bmatrix} L ^ U ^ = [ ϵ 1 − 1 0 ]
Notable results
Triangular systems : ( T + Δ T ) x ^ = b ∣ Δ T ∣ ≤ γ n ∣ T ∣ (T + \Delta T)\hat{x} = b \quad |\Delta T| \le \gamma_n|T| ( T + Δ T ) x ^ = b ∣Δ T ∣ ≤ γ n ∣ T ∣
LU decomposition : ( A + Δ A ) = L ^ U ^ ∣ Δ A ∣ ≤ γ n ∣ L ^ ∣ ∣ U ^ ∣ (A + \Delta A) = \hat{L}\hat{U} \quad |\Delta A| \le \gamma_n|\hat{L}||\hat{U}| ( A + Δ A ) = L ^ U ^ ∣Δ A ∣ ≤ γ n ∣ L ^ ∣∣ U ^ ∣
LU system : ( A + Δ A ) x ^ = b ∣ Δ A ∣ ≤ γ 3 n ∣ L ^ ∣ ∣ U ^ ∣ (A + \Delta A)\hat{x} = b \quad |\Delta A| \le \gamma_{3n}|\hat{L}||\hat{U}| ( A + Δ A ) x ^ = b ∣Δ A ∣ ≤ γ 3 n ∣ L ^ ∣∣ U ^ ∣
Simplex algorithm
The simplex is the most widely used method for solving LP problems.
maximize c T x subject to A x = b x ≥ 0 \begin{align*}
\text{maximize} \quad & c^Tx \\
\text{subject to} \quad & Ax = b \\
& x \ge 0
\end{align*} maximize subject to c T x A x = b x ≥ 0
with A ∈ R m × n A \in \mathbb{R}^{m \times n} A ∈ R m × n , b ∈ R m b \in \mathbb{R}^m b ∈ R m , and c ∈ R n c \in \mathbb{R}^n c ∈ R n .
We define a basis B B B of m m m columns of A A A .
It must be non-singular.
Revised simplex
The revised simplex is the basis for the more common implementation.
Find an initial feasible solution (auxiliary LP).
Primal solution x B = B − 1 b x_B = B^{-1}b x B = B − 1 b .
Dual solution y B = B − T c B y_B = B^{-T}c_B y B = B − T c B .
Entering variable e = arg max r e > 0 ( r e ≔ c e − A e T y B ) e = \arg\max_{r_e > 0}(r_e \coloneqq c_e - A_e^Ty_B) e = arg max r e > 0 ( r e : = c e − A e T y B ) .
Coefficient d = B − 1 A e d = B^{-1}A_e d = B − 1 A e .
Leaving variable l = arg min d i > 0 ( x B i / d i ) l = \arg\min_{d_i > 0}(x_{B_i}/d_i) l = arg min d i > 0 ( x B i / d i ) .
Update B = [ B 1 ⋯ B l − 1 B l + 1 ⋯ A e ] B = \begin{bmatrix} B_1 & \cdots & B_{l-1} & B_{l+1} & \cdots A_e \end{bmatrix} B = [ B 1 ⋯ B l − 1 B l + 1 ⋯ A e ] .
Exit conditions
The steps of the simplex loop until
No initial feasible solution ⟹ \implies ⟹ infeasible problem.
No entering variable ⟹ \implies ⟹ optimal solution.
No leaving variable ⟹ \implies ⟹ unbounded problem.
Bartels–Golub decomposition
Three linear systems involving B B B each iteration.
Why not use L U LU LU decomposition?
Keep L L L fixed and only update U U U .
H ( i ) = [ U 1 ( i − 1 ) ⋯ U l i − 1 − 1 ( i − 1 ) U l i − 1 + 1 ( i − 1 ) ⋯ U m ( i − 1 ) L − 1 A e ] H^{(i)} = \begin{bmatrix} U^{(i-1)}_1 & \cdots & U^{(i-1)}_{l_{i-1} - 1} & U^{(i-1)}_{l_{i-1} + 1} & \cdots & U^{(i-1)}_m & L^{-1}A_e \end{bmatrix} H ( i ) = [ U 1 ( i − 1 ) ⋯ U l i − 1 − 1 ( i − 1 ) U l i − 1 + 1 ( i − 1 ) ⋯ U m ( i − 1 ) L − 1 A e ]
Bartels–Golub decomposition
We need to re-triangularise H ( i ) H^{(i)} H ( i ) .
B ( i ) = L H ( i ) = L C ( 1 ) C ( 2 ) ⋯ C ( i ) C ( i ) − 1 C ( i − 1 ) − 1 ⋯ C ( 1 ) − 1 ⏟ Gaussian elimination H ( i ) = = L C ( 1 ) C ( 2 ) ⋯ C ( i ) ⏟ G ( i ) U ( i ) = = L G ( i ) U ( i ) \begin{align*}
B^{(i)} &= LH^{(i)} \\
&= LC^{(1)}C^{(2)}\cdots C^{(i)}\underbrace{C^{(i)-1}C^{(i-1)-1}\cdots C^{(1)-1}}_{\text{Gaussian elimination}}H^{(i)} = \\
&= L\underbrace{C^{(1)}C^{(2)}\cdots C^{(i)}}_{G^{(i)}}U^{(i)} = \\
&= LG^{(i)}U^{(i)} \\
\end{align*} B ( i ) = L H ( i ) = L C ( 1 ) C ( 2 ) ⋯ C ( i ) Gaussian elimination C ( i ) − 1 C ( i − 1 ) − 1 ⋯ C ( 1 ) − 1 H ( i ) = = L G ( i ) C ( 1 ) C ( 2 ) ⋯ C ( i ) U ( i ) = = L G ( i ) U ( i )
Stability
It can be proven that the Bartels–Golub decomposition is backward stable.
As long as the iterations do not grow too much, the error is bounded.
At some point it becomes more convenient to recompute the decomposition.
Exact floating point simplex
∃ τ , ϵ ∈ R \exists \tau, \epsilon \in \mathbb{R} ∃ τ , ϵ ∈ R , x , τ > 0 x, \tau > 0 x , τ > 0 , such that
def simplex ( A , b , c , B_idx , tolerance , epsilon ):
m, n = A.shape
for i in range (n! / (n - m)!):
B = A[:, B_idx]
L, G, U = bartels_golub (B)
x_B = solve (L, G, U, b)
y_B = solve (L.T, G.T, U.T, c[B_idx])
r = c - A.T @ y_B
if r <= tolerance: return x_B # Optimal
e = argmax (r)
d = solve (L, G, U, A[:, e])
if d <= tolerance: return x_B # Unbounded
l = argmin (x_B / d, where = d > tolerance)
B_idx[l] = e
behaves exactly.
Takeaways
Working with integers
Increasing the precision and tolerance
Using rational arithmetic
Using symbolic computation
Resources