June 3, 2021 29 min to read

[통계계산] 1. Computer Arithmetic

Computational Statistics

[통계계산] 1. Computer Arithmetic

목차

Computer Arithmetic

LU decomposition

Cholesky Decomposition

QR Decomposition

Interative Methods

Non-linear Equations

Optimization

Numerical Integration

Random Number Generation

Monte Carlo Integration

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_0.5.0       digest_0.6.27     utf8_1.2.1        crayon_1.4.1     
 [5] IRdisplay_1.0     repr_1.1.3        lifecycle_1.0.0   jsonlite_1.7.2   
 [9] evaluate_0.14     pillar_1.6.1      rlang_0.4.11      uuid_0.1-4       
[13] vctrs_0.3.8       ellipsis_0.3.2    IRkernel_1.1      tools_3.6.3      
[17] compiler_3.6.3    base64enc_0.1-3   pbdZMQ_0.3-5      htmltools_0.5.1.1

Required R packages

# Dig into the internal representation of R objects
library(lobstr)  # if not installed, use install.packages('lobstr')
# For unsigned integers
library(uint8) # devtools::install_github('coolbutuseless/uint8')
# For bitstrings
library(pryr)
# For big integers
library(gmp)
# For single precision floating point numbers
library(float)
library(Rcpp)

Attaching package: ‘uint8’

The following objects are masked from ‘package:base’:

    :, order

Registered S3 method overwritten by 'pryr':
  method      from
  print.bytes Rcpp

Attaching package: ‘pryr’

The following objects are masked from ‘package:lobstr’:

    ast, mem_used

Attaching package: ‘gmp’

The following objects are masked from ‘package:base’:

    %*%, apply, crossprod, matrix, tcrossprod

Computer arithmetics

Units of computer storage

Humans use decimal digits (why?)
Computers use binary digits (why?)

Bit = binary digit (coined by statistician John Tukey).
byte = 8 bits.
KB = kilobyte = $10^3$ bytes; KiB = kibibyte = $2^{10}$ bytes.
MB = megabyte = $10^6$ bytes; MiB = mebibyte = $2^{20}$ bytes.
GB = gigabyte = $10^9$ bytes. Typical RAM size.
TB = terabyte = $10^{12}$ bytes. Typical hard drive size. Size of NYSE each trading session.
PB = petabyte = $10^{15}$ bytes.
EB = exabyte = $10^{18}$ bytes. Size of all healthcare data in 2011 is ~150 EB.
ZB = zetabyte = $10^{21}$ bytes.

R function lobstr::obj_size() shows the amount of memory (in bytes) used by an object. (This is a better version of the built-in utils::object.size()

x <- 100
lobstr::obj_size(x)
y <-c(20, 30)
lobstr::obj_size(y)
z <- as.matrix(runif(100 * 100), nrow=100)  # 100 x 100 random matrix
lobstr::obj_size(z)

Print all variables in workspace and their sizes:

sapply(ls(), function(z, env=parent.env(environment())) obj_size(get(z, envir=env)))

Storage of Characters

Plain text files are stored in the form of characters: .jl, .r, .c, .cpp, .ipynb, .html, .tex, …
ASCII (American Code for Information Interchange): 7 bits, only $2^7=128$ characters.

# integers 0, 1, ..., 127 and corresponding ascii character
sapply(0:127, intToUtf8)

<ol class=list-inline><li>’’</li><li>‘\001’</li><li>‘\002’</li><li>‘\003’</li><li>‘\004’</li><li>‘\005’</li><li>‘\006’</li><li>‘\a’</li><li>‘\b’</li><li>‘\t’</li><li>‘\n’</li><li>‘\v’</li><li>‘\f’</li><li>‘\r’</li><li>‘\016’</li><li>‘\017’</li><li>‘\020’</li><li>‘\021’</li><li>‘\022’</li><li>‘\023’</li><li>‘\024’</li><li>‘\025’</li><li>‘\026’</li><li>‘\027’</li><li>‘\030’</li><li>‘\031’</li><li>‘\032’</li><li>‘\033’</li><li>‘\034’</li><li>‘\035’</li><li>‘\036’</li><li>‘\037’</li><li>’ ‘</li><li>’!’</li><li>’”’</li><li>’#’</li><li>’$’</li><li>’%’</li><li>‘&’</li><li>’'’</li><li>’(‘</li><li>’)’</li><li>‘*’</li><li>’+’</li><li>’,’</li><li>’-‘</li><li>’.’</li><li>’/’</li><li>‘0’</li><li>‘1’</li><li>‘2’</li><li>‘3’</li><li>‘4’</li><li>‘5’</li><li>‘6’</li><li>‘7’</li><li>‘8’</li><li>‘9’</li><li>’:’</li><li>’;’</li><li>‘<’</li><li>’=’</li><li>‘>’</li><li>’?’</li><li>’@’</li><li>‘A’</li><li>‘B’</li><li>‘C’</li><li>‘D’</li><li>‘E’</li><li>‘F’</li><li>‘G’</li><li>‘H’</li><li>‘I’</li><li>‘J’</li><li>‘K’</li><li>‘L’</li><li>‘M’</li><li>‘N’</li><li>‘O’</li><li>‘P’</li><li>‘Q’</li><li>‘R’</li><li>‘S’</li><li>‘T’</li><li>‘U’</li><li>‘V’</li><li>‘W’</li><li>‘X’</li><li>‘Y’</li><li>‘Z’</li><li>’[’</li><li>’\’</li><li>’]’</li><li>’^’</li><li>‘_’</li><li>’`’</li><li>‘a’</li><li>‘b’</li><li>‘c’</li><li>‘d’</li><li>‘e’</li><li>‘f’</li><li>‘g’</li><li>‘h’</li><li>‘i’</li><li>‘j’</li><li>‘k’</li><li>‘l’</li><li>‘m’</li><li>‘n’</li><li>‘o’</li><li>‘p’</li><li>‘q’</li><li>‘r’</li><li>’s’</li><li>‘t’</li><li>‘u’</li><li>‘v’</li><li>‘w’</li><li>‘x’</li><li>‘y’</li><li>‘z’</li><li>’{‘</li><li>’

’</li><li>’}’</li><li>’~’</li><li>‘\177’</li></ol>

Extended ASCII: 8 bits, $2^8=256$ characters.

# integers 128, 129, ..., 255 and corresponding extended ascii character
sapply(128:255, intToUtf8)

Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
UTF-8 is the current dominant character encoding on internet.

Source: Google Blog

st <-  "\uD1B5\uACC4\uACC4\uC0B0"
st

‘통계계산’

Integers: fixed-point number system

Fixed-point number system is a computer model for integers $\mathbb{Z}$.
- Remember that computer memory is finite whereas the cardinality of $\mathbb{Z}$ is (countably) infinite.
- Any representation of numbers in computer has to be an approximation.
The number $M$ of bits and method of representing negative numbers vary from system to system.
- The integer type in R has $M=32$ (packages such as ‘bit64’support 64 bit integers).
  - https://www.r-bloggers.com/r-in-a-64-bit-world/
- C has (unsigned) char, int, short, long (and long long), whose sizes depend on the machine.
- Matlab has (u)int8, (u)int16, (u)int32, (u)int64.

Unsigned integers

Model for $\mathbb{N} \cup {0}$.
For unsigned integers, the range is $[0,2^M-1]$.
R does not support unsigned integers natively (will see unit8 package later). In most other languages:

Type	Min	Max
UInt8	0	255
UInt16	0	65535
UInt32	0	4294967295
UInt64	0	18446744073709551615
UInt128	0	340282366920938463463374607431768211455

Signed integers

Model of $\mathbb{Z}$. Can do subtraction.
First bit (“most significant bit” or MSB) is the sign bit.
- 0 for nonnegative numbers
- 1 for negative numbers
Two’s complement representation for negative numbers.
- Set the sign bit to 1
- Negate (0->1, 1->0) the remaining bits
- Add to 1 to the result
- Two’s complement representation of a negative integer $x$ is the same as the unsigned integer $2^M - x$.

class(5L)  # postfix `L` means integer in R
pryr::bits(5L)
pryr::bits(-5L)
pryr::bits(2L * 5L) # shift bits of 5 to the left (why?)
pryr::bits(2L * -5L); # shift bits of -5 to left 

‘integer’

‘00000000 00000000 00000000 00000101’

‘11111111 11111111 11111111 11111011’

‘00000000 00000000 00000000 00001010’

‘11111111 11111111 11111111 11110110’

Two’s complement representation respects modular arithmetic nicely.
Addition of any two signed integers are just bitwise addition, possibly modulo $2^M$
- $M=4$ case:

Source: Signed Binary Numbers, Subtraction and Overflow by Grant Braught

Range of representable integers by $M$-bit signed integer is $[-2^{M-1},2^{M-1}-1]$:

.Machine$integer.max  # R uses 32-bit integer

2147483647

In most other languages,

Type	Min	Max
Int8	-128	127
Int16	-32768	32767
Int32	-2147483648	2147483647
Int64	-9223372036854775808	9223372036854775807
Int128	-170141183460469231731687303715884105728	170141183460469231731687303715884105727

Overflow and underflow for integer arithmetic

R reports NA for integer overflow and underflow.

# The largest integer R can hold
.Machine$integer.max 

2147483647

M <- 32
big <- 2^(M-1) - 1
as.integer(big)

2147483647

.Machine$integer.max + 1L

Warning message in .Machine$integer.max + 1L:
“NAs produced by integer overflow”

<NA>

unit8 outputs the result according to modular arithmetic. So does C and Julia.

uint8::as.uint8(255) + uint8::as.uint8(1)
uint8::as.uint8(250) + uint8::as.uint8(15)

[1] 0

[1] 9

Package gmp supports big integers with arbitrary precision.

gmp::as.bigz(.Machine$integer.max ) + gmp::as.bigz(1L)

Big Integer ('bigz') :
[1] 2147483648

Real numbers: floating-point number system

Floating-point number system is a computer model for the real line $\mathbb{R}$.

Most computer systems adopt the IEEE 754 standard, established in 1985, for floating-point arithmetics.
For the history, see an interview with William Kahan.
In the scientific notation, a real number is represented as $\pm d_1.d_2d_3 \cdots d_p \times b^e, \quad 0 \le d_i < b.$ Humans use the base $b=10$ and digits $d_i=0, 1, \dotsc, 9$.
In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1.
Normalized vs denormalized numbers. For example, decimal number 18 is $+1.0010 \times 2^4 \quad (\text{normalized})$ or, equivalently, $+0.1001 \times 2^5 \quad (\text{denormalized}).$
In the floating-number system, computer stores
- sign bit
- the fraction (or mantissa, or significand) of the normalized representation
- the actual exponent plus a bias
R supports double precesion floating point numbers (see below) via double.
C supports floating point types float and double, where in most systems float corresponds to single precision while double corresponds to double precision.
Julia provides Float16 (half precision, implemented in software using Float32), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).

R has no single precision data type. All real numbers are stored in double precision format. The functions as.single and single are identical to as.double and double except they set the attribute Csingle that is used in the .C and .Fortran interface, and they are intended only to be used in that context. R Documentation

For ease of exposition, we begin with half precision.

Half precision (Float16)

Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format

In Julia, Float16 is the type for half precision numbers.
MSB is the sign bit.
10 significand bits (fraction=mantissa), hence $p=11$ (why?)
5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15 = $01111_2$ for encoding:
- $e_{\min} = \mathbf{000012} - 01111_2 = -14{10}$
- $e_{\max} = \mathbf{111102} - 01111_2 = 15{10}$
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
Precision: $\log_{10}2^{11} \approx 3.311$ decimal digits.

\[(value) = (-1)^{b_{15}}\times 2^{(\sum_{j=1}^5 b_{15-j}2^{5-j}) - 15} \times \left( 1 + \sum_{i=1}^{10}\frac{b_{10-i}}{2^i}\right)\]

# This is Julia
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision

Half precision:
bitstring(Float16(5)) = "0100010100000000"
bitstring(Float16(-5)) = "1100010100000000"

Single precision (Float32, or `float`)

Source: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

In C, float is the type for single precision numbers for most systems.
In Julia, Float32 is the type for single precision numbers.
In R, single precision is not supported natively. We use the following workaround:

# Homework: figure out how this C++ code works
Rcpp::cppFunction('int float32bin(double x) {
    float flx = (float) x; 
    unsigned int binx = *((unsigned int*)&flx); 
    return binx; 
}')

MSB is the sign bit.
23 significand bits ($p=24$).
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
precision: $\log_{10}(2^{24}) \approx 7.225$ decimal digits.

message("Single precision:")
pryr::bits(float32bin(5)) # 5 in single precision
pryr::bits(float32bin(-5)) # -5 in single precision

Single precision:

‘01000000 10100000 00000000 00000000’

‘11000000 10100000 00000000 00000000’

Double precision (Float64, or `double`)

Source: https://en.wikipedia.org/wiki/Double-precision_floating-point_format

Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
In C, double is the type for double precision numbers for most systems. It is the default type for numeric values.
In Julia, Float64 is the type for double precision numbers.
In R, double is the type for double precision numbers.
MSB is the sign bit.
52 significand bits ($p=15$).
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
precision to the $\log_{10}(2^{53}) \approx 15.95$ decimal point.

message("Double precision:")
pryr::bits(5)    # 5 in double precision
pryr::bits(-5)   # -5 in double precision

Double precision:

‘01000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’

‘11000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’

Special floating-point numbers

Exponent $e_{\max}+1$ plus a zero mantissa means $\pm \infty$.

pryr::bits(Inf)    # Inf in double precision
pryr::bits(-Inf)   # -Inf in double precision

‘01111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

‘11111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN. NaN could be produced from 0 / 0, 0 * Inf, …
In general NaN ≠ NaN bitwise. Test whether a number is NaN by is.nan function.

pryr::bits(0 / 0)  # NaN
pryr::bits(0 * Inf) # NaN

‘11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000’

Exponent $e_{\min}-1$ with a zero mantissa represents the real number 0 (“exact zero”).
- Why do we need an exact zero?

pryr::bits(0.0)  # 0 in double precision 

‘00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000’

Exponent $e_{\min}-1$ with a nonzero mantissa are for numbers less than $b^{e_{\min}}$.
Numbers are denormalized in the range $(0,b^{e_{\min}})$ – gradual underflow.
For example, in half-precision, $e_{\min}=-14$ but $2^{-24}$ is represented by $0.0000000001_2 \times 2^{-14}$.
In single precision, $e_{\min}=-126$ but $2^{-149}$ is represented by $0.00000000000000000000001_2 \times 2^{-126}$.

2^(-126)  # emin=-126
pryr::bits(float32bin(2^(-126)))
2^(-149)  # denormalized
pryr::bits(float32bin(2^(-149)))

1.17549435082229e-38

‘00000000 10000000 00000000 00000000’

1.40129846432482e-45

‘00000000 00000000 00000000 00000001’

Rounding

Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). For example, the number 1/10 cannot be represented accurately as a (binary) floating point number: $0.1 = 1.10011001\dotsc_2 \times 2^{-4}$

pryr::bits(0.1)  # double precision 
pryr::bits(float32bin(0.1)) # single precision, 1001 gets rounded to 101(0)

‘00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010’

‘00111101 11001100 11001100 11001101’

Errors

Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then

Absolute error $| [x] - x |$
Relative error $\frac{| [x] - x |}{|x|}$ (if $x \neq 0$).

Of course, we want to ensure that the error after a computation is small.

Machine epsilons

Floating-point numbers do not occur uniformly over the real number line

Source: What you never wanted to know about floating point but will be forced to find out
Same number of representible numbers in $(2^i, 2^{i+1}]$ as in $(2^{i+1}, 2^{i+2}]$. Within an interval, they are uniformly distributed.
Machine epsilons are the spacings of numbers around 1:
- $\epsilon_{\max}$ = (smallest positive floating point number that added to 1 will give a result different from 1) = $\frac{1}{2^p} + \frac{1}{2^{2p-1}}$
- $\epsilon_{\min}$ = (smallest positive floating point number that subtracted from 1 will give a result different from 1) = $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.

Source: Computational Statistics, James Gentle, Springer, New York, 2009.

“Round to nearest, ties to even” rule: rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit (default for IEEE 754 binary floating point numbers)
Any real number in the interval $\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]$ is represented by a floating point number $1 = 1.00\dotsc 0_2 \times 2^0$.
Adding $\frac{1}{2^p}$ to 1 won’t reach the next representable floating point number $1.00\dotsc 012 \times 2^0 = 1 + \frac{1}{2^{p-1}}$. Hence $\epsilon{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$.
Adding the floating point number next to $\frac{1}{2^p} = 1.00\dotsc 02 \times 2^{-p}$ to 1 WILL result in $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$, hence $\epsilon{\max} = 1.00\dotsc 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}$.
Subtracting $\frac{1}{2^{p+1}}$ from 1 results in $1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}$, which is represented by the floating point number $1.00\dotsc 02 \times 2^{0} = 1$ by the “ties to even” rule. Hence $\epsilon{\min} > \frac{1}{2^{p+1}}$.
The smallest positive floating point number larger than $\frac{1}{2^{p+1}}$ is $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 12 \times 2^{-p-1}$. Thus $\epsilon{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.

Machine precision

Machine epsilon is often called the machine precision.
If a positive real number $x \in \mathbb{R}$ is represented by $[x]$ in the floating point arithmetic, then $[x] = \left( 1 + \sum_{i=1}^{p-1}\frac{b_{i+1}}{2^i}\right) \times 2^e.$ Thus $x - \frac{2^e}{2^p} < [x] \le x + \frac{2^e}{2^p}$, and $\begin{split} \frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\ &\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\ &\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\ &= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}. \end{split}$ That is, the relative error of the floating point representation $[x]$ of real number $x$ is bounded by $\epsilon_{\max}$.

options(digits=20)

print(2^(-53) + 2^(-105))   # epsilon_max for double
print(1.0 + 2^(-53))
print(1.0 + (2^(-53) + 2^(-105)))
print(1.0 + 2^(-53) + 2^(-105))  # why is the result?  See "Catastrophic cancellation"


print(as.double(float::fl(2^(-24) + 2^(-47))))  # epsilon_max for float
print(as.double(float::fl(1.0) + float::fl(2^(-24))))
print(as.double(float::fl(1.0) + float::fl(2^(-24) + 2^(-47))))

[1] 1.1102230246251567869e-16
[1] 1
[1] 1.000000000000000222
[1] 1
[1] 5.9604651880817982601e-08
[1] 1
[1] 1.0000001192092895508

print(1/2^54 + 1/2^106)  # epsilon_min for double
print(1 - (1/2^54 + 1/2^106))
pryr::bits(1.0)
pryr::bits(1 - (1/2^54 + 1/2^106))

[1] 5.5511151231257839347e-17
[1] 0.99999999999999988898

‘00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

‘00111111 11101111 11111111 11111111 11111111 11111111 11111111 11111111’

In R, the variable .Machine contains numerical characteristics of the machine. double.neg.eps is our $\epsilon_{\max}$.

.Machine

$double.eps: 2.22044604925031e-16
$double.neg.eps: 1.11022302462516e-16
$double.xmin: 2.2250738585072e-308
$double.xmax: 1.79769313486232e+308
$double.base: 2
$double.digits: 53
$double.rounding: 5
$double.guard: 0
$double.ulp.digits: -52
$double.neg.ulp.digits: -53
$double.exponent: 11
$double.min.exp: -1022
$double.max.exp: 1024
$integer.max: 2147483647
$sizeof.long: 8
$sizeof.longlong: 8
$sizeof.longdouble: 16
$sizeof.pointer: 8

Overflow and underflow of floating-point number

For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{+308}$). Overflow produces $\pm \inf$. Underflow yields zeros or subnormal numbers.
E.g., the logit link function is $p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.$ The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to gradual underflow.
.Machine$double.xmax and .Machine$double.xmin functions gives largest and smallest non-subnormal number represented by the given floating point type.

Catastrophic cancellation

Scenario 1: Addition or subtraction of two numbers of widely different magnitudes: $a+b$ or $a-b$ where $a \gg b$ or $a \ll b$. We lose the precision in the number of smaller magnitude. Consider $\begin{eqnarray*} a &=& x.xxx ... \times 2^{30} \\ b &=& y.yyy... \times 2^{-30} \end{eqnarray*}$ What happens when computer calculates $a+b$? We get $a+b=a$!

(a <- 2.0^30)
(b <- 2.0^-30)
(a + b == a)
pryr::bits(a)
pryr::bits(a + b)

1073741824

9.31322574615479e-10

TRUE

‘01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000’

Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be $\delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} .$ What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is $\delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} .$ Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.

We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have $\begin{split} | [[x]+[y]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &\le |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split}$ because $\epsilon_{\max}^2 \approx 0$. Thus $\frac{| [[x]+[y]-(x+y) |}{|x+y|} \le 2\epsilon_{\max}$ approximately.

Scenario 2: Subtraction of two nearly equal numbers eliminates significant digits. $a-b$ where $a \approx b$. Consider $\begin{eqnarray*} a &=& x.xxxxxxxxxx1ssss \\ b &=& x.xxxxxxxxxx0tttt \end{eqnarray*}$ The result is $1.vvvvu…u$ where $u$ are unassigned digits.

olddigits <- options('digits')$digits
options(digits=20)

a <- float::fl(1.2345678) # rounding
pryr::bits(float32bin(as.double(a))) # rounding
b <- float::fl(1.2345677)
pryr::bits(float32bin(as.double(b)))
print(as.double(a - b)) # correct result should be 1f-7
pryr::bits(float32bin(as.double(a - b)))   # must be 1.0000...0 x 2^(-23)
print(1/2^23)

options(digits=olddigits)

‘00111111 10011110 00000110 01010001’

‘00111111 10011110 00000110 01010000’

[1] 1.1920928955078125e-07

‘00110100 00000000 00000000 00000000’

[1] 1.1920928955078125e-07

Analysis: Let $[x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} .$

$[x]-[y] = \frac{1}{2^{p-1}} = [[x]-[y]]$.
The true difference $x-y$ may lie anywhere in $(0, \frac{1}{2^{p-2}}+\frac{1}{2^{2p}}]$.
Relative error $\frac{|x-y -[[x]-[y]]|}{|x-y|}$ is unbounded – no guarantee of any significant digit!
Implications for numerical computation
- Rule 1: add small numbers together before adding larger ones
- Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
- Rule 3: avoid substraction of two numbers that are nearly equal

Algebraic laws

Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.

Coditioning

Condiser solving a linear system $Ax=b$:

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]

whose solution is $(x_1, x_2) = (1.000, 1.000)$.

A <- matrix(c(1.0, 0.667, 0.5, 0.333), nrow=2)
b <- c(1.5, 1.0)
solve(A, b)

If we perturb $b$ by 0.001, i.e., solve

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 0.999 \end{bmatrix} ,\]

then the solution changes to $(x_1, x_2) = (0.000, 3.000)$.

b1 <- c(1.5, 0.999)
solve(A, b1)

If we instead perturb $A$ by 0.001, i.e., solve

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.334 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]

then this time the solution changes to $(x_1, x_2) = (2.000, -1.000)$.

A1 <- matrix(c(1.0, 0.667, 0.5, 0.334), nrow=2)
solve(A1, b)

In other words, an input perturbation of order $10^{-3}$ yield an output perturbation of order $10^0$. Thats 3 orders of magnutides of relative change!

Floating point representation $[x]$ of a real number $x$ in a digital computer may introduce such input perturbation easily. The perturbation of output of a problem with respect to the input is called conditioning.

Abstractly, a problem can be viewed as function $f: X \to Y$ where $X$ is a normed vector space of data and $Y$ is a normed vector space of solutions.
- The problem of solving $Ax=b$ for fixed $b$ is $f: A \mapsto A^{-1}b$ with $X={M\in\mathbb{R}^{n\times n}: M \text{ is invertible} }$ and $Y = \mathbb{R}^n$.
- The combination of a problem $f$ with a given data $x$ is called a problem instance, or simply problem unless no confusion occurs.
A well-conditioned problem (instance) is one such that all small perturbations of $x$ lead to only small changes in $f(x)$.
An ill-conditioned problem is one such that some small perturbation of $x$ leads to a large change in $f(x)$.
The (relative) condition number $\kappa=\kappa(x)$ of a problem is defined by $\kappa = \lim_{\delta\to 0}\sup_{\|\delta x\|\le \delta}\frac{\|\delta f\|/\|f(x)\|}{\|\delta x\|/\|x\|} = \sup_{\delta x} \frac{\|\delta f\|/\|f(x)\|}{\|\delta x\|/\|x\|}$
For the problem of solving $Ax=b$ for fixed $b$, $f: A \mapsto A^{-1}b$, it can be shown that the condition number of $f$ is $\kappa = \|A\|\|A^{-1}\| =: \kappa(A) ,$ where $|A|$ is the matrix norm induced by vector norm $|\cdot|$, i.e., $\|A\| = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|}.$ If 2-norm is used, then $\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A),$ the ratio of the maximum and minimum singular values of $A$.

In the above problem, the condition number is matrix $A$ (w.r.t. Euclidean norm) is

kappa(A)

4333.00046136224

YeonghyeonKO v3.1.0

[통계계산] 1. Computer Arithmetic

[통계계산] 1. Computer Arithmetic

목차

Required R packages

Computer arithmetics

Units of computer storage

Storage of Characters

Integers: fixed-point number system

Unsigned integers

Signed integers

Overflow and underflow for integer arithmetic

Real numbers: floating-point number system

Half precision (Float16)

Single precision (Float32, or `float`)

Double precision (Float64, or `double`)

Special floating-point numbers

Rounding

Errors

Machine epsilons

Machine precision

Overflow and underflow of floating-point number

Catastrophic cancellation

Algebraic laws

Coditioning

Further readings

[통계계산] 2. LU Decomposition

Comments

[통계계산] 1. Computer Arithmetic

[통계계산] 1. Computer Arithmetic

목차

Required R packages

Computer arithmetics

Units of computer storage

Storage of Characters

Integers: fixed-point number system

Unsigned integers

Signed integers

Overflow and underflow for integer arithmetic

Real numbers: floating-point number system

Half precision (Float16)

Single precision (Float32, or float)

Double precision (Float64, or double)

Special floating-point numbers

Rounding

Errors

Machine epsilons

Machine precision

Overflow and underflow of floating-point number

Catastrophic cancellation

Algebraic laws

Coditioning

Further readings

[통계계산] 2. LU Decomposition

Share

Comments

Single precision (Float32, or `float`)

Double precision (Float64, or `double`)