[통계계산] 1. Computer Arithmetic

Computational Statistics

Featured image

[통계계산] 1. Computer Arithmetic


목차

  1. Computer Arithmetic
  2. LU decomposition
  3. Cholesky Decomposition
  4. QR Decomposition
  5. Interative Methods
  6. Non-linear Equations
  7. Optimization
  8. Numerical Integration
  9. Random Number Generation
  10. Monte Carlo Integration


sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_0.5.0       digest_0.6.27     utf8_1.2.1        crayon_1.4.1     
 [5] IRdisplay_1.0     repr_1.1.3        lifecycle_1.0.0   jsonlite_1.7.2   
 [9] evaluate_0.14     pillar_1.6.1      rlang_0.4.11      uuid_0.1-4       
[13] vctrs_0.3.8       ellipsis_0.3.2    IRkernel_1.1      tools_3.6.3      
[17] compiler_3.6.3    base64enc_0.1-3   pbdZMQ_0.3-5      htmltools_0.5.1.1

Required R packages

# Dig into the internal representation of R objects
library(lobstr)  # if not installed, use install.packages('lobstr')
# For unsigned integers
library(uint8) # devtools::install_github('coolbutuseless/uint8')
# For bitstrings
library(pryr)
# For big integers
library(gmp)
# For single precision floating point numbers
library(float)
library(Rcpp)
Attaching package: ‘uint8’


The following objects are masked from ‘package:base’:

    :, order


Registered S3 method overwritten by 'pryr':
  method      from
  print.bytes Rcpp


Attaching package: ‘pryr’


The following objects are masked from ‘package:lobstr’:

    ast, mem_used



Attaching package: ‘gmp’


The following objects are masked from ‘package:base’:

    %*%, apply, crossprod, matrix, tcrossprod

Computer arithmetics

Units of computer storage

Humans use decimal digits (why?)
Computers use binary digits (why?)

R function lobstr::obj_size() shows the amount of memory (in bytes) used by an object. (This is a better version of the built-in utils::object.size()

x <- 100
lobstr::obj_size(x)
y <-c(20, 30)
lobstr::obj_size(y)
z <- as.matrix(runif(100 * 100), nrow=100)  # 100 x 100 random matrix
lobstr::obj_size(z)
56 B



64 B



80,216 B

Print all variables in workspace and their sizes:

sapply(ls(), function(z, env=parent.env(environment())) obj_size(get(z, envir=env)))

<dl class=dl-inline><dt>x</dt><dd>56</dd><dt>y</dt><dd>64</dd><dt>z</dt><dd>80216</dd></dl>

Storage of Characters

# integers 0, 1, ..., 127 and corresponding ascii character
sapply(0:127, intToUtf8)
<ol class=list-inline><li>’’</li><li>‘\001’</li><li>‘\002’</li><li>‘\003’</li><li>‘\004’</li><li>‘\005’</li><li>‘\006’</li><li>‘\a’</li><li>‘\b’</li><li>‘\t’</li><li>‘\n’</li><li>‘\v’</li><li>‘\f’</li><li>‘\r’</li><li>‘\016’</li><li>‘\017’</li><li>‘\020’</li><li>‘\021’</li><li>‘\022’</li><li>‘\023’</li><li>‘\024’</li><li>‘\025’</li><li>‘\026’</li><li>‘\027’</li><li>‘\030’</li><li>‘\031’</li><li>‘\032’</li><li>‘\033’</li><li>‘\034’</li><li>‘\035’</li><li>‘\036’</li><li>‘\037’</li><li>’ ‘</li><li>’!’</li><li>’”’</li><li>’#’</li><li>’$’</li><li>’%’</li><li>‘&’</li><li>’'’</li><li>’(‘</li><li>’)’</li><li>‘*’</li><li>’+’</li><li>’,’</li><li>’-‘</li><li>’.’</li><li>’/’</li><li>‘0’</li><li>‘1’</li><li>‘2’</li><li>‘3’</li><li>‘4’</li><li>‘5’</li><li>‘6’</li><li>‘7’</li><li>‘8’</li><li>‘9’</li><li>’:’</li><li>’;’</li><li>‘<’</li><li>’=’</li><li>‘>’</li><li>’?’</li><li>’@’</li><li>‘A’</li><li>‘B’</li><li>‘C’</li><li>‘D’</li><li>‘E’</li><li>‘F’</li><li>‘G’</li><li>‘H’</li><li>‘I’</li><li>‘J’</li><li>‘K’</li><li>‘L’</li><li>‘M’</li><li>‘N’</li><li>‘O’</li><li>‘P’</li><li>‘Q’</li><li>‘R’</li><li>‘S’</li><li>‘T’</li><li>‘U’</li><li>‘V’</li><li>‘W’</li><li>‘X’</li><li>‘Y’</li><li>‘Z’</li><li>’[’</li><li>’\’</li><li>’]’</li><li>’^’</li><li>‘_’</li><li>’`’</li><li>‘a’</li><li>‘b’</li><li>‘c’</li><li>‘d’</li><li>‘e’</li><li>‘f’</li><li>‘g’</li><li>‘h’</li><li>‘i’</li><li>‘j’</li><li>‘k’</li><li>‘l’</li><li>‘m’</li><li>‘n’</li><li>‘o’</li><li>‘p’</li><li>‘q’</li><li>‘r’</li><li>’s’</li><li>‘t’</li><li>‘u’</li><li>‘v’</li><li>‘w’</li><li>‘x’</li><li>‘y’</li><li>‘z’</li><li>’{‘</li><li>’ ’</li><li>’}’</li><li>’~’</li><li>‘\177’</li></ol>
# integers 128, 129, ..., 255 and corresponding extended ascii character
sapply(128:255, intToUtf8)

<ol class=list-inline><li>‘\u0080’</li><li>‘\u0081’</li><li>‘\u0082’</li><li>‘\u0083’</li><li>‘\u0084’</li><li>‘\u0085’</li><li>‘\u0086’</li><li>‘\u0087’</li><li>‘\u0088’</li><li>‘\u0089’</li><li>‘\u008a’</li><li>‘\u008b’</li><li>‘\u008c’</li><li>‘\u008d’</li><li>‘\u008e’</li><li>‘\u008f’</li><li>‘\u0090’</li><li>‘\u0091’</li><li>‘\u0092’</li><li>‘\u0093’</li><li>‘\u0094’</li><li>‘\u0095’</li><li>‘\u0096’</li><li>‘\u0097’</li><li>‘\u0098’</li><li>‘\u0099’</li><li>‘\u009a’</li><li>‘\u009b’</li><li>‘\u009c’</li><li>‘\u009d’</li><li>‘\u009e’</li><li>‘\u009f’</li><li>‘ ’</li><li>‘¡’</li><li>‘¢’</li><li>‘£’</li><li>‘¤’</li><li>‘¥’</li><li>‘¦’</li><li>‘§’</li><li>‘¨’</li><li>‘©’</li><li>‘ª’</li><li>‘«’</li><li>‘¬’</li><li>‘­’</li><li>‘®’</li><li>‘¯’</li><li>‘°’</li><li>‘±’</li><li>‘²’</li><li>‘³’</li><li>‘´’</li><li>‘µ’</li><li>‘¶’</li><li>‘·’</li><li>‘¸’</li><li>‘¹’</li><li>‘º’</li><li>‘»’</li><li>‘¼’</li><li>‘½’</li><li>‘¾’</li><li>‘¿’</li><li>‘À’</li><li>‘Á’</li><li>‘Â’</li><li>‘Ã’</li><li>‘Ä’</li><li>‘Å’</li><li>‘Æ’</li><li>‘Ç’</li><li>‘È’</li><li>‘É’</li><li>‘Ê’</li><li>‘Ë’</li><li>‘Ì’</li><li>‘Í’</li><li>‘Î’</li><li>‘Ï’</li><li>‘Ð’</li><li>‘Ñ’</li><li>‘Ò’</li><li>‘Ó’</li><li>‘Ô’</li><li>‘Õ’</li><li>‘Ö’</li><li>‘×’</li><li>‘Ø’</li><li>‘Ù’</li><li>‘Ú’</li><li>‘Û’</li><li>‘Ü’</li><li>‘Ý’</li><li>‘Þ’</li><li>‘ß’</li><li>‘à’</li><li>‘á’</li><li>‘â’</li><li>‘ã’</li><li>‘ä’</li><li>‘å’</li><li>‘æ’</li><li>‘ç’</li><li>‘è’</li><li>‘é’</li><li>‘ê’</li><li>‘ë’</li><li>‘ì’</li><li>‘í’</li><li>‘î’</li><li>‘ï’</li><li>‘ð’</li><li>‘ñ’</li><li>‘ò’</li><li>‘ó’</li><li>‘ô’</li><li>‘õ’</li><li>‘ö’</li><li>‘÷’</li><li>‘ø’</li><li>‘ù’</li><li>‘ú’</li><li>‘û’</li><li>‘ü’</li><li>‘ý’</li><li>‘þ’</li><li>‘ÿ’</li></ol>

Source: Google Blog

st <-  "\uD1B5\uACC4\uACC4\uC0B0"
st

‘통계계산’

Integers: fixed-point number system

Unsigned integers

Type Min Max
UInt8 0 255
UInt16 0 65535
UInt32 0 4294967295
UInt64 0 18446744073709551615
UInt128 0 340282366920938463463374607431768211455

Signed integers

class(5L)  # postfix `L` means integer in R
pryr::bits(5L)
pryr::bits(-5L)
pryr::bits(2L * 5L) # shift bits of 5 to the left (why?)
pryr::bits(2L * -5L); # shift bits of -5 to left 

‘integer’

‘00000000 00000000 00000000 00000101’

‘11111111 11111111 11111111 11111011’

‘00000000 00000000 00000000 00001010’

‘11111111 11111111 11111111 11110110’

Source: Signed Binary Numbers, Subtraction and Overflow by Grant Braught

.Machine$integer.max  # R uses 32-bit integer

2147483647

In most other languages,

Type Min Max
Int8 -128 127
Int16 -32768 32767
Int32 -2147483648 2147483647
Int64 -9223372036854775808 9223372036854775807
Int128 -170141183460469231731687303715884105728 170141183460469231731687303715884105727

Overflow and underflow for integer arithmetic

R reports NA for integer overflow and underflow.

# The largest integer R can hold
.Machine$integer.max 

2147483647

M <- 32
big <- 2^(M-1) - 1
as.integer(big)

2147483647

.Machine$integer.max + 1L
Warning message in .Machine$integer.max + 1L:
“NAs produced by integer overflow”

<NA>

unit8 outputs the result according to modular arithmetic. So does C and Julia.

uint8::as.uint8(255) + uint8::as.uint8(1)
uint8::as.uint8(250) + uint8::as.uint8(15)
[1] 0



[1] 9

Package gmp supports big integers with arbitrary precision.

gmp::as.bigz(.Machine$integer.max ) + gmp::as.bigz(1L)
Big Integer ('bigz') :
[1] 2147483648

Real numbers: floating-point number system

Floating-point number system is a computer model for the real line $\mathbb{R}$.

R has no single precision data type. All real numbers are stored in double precision format. The functions as.single and single are identical to as.double and double except they set the attribute Csingle that is used in the .C and .Fortran interface, and they are intended only to be used in that context. R Documentation

Half precision (Float16)

Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format

\[(value) = (-1)^{b_{15}}\times 2^{(\sum_{j=1}^5 b_{15-j}2^{5-j}) - 15} \times \left( 1 + \sum_{i=1}^{10}\frac{b_{10-i}}{2^i}\right)\]
# This is Julia
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision
Half precision:
bitstring(Float16(5)) = "0100010100000000"
bitstring(Float16(-5)) = "1100010100000000"

Single precision (Float32, or float)

Source: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

# Homework: figure out how this C++ code works
Rcpp::cppFunction('int float32bin(double x) {
    float flx = (float) x; 
    unsigned int binx = *((unsigned int*)&flx); 
    return binx; 
}')
message("Single precision:")
pryr::bits(float32bin(5)) # 5 in single precision
pryr::bits(float32bin(-5)) # -5 in single precision
Single precision:

‘01000000 10100000 00000000 00000000’

‘11000000 10100000 00000000 00000000’

Double precision (Float64, or double)

Source: https://en.wikipedia.org/wiki/Double-precision_floating-point_format

message("Double precision:")
pryr::bits(5)    # 5 in double precision
pryr::bits(-5)   # -5 in double precision
Double precision:

‘01000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’

‘11000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’

Special floating-point numbers

pryr::bits(Inf)    # Inf in double precision
pryr::bits(-Inf)   # -Inf in double precision

‘01111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

‘11111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

pryr::bits(0 / 0)  # NaN
pryr::bits(0 * Inf) # NaN

‘11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000’

‘11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000’

pryr::bits(0.0)  # 0 in double precision 

‘00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000’

2^(-126)  # emin=-126
pryr::bits(float32bin(2^(-126)))
2^(-149)  # denormalized
pryr::bits(float32bin(2^(-149)))

1.17549435082229e-38

‘00000000 10000000 00000000 00000000’

1.40129846432482e-45

‘00000000 00000000 00000000 00000001’

Rounding

pryr::bits(0.1)  # double precision 
pryr::bits(float32bin(0.1)) # single precision, 1001 gets rounded to 101(0)

‘00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010’

‘00111101 11001100 11001100 11001101’

Errors

Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then

Of course, we want to ensure that the error after a computation is small.

Machine epsilons

Source: Computational Statistics, James Gentle, Springer, New York, 2009.

Machine precision

options(digits=20)

print(2^(-53) + 2^(-105))   # epsilon_max for double
print(1.0 + 2^(-53))
print(1.0 + (2^(-53) + 2^(-105)))
print(1.0 + 2^(-53) + 2^(-105))  # why is the result?  See "Catastrophic cancellation"


print(as.double(float::fl(2^(-24) + 2^(-47))))  # epsilon_max for float
print(as.double(float::fl(1.0) + float::fl(2^(-24))))
print(as.double(float::fl(1.0) + float::fl(2^(-24) + 2^(-47))))

[1] 1.1102230246251567869e-16
[1] 1
[1] 1.000000000000000222
[1] 1
[1] 5.9604651880817982601e-08
[1] 1
[1] 1.0000001192092895508
print(1/2^54 + 1/2^106)  # epsilon_min for double
print(1 - (1/2^54 + 1/2^106))
pryr::bits(1.0)
pryr::bits(1 - (1/2^54 + 1/2^106))
[1] 5.5511151231257839347e-17
[1] 0.99999999999999988898

‘00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’

‘00111111 11101111 11111111 11111111 11111111 11111111 11111111 11111111’

.Machine
$double.eps
2.22044604925031e-16
$double.neg.eps
1.11022302462516e-16
$double.xmin
2.2250738585072e-308
$double.xmax
1.79769313486232e+308
$double.base
2
$double.digits
53
$double.rounding
5
$double.guard
0
$double.ulp.digits
-52
$double.neg.ulp.digits
-53
$double.exponent
11
$double.min.exp
-1022
$double.max.exp
1024
$integer.max
2147483647
$sizeof.long
8
$sizeof.longlong
8
$sizeof.longdouble
16
$sizeof.pointer
8

Overflow and underflow of floating-point number

Catastrophic cancellation

(a <- 2.0^30)
(b <- 2.0^-30)
(a + b == a)
pryr::bits(a)
pryr::bits(a + b)

1073741824

9.31322574615479e-10

TRUE

‘01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000’

‘01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000’

Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be \(\delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} .\) What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is \(\delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} .\) Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.

We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have \(\begin{split} | [[x]+[y]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &\le |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split}\) because $\epsilon_{\max}^2 \approx 0$. Thus \(\frac{| [[x]+[y]-(x+y) |}{|x+y|} \le 2\epsilon_{\max}\) approximately.

olddigits <- options('digits')$digits
options(digits=20)

a <- float::fl(1.2345678) # rounding
pryr::bits(float32bin(as.double(a))) # rounding
b <- float::fl(1.2345677)
pryr::bits(float32bin(as.double(b)))
print(as.double(a - b)) # correct result should be 1f-7
pryr::bits(float32bin(as.double(a - b)))   # must be 1.0000...0 x 2^(-23)
print(1/2^23)

options(digits=olddigits)

‘00111111 10011110 00000110 01010001’

‘00111111 10011110 00000110 01010000’

[1] 1.1920928955078125e-07

‘00110100 00000000 00000000 00000000’

[1] 1.1920928955078125e-07

Analysis: Let \([x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} .\)

Algebraic laws

Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.

Coditioning

Condiser solving a linear system $Ax=b$:

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]

whose solution is $(x_1, x_2) = (1.000, 1.000)$.

A <- matrix(c(1.0, 0.667, 0.5, 0.333), nrow=2)
b <- c(1.5, 1.0)
solve(A, b)

<ol class=list-inline><li>0.999999999999833</li><li>1.00000000000033</li></ol>

If we perturb $b$ by 0.001, i.e., solve

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 0.999 \end{bmatrix} ,\]

then the solution changes to $(x_1, x_2) = (0.000, 3.000)$.

b1 <- c(1.5, 0.999)
solve(A, b1)

<ol class=list-inline><li>-1.66533453693773e-13</li><li>3.00000000000033</li></ol>

If we instead perturb $A$ by 0.001, i.e., solve

\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.334 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]

then this time the solution changes to $(x_1, x_2) = (2.000, -1.000)$.

A1 <- matrix(c(1.0, 0.667, 0.5, 0.334), nrow=2)
solve(A1, b)

<ol class=list-inline><li>2.00000000000017</li><li>-1.00000000000033</li></ol>

In other words, an input perturbation of order $10^{-3}$ yield an output perturbation of order $10^0$. Thats 3 orders of magnutides of relative change!

Floating point representation $[x]$ of a real number $x$ in a digital computer may introduce such input perturbation easily. The perturbation of output of a problem with respect to the input is called conditioning.

In the above problem, the condition number is matrix $A$ (w.r.t. Euclidean norm) is

kappa(A)

4333.00046136224

Further readings