29 min to read
[통계계산] 1. Computer Arithmetic
Computational Statistics
[통계계산] 1. Computer Arithmetic
목차
- Computer Arithmetic
- LU decomposition
- Cholesky Decomposition
- QR Decomposition
- Interative Methods
- Non-linear Equations
- Optimization
- Numerical Integration
- Random Number Generation
- Monte Carlo Integration
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] fansi_0.5.0 digest_0.6.27 utf8_1.2.1 crayon_1.4.1
[5] IRdisplay_1.0 repr_1.1.3 lifecycle_1.0.0 jsonlite_1.7.2
[9] evaluate_0.14 pillar_1.6.1 rlang_0.4.11 uuid_0.1-4
[13] vctrs_0.3.8 ellipsis_0.3.2 IRkernel_1.1 tools_3.6.3
[17] compiler_3.6.3 base64enc_0.1-3 pbdZMQ_0.3-5 htmltools_0.5.1.1
Required R packages
# Dig into the internal representation of R objects
library(lobstr) # if not installed, use install.packages('lobstr')
# For unsigned integers
library(uint8) # devtools::install_github('coolbutuseless/uint8')
# For bitstrings
library(pryr)
# For big integers
library(gmp)
# For single precision floating point numbers
library(float)
library(Rcpp)
Attaching package: ‘uint8’
The following objects are masked from ‘package:base’:
:, order
Registered S3 method overwritten by 'pryr':
method from
print.bytes Rcpp
Attaching package: ‘pryr’
The following objects are masked from ‘package:lobstr’:
ast, mem_used
Attaching package: ‘gmp’
The following objects are masked from ‘package:base’:
%*%, apply, crossprod, matrix, tcrossprod
Computer arithmetics
Units of computer storage
Humans use decimal digits (why?)
Computers use binary digits (why?)
- Bit = binary digit (coined by statistician John Tukey).
- byte = 8 bits.
- KB = kilobyte = $10^3$ bytes; KiB = kibibyte = $2^{10}$ bytes.
- MB = megabyte = $10^6$ bytes; MiB = mebibyte = $2^{20}$ bytes.
- GB = gigabyte = $10^9$ bytes. Typical RAM size.
- TB = terabyte = $10^{12}$ bytes. Typical hard drive size. Size of NYSE each trading session.
- PB = petabyte = $10^{15}$ bytes.
- EB = exabyte = $10^{18}$ bytes. Size of all healthcare data in 2011 is ~150 EB.
- ZB = zetabyte = $10^{21}$ bytes.
R function lobstr::obj_size()
shows the amount of memory (in bytes) used by an object. (This is a better version of the built-in utils::object.size()
x <- 100
lobstr::obj_size(x)
y <-c(20, 30)
lobstr::obj_size(y)
z <- as.matrix(runif(100 * 100), nrow=100) # 100 x 100 random matrix
lobstr::obj_size(z)
56 B
64 B
80,216 B
Print all variables in workspace and their sizes:
sapply(ls(), function(z, env=parent.env(environment())) obj_size(get(z, envir=env)))
<dl class=dl-inline><dt>x</dt><dd>56</dd><dt>y</dt><dd>64</dd><dt>z</dt><dd>80216</dd></dl>
Storage of Characters
- Plain text files are stored in the form of characters:
.jl
,.r
,.c
,.cpp
,.ipynb
,.html
,.tex
, … - ASCII (American Code for Information Interchange): 7 bits, only $2^7=128$ characters.
# integers 0, 1, ..., 127 and corresponding ascii character
sapply(0:127, intToUtf8)
<ol class=list-inline><li>’’</li><li>‘\001’</li><li>‘\002’</li><li>‘\003’</li><li>‘\004’</li><li>‘\005’</li><li>‘\006’</li><li>‘\a’</li><li>‘\b’</li><li>‘\t’</li><li>‘\n’</li><li>‘\v’</li><li>‘\f’</li><li>‘\r’</li><li>‘\016’</li><li>‘\017’</li><li>‘\020’</li><li>‘\021’</li><li>‘\022’</li><li>‘\023’</li><li>‘\024’</li><li>‘\025’</li><li>‘\026’</li><li>‘\027’</li><li>‘\030’</li><li>‘\031’</li><li>‘\032’</li><li>‘\033’</li><li>‘\034’</li><li>‘\035’</li><li>‘\036’</li><li>‘\037’</li><li>’ ‘</li><li>’!’</li><li>’”’</li><li>’#’</li><li>’$’</li><li>’%’</li><li>‘&’</li><li>’'’</li><li>’(‘</li><li>’)’</li><li>‘*’</li><li>’+’</li><li>’,’</li><li>’-‘</li><li>’.’</li><li>’/’</li><li>‘0’</li><li>‘1’</li><li>‘2’</li><li>‘3’</li><li>‘4’</li><li>‘5’</li><li>‘6’</li><li>‘7’</li><li>‘8’</li><li>‘9’</li><li>’:’</li><li>’;’</li><li>‘<’</li><li>’=’</li><li>‘>’</li><li>’?’</li><li>’@’</li><li>‘A’</li><li>‘B’</li><li>‘C’</li><li>‘D’</li><li>‘E’</li><li>‘F’</li><li>‘G’</li><li>‘H’</li><li>‘I’</li><li>‘J’</li><li>‘K’</li><li>‘L’</li><li>‘M’</li><li>‘N’</li><li>‘O’</li><li>‘P’</li><li>‘Q’</li><li>‘R’</li><li>‘S’</li><li>‘T’</li><li>‘U’</li><li>‘V’</li><li>‘W’</li><li>‘X’</li><li>‘Y’</li><li>‘Z’</li><li>’[’</li><li>’\’</li><li>’]’</li><li>’^’</li><li>‘_’</li><li>’`’</li><li>‘a’</li><li>‘b’</li><li>‘c’</li><li>‘d’</li><li>‘e’</li><li>‘f’</li><li>‘g’</li><li>‘h’</li><li>‘i’</li><li>‘j’</li><li>‘k’</li><li>‘l’</li><li>‘m’</li><li>‘n’</li><li>‘o’</li><li>‘p’</li><li>‘q’</li><li>‘r’</li><li>’s’</li><li>‘t’</li><li>‘u’</li><li>‘v’</li><li>‘w’</li><li>‘x’</li><li>‘y’</li><li>‘z’</li><li>’{‘</li><li>’ | ’</li><li>’}’</li><li>’~’</li><li>‘\177’</li></ol> |
- Extended ASCII: 8 bits, $2^8=256$ characters.
# integers 128, 129, ..., 255 and corresponding extended ascii character
sapply(128:255, intToUtf8)
<ol class=list-inline><li>‘\u0080’</li><li>‘\u0081’</li><li>‘\u0082’</li><li>‘\u0083’</li><li>‘\u0084’</li><li>‘\u0085’</li><li>‘\u0086’</li><li>‘\u0087’</li><li>‘\u0088’</li><li>‘\u0089’</li><li>‘\u008a’</li><li>‘\u008b’</li><li>‘\u008c’</li><li>‘\u008d’</li><li>‘\u008e’</li><li>‘\u008f’</li><li>‘\u0090’</li><li>‘\u0091’</li><li>‘\u0092’</li><li>‘\u0093’</li><li>‘\u0094’</li><li>‘\u0095’</li><li>‘\u0096’</li><li>‘\u0097’</li><li>‘\u0098’</li><li>‘\u0099’</li><li>‘\u009a’</li><li>‘\u009b’</li><li>‘\u009c’</li><li>‘\u009d’</li><li>‘\u009e’</li><li>‘\u009f’</li><li>‘ ’</li><li>‘¡’</li><li>‘¢’</li><li>‘£’</li><li>‘¤’</li><li>‘¥’</li><li>‘¦’</li><li>‘§’</li><li>‘¨’</li><li>‘©’</li><li>‘ª’</li><li>‘«’</li><li>‘¬’</li><li>‘’</li><li>‘®’</li><li>‘¯’</li><li>‘°’</li><li>‘±’</li><li>‘²’</li><li>‘³’</li><li>‘´’</li><li>‘µ’</li><li>‘¶’</li><li>‘·’</li><li>‘¸’</li><li>‘¹’</li><li>‘º’</li><li>‘»’</li><li>‘¼’</li><li>‘½’</li><li>‘¾’</li><li>‘¿’</li><li>‘À’</li><li>‘Á’</li><li>‘Â’</li><li>‘Ã’</li><li>‘Ä’</li><li>‘Å’</li><li>‘Æ’</li><li>‘Ç’</li><li>‘È’</li><li>‘É’</li><li>‘Ê’</li><li>‘Ë’</li><li>‘Ì’</li><li>‘Í’</li><li>‘Î’</li><li>‘Ï’</li><li>‘Ð’</li><li>‘Ñ’</li><li>‘Ò’</li><li>‘Ó’</li><li>‘Ô’</li><li>‘Õ’</li><li>‘Ö’</li><li>‘×’</li><li>‘Ø’</li><li>‘Ù’</li><li>‘Ú’</li><li>‘Û’</li><li>‘Ü’</li><li>‘Ý’</li><li>‘Þ’</li><li>‘ß’</li><li>‘à’</li><li>‘á’</li><li>‘â’</li><li>‘ã’</li><li>‘ä’</li><li>‘å’</li><li>‘æ’</li><li>‘ç’</li><li>‘è’</li><li>‘é’</li><li>‘ê’</li><li>‘ë’</li><li>‘ì’</li><li>‘í’</li><li>‘î’</li><li>‘ï’</li><li>‘ð’</li><li>‘ñ’</li><li>‘ò’</li><li>‘ó’</li><li>‘ô’</li><li>‘õ’</li><li>‘ö’</li><li>‘÷’</li><li>‘ø’</li><li>‘ù’</li><li>‘ú’</li><li>‘û’</li><li>‘ü’</li><li>‘ý’</li><li>‘þ’</li><li>‘ÿ’</li></ol>
-
Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.
-
UTF-8 is the current dominant character encoding on internet.
Source: Google Blog
st <- "\uD1B5\uACC4\uACC4\uC0B0"
st
‘통계계산’
Integers: fixed-point number system
- Fixed-point number system is a computer model for integers $\mathbb{Z}$.
- Remember that computer memory is finite whereas the cardinality of $\mathbb{Z}$ is (countably) infinite.
- Any representation of numbers in computer has to be an approximation.
- The number $M$ of bits and method of representing negative numbers vary from system to system.
- The
integer
type in R has $M=32$ (packages such as ‘bit64’support 64 bit integers). - C has (
unsigned
)char
,int
,short
,long
(andlong long
), whose sizes depend on the machine. - Matlab has
(u)int8
,(u)int16
,(u)int32
,(u)int64
.
- The
Unsigned integers
- Model for $\mathbb{N} \cup {0}$.
- For unsigned integers, the range is $[0,2^M-1]$.
- R does not support unsigned integers natively (will see
unit8
package later). In most other languages:
Type | Min | Max |
---|---|---|
UInt8 | 0 | 255 |
UInt16 | 0 | 65535 |
UInt32 | 0 | 4294967295 |
UInt64 | 0 | 18446744073709551615 |
UInt128 | 0 | 340282366920938463463374607431768211455 |
Signed integers
-
Model of $\mathbb{Z}$. Can do subtraction.
- First bit (“most significant bit” or MSB) is the sign bit.
0
for nonnegative numbers1
for negative numbers
- Two’s complement representation for negative numbers.
- Set the sign bit to 1
- Negate (
0
->1
,1
->0
) the remaining bits - Add to
1
to the result - Two’s complement representation of a negative integer $x$ is the same as the unsigned integer $2^M - x$.
class(5L) # postfix `L` means integer in R
pryr::bits(5L)
pryr::bits(-5L)
pryr::bits(2L * 5L) # shift bits of 5 to the left (why?)
pryr::bits(2L * -5L); # shift bits of -5 to left
‘integer’
‘00000000 00000000 00000000 00000101’
‘11111111 11111111 11111111 11111011’
‘00000000 00000000 00000000 00001010’
‘11111111 11111111 11111111 11110110’
- Two’s complement representation respects modular arithmetic nicely.
Addition of any two signed integers are just bitwise addition, possibly modulo $2^M$- $M=4$ case:
Source: Signed Binary Numbers, Subtraction and Overflow by Grant Braught
- Range of representable integers by $M$-bit signed integer is $[-2^{M-1},2^{M-1}-1]$:
.Machine$integer.max # R uses 32-bit integer
2147483647
In most other languages,
Type | Min | Max |
---|---|---|
Int8 | -128 | 127 |
Int16 | -32768 | 32767 |
Int32 | -2147483648 | 2147483647 |
Int64 | -9223372036854775808 | 9223372036854775807 |
Int128 | -170141183460469231731687303715884105728 | 170141183460469231731687303715884105727 |
Overflow and underflow for integer arithmetic
R reports NA
for integer overflow and underflow.
# The largest integer R can hold
.Machine$integer.max
2147483647
M <- 32
big <- 2^(M-1) - 1
as.integer(big)
2147483647
.Machine$integer.max + 1L
Warning message in .Machine$integer.max + 1L:
“NAs produced by integer overflow”
<NA>
unit8
outputs the result according to modular arithmetic. So does C and Julia.
uint8::as.uint8(255) + uint8::as.uint8(1)
uint8::as.uint8(250) + uint8::as.uint8(15)
[1] 0
[1] 9
Package gmp
supports big integers with arbitrary precision.
gmp::as.bigz(.Machine$integer.max ) + gmp::as.bigz(1L)
Big Integer ('bigz') :
[1] 2147483648
Real numbers: floating-point number system
Floating-point number system is a computer model for the real line $\mathbb{R}$.
-
Most computer systems adopt the IEEE 754 standard, established in 1985, for floating-point arithmetics.
For the history, see an interview with William Kahan. -
In the scientific notation, a real number is represented as \(\pm d_1.d_2d_3 \cdots d_p \times b^e, \quad 0 \le d_i < b.\) Humans use the base $b=10$ and digits $d_i=0, 1, \dotsc, 9$.
In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1. -
Normalized vs denormalized numbers. For example, decimal number 18 is \(+1.0010 \times 2^4 \quad (\text{normalized})\) or, equivalently, \(+0.1001 \times 2^5 \quad (\text{denormalized}).\)
- In the floating-number system, computer stores
- sign bit
- the fraction (or mantissa, or significand) of the normalized representation
- the actual exponent plus a bias
-
R supports double precesion floating point numbers (see below) via
double
. -
C supports floating point types
float
anddouble
, where in most systemsfloat
corresponds to single precision whiledouble
corresponds to double precision. - Julia provides
Float16
(half precision, implemented in software usingFloat32
),Float32
(single precision),Float64
(double precision), andBigFloat
(arbitrary precision).
R has no single precision data type. All real numbers are stored in double precision format. The functions
as.single
andsingle
are identical toas.double
anddouble
except they set the attributeCsingle
that is used in the.C
and.Fortran
interface, and they are intended only to be used in that context. R Documentation
- For ease of exposition, we begin with half precision.
Half precision (Float16)
Source: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
-
In Julia,
Float16
is the type for half precision numbers. -
MSB is the sign bit.
-
10 significand bits (fraction=mantissa), hence $p=11$ (why?)
- 5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15 = $01111_2$ for encoding:
- $e_{\min} = \mathbf{000012} - 01111_2 = -14{10}$
- $e_{\max} = \mathbf{111102} - 01111_2 = 15{10}$
-
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
-
range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$.
- Precision: $\log_{10}2^{11} \approx 3.311$ decimal digits.
# This is Julia
println("Half precision:")
@show bitstring(Float16(5)) # 5 in half precision
@show bitstring(Float16(-5)); # -5 in half precision
Half precision:
bitstring(Float16(5)) = "0100010100000000"
bitstring(Float16(-5)) = "1100010100000000"
Single precision (Float32, or float
)
Source: https://en.wikipedia.org/wiki/Single-precision_floating-point_format
- In C,
float
is the type for single precision numbers for most systems. - In Julia,
Float32
is the type for single precision numbers. - In R, single precision is not supported natively. We use the following workaround:
# Homework: figure out how this C++ code works
Rcpp::cppFunction('int float32bin(double x) {
float flx = (float) x;
unsigned int binx = *((unsigned int*)&flx);
return binx;
}')
-
MSB is the sign bit.
-
23 significand bits ($p=24$).
-
8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127.
-
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
-
range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$.
-
precision: $\log_{10}(2^{24}) \approx 7.225$ decimal digits.
message("Single precision:")
pryr::bits(float32bin(5)) # 5 in single precision
pryr::bits(float32bin(-5)) # -5 in single precision
Single precision:
‘01000000 10100000 00000000 00000000’
‘11000000 10100000 00000000 00000000’
Double precision (Float64, or double
)
Source: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
-
Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.
-
In C,
double
is the type for double precision numbers for most systems. It is the default type fornumeric
values. -
In Julia,
Float64
is the type for double precision numbers. -
In R,
double
is the type for double precision numbers. -
MSB is the sign bit.
-
52 significand bits ($p=15$).
-
11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023.
-
$e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers.
-
range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$.
-
precision to the $\log_{10}(2^{53}) \approx 15.95$ decimal point.
message("Double precision:")
pryr::bits(5) # 5 in double precision
pryr::bits(-5) # -5 in double precision
Double precision:
‘01000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’
‘11000000 00010100 00000000 00000000 00000000 00000000 00000000 00000000’
Special floating-point numbers
- Exponent $e_{\max}+1$ plus a zero mantissa means $\pm \infty$.
pryr::bits(Inf) # Inf in double precision
pryr::bits(-Inf) # -Inf in double precision
‘01111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’
‘11111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’
-
Exponent $e_{\max}+1$ plus a nonzero mantissa means
NaN
.NaN
could be produced from0 / 0
,0 * Inf
, … -
In general
NaN ≠ NaN
bitwise. Test whether a number isNaN
byis.nan
function.
pryr::bits(0 / 0) # NaN
pryr::bits(0 * Inf) # NaN
‘11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000’
‘11111111 11111000 00000000 00000000 00000000 00000000 00000000 00000000’
- Exponent $e_{\min}-1$ with a zero mantissa represents the real number 0 (“exact zero”).
- Why do we need an exact zero?
pryr::bits(0.0) # 0 in double precision
‘00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000’
- Exponent $e_{\min}-1$ with a nonzero mantissa are for numbers less than $b^{e_{\min}}$.
Numbers are denormalized in the range $(0,b^{e_{\min}})$ – gradual underflow. - For example, in half-precision, $e_{\min}=-14$ but $2^{-24}$ is represented by $0.0000000001_2 \times 2^{-14}$.
- In single precision, $e_{\min}=-126$ but $2^{-149}$ is represented by $0.00000000000000000000001_2 \times 2^{-126}$.
2^(-126) # emin=-126
pryr::bits(float32bin(2^(-126)))
2^(-149) # denormalized
pryr::bits(float32bin(2^(-149)))
1.17549435082229e-38
‘00000000 10000000 00000000 00000000’
1.40129846432482e-45
‘00000000 00000000 00000000 00000001’
Rounding
- Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). For example, the number 1/10 cannot be represented accurately as a (binary) floating point number: \(0.1 = 1.10011001\dotsc_2 \times 2^{-4}\)
pryr::bits(0.1) # double precision
pryr::bits(float32bin(0.1)) # single precision, 1001 gets rounded to 101(0)
‘00111111 10111001 10011001 10011001 10011001 10011001 10011001 10011010’
‘00111101 11001100 11001100 11001101’
Errors
Rounding (more fundamentally, finite precision) incurs errors in floating porint computation. If a real number $x$ is represented by a floating point number $[x]$, then
-
Absolute error \(| [x] - x |\)
-
Relative error \(\frac{| [x] - x |}{|x|}\) (if $x \neq 0$).
Of course, we want to ensure that the error after a computation is small.
Machine epsilons
-
Floating-point numbers do not occur uniformly over the real number line
Source: What you never wanted to know about floating point but will be forced to find out
-
Same number of representible numbers in $(2^i, 2^{i+1}]$ as in $(2^{i+1}, 2^{i+2}]$. Within an interval, they are uniformly distributed.
-
Machine epsilons are the spacings of numbers around 1:
- $\epsilon_{\max}$ = (smallest positive floating point number that added to 1 will give a result different from 1) = $\frac{1}{2^p} + \frac{1}{2^{2p-1}}$
- $\epsilon_{\min}$ = (smallest positive floating point number that subtracted from 1 will give a result different from 1) = $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.
Source: Computational Statistics, James Gentle, Springer, New York, 2009.
-
“Round to nearest, ties to even” rule: rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit (default for IEEE 754 binary floating point numbers)
-
Any real number in the interval $\left[1 - \frac{1}{2^{p+1}}, 1 + \frac{1}{2^p}\right]$ is represented by a floating point number $1 = 1.00\dotsc 0_2 \times 2^0$.
-
Adding $\frac{1}{2^p}$ to 1 won’t reach the next representable floating point number $1.00\dotsc 012 \times 2^0 = 1 + \frac{1}{2^{p-1}}$. Hence $\epsilon{\max} > \frac{1}{2^p} = 1.00\dotsc 0_2 \times 2^{-p}$.
-
Adding the floating point number next to $\frac{1}{2^p} = 1.00\dotsc 02 \times 2^{-p}$ to 1 WILL result in $1.00\dotsc 01_2 \times 2^0 = 1 + \frac{1}{2^{p-1}}$, hence $\epsilon{\max} = 1.00\dotsc 01_2 \times 2^{-p} = \frac{1}{2^p} + \frac{1}{2^{p+p-1}}$.
-
Subtracting $\frac{1}{2^{p+1}}$ from 1 results in $1-\frac{1}{2^{p+1}} = \frac{1}{2} + \frac{1}{2^2} + \dotsb + \frac{1}{2^p} + \frac{1}{2^{p+1}}$, which is represented by the floating point number $1.00\dotsc 02 \times 2^{0} = 1$ by the “ties to even” rule. Hence $\epsilon{\min} > \frac{1}{2^{p+1}}$.
-
The smallest positive floating point number larger than $\frac{1}{2^{p+1}}$ is $\frac{1}{2^{p+1}} + \frac{1}{2^{2p}}=1.00\dotsc 12 \times 2^{-p-1}$. Thus $\epsilon{\min} = \frac{1}{2^{p+1}} + \frac{1}{2^{2p}}$.
Machine precision
-
Machine epsilon is often called the machine precision.
-
If a positive real number $x \in \mathbb{R}$ is represented by $[x]$ in the floating point arithmetic, then \([x] = \left( 1 + \sum_{i=1}^{p-1}\frac{b_{i+1}}{2^i}\right) \times 2^e.\) Thus $x - \frac{2^e}{2^p} < [x] \le x + \frac{2^e}{2^p}$, and \(\begin{split} \frac{| x - [x] |}{|x|} &\le \frac{2^e}{2^p|x|} \le \frac{2^e}{2^p}\frac{1}{[x]-2^e/2^p} \\ &\le \frac{2^e}{2^p}\frac{1}{2^e(1-1/2^p)} \quad (\because [x] \ge 2^e) \\ &\le \frac{2^e}{2^p}\frac{1}{2^e}(1 + \frac{1}{2^{p-1}}) \\ &= \frac{1}{2^p} + \frac{1}{2^{2p-1}} = \epsilon_{\max}. \end{split}\) That is, the relative error of the floating point representation $[x]$ of real number $x$ is bounded by $\epsilon_{\max}$.
options(digits=20)
print(2^(-53) + 2^(-105)) # epsilon_max for double
print(1.0 + 2^(-53))
print(1.0 + (2^(-53) + 2^(-105)))
print(1.0 + 2^(-53) + 2^(-105)) # why is the result? See "Catastrophic cancellation"
print(as.double(float::fl(2^(-24) + 2^(-47)))) # epsilon_max for float
print(as.double(float::fl(1.0) + float::fl(2^(-24))))
print(as.double(float::fl(1.0) + float::fl(2^(-24) + 2^(-47))))
[1] 1.1102230246251567869e-16
[1] 1
[1] 1.000000000000000222
[1] 1
[1] 5.9604651880817982601e-08
[1] 1
[1] 1.0000001192092895508
print(1/2^54 + 1/2^106) # epsilon_min for double
print(1 - (1/2^54 + 1/2^106))
pryr::bits(1.0)
pryr::bits(1 - (1/2^54 + 1/2^106))
[1] 5.5511151231257839347e-17
[1] 0.99999999999999988898
‘00111111 11110000 00000000 00000000 00000000 00000000 00000000 00000000’
‘00111111 11101111 11111111 11111111 11111111 11111111 11111111 11111111’
- In R, the variable
.Machine
contains numerical characteristics of the machine.double.neg.eps
is our $\epsilon_{\max}$.
.Machine
- $double.eps
- 2.22044604925031e-16
- $double.neg.eps
- 1.11022302462516e-16
- $double.xmin
- 2.2250738585072e-308
- $double.xmax
- 1.79769313486232e+308
- $double.base
- 2
- $double.digits
- 53
- $double.rounding
- 5
- $double.guard
- 0
- $double.ulp.digits
- -52
- $double.neg.ulp.digits
- -53
- $double.exponent
- 11
- $double.min.exp
- -1022
- $double.max.exp
- 1024
- $integer.max
- 2147483647
- $sizeof.long
- 8
- $sizeof.longlong
- 8
- $sizeof.longdouble
- 16
- $sizeof.pointer
- 8
Overflow and underflow of floating-point number
-
For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{+308}$). Overflow produces $\pm \inf$. Underflow yields zeros or subnormal numbers.
-
E.g., the logit link function is \(p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.\) The former expression can easily lead to
Inf / Inf = NaN
, while the latter expression leads to gradual underflow. -
.Machine$double.xmax
and.Machine$double.xmin
functions gives largest and smallest non-subnormal number represented by the given floating point type.
Catastrophic cancellation
- Scenario 1: Addition or subtraction of two numbers of widely different magnitudes: $a+b$ or $a-b$ where $a \gg b$ or $a \ll b$. We lose the precision in the number of smaller magnitude. Consider \(\begin{eqnarray*} a &=& x.xxx ... \times 2^{30} \\ b &=& y.yyy... \times 2^{-30} \end{eqnarray*}\) What happens when computer calculates $a+b$? We get $a+b=a$!
(a <- 2.0^30)
(b <- 2.0^-30)
(a + b == a)
pryr::bits(a)
pryr::bits(a + b)
1073741824
9.31322574615479e-10
TRUE
‘01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000’
‘01000001 11010000 00000000 00000000 00000000 00000000 00000000 00000000’
Analysis: suppose we want to compute $x + y$, $x, y > 0$. Let the relative error of $x$ and $y$ be \(\delta_x = \frac{[x] - x}{x}, \quad \delta_y = \frac{[y] - y}{y} .\) What the computer actually calculates is $[x] + [y]$, which in turn is represented by $[ [x] + [y] ]$. The relative error of this representation is \(\delta_{\text{sum}} = \frac{[[x]+[y]] - ([x]+[y])}{[x]+[y]} .\) Recall that $|\delta_x|, |\delta_y|, |\delta_{\text{sum}}| \le \epsilon_{\max}$.
We want to find a bound of the relative error of $[[x]+[y]]$ with respect to $x+y$. Since $|[x]+[y]| = |x(1+\delta_x) + y(1+\delta_y)| \le |x+y|(1+\epsilon_{\max})$, we have \(\begin{split} | [[x]+[y]-(x+y) | &= | [[x]+[y]] - [x] - [y] + [x] - x + [y] - y | \\ &\le | [[x]+[y]] - [x] - [y] | + | [x] - x | + | [y] - y | \\ &\le |\delta_{\text{sum}}([x]+[y])| + |\delta_x x| + |\delta_y y| \\ &\le \epsilon_{\max}(x+y)(1+\epsilon_{\max}) + \epsilon_{\max}x + \epsilon_{\max}y \\ &\approx 2\epsilon_{\max}(x+y) \end{split}\) because $\epsilon_{\max}^2 \approx 0$. Thus \(\frac{| [[x]+[y]-(x+y) |}{|x+y|} \le 2\epsilon_{\max}\) approximately.
- Scenario 2: Subtraction of two nearly equal numbers eliminates significant digits. $a-b$ where $a \approx b$. Consider \(\begin{eqnarray*} a &=& x.xxxxxxxxxx1ssss \\ b &=& x.xxxxxxxxxx0tttt \end{eqnarray*}\) The result is $1.vvvvu…u$ where $u$ are unassigned digits.
olddigits <- options('digits')$digits
options(digits=20)
a <- float::fl(1.2345678) # rounding
pryr::bits(float32bin(as.double(a))) # rounding
b <- float::fl(1.2345677)
pryr::bits(float32bin(as.double(b)))
print(as.double(a - b)) # correct result should be 1f-7
pryr::bits(float32bin(as.double(a - b))) # must be 1.0000...0 x 2^(-23)
print(1/2^23)
options(digits=olddigits)
‘00111111 10011110 00000110 01010001’
‘00111111 10011110 00000110 01010000’
[1] 1.1920928955078125e-07
‘00110100 00000000 00000000 00000000’
[1] 1.1920928955078125e-07
Analysis: Let \([x] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{1}{2^{p-1}}, \quad [y] = 1 + \sum_{i=1}^{p-2}\frac{d_{i+1}}{2^i} + \frac{0}{2^{p-1}} .\)
-
$[x]-[y] = \frac{1}{2^{p-1}} = [[x]-[y]]$.
-
The true difference $x-y$ may lie anywhere in $(0, \frac{1}{2^{p-2}}+\frac{1}{2^{2p}}]$.
-
Relative error \(\frac{|x-y -[[x]-[y]]|}{|x-y|}\) is unbounded – no guarantee of any significant digit!
-
Implications for numerical computation
- Rule 1: add small numbers together before adding larger ones
- Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
- Rule 3: avoid substraction of two numbers that are nearly equal
Algebraic laws
Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See the example in the Machine Epsilon section and HW1.
Coditioning
Condiser solving a linear system $Ax=b$:
\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]whose solution is $(x_1, x_2) = (1.000, 1.000)$.
A <- matrix(c(1.0, 0.667, 0.5, 0.333), nrow=2)
b <- c(1.5, 1.0)
solve(A, b)
<ol class=list-inline><li>0.999999999999833</li><li>1.00000000000033</li></ol>
If we perturb $b$ by 0.001, i.e., solve
\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.333 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 0.999 \end{bmatrix} ,\]then the solution changes to $(x_1, x_2) = (0.000, 3.000)$.
b1 <- c(1.5, 0.999)
solve(A, b1)
<ol class=list-inline><li>-1.66533453693773e-13</li><li>3.00000000000033</li></ol>
If we instead perturb $A$ by 0.001, i.e., solve
\[\begin{bmatrix} 1.000 & 0.500 \\ 0.667 & 0.334 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.500 \\ 1.000 \end{bmatrix} ,\]then this time the solution changes to $(x_1, x_2) = (2.000, -1.000)$.
A1 <- matrix(c(1.0, 0.667, 0.5, 0.334), nrow=2)
solve(A1, b)
<ol class=list-inline><li>2.00000000000017</li><li>-1.00000000000033</li></ol>
In other words, an input perturbation of order $10^{-3}$ yield an output perturbation of order $10^0$. Thats 3 orders of magnutides of relative change!
Floating point representation $[x]$ of a real number $x$ in a digital computer may introduce such input perturbation easily. The perturbation of output of a problem with respect to the input is called conditioning.
- Abstractly, a problem can be viewed as function $f: X \to Y$ where $X$ is a normed vector space of data and $Y$ is a normed vector space of solutions.
- The problem of solving $Ax=b$ for fixed $b$ is $f: A \mapsto A^{-1}b$ with $X={M\in\mathbb{R}^{n\times n}: M \text{ is invertible} }$ and $Y = \mathbb{R}^n$.
- The combination of a problem $f$ with a given data $x$ is called a problem instance, or simply problem unless no confusion occurs.
-
A well-conditioned problem (instance) is one such that all small perturbations of $x$ lead to only small changes in $f(x)$.
-
An ill-conditioned problem is one such that some small perturbation of $x$ leads to a large change in $f(x)$.
-
The (relative) condition number $\kappa=\kappa(x)$ of a problem is defined by \(\kappa = \lim_{\delta\to 0}\sup_{\|\delta x\|\le \delta}\frac{\|\delta f\|/\|f(x)\|}{\|\delta x\|/\|x\|} = \sup_{\delta x} \frac{\|\delta f\|/\|f(x)\|}{\|\delta x\|/\|x\|}\)
- For the problem of solving $Ax=b$ for fixed $b$, $f: A \mapsto A^{-1}b$, it can be shown that the condition number of $f$ is \(\kappa = \|A\|\|A^{-1}\| =: \kappa(A) ,\) where $|A|$ is the matrix norm induced by vector norm $|\cdot|$, i.e., \(\|A\| = \sup_{x\neq 0} \frac{\|Ax\|}{\|x\|}.\) If 2-norm is used, then \(\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A),\) the ratio of the maximum and minimum singular values of $A$.
In the above problem, the condition number is matrix $A$ (w.r.t. Euclidean norm) is
kappa(A)
4333.00046136224
Further readings
-
Section II.2, Computational Statistics by James Gentle (2009).
-
Sections 1.5 and 2.2, Applied Numerical Linear Algebra by James W. Demmel (1997).
-
What every computer scientist should know about floating-point arithmetic by David Goldberg (1991).
Comments