Finite Digit Arithmetic

Computers do not have memory to store infinitely many digits. This can lead to errors if not taken care of properly. For example, $\sqrt{3}$ is irrational and thus cannot be stored in a computer exactly. For most cases, we chose a rational number whose square is not exactly equal to 3 but is reasonably close to 3 to pass off without any problem. This may lead to Round Off Errors.

Representing Real Numbers

64 bits are used to represent a number.

The first bit is a sign indicator, denoted using $s$
The next 11 bit exponent is called the characteristic $c$
The remaining 52 bit fraction $f$ is called the mantissa

The final value of the exponent is given by $(c-1023)$ to ensure that negative exponents are allowed as well. $\text{Number }= (-1)^s2^{c-1023}(1+f)$ $m$ has 52 bits, meaning that the precision of this method of representation is 16 digits.

The smallest positive number that can be represented by this notation would be given by $(s,c,f) = (0,1,0)$. The number itself would be $2^{-1022} \approx 0.22251\times 10^{-307}$. Note that both $(0,0,0)$ and $(1,0,0)$ correspond to $0$. Numbers smaller than this result in underflow.

Similarly, the largest number would be $2^{1023}\cdot(2-2^{-52})$. Numbers larger than this result in overflow.

Floating Point Representation

We will use numbers of form $\pm 0.d_1d_2\ldots d_k \times 10^n \qquad 1\leq d_1\leq9, 0\leq d_i\leq 9$ Converting a number $y$ which has more than $k$ decimal digits can be done in two ways;

Chopping, wherein the additional digits are simply dropped
Rounding, where we add $5\times10^{n-k-1}$ and then drop the additional digits.

Let $\rho$ be the real number and $\rho^*$ be the approximation.

Absolute Error	Relative Error
$\vert \rho - \rho^* \vert$	$\frac{\vert \rho-\rho^* \vert}{\rho}$

Significant Digits

We say $\rho^*$ approximates $\rho$ to $t$ significant digits if $t$ is the largest non-negative integer for which $\frac{\vert \rho-\rho^* \vert}{\rho} < 5\times 10^{-t}$

Finite Digit Arithmetic

Operation	Meaning
$x\oplus y$	$fl(fl(x)+fl(y))$
$x \ominus y$	$fl(fl(x)-fl(y))$
$x \otimes y$	$fl(fl(x)\times fl(y))$
$x (\div) y$	$fl(fl(x)\div fl(y))$