2024 February 05 bit

Bit

In the IEEE 754 standard for double-precision floating-point format (often used for double in C++), numbers are represented using 64 bits: 1 bit for the sign (0 for positive, 1 for negative), 11 bits for the exponent, 52 bits for the mantissa. The actual value of a floating-point number is calculated as:

(−1)^sign ×1.mantissa×2^{exponent−bias}. The bias for double-precision is 1023. The total is 64bits.

Concrete Example: Consider the number 6.5 in double-precision floating-point format.

First, convert 6.5 to binary: 6_10 = 110_2, and 0.5_10 = 0.1_2, so 6.5_10 = 110.1_2. Normalize this binary number in scientific notation: 110.1_2 = 1.101×2^2 The sign bit is 0 (since 6.5 is positive). The mantissa is the normalized fractional part, “1.101”, but without the leading 1 (which is always assumed), so it’s “101…” (filled with zeros to make up the required 52 bits). he exponent is 2, but we need to store it in a biased format. So, we add the bias (1023) to it: 2+1023=1025_10=10000000001_2

Now, let’s put it all together in the 64-bit representation: Sign bit: 0 Exponent: 10000000001 Mantissa: 1010000000000000000000000000000000000000000000000000

To convert an integer to binary, you divide the number by 2 and keep track of the remainders. This process is repeated until the number becomes 0. For 6, the process is as follows: 6÷2=3 with a remainder of 0. 3÷2=1 with a remainder of 1. 1÷2=0 with a remainder of 1. Now, take the remainders and read them in reverse order to get the binary representation. For 6, the binary representation is 110.

For 0.5, the process is as follows: 0.5×2= 1.0 0.5×2=1.0. The integer part is 1, and there is no fractional part left.

For 0.4:

Multiply 0.4 by 2. The integer part is 0, and the fractional part is 0.8. 0.4×2=0.8. Write down 0 (to the right of the binary point). Multiply the fractional part (0.8) by 2 again. 0.8×2=1.6. Write down 1. Continue with the fractional part (0.6) by 2. 0.6×2=1.2. Write down 1. … and so on. This process can continue indefinitely, or until a satisfactory level of precision is reached.

0.5: Multiply by 2, get 1.0. Binary: .1 0.4: Multiply by 2, get 0.8 (write 0), then 1.6 (write 1), then 1.2 (write 1), etc. Binary: .011… 0.9: Multiply by 2, get 1.8 (write 1), then 1.6 (write 1), etc. Binary: .111…

Biased Exponent = 2 + 1023 = 1025.
Converting Biased Exponent to Binary: 1025÷2=512 remainder 1 512÷2=256 remainder 0 256÷2=128 remainder 0 .. continue this process until the quotient is zero.

Let’s represent 0.125 in a double-precision floating-point format:

0.125_10=0.001_2 Write it in normalized form: 1.0×2^−3 = 1.0×2^−3 Add Bias (1023): −3+1023=1020 Binary representation of 1020 is 1111111100. Mantissa (excluding the leading 1): all zeros in this case

Step 6: Assemble the Floating-Point Representation Sign bit: 1 bit (0 for positive numbers) Exponent: 11 bits ( 01111111100 01111111100 for 1020) Mantissa: 52 bits (all zeros in this case)

Final Representation Sign: 0 Exponent: 01111111100 Mantissa: 0000000000000000000000000000000000000000000000000000

Reading the remainders from the last division to the first gives us the binary representation of 1020. From the steps above, we get: Remainders: 1111111100 in the context of the IEEE 754 double-precision floating-point format, the exponent field is always 11 bits long.

Let’s consider each number to be summed to include a high-set-bit and a low-set-bit. Each bit will have an appropriate place-value identification. Both of these bits are set, and the difference in place-value cannot exceed the width of the mantissa in the representation. In between the high-set-bit and low-set-bit for a number, there are a collection of other bits of values of zero or one, as indicated by the mantissa. Across the array of all numbers to be summed, determine the highest place-value of the high-set-bit, and the lowest place value of the low-set-bit across the array of numbers. This could be performed in parallel using max and min reductions. Starting from the lowest place value of the low-set-bit, up to the highest place value of the high-set-bit, do: • identify which values have a set bit at that place value. The result of this step is a zero or one, for each thread/number (this is trivially parallelizable) • perform a sum reduction on the output from the previous step (parallelizable) • add the result, if any, from the previous iteration, shifted right by one bit • using the result from the previous step, assign a zero to the final answer bit position that corresponds to the place-value for this loop iteration, if the result of the previous step is even. Otherwise assign a one to the final answer bit position for this loop iteration. • repeat for the next iteration/place-value/bit position Although I have indicated that the above loop must proceed from the lowest set bit position to the highest set bit position, in fact it must proceed until the highest bit position and then until the result of adding the previous iteration result, shifted right by 1, is zero. When the final result is thus assembled, it will consist of potentially a very long binary word. That very long binary word will need to be truncated/rounded to fit in the desired result representation. The above arithmetic is essentially all integer arithmetic, and with appropriate choice of intermediate integer quantities (e.g. unsigned long long) it should be able to handle “arbitrary” array sizes. The above method could certainly be optimized for performance. For example, arrangement of the numbers from largest to smallest would allow entire threadblocks to retire early, or even traverse over a fixed subset of the place value range, as determined by their subset of numbers. It’s not obvious to me that any method could give a more accurate representation of the true result than this. Furthermore there should be no variability in result, from run-to-run, or from machine-to-machine.

1

X = 0100 0010 0000 1111 0000 0000 0000 0000 Y = 0100 0001 1010 0100 0000 0000 0000 0000

X S = 0 e = 1000 0100 132 - 127 = 5 1.000111 x 2^5

Y S = 0 e = 1000 0011 131 - 127 = 4 1.01001 x 2^4

2

X ≈ 1.000111 x 2^5

1.000111 x 2^5

0.1010010 x 2^5

1.1100011 x 2^5

Y ≈ 1.01001 x 2^4

3

5 + 127 = 132 = (1000 0100)_2

6 | 1000 0100 | 0000 0100 0000 … 32 bits

Bit

1

2

0.1010010 x 2^5

3