What Is a Floating Point Number? Understanding 32-Bit and 64-Bit

Bobby Brown

Post 2025-04-02

What Is a Floating Point Number? Understanding 32-Bit and 64-Bit

1. What Is a Floating-Point Number?
2. Floating-Point Representation
- 1 ) IEEE 754 Floating-Point Format
- 2 ) Floating-Point Precision: Float32 & Float64
3. 32-bit Floating-Point Calculation Example
4. Modbus Floating-Point Access and Conversion
- 1 ) How Modbus Stores a 32-bit Floating-Point Number
- 2 ) How to Decode Modbus Values Back into Floating-Point Numbers

1. What Is a Floating-Point Number?

A floating-point number is a way to represent decimals or extremely large numbers using a limited amount of memory. As Mitra (2017, para. 2) notes, "While integer operations are faster and free from rounding errors, integer types are not well-suited for representing fractions or values across an extreme range."

That’s where floating-point numbers come in—they allow us to handle continuous numerical data more effectively. They're especially useful in scientific calculations, such as representing physical constants or measuring distances on atomic or galactic scales. These types of values are difficult, if not impossible, to express with integers. That’s why virtually all modern programming languages support floating-point data types as a standard.

[1]

2. Floating-Point Representation

Floating-point numbers work similarly to scientific notation.

For example, the number 12345 in scientific notation is written as

1.2345 × 10⁴

Here, 1.2345 is called the fraction (or mantissa), and 4 is the exponent.

Scientific notation uses base 10 (decimal), while floating-point numbers use base 2 (binary).

Take the number 15 as an example:

Integer (Decimal)	Scientific Notation (Decimal)	Floating-point (Binary)
15	(1.5)₁₀ × 10¹	(1.111)₂ × 2³

Example: Representing the number 15 as a floating-point value

Step 1: Convert 15 to binary

Step 2: Shift all significant digits of the mantissa to the right of the decimal point to form (1.XXXX)₂ × 2ⁿ
Move the decimal point three places to the left. To keep the value the same, raise base 2 to the power of 3

✅ Final Floating-Point Representation:

1 ) IEEE 754 Floating-Point Format

To avoid inconsistent floating-point results across different platforms, the IEEE (Institute of Electrical and Electronics Engineers) introduced the IEEE 754 standard in 1985. This standard has been widely adopted since the 1990s to define how floating-point numbers are represented and calculated (Wikipedia, 2024):

IEEE 754 defines:

Storage formats for floating-point numbers (e.g., 32-bit Float32 and 64-bit Float64).
Rules for floating-point operations (e.g., rounding modes, special values like NaN and infinity).
Exceptional cases (e.g., overflow, underflow, and division by zero).

IEEE 754 specifies the structure of floating-point numbers as follows:

Value = (−1)^sign × 2^exponent × (1 + fraction)

Sign Bit: 1 bit, determines whether the number is positive or negative: 0 = positive / 1 = negative
Exponent Bias: Used to control the range of the value (as powers of 2), calculated as the exponent minus the bias.
Fraction (Mantissa): Represents "1 + fractional part" and stores the significant digits (similar to "1.XX" in scientific notation).

| IEEE754

[2]

2 ) Floating-Point Precision: Float32 & Float64

The floating-point representation of 15 is (1.111)₂ × 2³, written as

(−1)⁰ × 2³ × (1.111)₂

This follows the IEEE 754 normalized format. When computers store floating-point numbers, they encode them according to the IEEE 754 standard using either single precision (Float32) or double precision (Float64).

Single Precision

When storing a floating-point number, the leading 1 is implicit, so only the fractional part is saved instead of the full significand.

| Single Precision

IEEE 754 single precision uses 32 bits in total:

1 bit: Sign bit
8 bits: Exponent (m)
23 bits: Fraction (f)

In single precision, the exponent is biased by 127 to avoid storing a negative sign. The computer stores m + 127 instead of m. Therefore, the maximum exponent value is 127, and the minimum is −126. (Grainger College of Engineering, 2019)
* For calculation examples, please refer to the next section.

[3]

Double Precision

| Double Precision

IEEE 754 double precision uses 64 bits in total:

1 bit: Sign bit
11 bits: Exponent (m)
52 bits: Fractional part (f)

In double precision, the exponent is biased by 1023 to avoid storing a negative sign. The computer stores m + 1023 instead of m. Therefore, the maximum exponent value is 1023, and the minimum is −1022. (Grainger College of Engineering, 2019)

Float32 vs Float64 Feature Comparison


Feature	Float32 (Single Precision)	Float64 (Double Precision)
Bit Length	32 bits	64 bits
Fraction Length	23 bits (about 7–8 decimal digits)	52 bits (about 15–16 decimal digits)
Exponent Length	8 bits (bias: 127)	11 bits (bias: 1023)
Value Range	Approx. 10⁻³⁸ ~ 10³⁸	Approx. 10⁻³⁰⁸ ~ 10³⁰⁸
Memory Usage	4 bytes (32-bit)	8 bytes (64-bit)
Performance	Faster (ideal for graphics processing, AI training)	Slower but more accurate (better for scientific computing)
Common Applications	• Game development, graphics processing (GPU computing) • Machine learning (e.g., mixed precision in TensorFlow) • Large datasets with low precision requirements	• Scientific computing and simulations (e.g., weather forecasting, physics modeling) • Financial calculations (to avoid rounding errors) • High-precision numerical computing (e.g., engineering applications)

3. 32-bit Floating-Point Calculation Example

Single Precision (Float32, 32-bit) Format
Component	Bit Length	Description
Sign Bit (S)	1 bit	0 (positive)
Exponent (E)	8 bits	3 + 127 = 130 = 10000010_₂
Fraction (M)	23 bits	11100000000000000000000 (omit the leading 1., store only 111)
Storage	32 bits	0 10000010 11100000000000000000000

• Sign Bit (S): 0
(−1)⁰ = 1 (indicates a positive number)

• Exponent (E) (Bits 2–9): 10000010
Convert to decimal: 10000010₂ = 130₁₀
*IEEE 754 uses a bias value of 127
130 - 127 = 3
So, the exponent = 3, meaning 2³
(*IEEE 754 uses a bias to represent exponents. For single precision (32-bit), the exponent field uses 8 bits. According to the IEEE 754 definition, the bias is calculated as: Bias = 2⁸⁻¹ − 1 = 127)

• Fraction (M): 11100000000000000000000
According to IEEE 754, the fraction uses the normalized form “1 + fractional value,” so the actual value is: 1.11100000000000000000000₂

Convert to decimal:

IEEE754 floating-point format is defined as:

4. Modbus Floating-Point Access and Conversion

In Modbus communication, data is stored in registers, with each register being 16 bits wide. However, representing a single-precision floating-point number requires 32 bits. To handle this within the Modbus architecture, data that exceeds 16 bits must span multiple consecutive register addresses. Therefore, a 32-bit float needs to occupy two adjacent registers.

The Modbus protocol specifies the use of Big-Endian format for transmitting multi-byte values. In Big-Endian order, the most significant byte is sent first. For example, the 16-bit hexadecimal value 0x1234 would be transmitted on the bus as 0x12 followed by 0x34. (Modbus Organization, 2006)

[4]

1 ) How Modbus Stores a 32-bit Floating-Point Number

How is the floating-point number 84.0 stored in Modbus?
• Convert to IEEE 754 single-precision floating-point format:
The IEEE 754 single-precision (32-bit) binary representation of 85.625 is:

01000010101010000000000000000000₂

• Convert to hexadecimal: 0x42A80000
• Storage in Modbus (Big-Endian format):
Since each Modbus register is 16 bits, a 32-bit float is split across two registers:
High register (first 16 bits): 0x42A8
Low register (last 16 bits): 0x0000

2 ) How to Decode Modbus Values Back into Floating-Point Numbers

Example:

Device Address (Slave ID): 0x01
Function Code: 0x03 (Read Holding Registers)
Starting Register Address: 0x1000
Number of Registers to Read: 0x0002 (since a 32-bit float occupies two 16-bit registers)

➤ Modbus Transmission Request (TX - Master sends data):
01 03 10 00 00 02 C5 CD

Byte	Description	Value
01	Device Address (Slave ID)	0x01
03	Function Code (Read Holding Registers)	0x03
10 00	Starting Address (0x1000)	0x1000
00 02	Quantity of Registers (Read 2)	0x0002
C5 CD	CRC-16 Checksum	0xC5CD

➤ Modbus Device Response (RX - Slave responds):
01 03 04 42 A8 00 00 79 32

Byte	Description	Value
01	Device Address (Slave ID)	0x01
03	Function Code (Read Holding Registers)	0x03
04	Byte Count	0x04 (4 bytes = 32-bit)
42 A8	High Register Value	0x42A8
00 00	Low Register Value	0x0000
79 32	CRC-16 Checksum	0x7932

As shown, the merged register data is:

High 16-bit = 0x42A8, Low 16-bit = 0x0000
Combined into a 32-bit value (default Big-Endian): 0x42A80000
Binary representation: 0100 0010 1010 1000 0000 0000 0000 0000
IEEE 754 Decomposition:
• Sign bit (S) = 0 (positive)
• Exponent (E) = 10000101₂ = 133
• Fraction (M) = 01010000000000000000000₂

Plug into the IEEE 754 formula to get the decimal value: 84.0