An Introduction to Floating-Point Arithmetic

This tutorial will introduce you to floating-point arithmetic, and how many modern languages—C# included—represent real numbers. This is a series in two parts:

At the end of this article, you will find a link to download a simple C# library that provides a new type which improves the precision of traditional float and double variables.

For the ones of you who have been following my work for some time, you might have noticed the recurrent obsession theme for astronomy. From tools to visualize exoplanets, to games about orbital mechanics. All of these, share something in common: gravity. Simulating gravity is exceptionally easy with a modern computer. Simulating it well, however, is exceptionally complicated.

Part of this challenge comes from the fact that the equations that govern gravity are, for better or worse, chaotic. This means that very small deviations in the present can cause massive changes in the future. A significant part of this often comes from the measurements, which are necessarily imprecise and come with their own uncertainty. However, a more insidious source of error comes from Maths itself. Or at least, it comes from how mathematical operations are implemented in modern computers.

Floating-Point Representation

Numbers are easy to imagine, but hard to store. Each number has the potential to be unbounded in both directions, either being impossibly large, or impossibly small. And so, storing them in memory is a challenge that has to find a compromise between memory allocation and computational speed. When it comes to real numbers, the standard technique used to store them in memory uses what is known as floating-point representation.

To understand how floating-point numbers are represented in memory, let’s start with a simple example. Imagine you have a real number to write on the field of a paper form. What is the lastest—and the smallest—number that you put in that field? If we include the constraint that one of the boxes in the field needs to be reserved to the decimal separator, then you might answer like this:

The floating-point representation of real numbers works in a similar way. On top of storing the number, it also stores where the separator—the point—is. And exactly because the point can “float”, it is called floating-point.

Most programming languages offer floating-point representation for real numbers. C#, for instance, supports three different types of floating-point numbers floats, doubles and decimals.

 C# type Approximate range Precision Size .NET type float to ~6-9 digits 4 bytes System.Single double to ~15-17 digits 8 bytes System.Double decimal to 28-29 digits 16 bytes System.Decimal

Table from the .NET documentation page titled Floating-point numeric types.

A Problematic Representation…

Now, there is an obvious issue here. This technique works well with very large numbers and with very small numbers. But it does not work so well with large numbers that also have a lot of decimals. Moving the point to the left increases the precision, but also reduces the maximum number that can be stored. And the same is true for the opposite.

🔍 Periodic numbers
Another very well-known issue arises from periodic numbers. The faction is equal to which is often written as . While the fractional representation is “clean”, storing this number with its digits will naturally lead to rounding errors.

As it turns out, certain numbers that are non-periodic in the decimal system, are period when represented in a binary system. This leads to the infamous fact that, in most programming languages, .

This is happening in Unity as well, when using float numbers. However, the Debug.Log function is rounding the result to .

The website 0.30000000000000004.com shows the result of in a variety of different programming languages, including C#.

There is another, possibly more insidious problem with floating-point: arithmetic operations. What happens if we try to sum up both the numbers seen above? Since there are no decimals left, the second one is simply discarded; we added two numbers, but effectively nothing has changed.

This is more than a hypothetical issue. In the context of game development, the further a model is from the world origin (0,0,0), the more distorted it will appear. This is because the floating-point representation is using most of its bits to store the position, leaving little space for the fine details. The animation below shows how dramatic this effect can be when a model moves further and further away.

A common solution—one that is actually adopted by many games—is to simply translate everything “back” to (0,0,0) when the player has ventured too far from the world.

Some game engines also come with their own mechanism to mitigate the impact that floating-point precision has on rendering distant objects. Unity, for instance, offers a feature called Camera-relative rendering available in its latest rendering pipeline (HDRP). It works by bringing the camera (and everything else) to (0,0,0) before rendering. This takes place during the rendering stage of the scene, so it has no impact on the actual position of the objects.

🔍 Z-Fighting
Errors due to the rounding of floating-point numbers is also responsible for another very well-known phenomenon: z-fighting. When two triangles of a 3D model are placed too close to each other, they might partially overlap each other in rapid succession due to the fact that their distance is at the limit of the floating-point precision.

The name z-fighting comes from the fact that when this glitch occurs, the two meshes seem to be “fighting” each other to appear on top.

However, such a solution—resetting the position to (0,0,0)—cannot always be used. When working with high-precision simulations—such as gravitational ones—this is a big problem. Space is impossibly vast, and the speeds of our rockets and probes are rather small in comparison. Over time, these small errors add up, leading to a constant drift from the expected outcome. What are supposed to be stable orbits, for instance, could quickly spiral out of control.

For many applications, the problem of floating-point errors can be attenuated by cleverly re-arranging the order in which certain operations are performed. Since the main culprit is adding large numbers with small ones together, a common technique is to add elements “in order”. Adding up the smaller numbers first allows keeping as much precision as possible. Many different techniques have also been proposed to this problem, such as the pairwise summation or the compensated summation (also known as the “Kahan summation algorithm”).

This post introduced the concept of floating-point arithmetic, and why it often leads to inaccurate results. The second part of this series will show how to partially overcome these limitations with the Quad library.

You can download the C# Quad library on Patreon. It is available fully compatible with Unity.

💖 Support this blog

This website exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.

You will be notified when a new tutorial is released!

📝 Licensing

You are free to use, adapt and build upon this tutorial for your own projects (even commercially) as long as you credit me.

You are not allowed to redistribute the content of this tutorial on other platforms, especially the parts that are only available on Patreon.

If the knowledge you have gained had a significant impact on your project, a mention in the credit would be very appreciated. ❤️🧔🏻

1. VictorL

Hey, thank you for tutorial, very interesting.
P.S. You have a mistype in a C# floating point types table.

• Hi Victor!
Thank you!
Can you help me find the typo?

2. Great post. A brief and simple introduction to floating-point arithmetic and where you can go from here. There is a typo: “and the speeds or our rockets”. I think you meant “and the speeds *of* our rockets”.

• Thanks! I’ve corrected it now!

Webmentions

• Implementing Catenaries for Games - Alan Zucconi June 3, 2024

[…] The same method can be used here. We have an interval in which the solution should be (let’s call it a_prev, a_next), and we can iteratively split it in two halves, repeating the process until the interval size is arbitrarily small. When this is running on an actual machine, it is very unlikely we will find the exact, theoretical value of . This is because of rounding errors resulted from the way modern computer are storing numbers. You can read more about this on a series of articles dedicated to Floating-Point Arithmetic. […]

• Improving Floating-Point Precision in C# - Alan Zucconi June 3, 2024

[…] Part 1. An Introduction to Floating-Point Arithmetic […]