This tutorial will introduce you to floating-point arithmetic, and how many modern languages—C# included—represent real numbers. This is a series in two parts:

At the end of this article, you will find a link to download a simple C# library that provides a new type which improves the precision of traditional `float`

and `double`

variables.

For the ones of you who have been following my work for some time, you might have noticed the recurrent ~~obsession~~ theme for astronomy. From tools to visualize exoplanets, to games about orbital mechanics. All of these, share something in common: gravity. Simulating gravity is exceptionally easy with a modern computer. Simulating it well, however, is exceptionally complicated.

Part of this challenge comes from the fact that the equations that govern gravity are, for better or worse, chaotic. This means that very small deviations in the present can cause massive changes in the future. A significant part of this often comes from the measurements, which are necessarily imprecise and come with their own uncertainty. However, a more insidious source of error comes from Maths itself. Or at least, it comes from how mathematical operations are implemented in modern computers.

## Floating-Point Representation

Numbers are easy to imagine, but hard to store. Each number has the potential to be unbounded in *both* directions, either being impossibly large, or impossibly small. And so, storing them in memory is a challenge that has to find a compromise between memory allocation and computational speed. When it comes to **real numbers**, the standard technique used to store them in memory uses what is known as **floating-point representation**.

To understand how floating-point numbers are represented in memory, let’s start with a simple example. Imagine you have a real number to write on the field of a paper form. What is the lastest—and the smallest—number that you put in that field? If we include the constraint that one of the boxes in the field needs to be reserved to the decimal separator, then you might answer like this:

The floating-point representation of real numbers works in a similar way. On top of storing the number, it also stores where the separator—the point—is. And exactly because the point can “float”, it is called floating-point.

Most programming languages offer floating-point representation for real numbers. C#, for instance, supports three different types of floating-point numbers `float`

s, `double`

s and `decimal`

s.

C# type |
Approximate range |
Precision |
Size |
.NET type |

`float` |
to | ~6-9 digits | 4 bytes | `System.Single` |

`double` |
to | ~15-17 digits | 8 bytes | `System.Double` |

`decimal` |
to | 28-29 digits | 16 bytes | `System.Decimal` |

Table from the .NET documentation page titled Floating-point numeric types.

## A Problematic Representation…

Now, there is an obvious issue here. This technique works well with very large numbers and with very small numbers. But it does not work so well with large numbers that also have a lot of decimals. Moving the point to the left increases the precision, but also reduces the maximum number that can be stored. And the same is true for the opposite.

🔍 Periodic numbersAs it turns out, certain numbers that are non-periodic in the decimal system, are period when represented in a binary system. This leads to the infamous fact that, in most programming languages, .

This is happening in Unity as well, when using `float`

numbers. However, the `Debug.Log`

function is rounding the result to .

The website 0.30000000000000004.com shows the result of in a variety of different programming languages, including C#.

There is another, possibly more insidious problem with floating-point: arithmetic operations. What happens if we try to sum up both the numbers seen above? Since there are no decimals left, the second one is simply discarded; we added two numbers, but effectively nothing has changed.

This is more than a hypothetical issue. In the context of game development, the further a model is from the world origin (0,0,0), the more distorted it will appear. This is because the floating-point representation is using most of its bits to store the position, leaving little space for the fine details. The animation below shows how dramatic this effect can be when a model moves further and further away.

A common solution—one that is actually adopted by many games—is to simply translate everything “back” to (0,0,0) when the player has ventured too far from the world.

Some game engines also come with their own mechanism to mitigate the impact that floating-point precision has on rendering distant objects. Unity, for instance, offers a feature called Camera-relative rendering available in its latest rendering pipeline (HDRP). It works by bringing the camera (and everything else) to (0,0,0) before rendering. This takes place during the rendering stage of the scene, so it has no impact on the actual position of the objects.

🔍 Z-Fighting**z-fighting**. When two triangles of a 3D model are placed too close to each other, they might partially overlap each other in rapid succession due to the fact that their distance is at the limit of the floating-point precision.

The name z-fighting comes from the fact that when this glitch occurs, the two meshes seem to be “fighting” each other to appear on top.

However, such a solution—resetting the position to (0,0,0)—cannot always be used. When working with high-precision simulations—such as gravitational ones—this is a big problem. Space is impossibly vast, and the speeds or our rockets and probes are rather small in comparison. Over time, these small errors add up, leading to a constant drift from the expected outcome. What are supposed to be stable orbits, for instance, could quickly spiral out of control.

For many applications, the problem of floating-point errors can be attenuated by cleverly re-arranging the order in which certain operations are performed. Since the main culprit is adding large numbers with small ones together, a common technique is to add elements “in order”. Adding up the smaller numbers first allows keeping as much precision as possible. Many different techniques have also been proposed to this problem, such as the pairwise summation or the compensated summation (also known as the “Kahan summation algorithm”).

## What’s Next and Download

This post introduced the concept of floating-point arithmetic, and why it often leads to inaccurate results. The second part of this series will show how to partially overcome these limitations with the `Quad`

library.

You can download the C# `Quad`

library on Patreon. It is available fully compatible with Unity.

##### 💖 Support this blog

This website exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.

##### 📧 Stay updated

You will be notified when a new tutorial is released!

##### 📝 Licensing

You are free to use, adapt and build upon this tutorial for your own projects (even commercially) as long as you credit me.

You are not allowed to redistribute the content of this tutorial on other platforms, especially the parts that are only available on Patreon.

If the knowledge you have gained had a significant impact on your project, a mention in the credit would be very appreciated. ❤️🧔🏻

Hey, thank you for tutorial, very interesting.

P.S. You have a mistype in a C# floating point types table.

Hi Victor!

Thank you!

Can you help me find the typo?