Improving Floating-Point Precision in C#

This tutorial continues the journey to understand floating-point arithmetic, and how to improve the precision of modern programming language types.

Part 1. An Introduction to Floating-Point Arithmetic
Part 2. Beyond Floating-Point in C#

At the end of this article, you will find a link to download a simple C# library that provides a new type which improves the precision of traditional float and double variables.

Introduction

In the previous part of this tutorial, we have introduced the concept of floating-point arithmetic. As the de-facto standard to represent real numbers in modern programming languages, understanding where and why floating-point arithmetic fail is very important if you want to master Computer Science.

The main issue arises from the fact that real numbers (which are potentially unbounded) need to be stored in a finite amount of memory. Floating-point representation allows to store small numbers very precisely, or large numbers with very low precision. Generally speaking, you cannot have both.

Libraries that allow performing computation without any rounding error do exist, but they tend to be fairly slow. This is mostly because most processors have built-in operations to perform floating-point arithmetic, at a speed that cannot be matched by any other software technique.

Beyond Floating-Points in C#…

A good compromise that is often used is to use multiple floating-point numbers to store an even larger number. In this article, we will highlight how to exactly this, by using one number for the integer part, and another one number for the decimal one.

The idea is simple: any number $x \in \mathbb{R}$ can be decomposed as the sum of its integer and its fractional parts, which we will call $I$ and $F$ , respectively:

(1) $\begin{equation*} x = I + F$$ \end{equation*}$

where:

(2) $\begin{equation*} \begin{align} I & = \left [ x \right ] \\ F & = \left \{ x \right \} \end{align} \end{equation*}$

This can be easily implemented in C# using a struct:

public struct Quad
{
    public double  I;
    public decimal F;
}

In the snipped above, the integer part is defined as double, while the fractional as decimal. This is because decimals are designed to have a higher precision compared to doubles (which is why they are often used for financial transactions).

We can also easily create a constructor that converts any existing double into a Quad:

public Quad (double d)
{
    double i = Math.Truncate(d);

    I = i;
    F = (decimal) (d - i);
}

The cast to decimal is necessary; decimals have higher precision compared to floats, but they can store smaller numbers overall. In this case, the case is technically safe, since we are only interested in the decimal part of the expression.

To make this a “proper” new type, we should ensure that it can be used pretty much as other floating-point types can. This means:

Performing implicit and explicit casts to and from other floating-point types:

// Implicit casts (safe)
Quad qi = 1;  // Quad from int
Quad qf = 1f; // Quad from float
Quad qd = 1;  // Quad from double
Quad qm = 1m; // Quad from decimal

// Explicit casts (unsafe)
Quad q;
float   f = (float)   q; // Quad to float
double  d = (double)  q; // Quad to double
decimal m = (decimal) q; // Quad to decimal

Performing all of the basic operations (addition, subtraction, multiplication, division, …) via their respective operators (+, -, *, /, …):
```
Quad a = 1;
Quad b = 2;

Quad c = a+b;
```
Supporting comparison and equality operators:
```
Quad a = 1;
Quad b = 2;

bool b = a > b;
```

Casts to and from other types

To make the Quad type more usable, it is important that it can be used pretty much as all of the other floating-point types can. This means being able to implicitly cast other types to Quad—whenever it is safe to do so—, and forcing an explicit cast to convert a Quad into a smaller type.

This is very easy in C#, thanks to the implicit operator keyword:

public static implicit operator Quad(double a)
{
    return new Quad(a);
}

The method above is called every time a double is assigned to a Quad. As the name suggests, the cast happens implicitly, meaning that this method is behind called to construct a Quad from a double.

We can use a slightly different variant of this method to do the opposite: constructing a double from a Quad. This can be done with the explicit operator keyword, which requires an explicit cast to the destination type.

public static explicit operator double (Quad a)
{
    return a.I + (double) a.F;
}

C# typically requires explicit casting every time an assignment could potentially cause a loss of precision (like in this case).

The implementation for the other four operators is omitted from the article, but is included in the library you can download at the end of the page.

Operator Overloads

The next step is to make sure that Quad variables support all basic operations. Let’s start with implementing the addition. C# allows defining how the + operator should work when used with two Quad s variables, through a feature called operator overloading:

public static Quad operator+ (Quad a, Quad b)
{
    ...
}

What is now needed is to write the code to add two Quad variables.

As previously discussed, adding numbers that are wildly different in range can lead to floating-point errors. The safest and simplest option here is summing up two Quads components by components:

public static Quad operator+ (Quad a, Quad b)
{
    Quad c;

    c.I = a.I + b.I;
    c.F = a.F + b.F;

    return c;
}

This works, but leads to a problem: the fractional part can potentially go above one. For instance, if $a=1.5$ and $b=0.5$ , the fractional part would be $1$ .

While summing two integers is guaranteed to produce another integer, that is not necessarily the case for fractions. When that occurs, we need to carry the integer remainder from the fractional part to the integer one.

We can take care of this with a method that re-normalises the Quad:

public void Normalise()
{
    if (Math.Abs(F) < 1m)
        return;

    decimal i = Math.Truncate(F);
    F -= i;

    I += (double) i;
}

If we ensure that Normalise is called before operator+ returns, we have successfully implemented addition between two Quad variables.

Comparisons

Another important feature is to ensure that Quad variables can be compared to each other. In C# this can be done by implementing the IComparable<Quad> interface.

The snippet implements the CompareTo method from the IComparable<Quad> interface, by relying on the CompareTo methods of the integer and fractional parts. By specification, in fact, CompareTo must return 0 if two numbers are equal.

public int CompareTo(Quad other)
{
    int c = I.CompareTo(other.I);
    if (c != 0)
        return c;

    return F.CompareTo(other.F);
}

It is good practice to also overload the >, <, >= and <= operators, as seen in the IComparable<T> interface page on the .NET documentation:

public static bool operator > (Quad a, Quad b)
{
    return a.CompareTo(b) == +1;
}

public static bool operator < (Quad a, Quad b)
{
    return a.CompareTo(b) == -1;
}

Even though we have only implemented comparison between two Quad variables, we are able to perform comparisons with other floating-point types as well. This is because we have instructed the compiler that floats and doubles can be implicitly cast to Quads.

Equality Test

Another step that needs to be taken is to make sure we can use == to check whether or not two Quads are the same. Like before, this can be done by overloading the == operator, and by implementing the IEquetable<Quad> interface.

However, this is not strictly necessary. By default, using the == operator with two structs of the same type in C# has the behaviour of comparing all of their fields. This is doing through reflection, which can be quite slow. So it is always best to re-implement a more efficient version by hand:

public bool Equals(Quad other)
{
    return
        I.Equals(other.I) &&
F.Equals(other.F) ;
}

Overloading the == operator also requires to overload !=. And it is good practice to also override the Equals method inherited from the Object class.

When Equals is changed, you should also override the GetHashCode, which is used by many .NET libraries such as Dictionary and HashSet.

This leads to the rather unpleasant situation where, for something as simple as the Quad type, you end up having to add at least ten (TEN!) methods just to get comparisons and equality to work properly.

Trigonometric Functions

At the beginning of this article, we justified the need for a more precise floating-point variable with the example of a gravity simulator. If we really are to use Quad variables in a simulator of that kind, we do need a way to handle trigonometric functions.

To get the sine of a Quad, we could simply do this:

Quad q;
Quad s = Math.sin(q);

However, this is basically killing the precision we so spent so much time preserving. A better way would be to calculate the sine on the integer and fractional parts separately, and then to join them later. This is a good ideal, but is not as straightforward as it might sound. The reason is simple: trigonometric functions are not linear. In fact, the sine of the sum is not the sum of the sines; quite the opposite:

(3) $\begin{equation*} \begin{align} \sin\left(I+F\right) & = \sin\left(I\right) \cos\left(F\right) &+ \cos\left(I\right) \sin\left(F\right) \\ \cos\left(I+F\right) & = \cos\left(I\right) \sin\left(F\right) &- \sin\left(I\right) \sin\left(F\right) \end{align} \end{equation*}$

The sine function can be implemented like this:

public Quad Sin()
{
    Quad sinI = new Quad(Math.Sin(I));
    Quad cosI = new Quad(Math.Cos(I));

    double sinF = Math.Sin((double) F); // no Math.Sin for decimal
    double cosF = Math.Cos((double) F); // no Math.Cos for decimal

    return sinI * cosF + cosI * sinF;
}

There are a few drawbacks, however. First, the lack of trigonometric functions that operate on decimal variables. Currently, the Math library can only handle doubles.

🔍 Difference between Quad and double-double

The idea to mitigate the limitations of doubles by using several of them is nothing new. It is possible to use two double variables to implement what is often referred to as double-double arithmetic.

What we have created here, however, is a bit different. A true double-double type would be able to make full use of all of its bits. Our Quad, instead, cannot do that. When the number is below $1$ , a Quad simply breaks down to a decimal. In a sense, this new type shares many similarities with fixed-point numbers.

While true double-double types can certainly be made in C#, they are not as easy to implement. The solution presented here works relatively well and—even more importantly— has a great educational value. And I am confident that after reading this tutorial, you know have the basis to go even beyond and use nor just two, but three variables, shall you need to.

Conclusion & Download

This post concludes the two-part tutorial on floating-point arithmetic in C#.

Part 1. An Introduction to Floating-Point Arithmetic
Part 2. Beyond Floating-Point in C#

You can download the C# Quad library on Patreon. It is available fully compatible with Unity.

💖 Support this blog

This website exists thanks to the contribution of patrons on Patreon. If you think these posts have either helped or inspired you, please consider supporting this blog.

Follow @AlanZucconi

📧 Stay updated

You will be notified when a new tutorial is released!

📝 Licensing

You are free to use, adapt and build upon this tutorial for your own projects (even commercially) as long as you credit me.

You are not allowed to redistribute the content of this tutorial on other platforms, especially the parts that are only available on Patreon.

If the knowledge you have gained had a significant impact on your project, a mention in the credit would be very appreciated. ❤️🧔🏻

Enrico Speranza

September 1, 2020

Please see this very interesting paper:

https://www.researchgate.net/publication/337759338_Healing_Cracks_in_Cyberspace_towards_best_practice

https://www.youtube.com/playlist?list=PL-FOpmjxwbdxs1VGl8aHc3FTbOEGch98S

Reply to Enrico
365cent

March 19, 2021

Cool, thanks for sharing

Reply to 365cent
Alessio

October 19, 2022

Hi Alan, very insightful 2-parts on floating point arithmetic! This might be a stupid question but… is there a specific reason for using a double (and not an int) for the integer part I?

Reply to Alessio
- Alan Zucconi
  
  October 22, 2022
  
  Hi Alessio!
  Yes, it’s because a double can store much larger integer numbers than an int can (1.7976931348623157E+308 vs 2147483647)!
  
  Reply to Alan
  - Alessio
    
    October 22, 2022
    
    Ah, I should have seen that! That makes sense, thank you very much!
    
    Reply to Alessio
MrDarcy

November 4, 2022

This doesn’t have any advantage over using Decimal directly. By storing the fractional component in a Decimal, the Quad class here can never do better than Decimal. Storing the integer component separately, as implemented here, is completely redundant (also, why is it a double?)

If you truly want higher precision than Decimal can provide, both `I’ and `F’ should be longs. Otherwise, this post is just an exercise in implementing operators.

Reply to MrDarcy

Webmentions

An Introduction to Floating-Point Arithmetic - Alan Zucconi November 4, 2022

[…] Part 2. Beyond Floating-Point in C# […]

Published

August 3, 2020

Alan Zucconi in C#, Maths, Programming, Tutorial | August 3, 2020