An Introduction to Gradient Descent

This post concludes the theoretical introduction to Inverse Kinematics, providing a programmatical solution based on gradient descent. This article does not aim to be a comprehensive guide on the topic, but a gentle introduction. The next post, Inverse Kinematics for Robotic Arms, will show an actual C# implementation of this algorithm in with Unity.

The other post in this series can be found here:DistanceFromTarget

At the end of this post you can find a link to download all the assets and scenes necessary to replicate this tutorial.

Introduction

The previous post in this series, Implementing Forward Kinematics, provided a solution to the problem of Forward Kinematics. We have left with a function, called ForwardKinematics, which indicates which point in space our robotic arm is currently touching.

If we have a specific point in space that we want to reach, we can use ForwardKinematics to estimate how close we are, given the current joint configuration. The distance from the target is a function that can be implemented like this:

public float DistanceFromTarget(Vector3 target, float [] angles)
{
    Vector3 point = ForwardKinematics (angles);
    return Vector3.Distance(point, target);
}

Finding a solution for the problem of Inverse Kinematics means that we want to minimise the value returned from DistanceFromTarget. Minimising a function is one of most common problems, both in programming and Mathematics. The approach we will use relies on a technique called gradient descent (Wikipedia). Despite not being the most efficient, it has the advantage of being problem-agnostic and requires knowledge that most programmers already have.

❓ Solving Inverse Kinematics analytically

Gradient Descent

The easiest way to understand how gradient descent works, is to imagine a hilly landscape. We find ourselves on a random location, and we want to reach its lowest point. We call that the minimum of the landscape. At each step, gradient descent tells you to move in the direction that lowers your altitude. If the geometry of the landscape is relatively simple, this approach will converge towards the bottom of the valley.

The diagram below shows a typical scenario in which gradient descent is successful. In this toy example, we have a function that takes a single parameter (X axis), and returns an error value (Y axis). Starting from random points on the X axis (blue and green points), gradient descent should force us to move in the direction of the minimum (blue and green arrows).

Looking at the function in its entirety, the direction in which we have to move is obvious. Unfortunately, gradient descent has no prior knowledge of where the minimum is. The best guess the algorithm can do is to move in the direction that of the slope, also called the gradient of the function. If you are on a hill, let a ball go and follow it to reach the valley. The diagram below shows the gradient of the error function at two different points.

❓ How does a good landscape look like?

Gradient Estimation

If you have studied Calculus before, you might know the gradient of a function is deeply connected to its derivative. Calculating the derivative, however, requires the function to satisfy certain mathematical properties that, in general, cannot be guaranteed for arbitrary problems. Moreover, the analytical derivation of the derivative needs the error function to be presented analytically. Once again, you do not always have access to an analytical version of the function you are trying to minimise.

In all those cases, it is impossible to access the true derivative of the function. The solution is to give a rough estimate of its value. The diagram below shows how this can be done in one dimension. By sampling nearby points, you can get a feeling for the local gradient of the function. If the error function is smaller on the left, you go on the left. Likewise, if it is smaller on the right, you go on the right.

This sampling distance will be called \Delta x in the following section.

❓ What’s the difference between the gradient and the derivative?

❓ How critical is this sampling?

❓ What if the function has several local minima?

The Maths

Now that we have a general understanding of how gradient descent works graphically, let’s see how this translates mathematically. The first step is to calculate the gradient of our error function f at a specific point p. What we need is to find the direction in which the function grows. The gradient of a function is deeply connected with its derivative. For this reason, a good starting point for our estimation is to investigate how the derivative is calculate in the first place.

Mathematically speaking, the derivative of f is called {f}'. It’s value at the point p is {f}'\left(p\right), and it indicates how fast the function grows. It holds that:

  • {f}'\left(p\right) > 0 \Rightarrow f is going up, locally;
  • {f}'\left(p\right) < 0 \Rightarrow f is going down, locally;
  • {f}'\left(p\right) = 0 \Rightarrow f is flat, locally.

The idea is to use an estimation of {f}'\left(p\right) to calculate the gradient, called \nabla f. Mathematically, {f}' is defined as:

    \[{f}'\left(p\right) \right ) = \lim_{\Delta x \rightarrow 0} \frac{f\left(p+\Delta x\right) - f\left(p\right)}{\Delta x}\]

The following diagram shows what this means:

For what we are concerned, to estimate the derivative we need to sample the error function at two different points. The small distance between them, \Delta x is the sampling distance that we have introduced in the previous section.

To recap. The actual derivative of a function requires the usage of a limit. Our gradient is an estimation of the derivative, made using a sufficiently small sampling distance:

  • Derivative

        \[{f}'\left(p\right) \right ) = \lim_{\Delta x \rightarrow 0} \frac{f\left(p+\Delta x\right) - f\left(p\right)}{\Delta x}\]

  • Estimated gradient

        \[\nabla{f}\left(p\right) \right ) = \frac{f\left(p+\Delta x\right) - f\left(p\right)}{\Delta x}\]

We will see in the next section how these two differ for functions with multiple variables.

Once we have found our estimated derivative, we need to move in its opposite direction to climb down the function. This means that we have to update our parameter p like this:

    \[p_{i+1} = p_{i} - L \nabla {f}\left(p_i\right)\]

The constant L is often referred as learning rate, and it dictates how fast we move against the gradient. Larger values approach the solution faster, but are also more likely to overshoot it.

❓ lim?

Multiple Variables

The solution we have found so far works on a single dimension. What it means is that we have given the definition of the derivative for a function of the type f\left(p\right), where p is a single number. In this particular case, we could find a reasonably accurate estimate of the derivative by sampling the function f at two points: p and p+\Delta x. The result is a single number that indicates if the function is increasing or decreasing. We have used this number as our gradient.

A function with a single parameter corresponds to robot arm with a single joint. If we want to perform gradient descent for more complex robotic arms, we need to define the gradient for functions on multiple variables. If our robotic arm has three joints, for instance, our function will look more like f\left(\alpha_0, \alpha_1, \alpha_2\right). In such a case, the gradient is a vector made out of three numbers, which indicate the local behaviour of f in the 3D dimensional space of its parameters.

We can introduce the concept of partial derivatives, which essentially are “traditional” derivatives, calculated on a single variable at a time:

    \[{f}'_{\alpha_0}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )=\lim_{\Delta x \rightarrow 0} \frac{f\left(\alpha_0 + \Delta_x, \alpha_1, \alpha_2\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta x}\]

    \[{f}'_{\alpha_1}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )=\lim_{\Delta y \rightarrow 0} \frac{f\left(\alpha_0, \alpha_1+\Delta_y, \alpha_2\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta y}\]

    \[{f}'_{\alpha_2}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )=\lim_{\Delta z \rightarrow 0} \frac{f\left(\alpha_0, \alpha_1, \alpha_2+\Delta_z\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta z}\]

They represent three different scalar numbers, each indicating how the function grows on a specific direction (or axes). To calculate our total gradient, we approximate each partial derivative with a correspondent gradient, using sufficiently small sampling distances \Delta x, \Delta y and \Delta z:

    \[\nabla {f}_{\alpha_0}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )=\frac{f\left(\alpha_0 + \Delta_x, \alpha_1, \alpha_2\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta x}\]

    \[\nabla {f}_{\alpha_1}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )=\frac{f\left(\alpha_0, \alpha_1+\Delta_y, \alpha_2\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta y}\]

    \[\nabla {f}_{\alpha_2}\left(\alpha_0, \alpha_1, \alpha_2\right) \right )= \frac{f\left(\alpha_0, \alpha_1, \alpha_2+\Delta_z\right)-f\left(\alpha_0, \alpha_1, \alpha_2\right)}{\Delta z}\]

For our gradient descent, we will use the vector that incorporates those three results as the gradient:

    \[\nabla f \left(\alpha_0, \alpha_1, \alpha_2\right) = \left[\nabla {f}_{\alpha_0}\left(\alpha_0, \alpha_1, \alpha_2\right),\nabla{f}_{\alpha_1}\left(\alpha_0, \alpha_1, \alpha_2\right) ,\nabla{f}_{\alpha_2}\left(\alpha_0, \alpha_1, \alpha_2\right)\right]\]

❓ This is not a unit vector!

Other resources

The next part of this tutorial will finally show a working implementation of this algorithm.

Become a Patron!
Patreon You can download the Unity project for this tutorial on Patreon.

Credits for the 3D model of the robotic arm goes to Petr P.

Comments

10 responses to “An Introduction to Gradient Descent”

  1. Hi, thank you for this explanation.
    It is not really a big deal, but I think there are two missing “=” in the first formulas of the derivatives, before the limit.
    Thanks again

    1. Thank you for spotting that, Bob!
      I have corrected the formulas!

  2. Dioinecail avatar

    There is a little mistake in the code?
    As DistanceFromTarget returns float, not Vector3. Might confuse someone with very little experience in C#

  3. Andrew C avatar

    Very interesting and useful, thanks. I think you need ∇fα₂ once in your final vector though, rather than ∇fα₀ twice.

  4. Alan W avatar

    This is a lot to take in, had I not already completed higher level math courses at my university this would certainly be a much tougher read than it already is. You explained very thoroughly however, you must have a deep understanding of the topic. I appreciate the tutorial and am eager to learn more.

    1. Thank you!
      I wanted to give a more Maths-y introduction to the topic. Gradient Descent is often implemented in a very naive way. But it has solid theoretical foundations.

  5. […] Part 4. An Introduction to Gradient Descent […]

  6. […] of Forward Kinematics, we will see how to translate it into C# for Unity. The next tutorial, An Introduction to Gradient Descent, will finally show the theoretical foundations to solve inverse […]

Leave a Reply

Your email address will not be published. Required fields are marked *