Introduction

Creating an Autograd library and Neural nets in C++ from scratch has been one of the most fun and challenging side projects I’ve undertaken recently. Inspired by Andrej Karpathy’s micrograd in Python, I wanted to explore the same idea in C++. The goal was not to replicate Karpathy’s work but attempt to achieve the same outcome as best I can. This journey not only sharpened my understanding of computational graphs and automatic differentiation but also revealed the nuances of implementing them in a lower-level language. You can find the source code here.

In this blog, we will:

Walk through the process of Autograd and understand why it is important in ML.
Dive into some of the code and understand the implementation.
Explore the specific complexities and challenges faced due to C++.

Problem Statement

Training a Machine Learning model is like learning a mathematical equation from the training data so that it fits the data well. This process usually involves two main steps: Forward Pass and Backpropagation.

First we start by guessing a model and setting the weights randomly. Lets call it the base model. For example $y = w_1x_1 + w_2x_2 + b$ represents a linear regression model where $w_i$are the weights of the model and $x_i$ are the features.
Second, we use the training data to get the predicted outcome $\hat{y}$ through our base model.
Third, we measure how far away the predictions from our base model is from the actual outcome $y$. This is called the loss/cost function $J$.
Lastly, we adjust the weights of our model and repeat the process until our predicted outcome is very close to the actual outcome.

The last step in the process involves defining how the model's weights affect the final outcome. We do this by differentiating the cost function with respect to the model weights and then adjusting the weights by a certain factor of $dJ/dw$ . This means that every iteration of the training process requires us to calculate the gradients of the cost function with respect to all the model parameters. Modern deep networks can easily have thousands of weights (GPT-3 has 175 billion), so there needs to be a way to compute gradients that works efficiently and is agnostic to neural network architecture.

Solution

Let us start with a mathematical equation for which we want to calculate gradients.

$$J = xy + 3y + z$$

Now lets represent it in code.

x = 1
y = 5
z = 10
J = x*y + 3*y + z # J is set to 40
gradient(J, x) # This is what we need to calculate

The problem with this code is that J is a scalar with value 40. Now there is no way to compute gradients on it. For calculating gradients, J needs to understand its relationship with other parameters in the equation named x, y, z. How do we do that? Enter computational graphs.

A computational graph is a powerful abstraction for representing mathematical equations. Each node in the graph corresponds to an operation or a value, and edges represent dependencies. It can be constructed dynamically as mathematical operations grow. This representation allows us to define the relationship between the final equation and its parameters. With this knowledge, it is now possible to compute gradients as now we can apply the rules of calculus for summation, multiplication, and other operators.

There are few things to realize here:

The gradient with respect to the terminal node will always be 1 because the gradient of something with respect to itself is always 1.
Each node can easily compute the gradients of its child nodes by applying the rules of calculus. For example,node xy+3y will set gradient as 1 for both of its children, xy and 3y.
Following the chain rule in calculus, the gradient of each child node should be multiplied by the gradient of the parent node.
Because of the above property, it's important to compute gradients in an order where every parent is considered before its child nodes. There is an algorithm that does exactly this in graphs — Topological sort.

Implementation

Lets talk about the code now. We divide the implementation into several parts. Starting with the core building blocks to common mathematical operators, special functions, and tests.

Building blocks

The implementation revolves around a class called GradNode, which represents a node in the computational graph. This structure is fundamental because it stores not only the values and operations of the computation but also the dependencies between them. By capturing these relationships, GradNode enables backpropagation to efficiently compute gradients by systematically traversing the graph.

class GradNode {
public:
    ...
private:
    ...
  /// Private members
  std::vector<std::shared_ptr<GradNode>> children_; // Child nodes.
  std::function<void()> backward_fn_; // Backward function to compute gradients.
  double data_;                       // The value of the node.
  double grad_ = 0.0;                 // The gradient of the node.
  std::string label_;                 // The label of the node.
  bool is_scalar_ = false;            // Indicates if the node represents a scalar value.
};

While most of the parts are self explanatory. Here are some parts to focus on:

backward_fn_: This function is a callback that, when called, computes the gradients for all the children of this node. The awesome part of this design is that it lets us define differentiation rules without needing to store the actual mathematical operator.
is_scalar_: Helps distinguish real scalar values in the equation from the parameters, allowing us to skip gradient computations for these.
I have considered data_ as a double but it should be possible to templatize it to any numeric data type.

Function overloads

One key feature of python based Autograd Engines is the ability to write equations naturally like $J = xy + 3y + z$ as opposed to J = sum(multiply(x, y), multiply(3,y)). C++ provides operator overloading to enable just that.

friend std::shared_ptr<GradNode> operator+(const std::shared_ptr<GradNode> &a,
                                             double b);
...

friend std::shared_ptr<GradNode> operator*(const std::shared_ptr<GradNode> &a,
                                             double b);

Lets look into the implementation as well.

std::shared_ptr<GradNode> operator+(const std::shared_ptr<GradNode> &a,
                                    const std::shared_ptr<GradNode> &b) {

  auto output_data = a->data_ + b->data_;
  auto output_label = a->label_ + "+" + b->label_;
  auto output_children = std::vector<std::shared_ptr<GradNode>>{a, b};

  auto result = GradNode::CreateGradnode(output_data, output_label);
  result->children_ = output_children;
  result->backward_fn_ = [a, b, result]() {
    if (!a->is_scalar_) {
      a->grad_ += result->grad_;
    }
    if (!b->is_scalar_) {
      b->grad_ += result->grad_;
    }
  };

  return result;
}

std::shared_ptr<GradNode> operator*(const std::shared_ptr<GradNode> &a,
                                    const std::shared_ptr<GradNode> &b) {

  auto output_data = a->data_ * b->data_;
  auto output_label = a->label_ + "*" + b->label_;
  auto output_children = std::vector<std::shared_ptr<GradNode>>{a, b};

  auto result = GradNode::CreateGradnode(output_data, output_label);
  result->children_ = output_children;
  result->backward_fn_ = [a, b, result]() {
    if (!a->is_scalar_) {
      a->grad_ += (result->grad_ * b->data_);
    }
    if (!b->is_scalar_) {
      b->grad_ += (result->grad_ * a->data_);
    }
  };
  return result;
}

Notice how backward_fn_ is defined in both cases. We can model the differentiation rules within it.
By using is_scalar_, we skip gradient computations for scalars.
We had to add a special keyword friend to the function overloads as we are not overloading GradNode but shared_ptr<GradNode>.

Special functions

Neural net models rely on special functions referred to as the Activation Functions. This is critical to break linearity i.e. to learn a more complex non-linear model. The library should also support them out of the box.

//Header file

/// Sigmoid
friend std::shared_ptr<GradNode> sigmoid(std::shared_ptr<GradNode> &x);

/// Tanh
friend std::shared_ptr<GradNode> tanh(std::shared_ptr<GradNode> &x);

/// ReLU
friend std::shared_ptr<GradNode> relu(std::shared_ptr<GradNode> &x);

As these functions also internally can be represented by the basic mathematical operators, the implementation is straight forward.

std::shared_ptr<GradNode> sigmoid(std::shared_ptr<GradNode> &x) {
  auto e = GradNode::CreateGradnode(std::exp(1), "E");
  e->MakeScalar();
  auto minus_x = -1 * x;
  auto result = 1.0 / (1.0 + pow(e, minus_x));
  return result;
}

Lastly, I realized that log is an important operation specially for logistic regression in computing cross entropy but again, having backward_fn_ allows us to easily model the rules of gradients.

std::shared_ptr<GradNode> log(std::shared_ptr<GradNode> &x) {
  auto output_data = std::log(x->data_);
  auto output_label = "log(" + x->label_ + ")";
  auto output_children = std::vector<std::shared_ptr<GradNode>>{x};

  auto result = GradNode::CreateGradnode(output_data, output_label);
  result->children_ = output_children;
  result->backward_fn_ = [x, result]() {
    if (!x->is_scalar_) {
      x->grad_ += (result->grad_ * (1 / x->data_));
    }
  };
  return result;
}

Backpropagation

The heart of the code is the single intuitive API of computing gradients: Z.Backward(). Lets look at the code which is only a few lines.

void GradNode::Backward() {
  grad_ = 1.0;
  auto node_stack = TopologicalSort();
  while (!node_stack.empty()) {
    auto *node = node_stack.top();
    if (node->backward_fn_ == nullptr) {
      node_stack.pop();
      continue;
    }
    node->backward_fn_();
    node_stack.pop();
  }
}

The most important lesson here for me was to realize the need for Topological Sorting and why a simple Breath-First Search or Depth-First search is not the right algorithm to traverse nodes.
Secondly, it is important to set grad_ = 1.0 for the node on which Backward() is called.
Safety check to exclude computations for backward_fn_ == nullptr. This case will be triggered for all scalars.

Tests

Any software is incomplete without a set of hermetic, durable, and exhaustive unit tests. Also, for a library like this, it's even more important because there could be subtle bugs in the code for mathematical operations. So, I added several unit tests which helped me weed out bugs. I'll list a few here.

// Z = AB + B + C
TEST(Micrograd, ChainedEquation1) {
  auto a = GradNode::CreateGradnode(2.0, "a");
  auto b = GradNode::CreateGradnode(3.0, "b");
  auto c = GradNode::CreateGradnode(4.0, "c");

  auto z = (a * b) + b + c;
  z->Backward();

  EXPECT_EQ(a->GetGrad(), 3.0);
  EXPECT_EQ(b->GetGrad(), 3.0);
  EXPECT_EQ(c->GetGrad(), 1.0);
  EXPECT_EQ(z->GetGrad(), 1.0);
  EXPECT_EQ(z->GetData(), 13.0);
}

// Z = A^2 + AB + C/B + D
TEST(Micrograd, ChainedEquation2) {
  auto a = GradNode::CreateGradnode(2.0, "a");
  auto b = GradNode::CreateGradnode(3.0, "b");
  auto c = GradNode::CreateGradnode(4.0, "c");
  auto d = GradNode::CreateGradnode(5.0, "d");

  auto z = d + pow(a, 2.0) + (a * b) + (c / b);
  z->Backward();

  EXPECT_EQ(a->GetGrad(), 7.0);
  EXPECT_EQ(b->GetGrad(), 2.0 - (4.0 / 9.0));
  EXPECT_EQ(c->GetGrad(), 1.0 / 3.0);
  EXPECT_EQ(d->GetGrad(), 1.0);
  EXPECT_EQ(z->GetGrad(), 1.0);
  EXPECT_EQ(z->GetData(), 15 + 4.0 / 3);
}

TEST(Micrograd, Tanh) {
  auto a = GradNode::CreateGradnode(2.0, "a");

  auto z = tanh(a);
  z->Backward();

  EXPECT_NEAR(a->GetGrad(), 0.07065082485316443, 1e-9);
  EXPECT_NEAR(z->GetGrad(), 1.0, 1e-9);
  EXPECT_NEAR(z->GetData(), 0.9640275800758169, 1e-9);
}

Complexities and Challenges

Here are few complexities I encountered during the implementation. I will list them out in order of severity as I faced it.

Memory management in C++

The first and most important challenge for me was to support a seamless API like a + b + c instead of using functions like sum(sum(a, b), c). In C++, a + b + c first creates a temporary object for a + b and then adds it to c. The issue is that this temporary object gets deleted once the operation is complete, but we need to keep all nodes in our computation graph.

To create persistent objects, we can allocate them on the heap. However, using the new keyword is problematic, as in GradNode n = new GradNode(..). It requires manual memory management and can easily lead to memory leaks or segmentation faults. This is where shared_ptr comes to the rescue. It allows us to manage the lifecycle of an object without worrying about manual deletions. Once I figured this out, I had to design all the operators and functions to input and output a shared_ptr<GradNode>.

Complex syntax

This leads us to the second challenge. We need to overload the operators for shared_ptr instead of raw objects. A trick I discovered is to mark the functions as friend. Typically, the friend keyword is used when a class or method needs to access private variables or functions of another class. However, in this case, we use it to create a clean implementation and avoid awkward workarounds. I found this syntax frustrating and not very clean overall.

Scalars

Any mathematical equation should also support scalars. For example, a + 2b + c + 3. This means our operator overloading should automatically work with scalars as well. To handle all edge cases like this, I had to write three sets of overrides for each operator. Again, this felt like an untidy way of doing things, and I wish there was a more straightforward implementation.

  /// Addition
  friend std::shared_ptr<GradNode>
  operator+(double a, const std::shared_ptr<GradNode> &b);

  friend std::shared_ptr<GradNode> operator+(const std::shared_ptr<GradNode> &a,
                                             double b);

  friend std::shared_ptr<GradNode>
  operator+(const std::shared_ptr<GradNode> &a,
            const std::shared_ptr<GradNode> &b);

Whats Next?

We have reached the end of the discussion. Although I don't plan to add more features to micrograd-cpp, here are some ideas for future development. If you're a student or an ML enthusiast reading this blog, feel free to explore these ideas.

Optimization opportunities: There are chances to improve the code. For instance, there's no need to compute the Topological sort repeatedly in every iteration of the ML training process when Backwards() is called. This is inefficient, especially with large computational graphs. Also, consider using a more suitable data structure, like set, instead of vector to store child nodes.
New operators: Think about adding more operators, such as trigonometric functions like sin(x) or cos(x), to the library.
Template programming: The library shouldn't be limited to the double type. Users should be able to choose any numeric data type or define their own. This could help reduce the model's memory usage by selecting a more memory-efficient data type.
Building a Neural Network library: Expand this library to create a neural network library that allows users to build and train a neural network with just a few lines of code.

Conclusion

The pioneers of machine learning, like Andrej Karpathy and Andrew Ng, have consistently emphasized the importance of digging deeper into machine learning code rather than just having a superficial understanding. This exercise allowed me to do exactly that. As a C++ programmer for many years, it challenged me to test my knowledge in building such a framework, which was very exciting. It not only boosted my confidence in understanding ML but also gave me insight into the bottlenecks and optimization opportunities to watch out for. I also implemented tiny neural networks using my library. You can find the source code here.

Building an Autograd Library for Neural Net in C++ from Scratch: My Journey and Lessons

Introduction

Problem Statement

Solution

Implementation

Building blocks

Function overloads

Special functions

Backpropagation

Tests

Complexities and Challenges

Memory management in C++

Complex syntax

Scalars

Whats Next?

Conclusion

Comments

More from this blog

Agentic Deep Research is a Search & Infra problem, not an AI one

Grounding Vision-Language Agents in the DOM: The Element Reference Pattern

Unlocking Paywalls with LLMs: A Comprehensive Analysis

Behind the Pixels: The Math Driving Image Editing

Command Palette

Introduction

Problem Statement

Solution

Implementation

Building blocks

Function overloads

Special functions

Backpropagation

Tests

Complexities and Challenges

Memory management in C++

Complex syntax

Scalars

Whats Next?

Conclusion

Comments

More from this blog