Deriving forward and reverse mode AD
We will derive forward and reverse mode automatic differentiation (AD) for pure, straight-line programs by example. We assume that the reader is familiar with partial derivatives of real-valued functions, as well as matrix-matrix and matrix-vector multiplication. This post is heavily inspired by the explanation in (Pearlmutter & Siskind, 2008), which helped me first understand AD.
1Automatic differentiation
The family of algorithms known as automatic differentiation is the foundation of the tools which allow automated calculation of derivatives. The family can be coarsely divided into forward mode and reverse mode. Multiple modes exist because their asymptotics depend on different features of the differentiated programs. Forward mode AD was introduced in (Wengert, 1964), and reverse mode AD was created by (Speelpenning, 1980) in his thesis.
AD works in the presence of many programming language features, but its essence can still be understood by looking at pure, straight-line programs. These are programs consisting of a sequence of variable assignments where the value assigned is calculated from the previous variables using a pure function.
2The multivariate derivative
The multivariate version of differentiation is given by the Jacobian, which for a differentiable function $F \colon \R^n \to \R^m$ and a point $\vec{x} \in \R^n$ we denote by $\nabla F(\vec{x})$. The Jacobian $\nabla F(\vec{x})$ is an $m \times n$ matrix containing all the partial derivatives of $F$ at $\vec{x}$. Thus, writing $F(\vec{x})$ as $F(\vec{x}) = (f_{1}(\vec{x}),\ldots,f_{m}(\vec{x}))$ for differentiable functions $f_{j} \colon \R^n \to \R$, the Jacobian $\nabla F(\vec{x})$ is \[ \nabla F(\vec{x}) \defeq \begin{pmatrix} \partial_{1}f_1(\vec{x}) & \cdots & \partial_{n}f_1(\vec{x}) \cr \vdots & \ddots & \vdots \cr \partial_{1}f_m(\vec{x}) & \cdots & \partial_{n}f_m(\vec{x}) \cr \end{pmatrix} \] where $\partial_{i}$ is the $i^{\text{th}}$ partial derivative operator. The Jacobian satisfies the multivariate chain rule $\nabla (G \circ F)(\vec{x}) = \nabla G (F(\vec{x})) \times \nabla F(\vec{x})$. The chain rule is the key behind forward and reverse mode AD.
3The straight-line program
Consider the algebraic definition \[ z = h\left( g(f(a),b), f(a)\right) \] where $a, b \in \R$, $f \colon \R \to \R$, $g, h \colon \R^2 \to \R$, and all functions are differentiable. We can rewrite this as a sequence of calculations using intermediate variables \[ \begin{aligned} x &= f(a) &(1)\cr y &= g(x, b) & (2)\cr z &= h(y, x) & (3) \end{aligned} \] and consider the sequence as a pure, straight-line program where the variables $a, b$ are inputs and the variables $x, y, z$ are initialized to $0$. We now regard the state of the program at each line as a five-tuple $(a, b, x, y, z) \in \R^5$ containing the values of our variables. Thus, each line $(i)$ gives a function $F_i \colon \R^5 \to \R^5$, i.e. \[ \begin{aligned} F_1(v_0, v_1, v_2, v_3, v_4) &= (v_0, v_1, f(v_0), v_3, v_4) \cr F_2(v_0, v_1, v_2, v_3, v_4) &= (v_0, v_1, v_2, g(v_2, v_1), v_4)\cr F_3(v_0, v_1, v_2, v_3, v_4) &= (v_0, v_1, v_2, v_3, h(v_3, v_2)). \end{aligned} \] Our program can then be rewritten to \[ \begin{aligned} \vec{x}_0 &= (a, b, 0, 0, 0) \cr \vec{x}_1 &= F_1(\vec{x}_0) \cr \vec{x}_2 &= F_2(\vec{x}_1) \cr \vec{x}_3 &= F_3(\vec{x}_2) \end{aligned} \] where $\vec{x}_3$ gives the final state.
4Forward mode
We can now view our program as a composition of state-transforming functions, namely $F_3 \circ F_2 \circ F_1$. We can now calculate its derivative using the chain rule as \[ \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0) = \nabla F_3 (\vec{x}_2) \times \nabla F_2(\vec{x}_1) \times \nabla F_1 (\vec{x}_0) \] where $\times$ is matrix-matrix multiplication, and later matrix-vector multiplication as well. The crux of both forward and reverse mode AD is this calculation, which they each use differently.
For forward mode, we observe that the matrix product can be computed from right-to-left by \[ \begin{aligned} X_1 &= \nabla F_1(\vec{x}_0) \cr X_2 &= \nabla F_2(\vec{x}_1) \times X_1 \cr X_3 &= \nabla F_3(\vec{x}_2) \times X_2. \end{aligned} \] It would be inefficient to materialize entire matrices in practice, and so we can pre-multiply by a vector $\vec{dx}_0$ to obtain \[ \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0) \times \vec{dx}_0 = \nabla F_3 (\vec{x}_2) \times \nabla F_2(\vec{x}_1) \times \nabla F_1 (\vec{x}_0) \times \vec{dx}_0 \] giving the sequence of vectors \[ \begin{aligned} \vec{dx}_1 &= \nabla F_1(\vec{x}_0) \times \vec{dx}_0\cr \vec{dx}_2 &= \nabla F_2(\vec{x}_1) \times \vec{dx}_1\cr \vec{dx}_3 &= \nabla F_3(\vec{x}_2) \times \vec{dx}_2. \end{aligned} \] Calculating the Jacobian of the function $F_1(v_0, v_1, v_2, v_3, v_4) = (v_0, v_1, f(v_0), v_3, v_4)$ at $\vec{x}_0$, we see \[ \begingroup \nabla F_1 (\vec{x}_0) = \begin{pmatrix} 1 & & & & \cr & 1 & & & \cr \partial f(a) & & 0 & & \cr & & & 1 & \cr & & & & 1 \cr \end{pmatrix} \endgroup \] where $\partial f$ is shorthand for the derivative of $f \colon \R \to \R$ at $a$ and empty entries are $0$. Similarly, \[ \nabla F_2 (\vec{x}_1) = \begin{pmatrix} 1 & & & & \cr & 1 & & & \cr & & 1 & & \cr & \partial_{R} g(x, b) & \partial_{L} g(x, b) & 0 & \cr & & & & 1 \cr \end{pmatrix} \] \[ \nabla F_3 (\vec{x}_2) = \begin{pmatrix} 1 & & & & \cr & 1 & & & \cr & & 1 & & \cr & & & 1 & \cr & & \partial_{R} h(y, x) & \partial_{L} h(y, x) & 0 \cr \end{pmatrix} \] where $\partial_{R} g$ is the partial derivative of $g$ in the right argument and so on. Observe that the Jacobians are sparse due to each of the $F_{i}$’s only changing one variable. We now calculate the vectors $\vec{dx}_i$. We use the notation $\vec{dx}_i[a]$, $\vec{dx}_i[b]$, $\vec{dx}_i[x]$, $\vec{dx}_i[y]$, and $\vec{dx}_i[z]$ for the first through fifth components respectively. Pairing each vector with the matching line of our original program, we get \[ \begin{aligned} x &= f(a) & \quad \vec{dx}_1 &= {\color{gray}(\vec{dx}_0[a], \vec{dx}_0[b], {\color{black}\partial f(a) \cdot \vec{dx}_0[a]}, \vec{dx}_0[y], \vec{dx}_0[z])} \cr y &= g(x, b) & \quad \vec{dx}_2 &= {\color{gray}(\vec{dx}_1[a], \vec{dx}_1[b], \vec{dx}_1[x], {\color{black}\partial_R g(x, b) \cdot \vec{dx}_1[b] + \partial_L g(x, b) \cdot \vec{dx}_1[x]}, \vec{dx}_1[z])} \cr z &= h(y, x) & \quad \vec{dx}_3 &= {\color{gray}(\vec{dx}_2[a], \vec{dx}_2[b], \vec{dx}_2[x], \vec{dx}_2[y], {\color{black}\partial_R h(y, x) \cdot \vec{dx}_2[x] + \partial_L h(y, x) \cdot \vec{dx}_2[y]})}. \end{aligned} \] Observe that $\vec{dx}_3[x] = \vec{dx}_2[x] = \vec{dx}_1[x]$ because the $x$ components of the $\vec{dx}_i$’s are only changed when $x$ is assigned to. Thus, we do not need to define a vector $\vec{dx}_i$ at each step, it is sufficient to only define one new scalar variable. We can therefore rewrite the above as \[ \begin{aligned} x &= f(a) & \quad dx &= \partial f(a) \cdot da \cr y &= g(x, b) & \quad dy &= \partial_R g(x, b) \cdot db + \partial_L g(x, b) \cdot dx \cr z &= h(y, x) & \quad dz &= \partial_R h(y, x) \cdot dx + \partial_L h(y, x) \cdot dy \end{aligned} \] which exactly captures the forward mode algorithm. Namely, each line is paired with a derivative calculation using the partial derivatives, i.e. $y = f(x_1, x_2, \ldots, x_n)$ is paired with \[ dy = \sum^{n}_{i = 1} \partial_i f (x_1, x_2, \ldots, x_n) \cdot dx_i. \] for a fresh variable $dy$.1
5Reverse mode
For reverse mode, we observe that the matrix product can be transformed by transposition \[ \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0)^\intercal = \nabla F_1 (\vec{x}_0)^\intercal \times \nabla F_2(\vec{x}_1)^\intercal \times \nabla F_3 (\vec{x}_2)^\intercal \] and that this reverses the order of matrix multiplication. We can again calculate right-to-left, \[ \begin{aligned} X_3 &= \nabla F_3(\vec{x}_2)^\intercal \cr X_2 &= \nabla F_2(\vec{x}_1)^\intercal \times X_3 \cr X_1 &= \nabla F_1(\vec{x}_0)^\intercal \times X_2 \end{aligned} \] and similarly opt for pre-multiplying by a vector $\vec{\delta x}_4$ \[ \nabla (F_3 \circ F_2 \circ F_1)(\vec{x}_0)^\intercal \times \vec{\delta x}_4 = \nabla F_1 (\vec{x}_0)^\intercal \times \nabla F_2(\vec{x}_1)^\intercal \times \nabla F_3 (\vec{x}_2)^\intercal \times \vec{\delta x}_4 \] and thus we can define a sequence of intermediate vectors \[ \begin{aligned} \vec{\delta x}_3 &= \nabla F_3(\vec{x}_2)^\intercal \times \vec{\delta x}_4 \cr \vec{\delta x}_2 &= \nabla F_2(\vec{x}_1)^\intercal \times \vec{\delta x}_3 \cr \vec{\delta x}_1 &= \nabla F_1(\vec{x}_0)^\intercal \times \vec{\delta x}_2. \end{aligned} \] The transposes of the Jacobians \[ \nabla F_1 (\vec{x}_0)^\intercal= \begin{pmatrix} 1 & & \partial f(a) & & \cr & 1 & & & \cr & & 0 & & \cr & & & 1 & \cr & & & & 1 \cr \end{pmatrix} \] \[ \nabla F_2 (\vec{x}_1)^\intercal = \begin{pmatrix} 1 & & & & \cr & 1 & & \partial_{R} g(x, b) & \cr & & 1 & \partial_{L} g(x, b) & \cr & & & 0 & \cr & & & & 1 \cr \end{pmatrix} \] \[ \nabla F_3 (\vec{x}_2)^\intercal = \begin{pmatrix} 1 & & & & \cr & 1 & & & \cr & & 1 & & \partial_{R} h(y, x) \cr & & & 1 & \partial_{L} h(y, x) \cr & & & & 0 \cr \end{pmatrix} \] are also sparse. Let $\vec{\delta x}_i[a]$, $\vec{\delta x}_i[b]$, $\vec{\delta x}_i[x]$, $\vec{\delta x}_i[y]$, and $\vec{\delta x}_i[z]$ for the first through fifth components of $\vec{\delta x}_i$ respectively. Calculating with components, we see \[ \begin{aligned} \vec{\delta x}_3 &= {\color{gray}(\vec{\delta x}_4[a], \vec{\delta x}_4[b], {\color{black}\vec{\delta x}_4[x] + \partial_R h(y, x) \cdot \vec{\delta x}_4[z]}, {\color{black}\vec{\delta x}_4[y] + \partial_L h(y, x) \cdot \vec{\delta x}_4[z]}, 0)} \cr \vec{\delta x}_2 &= {\color{gray}(\vec{\delta x}_3[a], {\color{black}\vec{\delta x}_3[b] + \partial_R g(x, b) \cdot \vec{\delta x}_3[y]}, {\color{black}\vec{\delta x}_3[x] + \partial_L g(x, b) \cdot \vec{\delta x}_3[y]}, 0, \vec{\delta x}_3[z])} \cr \vec{\delta x}_1 &= {\color{gray}({\color{black}\vec{\delta x}_2[a] + \partial f(a) \cdot \vec{\delta x}_2[x]}, \vec{\delta x}_2[b], 0, \vec{\delta x}_2[y], \vec{\delta x}_2[z])} \cr \end{aligned} \] and note that each line accumulates derivatives into the arguments of the function used based on the resulting variable. For example, $x = f(a)$ adds $f(a) \cdot \vec{\delta x}_2[x]$ to $\vec{\delta x}_2[a]$. We can use mutable variables $\delta a$, $\delta b$, $\delta x$, and $\delta y$ initialized to $0$ to perform the above calculation \[ \begin{aligned} x &= f(a) \cr y &= g(x, b) \cr z &= h(y, x) \cr \delta y &\mathrel{+}= \partial_L h(y, x) \cdot \delta z, \quad \delta x \mathrel{+}= \partial_R h(y, x) \cdot \delta z \cr \delta x &\mathrel{+}= \partial_L g(x, b) \cdot \delta y, \quad \delta b \mathrel{+}= \partial_R g(x, b) \cdot \delta y \cr \delta a &\mathrel{+}= \partial f(a) \cdot \delta x \end{aligned} \] which is exactly reverse mode AD, modulo zeroing out mutable variables. Namely, each line has a corresponding stateful derivative update which accumulates into the mutable derivative associated with its arguments, i.e. $y = f(x_1, x_2, \ldots, x_n)$ is paired with \[ \delta x_1 \mathrel{+}= \partial_1 f(x_1, \ldots, x_n) \cdot \delta y,\, \ldots,\, \delta x_n \mathrel{+}= \partial_n f(x_1, \ldots, x_n) \cdot \delta y \] in the reverse order of the original program.
6Conclusion
We have seen how to derive forward and reverse mode automatic differentiation for pure, straight-line programs. We achieved this by:
- Viewing our programs as state-transformers on all of the variables.
- Applying the multivariate chain rule, and in the case of reverse mode transposing the result.
- Pre-multiplying by a vector to get a sequence of intermediate vectors.
- Analyzing the sparsity of the Jacobians to determine the structure of each vector.
- Using this analysis to never create the vectors or Jacobians to begin with.
If you want to see how implement forward and reverse mode AD (and many more modes!) for a more general purpose language, feel free to look at my thesis.
References
- Pearlmutter, B. A., & Siskind, J. M. (2008). Reverse-mode AD in a Functional Framework: Lambda the Ultimate Backpropagator. ACM Trans. Program. Lang. Syst., 30(2), 7:1–7:36. https://doi.org/10.1145/1330017.1330018
- Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464. https://doi.org/10.1145/355586.364791
- Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms [Ph.D.]. University of Illinois at Urbana-Champaign.
- Griewank, A., & Walther, A. (2008). Evaluating Derivatives. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898717761
-
Forward mode AD can also be viewed as arithmetic in the ring of truncated Taylor series (Griewank & Walther, 2008, Chapter 13). ↩