cholesky_theory¶

AD Theory for Cholesky Factorization¶

Reference¶

See section 3.6 of Sebastian F. Walter’s Ph.D. thesis, Structured Higher-Order Algorithmic Differentiation in the Forward and Reverse Mode with Application in Optimum Experimental Design , Humboldt-Universitat zu Berlin, 2011.

Notation¶

Cholesky Factor¶

We are given a positive definite symmetric matrix \(A \in \B{R}^{n \times n}\) and a Cholesky factorization

\[A = L L^\R{T}\]

where \(L \in \B{R}^{n \times n}\) is lower triangular.

Taylor Coefficient¶

The matrix \(A\) is a function of a scalar argument \(t\). For \(k = 0 , \ldots , K\), we use \(A_k\) for the corresponding Taylor coefficients; i.e.,

\[A(t) = o( t^K ) + \sum_{k = 0}^K A_k t^k\]

where \(o( t^K ) / t^K \rightarrow 0\) as \(t \rightarrow 0\). We use a similar notation for \(L(t)\).

Lower Triangular Part¶

For a square matrix \(C\), \(\R{lower} (C)\) is the lower triangular part of \(C\), \(\R{diag} (C)\) is the diagonal matrix with the same diagonal as \(C\) and

\[\R{low} ( C ) = \R{lower} (C) - \frac{1}{2} \R{diag} (C)\]

Forward Mode¶

For Taylor coefficient order \(k = 0 , \ldots , K\) the coefficients \(A_k \in \B{R}^{n \times n}\), and satisfy the equation

\[A_k = \sum_{\ell=0}^k L_\ell L_{k-\ell}^\R{T}\]

In the case where \(k=0\), the

\[A_0 = L_0 L_0^\R{T}\]

The value of \(L_0\) can be computed using the Cholesky factorization. In the case where \(k > 0\),

\[A_k = L_k L_0^\R{T} + L_0 L_k^\R{T} + B_k\]

where

\[B_k = \sum_{\ell=1}^{k-1} L_\ell L_{k-\ell}^\R{T}\]

Note that \(B_k\) is defined in terms of Taylor coefficients of \(L(t)\) that have order less than \(k\). We also note that

\[L_0^{-1} ( A_k - B_k ) L_0^\R{-T} = L_0^{-1} L_k + L_k^\R{T} L_0^\R{-T}\]

The first matrix on the right hand side is lower triangular, the second is upper triangular, and the diagonals are equal. It follows that

\[L_0^{-1} L_k = \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ]\]

\[L_k = L_0 \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ]\]

This expresses \(L_k\) in term of the Taylor coefficients of \(A(t)\) and the lower order coefficients of \(L(t)\).

Lemma 1¶

We use the notation \(\dot{C}\) for the derivative of a matrix valued function \(C(s)\) with respect to a scalar argument \(s\). We use the notation \(\bar{S}\) and \(\bar{L}\) for the partial derivative of a scalar value function \(\bar{F}( S, L)\) with respect to a symmetric matrix \(S\) and an lower triangular matrix \(L\). Define the scalar valued function

\[\hat{F}( C ) = \bar{F} [ S , \hat{L} (S) ]\]

We use \(\hat{S}\) for the total derivative of \(\hat{F}\) with respect to \(S\). Suppose that \(\hat{L} ( S )\) is such that

\[\dot{L} = L_0 \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )\]

for any \(S(s)\). It follows that

\[\hat{S} = \bar{S} + \frac{1}{2} ( M + M^\R{T} )\]

where

\[M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1}\]

Proof¶

\[\partial_s \hat{F} [ S(s) , L(s) ] = \R{tr} ( \bar{S}^\R{T} \dot{S} ) + \R{tr} ( \bar{L}^\R{T} \dot{L} )\]

\[\R{tr} ( \bar{L}^\R{T} \dot{L} ) = \R{tr} [ \bar{L}^\R{T} L_0 \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} ) ]\]

\[= \R{tr} [ \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )^\R{T} L_0^\R{T} \bar{L} ]\]

\[= \R{tr} [ L_0^{-1} \dot{S} L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) ]\]

\[= \R{tr} [ L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S} ]\]

\[\partial_s \hat{F} [ S(s) , L(s) ] = \R{tr} ( \bar{S}^\R{T} \dot{S} ) + \R{tr} [ L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S} ]\]

We now consider the \((i, j)\) component function, for a symmetric matrix \(S(s)\), defined by

\[\begin{split}S_{k, \ell} (s) = \left\{ \begin{array}{ll} 1 & \R{if} \; k = i \; \R{and} \; \ell = j \\ 1 & \R{if} \; k = j \; \R{and} \; \ell = i \\ 0 & \R{otherwise} \end{array} \right\}\end{split}\]

This shows that the formula in the lemma is correct for \(\hat{S}_{i,j}\) and \(\hat{S}_{j,i}\). This completes the proof because the component \((i, j)\) was arbitrary.

Lemma 2¶

We use the same assumptions as in Lemma 1 except that the matrix \(S\) is lower triangular (instead of symmetric). It follows that

\[\hat{S} = \bar{S} + \R{lower}(M)\]

where

\[M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1}\]

The proof of this lemma is identical to Lemma 2 except that component function is defined by

\[\begin{split}S_{k, \ell} (s) = \left\{ \begin{array}{ll} 1 & \R{if} \; k = i \; \R{and} \; \ell = j \\ 0 & \R{otherwise} \end{array} \right\}\end{split}\]

Reverse Mode¶

k Equal To 0¶

For the case \(k = 0\),

\[\dot{A}_0 = \dot{L}_0 L_0^\R{T} + L_0 \dot{L}_0^\R{T}\]

\[L_0^{-1} \dot{A}_0 L_0^\R{-T} = L_0^{-1} \dot{L}_0 + \dot{L}_0^\R{T} L_0^\R{-T}\]

\[\R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} ) = L_0^{-1} \dot{L}_0\]

\[\dot{L}_0 = L_0 \R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} )\]

It follows from Lemma 1 that

\[\bar{A}_0 \stackrel{+}{=} \frac{1}{2} ( M + M^\R{T} )\]

where

\[M = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_0 )^\R{T} L_0^{-1}\]

and \(\bar{A}_0\) is the partial before and after is before and after \(L_0\) is removed from the scalar function dependency.

k Greater Than 0¶

In the case where \(k > 0\),

\[A_k = L_k L_0^\R{T} + L_0 L_k^\R{T} + B_k\]

where \(B_k\) is defined in terms of Taylor coefficients of \(L(t)\) that have order less than \(k\). It follows that

\[\dot{L}_k L_0^\R{T} + L_0 \dot{L}_k^\R{T} = \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T}\]

\[L_0^{-1} \dot{L}_k + \dot{L}_k^\R{T} L_0^\R{-T} = L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T}\]

\[L_0^{-1} \dot{L}_k = \R{low} [ L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T} ]\]

\[\dot{L}_k = L_0 \R{low} [ L_0^{-1} ( \dot{A}_k - \dot{B}_k - \dot{L}_0 L_k^\R{T} - L_k \dot{L}_0^\R{T} ) L_0^\R{-T} ]\]

The matrix \(A_k\) is symmetric, it follows that

\[\bar{A}_k \stackrel{+}{=} \frac{1}{2} ( M_k + M_k^\R{T} )\]

where

\[M_k = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_k )^\R{T} L_0^{-1}\]

The matrix \(B_k\) is also symmetric, hence

\[\bar{B}_k = - \; \frac{1}{2} ( M_k + M_k^\R{T} )\]

We define the symmetric matrix \(C_k (s)\) by

\[\dot{C}_k = \dot{L}_0 L_k^\R{T} + L_k \dot{L}_0^\R{T}\]

and remove the dependency on \(C_k\) with

\[\R{tr}( \bar{C}_k^\R{T} \dot{C}_k ) = \R{tr}( \bar{B}_k^\R{T} \dot{C}_k ) = \R{tr}( \bar{B}_k^\R{T} \dot{L}_0 L_k^\R{T} ) + \R{tr}( \bar{B}_k^\R{T} L_k \dot{L}_0^\R{T} )\]

\[= \R{tr}( L_k^\R{T} \bar{B}_k^\R{T} \dot{L}_0 ) + \R{tr}( L_k^\R{T} \bar{B}_k \dot{L}_0 )\]

\[= \R{tr}[ L_k^\R{T} ( \bar{B}_k + \bar{B}_k^\R{T} ) \dot{L}_0 ]\]

Thus, removing \(C_k\) from the dependency results in the following update to \(\bar{L}_0\):

\[\bar{L}_0 \stackrel{+}{=} \R{lower} [ ( \bar{B}_k + \bar{B}_k^\R{T} ) L_k ]\]

which is the same as

\[\bar{L}_0 \stackrel{+}{=} 2 \; \R{lower} [ \bar{B}_k L_k ]\]

We still need to remove \(B_k\) from the dependency. It follows from its definition that

\[\dot{B}_k = \sum_{\ell=1}^{k-1} \dot{L}_\ell L_{k-\ell}^\R{T} + L_\ell \dot{L}_{k-\ell}^\R{T}\]

\[\R{tr}( \bar{B}_k^\R{T} \dot{B}_k ) = \sum_{\ell=1}^{k-1} \R{tr}( \bar{B}_k^\R{T} \dot{L}_\ell L_{k-\ell}^\R{T} ) + \R{tr}( \bar{B}_k^\R{T} L_\ell \dot{L}_{k-\ell}^\R{T} )\]

\[= \sum_{\ell=1}^{k-1} \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell ) + \sum_{\ell=1}^{k-1} \R{tr}( L_\ell^\R{T} \bar{B}_k \dot{L}_{k-\ell} )\]

We now use the fact that \(\bar{B}_k\) is symmetric to conclude

\[\R{tr}( \bar{B}_k^\R{T} \dot{B}_k ) = 2 \sum_{\ell=1}^{k-1} \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell )\]

Each of the \(\dot{L}_\ell\) matrices is lower triangular. Thus, removing \(B_k\) from the dependency results in the following update for \(\ell = 1 , \ldots , k-1\):

\[\bar{L}_\ell \stackrel{+}{=} 2 \; \R{lower}( \bar{B}_k L_{k-\ell} )\]