lines 5-461 of file: xrst/theory/cholesky.xrst

{xrst_begin cholesky_theory app}
{xrst_spell
   diag
   diagonals
   humboldt
   lemma
   ph
   tr
   universitat
}

AD Theory for Cholesky Factorization
####################################

Reference
*********
See section 3.6 of
Sebastian F. Walter's Ph.D. thesis,
*Structured Higher-Order Algorithmic Differentiation*
*in the Forward and Reverse Mode*
*with Application in Optimum Experimental Design* ,
Humboldt-Universitat zu Berlin,
2011.

Notation
********

Cholesky Factor
===============
We are given a positive definite symmetric matrix
:math:`A \in \B{R}^{n \times n}`
and a Cholesky factorization

.. math::

   A = L L^\R{T}

where :math:`L \in \B{R}^{n \times n}` is lower triangular.

Taylor Coefficient
==================
The matrix :math:`A` is a function of a scalar argument
:math:`t`.
For :math:`k = 0 , \ldots , K`, we use :math:`A_k` for the
corresponding Taylor coefficients; i.e.,

.. math::

   A(t) = o( t^K ) + \sum_{k = 0}^K A_k t^k

where :math:`o( t^K ) / t^K \rightarrow 0` as :math:`t \rightarrow 0`.
We use a similar notation for :math:`L(t)`.

Lower Triangular Part
=====================
For a square matrix :math:`C`,
:math:`\R{lower} (C)` is the lower triangular part of :math:`C`,
:math:`\R{diag} (C)` is the diagonal matrix with the same diagonal as
:math:`C` and

.. math::

   \R{low} ( C ) = \R{lower} (C) - \frac{1}{2} \R{diag} (C)

Forward Mode
************
For Taylor coefficient order :math:`k = 0 , \ldots , K`
the coefficients
:math:`A_k \in \B{R}^{n \times n}`, and
satisfy the equation

.. math::

   A_k = \sum_{\ell=0}^k L_\ell L_{k-\ell}^\R{T}

In the case where :math:`k=0`, the

.. math::

   A_0 = L_0 L_0^\R{T}

The value of :math:`L_0` can be computed using the Cholesky factorization.
In the case where :math:`k > 0`,

.. math::

   A_k = L_k L_0^\R{T}  + L_0 L_k^\R{T}  + B_k

where

.. math::

   B_k = \sum_{\ell=1}^{k-1} L_\ell L_{k-\ell}^\R{T}

Note that :math:`B_k` is defined in terms of Taylor coefficients
of :math:`L(t)` that have order less than :math:`k`.
We also note that

.. math::

   L_0^{-1} ( A_k - B_k ) L_0^\R{-T}
   =
   L_0^{-1} L_k + L_k^\R{T} L_0^\R{-T}

The first matrix on the right hand side is lower triangular,
the second is upper triangular,
and the diagonals are equal.
It follows that

.. math::

   L_0^{-1} L_k
   =
   \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ]

.. math::

   L_k
   =
   L_0 \R{low} [ L_0^{-1} ( A_k - B_k ) L_0^\R{-T} ]

This expresses :math:`L_k` in term of the
Taylor coefficients of :math:`A(t)` and the lower order coefficients
of :math:`L(t)`.

Lemma 1
*******
We use the notation :math:`\dot{C}` for the derivative of a matrix
valued function :math:`C(s)` with respect to a scalar argument :math:`s`.
We use the notation :math:`\bar{S}` and :math:`\bar{L}` for the
partial derivative of a scalar value function :math:`\bar{F}( S, L)`
with respect to a symmetric matrix :math:`S` and
an lower triangular matrix :math:`L`.
Define the scalar valued function

.. math::

   \hat{F}( C ) = \bar{F} [ S , \hat{L} (S) ]

We use :math:`\hat{S}` for the total derivative of :math:`\hat{F}` with
respect to :math:`S`.
Suppose that :math:`\hat{L} ( S )` is such that

.. math::

   \dot{L} = L_0 \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )

for any :math:`S(s)`. It follows that

.. math::

   \hat{S} = \bar{S} + \frac{1}{2} ( M + M^\R{T} )

where

.. math::

   M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1}

Proof
=====

.. math::

   \partial_s \hat{F} [ S(s) , L(s) ]
   =
   \R{tr} ( \bar{S}^\R{T} \dot{S} )
   +
   \R{tr} ( \bar{L}^\R{T} \dot{L} )

.. math::

   \R{tr} ( \bar{L}^\R{T} \dot{L} )
   =
   \R{tr} [
      \bar{L}^\R{T} L_0
      \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )
   ]

.. math::

   =
   \R{tr} [
      \R{low} ( L_0^{-1} \dot{S} L_0^\R{-T} )^\R{T}
      L_0^\R{T} \bar{L}
   ]

.. math::

   =
   \R{tr} [
      L_0^{-1} \dot{S} L_0^\R{-T}
      \R{low}( L_0^\R{T} \bar{L} )
   ]

.. math::

   =
   \R{tr} [
      L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S}
   ]

.. math::

   \partial_s \hat{F} [ S(s) , L(s) ]
   =
   \R{tr} ( \bar{S}^\R{T} \dot{S} )
   +
   \R{tr} [
      L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} ) L_0^{-1} \dot{S}
   ]

We now consider the :math:`(i, j)` component function,
for a symmetric matrix :math:`S(s)`,
defined by

.. math::

   S_{k, \ell} (s) = \left\{ \begin{array}{ll}
      1 & \R{if} \; k = i \; \R{and} \; \ell = j \\
      1 & \R{if} \; k = j \; \R{and} \; \ell = i \\
      0 & \R{otherwise}
   \end{array} \right\}

This shows that the formula in the lemma is correct for
:math:`\hat{S}_{i,j}` and :math:`\hat{S}_{j,i}`.
This completes the proof because the component :math:`(i, j)` was arbitrary.

Lemma 2
*******
We use the same assumptions as in Lemma 1 except that the
matrix :math:`S` is lower triangular (instead of symmetric).
It follows that

.. math::

   \hat{S} = \bar{S} + \R{lower}(M)

where

.. math::

   M = L_0^\R{-T} \R{low}( L_0^\R{T} \bar{L} )^\R{T} L_0^{-1}

The proof of this lemma is identical to Lemma 2 except that component function
is defined by

.. math::

   S_{k, \ell} (s) = \left\{ \begin{array}{ll}
      1 & \R{if} \; k = i \; \R{and} \; \ell = j \\
      0 & \R{otherwise}
   \end{array} \right\}

Reverse Mode
************

k Equal To 0
============
For the case :math:`k = 0`,

.. math::

   \dot{A}_0
   =
   \dot{L}_0 L_0^\R{T}
   +
   L_0  \dot{L}_0^\R{T}

.. math::

   L_0^{-1} \dot{A}_0 L_0^\R{-T}
   =
   L_0^{-1} \dot{L}_0
   +
   \dot{L}_0^\R{T} L_0^\R{-T}

.. math::

   \R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} )
   =
   L_0^{-1} \dot{L}_0

.. math::

   \dot{L}_0
   =
   L_0 \R{low} ( L_0^{-1} \dot{A}_0 L_0^\R{-T} )

It follows from Lemma 1 that

.. math::

   \bar{A}_0 \stackrel{+}{=} \frac{1}{2} ( M + M^\R{T} )

where

.. math::

   M = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_0 )^\R{T} L_0^{-1}

and :math:`\bar{A}_0` is the partial before and after
is before and after :math:`L_0` is removed from the scalar function
dependency.

k Greater Than 0
================
In the case where :math:`k > 0`,

.. math::

   A_k = L_k L_0^\R{T}  + L_0 L_k^\R{T}  + B_k

where :math:`B_k` is defined in terms of Taylor coefficients
of :math:`L(t)` that have order less than :math:`k`.
It follows that

.. math::

   \dot{L}_k L_0^\R{T}
   +
   L_0 \dot{L}_k^\R{T}
   =
   \dot{A}_k - \dot{B}_k  - \dot{L}_0 L_k^\R{T} -  L_k \dot{L}_0^\R{T}

.. math::

   L_0^{-1} \dot{L}_k
   +
   \dot{L}_k^\R{T} L_0^\R{-T}
   =
   L_0^{-1} (
   \dot{A}_k - \dot{B}_k  - \dot{L}_0 L_k^\R{T} -  L_k \dot{L}_0^\R{T}
   ) L_0^\R{-T}

.. math::

   L_0^{-1} \dot{L}_k
   =
   \R{low} [ L_0^{-1} (
   \dot{A}_k - \dot{B}_k  - \dot{L}_0 L_k^\R{T} -  L_k \dot{L}_0^\R{T}
   ) L_0^\R{-T} ]

.. math::

   \dot{L}_k
   =
   L_0 \R{low} [ L_0^{-1} (
   \dot{A}_k - \dot{B}_k  - \dot{L}_0 L_k^\R{T} -  L_k \dot{L}_0^\R{T}
   ) L_0^\R{-T} ]

The matrix :math:`A_k` is symmetric, it follows that

.. math::

   \bar{A}_k \stackrel{+}{=} \frac{1}{2} ( M_k + M_k^\R{T} )

where

.. math::

   M_k = L_0^\R{-T} \R{low} ( L_0^\R{T} \bar{L}_k )^\R{T} L_0^{-1}

The matrix :math:`B_k` is also symmetric, hence

.. math::

   \bar{B}_k = - \; \frac{1}{2} ( M_k + M_k^\R{T} )

We define the symmetric matrix :math:`C_k (s)` by

.. math::

   \dot{C}_k = \dot{L}_0 L_k^\R{T} +  L_k \dot{L}_0^\R{T}

and remove the dependency on :math:`C_k` with

.. math::

   \R{tr}( \bar{C}_k^\R{T} \dot{C}_k )
   =
   \R{tr}( \bar{B}_k^\R{T} \dot{C}_k )
   =
   \R{tr}( \bar{B}_k^\R{T} \dot{L}_0 L_k^\R{T}  )
   +
   \R{tr}( \bar{B}_k^\R{T}  L_k \dot{L}_0^\R{T} )

.. math::

   =
   \R{tr}( L_k^\R{T} \bar{B}_k^\R{T} \dot{L}_0 )
   +
   \R{tr}( L_k^\R{T} \bar{B}_k \dot{L}_0 )

.. math::

   =
   \R{tr}[ L_k^\R{T} ( \bar{B}_k + \bar{B}_k^\R{T} ) \dot{L}_0 ]

Thus, removing :math:`C_k` from the dependency results in the
following update to :math:`\bar{L}_0`:

.. math::

   \bar{L}_0 \stackrel{+}{=} \R{lower} [ ( \bar{B}_k + \bar{B}_k^\R{T} ) L_k ]

which is the same as

.. math::

   \bar{L}_0 \stackrel{+}{=} 2 \; \R{lower} [ \bar{B}_k L_k ]

We still need to remove :math:`B_k` from the dependency.
It follows from its definition that

.. math::

   \dot{B}_k = \sum_{\ell=1}^{k-1}
      \dot{L}_\ell L_{k-\ell}^\R{T} + L_\ell \dot{L}_{k-\ell}^\R{T}

.. math::

   \R{tr}( \bar{B}_k^\R{T} \dot{B}_k )
   =
   \sum_{\ell=1}^{k-1}
   \R{tr}( \bar{B}_k^\R{T} \dot{L}_\ell L_{k-\ell}^\R{T} )
   +
   \R{tr}( \bar{B}_k^\R{T} L_\ell \dot{L}_{k-\ell}^\R{T} )

.. math::

   =
   \sum_{\ell=1}^{k-1}
   \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell )
   +
   \sum_{\ell=1}^{k-1}
   \R{tr}( L_\ell^\R{T} \bar{B}_k \dot{L}_{k-\ell} )

We now use the fact that :math:`\bar{B}_k` is symmetric to conclude

.. math::

   \R{tr}( \bar{B}_k^\R{T} \dot{B}_k )
   =
   2 \sum_{\ell=1}^{k-1}
   \R{tr}( L_{k-\ell}^\R{T} \bar{B}_k^\R{T} \dot{L}_\ell )

Each of the :math:`\dot{L}_\ell` matrices is lower triangular.
Thus, removing :math:`B_k` from the dependency results in the following
update for :math:`\ell = 1 , \ldots , k-1`:

.. math::

   \bar{L}_\ell
   \stackrel{+}{=} 2 \; \R{lower}( \bar{B}_k L_{k-\ell} )

{xrst_end cholesky_theory}