3.1 Optimization Using Gradient Descent
Optimization methods are central to training machine learning models. In this section, we study gradient-based optimization, where we iteratively improve parameters by following the negative gradient of an objective function.
Gradient descent is a first-order optimization algorithm used to find a local minimum of a differentiable function \(f: \mathbb{R}^d \to \mathbb{R}\).
Definition 3.1 The gradient descent algorithm begins with an initial guess \(\mathbf{x}_0\). At each iteration, parameters are updated as: \[ \mathbf{x}_{i+1} = \mathbf{x}_i - \gamma_i (\nabla f(\mathbf{x}_i))^\top , \] where \(\gamma_i\) is the step-size (or learning rate).
The gradient points in the direction of steepest ascent, so moving in the opposite direction decreases the function value.
Example 3.1 Gradient Descent on a Simple Quadratic Function
Consider the quadratic function \[ f(x) = (x - 3)^2. \] This function has a global minimum at \(x = 3\).
The derivative (gradient) of \(f\) is \[ \nabla f(x) = f'(x) = 2(x - 3). \] Gradient descent updates the variable \(x\) according to \[ x_{k+1} = x_k - \gamma \nabla f(x_k), \] where \(\gamma > 0\) is the learning rate.
Substituting the gradient: \[ x_{k+1} = x_k - \gamma \, 2(x_k - 3). \] Let the initial value be \(x_0 = 0\) and choose \(\gamma = 0.1\). \[ x_1 = 0 - 0.1 \cdot 2(0 - 3) = 0.6 \] \[ x_2 = 0.6 - 0.1 \cdot 2(0.6 - 3) = 1.08 \] \[ x_3 = 1.08 - 0.1 \cdot 2(1.08 - 3) = 1.464 \] Each step moves \(x_k\) closer to the minimizer \(x = 3\).
The gradient points in the direction of steepest increase. Gradient descent moves in the opposite direction, reducing the function value. Because \(f(x)\) is convex, gradient descent converges to the global minimum.
A simple example with a quadratic function demonstrates how iterative updates move toward the minimum value. However, if the problem is poorly conditioned (e.g., shaped like a long narrow valley), convergence becomes slow and oscillatory.
Example 3.2 Gradient Descent on a Two-Variable Quadratic Function
Consider the function \[ f(x, y) = (x - 2)^2 + (y + 1)^2. \] This is a convex quadratic function with a global minimum at \[ (x^\ast, y^\ast) = (2, -1). \]
The gradient of \(f\) is \[ \nabla f(x, y) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2(x - 2) \\ 2(y + 1) \end{bmatrix}. \]
Gradient descent updates the variables according to \[ \begin{bmatrix} x_{k+1} \\ y_{k+1} \end{bmatrix} = \begin{bmatrix} x_k \\ y_k \end{bmatrix} - \gamma \begin{bmatrix} 2(x_k - 2) \\ 2(y_k + 1) \end{bmatrix}, \] where \(\gamma > 0\) is the learning rate.
Let the initial point be \[ (x_0, y_0) = (0, 0), \] and choose \(\gamma = 0.1\).
Iteration 1 \[ \begin{aligned} x_1 &= 0 - 0.1 \cdot 2(0 - 2) = 0.4, \\ y_1 &= 0 - 0.1 \cdot 2(0 + 1) = -0.2. \end{aligned} \]
Iteration 2 \[ \begin{aligned} x_2 &= 0.4 - 0.1 \cdot 2(0.4 - 2) = 0.72, \\ y_2 &= -0.2 - 0.1 \cdot 2(-0.2 + 1) = -0.36. \end{aligned} \]
Iteration 3 \[ \begin{aligned} x_3 &= 0.72 - 0.1 \cdot 2(0.72 - 2) = 0.976, \\ y_3 &= -0.36 - 0.1 \cdot 2(-0.36 + 1) = -0.488. \end{aligned} \]
The iterates move steadily toward \((2, -1)\).
The level sets of \(f(x,y)\) are concentric circles centered at \((2,-1)\). The gradient points perpendicular to level curves in the direction of steepest increase. Gradient descent moves opposite the gradient, descending toward the minimum.
3.1.1 Step-size Selection
Choosing an appropriate step-size \(\gamma\) is crucial:
- Too small → slow convergence
- Too large → overshooting or divergence
Adaptive methods adjust the step-size based on local function behavior:
- If \(f(x_{i+1}) > f(x_i)\), the step was too large → reduce \(\gamma\)
- If \(f(x_{i+1}) < f(x_i)\), the step could be larger → increase \(\gamma\)
This ensures monotonic convergence. Monotonic convergence refers to a type of convergence where a sequence approaches its limit without oscillating, meaning it moves consistently in one direction (either always increasing or always decreasing) toward the limit.
Definition 3.2 A sequence \(\{x_n\}\) converges monotonically to \(x^\ast\) if it satisfies both:
\(x_n \to x^\ast\) as \(n \to \infty\), and
The sequence is monotonic, meaning either:
- \(x_{n+1} \le x_n\) for all \(n\) (monotonic decreasing), or
- \(x_{n+1} \ge x_n\) for all \(n\) (monotonic increasing).
- \(x_{n+1} \le x_n\) for all \(n\) (monotonic decreasing), or
Example 3.3 Solving \(\mathbf{A}\mathbf{x} = \mathbf{b}\) can be formulated as minimizing: \[ \|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 = (\mathbf{A}\mathbf{x} - \mathbf{b})^\top (\mathbf{A}\mathbf{x} - \mathbf{b}). \] The gradient is \[ \nabla_\mathbf{x} = 2(\mathbf{A}\mathbf{x} - \mathbf{b})^\top \mathbf{A} .\] Although this problem has a closed-form solution, gradient descent provides an iterative alternative.
The convergence rate depends on the condition number \(\kappa = \frac{\sigma_{\max}(\mathbf{A})}{\sigma_{\min}(\mathbf{A})}\) (here, \(\sigma_{max}\) and \(\sigma_{min}\) refer to the maximum and minimum singular values of \(\mathbf{A}\)). You can imagine a poorly conditioned problem as one similar to trying to find the minimum value of a slightly curved dinner plate versus a well conditioned problem as trying to find the minimum of an ice-cream cone. Poorly conditioned problems have small condition number while better conditioned ones have large condition numbers.
Preconditioning (using a matrix \(\mathbf{P}^{-1}\) or \(\mathbf{P}\) in some texts) can improve convergence. The idea of preconditioning is essentially solving \[\mathbf{P}^{-1}\left(\mathbf{A}\mathbf{x} - \mathbf{b} \right) = \mathbf{0}\] instead of the original.
Example 3.4 Preconditioning Gradient Descent
Consider the quadratic function \[ f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\top \mathbf{A}\mathbf{x}, \quad \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \] where \[ \mathbf{A} = \begin{bmatrix} 10 & 0 \\ 0 & 1 \end{bmatrix}. \]
This function is strongly anisotropic: it curves much more steeply in the \(x_1\)-direction than in the \(x_2\)-direction.
The gradient is \[ \nabla f(\mathbf{x}) = \mathbf{A}\mathbf{x} = \begin{bmatrix} 10x_1 \\ x_2 \end{bmatrix}. \] Standard gradient descent updates: \[ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma \nabla f(\mathbf{x}_k). \] Because of the large eigenvalue \(10\), the method must take small steps to avoid overshooting, leading to slow convergence.
We choose a preconditioning matrix \[ \mathbf{P} = \begin{bmatrix} \frac{1}{10} & 0 \\ 0 & 1 \end{bmatrix}. \]
The preconditioned gradient descent update becomes \[ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma \mathbf{P}\nabla f(\mathbf{x}_k). \]
The impact of this is the new preconditioned gradient: \[ \mathbf{P}\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{1}{10} & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 10x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}. \]
Now the update rule is \[ \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}. \] Both components are scaled equally, eliminating the imbalance in curvature.
Without preconditioning: level sets are elongated ellipses, causing zig-zag motion. With preconditioning: the problem behaves like \[ f(\mathbf{x}) = \tfrac{1}{2}(x_1^2 + x_2^2), \] which has circular level sets. Convergence is faster and more stable.
3.1.2 Gradient Descent with Momentum
Gradient descent with momentum adds a memory term to smooth updates and dampen oscillations: \[ \mathbf{x}_{i+1} = \mathbf{x}_i - \gamma_i (\nabla f(\mathbf{x}_i))^\top + \alpha \Delta \mathbf{x}_i \] where \(\Delta \mathbf{x}_i = \mathbf{x}_i - \mathbf{x}_{i-1}\) and \(\alpha \in [0,1]\) controls the contribution of past updates.
This technique behaves like a heavy ball rolling downhill, accelerating along consistent directions and reducing oscillations in narrow valleys. Momentum helps especially when gradients are noisy or the optimization landscape is highly curved.
Example 3.5 Gradient Descent with Momentum
Consider the one–dimensional quadratic function \[ f(x) = x^2. \]
The gradient is \[ \nabla f(x) = 2x. \] Momentum introduces a velocity term \(v_k\): \[ \begin{aligned} v_{k+1} &= \beta v_k + \nabla f(x_k), \\ x_{k+1} &= x_k - \gamma v_{k+1}, \end{aligned} \] where:
- \(\gamma > 0\) is the learning rate,
- \(\beta \in [0,1)\) is the momentum parameter.
For example, let: \[ x_0 = 4, \quad v_0 = 0, \quad \gamma = 0.1, \quad \beta = 0.9. \]
Gradient at \(x_0\):
\[ \nabla f(x_0) = 2(4) = 8 \]Velocity update: \[ v_1 = 0.9(0) + 8 = 8 \]
Position update: \[ x_1 = 4 - 0.1(8) = 3.2. \]
Momentum accumulates past gradients, helping the iterates move faster in consistent descent directions and reducing oscillations.
Gradient at \(x_1\):
\[ \nabla f(x_1) = 2(3.2) = 6.4 \]Velocity update: \[ v_2 = 0.9(8) + 8 = 7.2 \]
Position update: \[ x_2 = 3.2 - 0.1(7.2) = 2.48. \]
3.1.3 Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) approximates the gradient using a random subset of data. Given a loss function decomposed as a sum: \[ L(\mathbf{\theta}) = \sum_{n=1}^N L_n(\mathbf{\theta}), \] SGD updates parameters using only a subset (mini-batch) of terms: \[ \mathbf{\theta}_{i+1} = \mathbf{\theta}_i - \gamma_i \left( \nabla L(\mathbf{\theta}_i) \right)^\top = \mathbf{\theta}_i - \gamma_i \sum_{n=1}^N \left( \nabla L_{n}(\mathbf{\theta}_i) \right)^\top. \] This reduces computational cost and allows training on large datasets.
- Large mini-batches → lower variance, smoother convergence, but higher cost
- Small mini-batches → noisier updates, faster per-step computation, better ability to escape local minima
Despite noisy gradients, SGD converges to a local minimum (almost surely) under mild conditions. It has become the default optimization method in large-scale machine learning.
Example 3.6 Stochastic Gradient Descent (SGD)
Consider a dataset with three data points and the loss function \[ f(x) = \frac{1}{3}\sum_{i=1}^3 f_i(x), \quad f_i(x) = (x - a_i)^2, \] where: \[ a_1 = 1, \quad a_2 = 3, \quad a_3 = 5. \]
Full Gradient (Batch Gradient Descent) \[ \nabla f(x) = \frac{2}{3}\sum_{i=1}^3 (x - a_i). \]
Stochastic Gradient Descent Update SGD uses one data point at a time: \[ x_{k+1} = x_k - \gamma \nabla f_{i_k}(x_k), \] where \(i_k\) is chosen randomly from \(\{1,2,3\}\).
For example, let: \[ x_0 = 0, \quad \gamma = 0.1, \] and randomly choose \(i_0 = 2\).
Gradient using only \(f_2\): \[ \nabla f_2(x_0) = 2(x_0 - 3) = -6 \]
Update: \[ x_1 = 0 - 0.1(-6) = 0.6 \]
Here, the update is noisy, but cheap. On average, SGD moves toward the minimum. It is widely used in machine learning for large datasets.
Exercises
Exercise 3.1 The notes talk about a function \[f\left(\begin{bmatrix}x_1\\x_2 \end{bmatrix}\right) = \dfrac{1}{2}\begin{bmatrix}x_1\\x_2 \end{bmatrix}^T\begin{bmatrix}2 & 1 \\ 1 & 20 \end{bmatrix} \begin{bmatrix}x_1\\x_2 \end{bmatrix} - \begin{bmatrix}5\\3 \end{bmatrix}^T\begin{bmatrix}x_1\\x_2 \end{bmatrix}.\] What is this function? Show that the gradient is \[\nabla f\left(\begin{bmatrix}x_1\\x_2 \end{bmatrix}\right) = \dfrac{1}{2}\begin{bmatrix}x_1\\x_2 \end{bmatrix}^T\begin{bmatrix}2 & 1 \\ 1 & 20 \end{bmatrix} - \begin{bmatrix}5\\3 \end{bmatrix}^T.\]
Exercise 3.2 Let \[f(x) = \dfrac{x^2 \cos(x) - x}{10},\] with \(x_0 = 6\) and stepsize 0.2. Use gradient descent to find the local minimum. Stop when successive terms are within 0.05 of each other.
Exercise 3.3 Let \[f(x) = xe^{-x^2} + \dfrac{x^2}{20}.\] Run 10 iterations of gradient descent from the initial position \(x_0 = -3\) and step size 0.1 to try to find the minimum of \(f\).
Exercise 3.4 Let \[f(x,y) = \left(\dfrac{3}{4}x - \dfrac{3}{2} \right)^2 + (y-2)^2 + \dfrac{1}{4} xy.\] Run 10 iterations of gradient descent from the initial position \((5,4)\) with step sizes 1, 0.1 and 0.01 to try to find the minimum of \(f\).
Exercise 3.5 Let \[f(x,y) = e^{-x^2-y^2} + \dfrac{x^2+y^2}{20}.\] Run 10 iterations of gradient descent from the initial position \((1,1)\) with step sizes 1, 0.1 and 0.01 to try to find the minimum of \(f\).
Exercise 3.6 Use gradient descent for 5 iterations and initial guess \(\mathbf{x}_0 = [4,2,-1]\) to find the minimum of \[f(x,y,z) = (x-4)^4 + (y-3)^2 + 4(z-5)^4.\]