Elementary Optimization

Understanding optimization from scratch; extreme points; constrained optimization with the Lagrangian method, and why it works

Objective function in blue with a given constraint regularization in red. Optimal solution is w*. The picture is taken from a section about l2 and l1 regularization in Bishop’s Pattern Recognition and Machine Learning.

Finding extreme points for a real-valued function of n static variables is the place to start. While many Calculus classes talk about taking derivatives to find max/min, I’ll try to organize a few concepts — in what situations to use concepts such as convexity, Second Derivatives Test, and why they fit into these situations. Later on, the Lagrangian method will be discussed.

We formulate an optimization problem as

f is called an objective function. Maximization problems can become minimizing negative f.

Convexity most closely connects with optimality, both mathematically and intuitively. There are several ways to understand convexity. One is to visualize it as ‘concave up’ like a bowl or cup. I like to remember concave as being the shape of a cave, and convex as the opposite. Another way is to see whether the second derivative is negative over the whole domain of x. The mathematical definition of convexity is that f(x) is convex if and only if:

where λ is between 0 and 1, and concavity is obtained by switching the inequality sign. To internalize this inequality, imagine a ‘concave up’ graph, and x1 < x2, like in the graph below. If I draw a line from (x1, y1) to (x2, y2), any point on that line should be above the function. Here, the right-hand-side of the inequality is a linear combination of these two points (representing the line), and left-hand-side is the function itself. (Bolded x means vector consisting of each xi.)

Suppose we found the first derivative solutions for x, then this condition together with f’s convexity are sufficient to make x global minimum. Intuitively, you can picture that a convex function slopes upward. Moreover, given convexity’s mathematical definition, a few algebraic steps suffice to prove that if x* has first derivative equal 0 and if f is convex (which leads to that inequality), then f(x*) indeed must be less than f(x) for any other x.

Oftentimes, the objective function is not convex. Thus, it might be helpful to find local minima of this function, and we know that the final solution must be among these local minima (and boundary points, if x is bounded, but constrained optimization is discussed in the next section). The Second Derivatives Test comes into play when we find local minima. The intuition is, now that these points x_1, …, x_n satisfy first derivatives equal 0, is the function convex in the neighborhood of these x’s? If it is convex around x, then x is the minimum point, just only with regards to its neighborhood.

Constrained Optimization with the Lagrangian Method

Oftentimes, we optimize an objective subject to constraints. We formulate such a problem as:

By convention, we maximize instead of minimize problems involving constraints on x. f and g can be any static function, linear or non-linear.

A function, called the Lagrangian function, was ingeniously invented so that, as long as we maximize that function in the same way as before without additionally considering constraints (constraints are implied in the Lagrangian function), then we find the optimal solution to the original problem. The Lagrangian function is defined as:

The optimal solution to the original problem must satisfy, for all i:

The above is the necessary condition for the optimal solution.

For sufficient condition, only a concavity requirement is added to the necessary conditions. Solve the above equations, and check L’s concavity, and you will find the final solution.

In other words, as long as an admissible x maximizes L, it maximizes f as well. It is quick to prove this: assume that L(x*) ≥ L(x) for any admissible x. We need to show that f(x*) ≥ f(x). We have that f(x*)-λ(g(x*)-b) = L(x*) ≥ L(x) = f(x)-λ(g(x)-b). Since λ(g(x)-b) = 0 for all admissible x, f(x*) ≥ f(x).

Now, some practical advise on solving an optimization problem with the Lagrangian method. One way is to first solve for the necessary conditions, then calculate the value of f for all solutions you have from necessary conditions, and finally find the solution that results in the biggest f among these solutions; the other way is to find the sufficient condition directly.

We formulate the problem as:

The Lagrangian function is still the same as in the case of equality constraints.

Below are the necessary conditions, for all i:

These conditions are also known as Kuhn-Tucker conditions. While I’ll skip the proof of the necessary condition, I’ll say an illuminating result of this proof. This proof leads to λ = df/db, given x. This result says that λ ≥ 0, because with constraints written as g(x) ≤ b, the larger b is, the looser constraints are, and therefore df/db ≥ 0. Knowing this nature of λ helps one understand the second line in the Kuhn-Tucker conditions. It says that either the constraint is active, i.e. g(x) = b, or the constraint is inactive, i.e. g(x) < b, and in this latter case increasing b doesn’t affect the solution anymore, so λ = df/db = 0.

For sufficient condition, again only concavity of L is added to the above conditions. Proof of the sufficient condition is very similar: because we have both dL/dx = 0 and concavity, this solution x maximizes L. Assume that L(x*) ≥ L(x) for any admissible x. We have that f(x*)-λ(g(x*)-b) = L(x*) ≥ L(x) = f(x)-λ(g(x)-b). Note that by Kuhn-Tucker condition, either g(x)-b = 0 or λ = 0, so λ(g(x)-b) = 0 for all admissible x. Therefore, f(x*) ≥ f(x).

Solving the optimization problem with inequality constraints using the Lagrangian function requires either finding solution to the sufficient condition, or solving the necessary conditions and finding the solution among these results that lead to the biggest f. So far, it’s same as in the case of equality constraints. The slightly trickier part is solving the Kuhn-Tucker conditions. For each constraint function g, one needs to consider two cases (when the constraint is active, and when inactive), so if there are m constraint functions, one needs to solve 2^m systems of equations. Hopefully, the problems that you need to solve by hand have only 2–3 constraints, and in some cases setting a λ to 0 allows quick computations. Thus, while the Lagrangian method is very flexible and great for small-scale problems, running a program that uses this method on problems with many constraints is slow. Fortunately, there are other methods for optimization. In particular, when f and g are linear, the Simplex Method is fast and one of the most common algorithms.

Columbia University Class of 2021 | Episcopal High School Class of 2017 | Statistics + Applied Math major

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store