Math51H Translated For The Mathematical Underdog

Contents
Preface vii
0.1 Three Tales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
0.2 Insight on your Present, Past, and Future . . . . . . . . . . . . . . . . . . . . . . . . viii
0.3 Who Should Read this Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
0.4 What is a Mathematician? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
0.5 Final Words of Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Distance, Dened 1
1.1 Know thy Enemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Proof Technique: Proving Universal Statements . . . . . . . . . . . . . . . . . . . . . 8
1.6 How Proofs Should Not be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Squaring and Rooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Verifying that d is Actually a Distance Function . . . . . . . . . . . . . . . . . . . . . 14
1.9 Why the Distance Function Detour? . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 All About Angles 17
2.1 Angles in R
n
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Dot Product Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 On Keeping Ones Word: Triangle Inequality . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Some Fun with Cauchy-Schwarz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Proof Technique: The 7-10 Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Lets get Linear! 37
3.1 Linear Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Subspaces of R
n
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 How to Verify a Set is a Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Spanning Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Proof Technique: Proof by Contradiction . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Proof Technique: If and Only If . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 On Keeping Ones Word: Cauchy-Schwarz Equality . . . . . . . . . . . . . . . . . . . 57
i
ii CONTENTS
4 Under-determined Potential 63
4.1 System of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Gaussian Elimination: Step One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Proof Technique: Proof by Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Proof Technique: Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 How Induction Should Not be Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6 Under-determined Systems Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 Linear Dependence Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Keeping it Real 89
5.1 Thinking Axiomatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Proof Technique: Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Abelian Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Field Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6 Shorthand Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.7 Ordering Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.8 Completeness Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9 Proof Technique: Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 All Your Basis are Belong to Us 115
6.1 Building Downwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Proof Technique: Proving Two Sets are Equal . . . . . . . . . . . . . . . . . . . . . . 116
6.3 The Basis Theorem: Showing a Basis Exists . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 The Basis Theorem, Part II: Finding a Basis . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Some Fun with the Basis Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Sketchy Shades of Grey: Axiom of Choice . . . . . . . . . . . . . . . . . . . . . . . . 134
7 Matrix Madness 137
7.1 Lets be Honest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Working with Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3 Proving Two Matrices are Equal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4 Distances on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.5 Importance Behind Matrices: Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . 153
8 Row Space, Column Space, Null Space, Oh My! 157
8.1 Column Space and Null Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2 Rank-Nullity Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3 Row Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.4 Proving Rank Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9 The Skys the Limit 173
9.1 Capturing Closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 Intuition for Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.3 Proving Statements involving Multiple Quantiers . . . . . . . . . . . . . . . . . . . . 177
9.4 How to Prove a Sequence Converges to Some Limit . . . . . . . . . . . . . . . . . . . 180
CONTENTS iii
9.5 Limit Properties: Addition and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.6 Limit Properties: Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.7 Uniqueness of Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10 Being Bolzy 197
10.1 The Next Big Thing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.2 Monotone Convergence Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.3 The Sandwich Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.4 Bolzano-Weierstrass Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.5 An Alternate Proof of Bolzano-Weierstrass . . . . . . . . . . . . . . . . . . . . . . . . 213
11 Fishing for Complements 217
11.1 The Story so Far... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.2 Proving the First Claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.3 Proving the Second Claim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
11.4 A Happy Ending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.5 Something Extra for our Troubles: Orthogonal Projection Map . . . . . . . . . . . . . 230
11.6 Orthogonal Projection Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
12 A Game of Cat and Gauss 239
12.1 A Little Constructive Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.3 An Enlightening Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
12.4 An Easier Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
12.5 Null Space Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
12.6 Column Space Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.7 Inhomogeneous Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Midterm I: The Linear Algebra Menace 267
13 Continuing with Continuity 271
13.1 Why Continuity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.2 Limit of a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
13.3 Please Read: A Fundamental Dierence in Texts . . . . . . . . . . . . . . . . . . . . . 274
13.4 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
13.5 Properties of Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
13.6 Max and Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
13.7 On a Rolle! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13.8 Applying Continuous Functions to Limit Solving . . . . . . . . . . . . . . . . . . . . . 294
14 Keeping an Open Mind 301
14.1 All in the Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
14.2 Open Intervals to Open Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
14.3 Verifying a Set is Open . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
14.4 Closed Intervals to Closed Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.5 Verifying a Set is Closed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
14.6 Open and Closed Sets on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
iv CONTENTS
14.7 Open vs. Closed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
14.8 Open and Closed Set Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15 Continuing from R to R
n
327
15.1 Plans of Ascension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
15.2 How Bolzano-Weierstrass Should Not be Proven . . . . . . . . . . . . . . . . . . . . . 330
15.3 Bringing Bolzy Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
15.4 Continuous Functions and Limits in R
n
. . . . . . . . . . . . . . . . . . . . . . . . . . 337
16 Dishing out Derivatives 347
16.1 Motivation on a Multivariable Derivative . . . . . . . . . . . . . . . . . . . . . . . . . 347
16.2 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
16.3 Dierentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
17 Sum-body that I Used to Know 363
17.1 A Second Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
17.2 The Most Basic Test: N-th Term Test . . . . . . . . . . . . . . . . . . . . . . . . . . 365
17.3 Staying Non-negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
17.4 Absolute Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.5 Rearrangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
18 From Dierentiable to Directional 385
18.1 The Story Thus Far... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
18.2 Dierentiability in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
18.3 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
19 From Directional to Dierentiable 395
19.1 A Clever Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
19.2 Applying the 1D Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 397
19.3 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
20 Ironclad Chain Rule 407
20.1 Dierentiating a Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
20.2 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
21 Mighty Morphin Power Series 417
21.1 From Polynomial to Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
21.2 Change of Base-Point, Finite Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
21.3 Change of Base-Point, Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
22 A Mixed Bag of Partials 443
22.1 Out of Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
22.2 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
23 Second to None 453
23.1 The Three Kings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
23.2 On Multivariable Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
23.3 A Comedy of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
CONTENTS v
23.4 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
23.5 Second Derivative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
24 Chasing Curves 473
24.1 What are Curves? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
24.2 Arc-Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
25 Taylor Swift Series 489
25.1 Dierentiating a Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
25.2 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
26 Mastering Manifolds 503
26.1 Another look at Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
26.2 What is a Manifold? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
26.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
26.4 Tangent Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
27 Living La Vida Lagrangian 525
27.1 The Engineering Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
27.2 The Mathematics of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 531
Midterm II: Conquering Calculus 539
28 Playing with Permutations 543
28.1 Insight on Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
28.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
28.3 The Trouble with Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
29 Determining Determinant 557
29.1 The Magic of Multilinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
29.2 Uniqueness of D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
29.3 Computing Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
30 Flirting with Inverting 577
30.1 A Revision of Algebra II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
30.2 Left and Right Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
30.3 Cofactor Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
30.4 Constructing the Left Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
30.5 Proving det(A) = det(A
T
) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
30.6 Constructing the Right Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
31 Gram-Schmidt Style 597
31.1 Extra Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
31.2 The Best Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
31.3 Gram-Schmidt Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
vi CONTENTS
32 Spooky Spectral Theorem 609
32.1 Rewriting Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
32.2 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
32.3 Seeking Suciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
32.4 Spectral Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
33 Keeping up with Contractions 631
33.1 Its Inception All Over Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
33.2 Non-Triviality of Non-Emptiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
33.3 Contraction Mapping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
34 Intimidating Inverse Function Theorem 641
34.1 Introspection on Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
34.2 Inverse Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
34.3 An Overall Schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
34.4 A Much Needed Simplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
34.5 The Intricate Inverse Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 650
35 Implying Implicit Function Theorem 663
35.1 On Keeping Ones Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
35.2 Intuition on Implicit Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 664
35.3 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
35.4 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
36 Proving FTA: An Analytic Way 679
36.1 Journey to Another Plane: Preparations . . . . . . . . . . . . . . . . . . . . . . . . . 679
36.2 No Harm in Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
36.3 Getting Complex in Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
36.4 The Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
37 An Animal Farm of Innities 697
37.1 A Little More Complicated than it Looks . . . . . . . . . . . . . . . . . . . . . . . . . 697
37.2 Being a Bit Bijective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
37.3 Counting on Countability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
37.4 Realistically Uncountable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
37.5 Last Words on the Continuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 712
The Final: Acing Analysis 715
The After Math 719
Preface
I pass, like night, from land to land;
I have strange power of speech;
That moment that his face I see,
I know the man that must hear me:
To him, my tale, I teach.
-Rime of the Ancient Mariner
0.1 Three Tales
Ive never been that great of a writer. A writer can create originality: original mathematics. I am
more of a storyteller: when I love a proof, I can rederive it with my own original spin. My goal today
is to be a storyteller and to tell three tales.
The rst tale takes place on a warm October night. The crowd was shuing into Cubberly Audi-
torium, eagerly awaiting Brian Conrads public lecture, Rubik, Escher, Bank. For a mathematics
lecture, this was a pretty diverse crowd, lled with students and professors alike. In walked a fresh-
man, who took a seat next to me. Recognizing me from his Honors Multivariable Course, he inquired,
with a grin,
I wonder how many of us are here from Math 51H?
This question was rhetorical. Even in the midst of midterm week, it was evident that the front row
was teeming with frosh. Being the oddball that I am, I gave a cryptic reply:
Many. And not just your year.
Indeed, the young faces in the audience spanned at least a decade of upperclassmen, doctorates, and
graduates who all had one thing in common: they had all been through the introductory Math Honors
series. In this freshmans eyes, the presence of his peers was a testament to the strength and passion
of his class. But in my eyes, the presence of so many H-series veterans was a testament to
Leon Simons strength as a teacher. I have taught for 7 years, across 3 states and 2 continents,
and I have never met a more interesting and enthusiastic teacher than Leon Simon. As a teacher, I
know there is no greater honor than to see ones students be inspired to continue the cause. And in
the small window I have been at Stanford, I have seen some of the brightest and most mathematically
talented emerge from his doors.
vii
viii PREFACE
But not every student can survive the rigors of the H-series. Not every Dante gets out of the Inferno.
And that leads me to my second story.
The second tale is from my youth. Truth be told, I used to be a pretty bad kid. And then a
professor emeritus took me under his wing. Four years of teaching and he didnt ask for a dime. He
was the greatest man I have ever met and more of a father than I have ever known. All I ever really
wanted was to be just like him: a mathematician. I guess thats a pretty lousy reason. And Ive
realized that I was a pretty naive kid: math is more than just Calc BC and the pages of Stewart and
Strang.
During my rst week of Stanford, my undergraduate advisor asked me one key question to test my
math abilities:
What is the denition of continuity?
I gave him the typical American high school answer, epsilon and delta free. And he scolded me. He
was right to do so. But before I walked out, he did impart some words of wisdom:
Baptisma Pyros. Be prepared for a Baptism by Fire.
And it was. My undergraduate career was a Baptism by Fire.
The nal tale is a retelling of one of the greatest stories ever told, Math 51H. It will mimic the
released notes exactly, except I have added additional commentary to build intuition, outlined proofs
and techniques, and gave numerous examples and analogies.
But before I begin the tale, I should explain a few things.
0.2 Insight on your Present, Past, and Future
More than a quarter of the 51H class will have solid proof backgrounds and more than half will
have made it to AIME. At least ten kids are going to be from highly competitive math camps like
PROMYS and SUMaC. Others will hail from countries like Romania, Bulgaria, and Singapore, which
have relentless math programs. These students will know all sorts of obscure theorems like which odd
primes can be written as a sum of two squares. They will swim through the class like water and be
the ones raising their hands. But every students in the class will be from the top of his or her
school and will want to be a mathematician.
Here is a glimpse of your future:
You will study hard for the rst test. Its probably your rst Stanford midterm. Its also the rst
exam you have ever taken at night. You feel condent afterwards because you have aced every math
exam ever thrown at you. Then you get the results:
0.2. INSIGHT ON YOUR PRESENT, PAST, AND FUTURE ix
WHAT THE F**K?! You are shocked at the histogram spread. You are even more shocked that you
are lower than the 50th percentile. Pretty pissed at yourself, you study even harder for the second
midterm.
The students who completely bombed the rst midterm dropped out. Even though you studied more,
you did even worse.
On the nal, you score less than a 30 and walk out of the class with a B. You dont admit it to
anyone, and if anyone asks, you got an A. You spend your winter break studying for 52H, and when
you return, you realize all the B and lower students dropped out. Fifty students become twenty. The
cycle begins again, except this time you are dead last.
Is it Leon Simons fault that the class is going way too fast? No. But how can you master proof
techniques like induction if you are ung head-rst into upper-level applications? The truth is, Leon
Simon cannot cater to everyone. There are so many dierent math backgrounds spanning the entire
globe, and ultimately it is your choice to be in the course. And it is your misfortune that you
went to a normal American high school. Because I have a confession. On behalf of all high school
math teachers, I want you to know:
I am sorry. Math is more than just SAT and calculation. It is an art whose beauty
and creativity have been omitted from the standard curricula. If you come from
the typical American high school, you are at an absurd disadvantage.
x PREFACE
I can write a whole book about how bad the mathematics in America is, from elementary rote
memorization to mindless exercises.
1
Paul Lockhart already beat me to it in his A Mathematicianss
Lament (I am sure Professor Devlin would be willing to lend you a copy). There is nothing I can
do to reform your education. But at the very least, I am going to give you a ghting chance at the
H-series.
0.3 Who Should Read this Book?
Do not read this book if any of the following are true:
If you want to be an engineer. The H-series will actually hurt you. You must prac-
tice calculation. For any linear dynamical system, Fourier transform, innite summation, or
overdetermined system of equations, you will be asked to forsake any theoretic considerations
(like convergence and existence) and just mindlessly calculate. You would be surprised how
many theoretical math majors cannot calculate curls or solve simple ODEs. Heck, I remember
Leon Simon once asking the class to use integration by parts, and everyone just stared in awk-
ward silence. If you want to understand whats underneath the box, then take Math 115 and
Math 109: they count for the Engineering Mathematics requirement as well as an easy minor.
If you are already well-versed in math proofs. You have to get used to reading concise
2
proofs and guring out the why on your own. Especially when you hit the yellow and blue
graduate texts (these books are beasts)!
If you are used to taking the highest available math course and you just want a good
grade. First, you are not going to get an A with that attitude. And second, from someone who
works in the real world, no one cares about the specic classes you take.
If you want to make the climb without a rope. Like Bruce Wayne in Dark Knight Rises,
fear can be a powerful motivating factor. If this is the case, I wish you the best of luck.
Now, if you satisfy the following necessary conditions
You feel, deep in your core, that you want to be a mathematician.
You want to think about this stu way after graduation.
You think proofs are cool, and that mathematics is a beautiful art.
You do not have the same math background as everyone else and you always feel
like youre at the bottom of the class.
Then, at this point, I recommend heading to Coursera and watching Keith Devlins
Introduction to Mathematical Thinking
1
Be prepared to learn the dierence between exercises and problems.
2
To quote Professor Simon, I really need to trim the fat from this book.
0.4. WHAT IS A MATHEMATICIAN? xi
as soon as possible! In fact, if you are really eager to become a mathematician, chances are you
already watched this series before coming to Stanford. Nevertheless, it will get your mind rolling in
the right direction.
Then, instead of the H-series, you should take the following courses:
Math 115, Real Analysis: An excellent and easy introduction to the inner workings of
Calculus.
Math 109, Group Theory: Here, you will get tremendous practice in deriving theorems from
completely abstract properties.
Phil 151, Introduction to Logic: A rst course on purely deductive reasoning in a Logic
System, an axiomatic view of mathematical reasoning.
CS 103, Discrete Mathematics: Induction, Pigeon-hole, and lotsa fun stu.
Math 110, Number Theory: Cool properties of primes, modular arithmetic, and RSA.
The rst 2 weeks of fundamentals taught in Math 51H will be spread out across 10 weeks in each of
the aforementioned courses.
The nal necessary condition to read this book is
You are an incredibly stubborn bastard.
Chances are, you are just as stubborn as I was. And you refuse to give up. You will spend your days
studying your butt o, stuck in a library, and your grades wont reect the eort you put in. How
can 10 weeks of work possibly be reected in 6 hours of testing? Then this is absolutely the
book for you. And if this book helps you in any way, then the time I spent writing it (not to mention
the risk of getting sued) was worth it.
0.4 What is a Mathematician?
In the whole discussion above, I never really dened the term mathematician. How can you decide
to be a mathematician if you dont even know what one is? In general, there are a few questions for
which math people have automated responses. When discussing irrationality, they cite the drowning
of Hippasus by the Pythagorean cult. On the topic of constructing the reals from the integers, they
always cite Kronecker. To dene a mathematician, they quote Hardy:
A mathematician, like a painter or poet, is a maker of patterns. If his patterns are more permanent
than theirs, it is because they are made with ideas.
-G.H. Hardy
This is a great answer. However, this is not mine. My answer took me a very long time to nd.
However, I am not going to tell you, but simply point you where to look. Study the counterexample
to an innitary extension of component convergence implies point convergence of R to R
. In the
words of the great mathematician Leon Simon, who I will often quote in this book,
Mull it over.
xii PREFACE
0.5 Final Words of Advice
What you should not be doing:
Skipping the proofs and understanding only calculation. Every problem on your exam
(except 1 denition and 1 calculation), will be a proof
Highlighting every word in a proof to memorize an argument exactly. If you are
memorizing rather than understanding, I guarantee you will bomb the course.
Not attending lecture. This is where you gain intuition on proofs and problem solving
techniques. Not to mention, Professor Simon is highly entertaining.
Wikipedia-ing and googling solutions. This is like asking someone to lift your own weights!
Dont do it. The problem set is the only time you get to practice problem solving, and honestly,
you should be spending the whole week mulling over the p-set. In the worst case scenario, ask
Professor Simon or the TA for a hint. By the way, anyone who tells you that he nished the
p-set in less than an hour is either a liar or has seen the material before.
Hating yourself if you are completely lost in lecture. You are seeing it for the rst time:
of course you will be lost on the spot! To quote Leon Simon,
The N denition took 100 years to develop,
yet you are expected to know it in less than 20 minutes!
If it makes you feel any better, even the great Hilbert, in his youth,
...was not particularly quick at comprehending new ideas. He seemed never really able to understand
anything until he had worked it through his own mind.
-Constance Reid, Hilbert
What you should be doing:
Rewriting and rederiving proofs without looking. Do this even if you think you know a
proof because,
The human capacity for self delusion is limitless
-Leon Simon
In fact, for each of the key theorems, Ive written a proof summary. Use this rst to get an
overall picture of the proof and then try to ll in the blanks.
Going to oce hours. Ask questions, even if it makes you look stupid. To quote one of the
math departments top graduate students,
0.5. FINAL WORDS OF ADVICE xiii
In undergrad, I pestered a professor with questions after every lecture, and initially I felt
he wanted to escape me; later he said he was very happy to receive questions because it
showed someone was listening to and understanding his lecture. He ended up teaching me
tons of hard maths beyond his course, and wrote me an excellent recommendation letter,
both of which really helped me get into grad school. So you might get something amazing
out of talking to your professors; you wouldnt know if you didnt try.
-Amy Pang
Improving your proofs even if you already understand them: it is like improving a
Pina Colada by adding amaretto. This is especially true when homework solutions are released.
Even if you get the proofs correct, read over the solutions!
Being in a state of Sitz-Fleisch. This is the rst thing Leon Simon will tell you (and one of
my favorite expressions). You have to be in a state of sitting-esh, constantly thinking about
a problem and expanding your mind.
Now, without further ado, my re-telling of Math 51H. I hope you nd it more useful than the Half
Blood Princes annotated copy of advanced potion making.
xiv PREFACE
Lecture 1
Distance, Dened
The journey of a thousand miles begins with some sort of metric.
- cius
Goals: The rst three weeks of 51H are dedicated to Linear Algebra: this is because the
objects we will be working with are vectors. Today, we dene what a vector is, as well
as the distance between two vectors. We also introduce sets and the method for proving
universal statements.
1.1 Know thy Enemy
Before we begin our ten-week journey, we rst need to know what were studying. Particularly, we
need to ask ourselves,
What is Multivariable Calculus?
This is a two-part question: rst, what is Calculus?
In high school, you learned that
Calculus is the study of change.
However, I feel a more apt description is that
Calculus is the study of closeness.
This is because in Calculus, you study sequences as the term number gets closer to innity
lim
n
a
n
dierence quotients over interval lengths that get closer to 0
lim
h0
f(x + h) f(x)
h
and area approximations that get closer to the true value:
1
2 LECTURE 1. DISTANCE, DEFINED
But to study the nature of closeness, we need a notion of distance between objects. And the type
of objects we will be dealing with is what makes this Multivariable Calculus. Namely, we will be
working with vectors.
1.2 Vectors
In your high school career, you worked with single numbers x, pairs (x, y), and triplets (x, y, z).
However, we do not need to limit ourselves to just 1, 2, or 3 components. In this course, we will
generalize to n components:
(x
1
, x
2
, . . . , x
n
).
Most of the time, we will prop these n-tuples as columns:
Denition. An n-dimensional (column
1
) vector is an n-tuple
v =
_
_
v
1
v
2
.
.
.
v
n
_
_
where v
1
, v
2
. . . , v
n
are real numbers.
By convention, variables with an overhead arrow will denote vectors. Moreover, given a vector x, we
will denote its i-th component by x
i
. Thus,
x
1
, x
3
, x
7
will represent the 1st, 3rd, and 7th components of x, respectively.
1
We will always assume a vector is a column vector unless stated otherwise.
1.2. VECTORS 3
Algebraically, we can add and subtract vectors of the same size
v + w =
_
_
v
1
v
2
.
.
.
v
n
_
_
+
_
_
w
1
w
2
.
.
.
w
n
_
_
=
_
_
v
1
+ w
1
v
2
+ w
2
.
.
.
v
n
+ w
n
_
_
and scale vectors by a constant
cv = c
_
_
v
1
v
2
.
.
.
v
n
_
_
=
_
_
cv
1
cv
2
.
.
.
cv
n
_
_
.
For the cases n = 1, 2, 3, you can visualize vectors geometrically as points in space or directed arrows
from the origin:
[1] [2]
_
1
2
_ _
1
2
_
_
_
3
3
5
_
_
_
_
3
3
0
_
_
But how do you visualize the case n = 4 or higher?
I have heard some physics mumbo jumbo of visualizing four or higher spatial dimensions. I really
dont get it. If you can see in 4D, then you are too gifted to be taking undergraduate math. Regard-
less, there are many ways you can interpret n-dimensional vectors.
For example,
x =
_
_
2
3
4
5
_
_
can represent a particle at (2, 3, 4) at time 5:
But we can also interpret the same vector as a colored point on the number line:
x =
_
_
2
3
4
5
_
_
Position
Red Quantity
Blue Quantity
Green Quantity
Therefore, treat vectors as purely algebraic objects and leave the interpretation to the context.
1.3 Sets
Often times, we would like to talk about some collection of vectors. In order to do this, we need to
discuss one of the most fundamental
1
objects of mathematics: sets.
A set is simply a collection of distinct objects. The most common sets that you will see in your
undergraduate career are:
Symbol Denition
Z The set of all integers.
Q The set of all rationals.
R The set of all real numbers.
C The set of all complex numbers.
R
n
The set of all n-dimensional vectors.
We can also explicitly build sets using curly brackets:
A =
_
. . .
_
1
In Math 161: Set Theory, you will learn that we can encode mathematics using sets.
1.4. DISTANCE 5
For example,
A = {Red, Green, Blue}
B = {x
1
, x
2
, x
3
, x
4
, x
5
}
tells us that A is the set containing the colors Red, Green, and Blue and that B is the set containing
the vectors x
1
, x
2
, x
3
, x
4
, x
5
.
Of course, we can have more complicated sets. Particularly, we can form sets by picking out elements
from a bigger set. In this case, we will use the notation
A =
_
x B
P(x)
_
which is read as
A is the set of all elements x in B such that x satises property P
For example,
S =
_
x Z
x = y
2
for some y Z
_
is the set of all perfect squares whereas
T =
_
x R
5
x
1
= 0
_
is the set of all 5-dimensional vectors with rst component 0.
1.4 Distance
Now that we have the objects of Multivariable Calculus, we would like to measure the distance be-
tween them.
But what is the distance between two vectors?
In high school, you derived the distance formulas for 2D and 3D by repeated application of the
Pythagorean Theorem:
(x
1
, x
2
)
(y
1
, y
2
)
_
(
y
1
x
1
)
2
+
(
y
2
x
2
)
2
y
1
x
1
y
2
x
2
y
3
x
3
_
(y
1
x
1
)
2
+ (y
2
x
2
)
2
_
(
y
1
x
1
)
2
+
(
y
2
x
2
)
2
+
(
y
3
x
3
)
2
(x
1
, x
2
, x
3
)
(y
1
, y
2
, y
3
)
From these two cases, you could hypothesize that
Distance can be calculated by plotting the two vectors spatially and
repeatedly applying the Pythagorean Theorem.
But this is incorrect! A vector is an algebraic construct, while the Pythagorean Theorem is a geometric
statement about lines in the plane. We cant even visualize higher dimensional vectors!
Instead, we must dene the distance formula for vectors. But where do we begin?
We cant just spout out any formula. Our formula needs to have meaning. Namely, it must capture
the real world intuition of distance.
Generally,
Math Mantra: Study a physical phenomena and decide on its key properties.
Then, incorporate these properties in an abstract definition.
We decide that distance should be a function that inputs two vectors and outputs a real number.
Moreover, we decide that a distance function should have the following properties:
1.4. DISTANCE 7
Denition. We call a function d : R
n
R
n
R a distance function on R
n
if the following
properties hold:
1. (Non-negativity) Distance is always non-negative: for all x, y R
n
,
d(x, y) 0.
2. (Symmetry) The distance from x to y should be the same as the distance from y to x:
d(x, y) = d(y, x)
for all x, y R
n
.
3. (Zero) If the distance between two objects is 0, then they should be the same object: for all
x, y R
n
, if
d(x, y) = 0
then
x = y.
4. (Triangle Inequality) The direct path between two vectors should be shorter than or the same
length as any detour:
d(x, z) d(x, y) + d(y, z)
for all x, y, z R
n
.
This is an ideal denition for a distance function; however, we need to be careful:
Math Mantra: A definition does not guarantee existence!
How do we know there is actually a function d that satises all four properties? For example, I can
make the denition,
Let n be a number that is both even and odd.
But no such n exists! Likewise, I can say
Let x be a good M. Night Shyamalan movie.
and as we all know, there is no such x.
To nd a suitable candidate for a distance function d, we look to the 2D and 3D distance formulas
for inspiration:
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ (x
3
y
3
)
2
.
Naturally, since we are working with n-dimensional vectors, we guess that
d(x, y) =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
is a distance function. But in order to verify this, youll need a basic proof technique:
1.5 Proof Technique: Proving Universal Statements
Suppose you want to prove a fact about every element in a set.
For example, you may want to prove
Every natural number can be written as a sum of four squares.
Or, as a less abstract example,
Every Californian bar is forbidden to sell Everclear-190.
The bone-headed thing to do (if it is even possible) would be to go through each particular case. In
our second example, this means bothering every bar owner in California.
The more logical thing to do is to take an arbitrary element in set S and call it x:
x S.
Then, using only the fact that x S, show that x has the property you are trying to prove. Since
we can reapply this argument for any particular element in S, we can conclude all elements in S
must have this property.
So in our second example, consider an arbitrary bar in California and call it x. Then we use the
following chain of reasoning:
By denition of being a Californian bar, x has to have a Californian liquor license.
A Californian liquor license only permits the sale of alcoholic beverages with alcoholic volume
of at most 75.5%.
Everclear-190 has volume 95%.
Thus, x cannot sell Everclear-190.
But we only used the fact that x is a Californian bar and not its particular identity. Thus, we can
conclude that any bar in California we choose is forbidden to sell Everclear-190. In other words, every
Californian bar is forbidden to sell Everclear-190.
Here are a few mathematical examples:
1.5. PROOF TECHNIQUE: PROVING UNIVERSAL STATEMENTS 9
Example. Every odd number can be written as a dierence of two squares.
Proof: Let x be an arbitrary odd number. By denition,
x = 2n + 1
for some integer n Z. Now consider the consecutive perfect squares, (n + 1)
2
and n
2
. Then,
(n + 1)
2
n
2
= 2n + 1 = x.
Thus, x can be written as a dierence of two squares. Since x was an arbitrary odd number and we
only used the fact that x was odd, we conclude that all odd numbers can be written as a dierence
of squares.
Notice, in this proof, that the choice of n is a function of the arbitrary choice x. If we picked x = 5,
then we would know n = 2. Or, if we picked x = 301, then n = 150. The point is, we can construct
n for any choice x.
Example. The sum of two rational numbers is always rational.
Proof: Let a, b be two rational numbers. By denition of being rational, there exist integers
p
1
, p
2
, q
1
, q
2
such that
a =
p
1
q
1
b =
p
2
q
2
Then,
a + b =
p
1
q
1
+
p
2
q
2
=
p
1
q
2
+ p
2
q
1
q
1
q
2
Since p
1
q
2
+p
2
q
1
and q
1
q
2
are still integers, a+b is rational. Moreover, the choice of a, b was arbitrary;
thus, the sum of any two rational numbers is rational.
In the preceding proof, you need to avoid a noob mistake:
Math Mantra: DONT MAKE A DUMMY MISTAKE WITH DUMMY VARIABLES!
Suppose you wrote:
Since a, b are rational,
a =
p
q
b =
p
q
for some integers p, q.
This is completely wrong: this asserts that a = b.
True, there is no problem in changing the dummy variable. For example,
N
i=1
i
is the same as
N
j=1
j
However, if you had used the same dummy variables in the preceding theorem, you would have
completely changed the intent of the expression. Generally,
Math Mantra: The purpose of notation is to precisely capture our mathematical
reasoning. It is the mathematics that create the notation, NOT the other way
around.
For our nal example, recall that a function f is even if
f(x) = f(x)
for every x. Likewise, a function f is odd if
f(x) = f(x)
for every x. While every integer is either even or odd, some functions are neither even nor odd. But
it turns out that every function has a neat decomposition:
Example. Any function can be written as the sum of an even function and an odd function.
Proof: Consider an arbitrary function f. Using f, we can create functions
g(x) =
f(x) + f(x)
2
h(x) =
f(x) f(x)
2
1.6. HOW PROOFS SHOULD NOT BE DONE 11
Notice that g is even since
g(x) =
f(x) + f((x))
2
=
f(x) + f(x)
2
= g(x).
Moreover, h is odd:
h(x) =
f(x) f((x))
2
=
f(x) f(x)
2
=
f(x) f(x)
2
= h(x).
Summing g and h, we get
g(x) + h(x) =
f(x) + f(x)
2
+
f(x) f(x)
2
=
2f(x)
2
= f(x)
Hence, f can be written as the sum of an even function and an odd function. Since f was arbitrary,
any function can be written as the sum of an even function and an odd function.
Here, you should be asking,
Where in the world did g and h come from?
The truth is, we had to nd these explicit functions. And despite appearances, these functions didnt
pop out of thin air: we needed to do lots of scratch work.
1
Math Mantra: Most mathematics comes from being locked in a room and wasting
tons of paper.
And when you nd a beautiful idea, you keep this single sheet and burn all evidence of failure. This
is what makes mathematics an art.
1.6 How Proofs Should Not be Done
As an application reader for a prestigious mathematics program, I see a lot of top students apply.
All of them with 4.0 GPA, 10+ APs, perfect 800 SAT Math scores, great letters of recommenda-
tion, and long-winded essays. And to be honest, more than half of them submit atrocious problem sets.
Consider the problem:
Prove that n
2
+ n + 1 is never a multiple of 5.
More than half of the students submit an answer like this:
1
Although guessing an explicit formula is a tough process, experience will guide your path. After some exposure to
Complex Analysis and Eulers formula, thinking up g and h will become second nature.
This is not a proof ! It is not enough to plug in a few numbers!
For example, suppose you wanted to prove
p(x) = x
2
+ x + 41
spits out a prime number for any integer x. The completely boned-headed thing to do would be to
plug in the integers 0, 1, 2, 3, and then conclude it is true. In fact, if you keep on plugging in integers
from 1 to 39, you will keep getting a prime. But just plug in 40, and you will see
p(40) = 1641
isnt prime
1
!
Lets be honest: you can tell this is not a valid proof, but many of you would have submitted an
answer like this. Does this mean you are stupid? No. This is what we taught you. But now I am
teaching you something dierent:
Math Mantra: Proof does not mean support the argument!
A proof is an undeniable, irrefutable, absolutism. It is the nal word on all arguments, the Judge
Dredd of the sciences. It is my hope that this book not only helps gets you through the H-series, but
also helps correct High School misconceptions.
Before we return to the main theorem, we need one more digression on a very important fact.
1.7 Squaring and Rooting
In this course, we will be working with inequalities. A lot. Unfortunately, there are few high school
exercises that work with inequalities abstractly. In particular, you may have missed two key facts
involving non-negative numbers:
1
If you want more curios like this, Google search the Law Of Small Numbers by Richard Guy. FYI, there does
not exist a non-constant polynomial with integer coecients that always outputs a prime number.
1.7. SQUARING AND ROOTING 13
Square roots preserve inequalities.
and
Squaring both sides preserves inequalities
Formally, this means that

x and x
2
are increasing functions over the non-negative numbers: if
0 a b
then
b
and
a
2
b
2
For example,
3 9
thus

3
9
3
2
9
2
Note that we need the non-negativity requirement since, for negative numbers,
1.

x is undened
2. Squaring would not preserve inequalities. For example,
5 3
but
25 9.
You can check that x
2
and

x are increasing functions using Calculus, but if you dont believe me,
just stare at their graphs:
x
y
y = x
2
x
y
y =

x
Now, back to our regularly scheduled program.
1.8 Verifying that d is Actually a Distance Function
Theorem. The function d : R
n
R
n
R dened by
d(x, y) =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
satises the properties of a distance function.
Proof: We simply check the denition:
Non-negativity:
Let x and y be two arbitrary vectors in R
n
. Since a sum of squares is always non-negative,
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
0.
Moreover, square roots preserve inequalities; thus, we can root both sides to get
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
0
which is the same as
d(x, y) 0.
Symmetry:
Simply use the fact that x
2
= (x)
2
: for arbitrary x, y,
d(x, y) =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
=
_
(y
1
x
1
)
2
+ (y
2
x
2
)
2
+ . . . + (y
n
x
n
)
2
= d(y, x)
Zero:
Assume
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
= 0
Squaring, we get
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
= 0
But the sum of non-negative terms is 0, when, and only when each term is 0:
(x
1
y
1
)
2
= 0
(x
2
y
2
)
2
= 0
.
.
.
(x
n
y
n
)
2
= 0
which is only true when
x
1
= y
1
x
2
= y
2
.
.
.
x
n
= y
n
Thus, x = y.
1.9. WHY THE DISTANCE FUNCTION DETOUR? 15
Triangle Inequality:
See next lecture.
Note, we have yet to prove that d(x, y) is a distance function (although we will nish the proof in the
next lecture, as a result of Cauchy-Schwarz inequality).
Because a binary function d(x, y) is cumbersome to write, we dene (for this entire course),
Denition. The Euclidean distance
1
between x, y R
n
is
x y =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+ . . . + (x
n
y
n
)
2
.
In particular, the distance between x and the zero vector

0
x =
_
x
2
1
+ x
2
2
+ . . . + x
2
n
is called the norm of x.
1.9 Why the Distance Function Detour?
I took the liberty of introducing the properties of a distance function before giving an explicit example.
The reason is that we can have dierent functions
2
d that satisfy the properties of a distance function.
For example, we could have dened:
d(x, y) = |x
1
y
1
| +|x
2
y
2
| + . . . +|x
n
y
n
|
We could have even given d a completely bone-headed denition:
d(x, y) =
_
1 if x = y
0 if x = y
A recurring theme in mathematics is,
Math Mantra: If we prove a theorem using ONLY the properties of an object, then
we can apply the theorem to any other object that satisfies the same properties.
The computer scientists call this using a dierent instantiation. So understand the properties of
distance, and dont just gloss them over: youll thank me later.
1
If x y is not a distance function, then this would be a bone-headed name.
2
In fact, we can take this abstraction one step further: instead of R
n
, we can take d over some set M. You will see
this in Math 171 when you study metric spaces.
New Notation
Symbol Reading Example Example Translation
In, element of x A x is an element of the set A
R The set of all real
numbers
R is a real number.
C The set of all complex
numbers
2 + i C 2 + i is a complex number.
Q The set of all rationals
1
2
Q
1
2
is rational.
Z The set of all integers 3 Z 3 is an integer.
R
n
The set of all n-
dimensional vectors
v R
2
v is a 2 dimensional vector.
_
x A| P(x)
_
The set of all elements
in A satisfying prop-
erty P
{x R | x > 0} The set of all real numbers that are
greater than 0.
v Vector v v R
3
v is a vector with three real compo-
nents.
0 The zero vector x +
0 = x The sum of vector x and the zero

vector is x.
x The norm (or length)
of vector x
x = 1 The length of vector x is 1.
Lecture 2
All About Angles
...there is no doubt that [the Cauchy-Schwarz inequality] is one of the most widely used
and most important inequalities in all of mathematics
-J. Michael Steele, The Cauchy-Schwarz Master Class
Goals: Today, we dene the angle between two vectors. To do this, we introduce the dot
product and its properties. Also, to check that our denition of angle makes sense, we
derive the Cauchy-Schwarz inequality. The Cauchy-Schwarz inequality is a key inequality
which will be used time and time again throughout this course and all of your future
analysis courses.
2.1 Angles in R
n
Last lecture, we mentioned that vectors in R
2
and R
3
can be spatially visualized as directed arrows
from the origin. Under this visualization, we can calculate the angle between two vectors.
For example, in the case of R
2
, suppose we are given the points (x
1
, x
2
) and (y
1
, y
2
):
(x
1
, x
2
)
(y
1
, y
2
)
(0, 0)

17
18 LECTURE 2. ALL ABOUT ANGLES
To calculate , we rst compute the distances:
_ x
2
1
+
x
2
2
_
y
2
1
+
y
2
2
_
(
x
1
y
1
)
2
+
(
x
2
y
2
)
2
(x
1
, x
2
)
(y
1
, y
2
)
(0, 0)

Then, using the Law of Cosines
c
2
= a
2
+ b
2
2ab cos()
with
a =
_
x
2
1
+ x
2
2
b =
_
y
2
1
+ y
2
2
c =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
we have
(x
1
y
1
)
2
+ (x
2
y
2
)
2
. .
c
2
= x
2
1
+ x
2
2
. .
a
2
+y
2
1
+ y
2
2
. .
b
2
2
_
x
2
1
+ x
2
2
. .
a
_
y
2
1
+ y
2
2
. .
b
cos().
Expanding the left hand side (LHS), we get
x
2
1
2x
1
y
1
+ y
2
1
+ x
2
2
2x
2
y
2
+ y
2
2
= x
2
1
+ x
2
2
+ y
2
1
+ y
2
2
2
_
x
2
1
+ x
2
2
_
y
2
1
+ y
2
2
cos().
Then, cancelling terms, we are left with
2x
1
y
1
2x
2
y
2
= 2
_
x
2
1
+ x
2
2
_
x
2
2
+ y
2
2
cos().
This gives us
cos() =
x
1
y
1
+ x
2
y
2
_
x
2
1
+ x
2
2
_
y
2
1
+ y
2
2
.
Thus,
= cos
1
_
x
1
y
1
+ x
2
y
2
_
x
2
1
+ x
2
2
_
y
2
1
+ y
2
2
_
.
Likewise, we can follow a similar derivation to show that the angle between (x
1
, x
2
, x
3
) and (y
1
, y
2
, y
3
)
is
2.1. ANGLES IN R
N
19
= cos
1
_
x
1
y
1
+ x
2
y
2
+ x
3
y
3
_
x
2
1
+ x
2
2
+ x
2
3
_
y
2
1
+ y
2
2
+ y
2
3
_
.
But what about the angle between two vectors in R
n
for n 4 ?
To reiterate, vectors are simply algebraic objects and we cannot visualize R
n
spatially for higher
values of n. So just as we dened the distance formula, we will need to dene the angle between two
vectors.
Staring at the 2D and 3D cases,
cos
1
_
x
1
y
1
+ x
2
y
2
_
x
2
1
+ x
2
2
_
y
2
1
+ y
2
2
_
cos
1
_
x
1
y
1
+ x
2
y
2
+ x
3
y
3
_
x
2
1
+ x
2
2
+ x
2
3
_
y
2
1
+ y
2
2
+ y
2
3
_
we are inspired to choose the following candidate for the angle formula in R
n
:
cos
1
_
x
1
y
1
+ x
2
y
2
+ . . . + x
n
y
n
_
x
2
1
+ x
2
2
+ . . . + x
2
n
_
y
2
1
+ y
2
2
+ . . . + y
2
n
_
If you look closely, youll recognize that the denominator is the product
x y.
Moreover, the numerator is so important that we give it a name:
Denition. For x, y R
n
, the dot product of x, y is
x y = x
1
y
1
+ x
2
y
2
+ . . . + x
n
y
n
Using this notation, we formally dene angles between vectors in R
n
:
Denition. The angle between two non-zero vectors x, y R
n
is
= cos
1
_
x y
x y
_
Again, our denition needs to make sense! Specically, between any two non-zero vectors, the angle
should always exist. Suppose
x y
x y
> 1
Then does not exist (the domain of cos
1
is [1, 1] )! Thus, we would like
x y
x y
1
for x, y = 0.
Before we can prove this, we need to learn dot product properties.
2.2 Dot Product Properties
The most basic properties you need to know are:
Theorem. The dot product satises the following properties:
1. (Commutativity) Order of dot product multiplication does not matter:
x y = y x
for all x, y R
n
.
2. (Homogeneity) You can always pull scalars outside the dot product:
(x) y = x (y) = (x y)
for all real and x, y R
n
.
3. (Distributivity) Dot products distribute over vector addition:
x (y +z) = x y +x z
for all x, y, z R
n
.
4. (Relation to Norm
1
) The dot product of a vector with itself is the same as the square of its
norm:
x x = x
2
for all x R
n
.
1
In math lingo, we say that the norm is induced by the dot product.
2.2. DOT PRODUCT PROPERTIES 21
Proof: Expand the denition in each case. For example, to check (1), let x, y R
n
. Then, by the
commutativity of the reals,
x y = x
1
y
1
+ x
2
y
2
+ . . . + x
n
y
n
= y
1
x
1
+ y
2
x
2
+ . . . + y
n
x
n
= y x

Notice that the dot product looks a lot like normal multiplication; however, dont think of the dot
product as multiplication! In particular, associativity doesnt really makes sense here: dot products
only take vectors as arguments.
Now that we have these properties, we should
Math Mantra: Try to use the derived properties and not the original definition
to prove a result.
Why shouldnt we go back to the original denition?
1. It could be tedious. We already derived the tools, so why not use them right away? Consider
the following proof:
Example. For any x, y R
n
,
x +y
2
= x
2
+y
2
+ 2x y
Proof: Use the properties of dot product to expand the left hand side:
x +y
2
= (x +y) (x +y) (Convert Norm to Dot Product)
= (x +y) x + (x +y) y (Distribute)
= x (x +y) +y (x +y) (Commutativity)
= x x +x y +y x +y y (Distribute)
= x x +x y +x y +y y (Commutativity)
= x
2
+y
2
+ 2x y (Convert Dot Product to Norm)
If we went back to the dot product denition and wrote the norm explicitly as the square root
of a sum of squares, it would be a lot messier.
2. Generalization. This is the far more important reason which we talked about during the end
of last lecture. There are dierent norms that satisfy properties (1)-(4) in the Theorem. The
preceding proof applies to any of these norms, whereas a proof involving the explicit denition
of the dot product would not.
2.3 Cauchy-Schwarz Inequality
Now that we are equipped with the proper machinery, we can go back to proving
x y
x y
1
for x, y = 0.
First, recall one of the properties of absolute value:
The absolute value of a product is the product of the individual absolute values:
|ab| = |a| |b|
Thus,
x y
x y
= |x y|
1
x y
=
|x y|
x y
So now, our condition is just
|x y|
x y
1
which is equivalent to
|x y| x y
This is known as the Cauchy-Schwarz inequality. When expanded, it actually looks like
|x
1
y
1
+ x
2
y
2
+ . . . + x
n
y
n
|
_
x
2
1
+ x
2
2
+ . . . + x
2
n
_
y
2
1
+ y
2
2
+ . . . + y
2
n
But to prove this inequality, we are going to need an idea. In fact,
Math Mantra: Cool results can come from solving entirely different problems!
And the problem we are going to look at is one you solved in 9th grade Geometry:
Which point on a given line is closest to the origin?
To solve this problem, you found a perpendicular line that goes through the origin and calculated the
intersection:
2.3. CAUCHY-SCHWARZ INEQUALITY 23
y
x
(0, 0)
However, there is a direct way to solve this problem. Given the line
y = mx + b
we would like to minimize the distance to the origin:
_
x
2
+ y
2
=
_
x
2
+ (mx + b)
2
.
But what do we do about the square root? The key idea is that this function is minimized precisely
when its square is minimized. Precisely,
If we have a non-negative function f, then f and f
2
achieve their minima at the same point x
0
.
Why is this true? We showed the reason last lecture:
Square roots preserve inequalities.
and
Squaring both sides preserves inequalities
So if f achieves its minimum at x
0
, then
f(x
0
) f(x)
for all x. Squaring,
_
f(x
0
)
_
2
_
f(x)
_
2
for all x. Hence, f
2
also achieves its minimum at x
0
. Conversely, if f
2
achieves its minimum at x
0
,
_
f(x
0
)
_
2
_
f(x)
_
2
for all x. Rooting,
f(x
0
) f(x)
for all x i.e. f also achieves its minimum at x
0
.
Thus, we only need to nd the x that minimizes
x
2
+ (mx + b)
2
= x
2
+ m
2
x
2
+ 2mbx + b
2
= (1 +m
2
)x
2
. .
Ax
2
+(2mb)x
. .
Bx
+ b
2
..
C
But this is a quadratic equation in terms of x. And we all know, from the SAT II or Calc BC, that
the minimum of a quadratic y = Ax
2
+ Bx + C is achieved at the vertexs x coordinate
x =
B
2A
Here, the minimum is achieved at
x =
2mb
2(1 + m
2
)
=
mb
1 + m
2
But this is just for R
2
. How do we generalize to higher dimensions?
Again, we cannot visualize higher dimensional lines. Thus, we dene a line in n-dimensions. Looking
at the 2D formula for a line,
y = mx + b,
we make a guess that the equation for the n-dimensional case is
y = mx +
b.
This should make intuitive sense: each unit of x increases vector y by some m slope. Moreover,
just to be clear, a line is the set of points satisfying this equation:
_
y R
n
| y = mx +
b for some x R
_
However, we like to reserve x, y for vectors and use t for scalars, so we change variables and use a
more common notation:
Denition. Given vectors x, y R
n
, the line that goes through x in the direction of y is the set of
all points of the form
x + ty
where t R.
Now we can prove the n-dimensional analogue of the point on a line closest to the origin:
2.3. CAUCHY-SCHWARZ INEQUALITY 25
Lemma. For x, y R
n
and y =
0, consider the line that goes through vector x in the direction of y:

x + ty.
Then this line is closest to the origin when
t =
x y
y
2
Proof Summary:
This is equivalent to minimizing the square of a norm.
Expand the square into a quadratic using dot product properties
The minimum of this quadratic occurs at the x coordinate of the vertex.
Proof: We want to nd the particular t that minimizes the distance between x + ty and the origin:
x + ty.
Since this is a non-negative function, we know it achieves its minimum when its square achieves its
minimum:
x + ty
2
But now we can use dot product properties:
x + ty
2
= (x + ty) (x + ty) (Convert Norm to Dot Product)
= x x +x (ty) + (ty) x +y y (Distributive and Commutative Laws)
= x
2
+ 2t(x y) + t
2
y
2
(Convert Dot Product to Norm)
= y
2
t
2
. .
At
2
+2(x y)t
. .
Bt
+x
2
..
C
But lo and behold, this is a quadratic equation in terms of t! Using the vertex formula
t =
B
2A
with
A = y
2
B = 2(x y),
the minimum occurs when
t =
x y
y
2
.
Now we use this result to derive one of the most important inequalities in mathematics:
Theorem (Cauchy-Schwarz Inequality). For any two vectors x, y R
n
,
|x y| x y.
Proof Summary:
We know that x + ty
2
is always non-negative.
Plug in the minimizing t from the previous Lemma.
Isolate the dot product and square root both sides.
Proof: Since a square of a number is always non-negative,
x + ty
2
0.
From the proof of the lemma, we can rewrite the left side:
y
2
t
2
+ 2(x y)t +x
2
. .
x+ty
2
0.
Now plug in a particular t. Namely, the t where tx +y is closest to the origin:
t =
x y
y
2
.
Then, our inequality becomes
(x y)
2
y
2
2
(x y)
2
y
2
+x
2
. .
y
2
t
2
+2(xy)t+x
2
0.
Multiplying both sides by y
2
and moving terms to the other side, we get
y
2
x
2
(x y)
2
.
Finally, take the square root of both sides:
|x y| x y.
Are we done now? No! To quote Professor Simon,
There is a slight y in the ointment!
We assumed in our proof y =
0 when we dened t, but this may not be the case. Luckily though,
the proof is trivial when y =
0 since
0
..
|xy|
0
..
x y
.
Now were done!
2.4. ON KEEPING ONES WORD: TRIANGLE INEQUALITY 27
Remarkably, we just used a completely dierent problem (a geometric one) to prove the Cauchy-
Schwarz Inequality! Thats pretty cool! And because this is so awesome, Ill talk more about this
proof technique at the end of this lecture.
Before we move on, take note of two steps in this proof:
1. We took the square of a norm.
Typically,
Math Mantra: When dealing with norms and square roots, it is often easier
to work with their squares.
In particular, when we square a norm, we can expand it as a dot product:
x
2
= x x
This allows us to exploit dot product properties.
2. We took the square root of a square.
Near the end of the proof, we calculated
_
(x y)
2
= |x y|.
Note the absolute value sign!
Generally,
a
2
= |a|
It is incorrect to write
a
2
= a.
2.4 On Keeping Ones Word: Triangle Inequality
As promised in the previous chapter, we will complete our proof that
x y is a distance function.
All that was left to prove was the triangle inequality: for all x, y, z R
n
,
d(x, z) d(x, y) + d(y, z)
i.e.
x z x y +y z.
We could prove this directly, but its a lot easier to prove
a +
b a +
b.
The result then follows by substituting
a = x y
b = y z.
Theorem (Triangle Inequality). For any a,
b R
n
a +
b a +
b.
Proof Summary:
Expand a +
b
2
using dot product properties.
Use Cauchy-Schwartz on the a b term.
The Right Hand Side (RHS) of the inequality is simply
_
a +
b
_
2
.
Square root both sides.
Proof: We once again square the norm and exploit dot product properties:
a +
b
2
=
_
a +
b
_
_
a +
b
_
= a a + 2
_
a
b
_
+
b
= a
2
+
b
2
+ 2
_
a
b
_
.
But x |x|, so in particular,
a
b |a
b|.
Thus, we can create the bound
a
2
+
b
2
+ 2
_
a
b
_
a
2
+
b
2
+ 2
.
Using Cauchy-Schwarz, we can bound the right hand side further by
a
2
+
b
2
+ 2|a
b| a
2
+
b
2
+ 2a
b.
Thus, we have
a +
b
2
a
2
+
b
2
+ 2a
b
Rewriting the right hand side as a square
a +
b
2
_
a +
b
_
2
,
we root both sides to get
a +
b a +
b.
2.5. SOME FUN WITH CAUCHY-SCHWARZ 29
2.5 Some Fun with Cauchy-Schwarz
There are lots of fun things you can show with Cauchy-Schwarz, but most of the time youll be using
it to establish an upper bound, one typically involving a magic letter named . But for practice, here
are some fun applications:
Example. For any angle ,
| cos
2
sin
2
| 1.
Proof: Apply the Cauchy-Schwarz inequality to
x =
_
cos
sin
_
y =
_
cos
sin
_
Since
|x y| = | cos cos sin sin | = | cos
2
sin
2
|
and
x y =
_
_
(cos )
2
+ (sin )
2
__
_
(cos )
2
+ (sin )
2
_
= 1,
Cauchy-Schwarz tells us
| cos
2
sin
2
|
. .
|xy|
1
..
x y
In case you forgot, the left hand side is just the double angle formula for cosine. You used this equa-
tion tons of times when integrating cos
2
.
I learned the next Cauchy-Schwarz example from a tenth grade Taiwanese national, who would take
the H-series at Stanford 3 years later.
Example. For any a, b, c > 0,
(a + b + c)
_
1
a
+
1
b
+
1
c
_
9
Proof: Rewrite the left hand side as the square of a products of norms: let
x =
_
c
_
_
y =
_
_
1
a
1
b
1
c
_
_
Then,
_
x y
_
2
= x
2
y
2
= (a + b + c)
_
1
a
+
1
b
+
1
c
_
We also know
|x y|
2
= (1 + 1 + 1)
2
= 9
So, by the square of Cauchy-Schwarz,
9
..
|xy|
2
(a + b + c)
_
1
a
+
1
b
+
1
c
_
. .
(x y)
2
The next example is a famous one that translates to

The square of the average is less than or equal to the average of the squares.
If you are not sure what this means, it is always good to write down an example:
_
1 + 2 + 5 + 6 + 10
5
_
2
1
2
+ 2
2
+ 5
2
+ 6
2
+ 10
2
5
Example. Let a
1
, a
2
, . . . , a
n
R. Then,
_
1
n
n
i=1
a
i
_
2
1
n
n
i=1
a
2
i
Proof: The magic vectors we are going to use this time are
x =
_
_
a
1
a
2
.
.
.
a
n
_
_
y =
_
_
1
n
1
n
.
.
.
1
n
_
_
First, calculate the dot product as
|x y| =
n
i=1
a
i
n
=
1
n
n
i=1
a
i
and the norm product as
x y =
_
n
i=1
a
2
i
_
n
i=1
1
n
2
=
1
_
n
i=1
a
2
i
.
2.5. SOME FUN WITH CAUCHY-SCHWARZ 31
So Cauchy-Schwarz tells us
1
n
n
i=1
a
i
. .
|xy|
_
n
i=1
a
2
i
. .
x y
.
Since we have non-negative terms, we can square both sides:
_
1
n
n
i=1
a
i
_
2
1
n
n
i=1
a
2
i
.
The nal example doesnt have a cute saying (at least not one I can think of). But the symmetry is
pretty sweet, and just like in love, it is easier to see beauty than describe it.
Example. Let a
1
, a
2
, . . . , a
n
be positive. Then,
a
1
+ a
2
+ . . . + a
n

a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n1
a
n
+
a
2
n
a
1
Proof: We will apply Cauchy-Schwarz to the vectors
x =
_
a
2
a
3
.
.
.
a
n
a
1
_
_
y =
_
_
a
1
a
2
a
2
a
3
.
.
.
a
n1
a
n
a
n
a
1
_
_
Then,
|x y| =
a
2
a
1
a
2
+
a
3
a
2
a
3
+ . . . +
a
n
a
n1
a
n
+
a
1
a
n
a
1
= |a
1
+ a
2
+ . . . + a
n
|.
But we can drop the absolute value since the a
i
are non-negative:
|x y| = a
1
+ a
2
+ . . . + a
n
.
Also,
x y =
_
(
a
2
)
2
+ (
a
3
)
2
+ . . . + (
a
n
)
2
+ (
a
1
)
2
_
_
a
1
a
2
_
2
+
_
a
2
a
3
_
2
+ . . . +
_
a
n1
an
_
2
+
_
an
a
1
_
2
=
a
2
+ a
3
+ . . . + a
n
+ a
1
a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n1
a
n
+
a
2
n
a
1
So by Cauchy-Schwarz,
a
1
+ a
2
+ . . . + a
n
. .
|xy|
a
2
+ a
3
+ . . . + a
n
+ a
1
a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n1
a
n
+
a
2
n
a
1
. .
x y
Since both sides are non-negative, we can square this inequality to get
(a
1
+ a
2
+ . . . + a
n
)
2
(a
2
+ a
3
+ . . . + a
n
+ a
1
)
_
a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n1
a
n
+
a
2
n
a
1
_
.
Notice that
a
1
+ a
2
+ . . . + a
n
= a
2
+ a
3
+ . . . + a
n
+ a
1
.
Thus, we can divide out by
a
2
+ a
3
+ . . . + a
n
+ a
1
to get
a
1
+ a
2
+ . . . + a
n

a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n
a
1
.
After seeing these examples, the natural question to ask is
When is equality achieved in the Cauchy-Schwarz inequality?
Thats a very good question! We will save this discussion for the next chapter, after we introduce if
and only if proofs.
One last remark: in the preceding proofs, notice that I did not tell you how we cooked up x and y.
This is because:
Math Mantra: Some things are not obvious and you really have to mull them over.
And often times, this will involve tons of scratch work. But that is what makes mathematics an art.
2.6. PROOF TECHNIQUE: THE 7-10 SPLIT 33
2.6 Proof Technique: The 7-10 Split
The hardest shot in bowling is the 7-10 Split:
To knock down both pins, the ball has to collide in such a way that one pin ricochets into the other.
Amazingly, by trying to knock down one pin, it leads to the collapse of the other.
In math, we build theorems that often have immediate implications. But some of the rarest and most
beautiful proofs are the ones that pop out of nowhere through the solution of a seemingly unrelated
problem.
I hesitate mentioning this proof technique now because
You havent learned all the basic proof techniques.
Describing this proof technique is the mathematical equivalent of trying to explain irony.
But this is indeed the trick that we used to prove Cauchy-Schwarz: we derived this inequality by
solving the geometric problem of nding the point on a given line that is closest to the origin.
There are some fantastic examples: my personal favorite is Prufer codes.
1
A Prufer code solves the
problem of storing data structures known as trees. But we can also use it to solve a seemingly
unrelated problem: the problem of counting the number of labelled trees with n nodes.
1
Wikipedia it! The math is very simple and will be especially useful if you have CS aspirations. You will also see
Prufer codes in Math 108.
The prototypical
2
example of the 7-10 split is one you will prove in 52H. Recall, in Calc BC, you used
p-series to prove that
n=1
1
n
2
converges. But what does it actually converge to? Using the entirely dierent subject of Fourier
Series, the answer magically pops out:
n=1
1
n
2
=

2
6
.
Awesome!
However, these examples involve words youve never seen before (like trees and Fourier Series). But
you came here to do math! You want to see theorems that you can understand. So here is an
example from your Algebra II days.
The Binomial Theorem states that
(x + y)
n
=
n
i=0
_
n
i
_
x
i
y
ni
This tells us how to expand a power of a binomial. But we can actually use this to prove set theoretic
facts!
We say that
a set A is a subset of a set B if for every x in A, x is also in B.
Often, this expression will be denoted by
A B
Alternatively, we can write
B A
The rst property is that any set of size n has 2
n
subsets. For example, the 3 element set
{0, 1, 2}
has 2
3
subsets, namely,
{}
{0} {1} {2}
{0, 1} {0, 2} {1, 2}
{0, 1, 2}
where {} is just the empty set.
Now we prove this property using the Binomial Theorem:
2
The prototypical example is actually elliptic curves and Fermats Last Theorem, but I make it a policy to never
talk about anything I cant prove. Guess Id make a pretty lousy politician : )
2.6. PROOF TECHNIQUE: THE 7-10 SPLIT 35
Example. Any set of size n has 2
n
subsets.
Proof Summary:
Expand (1 + 1)
n
using the Binomial Theorem
Interpret the right hand side as the number of subsets.
Proof: Plugging x = 1 and y = 1 into the Binomial Theorem yields
(1 + 1)
n
=
n
i=0
_
n
i
_
1
i
1
ni
which simplies to
2
n
=
n
i=0
_
n
i
_
But how do we build all the subsets of an n-element set? The number of ways to build a subset of
size i is simply the number of ways to choose i out of the original n elements:
_
n
i
_
So the number of subsets is
_
n
0
_
..
Number of 0-element subsets
+
_
n
1
_
..
Number of 1-element subsets
+ . . . +
_
n
n
_
..
Number of n-element subsets
.
which is the right hand side above, so it equals
2
n
by the Binomial Theorem.
We can use the same trick to prove that any set has an equal number of odd and even sized subsets.
In our previous example, {0, 1, 2}, the number of even sized subsets is 4:
{}
{0, 1} {0, 2} {1, 2}
whereas the number of odd size subsets is also 4:
{0} {1} {2}
{0, 1, 2}
Example. For any set of size n, the number of odd sized subsets is equal to the number of even sized
subsets.
Proof Summary:
Expand (1 + 1)
n
using the Binomial Theorem.
Move all negative
_
n
i
_
terms to the left side of the equation.
Interpret the left side as the number of even subsets and the right side as the number of odd
subsets.
Proof: For ease, lets assume n is even (the argument for n odd is essentially the same). Plugging
x = 1 and y = 1 into the Binomial Theorem yields
(1 + 1)
n
=
n
i=0
_
n
i
_
(1)
i
(1)
ni
which simplies to
0 =
_
n
0
_
_
n
1
_
+
_
n
2
_
_
n
3
_
+ . . . +
_
n
n
_
.
Moving all the negative terms to the left,
_
n
1
_
+
_
n
3
_
+
_
n
5
_
+ . . . +
_
n
n 1
_
=
_
n
0
_
+
_
n
2
_
+
_
n
4
_
+ . . . +
_
n
n
_
.
But this just says the number of odd size subsets is the same as the number of even size subsets.
New Notation
x y The dot product of
vectors x and y
x y < 9 The dot product of vectors x and y
is less than 9.
(LHS) The left hand side The (LHS) of A = B is A The left hand side of equation A =
B is A.
(RHS) The right hand side The (RHS) of A = B is B The right hand side of equation A =
B is B.
A B A is a subset of B Z Q The set of integers is contained in
the set of rationals.
B A B contains the subset
A
C R The set of complex numbers contains
the set of reals.
Lecture 3
Lets get Linear!
Its been quite a thousand years, and I feel some obligation to contribute to the millennial
reminiscences before we get too far into the new century. I propose linearity as one of the
most important themes of mathematics and its applications in times past and present.
-The Great Brad Osgood, EE261.
Goals: First, we dene linear functions and use them to motivate the denition of a sub-
space (of R
n
). Afterwards, we introduce the notion of span to construct subspaces from
some initial set of vectors. In this pursuit, natural questions arise concerning redundancy
of vectors, allowing us to introduce Proof by Contradiction and if and only if statements.
3.1 Linear Functions
For the rst few days, we have been studying Linear Algebra. But what exactly makes it Linear?
Linearity is a property that is pretty much everywhere, from physics to nance. To explain linearity,
lets consider a simple example:
Suppose
x ounces of rum produces f(x) Mojitos.
Intuitively, if you double the input, you should double the output:
f(2x) = 2f(x)
Likewise, if
x ounces of rum produces f(x) Mojitos.
y ounces of rum produces f(y) Mojitos.
Then, x + y ounces of rum should produce f(x) + f(y) Mojitos:
f(x + y) = f(x) + f(y).
Simple, right? Precisely,
37
38 LECTURE 3. LETS GET LINEAR!
Denition. Let f be a function with domain V . We say that f is linear on V if it satises two
properties:
1. (Superposition) Applying a function to a sum of inputs is the same as summing the function
applied to each term separately:
f(x + y) = f(x) + f(y)
for any x, y V .
2. (Homogeneity) Scaling the input by scales the output by :
f(x) = f(x)
for any x V and real number .
Next week, you will see that the most important linear functions in this course involve matrix multi-
plication in R
n
, namely:
f(x) = Ax
where A is an nn matrix. In fact, you will learn that any linear function on R
n
can be expressed
as a matrix multiplication!
3.2 Subspaces of R
n
Notice that I cheated in the denition of a linear function: I never dened its domain and just left
it as V . If you take Math 113, Math 171, and Math 121, you will learn that V is an abstract
collection known as a vector space.
For example, suppose
V is the set of dierentiable functions.
Then the derivative is a linear function on V : the derivative of the sum of two functions is just the
sum of the individual derivatives
d
dx
_
f(x) + g(x)
=
d
dx
f(x) +
d
dx
g(x)
and you can always pull a scalar out of a derivative.
d
dx
_
f(x)
=
d
dx
f(x)
Likewise, if
V is the set of integrable functions,
3.2. SUBSPACES OF R
N
39
the integral is linear on V :
_
_
f(x) + g(x)
dx =
_
f(x) dx +
_
g(x) dx
_
_
f(x)
dx =
_
f(x) dx
But, for this course, we are going to assume
V is a subset of R
n
.
Moreover, we require V to have certain properties.
Recall that our denition of a linear function took V to be the domain of f. The property
f(x) + f(y) = f(x + y)
only makes sense if the domain is closed under addition: if x and y are in V , then so is x + y.
Also,
f(x) = f(x)
only makes sense if the domain is closed under scaling: if x is in the V , so is x.
This motivates us to make the following denition for such a domain V :
Denition. A subspace is a subset V R
n
that satises the following properties:
1. (Existence of Zero) The zero vector is in V :
0 V
2. (Closure under Addition) Given two vectors in V , their sum is also in V :
x +y V
for all x, y V
3. (Closure under Scalar Multiplication) For any vector, all scalar multiples of that vector
are in V :
x V
for all x V and real .
Note that we added the Existence of Zero requirement so that the empty set is not a subspace.
3.3 How to Verify a Set is a Subspace
I guarantee a signicant percentage of students who scored a 5 on the Calculus BC
cannot prove the span is a subspace.
-Leon Simon
Professor Simons assertion is absolutely correct. The reason is that students have yet to learn how
to prove a universal statement. However, after youve mastered this technique, youll realize that
verifying that a set is a subspace is straightforward. Just check the denition!
Example. The set
A =
_
_
_
_
_
x
y
z
_
_
R
3
x = y = z
_
_
_
is a subspace.
Proof: We need to directly check the denition of a subspace.
Existence of Zero
Immediately
_
_
0
0
0
_
_
A
since its components are equal.
Closure under Addition
Consider any two vectors x, y A. By denition, they are of the form
x =
_
_
c
c
c
_
_
y =
_
_
d
d
d
_
_
for some reals c, d. Then,
x +y =
_
_
c
c
c
_
_
+
_
_
d
d
d
_
_
=
_
_
c + d
c + d
c + d
_
_
Each of its components are equal; therefore,
x +y A.
3.3. HOW TO VERIFY A SET IS A SUBSPACE 41
Closure under Scalar Multiplication
Consider any vector x A. By denition,
x =
_
_
c
c
c
_
_
for some constant c. Then, for any real k,
kx = k
_
_
c
c
c
_
_
=
_
_
kc
kc
kc
_
_
which again has equal components, so
kx A.
You also need to know how to prove theorems given arbitrary subspaces.
We dene
The intersection of A, B, denoted by A B, is the set of all elements in both A and B.
Using this notation, we can prove
Example. For any subspaces A, B R
n
,
A B
is a subspace.
Proof:
Existence of Zero
By denition of a subspace,
0 A
0 B
Therefore,
0 A B.
Consider any two vectors x, y A B. This means
x A
y A
Since A is a subspace,
x +y A
Likewise,
x B
y B
and thus,
x +y B
Therefore,
x +y A B.
Consider any vector x A B. Then,
x A
x B
and so for any real k,
kx A
kx B.
Therefore,
kx A B.
The last two examples will be of fundamental importance in Lecture 8:
Example. Let f : R
n
R
m
be linear and let V R
n
be a subspace. The image of V under f,
f(V ) = {f(x) | x V }
is a subspace.
Proof:
Existence of Zero
By denition of a subspace,
0 V
Applying linearity of f,
f(
0) = f(
0 +
0) = f(
0) + f(
0).
This gives us
f(
0) =
0
so
0
..
f(
0)
f(V ).
3.3. HOW TO VERIFY A SET IS A SUBSPACE 43
Consider any two vectors x, y f(V ). By denition of f(V ), there exist vectors v
1
, v
2
V such
that
f(v
1
) = x
f(v
2
) = y
Closure of V (under addition) yields
v
1
+v
2
V.
This gives us
f(v
1
+v
2
) f(V ).
Applying linearity,
f(v
1
) + f(v
2
)
. .
f(v
1
+v
2
)
f(V ).
i.e.
x +y f(V ).
Consider any vector x f(V ). Then,
f(v) = x
for some v V . Since V is a subspace, for any real k,
kv V.
Now we have
f(kv) f(V ).
Applying linearity,
kf(v)
. .
f(kv)
f(V )
i.e
kx f(V ).
Example. Let f : R
n
R
m
be linear and let V R
n
be a subspace. The solution set of f(x) =
0,
N =
_
x R
n
f(x) =
0
_
is a subspace.
Proof:
Existence of Zero
Applying linearity of f,
f(
0) = f(
0 +
0) = f(
0) + f(
0).
This gives us
f(
0) =
0.
Therefore,
0 N.
Consider any two vectors x, y N. By denition, they satisfy
f(x) =

0
f(y) =

0
By linearity,
f(x +y) = f(x)
..
0
+f(y)
..
0
=
0.
In other words,
x +y N.
Consider any vector x N. Then,
f(x) =
0.
For any real k,
f(kx) = k f(x)
..
0
=
0
by linearity. Therefore,
kx N.
3.4 Spanning Vectors
Now that we know what subspaces are, how do we build them?
First, we start with some initial set of vectors, say
v
1
v
2
v
3
v
4
v
5
v
6
Then, from this initial set of vectors, we are going to grow a full subspace S:
3.4. SPANNING VECTORS 45
S
v
2
v
5
v
1
v
6
v
4
v
3
Formally, we take S to be the set of all linear combinations of our initial vectors:
Denition. A linear combination of v
1
, v
2
, . . . , v
k
R
n
is a vector of the form
c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
where c
1
, c
2
, . . . , c
k
R. The span of v
1
, v
2
, . . . , v
k
is the set of all such linear combinations:
span{v
1
, v
2
, . . . , v
k
} = {c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
| c
1
, c
2
, . . . , c
k
R}
Intuitively, we know the span is a subspace: it is the collection of all possible sums and scalings from
an initial set of vectors. Of course, we need a proof:
Theorem. For any vectors v
1
, v
2
, . . . , v
k
R
n
,
span{v
1
, v
2
, . . . , v
k
}
is a subspace.
Proof:
Existence of Zero
Consider a linear combination with each c
i
= 0:
0v
1
+ 0v
2
+ . . . + 0v
k
=
0.
Thus,
0 span{v
1
, v
2
, . . . , v
k
}.
Let x, y span{v
1
, v
2
, . . . , v
k
}. By denition of span,
x = c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
for some c
1
, c
2
, . . . , c
k
R
y = d
1
v
1
+ d
2
v
2
+ . . . + d
k
v
k
for some d
1
, d
2
, . . . , d
k
R
Then,
x +y = (c
1
+ d
1
)v
1
+ (c
2
+ d
2
)v
2
+ . . . + (c
k
+ d
k
)v
k
,
which is still a linear combination of v
1
, v
2
, . . . , v
k
. Hence,
x +y span{v
1
, v
2
, . . . , v
k
}.
Closure under scaling
For all x span{v
1
, v
2
, . . . , v
k
},
x = c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
for some c
1
, c
2
, . . . , c
k
R.
Then for any real ,
x = (c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
) = (c
1
)v
1
+ (c
2
)v
2
+ . . . + (c
k
)v
k
which is again a linear combination in our span. Thus,
x span{v
1
, v
2
, . . . , v
k
}.
At this point, you should be asking yourself:
Can all subspaces be built from a span?
Can you write one subspace as two dierent spans?
Are all the vectors in our spanning list necessary?
The answer to the rst question, remarkably, is yes. We will prove this fact in Lecture 6.
The answer to the second question is also yes:
span
__
1
1
_
,
_
1
0
__
and span
__
1
0
_
,
_
0
1
__
are two dierent ways of writing R
2
. In fact, we will prove that we can nd a best set in Lecture 31.
As for the third question, the answer is: not always. Consider the two spans,
span
_
_
_
_
_
1
1
1
_
_
,
_
_
1
1
0
_
_
_
_
_
span
_
_
_
_
_
1
1
1
_
_
,
_
_
1
1
0
_
_
,
_
_
3
3
2
_
_
_
_
_
3.5. PROOF TECHNIQUE: PROOF BY CONTRADICTION 47
These two spans are the same set. This is because the extra vector in the second list is redundant:
_
_
3
3
2
_
_
= 2
_
_
1
1
1
_
_
+
_
_
1
1
0
_
_
,
Formally, we say that the list is linearly dependent.
Before we can go into linear dependence, we are going to describe a very powerful proof technique called
Proof by Contradiction. This will be useful not only in proving results about linear independence, but
in many proofs throughout your mathematical career. Next, we will clarify, rigorously, the meaning
of equivalent by showing how to prove if and only if statements. Finally, we will apply our new
techniques to prove that showing linear independence of a set of vectors is equivalent to showing that
none of these vectors can be written as a linear combination of the others.
3.5 Proof Technique: Proof by Contradiction
Mathematicians are strange creatures, one may observe: They go into long arguments
based on assumptions they know are false, and their happiest moments are when they nd
a contradiction between statements they have proved.
-L. Lovasz, Discrete Mathematics
One of the basic mathematical axioms is the law of excluded middle:
For any statement P, either P is true or P is not true.
In case you want to get it tattooed, the symbolic shorthand is
P P
In plain English, this means that for any event, it is always the case that it either happens or it does
not happen.
For example, the following are always true:
1
I left my glasses in the library or I did not leave my glasses in the library
You beat me at chess or you didnt beat me at chess.
Carly Rae Jepsen will call or Carly Rae Jepsen will not call.
In each case, exactly one of the two possibilities is always true: they cannot both be true (by denition
of not) and at least one must be true by our axiom.
Why is this useful? We can assume one choice. If the assumption of this choice leads to a contradic-
tion, we know the other choice must have been true! This approach is known as Proof by Contradiction.
There are many reasons why you should use Proof by Contradiction:
1
Such statements that are always true are called tautologies.
Proof by contradiction gives us a jumping point:
Math Mantra: Assuming the negation of the statement you are trying to prove
gives you extra information to work with and can start you in the right
direction.
Proof by contradiction is often the natural way to approach a problem:
Math Mantra: To prove a definition that is the negation of another, it is
natural to use proof by contradiction.
For example, irrational is dened as not rational, disconnected means not connected, and in
our case, linearly independent means not linearly dependent. In these cases, it is very natural
to use proof by contradiction!
Proof by contradiction is very beautiful and is perhaps the most useful proof technique you will use
during your undergraduate career. In the words of my Math 171 TA, David Sher:
Start every proof with suppose not!
As usual with new proof techniques, I give a few examples:
Example. The product of an irrational number and a non-zero rational number is always irrational.
Proof: Suppose not. Then there exists an irrational number x and non-zero rational y whose product
is rational:
xy =
p
1
q
1
for some integers p
1
, q
1
. Since y is rational, by denition,
y =
p
2
q
2
for some integers p
2
, q
2
. Plugging y back into the original equation, we get
x
p
2
q
2
..
y
=
p
1
q
1
.
Thus,
x =
p
1
q
2
q
1
p
2
.
But this means that x is rational, a contradiction. In conclusion, a product of an irrational and a
non-zero rational is always irrational.
One of the fundamental facts about the integers is that any integer n 2 can be written as a product
of primes:
n = p
1
p
2
. . . p
n
.
For example,
360 = 2 2 2 3 3 5.
We will use this fact to prove the following result:
Example. If n is composite, then n is divisible by some prime less than or equal to

n.
Proof Summary:
Suppose not: ns prime factors are all greater than

n.
n has at least 2 primes in its factorization (not necessarily distinct).
Conclude n > n, contradiction.
Proof: Suppose not. Then there is some composite n whose prime factors are all greater than

n.
Moreover, we also know that n has at least two prime factors in its prime factorization (otherwise n
would be prime):
n = p
1
p
2
. . . p
k
for k 2. In particular, n must be greater than (or equal to) the product of its rst two prime factors:
n p
1
p
2
But by our assumption on n,
p
1
p
2
>
n = n.
This gives us
n > n
which is impossible. In conclusion, every composite number n is divisible by some prime less than or
equal to

n.
Even though this is a fast proof, it gives us a neat way to check if a number is prime. For example,
to prove
499 is prime
we just need to check that 499 it is not divisible by
2, 3, 5, 7, 11, 13, 17, 19.
The next example is the classic result that
2 is irrational:
Example.
2 is irrational.
Proof Summary:
Suppose
2 is rational with gcd(a, b) = 1.

Show b is even.
Substitute b = 2k and show a is even.
Contradict gcd(a, b) = 1.
Proof: Suppose
2 is rational. Then,
2 =
a
b
where we can assume that a, b are integers such that gcd(a, b) = 1 (we can always divide out common
factors). Then, by isolating a and squaring both sides, we have
2b
2
= a
2
So a
2
is even. This implies a must be even: otherwise if a were odd, a
2
is odd, which is not the case.
Thus,
a = 2n
for some integer n. Then, substituting this back into the equation,
2b
2
= (2n)
2
. .
a
2
.
Simplifying, we get
b
2
= 2n
2
.
So b
2
is even, and thus b is even. But this means that a and b are both divisible by 2, contradicting
gcd(a, b) = 1. Thus we have a contradiction and
2 is irrational.
Next, we prove one of my favorite theorems.
1
Its a great example for recent Calculus BC students:
1
I would like to say that I like this theorem because we can use it to prove Eulers Theorem in Complex Analysis.
However, the true reason it brings a smile to my face is that Leon Simon once asked one of my friends, who was sleeping
in class, to prove it. My friend was completely dumbfounded and Professor Simon retorted with his Australian accent,
Caughtcha napping! Good times.
Example. If a dierentiable function f satises f
(x) = 0 for every x, then f is constant.

Proof Summary:
Suppose f is not constant.
Apply Mean Value Theorem to two dierent points.
Contradict f
(x) = 0.
Proof: Suppose f is not constant. Then f must have two dierent values at two points, say a and b,
such that
f(a) = f(b)
Recall that the Mean Value Theorem states,
Given any open interval, we can nd a point in this interval where the derivative of f is
equal to the slope of the secant line.
Formally, using interval (a, b) we can nd c (a, b) such that
f
(c) =
f(b) f(a)
b a
.
But f
(c) is not zero since f(b) = f(a)! Thus, we have a contradiction, so f(x) must be constant.
In summary, Proof by Contradiction is one of the most powerful techniques in mathematics. In fact,
there are some theorems that are very hard to prove if you do not allow proof by contradiction (the
constructivists lurking in the math department can attest to this). Personally, I have no clue how to
prove a number is irrational without using proof by contradiction!
Lastly, a tricky aspect of proof by contradiction is that
You may not know what contradiction you need to reach.
In one example, we contradicted that a fraction was in reduced form. Whereas in another example,
we contradicted n = n. The possible contradictions you can form are way out there: for example, in
the proof that e is irrational, you create a number that is both an integer and not an integer! But
nding what you need to contradict is one of the things that makes mathematics an art.
3.6 Proof Technique: If and Only If
Me: If you are good at Physics, then you are good at Calculus.
Dhruv: But I suck at Physics.
Me: But you can still be good at Calculus!
In Calculus BC, the whole concept of a limit was very informal. You knew it intuitively, but you
never had to prove it formally. The same was true of the double arrow sign:
You knew it meant equivalently, but you never talked about it formally. The most you probably
did was ll in some truth table. But,
Math Mantra: If you want to study mathematics, you have to be precise and
everything you do must have meaning!
Let A and B be statements. Then,
A B
means
Assuming A you can derive the outcome B.
Likewise,
A B
means
Assuming B you can derive the outcome A.
For example, let A and B represent
A : x is a prime greater than 2
B : x is odd
Then A B means
If you assume x is a prime number greater than 2, then you can derive the outcome x is odd.
The double arrow notation
A B
means A is equivalent to B in the sense that,
If you assume A, then you can derive B
and
if you assume B, then you can derive A.
3.6. PROOF TECHNIQUE: IF AND ONLY IF 53
So logically, statement A is really the same as statement B.
For example, consider the implication:
x = 1 x
2
= 1
This is true. However, it is false to write
x = 1 x
2
= 1
If we start at x = 1, then it is the case that x
2
= 1. But we cannot go backwards:
1
even if x
2
= 1, it
is possible that x = 1.
However, the following is true:
x + y = 2 x = 2 y.
This is because you can perform an invertible operation to get from one equation to another.
Here are three proofs that use if and only if :
Example. An integer n is even if and only if n + 1 is odd.
Proof:

Let n be even. Then
n = 2k
for some integer k. But then
n + 1 = 2k + 1
which is the denition of odd.

Let n + 1 be odd. Then
n + 1 = 2k + 1
for some integer k. But then
n = 2k
which is again even by denition.
Notice we could have compressed both directions into a single proof. This is often possible when each
step in the proof is invertible.
1
Unless we are working with non-negative numbers.
Proof (Compressed):
n = 2k for some integer k
n + 1 = 2k + 1 for some integer k
The next example requires a fundamental fact about the integers:

Fundamental Theorem of Arithmetic: Every integer n 2 has a unique prime factorization
(up to reordering the multiplication):
n = p
1
1
p
2
2
. . . p
r
r
for some distinct primes p
1
, p
2
, . . . , p
r
and integer powers
1
,
2
, . . . ,
r
0.
For example,
2250 = 2 3
2
5
3
.
Example. A positive integer is a perfect square if and only if every power in its prime factorization
is even.
Proof:

Let n be a perfect square. The n = k
2
by denition of a perfect square. But k has a prime
factorization by the Fundamental Theorem of Arithmetic. So,
k = p
1
1
p
2
2
. . . p
r
r
1
, p
2
, . . . , p
r
and powers
1
,
2
, . . . ,
r
. Thus,
n = k
2
= (p
1
1
p
2
2
. . . , p
r
r
)
2
= p
2
1
1
p
2
2
2
. . . p
2r
r
is the prime factorization of n and it contains only even powers.

Let n have only even powers in its prime factorization. Then,
n = p
2
1
1
p
2
2
2
. . . p
2r
r
1
, p
2
, . . . , p
r
and powers
1
,
2
, . . . ,
r
. But then,
n = p
2
1
1
p
2
2
2
. . . p
2r
r
= (p
1
1
p
2
2
. . . p
r
r
)
2
= k
2
for integer k = p
1
1
p
2
2
. . . p
r
r
. Thus n is a perfect square.
3.7. LINEAR INDEPENDENCE 55
Once again, the proof is compressible:
Proof (Compressed):
n = k
2
n = (p
1
1
p
2
2
. . . p
r
r
)
2
n = p
2
1
1
p
2
2
2
. . . p
2r
r
We can also complete the following proofs:
Example. An integer n 2 is composite if and only if n is divisible by some prime p < n.
Proof:

We already proved this via contradiction.
1

Since n is divisible by some prime p < n, n is composite by denition.
Example. Given a dierentiable function f, f
(x) = 0 for every x if and only if f(x) is constant.

Proof:

We already proved this via contradiction.

The derivative of a constant function is 0.
Sometimes if and only if will be a godsend: it will give you two equivalent statements that are useful
at dierent times (you will see this in the discussion of closed sets and limits). Or it could be used to
reduce some unwieldy monstrosity of a denition to an easy equivalent statement.
3.7 Linear Independence
Now that we have discussed two of the most important proof techniques, lets return to the original
question:
When are vectors in our spanning list redundant?
1
In fact, we proved something stronger: if n is composite then n is divisible by some prime p

n.
We make the following denitions:
Denition. We say a set of vectors v
1
, v
2
, . . . , v
n
R
n
is linearly dependent if there exists coe-
cients c
1
, c
2
, . . . , c
n
not all zero such that
c
1
v
1
+ c
2
v
2
+ . . . + c
n
v
n
=
0.
We also say a set is linearly independent if that set is not linearly dependent.
Using this denition, the answer to our original question is:
the spanning list has no redundant vector if and only if the spanning list is linearly independent.
Generally,
Theorem. A set is linearly independent if and only if no vector in the set can be written as a linear
combination of other vectors in the set.
Proof:

Let v
1
, v
2
, . . . , v
n
be a linearly independent set of vectors. Suppose, for an eventual contradiction,
that some vector v
i
in the set can be written as a linear combination of the other vectors. Then,
by reordering,
1
we can assume
v = v
1
and
v
1
= c
2
v
2
+ c
3
v
3
+ . . . + c
n
v
n
for some (possibly zero) c
1
, c
2
, . . . , c
n
R. By moving all terms to one side,
(1)v
1
+ c
2
v
2
+ c
3
v
3
+ . . . + c
n
v
n
=
0
But this is a non-trivial linear combination of the zero vector since c
1
= 1, contradicting linear
independence. In conclusion, no vector in the set can be written as a linear combination of
other vectors.

Assume no vector in the set can be written as a linear combination of other vectors in the set.
Suppose the set is not linearly independent. Then the set is, by denition, linearly dependent
and there exist constants c
1
, c
2
, . . . , c
n
not all zero, such that
c
1
v
1
+ c
2
v
2
+ . . . + c
n
v
n
=
0
1
The jargon is WLOG, without loss of generality. We use this expression when we claim that we can solve the
general case through the solution of a specic case.
3.8. ON KEEPING ONES WORD: CAUCHY-SCHWARZ EQUALITY 57
We know that one of these coecients is non-zero, so without loss of generality, we can reorder
these vectors so that c
1
is non-zero. Then, isolating v
1
,
v
1
=
c
2
c
1
v
2
c
3
c
1
v
3
. . .
c
n
c
1
v
n
But we have written one vector as a linear combination of the remaining vectors, contradicting
our initial assumption.
3.8 On Keeping Ones Word: Cauchy-Schwarz Equality
Now that we have the tools, we can answer our Cauchy-Schwarz Question from Lecture 2:
When is equality achieved in the Cauchy-Schwarz inequality?
The answer: equality in Cauchy-Schwarz holds if and only if two vectors are parallel :
Theorem.
|x y| = x y
if and only if x = ty for some t R.
Proof:

Let x = ty for some t in R. Then, we need to show
| ty
..
x
y | = ty
..
x
y.
Starting from the left, use absolute value properties to rewrite
|ty y| = |t(y y)| = |t| |y y|
Converting to norms, this is
|t| |y y| = |t| y
2
.
Now, pull the |t| inside one of the norms:
|t| y
2
= |t| y y = ty y.

To show two vectors are equal, we can exploit the zero property:
x = 0 x =
0.
Particularly, if we can nd a t such that
x ty = 0
then
x = ty
But how do we nd such a t?
Math Mantra: We assume
1
for the moment that something exists to extract
information on what it MUST look like and form a guess. Then, with our guess,
we retry the proof.
Lets assume, for the moment, a t exists such that
x ty = 0
Since it is easier to deal with squares, we square both sides to get
x ty
2
= (x ty) (x ty)
= x x 2tx y + t
2
y y
= x
2
..
c
+2x yt
. .
bt
+t
2
y
2
. .
at
2
= 0
Therefore, if such a t were to exist, it would be given by the quadratic expression above.
Applying the quadratic formula,
t =
2x y
_
(2x y)
2
4x
2
y
2
2y
2
=
x y
_
(x y)
2
x
2
y
2
y
2
But we assumed Cauchy-Schwarz equality holds; thus, we can replace the inner dot product
with the product of norms:
t =
x y
=0
..
_
(x y)
2
x
2
y
2
y
2
=
x y
y
2
.
Are we done? No. We only solved for what t must look like. Now, we must erase all our
previous work and dene
t =
x y
y
2
.
We then show that our constructed t satises
x ty = 0
Starting from the left, we square to get
x ty
2
= x
2
2tx y + t
2
y
2
Substituting t, we have
x
2
2
_
x y
y
2
_
. .
t
x y +
_
x y
y
2
_
2
. .
t
2
y
2
1
This is a cheat that we will use again when we see determinants.
x
2
2
_
(x y)
2
y
2
_
+
(x y)
2
y
2
.
But we assumed Cauchy-Schwarz equality holds; thus, this is equal to
x
2
2
_
_
xy
_
2
y
2
_
+
_
xy
_
2
y
2
.
Expanding, we now have
x
2
2x
2
+x
2
= 0
Since we have shown that
x ty
2
= 0
we conclude
x ty = 0
and thus
x = ty.
Are we done now? No! Just like in the proof of Cauchy-Schwarz inequality,
Again, our denition of t requires y =

0 which need not be the case. However, the proof is
trivial in the case y =
0 since
0x = y.
Now were done!
Lets use this result on our fun Cauchy-Schwarz examples in Lecture 2!
Example. For any angle ,
| cos
2
sin
2
| 1.
Equality holds if and only if for some constant c
_
cos
sin
_
=
_
c cos
c(sin )
_
.
Equating components,
cos = c cos
sin = c sin
Equivalently,
(1 c) cos = 0
(1 + c) sin = 0
The rst equation holds if and only if
c = 1 or =

2
+ n for any integer n
whereas the second equation requires
c = 1 or = n for any integer n.
In the case c = 1, we can plug c into the second equation to get
2 sin = 0
which holds if and only if
= n for any integer n.
Likewise, plugging the case c = 1 into the rst equation yields
2 cos = 0
which is true if and only if
=
2
+ n for any integer n.
Therefore, both equations hold if and only if
= n for any integer n or =

2
+ n for any integer n
In conclusion, | cos
2
sin
2
| = 1 if and only if
=
n
2
for any integer n.
Example. For any a, b, c > 0, we have
(a + b + c)
_
1
a
+
1
b
+
1
c
_
9
Equality holds if and only if, for some constant t,
_
c
_
_
=
_
_
t
a
t
b
t
c
_
_
a =
t
b =
t
c =
t
c
.
Isolating t in each equation, we see that equality holds if and only if
a = b = c.
Example. Let a
1
, a
2
, . . . , a
n
R, then
_
1
n
n
i=1
a
i
_
2
1
n
n
i=1
a
2
i
.
Equality holds if and only if, for some constant c,
_
_
a
1
a
2
.
.
.
a
n
_
_
=
_
_
c
n
c
n
.
.
.
c
n
_
_
.
But this is equivalently
a
1
= a
2
= . . . = a
n
.
Example. Let a
1
, a
2
, . . . , a
n
be positive. Then,
a
1
+ a
2
+ . . . + a
n

a
2
1
a
2
+
a
2
2
a
3
+ . . . +
a
2
n1
a
n
+
a
2
n
a
1
Equality holds if and only if for some constant c
_
a
2
a
3
.
.
.
a
n
a
1
_
_
=
_
_
ca
1
a
2
ca
2
a
3
.
.
.
ca
n1
a
n
ca
n
a
1
_
_
.
This tells us
a
2
= ca
1
a
3
= ca
2
.
.
.
a
n
= ca
n1
a
1
= ca
n
.
Substituting each equality into the next yields
a
1
= c
n
a
1
.
Equivalently,
(1 c
n
)a
1
= 0.
Thus,
a
1
= 0 or c = 1.
But a
1
= 0 by denition. Moreover, c = 1 since a
2
= ca
1
and a
2
, a
1
are both positive. Therefore,
equality only holds when and only when c = 1. Equivalently, when c = 1,
a
1
= a
2
= . . . = a
n
.
New Notation
A B The intersection of
sets A and B.
Q R = Q The intersection of the set of
reals and the set of rationals
is the set of rationals.
span{v
1
, v
2
, . . . , v
n
} The span of the set of
vectors v
1
, v
2
, . . . , v
n
span{v
1
} R
2
The span of v
1
is contained
in R
2
.
A B A implies B x = 1 x
2
= 1 If a number is equal to one,
then its square is 1.
A B B implies A x = 0 or y = 0 xy = 0 If the product of two num-
bers is zero, then at least
one of them is zero.
A B A if and only if B x = y + 2 x 2 = y x = y + 2 if and only if x
2 = y
Lecture 4
Under-determined Potential
Think around the statement (the Under-determined Systems Lemma). Try to uncover its meaning
and how one could perhaps approach a proof of it. Look at special cases to see if they shed light on
the statement. Maybe the statement is false!
-Leon Simon
Goals: Today, we focus on systems of homogeneous linear equations. In order to simplify
them without changing the solutions, we apply the rst step of Gaussian Elimination.
This will be the key step in proving the Under-determined Systems Lemma. We will
also need to apply two new proof methods: the intuitive method of proof by cases, and
the often mistaught proof by induction. We then use the Under-determined Systems
Lemma to prove the incredibly important Linear Dependence Lemma.
4.1 System of Equations
In Algebra II, you learned how to solve a system of equations like
3x + 2y + z = 1
2x + y + 4z = 3
5x + 2y z = 0
Of course, we can generalize to a system with m equations with n unknowns:
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= b
1
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= b
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= b
m
Here, a
ij
is the coecient
1
of the x
j
term in the i-th equation and b
i
is the right-hand side of the i-th
equation.
1
Sometimes I wish the notation would be a
i
j
. Especially since, for example, a
(m+1)2
can be misconstrued as a
(2m+2)
.
Nevertheless, we reserve such superscript notation for exponents and sequences.
63
64 LECTURE 4. UNDER-DETERMINED POTENTIAL
In this lecture we are going to consider only systems with each b
i
= 0. We call such system homoge-
neous:
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
Why are we considering only homogeneous systems instead of general (possibly inhomogeneous) sys-
tems?
We are always guaranteed a solution.
Namely, we always have the trivial solution
x
1
= x
2
= . . . = x
n
= 0
Thus, we dont have to worry about having no solutions which as you recall, arises from incon-
sistent equations like
x
1
+ 2x
2
= 1
x
1
+ 2x
2
= 0
If such x
1
, x
2
did exist, then 1 = 0, which is a big no-no.
The homogeneous solution is needed to construct the solution of an inhomogeneous
system!
Without going into too much detail, the reason is analogous to linear functions. Consider a
linear function f that satises
f(y) = b
Let y
be a solution to the homogeneous analogue

f(y
) = 0.
Then,
f(y + y
) = f(y) + f(y
) = 0 + b = b.
This means the sum of a solution of the inhomogeneous system and of the homogeneous system
is still a solution of the inhomogeneous system!
We will only need homogeneous systems to prove the almighty Linear Dependence
Lemma.
The rst fact we will prove about homogeneous systems is the Under-determined Systems Lemma.
But to prove this fact, we rst need to talk about the Gaussian Elimination process. Here, we only
focus on the rst step, but in Lecture 12 we will use the full force of Gaussian Elimination to compute
reduced row echelon forms.
4.2. GAUSSIAN ELIMINATION: STEP ONE 65
4.2 Gaussian Elimination: Step One
The rst step of Gaussian Elimination is used to help determine the solution to the rst unknown,
x
1
. This is exactly what you did back in those matrix days, when simplifying
_
_
0 2 0 1
2 1 0 2
3 1 1 3
_
_
Except we are going to be a lot more formal about it. Particularly, we are going to reserve matrix
notation for Lecture 7. So pretend, for the moment, that matrices dont exist and just phrase ev-
erything in terms of systems of equations.
Consider the system
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
The rst step of Gaussian Elimination performs one of the two actions:
Case 1: If the coecient a
i1
of x
1
in each equation is 0 (the rst column contains all zeros),
0x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
0x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
you are done.
Case 2: Otherwise, in some equation k, the coecient a
k1
= 0. Swap the rst and k-th
equations:
SWAP
a
k1
x
1
+ a
k2
x
2
+ a
k3
x
3
+ . . . + a
kn
x
n
= 0
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
Then divide the rst equation by a
k1
to get
1x
1
+
a
k2
a
k1
x
2
+
a
k3
a
k1
x
3
+ . . . +
a
kn
a
k1
x
n
= 0.
Subtract the proper multiples of the rst equation from each of the remaining equations, so that
each coecient of x
i
is zero:
1x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
0x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
Note the new a
ij
: we simply relabeled coecients because we do not want to be distracted by fractions
and subtractions. We only care about the structure of the rst column.
Example. Compute the rst step of Gaussian Elimination for
0x
1
+ 2x
2
+ 1x
3
= 0
2x
1
+ 2x
2
+ 4x
3
= 0
6x
1
+ 2x
2
1x
3
= 0
We switch the second row with the rst, yielding
2x
1
+ 2x
2
+ 4x
3
= 0
0x
1
+ 2x
2
+ 1x
3
= 0
6x
1
+ 2x
2
1x
3
= 0
Then, divide the rst equation by the coecient of x
1
:
1x
1
+ 1x
2
+ 2x
3
= 0
0x
1
+ 2x
2
+ 1x
3
= 0
6x
1
+ 2x
2
1x
3
= 0
Finally, subtract some multiple of the rst equation from each of the remaining equations, so that
their x
1
coecients are 0:
1x
1
+ 1x
2
+ 2x
3
= 0
0x
1
+ 2x
2
+ 1x
3
= 0
0x
1
4x
2
13x
3
= 0
In summary, the key observation is that after the rst step of Gaussian Elimination, our system is
transformed to one of two possible forms:
0x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= 0
0x
1
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ . . . + a
mn
x
n
= 0
1x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= 0
0x
1
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ . . . + a
mn
x
n
= 0
4.3. PROOF TECHNIQUE: PROOF BY CASES 67
By the way, you should ask yourself:
Does Gaussian Elimination preserve the solution space? Does it create or destroy solutions?
Intuitively, it is clear that the solutions are unchanged, but we should provide a proof. However, such a
proof requires that we prove two sets are equal, a technique I shall delay until more urgent need arises.
The proof techniques that need immediate attention, however, are the intuitive proof by cases and
the less intuitive induction. If you want to understand the Under-determined Systems Lemma (as
well as many proofs in this course), you need to learn how to apply both techniques.
4.3 Proof Technique: Proof by Cases
All Roads Lead to Rome.
-Proverb
Consider the story of poor Jose, who grew up in rural Portugal. Each day, his mother would ask him
and his seven siblings,
Would you like chicken, sh, or beef for dinner?
Regardless of what the children chose, the answer from the mother would always be the same:
We are having sh.
This is all Proof by Cases really is: every possible choice leads to the same result. Precisely,
If every possibility implies the same outcome, then that outcome MUST be true.
Visually you can think of a crossroads; no matter which path you take, you end up at the same place.
A
T S F D
N R
O
We give the prototypical example of proof by cases (and the bane of the mathematical constructivists).
Example. There exist irrational numbers a, b such that a
b
is rational.
Proof: We know that either
2
is rational or
2
is irrational
Case 1:
2
is rational.
In this case, we are already done since we know
2 is irrational. Just choose

a =
2
b =
2
Case 2:
2
is irrational.
In this case, we choose irrationals
a =
2
b =
2
Then, by exponent properties,
_
2
_
2
=
2
=
2
2
= 2
which is rational.
Thus by cases, there exist irrational numbers a, b such that a
b
is rational.
Did we actually nd numbers a, b such that a
b
is rational? Nope. But we did prove such numbers
exist! This is one of those fun math phenomena: we can prove the existence of an object without
actually constructing it. Love it or hate it, you must admit it is pretty cool.
Another fun example is one from 8th grade Geometry: recall (a, b, c) is a Pythagorean triple if
a
2
+ b
2
= c
2
and a, b, c are all positive integers. We can prove that any integer n 3 occurs in some Pythagorean
triple.
For example, consider the number 17. We can nd a Pythagorean triple containing it, namely
(144, 17, 145) since
144
2
+ 17
2
= 145
2
.
Example. Any integer n 3 occurs in some Pythagorean triple.
Proof: We know that either
n is odd or n is even.
4.3. PROOF TECHNIQUE: PROOF BY CASES 69
Case 1: n is odd.
In this case, n
2
is odd. This means
n
2
= 2q + 1
for some positive integer q. Using this q
q
2
+ (2q + 1)
. .
n
2
= (q + 1)
2
.
Therefore,
(q, n, q + 1)
is a Pythagorean triple containing n.
Case 2: n is even.
In this case, n
2
= 4q for some positive integer q. Using this q,
(q 1)
2
+ (4q)
..
n
2
= (q + 1)
2
Thus,
(q 1, n, q + 1)
is a Pythagorean triple containing n.
Here is a fun example from a math competition:
On the planet Ianoia, the currency is in $5 and $11 bills. What is the largest dollar amount that
cannot be created using only $5 and $11 bills?
We can show that 39 cannot be created. Suppose it can. Then for some integers x, y 0 :
39 = 5x + 11y = 39 5x = 11y
But when we subtract multiples of 5 from 39, we never get a non-negative multiple of 11:
39 0 5 = 39
39 1 5 = 34
39 2 5 = 29
39 3 5 = 24
39 4 5 = 19
39 5 5 = 14
39 6 5 = 9
39 7 5 = 4
If we were in the contest, we might just guess that 39 is the largest amount that cannot be created
with $5 and $11 bills. As aspiring mathematicians, however, we need to formally show that any dollar
amount greater than 39 can be made with $5 and $11 bills:
Example. Any integer n 40 can be written as
n = 5x + 11y
for some non-negative x, y.
Proof: We generalize the idea that n is odd or n is even. For some non-negative integer q, n must
have one of the following forms:
n = 5q
n = 5q + 1
n = 5q + 2
n = 5q + 3
n = 5q + 4
Why is this true? Simple: when you divide by 5, you either have remainder 0,1,2,3, or 4. Also notice,
since n 40, we have q 8.
Case 1: n = 5q.
In this case, we choose
x = q
y = 0.
where
1
x 8.
Case 2: n = 5q + 1.
Then,
n = 5q + 1 = 5(q 2) + 10 + 1 = 5(q 2) + 11
Choose
x = q 2
y = 1.
where x 8 2 = 6.
Case 3: n = 5q + 2.
Then,
n = 5q + 2 = 5(q 4) + 20 + 2 = 5(q 4) + (11)2
Choose
x = q 4
y = 2.
where x 8 4 = 4.
1
In each case, we need to check that x is non-negative. Indeed, this is where the argument would fail for some
n < 40.
4.4. PROOF TECHNIQUE: INDUCTION 71
Case 4: n = 5q + 3.
Then,
n = 5q + 3 = 5(q 6) + 30 + 3 = 5(q 6) + 3(11)
Choose
x = q 6
y = 3
where x 8 6 = 2.
Case 5: n = 5q + 4.
Then,
n = 5q + 4 = 5(q 8) + 40 + 4 = 5(q 8) + 4(11)
Choose
x = q 8
y = 4
where x 8 8 = 0.
Thus, in each case, we can nd non-negative x, y such that
n = 5x + 11y
In general,
Math Mantra: If you are stuck on a proof, try breaking it up into cases. The
additional information you gain from each individual case can help you solve the
problem one step at a time.
4.4 Proof Technique: Induction
I am very powerful. Yet I am the least of all the guards. From hall to hall, door after
door, each guard is more powerful than the last...
-Franz Kafka, The Trial
In all my years of teaching, I have found that students struggle most with this technique. But if you
are going to work with n dimensional things or anything built by a recursive process, you must know
induction. Before I tell you the details, let us talk about induction algebraically.
Consider the following game:
The Game of Modus Ponens
Objects:
The only objects we will use are letters or two letters joined by the symbol . For example,
X
Y
Z
X Z
D E
C C
are all objects in our game. Usually, we will number letters with subscripts:
A
1
A
2
.
.
.
A
n
The Rule:
Suppose, we are given two objects
A B
and
A.
Then we are allowed to create the brand new object
B
As an example of how to play this game, suppose you are given
A B
B C
B D
A
Using these objects, you can build the additional objects, B, C, D:
First, apply the rule to
A B
A
to get B. Now we apply the rule with B and the objects
B C
B D
respectively to get C and D.
Now, consider a game where you are given the object
A
1
and
A
n
A
n+1
for any natural number n. Note that the preceding line encodes innitely many objects; namely:
A
1
A
2
A
2
A
3
A
3
A
4
A
4
A
5
.
.
.
Using our rule, what objects can you build?
First, you can use
A
1
A
2
A
1
to build A
2
. Then, using
A
2
A
3
A
2
you can build A
3
. In fact, you can build A
n
for any natural number n.
This is what induction really is: the objects are true statements like
I : 0 is neither positive nor negative.
A : John went to the movies.
N : There is no smallest positive real number.
In particular, the objects containing are just true conditionals.
For example, the following are true:
S T : If
S
..
you scored a 2400 on the SAT, then
T
..
you scored an 800 on the Math Section.
A B : If
A
..
|x y| = x y, then
B
..
x = ty for some t.
C D : If
C
..
v
1
, v
2
, . . . , v
n
is a linearly independent set,
then
D
..
no v
i
in this set can be written as a linear combination of the other vectors.
Our rule is just the law of Modus Ponens:
Modus Ponens: If A is true and A implies B, then we can conclude B is true.
For example, if we have
A B : If
A
..
n is prime, then
B
..
n is not 8.
A : n is prime.
Then we conclude B:
B : n is not 8.
An object A
n
is simply a true statement that is a function of n. And if we can prove the base case
A
1
and the inductive step
A
n
A
n+1
,
for any non-negative integer n, then, as in our game, we can conclude A
n
is true for any natural
number n.
For example, suppose we know
A
1
: 1 is greater than 0
and for all natural numbers n,
A
n
A
n+1
: if n is greater than 0, then n + 1 is greater than 0.
Then we equivalently know
A
1
A
1
A
2
: if 1 is greater than 0, then 2 is greater than 0.
A
2
A
3
: if 2 is greater than 0, then 3 is greater than 0.
.
.
.
.
.
.
Using Modus Ponens, we conclude
A
1
A
2
A
3
.
.
.
.
.
.
i.e. for all integer n
A
n
: n is greater than 0
Here are some examples of induction. However, I reserve the prototypical example of Gauss lemma
(and the warped high school monstrosity version of it), for the next section.
We call an object a binary string if it is a nite sequence of 1s and 0s. For example
0100010
is a binary string. We can also form new strings by concatenating them i.e placing one string after
the other. Thus,
01000100100
is a string formed from concatenating the ve strings
0, 100, 010, 010, 0
Using induction, we can formally prove the following intuitive result:
Example. A string formed from concatenations of 0, 100, and 010 will have more 0s than 1s.
Proof: We do induction on the number of string pieces that are concatenated. Formally, we want to
prove, for every n, the property:
P
n
: A string formed from concatenating n pieces of 0, 100, and 010 has more 0s than 1s.
Base Case, n = 1.
Suppose we have a nal string formed from only one piece. Then it is either
0, 100, 010
which each have more 0s than 1s. Thus, P
1
is true.
Inductive Step
Let n be an arbitrary natural number. To prove
1
P
n
P
n+1
, we need to assume P
n
and show
that we can conclude
P
n+1
: A string formed from concatenating n + 1 pieces of 0, 100, and 010 has more 0s than
1s.
Let S
be some arbitrary string built from n + 1 pieces. By denition of how we construct new
strings, it must be formed from an n-piece string S, along with one of the strings
0, 100, 010.
Then, by our inductive hypothesis (our assumption of P
n
), we know S has x zeros and y ones
where x > y. So S
is either
1
Indeed, to prove a conditional is true, you assume the if statement and show the then statement holds. In fact,
youve already done this for several proofs!
String Number of 0s Number of 1s
S0 x + 1 y
S100 x + 2 y + 1
S010 x + 2 y + 1
In each case, the number of 0s is greater than the number of 1s since x > y, so P
n+1
is true.
Since n was arbitrary, we proved
P
n
P
n+1
for all natural numbers n. We can thus conclude by induction, that for n 1, P
n
holds i.e. a
string formed from concatenating n pieces of 0, 100, and 010 has more zeros than ones.
Too easy? Here is a less obvious example:
Example. 8
n
3
n
is divisible by 5 for every non-negative integer n.
Proof: We do induction on the natural number n where
P
n
: 8
n
3
n
is divisible by 5
Base Case, n = 0
P
0
is obviously true since
8
0
3
0
= 0
is divisible by 5.
Inductive Step
Let n be an arbitrary non-negative integer. Assume
P
n
: 8
n
3
n
is divisible by 5.
We want to show
P
n+1
: 8
n+1
3
n+1
is divisible by 5
is true. By algebraic shenanigans, we can rewrite
8
n+1
3
n+1
= 8 8
n
3 3
n
= (5 + 3) 8
n
3 3
n
= 5 8
n
+ 3 8
n
3 3
n
= 5 8
n
+ 3 (8
n
3
n
)
By the inductive hypothesis,
8
n
3
n
= 5q
for some non-negative integer q. Thus,
8
n+1
3
n+1
= 5 8
n
+ 3 5q
= 5(8
n
+ 3q)
meaning 8
n+1
3
n+1
is divisible by 5. Therefore, P
n+1
is true. We can thus conclude by induction
that for every non-negative integer n,
8
n
3
n
is divisible by 5
Here is an induction example involving inequalities: consider the sequence dened by
S
1
= 1
S
k
=
1 + S
k1
for k 2
The rst few terms are
S
1
= 1
S
2
=
_
1 +
1
S
3
=
_
1 +
_
1 +
1
S
4
=
_
1 +
_
1 +
_
1 +
1
It turns out that this sequence converges to the golden ratio! We will prove this result in two weeks.
But one of the key steps in that proof is the following:
Example. S
n
< 2 for all positive integer n, where we dene S
k
by
S
1
= 1
S
k
=
1 + S
k1
for k 2
Informally, we are showing
_
1 +
_
1 +
1 +
_
. . .
_
1 +
1 < 2
no matter how many nite iterations the . . . represent.
Proof: We do induction on the term number (the subscript of S). That is, for each n 1, we take
property P
n
to be
P
n
: S
n
< 2
We will show P
n
is true for all n 1.
Base Case, n = 1:
P
1
is obviously true since 1 < 2.
Inductive Step
Let n 1 be arbitrary. Assume that property P
n
is true. We want to show
P
n+1
: S
n+1
< 2
is true. But we know
S
n+1
=
1 + S
n
(by Recursive Denition)
<
1 + 2 =
3 (by Inductive Hypothesis)

< 2
Thus P
n+1
is true. By induction, we conclude for every natural number n,
P
n
: S
n
< 2
Finally, we provide an alternate proof of the Cauchy-Schwarz inequality using induction. Why should
we bother with an alternate proof if we already know the theorem statement is true?
Every proof reveals much more than just the bare fact stated in the theorem, and this
revelation may be more valuable than the theorem itself.
-L. Lovasz, Discrete Mathematics
By now, you already proved Cauchy-Schwarz by
1. Finding the closest point on a line to the origin.
2. Rewriting the dierence of the square of the norm product and the square of the dot product as
a non-negative sum (Problem Set 1).
This time you are going to use
3. Induction
But for our induction to work, we will need to prove Cauchy-Schwarz for n = 2 separately.
Lemma (Cauchy-Schwarz for n = 2). For real numbers x
1
, x
2
, y
1
, y
2
,
|x
1
y
1
+ x
2
y
2
|
_
x
2
1
+ x
2
2
_
y
2
1
+ y
2
2
Proof: As usual, it is easier to work with squares. Thus, we try to prove
(x
1
y
2
+ x
2
y
2
)
2
(x
2
1
+ x
2
2
)(y
2
1
+ y
2
2
)
But this is equivalent to proving
0 (x
2
1
+ x
2
2
)(y
2
1
+ y
2
2
) (x
1
y
1
+ x
2
y
2
)
2
Expanding the right hand side,
(x
2
1
+ x
2
2
)(y
2
1
+ y
2
2
) (x
1
y
1
+ x
2
y
2
)
2
= (x
2
1
y
2
1
+ x
2
1
y
2
2
+ x
2
2
y
2
1
+ x
2
2
y
2
2
) (x
2
1
y
2
1
+ 2x
1
y
1
x
2
y
2
+ x
2
2
y
2
2
)
= x
2
1
y
2
2
..
a
2
2x
1
y
1
x
2
y
2
. .
2ab
+x
2
2
y
2
1
..
b
2
But lo and behold, what is this? It is one of our oldest friends, the factoring:
(a b)
2
= a
2
2ab + b
2
with
a = x
1
y
2
b = x
2
y
1
But we already know
(x
1
y
2
x
2
y
1
)
2
0
So Cauchy-Schwarz for the case n = 2 is true.
With the case n = 2 out of the way, lets prove the full blown Cauchy-Schwarz inequality:
Theorem (Cauchy-Schwarz Inequality). For x
1
, x
2
. . . x
n
, y
1
, y
2
, . . . y
n
R the following inequal-
ity always holds:
|x
1
y
1
+ x
2
y
2
. . . + x
n
y
n
|
_
x
2
1
+ x
2
2
+ . . . + x
2
n
_
y
2
1
+ y
2
2
+ . . . + y
2
n
Proof Summary:
Base Case: Use property |ab| = |a||b|
Inductive Step: Start with LHS and split into the sum of the n case and a
n+1
term
Visualize as Cauchy-Schwarz for case n = 2 and apply lemma.
Proof: We do induction on n to prove the property
P
n
: |x
1
y
1
+ x
2
y
2
. . . + x
n
y
n
|
_
x
2
1
+ x
2
2
+ . . . + x
2
n
_
y
2
1
+ y
2
2
+ . . . + y
2
n
is true for every positive n.
Base case, n = 1.
To show
|x
1
y
1
|
_
x
2
1
_
y
2
1
just apply basic absolute value properties:
|x
1
y
1
| = |x
1
| |y
1
| =
_
x
2
1
_
y
2
2
.
Thus, P
1
is true.
1
Inductive Step
Assume P
n
is true. We want to prove
P
n+1
: |x
1
y
1
+ x
2
y
2
+ . . . + x
n+1
y
n+1
|
_
x
2
1
+ x
2
2
+ . . . + x
2
n+1
_
y
2
1
+ y
2
2
+ . . . + y
2
n+1
Like with most inequalities, we start from the left and try to construct an upper bound. First,
visualize the left hand side as the sum of two vectors and apply normal
2
triangle inequality:
(x
1
y
1
+ x
2
y
2
+ . . . + x
n
)
. .
a
+(x
n+1
y
n+1
)
. .
b
|x
1
y
1
+ x
2
y
2
+ . . . + x
n
|
. .
|a|
+|x
n+1
y
n+1
|
. .
|b|
Using the inductive hypothesis, we can bound this even further:
|x
1
y
1
+ x
2
y
2
+ . . . + x
n
| +|x
n+1
y
n+1
|
_
x
2
1
+ x
2
2
+ . . . x
2
n
_
y
2
1
+ y
2
2
+ . . . y
2
n
+|x
n+1
y
n+1
|.
Now we do a very nice trick: apply the Cauchy-Schwarz inequality for the n = 2 case using
the vectors,
a =
_ _
x
2
1
+ x
2
2
+ . . . + x
2
n
|x
n+1
|
_

b =
_ _
y
2
1
+ y
2
2
+ . . . + y
2
n
|y
n+1
|
_
Cauchy-Schwarz on a,
b gives us an upper bound for the right hand side
_
x
2
1
+ . . . + x
2
n
_
y
2
1
+ . . . + y
2
n
+|x
n+1
| |y
n+1
|
. .
a
_
_
x
2
1
+ . . . + x
2
n
_
2
+|x
n+1
|
2
_
_
y
2
1
+ . . . + y
2
n
_
2
+|y
n+1
|
2
. .
a
b
which, after simplifying, is
_
x
2
1
+ x
2
2
+ . . . + x
2
n
+|x
n+1
|
2
_
y
2
1
+ y
2
2
+ . . . + y
2
n
+|y
n+1
|
2
Of course, we can drop absolute values when we square:
1
Equality should be intuitive in the n = 1 case since any two numbers on the number line are parallel
2
General triangle inequality would be circular!
4.5. HOW INDUCTION SHOULD NOT BE DONE 81
_
x
2
1
+ x
2
2
+ . . . + x
2
n
+ x
2
n+1
_
y
2
1
+ y
2
2
+ . . . + y
2
n
+ y
2
n+1
.
Thus,
|x
1
y
1
+ x
2
y
2
+ . . . + x
n+1
y
n+1
|
_
x
2
1
+ x
2
2
+ . . . + x
2
n
+ x
2
n+1
_
y
2
1
+ y
2
2
+ . . . + y
2
n
+ y
2
n+1
.
and the case P
n+1
is true. This lets us conclude the Cauchy-Schwarz inequality is true.
Notice that we used induction on integers, strings, and sequences. Generally,
Math Mantra: Consider using induction when dealing with objects that are
constructed RECURSIVELY i.e. each successive object is expressed in terms of
the previous objects.
4.5 How Induction Should Not be Done
This is perhaps the most important memory Ive collected. It is also a lie!
-Albus Dumbledore
As a teacher, I see induction incorrectly taught time and time again. In fact, I am sure your graders
will complain about it after the rst homework!
Here is a proof of Gauss Lemma pilfered from a handout by a math teacher at a reputable local
high school:
Theorem. For any positive integer n,
1 + 2 + 3 + . . . + n =
n(n + 1)
2
Fake proof:
Base, n = 1
1
?
=
1(1 + 1)
2
= 1
1

= 1
Inductive
1 + 2 + 3 + . . . + n + (n + 1)
?
=
(n + 1)(n + 2)
2
n(n + 1)
2
+ (n + 1)
?
=
(n + 1)(n + 2)
2
n
2
+ n
2
+
2(n + 1)
2
?
=
(n + 1)(n + 2)
2
n
2
+ 3n + 2
2
?
=
(n + 1)(n + 2)
2
(n + 1)(n + 2)
2
=
(n + 1)(n + 2)
2
This mindless, mechanical process is a by-product of an education revolving around rote memorization.
To us, there is some merit in the process because the kids are showing an equivalence to a tautology.
But kids dont know that. Instead, they are taught:
1. Assume what you are trying to prove.
2. Manipulate to get a true statement.
This is nuttier than squirrel poo. The rst step is circular: you cannot assume what you are trying
to prove! Combined with the second step, it creates great evil. If this process is permitted, then you
can derive anything:
27232 = 323 =
0 27232 = 0 323 =
0 = 0
By this logic, 27232 = 323. Fail!
The better way to teach induction is to tell the student to
1. Start with some initial knowledge (the left hand side).
2. Using the given, build towards what you are trying to prove (the right hand side).
Starting with
1 + 2 + 3 + . . . + n + (n + 1) ()
we can apply the inductive hypothesis
1 + 2 + 3 + . . . + n =
n(n + 1)
2
to rewrite () as
n(n + 1)
2
+ (n + 1)
which is just
n(n + 1)
2
+
2(n + 1)
2
=
n
2
+ n + 2(n + 1)
2
=
(n + 1)(n + 2)
2
.
4.6. UNDER-DETERMINED SYSTEMS LEMMA 83
Thus,
1 + 2 + 3 + . . . + n + (n + 1) =
(n + 1)(n + 2)
2
Building from what we know to what we want to conclude is a natural process. The natural process
of REASONING. Anyways, thats my rant on high school induction. Thanks for listening.
4.6 Under-determined Systems Lemma
With induction at our disposal, we can prove the Under-determined Systems Lemma. But what is
this lemma?
In Algebra II, you learned that a system of equations can be visualized as an intersection of lines. In
the case of
2x + 2y = 0
3x + 4y = 0
there must only be one solution since these lines are non-parallel. Also note that this one solution
must be (0, 0) since this is a homogeneous system.
However, suppose we have only one equation:
2x + 2y = 0.
Then there exists a solution that is not (0, 0). In fact, we have innitely many solutions, since any
point on the line
y = x
satises the equation.
The Under-determined Systems Lemma generalizes this idea. Namely, in a homogeneous system with
more equations than unknowns, you will always have a non-trivial solution. This is very intuitive:
just think of equations as constraints and unknowns as freedoms. You are guaranteed a solution when
you have more freedom than constraint.
Theorem (Under-determined Systems Lemma). Any homogeneous system of m equations with
more than m unknowns has a non-trivial solution.
Proof: We proceed by induction on the number of equations, k, to prove the property
P
k
: Any homogeneous system with k equations and more than k unknowns has at least one
non-trivial solution.
holds for every positive integer k.
Base Case, k = 1.
Consider a system of 1 equation with n unknowns, where n > k:
a
11
x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= 0
We have two possibilities:
a
11
= 0 or a
11
= 0
If a
11
= 0, we can form a non-trivial solution by simply setting the rst unknown to 1 and
the rest to 0:
x
1
= 1
x
2
= 0
.
.
.
x
n
= 0
If a
11
= 0, then set all unknowns except the rst to 1:
x
2
= 1
x
3
= 1
.
.
.
x
n
= 1
Now solve for x
1
:
a
11
x
1
+ a
12
(1) + a
13
(1) + . . . + a
1n
(1) = 0 =
a
11
x
1
= (a
12
+ a
13
+ . . . + a
1n
) =
x
1
=
(a
12
+ a
13
+ . . . + a
1n
)
a
11
This means
x
1
=
(a
12
+ a
13
. . . + a
1n
)
a
11
x
2
= 1
x
3
= 1
.
.
.
x
n
= 1
is a non-trivial solution.
In either case, we have a non-trivial solution. Thus, the base case, P
1
, is true.
Inductive Step
Assume P
m
is true. We want to show P
m+1
is true, i.e.
P
m+1
: Any homogeneous system with m + 1 equations and more than m + 1 unknowns has at
least one non-trivial solution.
4.6. UNDER-DETERMINED SYSTEMS LEMMA 85
Consider an arbitrary homogeneous system with m + 1 equations and n unknowns,
where n > m + 1:
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(m+1)1
x
1
+ a
(m+1)2
x
2
+ a
(m+1)3
x
3
+ . . . + a
(m+1)n
x
n
= 0
The rst step of Gaussian Elimination preserves the solutions (we will prove this later)! Thus,
we can perform this rst step to transform the system into one of two possible cases:
Case 1
The system reduces to
0x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
0x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
(m+1)2
x
2
+ a
(m+1)3
x
3
+ . . . + a
(m+1)n
x
n
= 0
This means that the value of x
1
does not matter. Setting
x
1
= 1
x
2
= 0
.
.
.
x
n
= 0
yields a non-trivial solution.
Case 2
The system reduces to
1x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
0x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
(m+1)2
x
2
+ a
(m+1)3
x
3
+ . . . + a
(m+1)n
x
n
= 0
But notice that, if we ignore the rst equation, we have a system of m equations with n1
unknowns:
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
a
(m+1)2
x
2
+ a
(m+1)3
x
3
+ . . . + a
(m+1)n
x
n
= 0
By applying the inductive hypothesis on this subsystem, we get a non-trivial solution:
x
2
= s
2
x
3
= s
3
.
.
.
x
n
= s
n
Now plug this solution into the rst equation
1x
1
+ a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
= 0
and solve for x
1
:
x
1
= (a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
)
Thus,
x
1
= (a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
)
x
2
= s
2
.
.
.
x
n
= s
n
is a non-trivial solution of the larger system.
By the way, you should ask yourself,
Why should we care about the Under-determined Systems Lemma?
One reason is that this lemma is used to prove the Linear Dependence Lemma. And without the Linear
Dependence Lemma, a lot of our upcoming subspace theory (like basis and dimension), wouldnt even
make sense!
4.7 Linear Dependence Lemma
Consider the question: can you choose 6 numbers from the set
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
such that none are even? The answer is obviously no since the set only contains 5 odd numbers.
The Linear Dependence Lemma follows the same pigeon-hole philosophy: you cannot choose k + 1
vectors from the set
span{ v
1
, v
2
, . . . , v
k
}
such that these vectors are linearly independent.
At rst sight, the proof of the Linear Dependence Lemma looks dicult. Its not. The only reason it
seems dicult is that it requires a healthy dose of book-keeping.
4.7. LINEAR DEPENDENCE LEMMA 87
Math Mantra: Dont be deceived by cumbersome notation. Instead, look for the
BIG PICTURE!
Here is the big picture:
To prove linear dependence, you want to nd a non-trivial solution (i.e. not x
1
= . . . = x
k+1
= 0) to
some equation of the form
x
1
w
1
+ x
2
w
2
+ . . . + x
k
w
k
+ x
k+1
w
k+1
=
0.
By substituting the denition of w
i
, we can rewrite this equation as
c
1
v
1
+ c
2
v
2
+ . . . + c
k
v
k
=
0
where each coecient c
i
is a function of the unknowns x
1
, x
2
, . . . , x
k+1
. To nd some solution to this
equation, it suces to nd x
1
, x
2
, . . . , x
k+1
that makes each coecient zero for then we have:
0v
1
+ 0v
2
+ . . . + 0v
k
=
0.
Therefore, we will solve the system
c
1
= 0
c
2
= 0
.
.
.
c
k
= 0
When we write out the coecients c
1
, c
2
, . . . , c
k
this is a homogeneous system of equations with more
unknowns than equations. Just apply Under-determined Systems Lemma and were done!
Theorem (Linear Dependence Lemma). Let v
1
, v
2
, . . . , v
k
R
n
. Any set of k + 1 vectors taken
from
span{v
1
, v
2
, . . . , v
k
}
is linearly dependent.
Proof Summary:
We want to nd a non-trivial solution to
x
1
w
1
+ x
2
w
2
+ . . . + x
k+1
w
k+1
=
0.
Rewrite each w
i
as a linear combination of v
1
, v
2
, . . . , v
k
.
Rewrite the new equation as a sum of scaled v
i
.
Set each coecient of v
i
to 0.
This is a homogeneous system with k equations, k + 1 unknowns: apply Under-determined
Systems Lemma.
Proof: Let w
1
, w
2
, . . . , w
k+1
be arbitrary vectors such that
w
1
, w
2
, . . . , w
k+1
span{v
1
, v
2
, . . . , v
k
}
Since we want to prove these vectors are linearly dependent, by denition, we need a non-trivial
solution to
x
1
w
1
+ x
2
w
2
+ . . . + x
k+1
w
k+1
=
0. ()
First, expand the denition of the w
i
i.e, write each of them as a member of the span. We could try
w
1
=
1
v
1
+
2
v
2
+ . . . +
k
v
k
w
2
=
1
v
1
+
2
v
2
+ . . . +
k
v
k
.
.
.
w
k+1
=
1
v
1
+
2
v
2
+ . . . +
k
v
k
But this is a bone-headed labelling. Single variable notations not going to cut it!
Why? We are going to have overlapping variables. Namely the above says that
w
1
= w
2
= . . . = w
k+1
This is not our intent! A smarter idea is to order our variables using double subscripts:
w
1
=
11
v
1
+
12
v
2
+ . . . +
1k
v
k
w
2
=
21
v
1
+
22
v
2
+ . . . +
2k
v
k
.
.
.
w
k+1
=
(k+1)1
v
1
+
(k+1)2
v
2
+ . . . +
(k+1)k
v
k
Much better! Now, it resembles the coecients of a system of equations (the wheel starts to turn)!
Substituting w
i
into () yields
x
1
(
11
v
1
+ . . . +
1k
v
k
)
. .
w
1
+x
2
(
21
v
1
+ . . . +
2k
v
k
)
. .
w
2
+. . . + x
k+1
(
(k+1)1
v
1
+ . . . +
(k+1)k
v
k
)
. .
w
k+1
= 0
Distributing, we have
_
x
1
11
v
1
+. . . +x
1
1k
v
k
_
+
_
x
2
21
v
1
+. . . +x
2
2k
v
k
_
+. . . +
_
x
k+1
(k+1)1
v
1
+. . . +x
k+1
(k+1)k
v
k
_
=
0
Now, group the v
i
terms:
_
x
1
11
+ x
2
21
+ . . . + x
k+1
(k+1)1
_
v
1
+ . . . +
_
x
1
1k
+ x
2
2k
+ . . . + x
k+1
(k+1)k
_
v
k
=
0
To solve this equation, it suces to nd x
1
, x
2
, . . . , x
n
such that the coecient of each of the v
i
are
zero. This gives us the system
x
1
11
+ x
2
21
+ . . . + x
k+1
(k+1)1
= 0
x
1
12
+ x
2
22
+ . . . + x
k+1
(k+1)2
= 0
.
.
.
.
.
.
.
.
.
.
.
.
x
1
1k
+ x
2
2k
+ . . . + x
k+1
(k+1)k
= 0
But this system has k equations with k +1 unknowns! So by the Under-determined Systems Lemma,
we have a non-trivial solution. Awesome.
Lecture 5
Keeping it Real
One must be able to say at all times: instead of points, straight lines, and planes-
tables, chairs, and beer mugs.
-David Hilbert
Goals: After introducing a series of axioms, we will use the eld and ordering properties
to rigorously derive real number theorems. In particular, we state the Completeness
Axiom. This fundamental axiom will be used multiple times throughout the course. We
also discuss how to prove existence and uniqueness.
5.1 Thinking Axiomatically
There is nothing more important to a mathematician than a proof. But,
What is a proof ?
Without going too much into philosophy, we can say that a proof is a truth-preserving operation. It
takes some collection of true statements and deduces a new true statement.
Naturally, we can try to trace all true statements back to some initial set of truths which we call
axioms.
The process of taking some theory and formally identifying the axioms at its foundation is known
as axiomization. Particularly, we would like to axiomatize the theory of real numbers. This
axiomization is going to be in three parts:
The Field Axioms
The Ordering Axioms
The Completeness Axiom
For the rst item, we are going to introduce the abstract notions of a group and a eld. Then, to
check that our axioms make sense, we are going to apply only the algebraic properties of a eld to
derive real number properties. This means forgetting all the meaning behind the symbols
+, , ,
89
90 LECTURE 5. KEEPING IT REAL
and forgetting all the real numbers
2, , 7, . . .
For all we care, we could call these numbers tables, chairs, and beer mugs. We will treat numbers
as purely algebraic objects to which we can apply formal algebraic rules. For example, consider the
object
1 (2 + 3).
We do not know that this is 5. However, we can formally use the rule
1 (x + y) = 1 x +1 y
to derive
1 ( 2
..
x
+ 3
..
y
) = 1 2
..
x
+1 3
..
y
.
But before I cast Obliviate, we start o with a discussion of one of the most fundamental concepts in
all mathematics: uniqueness.
5.2 Proof Technique: Uniqueness
In the end, there can be only one!
-Duncan McCloud
When you walk into a bar, you may see a strange yellow bottle:
This bottle holds Galliano, a sugary vanilla-avored liqueur. If you ever ask a bartender for a drink
made with Galliano he will always make a Harvey Wallbanger. This is because
A Harvey Wallbanger is the only
1
drink in the entire universe that uses Galliano.
We have uniquely identied
Harvey Wallbanger
with the property
1
As far as I know, but we will assume this for mathematical purposes.
5.2. PROOF TECHNIQUE: UNIQUENESS 91
made with Galliano.
Generally, we say that
Object x uniquely satises a property P
if x is the only object that satises property P. Technically, this is expressed as: x satises P and
for any object y that satises P, it must be the case y = x.
As a more mathematical example, consider the roots of
sin(x).
There are innitely many roots, namely
x = n
for any integer n. But there is only one root in the interval
_
2
,

2
_
, namely, x = 0. Therefore, we
have uniquely identied
x = 0
as
The root of sin(x) in interval
_
2
,

2
_
.
Two reasons why we study uniqueness is that
Uniqueness is important in the study of functions.
By denition, a function has a unique output for every input. Moreover, for a function to have
an inverse, every point in the image must be mapped from a unique point in the domain.
Uniqueness can be used in a proof by contradiction.
If an object has a unique representation of a certain form and we have two dierent ways to
express it in that form, then we have a contradiction. Heres a fun example:
Example.
5
3 is irrational.
Proof Summary:
Suppose rational.
Rewrite as 3q
5
= p
5
.
The power of 3 in the prime factorization of p
5
is a multiple of 5.
The power of 3 in the prime factorization of 3q
5
is one more than a multiple of 5.
Contradicts uniqueness of prime factorization.
Proof: Suppose
5
3 is rational. Then
5
3 =
p
q
for integers p, q. Raising both sides to the fth power gives
3 =
p
5
q
5
so
3q
5
= p
5
.
In the prime factorization of
p
5
,
all the prime powers must be multiples of 5. This is because we can write p in terms of its prime
factorization,
p = (p
1
1
p
2
2
. . . p
r
r
)
5
and expand
p
5
= (p
1
1
p
2
2
. . . p
r
r
. .
p
)
5
= p
5
1
1
p
5
2
2
. . . p
5r
r
Since 3 divides p
5
(from above),
The power of 3 in the prime factorization of p
5
is a multiple of 5.
Likewise, we can show the prime factorization of q
5
only includes multiples of 5:
q
5
= (q
1
1
q
2
2
. . . q
s
s
. .
q
)
5
= q
5
1
1
q
5
2
2
. . . q
5s
s
.
Therefore,
The power of 3 in the prime factorization of 3q
5
is one more than a multiple of 5.
But
3q
5
= p
5
so by uniqueness of prime factorization the power of 3 in the unique prime factorization of 3q
5
must be both
A multiple of 5.
One more than a multiple of 5.
which is absurd. Thus,
5
3 is irrational.
Uniqueness is important. But how do we prove it?
Simple:
To prove uniqueness, consider two objects that satisfy the same property. Using only
the fact that they satisfy this property, prove that the two objects must be equal.
As a rst example, recall the Algebra II topic of inverse functions. To construct an inverse, you
1. Switched x and y.
2. Solved for y in terms of x
However, mindlessly applying this procedure doesnt guarantee
1
an inverse exists. Namely, we need
to rst prove the function is injective. This means that every output is mapped from a unique input.
2
Example. Consider the function
f(x) = 3x + 5
Then every point in its image is mapped from a unique point in the domain.
Proof: Suppose we have two points a, b in the domain that are mapped to the same image point:
f(a) = f(b)
By denition,
3a + 5 = 3b + 5.
which implies
a = b.
Thus, if we have two points that are mapped to the same image point, then those two points must
be equal.
The prototypical example is proving the uniqueness of the Division Algorithm:
In elementary school, you performed long division using a remainder term. For example, when you
divided 22 by 4, the quotient was 5 with remainder 2:
4
22
5R2
20
2
Now that we are adults, the proper way to express this division is
22 = 4q + r
where
q = 5
r = 2
1
We delay the proof that injectivity implies that the inverse function exists: see Lecture 34.
2
As an analogy, think of Cryptography. Suppose two messages A and B were encoded as the same coded message
C. When we receive C and try to decode (invert) it, we have no idea whether the original message was A or B!
I claim that when you divide an integer by any positive integer, the quotient and remainder pair
1
(q, r) is always unique.
Example (Division Algorithm). Given a positive integer b, every integer n can be written uniquely
in the form
n = bq + r
where q and r are integers, and r is either 0, 1, 2 . . . , b 1. Formally, if we have both
n = bq + r
n = bq
+ r
satisfying the conditions above, then

(q, r) = (q
, r
)
Proof Summary:
Consider two dierent ways to write n = bq + r.
Isolate qs on one side and rs on other.
Absolute value both sides.
Suppose qs are dierent. We have a number both greater than or equal to b and less than b,
contradiction. Conclude q
= q.
Substitute q
= q to immediately get r
= r.
Conclude (q, r) = (q
, r
).
Proof: Suppose
n = bq + r
n = bq
+ r
for integers q, q
, r, r
with
0 r, r
b 1.
Equating the two right hand expressions and isolating the r
s on one side
bq + r = bq
+ r
.
In other words,
b(q q
) = r r
.
Suppose q = q
. Because we dont want to be bothered with negatives, take absolute values of both
sides.
|b(q q
)| = |r r
|
1
Two pairs are dierent if they dier in at least one component. For example, (2,3) is dierent from the pairs (3,5),
(2,4), and (4,3).
and use the absolute value property |xy| = |x||y| to split the (LHS)
|b| |q q
| = |r r
|
Because b is positive, we drop the absolute value:
b |q q
| = |r r
| ()
Since q q
= 0, we know |q q
| is some positive integer and therefore

b|q q
| b 1
..
= b
However,
|r r
| b 1.
This is because |r r
| is biggest if one of r, r
is 0 and the other is b 1. Applying these bounds to

(),
b |q q
|
. .
b
= |r r
|
. .
<b
,
which is impossible! Therefore
q = q
Substituting
bq + r = b q
..
q
+r
,
we solve for
r = r.
In conclusion, if
n = bq + r
n = bq
+ r
for integers q, q
, r, r
with
0 r, r
b 1,
it must be the case that
(q, r) = (q
, r
).
For the next example of uniqueness, consider the question
Is there any function other than e
x
, that goes through (0, 1) and is its own derivative?
The answer is no. To prove this, we use one of my favorite tricks in mathematics:
Math Mantra: If you want to prove that a differentiable function g is uniquely
some (non-zero) function f, try differentiating g(x)
1
f(x)
.
In Lecture 3, we proved that
f
is the zero function if and only if f is a constant.

Therefore,
_
g(x)
1
f(x)
_
= 0
implies
g(x)
1
f(x)
= k
for some constant k. Hence,
g(x) = kf(x).
To solve for k, we can plug in any number we want into x e.g. 0:
k =
g(0)
f(0)
.
If k = 1, then
g(x) = f(x)
as needed.
Example. The unique function that is its own derivative and has value 1 at x = 0 is
f(x) = e
x
.
Proof Summary:
Consider u that satises the same property.
Dierentiate u(x)e
x
Derivative is 0, so u(x)e
x
is constant.
Plug in 0 to solve for that constant. Conclude u(x) = e
x
.
Proof: Suppose there is some other function u that has the property
u
(x) = u(x)
u(0) = 1
Consider
u(x)
1
e
x
which is equivalently
u(x)e
x
.
5.3. ABELIAN GROUPS 97
Dierentiate according to the product rule:
(u(x)e
x
)
= u
(x)e
x
u(x)e
x
.
But u
(x) is its own derivative, so this expression simplies to

(u(x)e
x
)
= u
(x)e
x
u
(x)
. .
u(x)
e
x
= 0
Therefore,
u(x)e
x
= k
for some constant k. Since we know u(0) = 1, plug in 0 to solve for k:
u(0)e
0
= 1.
Therefore k = 1 and
u(x)e
x
= 1
Multiplying by e
x
, we get
u(x) = e
x
and conclude that e
x
is the unique function that is its own derivative and value 1 at x = 0.
5.3 Abelian Groups
Whats purple and commutes?
An Abelian grape.
To understand the eld axioms, rst we need to understand abelian groups.
Consider some set S and some function that takes any two inputs
x, y S
and returns an output
(x, y) S
Alternatively, we write the output as
x y.
Then, we call a binary function on S. In general,
Denition. For set S and binary function : S S S, we call the pair (S, ) an abelian group
if the following properties hold:
Existence of an identity element
There is some element in the set such that when you apply the operation, you get the other
input: there exists e S such that
e x = x e = x
for all x S.
Existence of inverses
For any input x, you can nd some other input such that applying the operation outputs the
identity element: for all x S, there exists
1
x
S such that
x x
= x
x = e.
Commutativity
The order in which the operation is applied does not matter: for all x, y S,
x y = y x.
Associativity
Applying to x and y z is the same as applying to x y and z: for all x, y, z S,
x (y z) = (x y) z.
Here are a few fundamental properties you should know about abelian groups:
Theorem. The identity element e is unique.
Proof: Suppose there are two identity elements e
1
and e
2
. By denition of e
1
being an identity, for
any element x S,
e
1
x = x.
Taking x = e
2
gives
e
1
e
2
..
x
= e
2
..
x
. (1)
Likewise, we know e
2
is an identity element, so
x e
2
= x
1
Notice that element x
depends on the choice of x. In other words, the inverse is a function of x.

5.4. FIELDS 99
for any x S. In particular,
e
1
..
x
e
2
= e
1
..
x
(2)
Combining (1) and (2),
e
1
= e
1
e
2
= e
2
.
Theorem. For each x S, the inverse of x is unique.
Proof: Suppose x has two inverses, say i
1
and i
2
. Then we know
i
1
= i
1
e (Identity)
= i
1
(x i
2
) (i
2
is an inverse)
= (i
1
x) i
2
(Associativity)
= e i
2
(i
1
is an inverse)
= i
2
(Identity)
Theorem. The inverse of the inverse of an element is the original element:

x =
_
x
.
Proof: Since inverses are unique by the last theorem, proving
x =
_
x
means we have to show x is the inverse of x
i.e.
x x
= x
x = e.
But this is immediately true by the denition of x
.
5.4 Fields
To form a eld, we start with an abelian group
(F, )
with identity element e
1
. Then we form another abelian group on F except with the identity e
1
removed (youll see why):
(F {e
1
}, )
Finally, we relate the two operations via a distributive law. Formally,
Denition. For a set F and binary functions , , we say that (F, , ) is a eld if
(F, ) is an abelian group with identity e
1
.
(F {e
1
}, ) is an abelian group with identity e
2
.
The identity elements of the two groups are dierent: e
1
= e
2
.
, satisfy the distributive law: for any a, b, c F
a (b c) = (a b) (a c)
Note that we take e
1
away from F to get an abelian group under and require
e
1
= e
2
.
This is because we do not want , to be the same abelian group operation. Moreover, this rule
forbids the existence of an inverse of e
1
under . To see why, rst notice
Theorem. For any x F,
e
1
x = e
1
.
Proof: For any x F,
e
1
x = (e
1
e
1
) x (Identity)
= x (e
1
e
1
) (Commutativity)
= (x e
1
) (x e
1
) (Distributive Law)
= (e
1
x) (e
1
x) (Commutativity)
Applying the inverse of e
1
x (under ) to both sides of
e
1
x = (e
1
x) (e
1
x)
yields
e
1
= e
1
x
Using this theorem, we can show
Theorem. The inverse of e
1
under does not exist.
Proof: Suppose it does exist. Let e
1
denote the inverse of e
1
under .
5.5. FIELD AXIOM 101
By denition of inverse (under ),
e
1
e
1
= e
2
.
Yet the previous theorem gives us
e
1
e
1
= e
1
.
Therefore,
e
1
= e
2
which we already forbid!
5.5 Field Axiom
The rst axiom
1
that we accept about the reals is
Field Axiom
R is a eld.
In particular,
(R, +) is an abelian group with identity 0. The inverse of x under + is denoted by x.
(R {0}, ) is an abelian group with identity 1. The inverse of x under is denoted by x
1
.
Because we have already proven results about general groups and elds, these theorems apply in
particular to R. Generally,
Math Mantra: If a theorem about an object is proven from certain properties of
that object, then the theorem also holds true for any OTHER object that
satisfies the same properties.
Therefore,
General Group Theorem (R, +) (R {0}, )
The identity is unique 0 is unique 1 is unique
The inverse is unique x is unique x
1
is unique
The inverse of the inverse is the original element (x) = x (x
1
)
1
= x
Moreover, the eld theorems give us
0 x = 0 for every x R
and
1
In most texts, this is introduced as a set of axioms.
0
1
does not exist.
In fact, by only applying the properties of a eld, we can derive several fundamental facts about
the reals.
There are a few reasons why we focus on using only the axioms:
To convince ourselves that our axioms arent bone-headed.
If we can use our axioms to prove a property that does not actually hold for the reals, then we
have made a lousy choice of axioms.
To generalize.
Properties derived from the eld axioms hold for any eld. In particular, it applies
1
to Q and
C.
To design Logic Systems.
We can model the real numbers in a Logic System.
2
In such a system, we treat numbers as
purely syntax and apply formal syntax manipulation rules. Moreover, we can use such a system
to program a proof-solver.
Using only the eld axioms (and their derived properties), we can prove the following fundamental
results for R. Note that these results rely on the distributive law.
3
Theorem. For any a, b R with b = 0,
(a b
1
) = (a) b
1
Proof: By denition of , we want to show (a) b
1
is the additive inverse of a b
1
:
(a) b
1
+ a b
1
= 0
But we can directly check that
(a) b
1
+ a b
1
= b
1
(a) + b
1
a (Commutativity)
= b
1
(a + a) (Distributive Law)
= b
1
0 (Additive Inverse)
= 0 b
1
(Commutativity)
= 0 (0 a = 0)
Thus, a b
1
is the additive inverse of a b
1
.
1
My favorite eld is F
scho
.
2
To learn more, I highly recommend taking Phil 151: Professor Sommer is an excellent lecturer.
3
Typically, if you are going to prove a result involving both operations, it is going to involve the distributive law.
This is because its the only eld property relating both operations.
5.5. FIELD AXIOM 103
Theorem. For any a R
1 a = a.
Proof: The above statement really means that we want to show 1 a is the additive inverse of a.
Precisely,
(1 a) + a = 0
But we can see
(1 a) + a = 1 a + 1 a (Multiplicative Identity)
= a 1 + a 1 (Commutativity)
= a (1 + 1) (Distributive Law)
= (1 + 1) a (Commutativity)
= 0 a (Additive Inverse)
= 0 (0 a = 0)
Thus, we can conclude
1 a = a.
Theorem. In R,
1 1 = 1
Proof: Plugging a = 1 into the preceding theorem yields
1 1 = (1)
But we know the inverse of an inverse is the original element, so
1 1 = 1.
Theorem. If a b = 0 then either
1
a = 0 or b = 0.
Proof: Let ab = 0 and suppose it is not the case that either a = 0 or b = 0. Then a = 0 and b = 0.
In particular, a has a multiplicative inverse a
1
, so
a b = 0 =
a
1
(a b) = a
1
0 =
(a
1
a) b = a
1
0 =
b = 0
a contradiction since b = 0. Thus, it must be the case that either a = 0 or b = 0.
1
This does not preclude a = 0 and b = 0.
5.6 Shorthand Notations
You may be wondering whether we need to dene axioms for subtraction and division. Theres no
need: weve already done the work when we dened inverses. Subtraction and division are really just
shorthand notations for combine with the inverse. Specically, we dene
a b
to be shorthand for
a + (b)
and
a
b
to be shorthand for
a b
1
Here is a list of shorthands involving +, :
Shorthand Denition
a b a + (b)
a
b
a b
1
ab a b
and here are shorthands we will use when we introduce orderings:
Shorthand Denition
a 0 a > 0 or a = 0
a > b a b > 0
a < b b > a
By the way, dont think of a shorthand as something new. Its not. Its just another way to write
something. Dont be scared of it!
If you ever feel like panicking when you see new notation, remember:
Math Mantra: Dont fear notation! Just expand the notation according to its
definition to get something you are familiar with.
5.7 Ordering Axioms
Recall that all the previous theorems apply to any eld. In particular, the previous result hold true
for the rationals.
In order to axiomize the reals and distinguish them from the rationals, we need to introduce one key
axiom. However, for this axiom to make any sense, we must rst introduce the ordering axioms.
1
1
As a fun exercise, you can show that the complex numbers do not satisfy the ordering axioms. Therefore, the
ordering axioms distinguish the reals from the complex numbers.
5.7. ORDERING AXIOMS 105
Ordering Axioms
Trichotomy: For any a R, exactly one of the following must hold:
a > 0
a = 0
a > 0
Positivity: For any x, y R, if x > 0 and y > 0, then
x + y > 0
x y > 0
Take note of trichotomy in particular: it says that a is the same object as 0, or either it or its additive
inverse is greater
1
than 0. Moreover, trichotomy prescribes that precisely one of these conditions
holds. This is useful because if we can ever show that two conditions hold, then we have a contra-
diction.
Here are a few consequences of the ordering axioms:
Theorem.
1 > 0
Proof: Suppose that it is not the case that 1 > 0. Then by trichotomy either
1 = 0
or
1 > 0.
By denition of a eld, 1 = 0 so it must be the case that
1 > 0.
Using positivity,
1
..
x
1
..
y
> 0.
But we proved 1 1 = 1, so
1 > 0
which is a contradiction since we assumed this was not the case. Thus 1 > 0.
The next two theorems order elements relative to their inverses: if a is greater than 0, its additive
inverse is less than 0 while its multiplicative inverse is greater than 0.
1
I hesitate to say greater here since > is just a symbol with the prescribed properties
Theorem. If
a > 0
then
0 > a
Proof: We will prove that the two statements are equivalent. The inequality
0 > a
is really just shorthand for
0 (a) > 0.
Moreover, 0 (a) is still shorthand, which we can expand to get:
0 + ((a)) > 0.
We already proved the group property
((a)) = a,
so the inequality is equivalent to
0 + a > 0.
Since 0 is the additive identity, this is just
a > 0
as needed.
Theorem. If
a > 0
then
1
a
> 0
Proof: Let a > 0, suppose it is not the case that
1
a
> 0. Then, by trichotomy on
1
a
we either have
1
a
= 0
or
1
a
> 0
Case 1:
1
a
= 0.
If this is the case, we multiply both sides by a to get
1 = 0
which is impossible.
5.8. COMPLETENESS AXIOM 107
Case 2:
1
a
> 0.
We know that a > 0, so by positivity on
x =
1
a
y = a
we have
1
a
..
x
a
..
y
> 0
But the left hand side is just
1
a
a = (1 a
1
) a (1 x = x)
= 1 (a
1
a) (Associativity)
= 1 1 (Multiplicative Inverse)
= 1 (Identity)
So,
1 > 0.
But this contradicts trichotomy since we already know 1 > 0 and exactly one of the three
scenarios must hold. Thus, this case is impossible.
Since each case yields a contradiction, we can conclude
1
a
> 0.
5.8 Completeness Axiom
Notice that the eld axiom and ordering axioms apply to both the reals and the rationals. Therefore,
we need an additional axiom to distinguish the two. Precisely, we need an axiom that implies
Irrational numbers exist.
To discover the missing axiom, rst view the rationals on a number line where the holes are the
missing irrational numbers:
-
1
2 0
1
2
The key idea is to dene an axiom that lls in these holes.
Consider some hole S on the number line:
S
This hole can be associated to some set A of elements left of S:
A
S
Now, we create an axiom that says we can ll in this point, i.e. this point exists
A
S
But how can we do this precisely?
Lets examine the relationship between the set A and the point S.
First notice that A is to left of S. As per our number line convention, S is greater than any point in
set A.
A
. .
<S
S
So we say that S is an upper bound of set A.
However, A has innitely many upper bounds, namely all the points to the right of A:
A
S B
1
B
2
B
2
Therefore, in order to distinguish S, we notice that S is the least of these points to the right of A.
Formally,
Denition. S is the supremum or least upper bound of the set A if it is both:
An upper bound.
S is greater than any element in the set:
x S
for any x A.
The least upper bound.
S is smaller than (or equal to) every other upper bound: if there is some B such that for any
x A,
x B,
it is the case that
S B.
Note that the supremum need not be in S.
Armed with this idea, we now dene the axiom
1
that distinguishes the reals from the rationals:
1
If you are like me, and dont like Hole-ly things, we will have a much nicer equivalent characterization of this axiom
once we discuss limits.
5.9. PROOF TECHNIQUE: EXISTENCE 109
Completeness Axiom
Any non-empty set which is bounded above has a supremum.
Notice that we had to add a few provisos to our axiom, namely the words bounded above and
non-empty. This is because,
A set that is not bounded above has no rightmost point (the set contains points that are
arbitrarily far right).
An empty set has no rightmost point(what is the rightmost point of nothing)?
The Completeness Axiom gives us a new proof technique: we can now justify certain numbers exist!
5.9 Proof Technique: Existence
I think, therefore I am.
-Descartes
It is easy to prove something doesnt exist. For example, it is easy to prove that we cant have an
integer that is both even and odd: if it existed, we would have a contradiction. But,
How do we prove something exists?
The most intuitive answer is
Find It
For example, to prove there exists a number that is divisible by 2 or 3, you can construct 6 as an
example. Or, to prove double rainbows exist, you just do a YouTube search.
But besides construction, there are other ways to prove existence.
For example, using proof by cases, we already proved that there exist rationals a, b such that
a
b
is irrational. Contrary to our intuitive technique for proving existence, this proof gives us no insight
into what such a number looks like. We will see this again next lecture: we will prove
Every non-trivial subspace has a basis
without constructing a particular basis! Its like the author of this book: this book is proof that I
exist, but you have no clue what I look like!
The only real
1
technique you will use to prove existence is the Completeness Axiom. Formally, suppose
we want to prove there exists an element that satises a specic property. Then,
1
Ignoring my delightful pun, the Completeness Axiom is an axiom. So proof really means an agreed acceptance.
1. Choose some bounded set.
2. By the Completeness Axiom, there exists an element that is a least upper bound of this set.
3. Show that this element satises the desired property.
Lets prove the classic example that there is some real number x such that
x
2
= 2.
To make our lives easier, assume all the typical properties about the rationals and > and do not refer
back to the other axioms.
1
Theorem. There exists a real number x such that
x
2
= 2.
Proof Summary:
Consider the set of positive numbers whose squares are less than 2.
The set is bounded and non-empty; therefore, there exists a supremum S by Completeness
Axiom.
S
2
< 2 impossible: else you can construct (S + )
2
< 2, contradicting S is an upper bound.
S
2
> 2 impossible: else you can construct (S)
2
> 2, contradicting S is the least upper bound.
Conclude S
2
= 2.
Proof: Consider the set
_
x R
0 < x and x
2
< 2
_
First notice this set is non-empty since it contains 1:
1
2
= 1 < 2
The set is also bounded above by 2. Otherwise, if there was an element y in the set such that y > 2,
then
y
2
> 4
But by denition,
y
2
< 2
so
4 < 2
1
Before we return to analysis, I would like to make a recommendation. If you enjoyed manipulating abstract
properties, I highly recommend taking Math 120 and Math 121. Especially with Professor Sound: he is the Morgan
Freeman of the Math department!
which is impossible. Thus, by the Completeness Axiom, a supremum S exists. Thats the easy part.
Heres the hard part: we have to check that this element satises
S
2
= 2
Suppose not. Then either:
S
2
< 2 OR S
2
> 2
Case 1: S
2
< 2.
To prove this is impossible, we want to nd an element that is bigger than S and in our
original set. This will contradict that S is an upper bound. In particular, we need to nd a
positive such that
(S + )
2
< 2
If we can nd such an , then both
(S + ) > S
and
(S + )
_
x R
0 < x and x
2
< 2
_
.
But nding such a satisfying means we must nd one that satises
S
2
+ 2S +
2
. .
(S+)
2
< 2
or equivalently,
2S +
2
< 2 S
2
. ()
Thus, if we can choose a positive that makes the last condition true, we are done!
But there is a problem: how can we work with an ugly
2
term? If life were easy and it werent
a square, we could isolate on one side.
Heres a trick you are going to use often in this course:
If 0 < < 1, then
2
<
Why is this true? Simple:
< 1 (since < 1)
= .
If we restrict < 1, then we have an upper bound of 2S +
2
, namely,
2S +
2
< 2S +
Therefore, to show () is satised, we simply need to show our upper bound of 2S+
2
is smaller
than 2 S
2
:
2S + < 2 S
2
But this is a much easier problem since we can isolate
(2S + 1)
. .
2S+
< 2 S
2
,
and rewrite it as
<
2 S
2
2S + 1
.
Therefore, given the constant S, we need to nd a positive that satises the above inequality.
But that is easy: dene to be the right hand side divided by 2!
=
2 S
2
2(2S + 1)
We are almost good to go, but there is a slight y in the ointment!
We need to make sure that our is smaller than 1 for our inequality to work.
Therefore, choose
= min
_
2 S
2
2S + 1
, 1
_
Case 2: S
2
> 2
To prove this is impossible, we nd a smaller element that is an upper bound of our set.
Particularly, we want to nd a positive such that
(S )
2
> 2.
Indeed, if this is the case
S < S
contradicting that S is the least upper bound.
By algebra again, this is equivalent to showing
2 +
2
> 2 S
2
.
After dividing both sides by 1, this is equivalent to
2
2
< S
2
2.
Since
2
> 0,
2
2
< 2.
This means we just have to make sure
2 < S
2
2.
Simply choose
=
S
2
2
4
.
In each case, we have a contradiction. Therefore,
S
2
= 2.
Once again, this tells us nothing about what such S looks like, only that S exists. But,
Math Mantra: Once we know an object exists, we can give it a name and derive
properties about it!
In particular, using the fact that this S exists, we can give it the name
2.
Then, we can prove neat properties about it, e.g.
2 is irrational.
Lecture 6
All Your Basis are Belong to Us
The best proofs are simple. I did not say easy, I said simple.
-Leon Simon
Goals: In Lecture 3, we learned how to construct a subspace by taking the span of a
base set of vectors. Today, we focus on the opposite direction: constructing a base set
of spanning vectors given the subspace.
6.1 Building Downwards
Whenever we dene a new type of animal
1
in mathematics, we like to ask ourselves two major
questions. The rst is,
Starting from some simple atoms, is there a process to construct animals of that type?
If the answer is yes, then we can build animals to play with. This is the upwards approach. The
second important question we can ask is,
Starting from the animal, can we break it down into simple atoms?
The reason why we like to ask this question is that
Math Mantra: If we can break an object into a simple atomic structure, then we
can derive tons of properties of the object just by examining the little atoms!
Weve seen these upwards and downwards processes before: starting from the base set of prime num-
bers, we can build all positive integers, and starting from the positive integers, we can decompose
them into a product of primes. By studying the prime factorization, we can prove a ridiculous number
of cool properties about the integers.
Dont believe that these are important processes? Here is an undergraduate check-list:
1
Another Leon Simonism
115
116 LECTURE 6. ALL YOUR BASIS ARE BELONG TO US
Area of Mathematics Topic
Number Theory Prime Factorization
Algebra Group Generators
Real Analysis Open Sets into Intervals
Analysis Vector Space
1
as a Basis Span
Geometry Simple Polygons into Triangles
Logic Syntax
Discrete Math Tree Growing Procedures
In Lecture 3, we built a subspace by taking the span of a set of vectors. In fact, by removing re-
dundancy, we could have assumed the span is built from a linearly independent set. That was going
upwards.
Today, we are going downwards. Given the subspace, we will construct a set of spanning vectors.
This theorem is literally one of the most important theorems of your undergraduate career and will
be used in many of your Linear Algebra proofs:
Every non-trivial
2
subspace has a basis.
But before this proof can make sense, you need to learn how to prove that two sets are equal. This
vital technique will be used in all of your future math courses.
6.2 Proof Technique: Proving Two Sets are Equal
What does it mean for two objects to be equal? In your mathematical career, you will see many
dierent notions of equality. Up until this point, the denition has been
Two objects are equal if they are identically the same object.
For example, when I taught in Taiwan, my boss Burch worked as a nightclub DJ. His name was DJ
Kimchi, so
Burch = DJ Kimchi
since both names reference the same object. So if I hit Burch in the head, that would be the same as
hitting DJ Kimchi in the head.
As a more mathematical example, suppose y is a constant satisfying
e
y
= 2
When you applied the log operation on both sides, what you were really doing was asserting:
Because e
y
and 2 are the same object, the log of e
y
is the same as the log of 2.
1
Assuming the vector space is nite dimensional. In Math 171, you will see that this is not true for innite
dimensional vector spaces.
2
The trivial case V = {
0} does not have a basis.

6.2. PROOF TECHNIQUE: PROVING TWO SETS ARE EQUAL 117
But what do we mean by two sets are equal ? Intuitively, we know
{a, b, c}
is considered the same as
{c, a, b}
because they have the same elements.
1
Formally, we dene the following axiom:
If A B and B A, then A = B.
In words, this says
If we can show that every element in A is an element of B, and that every element in B is an
element of A, then we conclude A and B are the same set.
The logicians call this the Axiom of Extensionality.
Lets do a few examples:
Example. The set of all elements formed from adding integer combinations of 12 and 14 is the same
as the set of all even integers:
{12x + 14y | x, y Z} = {2n| n Z}
Proof: Dene
A = {12x + 14y | x, y Z}
B = {2n| n Z}

Let a be an arbitrary element of A. By denition of A,
a = 12s + 14t
for some integers s, t. Then,
a = 12s + 14t = 2 (6s + 7t)
. .
Z
.
Thus, a B. Since a was arbitrary, every element of A is in B.

Now let b be an arbitrary element of B. Then,
b = 2j
for some integer j. But
b = 2j = (14 12)j = 12(j) + 14j
Therefore, b A. Since b was arbitrary, every element of B is in A.
1
However, (a, b, c) = (c, a, b). Ordered tuples are not the same as sets.
Since A B and B A, we can conclude A = B.
In Lecture 4, we mentioned that applying elementary operations to a system of equations leaves the
solution space unchanged. This is equivalent to showing
The set of solutions of the original system is the same as
the set of solutions of the transformed system.
Since this is trivial to show for switching two equations and scaling one equation by a non-zero
constant, we only prove the last transformation:
Example. Consider the system of equations
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= b
1
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= b
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= b
m
and the system obtained by taking the rst equation and adding a scalar k times another equation:
(a
11
+ ka
i1
)x
1
+ (a
12
+ ka
i2
)x
2
+ (a
13
+ ka
i3
)x
3
+ . . . + (a
1n
+ ka
in
)x
n
= b
1
+ kb
i
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= b
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= b
m
Then
S
1
= S
2
where S
1
is the set of all solutions to the rst system and S
2
is the set of all solutions to the
transformed system.
Proof:

Let s S
1
,
s = (s
1
, s
2
, . . . , s
n
)
By denition,
a
21
s
1
+ a
22
s
2
+ a
23
s
3
+ . . . + a
2n
s
n
= b
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
s
1
+ a
m2
s
2
+ a
m3
s
3
+ . . . + a
mn
s
n
= b
m
So we just need to check that
(a
11
+ ka
i1
)s
1
+ (a
12
+ ka
i2
)s
2
+ (a
13
+ ka
i3
)s
3
+ . . . + (a
1n
+ ka
in
)s
n
= b
1
+ kb
i
But
a
i1
s
1
+ a
i2
s
2
+ a
i3
s
3
+ . . . + a
in
s
n
= b
i
implies
ka
i1
s
1
+ ka
i2
s
2
+ ka
i3
s
3
+ . . . + ka
in
s
n
= kb
i
and we are already given
a
11
s
1
+ a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
= b
1
.
Summing the two equations yields
(a
11
+ ka
i1
)s
1
+ (a
12
+ ka
i2
)s
2
+ (a
13
+ ka
i3
)s
3
+ . . . + (a
1n
+ ka
in
)s
n
= b
1
+ kb
i
.
Thus, s S
2

Now assume s S
2
. As before, s satises equations 2, 3 . . . , m so we need only show that
a
11
s
1
+ a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
= b
1
But we already know
a
i1
s
1
+ a
i2
s
2
+ a
i3
s
3
+ . . . + a
in
s
n
= b
i
.
so scaling both sides by k,
ka
i1
s
1
+ ka
i2
s
2
+ ka
i3
s
3
+ . . . + ka
in
s
n
= kb
i
.
Subtracting this from
(a
11
+ ka
i1
)s
1
+ (a
12
+ ka
i2
)s
2
+ (a
13
+ ka
i3
)s
3
+ . . . + (a
1n
+ ka
in
)s
n
= b
1
+ kb
i
yields
a
11
s
1
+ a
12
s
2
+ a
13
s
3
+ . . . + a
1n
s
n
= b
1
so s S
1
.
In Lecture 5, we derived the division algorithm to get expressions of the form:
n = qb + r.
We can exploit this algorithm to rapidly calculate the greatest common divisor between two integers.
This calculation relies on the following fact:
Example. For any integers n, b, q, r such that
n = qb + r
we have
gcd(n, b) = gcd(b, r)
Proof: Dene
A = {d Z| d divides n and d divides b}
B = {d Z| d divides b and d divides r}
i.e
A is the set of common divisors of n and b
B is the set of common divisors of b and r
If we can prove A and B are the same set, then, in particular, the greatest element in each set is the
same. Thus,
gcd(n, b) = gcd(b, r).

Let a be an arbitrary element of A. Then a divides n and b, so
n = t
1
a
b = t
2
a
for some integers t
1
, t
2
. Plugging in
t
1
a
..
n
= q t
2
a
..
b
+r
we get
r = (t
1
qt
2
)a
so a divides r. Since we already know a divides b, we conclude a B. Since a was arbitrary,
every element of A is in B.

Follow the same argument: let s be an arbitrary element of B. Then,
b = t
1
s
r = t
2
s
for some integers t
1
, t
2
. Again, plug in:
n = q t
1
s
..
b
+ t
2
s
..
r
Then,
n = (qt
1
+ t
2
)s
so s divides n. Thus, s A. Since s was an arbitrary element of B, every element of B is in A.
In conclusion,
{d Z| d divides n and d divides b} = {d Z| d divides b and d divides r}
Particularly, the greatest element in each set is the same, implying:
gcd(n, b) = gcd(b, r).
Why is the preceding theorem useful? Suppose you wanted to compute
gcd(179217921, 17921792)
The elementary school way to compute this is to calculate the lists of factors and take the biggest
number on both lists:
17921792 : 1, 2, 4 . . .
179217921 : 1, 3, 373 . . .
This is painfully slow. Instead, we can use the theorem we just proved.
179217921
. .
n
= 10
..
q
17921792
. .
b
+ 1
..
r
Immediately,
gcd(179217921, 17921792) = gcd(17921792, 1) = 1
Number theory is great, but how about a more relevant example? Absolutely!
Heres a simple one: consider the vector with i-th component 1 and the rest 0:
e
i
=
_
_
0
.
.
.
0
1
0
.
.
.
0
_
_
i-th component
This vector e
i
is called the i-th standard basis vector. Why are the standard basis vectors
e
1
, e
2
, . . . , e
n
important? We can break any vector in R
n
into a sum of scaled standard basis vec-
tors.
For example,
_
_
3
2
1
_
_
= 3
_
_
1
0
0
_
_
+ 2
_
_
0
1
0
_
_
+ 1
_
_
0
0
1
_
_
= 3e
1
+ 2e
2
+ 1e
3
Rigorously,
Example. For standard basis vectors e
1
, e
2
, . . . , e
n
R
n
,
R
n
= span{e
1
, e
2
, . . . , e
n
}
Proof:

This is immediate since any span of vectors in R
n
is contained in R
n
.

Let x R
n
. Then,
x =
_
_
x
1
x
2
.
.
.
x
n
_
_
= x
1
_
_
1
0
.
.
.
0
_
_
+ x
2
_
_
0
1
.
.
.
0
_
_
+ . . . + x
n
_
_
0
0
.
.
.
1
_
_
= x
1
e
1
+ x
2
e
1
+ . . . + x
n
e
n
Thus,
x span{e
1
, e
2
, . . . , e
n
}
The next example is an often-used span property. Specically, if we have a redundant vector in our
spanning list, we can toss it away:
Example. Suppose v is a linear combination of v
1
, v
2
, . . . , v
n
. Then,
span{v
1
, v
2
, . . . , v
n
, v} = span{v
1
, v
2
, . . . , v
n
}
Proof:

This is immediate from denition. Any linear combination of the vectors on the right is a linear
combination of the vectors on the left if we just add 0v.

Let
x span{v
1
, v
2
, . . . , v
n
, v}
Then,
x =
1
v
1
+
2
v
2
+ . . . +
n
v
n
+ v
for some
1
, . . . ,
n
, R. But v is a linear combination of the other vectors:
v =
1
v
1
+
2
v
2
+ . . . +
n
v
n
for some
1
, . . . ,
n
R. Substitute this back into x:
x =
1
v
1
+
2
v
2
+ . . . +
n
v
n
+ (
1
v
1
+
2
v
2
+ . . . +
n
v
n
)
. .
v
.
Distribute:
x =
1
v
1
+
2
v
2
+ . . . +
n
v
n
+
1
v
1
+
2
v
2
+ . . . +
n
v
n
and group terms:
x = (
1
+
1
)v
1
+ (
2
+
2
)v
2
+ . . . + (
n
+
n
)v
n
.
Thus,
x span{v
1
, v
2
, . . . , v
n
}.
For the next example, lets prove a set theoretic property. Heres one of De Morgans Laws:
Example. For any sets A and B,
(A B)
c
= A
c
B
c
Proof:

Let x be an arbitrary element of (A B)
c
. Then by denition,
x / A B ()
Suppose, for a contradiction, x A. Then, by denition of union,
x A B
But this contradicts (). Thus, x / A. In other words,
x A
c
Likewise, suppose x B. Then, by denition of union,
x A B
Again, this contradicts (), allowing us to conclude
x B
c
.
By denition of intersection
x A
c
B
c

Let x be an arbitrary element of A
c
B
c
. By denition,
x / A
x / B
Suppose, for an eventual contradiction,
x / (A B)
c
This means
x A B.
By denition of union, we know either
x A
or
x B.
In either case, we have a contradiction since we already know x / A and x / B. In conclusion,
x (A B)
c
.
6.3 The Basis Theorem: Showing a Basis Exists
First we dene
Denition. A basis for a subspace V R
n
is a nite set of vectors
v
1
, v
2
, . . . , v
n
such that:
{v
1
, v
2
, . . . , v
n
} is linearly independent.
span{v
1
, v
2
, . . . , v
n
} = V .
We are going to show that every non-trivial subspace has a basis. But how do we begin?
Existence proofs are often very dicult, mostly because you dont even know where to start. To quote
Professor Simon,
You just need a nice, simple, down to Earth idea.
Heres the idea: lets just consider every combination of vectors in V . The biggest linearly indepen-
dent set that exists is going to be the basis!
Simple right? There are, however, a few questions we need to ask:
1. How do we know that there is a biggest linearly independent set? Maybe I can keep
on nding arbitrarily large linearly independent sets:
{a
1
, a
2
, a
3
}
{
b
1
,
b
2
,
b
3
,
b
4
,
b
5
,
b
6
,
b
7
,
b
8
}
{c
1
, c
2
, c
3
, c
4
, c
5
, c
6
, c
7
, c
8
, c
9
, c
10
, c
11
, c
12
, c
13
, . . . , c
1792
}
.
.
.
2. Suppose a biggest linearly independent set does exist. Are we allowed to consider
every possible combination of vectors in V and choose a biggest linearly independent
set from this collection? We cant physically test every combination of vectors for linear
independence or even write them all out!
The rst question is easy: by the Linear Dependence Lemma, we are going to show that we cannot
have more than n linearly independent vectors in R
n
.
The second question is far more dicult because it is philosophical.
Considering all possible combinations of vectors in V and being able to choose a maximal linearly
independent set is sketchy. Very sketchy indeed. It has to do with the thin grey line between working
6.3. THE BASIS THEOREM: SHOWING A BASIS EXISTS 125
mathematicians and the logicians called the Axiom of Choice. We save this for the end of the
lecture, but for this course, you must choose to accept that we can nd a biggest linearly independent
set (pun intended).
Theorem (Basis Theorem). Every nontrivial subspace V R
n
has a basis.
Proof Summary:
By the Linear Dependence Lemma, V cannot contain more than n linearly independent vectors.
Since we cannot have arbitrarily large linearly independent sets, the maximum size M must be
achieved by some linearly independent set.
We can show this set is a basis for V .
Linear Independence: Obvious.
Spanning:
: Obvious.
: Suppose not. Then M +1 linearly independent vectors in V . But we assumed the
biggest set size is M, contradiction.
Proof: Let V R
n
. I claim that V cannot have more than n linearly independent vectors. Suppose
it does. Then there exists vectors
v
1
, v
2
, . . . , v
n
, v
n+1
, v
n+2
, . . . , v
n+k
that are linearly independent. Of course, this implies the rst n + 1 vectors
v
1
, v
2
, . . . , v
n+1
are linearly independent (quick check using contradiction)!
Recall from our examples, we showed
R
n
= span{e
1
, e
2
, . . . , e
n
}.
Of course, each of our v s are in R
n
. But this means we just found n + 1 independent vectors in
span{e
1
, e
2
, . . . , e
n
}.
This directly contradicts Linear Dependence Lemma!
Therefore, we can nd at most n linearly independent vectors in V .
Consider all possible combinations of vectors in V . Since we just proved that we cannot have arbi-
trarily large linearly independent sets, there must be some linearly independent set that achieves the
maximum size M, with 1 M n:
v
1
, v
2
, v
3
, . . . , v
M
.
Note, I did not tell you how to nd this set of vectors. We know it exists
1
by considering all possible
vector combinations.
I claim this set is a basis. Since it is linearly independent by denition, we just need to show
span{v
1
, v
2
, . . . , v
M
} = V

Immediate: if
x span{v
1
, v
2
, . . . , v
M
}
then
x =
1
v
1
+
2
v
2
+ . . . +
M
v
M
But each of the vs are in V , and V is closed under addition and scaling. So x V .

Let v V . Suppose
v / span{v
1
, v
2
, . . . , v
M
}.
This means
v
1
, v
2
, . . . v
M
, v
is linearly independent (because v cannot be written as a linear combination of the other vec-
tors). But this is impossible since we found M +1 linearly independent vectors and we assumed
we can have at most M!
Thus,
v span{v
1
, v
2
, . . . , v
M
}.
In conclusion,
V = span{v
1
, v
2
, . . . , v
M
}
and since V was an arbitrary subspace, every subspace has a basis.
6.4 The Basis Theorem, Part II: Finding a Basis
So we showed a basis exists. Cool. But that still doesnt answer the question of how to nd it. What
if you pick a few initial linearly independent vectors and then hit a dead-end
2
? In this case, its
impossible to add more vectors to make the span V :
1
Remember when Rose tossed the Heart of the Ocean o the Titanic? I have no clue how to nd her necklace. I
only know its somewhere in the ocean.
2
In computer science, you will see dilemmas like this in the packing problem.
6.4. THE BASIS THEOREM, PART II: FINDING A BASIS 127
v
1
v
2
v
3
Full Basis v
4
w
1
w
2
Dead End
But we can prove this doesnt happen and in fact, it doesnt matter what you choose as the
rst few linearly independent vectors. You can always extend them to a full basis. So that means
that to nd a basis, we just need to keep throwing in any linear independently vectors we can nd,
including the kitchen sink.
1
By modifying the Basis Theorem, we can come up with a powerful extension theorem:
Theorem (Basis Extension Theorem). Given linearly independent vectors
w
1
, w
2
, . . . , w
k
in V , we can always extend this set to a full basis. In other words, we can nd v
k+1
, v
k+2
, . . . , v
M
such
that
w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
are linearly independent and
V = span{ w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
}
Proof Summary:
V cannot contain more than n linearly independent vectors by Linear Dependence Lemma.
The maximum size linearly independent set containing original vectors is achieved by some set.
This set is a basis for V .
Proof: Once again, by Linear Dependence Lemma, we know that we cannot add an unlimited number
of linearly independent vectors to the set
w
1
, w
2
, . . . , w
k
.
1
Another Leon Simonism.
Otherwise, we will have n + 1 linearly independent vectors in R
n
.
In the very best case scenario, we can add vectors so our set has maximally M linearly independent
vectors
w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
Again, we show
span{ w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
} = V.

This follows from closure of a subspace under addition and scaling.

Let v V. Suppose
v / span{ w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
}
This implies
w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
, v
is a linearly independent set. But an M + 1 linearly independent set contradicts that we can
have at most M linearly independent vectors in V . Thus,
v span{ w
1
, w
2
, . . . , w
k
, v
k+1
, v
k+2
, . . . , v
M
}
Therefore, we will eventually get a basis by repeatedly adding whatever linearly independent vectors
we can nd.
6.5 Dimension
In our proof of The Basis Theorem, we found a basis by looking at a maximal linearly independent
set. Perhaps this was overkill. Specically, we need to ask ourselves
Is it possible to nd a smaller basis?
The answer to this question is no:
Theorem. Every basis for a given subspace V has the same number of vectors.
Proof Summary:
Suppose there exists two bases of dierent sizes.
The vectors in the bigger basis are linearly independent in the span of the smaller basis.
This contradicts the Linear Dependence Lemma.
6.5. DIMENSION 129
Proof: Suppose not. Then we can nd two bases of dierent sizes,
v
1
, v
2
, . . . , v
m
w
1
, w
2
, w
3
, w
4
, . . . , w
M
Without loss of generality, assume the second set is bigger: m < M. Choose the rst m + 1 vectors
in the second list
w
1
, w
2
, . . . , w
m+1
These are linearly independent in V ; thus, we have m + 1 linearly independent vectors in
span{v
1
, v
2
, . . . , v
m
}
KaBoom! This is a contradiction by the Linear Dependence Lemma. Thus, every basis for a given
subspace V has the same number of vectors.
Notice that this theorem allows us to associate a unique number to each subspace; namely, the number
of vectors in every basis for that subspace.
Denition. The dimension of a subspace V is the number of vectors in a basis that spans V . We
denote this by
dimV
Out of convention, we dene the dimension of {
0} (the subspace containing only the zero vector) to

be 0.
Why should we care about this number? Intuitively,
Dimension tells us how big a subspace is.
For example, consider any subspace V of R
3
. If dim(V ) = 0, we have a single point at the origin:
z
x
y
If dim(V ) = 1, we have a line:
z
x
y
If dim(V ) = 2, then we have a plane passing through the origin:
z
x
y
If dim(V ) = 3, we have the entire space R
3
.
Another reason why dimension is useful is that we can actually use this number to help nd a basis.
Wait,
Doesnt dimension come from knowing a particular basis and measuring its size?
Yes, but you are going to learn how to calculate the dimension V (in some special cases) without
actually nding a basis.
1
Then, with the following theorem, we can use the dimension to help nd a
basis:
1
All hail the almighty rank nullity theorem!
6.5. DIMENSION 131
Theorem. Let V = {
0} be a subspace. Then
1. If v
1
, v
2
, . . . , v
dimV
span V , then they are linearly independent (and thus form a basis for V )
2. If v
1
, v
2
, . . . , v
dimV
are linearly independent, then they span V (and thus form a basis for V )
Proof Summary:
1: Suppose not. Shrink the set of v
i
s to a set of dimk 1 elements that still spans V . Then
any basis for V is a set of dimV linearly independent vectors in a span of dimk 1 vectors,
which contradicts Linear Dependence Lemma.
2: Suppose not. Extend the set of v
i
s to a set that spans V and is still linearly independent.
This contradicts the fact that any basis has size dimV .
Proof:
1.
Let
v
1
, v
2
, . . . , v
dimV
span V . Suppose they are not linearly independent.
Recall that, in our examples, we proved that if v is a linear combination of v
1
, v
2
, . . . , v
k
, then
span{v
1
, v
2
, . . . , v
k
, v} = span{v
1
, v
2
, . . . , v
k
}
By linear dependence, at least one of our v
i
is a linear combination of the others. After possibly
reordering the vectors, we may assume that v
1
is a linear combination of the others. Then,
span{v
2
, . . . , v
dimV
} = span{v
1
, v
2
, . . . , v
dimV
} = V.
Now select a basis
w
1
, w
2
, . . . , w
dimV
for V by the basis theorem. Then, w
1
, w
2
, . . . , w
dimV
are dimV linearly independent vectors in
span{v
2
, . . . , v
dimV
}, contrary to the Linear Dependence Lemma.
2.
Let
v
1
, v
2
, . . . , v
dimV
be linearly independent and suppose this set does not span V . This means that there is some
vector v V that cannot be written as a linear combination of v s. Therefore,
v
1
, v
2
, . . . , v
dimV
, v
is linearly independent. Applying our basis extension theorem to this set, a basis would have
strictly more than dimV vectors, contradicting uniqueness of basis size!
Therefore, if we know the dimension of V , to check that a set of size dimV is a basis, we only need
to check that either the set is linearly independent or the set spans V .
6.6 Some Fun with the Basis Theorems
In Lecture 3, we proved that the intersection of two subspaces is still a subspace. Intuitively, if the
intersection is smaller, the basis should be smaller as well. In fact, the new basis should be smaller
than the bases of both the original subspaces:
Example. For subspaces V, W R
n
,
dimV U min{dimV, dimU}
Proof: By the basis theorem, we know that V U has a basis
w
1
, w
2
, . . . , w
k
But V U V , so these w
i
are linearly independent in V . Therefore, we can extend this to a full
basis for V . Thus,
dimV U dimV
Likewise, V U U, so these w
i
are linearly independent in U. Therefore, we can extend this to a
full basis for U. Thus,
dimV U dimU.
Because the dimension of V U is bounded by both dimU and dimV , it is certainly bounded by the
smaller of the two:
dimV U min{dimV, dimU}.
By the way, a noob mistake would be to assume a similar result holds for set unions:
dimU V max{dimU, dimV }
U V need not be a subspace, so taking the dimension does not make sense! U V is an entirely
dierent animal! In general,
Math Mantra: Just because you can write an expression down, doesnt mean it
automatically makes sense! You must check that you have the correct animals!
Recall that we also showed that any linear mapping of a subspace is still a subspace. In fact, the
image does not gain dimension:
Example. Let V be a subspace of R
n
and let f be a linear map on V i.e.
f(x +y) = f(x) + f(y) for all x, y V
f(x) = f(x) for all x V, R
Then the image
f(V ) = {f(x) | x V }
has dimension smaller than or equal to V s dimension:
dimf(V ) dim(V ).
6.6. SOME FUN WITH THE BASIS THEOREMS 133
Proof: Let
v
1
, v
2
, . . . , v
k
be a basis for V (by the Basis Theorem). As a simple consequence of linearity, we can show
f(V ) = span
_
f(v
1
), f(v
2
), . . . , f(v
k
)
_

Let x f(V ). By denition,
x = f(v)
for some v V . Since the v
i
s are a basis for V , we can write
v =
1
v
1
+
2
v
2
+ . . . +
k
v
k
.
for some
1
,
2
, . . . ,
k
R. Apply f to both sides:
f(v)
..
x
= f(
1
v
1
+
2
v
2
+ . . . +
k
v
k
).
By linearity,
x =
1
f(v
1
) +
2
f(v
2
) + . . . +
k
f(v
k
).
Thus,
x span
_
f(v
1
), f(v
2
), . . . , f(v
k
)
_

Let x span
_
f(v
1
), f(v
2
), . . . , f(v
k
)
_
. Then, for some
1
,
2
, . . . ,
k
R,
x =
1
f(v
1
) +
2
f(v
2
) + . . . +
k
f(v
k
)
By linearity, we can combine the fs:
x = f(
1
v
1
+
2
v
2
+ . . . +
k
v
k
).
Since
1
v
1
+
2
v
2
+ . . . +
k
v
k
V ,
x f(V )
In conclusion,
f(V ) = span
_
f(v
1
), f(v
2
), . . . , f(v
k
)
_
Therefore, f(V ) is a span of k vectors. This means f(V ) can have at most k linearly independent
vectors in its basis, implying
dimf(V ) dimV
. .
k

6.7 Sketchy Shades of Grey: Axiom of Choice
There are many things in the eld of Mathematical Logic that the working mathematician will de-
scribe as mumbo jumbo. But the Axiom of Choice isnt one of them.
If youre not careful, you can assume a seemingly innocuous statement and bad things will happen.
For example, if you are an aspiring Logician, instead of axiomatizing mathematics, you may want to
look into axiomatizing knowledge and belief.
1
You can take simple axioms like
If you believe x, then you believe that you believe x.
If you know x, then you believe x.
If you believe x, then you believe that you know x
and using simple rules of inference like Modus Ponens, you can derive
Knowing x is equivalent to believing x
But thats wrong:
I believe the Last Airbender is a bad movie
and
I know the Last Airbender is a bad movie
are two completely dierent statements.
What does assuming innocuous axioms have to do with The Basis Theorem? The Basis Theorem
implicitly
2
uses Axiom of Choice, which asserts
For any collection of sets, we can always nd a function on that collection that inputs a non-empty
set and outputs an element of that particular set.
We call such a function a choice function since it is choosing an element from each set. Heres a fun
analogy:
1
Example from Multi-agent Systems by Yoav Shoham
2
In the case of the Basis Theorem, we consider the collection of sets,
{The set of all nite combination of vectors in V | V is a non-empty subspace of R
n
}. We then construct a function
that selects a maximal linearly independent set of vectors from any set in this collection.
6.7. SKETCHY SHADES OF GREY: AXIOM OF CHOICE 135
For example, consider the sets
{1, 2, 3} {. . . 2, 2.00001, 42 . . .} R
n
{0} {x|x Stanford Math Department} R
Constructing a choice function for this collection of sets is easy: go through each set one at a time
and choose some element as the output, e.g.
f({1, 2, 3}) = 2
f({. . . 2, 2.00001, 42 . . .}) = 42
f(R
n
) = e
2
f({0}) = 0
f({x|x Stanford Math Department}) = Soren Galatius
f(R) =
You can also construct choice functions for innite collections of sets. For example, consider the
collection of all subsets of positive integers. Dene f to always spit out the least element in the set:
f({2, 4, 6, 8, . . .}) = 2
f({72, 83, 94, . . .}) = 72
f({101, 1011, 1211, 3333, . . .}) = 101
.
.
.
However, suppose I asked you to construct a choice function on the collection of all non-empty subsets
of R. It is humanly impossible to arbitrarily pick out one element at a time! The problem is that
the set is far too massive.
1
In the case of the reals, even though we cannot humanly construct a choice function, it seems like one
should exist. So we can just take this as axiom, right?
There is an unholy consequence: the Axiom of Choice implies the monster known as the Banach-Tarski
Paradox, which states:
We can break a ball into nitely many non-overlapping pieces and rearrange them to form two
identical balls:
Weird! For more details on this, I highly recommend reading The Pea and the Sun by Leonard Wap-
ner. He gives a nice exposition that requires only a modest amount of mathematics.
Despite these weird consequences, most mathematicians continue to accept the Axiom of Choice.
1
Precisely, the reals are uncountable. This will be discussed in the nal lecture.
New Notation
e
i
The i-th standard ba-
sis vector
span{e
1
, e
2
, e
3
} = R
3
The span of the rst three stan-
dard basis vectors is R
3
A B A is a subset of or
equals B
{0} V {0} is a subset of V or equals V .
A B A either equals or is
contained in B
R
2
V R
2
contains (or equals) V .
dimV The dimension of vec-
tor space V
dimR
3
= 3 The dimension of R
3
is 3.
Lecture 7
Matrix Madness
Unfortunately, no one can be told what the Matrix is. You have to practice bookkeeping and working
with double sums yourself.
-

=pheus
Goals: Today, we look at matrices and how to rigorously prove matrix properties. We
also dene a notion of a matrix norm and prove a Cauchy-Schwarz-like inequality. This
proof will rely on the key matrix property that the product of a matrix and a vector can
be viewed as a linear combination of the matrix columns. Lastly, we prove another key
property: any linear function from R
m
to R
n
can be written as a matrix multiplication.
7.1 Lets be Honest
Heres a complete summary of matrices in high school: in Algebra II, you learned how to represent
a system of equations as a matrix and row reduce to solve for all the unknowns. Then, you learned
how to mindlessly compute matrix products, sums, and inverses without any context as to why. A
majority of Honors Algebra II teachers stop right there. The really good ones
1
go a little further and
say that you can use matrix products to represent the system as a matrix multiplication.
Ax =
b
Then you can compute the inverse of A as long as det(A) = 0 and multiply both sides by the inverse
to get
Ax =

b
A
1
Ax = A
1
b
x = A
1
b
Typically, you learned this in a supplemental reading from Howard Anton since the standard high
school texts are very lacking in Linear Algebra. You also memorized the slogans
Matrix multiplication is associative
Matrix multiplication is not always commutative
For matrix multiplication to work, the inner dimensions must match
1
Shout out to Ms. Evans of Los Altos Hills, Mr. Friedland of Palo Alto High, and Mr. Lazar of San Jose Mission
137
138 LECTURE 7. MATRIX MADNESS
Now, the soul-shattering question I need to ask you is, why?
Why does det(A) = 0 imply non-invertibility?
Why is matrix multiplication always associative?
Why does concatenating a matrix with its identity let you nd the inverse?
_
_
0 2 0 1 0 0
2 1 0 0 1 0
3 1 1 0 0 1
_
_
But most importantly,
Why should you care?
You could have bypassed matrices and just stuck with systems of equations! It would have been easy
since you only dealt with 3 3 and 2 2 matrices.
You also need to ask yourselves the is questions:
Is there a way to compute solutions of Ax = b when det(A) = 0?
Is there more use for determinants than checking invertibility?
Is there a greater purpose for matrices then just notation?
Like when Aladdin met Jasmine, the H-series is going to show you a whole new world behind
matrices.
I used to be in your shoes. I know Gaussian Elimination and computation are a breeze and you will
be able to understand the theorem statements (though you may have trouble juggling m and n).
There are two things that are going to scare you, mainly because you have never had any practice
with them:
Working with -notation.
Proving two matrices (of arbitrary size) are equal.
Lets conquer these fears:
7.2 Working with Sums
It is my experience that -notation is more of a hindrance than a help for beginners in linear
algebra. Therefore, I have generally avoided its use.
-Howard Anton
Unfortunately, we wont have this liberty. If you want to survive the H-series, you have to master
-notation. True, I will avoid -notation if it makes a concept clearer. However, when you hit deter-
minants, this notation will be completely unavoidable.
You especially need to be comfortable with this notation when working with matrices. Primarily,
-notation condenses complicated expressions.
7.2. WORKING WITH SUMS 139
For example, given the matrices
A =
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
B =
_
_
b
11
b
12
. . . b
1p
b
21
b
22
. . . b
2p
.
.
.
.
.
.
.
.
.
.
.
.
b
n1
b
n2
. . . b
np
_
_
the matrix product AB is dened as the matrix
AB =
_
_
[AB]
11
[AB]
12
. . . [AB]
1j
. . . [AB]
1p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
[AB]
i1
[AB]
i2
. . . [AB]
ij
. . . [AB]
ip
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
[AB]
m1
[AB]
m2
. . . [AB]
mj
. . . [AB]
mp
_
_
where each ij entry [AB]
ij
is the dot product of the i-th row of A with the j-th column of B:
_
_
a
11
a
12
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
a
i2
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
_
_
b
11
. . . b
1j
. . . b
1p
b
21
. . . b
2j
. . . b
2p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
b
n1
. . . b
nj
. . . b
np
_
_
We can use -notation to condense this denition:
Denition. Let A be an mn matrix and B be an n p matrix, then the product AB is dened as
the matrix with ij entry
[AB]
ij
=
n
r=1
a
ir
b
rj
.
You also need to be comfortable with double sums.
1
Consider the expression
m
i=1
n
j=1
a
ij
In actuality, this is a shorthand for one sum nested within another. In the inner sum, the i variable
is xed:
m
i=1
_
n
j=1
a
ij
_
=
m
i=1
(a
i1
+ a
i2
+ . . . a
in
)
Of course, we can apply single summation properties to double summations:
1
Youve already seen these when proving Cauchy-Schwarz on Homework 1.
Example.
n
i=1
n
j=1
ij =
_
n
j=1
j
_
2
Proof: Viewing the left hand side as one sum nested within the other,
n
i=1
_
n
j=1
ij
_
,
we can pull out i from the inner sum since it is a constant (as j varies):
n
i=1
i
_
n
j=1
j
_
.
But within the outer sum, notice that
n
j=1
j
is a constant (as i varies), so we call pull that out of the outer sum, giving us a product
_
n
j=1
j
__
n
i=1
i
_
.
The dummy variable in each of the summations doesnt matter, so we change the i into a j, giving us
_
n
j=1
j
_
2
One of the most fundamental properties of double summations is that the s commute:
m
i=1
n
j=1
a
ij
=
n
j=1
m
i=1
a
ij
Visualizing the terms in an array, this equality states that summing over the columns is the same as
summing over the rows:
a
11
a
12
a
13
. . .
a
21
a
22
a
23
. . .
a
31
a
32
a
33
. . .
.
.
.
.
.
.
.
.
.
a
11
a
12
a
13
. . .
a
21
a
22
a
23
. . .
a
31
a
32
a
33
. . .
.
.
.
.
.
.
.
.
.
7.2. WORKING WITH SUMS 141
Intuitively, this property is obvious. But how do we prove it rigorously?
If you try to directly apply single summation properties, you may have trouble untangling the i from
the j.
Instead of purely algebraic manipulation, we need to look at the meaning behind the summation
symbol.
Math Mantra: Instead of jumping to algebraic manipulation, try to understand
the INTENT
1
of the notation.
When we write a sum, we are actually summing over a set of distinct terms. For example,
n
i=1
a
i
is really just shorthand for
Sum all terms of the form a
i
where i {1, 2, 3, . . . , n}
Alternatively, we could have represented this meaning by a dierent notation:
i{1,2,3,...,n}
a
i
where we sum over distinct terms {1, 2, 3, . . . , n}. The jargon for such a set is an indexing set.
Until now, you only worked with a notation that only permitted summations over terms indexed by
consecutive integers. But we can write more interesting expressions like
{d|d > 0, d divides 24}

d
or
{p 15, p is prime}
p
or even over innite indexing sets like
xR
a
x
The rst two sums, respectively, translate to
1 + 2 + 3 + 4 + 6 + 12 + 24 = 48
2 + 3 + 5 + 7 + 11 + 13 = 41
However, we must be careful about our notation. The last expression has a meaningless indexing set
in the context
2
of sums.
1
You will see this again when proving det A = det A
T
2
This indexing set is meaningful, however, in the context of unions.
Example.
m
i=1
n
j=1
a
ij
=
n
j=1
m
i=1
a
ij
Proof: The left hand side is shorthand for
(i,j)S
a
ij
where
S = {(i, j)| i, j are integers and for each j in {1, 1, 2, . . . m}, i is in {1, 2, . . . , n}} .
Likewise,
n
j=1
m
i=1
a
ij
is just shorthand for
(i,j)S
a
ij
where
S
= {(i, j)| i, j are integers and for each i in {1, 2, . . . n}, j is in {1, 2, . . . , m}} .
But it is a simple exercise to show that S
= S, so
m
i=1
n
j=1
a
ij
=
(i,j)S
a
ij
=
(i,j)S
a
ij
=
n
j=1
m
i=1
a
ij
.
Here is a summation we will need when we study Taylor Series:
Example.
N
i=0
i
j=0
a
ij
=
N
j=0
N
i=j
a
ij
Proof: We can rewrite
N
i=0
i
j=0
a
ij
=
(i,j)S
a
ij
where
S = {(i, j)| i and j are integers and for each 0 i N, 0 j i}
7.3. PROVING TWO MATRICES ARE EQUAL 143
and
N
j=0
N
i=j
a
ij
=
(i,j)S
a
ij
where
S
= {(i, j)| i and j are integers and for each 0 j N, j i N}

Now we prove that S
= S.

Let (a, b) S. By denition
0 a N
0 b a.
By transitivity, this implies
0 b N
b a N
so (a, b) S
.

Let (a, b) S
. Then by denition
0 b N
b a N
Again, we see that
0 a N
0 b a.
so (a, b) S.
7.3 Proving Two Matrices are Equal
In your math career, you are going to see many matrix expressions. For example,
(A + B) + C = A + (B + C)
(AB)C = A(BC)
Ax =
n
j=1
x
j

j
Each of these statements asserts that one matrix is equal to another matrix. But how do you formally
prove that two matrices are equal?
Back in high school, you proved equality through direct computation. For example, let A, B, C be
2 2 matrices. To prove associativity
(AB)C = A(BC)
you directly computed
(AB)C =
_
_
a
11
a
12
a
21
a
22
_ _
b
11
b
12
b
21
b
22
_
_
_
c
11
c
12
c
21
c
22
_
=
_
a
11
b
11
+ a
12
b
21
a
11
b
12
+ a
12
b
22
a
21
b
11
+ a
22
b
21
a
21
b
12
+ a
22
b
22
_ _
c
11
c
12
c
21
c
22
_
=
_
c
11
(a
11
b
11
+ a
12
b
21
) + c
21
(a
11
b
12
+ a
12
b
22
) c
12
(a
11
b
11
+ a
12
b
21
) + c
22
(a
11
b
12
+ a
12
b
22
)
c
11
(a
21
b
11
+ a
22
b
21
) + c
21
(a
21
b
12
+ a
22
b
22
) c
12
(a
21
b
11
+ a
22
b
21
) + c
22
(a
21
b
12
+ a
22
b
22
)
_
Then you mindlessly and painfully computed
A(BC) =
_
a
11
a
12
a
21
a
22
_
_
_
b
11
b
12
b
21
b
22
_ _
c
11
c
12
c
21
c
22
_
_
=
_
a
11
a
12
a
21
a
22
_ _
b
11
c
11
+ b
12
c
21
b
11
c
12
+ b
12
c
22
b
21
c
11
+ b
22
c
21
b
21
c
12
+ b
22
c
22
_
=
_
a
11
(b
11
c
11
+ b
12
c
21
) + a
12
(b
21
c
11
+ b
22
c
21
) a
11
(b
11
c
12
+ b
12
c
22
) + a
12
(b
21
c
12
+ b
22
c
22
)
a
21
(b
11
c
11
+ b
12
c
21
) + a
22
(b
21
c
11
+ b
22
c
21
) a
21
(b
11
c
12
+ b
12
c
22
) + a
22
(b
21
c
12
+ b
22
c
22
)
_
After comparing each component and conrming that they matched, you concluded
(AB)C = A(BC).
But this is a bone-headed way to prove the associative law! Heres why:
This argument does not apply to matrices of arbitrary-size.
Its highly inecient and, like the Blobsh, horrifyingly ugly to look at.
Instead we generalize. We know two matrices are equal if their components are the same. Thus,
To prove two matrices (of the same size) are equal, we have to show for any i, j, the
component of A at position ij is the same as the component of B at ij
In our example, we could have just proved that a single (but arbitrary) ij component was equal
instead of writing out all four components of both matrices.
Now, lets give a better (and actual) proof
1
that matrix multiplication is associative.
Example. For n n matrices A, B, C, we have
(AB)C = A(BC)
1
For simplicity, lets assume A, B, C are n n.
Proof: We need to show, for arbitrary component ij, that
[(AB)C]
ij
= [A(BC)]
ij
()
Starting from the left-hand side, this is the dot product of the i-th row of AB with the j-th column
of C:
[(AB)C]
ij
=
n
r=1
[AB]
ir
c
rj
.
Notice that the ir-entry of AB is the dot product of the i-th row of A with the r-th column of B.
Substituting, we get
[(AB)C]
ij
=
n
r=1
_
n
q=1
a
iq
b
qr
_
. .
[AB]
ir
c
rj
Pull c
rj
into the innermost sum to get
[(AB)C]
ij
=
n
r=1
_
n
q=1
a
iq
b
qr
c
rj
_
.
Now, look at the right-hand side of ():
[A(BC)]
ij
=
n
r=1
a
ir
[BC]
rj
Again, substitute the dot product denition for [BC]
rj
:
n
r=1
a
ir
_
n
q=1
b
rq
c
qj
_
. .
[BC]
rj
and then pull a
ir
into the innermost sum.
[A(BC)]
ij
=
n
r=1
_
n
q=1
a
ir
b
rq
c
qj
_
Now we have
[(AB)C]
ij
=
n
r=1
n
q=1
a
iq
b
qr
c
rj
[A(BC)]
ij
=
n
r=1
n
q=1
a
ir
b
rq
c
qj
.
Of course, we can switch the roles of the dummy variables in the rst equation:
[(AB)C]
ij
=
n
q=1
n
r=1
a
ir
b
rq
c
qj
[A(BC)]
ij
=
n
r=1
n
q=1
a
ir
b
rq
c
qj
.
Looks better! Now we just have to x the indexing set. But thats easy: we already proved that we
can switch the order of the double summation. Therefore, we can switch the order in the top line:
[(AB)C]
ij
=
n
r=1
n
q=1
a
ir
b
rq
c
qj
[A(BC)]
ij
=
n
r=1
n
q=1
a
ir
b
rq
c
qj
Thus,
[(AB)C]
ij
= [A(BC)]
ij
Since ij was arbitrary, we conclude that all the components agree, hence:
(AB)C = A(BC)
Notice that we actually had to do some work to prove that matrix multiplication is associative.
Associativity is not always obvious! If you ever study Cryptography and elliptic curves, youll know
what I mean. Generally,
Math Mantra: Just because something is easy to write, DOESNT MEAN its easy to
prove!
Matrix multiplication is associative. Great. But how about an example of a matrix equality thats
more useful for the working mathematician? Sure!
Consider the product of matrix A with vector x:
Ax.
By our denition, you can compute the new vector one element at a time, by dotting the i-th row of
A with x. Thats somewhat painful.
The better and more useful idea is to view Ax as a linear combination of columns of A. For
example,
_
_
1 4 7
2 5 8
3 6 9
_
_
_
_
x
1
x
2
x
3
_
_
= x
1
_
_
1
2
3
_
_
+ x
2
_
_
4
5
6
_
_
+ x
3
_
_
7
8
9
_
_
Example. Let A be an mn matrix with columns
1
,
2
, . . .
n
:
_
_

1

2
. . .
n
_
_
For any vector x R
n
Ax = x
1

1
+ x
2

2
+ . . . + x
n

n
=
n
j=1
x
j

j
Proof: Consider
n
j=1
x
j

j
and look at the i-th component of this sum. This is just sum of all the scaled i-th components of each
of the columns:
n
j=1
x
j

j
=
_
_
x
1
a
11
.
.
.
x
1
a
i1
.
.
.
x
1
a
m1
_
_
+
_
_
x
2
a
12
.
.
.
x
2
a
i2
.
.
.
x
2
a
m2
_
_
+ . . . +
_
_
x
n
a
1n
.
.
.
x
n
a
in
.
.
.
x
n
a
mn
_
_
Thus,
_
n
j=1
x
j

j
_
i
=
n
j=1
x
j
a
ij
But the i-th entry of the vector Ax is, by denition,
[Ax]
i
=
n
j=1
a
ij
x
j
Thus,
[Ax]
i
=
_
n
j=1
x
j

j
_
i
Since this is true for every position i,
Ax =
n
j=1
x
j

j

Even though this theorem refers to the product of a matrix and vector, we can extend this idea to
the product of two matrices. Specically, we can show that A distributes across the columns:
Example. Let A be an mn and B be an n p matrix with columns

1
,

2
, . . . ,

p
:
B =
_
2
. . .

p
_
_
Then the columns of AB are then A
1
, A
2
, . . . , A
p
:
AB =
_
_
A
1
A
2
. . . A
p
_
_
Proof: Lets look at
_
_
A
1
A
2
. . . A
p
_
_
The ij entry of this matrix is the i-th component of A
j
. Writing A
j
as a linear combination of
columns of A,
A
j
=
N
r=1
b
rj

r
where
r
is the r-th column of A. The i-th component is then
[A
j
]
i
=
N
r=1
b
rj
a
ir
.
We also know the ij entry of AB is
[AB]
ij
=
N
r=1
a
ir
b
rj
.
Thus,
[AB]
ij
= [A
j
]
i
.
Since ij was an arbitrary entry, we can conclude
AB =
_
_
A
1
A
2
. . . A
p
_
We can also look at the row analogues of the last two theorems. These will be vital when we prove
the rank theorems.
First up: when we multiply a row vector by a matrix, the result is a linear combination of the rows.
For example,
_
x
1
x
2
x
3
_
_
1 2 3
4 5 6
7 8 9
_
_
= x
1
_
1 2 3
+ x
2
_
4 5 6
+ x
3
_
7 8 9

Example. Let x be a 1 n row vector
x =
_
x
1
x
2
. . . x
n
and let B be a n p matrix with rows

B
1
,

B
2
, . . . ,

B
n
:
B =
_
B
1
B
2
.
.
.
B
n
_
_
.
Then,
xB =
n
i=1
x
i
B
i
.
Proof: Lets look at the j-th entry of
n
i=1
x
i
B
i
Visually, we see that we are isolating the j-th components of the rows:
n
i=1
x
i
B
i
=
_
_
_
x
1
B
11
. . . x
1
B
1j
. . . x
1
B
1p
+
_
x
2
B
21
. . . x
2
B
2j
. . . x
2
B
2p
+
.
.
.
+
_
x
n
B
n1
. . . x
n
B
nj
. . . x
n
B
np
Thus,
_
n
i=1
x
i
B
i
_
j
=
n
i=1
x
i
B
ij
But by denition of matrix multiplication,
[xB]
j
=
n
i=1
x
i
B
ij
.
This gives us
[xB]
j
=
_
n
i=1
x
i
B
i
_
j
Since j was arbitrary, we can conclude
xB =
n
i=1
x
i
B
i
.
Lastly, we have the row distributive property:
Example. Let A be an mn matrix with rows

A
1
,

A
2
, . . . ,

A
m
:
A =
_
A
1
A
2
.
.
.
A
m
_
_
and let B be an n p matrix. Then,
AB =
_
A
1
B
A
2
B
.
.
.
A
m
B
_
_
Proof: Lets look at
_
A
1
B
A
2
B
.
.
.
A
m
B
_
_
The ij entry is the j-th component of the i-th row,

A
i
B. Writing

A
i
B as a linear combination of the
rows of B,
A
i
B =
n
r=1
a
ir
B
r
where

B
r
is the r-th row of B. The j-th component is then
[
A
i
B]
j
=
n
r=1
a
ir
b
rj
.
By denition of matrix multiplication, the ij-th entry of AB is
[AB]
ij
=
n
r=1
a
ir
b
rj
7.4. DISTANCES ON MATRICES 151
Thus, we can conclude,
AB =
_
A
1
B
A
2
B
.
.
.
A
m
B
_
7.4 Distances on Matrices

Just as we dened distance between vectors, we can dene a distance function on matrices. Here, we
dene a matrix norm:
Denition. The norm of an mn matrix A, denoted A, is dened as
A =
_
_
_
_
_
_
_
_
_
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
a
31
a
32
. . . a
3n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
_
_
_
_
_
_
_
_
_
=
_
m
i=1
n
j=1
a
2
ij
Matrices inherit a lot of the distance properties from vectors. Why? The matrix norm is the same
as the vector norm: just unravel the matrix into a vector.
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
a
31
a
32
. . . a
3n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
a
11
a
12
.
.
.
a
1n
a
21
a
22
.
.
.
a
2n
.
.
.
a
m1
a
m2
.
.
.
a
mn
_
_
In particular, we can prove an important Cauchy-Schwarz-like upper bound:
Ax A x
Intuitively, this gives us an upper bound on how much A scales x through matrix multiplication.
When we hit Multivariable Calculus, this inequality will be our bread and butter.
By the way, be careful! In the above expression, we are using the same symbol to denote the
matrix norm and the vector norm:
Ax A x

Vector Norm Matrix Norm Vector Norm
As always,
Math Mantra: Watch out for overloaded notation!
Theorem. For any mn matrix A and any vector x R
n
,
Ax A x.
Proof: Recall that we proved
Ax = x
1

1
+ x
2

2
+ . . . + x
n

n
where
i
is the i-th column of A. Applying the triangle inequality,
x
1

1
+ x
2

2
+ . . . + x
n

n
. .
Ax
x
1

1
+x
2

2
|| + . . . +x
n

n
Pulling out the scalars from the norms, we rewrite the upper bound as
|x
1
|
1
+|x
2
|
2
+ . . . +|x
n
|
n
Stare at this for a moment: be like Jack in Christmas Town and ask whats this?
This is a dot product! Precisely, it is
_
_
|x
1
|
|x
2
|
.
.
.
|x
n
|
_
_

1
.
.
.

n
_
And what can we do to dot products? We apply Cauchy-Schwarz! This gives us
|x
1
|
1
+|x
2
|
2
+ . . . +|x
n
|
n

_
n
i=1
x
2
i
. .
A
_
n
i=1
2
. .
x
Thus,
Ax A x.
7.5. IMPORTANCE BEHIND MATRICES: LINEAR MAPS 153
7.5 Importance Behind Matrices: Linear Maps
We developed all this theory for matrices, but we never answered why they are important. So here
is the big picture:
Any linear map (from R
n
to R
m
) is a matrix multiplication!
Theorem. Let T be a linear map from R
n
to R
m
. Then, the function
1
T(x) can be written as the
matrix multiplication:
T(x) = Ax
where
A =
_
_
T( e
1
) T(e
1
) . . . T(e
n
)
_
_
Proof: Given a vector x, we can rewrite it in terms of the standard basis vectors e
1
, e
2
, . . . , e
n
:
_
_
x
1
x
2
.
.
.
x
n
_
_
. .
x
= x
1
_
_
1
0
.
.
.
0
_
_
+ x
2
_
_
0
1
.
.
.
0
_
_
+ . . . + x
n
_
_
0
0
.
.
.
1
_
_
Condensely,
x = x
1
e
1
+ x
2
e
2
+ . . . + x
1
e
n
Applying T,
T(x) = T(x
1
e
1
+ x
2
e
2
+ . . . + x
1
e
n
)
and by linearity,
T(x) = x
1
T(e
1
) + x
2
T(e
2
) + . . . + x
1
T(e
n
)
But by our column distributive property, this can be expressed as the product of a matrix and
vector:
T(x) = Ax
where
A =
_
_
T( e
1
) T(e
1
) . . . T(e
n
)
_
From this proof, we can make two major observations. The rst:
1
Again, be careful about overloaded notation! T(x) is a function mapping whereas Ax is a matrix product.
We have completely classied linear maps from R
n
to R
m
.
Notice that all the steps in our proof are completely invertible: we can represent any linear map as a
matrix multiplication and any matrix multiplication represents a linear map. This means that if we
want to talk about linear functions, its enough just to talk about matrix multiplication!
The second major observation is:
We can determine a linear map T completely by computing its values on the standard basis vectors,
namely,
T(e
1
), T(e
2
), . . . , T(e
n
)
This is amazing! To represent T, all we need to do is see how T acts on
e
1
, e
2
, . . . , e
n
and then plug in the outputs as the columns of a matrix.
Example. Let R
: R
2
R
2
denote the mapping that rotates a point in the plane counter-clockwise
by degrees. Express R
as a matrix multiplication.
Consider the point
_
1
0
_
Schematically, we can see that R
transforms this point:

x
y
_
1
0
_
R
x
y
_
cos
sin
_
sin
cos
7.5. IMPORTANCE BEHIND MATRICES: LINEAR MAPS 155

Thus,
R
_
1
0
_
=
_
cos
sin
_
Likewise, we can see how R
transforms
_
0
1
_
:
x
y
_
0
1
_
R
x
y
cos
sin
_
sin
cos
_
Thus,
R
_
0
1
_
=
_
sin
cos
_
Applying our preceding theorem,
R
_
x
y
_
=
_
R
_
1
0
_
R
_
0
1
__ _
x
y
_
=
_
cos sin
sin cos
_ _
x
y
_
Lecture 8
Row Space, Column Space, Null Space,
Oh My!
Im living in the kernel of a rank-one map.
From my domain, its image looks so blue.
Cause all I see are zeroes, its a cruel trap.
But were a nite simple group of order two.
-Klein Four
Goals: After introducing the row space, column space, and null space, we prove a funda-
mental relationship about their sizes. Namely, the Rank-Nullity Theorem asserts that
the dimensions of the null space and column space sum to the dimension of the domain.
The proof of this theorem hinges on our basis theorems from Lecture 6. Lastly, we prove
some fundamental rank properties.
8.1 Column Space and Null Space
Most of the material so far has been motivated by the study of linear functions. Particularly, given a
linear function f weve seen that the following are all subspaces:
The domain of f.
The image of f.
The solution space of f(x) =
0
We also showed, last lecture, that linear functions on R
n
are directly linked to matrices: any linear
function can be represented as a matrix multiplication
Ax
where A is the matrix whose columns are the mapped standard basis vectors. Conversely, any such
matrix multiplication represents a linear function.
One question we can ask is,
157
158 LECTURE 8. ROW SPACE, COLUMN SPACE, NULL SPACE, OH MY!
What do the aforementioned subspaces mean when we translate them into the world of matrix
multiplication?
The domain of f is all possible x that can be plugged into the matrix multiplication:
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
_
_
x
1
x
1
.
.
.
x
n
_
_
Namely, it is the space R
n
.
The image of f is the output from plugging in all possible x. But we showed that a matrix
multiplied by a vector simply outputs a linear combination of the columns:
_
_

1

2
. . .
n
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
= x
1
_
_

1
_
_
+ x
2
_
_

1
_
_
+ . . . + x
n
_
_

1
_
_
Thus, the image is really the span of the columns of A. We call this span the column space.
The solution space of f(x) =
0 is the set of vectors x that A multiplies to

0:
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
=
_
_
0
0
.
.
.
0
_
_
We call this the null space of A.
Formally, we dene:
Denition. Let A be an mn matrix. Then,
The domain of A is R
n
.
The column space of A is the subspace of R
m
:
C(A) = span {
1
,
2
, . . . ,
n
}
where the
i
are the columns of A.
8.2. RANK-NULLITY THEOREM 159
The null space of A is the subspace
N(A) =
_
x R
n
Ax =
0
_
.
We call the dimension of C(A) the column rank, and the dimension of N(A) the nullity.
One question we can ask is
Is there any relationship between the domain, the column space, and the null space?
For starters, since the null space is contained in the domain R
n
and the image is contained in the
space R
m
,
dimN(A) n
dimC(A) m
But these are pretty obvious (and pretty lame) inequalities. How about something more exciting?
Fortunately, we have an incredible result, the Rank-Nullity Theorem, which precisely relates the
dimensions of these three subspaces. In fact, Gilbert Strang refers to this theorem as the rst part of
the Fundamental Theorem of Linear Algebra.
1
8.2 Rank-Nullity Theorem
This is the most important application of the Basis Theorem. Namely, it gives you a direct relationship
between the size of the the domain, the column space, and the null space:
The size of the domain is the sum of the sizes of the column space and the null space.
2
Precisely, for any mn matrix A,
dimC(A) + dimN(A) = n
One interpretation is that, after mapping by A, every vector in the domain R
n
is either killed and
sent to zero or used to form the column space:
1
I highly encourage you to google his MAA article with the same name. It requires only a modest Linear Algebra
background and has excellent illustrations!
2
Since kernel is a synonym for null space, the lyrics at the beginning of this lecture imply that hes living in a very
big null space!
C
0
A
R
n
If we add the size of the collection of vectors killed and the size of the collection of vectors
used to build the column space, we would get the size of the full domain.
Or, as a bar-nalogy, imagine pouring tequila into a shot-glass. Chances are, you are going to spill
some of that onto the bar mat. If you recombine the shot and the spilled tequila (via a sponge), the
conglomerated booze is the same as the original amount poured.
But why is this theorem worth being called the rst part of the Fundamental Theorem of Linear
Algebra?
1. The Rank-Nullity Theorem is a counting result. Notice that n is a given xed constant.
This means that we can directly calculate the dimension of the null space given the dimension
of the column space and vice versa. Thats pretty neat! Moreover, we can use the Rank-Nullity
Theorem to prove certain linear mappings are impossible: otherwise things wouldnt add up.
For example, we can never have a linear map with domain R
2
and column space C(A) = R
3
:
otherwise, by the Rank-Nullity Theorem:
3 + dimN(A) = 2
so dimN(A) = -1, which is absurd.
2. The Rank-Nullity Theorem is an existence result. If we know the dimension of subspace
is a certain number k, this gives us a basis of k vectors to work with! For example, if we know
n = 5 and dimC(A) = 2, then
2 + dimN(A) = 5
Thus, dimN(A) = 3. Even though we have no clue how to calculate the null space, we still
know it has some basis
v
1
, v
2
, v
3
that we can work with.
1
1
You are going to see this trick a lot in the next lecture, specically when deriving limit values.
3. The Rank-Nullity Theorem is going to pop up many times. We are going to use it to
prove relationships between rows and columns, as well as facts about invertibility of matrices.
The Rank-Nullity Theorem even shows up in Math 52H and Math 53H, for example, in the
study of Jordan Canonical forms.
But how do you prove the Rank-Nullity Theorem?
First notice that N(A) is contained in the domain, R
n
:
N
N
1
N
2
.
.
.
N
nq
R
n
By basis extension theorem, we can extend the null space basis
N
1
,

N
2
, . . . ,

N
nq
to a full n-vector basis for R
n
:
N
1
,

N
2
, . . .

N
nq
, x
1
, x
2
, . . . x
q
.
Remarkably, we can prove that the image of the extension vectors under A
Ax
1
, Ax
2
, . . . , Ax
q
is a basis for the column space:
C
Ax
1
Ax
2
.
.
.
Ax
q
Thus, the sum of the dimensions of the null space and the column space equals n, which is the
dimension of the domain R
n
:
dimN(A) = n q dimC(A) = q
. . .

N
nq
N
2
N
1
Ax
1
Ax
2
. . . Ax
q
Easy as .
Theorem (Rank-Nullity Theorem). For any mn matrix A,
dimC(A) + dimN(A) = n
Proof Summary:
Extend null space basis to a full basis for R
n
.
Show the image of the extension vectors under A is a basis for the column space.
Linear Independence:
Suppose not. Then there exists a non-trivial combination of A(x
i
) whose sum is

0
Apply linearity of A to show A maps a non-trivial combination of x
i
to

0.
Therefore, the combination is in the null space.
This contradicts linear independence of our original basis for R
n
.
Spanning:
span {Ax
1
, Ax
2
, . . . , Ax
n
} C(A)
Denition.
C(A) span {Ax
1
, Ax
2
, . . . , Ax
n
}
For y C(A), Ax = y. Expand x in terms of original basis.
Conclude
dimC(A)
. .
q
+dimN(A)
. .
nq
= n.
Proof: By the basis theorem, we know that N(A) has a basis. Moreover, since N(A) R
n
,
dimN(A) n.
Therefore, we can assume dimN(A) = n q vectors where q is some non-negative integer:
N
1
,

N
2
, . . .

N
nq
.
Applying extension theorem, extend N(A) to a full n-dimensional basis
1
for R
n
:
N
1
,

N
2
, . . .

N
nq
, x
1
, x
2
, . . . x
q
.
I claim that the image of the extension vectors under A
Ax
1
, Ax
2
, . . . , Ax
q
gives us a basis for C(A). This would complete the proof since
dimC(A) = q
and thus
dimC(A)
. .
q
+dimN(A)
. .
nq
= n.
So lets check the denition of a basis!
Linearly independent.
Suppose
Ax
1
, Ax
2
, . . . , Ax
q
is not linearly independent. Then we have some non-trivial combination
1
A(x
1
) +
2
A(x
2
) + . . . +
q
A(x
q
) =
0
By linearity of matrix multiplication,
A(
1
x
1
+
2
x
2
+ . . . +
q
x
q
) = 0.
Thus,
1
x
1
+
2
x
2
+ . . . +
q
x
q
is in the null space of A. Writing this vector in terms of the null space basis,
1
x
1
+
2
x
2
+ . . . +
q
x
q
=
1

N
1
+
2

N
2
+ . . . +
nq

N
nq
.
This gives us a non-trivial combination for

0:
1
x
1
+
2
x
2
+ . . . +
k
x
q

1

N
1
2

N
2
. . .
nq

N
nq
=
0.
However, the above vectors are elements of our original basis
N
1
,

N
2
, . . .

N
nq
, x
1
, x
2
, . . . x
q
for R
n
, directly contradicting linear independence! Thus,
Ax
1
, Ax
2
, . . . , Ax
q
are linearly independent.
1
Often, students confuse m and n in the statement of the Rank-Nullity Theorem. Think domain. If you insist on a
mnemonic: extend the null space.
span {Ax
1
, Ax
2
, . . . , Ax
q
} = C(A)

This is a freebie since the image of specic vectors (as well as their combinations) are
automatically in the column space.

Let c C(A). By denition of the column space, we can nd a y R
n
such that
Ay =c ()
Expand y in terms of our basis for R
n
y =
1

N
1
+
2

N
2
+ . . . +
nq

N
nq
+
1
x
1
+
2
x
2
+ . . . +
q
x
q
to get
Ay = A
_
1

N
1
+
2

N
2
+ . . . +
nq

N
nq
+
1
x
1
+
2
x
2
+ . . . +
q
x
q
_
.
Distribute
Ay =
1
A
N
1
..
=0
+
2
A
N
2
..
=0
+. . . +
nq
A
N
nq
. .
=0
+
1
Ax
1
+
2
Ax
2
+ . . . +
q
Ax
q
and use the fact that each

N
i
is in the null space to get
Ay =
1
Ax
1
+
2
Ax
2
+ . . . +
q
Ax
q
Plugging this into the (LHS) of (),
1
Ax
1
+
2
Ax
2
+ . . . +
q
Ax
q
. .
Ay
=c
By denition, this means
c span {Ax
1
, Ax
2
, . . . , Ax
q
} = C(A)
Since we proved the claim, we conclude
dimC(A)
. .
q
+dimN(A)
. .
nq
= n.
8.3 Row Space
Notice that the preceding theorem is called the Rank-Nullity Theorem, not the Column Rank-Nullity
Theorem. This is because rank is more special than column rank (though they are the same number)!
We are going to upgrade our Rank-Nullity one step further. Or, to quote Elzar,
We are going to knock it up a notch!
8.3. ROW SPACE 165
Precisely, we are going to show that the dimension of the column space is the same as the dimension
of the row space. And this number will be called the rank.
As a rst guess, you may think that:
The row space is the span of the rows of A.
This is almost correct. However, we only dened spans for column vectors!
To x this, we make a denition:
Denition. Given an mn matrix
A =
_
_
a
11
a
12
. . . a
1n
a
21
a
22
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
. . . a
mn
_
_
the transpose of A is the n m matrix whose ij-component is the ji-component of A:
A
T
=
_
_
a
11
a
21
. . . a
m1
a
12
a
22
. . . a
m2
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
a
2n
. . . a
mn
_
_
Transposes will be extremely important in Lecture 30. For now, we will only transpose row vectors.
In this case, we are simply propping a row vector as a column vector:
x =
_
x
1
x
2
. . . x
n
x
T
=
_
_
x
1
x
2
.
.
.
x
n
_
_
The correct denition of the row space is:
The row space is the span of the transposed rows of A.
Precisely,
Denition. Let A be an mn matrix with rows

A
1
,

A
2
, . . . ,

A
m
:
A =
_
A
1
A
2
.
.
.
A
m
_
_
The row space of A is the subspace of R
n
:
R(A) = span
_
A
T
1
,

A
T
2
, . . . ,

A
T
m
_
For example, given the matrix
A =
_
1 2 3 4 5
11 12 13 14 15
_
the row space of A is
R(A) = span
_
_
_
_
1
2
3
4
5
_
_
,
_
_
11
12
13
14
15
_
_
_
_
Using the Rank-Nullity Theorem, we can prove the awesome fact that the dimension of the row space
equals the dimension of the column space:
dimR(A) = dimC(A).
This is not obvious! You would think that the rows and columns would have nothing to do with
each other. Indeed, the row space and column space are subsets of dierent spaces (R
n
and R
m
,
respectively). But it turns out that their bases have the same size!
And once we prove this equality, we will tear down the walls of column and row discrimination and
just say rank:
Denition. Let A be a mn matrix. The rank of A is dened as
rank A = dimC(A) = dimR(A)
Then, the Rank-Nullity Theorem becomes
rank A + dimN(A) = n.
In many linear algebra books, this result is the crowning jewel of the chapter because it has a ton of
applications, both theoretical and non-theoretical. Unfortunately, we are going to save the proof that
dimC(A) = dimR(A)
for next week. For now, lets assume the result and get some practice proving rank properties.
8.4. PROVING RANK PROPERTIES 167
8.4 Proving Rank Properties
Here are three basic rank properties that youre expected to prove, assuming A is an mn matrix:
rank(A + B) rank A + rank B
rank A min{m, n}
rank(AB) min {rank A, rank B}
But how do you even begin?
The rst rule of Math Fight Club is:
Math Mantra: Dont let any math statement scare you. Go back to the definition
and view the statement in simpler terms!
When you expand the denition of rank, you are going to realize its just all the basis stu youve
been practicing until this point! The one minor dierence is that the rank takes a matrix an input
whereas the dimension takes a subspace as input!
The second rule of Math Fight Club is:
Math Mantra: Dont even ATTEMPT to prove a math theorem until you are
comfortable with all the definitions used!
Go to the Flomo catwalks. Or Roble Field. Or the Synergy garden (if youre into that stu).
Somewhere. Just stare at the sky and think about the denitions. Think about what they mean.
Then, only when you are ready, attempt the proof. In the words of Professor Simon,
Mull it over until you feel comfortable with the denition. Write out basic cases.
Then write out the extreme cases.
Example. Let A and B be mn matrices. Then,
rank (A + B) rank A + rank B
Proof: Let
A =
_
_

1

2
. . .
n
_
_
B =
_
2
. . .

n
_
_
.
Then,
A + B =
_
_

1
+

1

2
+

2
. . .
n
+

n
_
_
.
Because rank is the dimension of the column space, lets study the span of the columns of A + B:
C(A + B) = span
_

1
+

1
,
2
+

2
, . . . ,
n
+

n
_
.
Since any linear combination of these columns
c
1
(
1
+

1
) + c
2
(
2
+

2
) + . . . + c
n
(
n
+

n
)
is just
c
1

1
+ c
2

2
+ . . . + c
n

n
+ c
1
1
+ c
2
2
+ . . . + c
n
n
we conclude that
span
_

1
+

1
,
2
+

2
, . . . ,
n
+

n
_
span
_

1
,
2
, . . . ,
n
,

1
,

2
, . . . ,

n
_
.
By our span theorems, we can collapse a linearly dependent spanning set into a basis. Reordering if
necessary,

1
,
2
, . . . ,
n
1
,

2
, . . . ,

n
collapses into bases

1
,
2
, . . . ,
rank A
1
,

2
, . . . ,

rank B
respectively. Therefore,
span
_

1
,
2
, . . . ,
n
,

1
,

2
, . . . ,

n
_
= span
_

1
,
2
, . . . ,
rank A
,

1
,

2
, . . . ,

rank B
_
making our set inclusion
span
_

1
+

1
,
2
+

2
, . . . ,
n
+

n
_
span
_

1
,
2
, . . . ,
rank A
,

1
,

2
, . . . ,

rank B
_
.
Even though we dont know the dimension of the right hand side, in the worst-case scenario, its
dimension is rank A + rank B (this happens if all the vectors on the right hand side are linearly
independent). Thus,
dimC(A + B) rank A + rank B
or by denition of rank,
rank(A + B) rank A + rank B.
The next example is an immediate
1
corollary of our denition of rank.
1
Do not submit this proof on Homework 2! That particular exercise denes rank A = dimC(A). You cannot
assume dimC(A) = dimR(A).
Theorem. Let A be an mn matrix. Then
rank A min{m, n}
Proof: The column space is contained in R
m
, so
rank A
. .
dimC(A)
m
Likewise, the row space is contained in R
n
:
rank A
. .
dimR(A)
n.
Thus, rank A is less than or equal to the smaller of m and n:
rank A min{m, n}
The last property is about the rank of a product of matrices.
Theorem. For an mn matrix A and n p matrix B,
rank AB min {rank A, rank B}
Proof: We need to prove two inequalities separately
rank AB rank A.
Lets look at the columns of AB. Writing
B =
_
2
. . .

p
_
_
,
apply the column distributive law to get
AB =
_
_
A
1
A
2
. . . A
p
_
_
.
By our matrix multiplication properties, each column of AB is a linear combination of the
columns
1
,
2
, . . . ,
n
of A:
A
j
=
n
r=1
b
rj

r
.
Therefore, any linear combination of the columns of AB is also a linear combination of the
columns of A, hence:
span
_
A
1
, A
2
, . . . , A
p
_
. .
C(AB)
span {
1
,
2
, . . . ,
n
}
. .
C(A)
.
Thus,
rank AB rank A
rank AB rank B
Look at the rows of A:
A =
_
A
1
A
2
.
.
.
A
m
_
_
From our matrix multiplication properties, we know that each row of AB is
AB =
_
A
1
B
A
2
B
.
.
.
A
m
B
_
_
Moreover, each row of AB is a linear combination of the rows

B
1
,

B
2
, . . . ,

B
n
of B:
A
i
B =
n
r=1
a
ir
B
r
The row space of AB is therefore contained in the row space of B:
span
_
_
A
1
B
_
T
,
_
A
2
B
_
T
, . . . ,
_
A
m
B
_
T
_
. .
R(AB)
span
_
B
T
1
,

B
T
2
, . . .

B
T
n
_
. .
R(B)
This implies the dimension of the row space of AB is less than or equal to the dimension of the
row space of B:
rank AB rank B.
Since
rank AB rank A
rank AB rank B
rank AB is bounded by the smaller of rank A and rank B:
rank AB min {rank A, rank B} .
By the way, now that we proved these rank theorems, you should be asking yourself why you should
you care.
These properties are important both theoretically and practically. Theoretically,
rank(A + B) rank A + rank B
says that adding two matrices cannot create a matrix with a rank greater than the sum of the ranks
of the original two matrices. And,
rank A min{m, n}
tells us if A is mn with m < n, the columns must always be linearly dependent. Likewise, if m > n,
then the rows are always linearly dependent. Finally,
rank AB min {rank A, rank B}
tells us that when we multiply two matrices, we cannot create a matrix whose rank is greater than
the ranks of the two original matrices.
As for practical applications, unfortunately, you wont see any real-world application of rank formulas
in Math 51H. However, I highly recommend taking
EE263: Linear Dynamical Systems.
Its an excellent course. In fact, Professor Boyd and his crack team of TAs literally scoured the globe
for every important real-world application of linear algebra, assembling a massive treasure trove of
100+ problems. Even though it is a graduate level class, it is a lot easier than the H-Series. And
every engineer agrees that it is a must-take course.
New Notation
C(A) The column space of A C(A) = R
n
The column space of A is R
n
N(A) The null space of A N(A) = {
0} The null space of A is the set con-

taining the zero vector
R(A) The row space of A dimR(A) = dimC(A) The dimension of the row space
equals the dimension of the col-
umn space.
x
T
x transpose
_
1 0 0
T
= e
1
The transpose of the vector
_
1 0 0
is the rst standard

basis vector.
rank A The rank of A rank A = dimR(A) The rank of the A is equal to the
dimension of the row space of A.
Lecture 9
The Skys the Limit
The -N denition took a hundred years to develop.
Yet you are expected to learn it in less than twenty minutes!
- Leon Simon
Goals: We introduce the central notion of analysis, limits. Because this topic is so
fundamental, I begin by giving the intuition behind the denition and explaining triple
quantiers. Then, I devote the rest of the lecture to numerous examples, to give you
practice with -N proofs.
9.1 Capturing Closeness
The essence of Calculus is the study of closeness, or rather, the
Limit of processes.
For example,
The derivative is the limit of the slopes of secant lines.
The integral is the limit of area approximations.
An innite sum is the limit of partial sums.
But what precisely is a limit?
In seven years of teaching, I have never seen a high school teacher actually teach the limit denition
correctly. Why? Because kids are squeamish about it. And teachers are squeamish
1
about it. So
instead of getting a denition, you are taught to plug and chug.
For example, to calculate the limit of the sequence
a
n
=
n
e
n
1
In their defence, not even Newton had the correct denition of limit. And that guy wrote calculus in one summer!
173
174 LECTURE 9. THE SKYS THE LIMIT
you plugged in l0, 50, 100, 200, and saw that the sequence gets closer to 0 as n approaches innity:
a
10
.0005
a
50
9.16 10
21
a
100
3.72 10
42
a
200
2.76 10
85
Then you concluded that a
n
converges to 0 as n approaches innity:
a
n
0.
However,
Math Mantra: You cannot conclude
1
a general result by just plugging in a few
numbers!
Consider the sequence
a
n
=
_
0 : if n is a multiple of 10
n : else
Plugging in 10, 50, 100, 200 yields 0:
a
10
= 0
a
50
= 0
a
100
= 0
a
200
= 0
But this sequence
1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11 . . . , 101, 102, 103, 104, 105, 106, 107, 108, 109, 0, 111 . . .
does not converge to 0!
9.2 Intuition for Limits
Before I give you a rigorous denition of the limit of a sequence, lets make an analogy. Consider the
scenario of a man shooting arrows. The n-th shot is the n-th term. Moreover, each shot lands on
the number line.
For example, the sequence (a
n
) where
a
n
=
n 1
n
can be visualized as shots getting closer to 1:
0
1
4
1
2
3
4 1
a
1
a
2
a
3
a
4
a
5
1
Always remember the x
2
+x + 41 example. The secret of the universe is not 42; its 41
9.2. INTUITION FOR LIMITS 175
Of course, we could be dealing with Sir Robert Locksley of Nottingham
1
(aka Robin Hood): he could
hit the same spot twice! Consider the sequence
a
n
= (1)
n
-1 1
a
1
a
3
a
5
a
7
a
2
a
4
a
6
a
8
This mans a boss. He goes back and forth hitting the 1, 1 marks every time!
As a rst attempt at understanding limits, we guess that:
The limit is the bullseye we want to hit.
Call the bullseye L. We want his shots to get closer and closer to the bullseye. We want him to be
able to eventually shoot an arrow within distance 1 of the target:
L 1 L L + 1
a
17
And after waiting even longer, within distance

1
2
of the target:
L
1
2 L
L +
1
2
a
92
And after waiting still longer, within distance

1
10
from the target:
L
1
10 L
L +
1
10
a
1792
In fact, for any desired accuracy desired distance , we want one of his shots to eventually land within
distance of the target:
L L L +
a
94305
Intuitively, this means that we can eventually nd a shot that is as accurate to the target as we want.
Mathematically,
for any > 0, there exists some integer index N such that |a
N
L| < .
1
Or for the young folks, that little girl from Brave.
Is this the denition of limit? Almost. But we have to require even more. Consider the previous
example of
a
n
= (1)
n
.
It is indeed true that for any > 0 there exists some N such that
|a
N
1| < .
In fact, you are always guaranteed to exactly hit the target 1 on every even numbered shot. But
intuitively, we feel the shots do not converge to 1.
We make one revision. Instead of having the archer eventually land a shot within distance of the
target, we tyrannically demand that
Eventually, all shots land within distance of the target.
So like checking in at the Hotel California, eventually his shots never leave.
L L L +
a
1
a
2
a
3
. . .
a
90
a
91
a
92
a
93
. . .
In this diagram, after a
90
, all subsequent terms are trapped in the vortex between L and L + .
We now formalize this intuition with the following denition:
Denition. We say L is the
1
limit of the sequence (a
n
) if for any > 0 there exists an integer N
such that for all i N
|a
i
L| <
We write
a
n
L
to mean
The sequence (a
n
) converges to L
We can alternatively write
lim
n
a
n
= L
which has the exact same meaning, but is read as
1
In order to say that L is the limit rather than a limit, we need to prove that there can be only one limit!
9.3. PROVING STATEMENTS INVOLVING MULTIPLE QUANTIFIERS 177
The limit of the sequence (a
n
) equals L
Even though the latter is more aesthetically pleasing, I tend to use the former notation because I
prefer howit is read. Moreover, it reminds us that we need to prove that the limit is unique whereas
the second notation already asserts this. I shall prove uniqueness of limits at the end of this lecture.
Note that the denition of a limit involves three magic words:
for any > 0 there exists an N such that for all i N
In order to prove that a sequence converges to some limit, we need to understand how to prove
statements involving multiple instances of any and exists.
9.3 Proving Statements involving Multiple Quantiers
There exists a castle on a cloud,
where I like to go when Im asleep.
Any oor there I dont need to sweep.
Here in my castle on a cloud.
-(Left)Cosette
One of the reasons why students struggle with limit proofs is that they havent been taught the
mathematical birds and bees. And by the birds and the bees, I mean the As and the Es:
Any and Exists
The logicians call these logical quantiers. To sum up what Ive taught you so far, typically:
To prove a universal (any) statement, take an arbitrary element from the set, and show that it
has the desired property.
To prove an existential (exists) statement, construct a specic example.
Of course, there are some other ways to prove universal and existential statements. Weve already
given examples that use proof by cases, contradiction, and even mathematical axioms. But for the
most part, we will use the two aforementioned techniques.
Some of our proofs have (innately) combined these two quantiers. Particularly, we have worked
through two combination types:
Any-Exists and Exists-Any
To prove these theorems, we only needed to combine the techniques above (in the right order
1
):
To prove an any-exists statement, take an arbitrary element x
1
and using this x
1
, construct
some element x
2
.
1
The order of quantiers does matter. You will see this when discussing continuous functions versus uniformly
continuous functions.
To prove an exists-any statement, nd some element x
1
such that an arbitrary element x
2
satises some property involving x
1
.
Here is an example of proving an any-exists statement:
Example. For any natural number n, there exists a perfect square greater than n.
Proof: Let n be a natural number. Using n, we construct a number bigger than n that is a perfect
square. Take
(n + 1)
2
It is a perfect square and
(n + 1)
2
= n
2
+ (2n + 1) > n
2
n
Here, the arbitrary element n is used to construct the number we need, namely (n + 1)
2
.
As an example of an exists-any proof, we can show there exists a matrix (dierent from the identity)
that does not change vector length:
Example. There exists a non-identity matrix A such that for any x R
2
Ax = x
Proof: First we nd a matrix. Intuitively, we know rotation shouldnt change vector length, so
choose
A =
_
cos sin
sin cos
_
Now we have to show that for any x R
2
Ax = x
Let x R
2
be arbitrary. Then,
Ax =
_
x
1
cos x
2
sin
x
1
sin + x
2
cos
_
Directly calculating the norm,
Ax =
_
_
x
1
cos x
2
sin
_
2
+
_
x
1
sin + x
2
cos
_
2
=
_
_
x
1
cos
_
2
+
_
x
1
sin
_
2
+
_
x
2
sin
_
2
+
_
x
2
cos
_
2
=
_
x
2
1
_
cos
2
+ sin
2
_
+ x
2
2
_
sin
2
+ cos
2
_
=
_
x
2
1
+ x
2
2
9.3. PROVING STATEMENTS INVOLVING MULTIPLE QUANTIFIERS 179
In conclusion,
Ax = x.
Here, we rst constructed an object A and then showed that any vector x has its length unchanged
by our chosen matrix A.
Before we move forward, I would like to re-emphasize that the second variable depends on the rst.
In the any-exists example, our constructed object was a function of the arbitrary n.
In the exists-any example, our proof that any vector had its length unchanged by A relied on
our choice of A.
In our limit proofs, N will be a function of .
Now, we are ready to deal with any-exists-any proofs. This is just a combination of the any-exists
and any proofs:
To prove an any-exists-any statement, take an arbitrary element x
1
and use it to
construct an x
2
, such that for an arbitrary element x
3
, some property holds.
For example, here is a variation of the innitude of primes proof:
Example. For any nite set of prime numbers P, there exists a natural number n, such that for
any prime in P, that prime does not divide n.
Proof: Let P be a nite set of primes
P = {p
1
, p
2
, . . . , p
s
}.
We use these p
i
to construct the n we need. Dene
n = p
1
p
2
. . . p
s
+ 1
Notice that for any p
i
in P, dividing n by p
i
always leaves remainder 1.
n = (p
1
p
2
. . . p
i1
p
i+1
. . . p
s
)
. .
q
p
i
..
b
+ 1
..
r
.
Thus, for any prime p
i
P, p
i
does not divide n.
We started with some arbitary set of primes P and used it to construct a number n. Then for an
arbitrary p
i
in P, we showed that p
i
did not divide n.
Once you feel comfortable with the preceding examples, you can try tackling limit proofs.
9.4 How to Prove a Sequence Converges to Some Limit
Start every limit proof with Let be bigger than 0.
-Maksim Maydanskiy
Every (basic) limit proof should follow the same formula:
1. Let > 0 i.e. x as some positive number.
2. Rewrite the |a
i
L| < into an equivalent condition on i.
3. Choose an N such that all i N, a
i
satises the aforementioned condition.
Make sure that your choice of N is a function
1
of .
Lets start with a classic example:
Example. The sequence (a
n
) where
a
n
=
1
n
converges to 0:
a
n
0.
Proof: Let > 0. We need to nd a corresponding N such that we can guarantee, for all i N,
1
i
0
<
or simply
1
i
<
First, rewrite this condition to isolate i:
1
< i ()
Now, we have to choose an N such that for all i N, condition () holds. Therefore, choose N to be
an integer strictly greater than
1
. For example, choose

N =
_
1
_
+ 1.
Then, for any i N,
1
N i
so the condition () is satised. Thus, we can conclude
a
n
0.
1
This is literally one of my greatest pet-peeves. If you magically nd an N that is independent of , you are either
dealing with constants or mental health issues.
9.4. HOW TO PROVE A SEQUENCE CONVERGES TO SOME LIMIT 181
Note that it is not enough to choose
N =
_
1
_
.
This is because for some choices of (e.g. =
1
2
) we have
1
= N.
However, we require
1
< N.
The next example involves negative numbers:
Example. Dene
s
n
=
cos(n)
n
Then,
s
n
0.
Proof: Let > 0. We need to nd a corresponding N such that we can guarantee, for all i N,
cos(i)
<
Because

cos(i)
=
|cos(i)|
=
|cos(i)|
i
,
this condition is equivalent to:
| cos(i)|
i
< .
To guarantee that the left hand side is less than , it is a lot easier to show that the upper bound of
this quantity is less than . Since
| cos(i)| 1,
we have
| cos(i)|
i
.
Thus it suces to prove
1
i
< .
Isolating i, we can rewrite this condition as
1
2
< i.
Choose
N =
_
1
2
_
+ 1
Then, for any i N,
1
2
< N i
as needed. Thus, we can conclude
s
n
0.
Heres a harder example:
Example. Dene
a
n
=
2
n
n!
.
Then,
a
n
0.
Proof: Let > 0. Then, we need to nd an N such that for all i N,
2
i
i!
<
Again, it is easier to nd a upper bound for the (LHS) and then show that this upper bound is less
than .
Expand the (LHS):
2
i
i!
=
i times
..
2 2 2 . . . 2
i (i 1) (i 2) . . . 3 2 1
and rewrite this as a product of fractions:
2
i

2
i 1
. . .
2
4

2
3

2
2

2
1
Notice that the following terms are at most 1:
2
i

2
i 1
. .
1
. . .
2
4
..
1
2
3
..
1
2
2
..
1
2
1
Therefore, this product is bounded above by
2
i
1 1 1 . . . 1 2.
giving us
2
i
i!

4
i
.
9.4. HOW TO PROVE A SEQUENCE CONVERGES TO SOME LIMIT 183
Now we need only show that the upper bound is less than :
4
i
<
But this is equivalent to showing
4
< i.
Therefore, choose
N =
_
4
_
Then, for any i N,
4
< N i
as needed. In conclusion,
a
n
0.
In the preceding proof, notice that we did not to use the fact that
2
i
1.
If we did, our bound would be too big:
2
i
i!
2.
No matter how big we require i to be, we cannot manipulate 2 to be less than . However, by leaving
2
i
alone, we got
2
i
i!

4
i
in which we can indeed nd a condition on i to ensure
4
i
< .
The next example involves a non-zero limit:
Example. Dene
a
n
=
n 1
n
Then,
a
n
1.
Proof: Let > 0. We want to nd an N such that for all i N,
i 1
i
1
<
Since
i 1
i
1 =
i 1
i

i
i
=
1
i
we can simplify this condition to
1
i
< .
Moreover, we can simplify the absolute value
1
i
<
and isolate i:
1
< i.
Again, choose
N =
_
1
_
+ 1
Then, for any i N,
1
< N i
Thus, we can conclude
a
n
1
Heres an example you saw when studying horizontal asymptotes:
Example. Dene
b
n
=
3n
2
+ 8
2n
2
+ 4
Then,
b
n

3
2
.
Proof: Let > 0. We want to nd an N such that for all i N,
3i
2
+ 8
2i
2
+ 4

3
2
< .
Since
3i
2
+ 8
2i
2
+ 4

3
2
=
3i
2
+ 8
2i
2
+ 4

3i
2
+ 6
2i
2
+ 4
=
2
2i
2
+ 4
=
1
i
2
+ 2
9.5. LIMIT PROPERTIES: ADDITION AND SCALING 185
we need to nd an N such that
1
i
2
+ 2
< .
Dropping the absolute value,
1
i
2
+ 2
<
and isolating i, we get
1
_
1
2 < i
Choose
N =
_
_
1
2
_
+ 1
Then, for an arbitrary i N,
_
1
2 < N i
Thus, we conclude
b
n

3
2

9.5 Limit Properties: Addition and Scaling
Oftentimes, we do not want to refer to the original N denition of limit. Instead, we use limit
properties. There are two main reasons:
The limit may be intuitively obvious.
For example, we can feel it in our bones that the limit of
a
n
= 3
n 1
n
is 3. Why? Because we already know the limit of
b
n
=
n 1
n
is 1. Intuitvely,
Scaling a convergent sequence by a constant scales the limit by the same constant.
As another example, the limit of
x
n
=
n 1
n
+
2
n
n!
should be 1. This is because the limits of the individual sequences
y
n
=
n 1
n
z
n
=
2
n
n!
sum to 1, and intuitively,
1
Technically, we should make a separate case for > .5. However, for pedagogical reasons, ignore this technicality.
The limit of a sum of convergent sequences is the sum of their individual limits.
We can break the terms of a sequence into easier parts.
Suppose we wanted to compute the limit of
a
n
=
(n 1)
10
n
10
The bone-headed thing to do would be to expand (n 1)
10
as an ugly polynomial. Instead, we
can show that
A product of convergent sequences converges to the product of their individual limits.
In particular
a
n
=
(n 1)
n

(n 1)
n
. . .
(n 1)
n
where each
(n1)
n
converges to 1. So we conclude:
a
n
1.
Proving the limit directly from the denition can be too dicult.
For example, suppose you are given the sequence
a
n
=
n
n
and you want to prove
a
n
1
No, we cant just apply LHospitals rule. You dont know how to prove it (nor do 99% of high
school students). If you tried direct expansion, you would get stuck showing
n < (1 + )
n
Instead, we can apply a more advanced limit property known as the Sandwich Theorem,
1
which
is proven in the next lecture.
We begin with proving the most basic limit property, scaling:
Theorem. Let
a
n
L
and let k be a constant. Then, the sequence formed from scaling each term by k
(ka
n
)
converges and
ka
n
kL.
1
You could argue that we can still write everything directly in terms of the -N since the Sandwich Theorem follows
from the same denition. But its like applying a function in Java. We can either use 5 lines of code where each line
represents 100 lines of code, or plug in 500 lines of code. Alternatively you can think of this as our matrix multiplication
example: would you rather write out all 1000+ entries or a single arbitrary entry?
9.5. LIMIT PROPERTIES: ADDITION AND SCALING 187
Proof Summary:
1. Let > 0.
2. Use convergence of (a
n
) with choice

k
to get corresponding N
.
3. Choose N = N
.
Proof: Assume k = 0 (since the zero sequence obviously converges). Let > 0. We want to nd an
N such that for all i N
|ka
i
kL| <
By absolute value properties, we can pull out the k:
|k||a
i
L| <
and thus rewrite our condition as
|a
i
L| <

k
.
Since we are given
a
n
L,
for any
> 0 we can a nd a corresponding N
such that for all j N
,
|a
j
L| <
.
Choose
=

k
.
Then there is a corresponding N
such that for all j N
,
|a
j
L| <
=

k
But lo and behold, this is exactly the condition we wanted to show. Just let N = N
and were done!

Notice, that we had to add

and change i to j. If we kept the same variables, we would have completely
changed the meaning of the statement! Generally,
Math Mantra: Dont be a dummy with dummy variables!
Mull over the dummy variables in the proof and think about the fact that we can choose an
and
corresponding N
. When you are comfortable enough to explain the proof to your grand pappy over
the phone, move on to the sum property:
Theorem. Let
a
n
L
1
b
n
L
2
Then, the sequence formed by adding each term point-wise
(a
n
+ b
n
)
converges and
(a
n
+ b
n
) L
1
+ L
2
.
Proof Summary:
1. Let > 0.
2. Apply the convergence denition of a
n
with choice
1
=

2
1
.
3. Apply the convergence denition of b
n
with choice
2
=

2
2
.
4. Choose N = max{N
1
, N
2
}.
Proof: Let > 0. We want to nd an N such that
|a
i
+ b
i
L
1
L
2
| < .
for all i N. Rearranging terms,
|(a
i
L
1
) + (b
i
L
2
)| <
Again, it is often easier to show that some upper bound for the (LHS) is less than . Applying the
triangle inequality on the (LHS),
|(a
i
L
1
) + (b
i
L
2
)| |a
i
L
1
| +|b
i
L
2
|
Therefore, if we can nd an N such that all i N,
|a
i
L
1
| +|b
i
L
2
| < ,
we are done!
By the denition of
a
n
L
1
,
we know that for any
1
> 0, we can nd an N
1
1
,
|a
i
L
1
| <
1
.
In particular, if we choose
1
=

2
, there is an N
1
1
,
|a
i
L
1
| <

2
.
Likewise, using the denition of
b
n
L
2
,
with the choice
2
=

2
, we can nd an N
2
2
,
|b
i
L
2
| <
2
Combining these N-hypotheses, if
i N
1
and i N
2
9.6. LIMIT PROPERTIES: PRODUCT 189
then in fact
|a
i
L
1
|
. .
<
2
+|b
i
L
2
|
. .
<
2
<
Therefore, to ensure that both hypotheses are satised, choose N to be the bigger of N
1
and N
2
:
N = max{N
1
, N
2
}.
Therefore, when i N, we know i N
1
and i N
2
, so
|a
i
L
1
|
. .
<
2
+|b
i
L
2
|
. .
<
2
< .
In conclusion,
(a
n
+ b
n
) L
1
+ L
2
.
9.6 Limit Properties: Product
We also have a product rule for convergent sequences. However, unlike scaling and summing, this
proof is not straightforward. Particularly, we need two things:
An algebraic trick
A nice idea
The algebraic trick is the same one we have used ever since we completed the square in Algebra II:
just add 0 (youll see)!
As for the nice idea, we need to prove
Convergent sequences are bounded.
Why do we know this is the idea that we are going to need?
Math Mantra: Nice ideas usually come from reaching an impasse in the proof.
When you get stuck, prove a new result to help you out.
To show a sequence is bounded above by a constant, just x to be some number, say = 1. Then
we know that the tail end of the sequence is bounded above by L + 1:
. .
<L+1
a
N
a
N+1
a
N+2
. . .
Moreover, the nitely many terms at the beginning of the sequence are bounded by the greatest
among them:
. .
max{a
1
,a
2
,...,a
N1
}
a
1
a
2
a
N1
. . .
Therefore, the entire sequence is bounded above by
K
1
= max{L + 1, a
1
, a
2
, . . . , a
N1
}
. .
K
1
a
1
a
2
a
N1
. . . a
N
a
N+1
a
N+2
. . .
A similar process is used to prove that the sequence is bounded below. Now we ll the details to get
a formal proof:
Theorem. Convergent sequences are bounded. In other words, if
a
n
L
then there exists some K > 0 such that
|a
i
| K
for all i = 1, 2, . . ..
Proof Summary:
We prove the sequence is bounded above by some K
1
:
Set = 1.
K
1
is the biggest value among a
1
, a
2
, . . . a
N1
, L + 1.
We prove the sequence is bounded below by some K
2
:
Set = 1.
K
2
is the smallest value among a
1
, a
2
, . . . a
N1
, L 1.
Let K = max{|K
1
|, |K
2
|}
Proof: First, we prove that the convergent sequence is bounded above, i.e. there exists K
1
such
that
a
i
K
1
9.6. LIMIT PROPERTIES: PRODUCT 191
for all i. Apply the convergence denition with the choice = 1. Then we know there is an N such
that for all i N,
|a
i
L| < 1.
or equivalently,
L 1 < a
i
< L + 1.
In particular, past index N, a
i
cannot be greater than L + 1. So in fact,
a
i
< L + 1
for all i N. Hence, there are only nitely many terms that are not automatically bounded above
by L + 1, namely:
a
1
, a
2
, . . . , a
N1
.
Therefore, every term in the entire sequence is bounded above by
K
1
= max{a
1
, a
2
, . . . , a
N1
, L + 1}
To prove that the sequence is bounded below, we follow a similar argument: we want to nd K
2
such that
K
2
a
i
for all i. Choose = 1. By the denition of convergence, we know that there is an N such that
|a
i
L| < 1
for all i N. In particular,
L 1 < a
i
for all i N. Therefore, the entire sequence is bounded below by
K
2
= min{a
1
, a
2
, . . . , a
N1
, L 1}
Since our sequence is bounded above and below,
K
2
a
i
K
1
for all i. Dene
K = max{|K
1
|, |K
2
|}.
Then for all i, we have
K K
2
a
i
K
1
K.
Hence,
|a
i
| K
for all i.
Now that we have our nice idea, we can prove the product property:
Theorem. If
a
n
L
1
b
n
L
2
Then the sequence formed taking the point-wise product
(a
n
b
n
)
converges and
a
n
b
n
L
1
L
2
Proof Summary:
In the -condition, add a
i
L
2
a
i
L
2
. .
=0
.
Use the triangle inequality and the boundedness of convergent sequences to establish an upper
bound.
Apply the convergence of a
n
with choice

2|L
2
|
1
.
Apply the convergence of b
n
with choice

2K
2
.
Choose N = max{N
1
, N
2
}.
Proof: Let > 0. We want to show that there exists an N such that for all i N,
|a
i
b
i
L
1
L
2
| < .
Using the awesome trick of adding 0, rewrite the -condition as:
|a
i
b
i
a
i
L
2
+ a
i
L
2
. .
=0
L
1
L
2
| <
Equivalently,
|(a
i
b
i
a
i
L
2
) + (a
i
L
2
L
1
L
2
)| <
Therefore, it suces to show that some upper bound for
|(a
i
b
i
a
i
L
2
) + (a
i
L
2
L
1
L
2
)|
is less than . Using our trusty triangle inequality,
|(a
i
b
i
a
i
L
2
) + (a
i
L
2
L
1
L
2
)| |a
i
b
i
a
i
L
2
| +|a
i
L
2
L
1
L
2
|.
By absolute value properties, the (RHS) is
|a
i
||b
i
L
2
| +|L
2
||a
i
L
1
|.
9.7. UNIQUENESS OF LIMITS 193
Now we apply our nice idea. Since, any convergent sequence is bounded, there exists a K > 0
such that for any i,
|a
i
| K.
Therefore,
|a
i
|
..
|b
i
L
2
| +|L
2
||a
i
L
1
| K
..
|b
i
L
2
| +|L
2
||a
i
L
1
|
Since
a
i
L
1
we know there
1
is some N
1
such that for every i N
1
,
|a
i
L
1
| <

2|L
2
|
Likewise, we know there exists an N
2
such that for every i N
2
|b
i
L
2
| <

2K
To ensure that both -conclusions hold, let
N = max(N
1
, N
2
).
Then, for i N,
K|b
i
L
2
|
. .
<

2K
+|L
2
| |a
i
L
1
|
. .
<

2|L
2
|
< .
9.7 Uniqueness of Limits
We conclude our rst lecture on limits with one more fundamental fact. First, lets take another look
at our original denition of limit:
Denition. We say L is the limit of the sequence (a
n
|a
i
L| <
Notice that I wrote the limit and not a limit. By using the word the, I imply that the limit is unique.
Its like saying
Chuck Norris is the king of the world.
versus
Chuck Norris is a king of the world.
1
Careful! At this point I divide by |L
2
|. Therefore, we will need to consider the case L
2
= 0 separately. This proof
is an easy exercise and I leave it to the enthusiastic reader to complete.
Why am I making such a big deal out of a single word?
We dont want multiple limits! Particularly,
a
n
= (1)
n
should not converge to both -1 and 1.
We will use uniqueness to solve for unknown limits. For example, if we can show
a
n
3x + 1
a
n
2x + 5
then by uniqueness, we know that
3x + 1 = 2x + 5.
Thus x = 4 and therefore
a
n
13.
Lets prove that the limit of a sequence is unique:
Theorem. If
a
n
L
1
a
n
L
2
then
L
1
= L
2
.
Proof Summary:
If we can prove |L
1
L
2
| < for all > 0, then it must be the case L
1
= L
2
.
Let > 0.
Add 0: |L
1
L
2
| = |L
1
a
i
+ a
i
. .
0
L
2
|
Use triangle inequality and limit denition to show the upper bound is less than .
Proof: Note that if
|L
1
L
2
| <
for any > 0, then L
1
L
2
must be 0. Otherwise, the choice
= |L
1
L
2
|
forces
|L
1
L
2
| < |L
1
L
2
|
. .
9.7. UNIQUENESS OF LIMITS 195

which is absurd!
Therefore, if we can prove for any > 0,
|L
1
L
2
| < ,
we can conclude
L
1
= L
2
.
Let > 0. Rewrite |L
1
L
2
| by adding zero
|L
1
a
i
+ a
i
. .
0
L
2
|
and regrouping
|(L
1
a
i
) + (a
i
L
2
)|.
Applying triangle inequality,
|(L
1
a
i
) + (a
i
L
2
)| |L
1
a
i
| +|a
i
L
2
|.
Now we just have to show this upper bound is less than . Since,
a
n
L
1
we know for the particular choice of

2
, we can nd an N
1
1
|a
i
L
1
| <

2
.
Likewise, since
a
n
L
2
we can nd an N
2
2
|a
i
L
2
| <

2
Let N = max{N
1
, N
2
}. Then for all i N,
|L
1
a
i
|
. .
<
2
+|a
i
L
2
|
. .
<
2
<
Therefore
|L
1
L
2
| <
and
L
1
= L
2
.
New Notation
(a
n
) The sequence with
each term a
n
.
_
1
n
_
converges to 0. The sequence whose n-th term is
1
n
converges to 0.
a
n
L The sequence (a
n
)
converges to L.
1
n
0 The sequence
_
1
n
_
converges to
0.
lim
n
a
n
= L The limit of sequence
(a
n
) equals L.
lim
n
n
n = 1 The limit of the sequence (

n
n) is
1.
Epsilon Otis calls his students s. Otis calls his students epsilons.
Lecture 10
Being Bolzy
Come on you math majors if you want to be free,
From Corporate America you listen to me.
Youve got a sequence that you built from your approximating tweakins,
And you really need to nd a convergent subsequence,
So you ask my man Bolzano and his homie Weierstrass,
Whove found you a solution with a trick thats really boss.
- Stephen Sawin
Goals: Today, we prove the infamous Bolzano-Weierstrass theorem. But in order to
do so, we rst need to prove the Monotone Convergence Property and the Sandwich
Theorem. These three theorems will be useful tools in proving limit properties.
10.1 The Next Big Thing
This is the biggest theorem since sliced bread. Or rather, the biggest theorem since Cauchy-Schwarz.
And like Cauchy-Schwarz, Bolzano-Weierstrass has multiple proofs, so it must be super important.
In fact, there is even a Bolzano-Weierstrass Rap (YouTube it)!
But in order to understand Bolzano-Weierstrass, we must rst understand subsequences. A subse-
quence is just a sequence formed by picking out terms of a larger sequence. For example, given the
sequence (a
n
):
1, 4, 9, 16, 25, 36, 49, 64, . . .
we can pick out specic terms
1, 4 , 9 , 16, 25 , 36, 49 , 64, . . .
to form the subsequence
a
n
1
= 4
a
n
2
= 9
a
n
3
= 25
a
n
4
= 49
.
.
.
Formally,
197
198 LECTURE 10. BEING BOLZY
Denition. Given an original sequence (a
n
)
n=1
, a subsequence is a sequence
(a
n
j
)
j=1
where (n
i
) is an increasing sequence of indices:
1 n
1
< n
2
< n
3
. . .
Note that:
(a
n
)
n=1
is a sequence indexed by n:
a
1
, a
2
, a
3
, a
4
, a
5
, . . .
whereas (a
n
j
)
j=1
is a sequence indexed by j:
a
n
1
, a
n
2
, a
n
3
, a
n
4
, a
n
5
, . . .
We will continue to write (a
n
)
n=1
as (a
n
) and (a
n
j
)
j=1
as (a
n
j
) while assuming the indexing
convention.
The increasing n
i
condition simply means
Move along the sequence, picking each term after the previous one, without ever going back to
an earlier term in the sequence.
a
1
a
2
a
3
a
4
a
5
a
6
a
7
. . . . . .
In our diagram,
n
1
= 1
n
2
= 2
n
3
= 4
n
4
= 7
.
.
.
To get practice with this denition, lets prove the following fundamental result:
Every subsequence of a convergent sequence also converges.
Moreover, it converges to the same limit as the original sequence.
This is a very intuitive result:
10.1. THE NEXT BIG THING 199
Theorem. If
a
n
L,
then any subsequence (a
n
i
) converges to the same limit:
a
n
i
L
Proof: Let > 0. Since (a
n
) converges to L, there exists some N such that for all i N,
|a
i
L| < .
Observe that
n
i
i.
This is because (n
i
) is an increasing sequence of positive integers; therefore, (n
i
) increases at least as
quickly as the slowest possible increasing sequence of positive integers 1, 2, 3, . . ..
By our observation, whenever i N,
n
i
i N,
so we have
|a
n
i
L| < .
Therefore, for all i N,
|a
n
i
L| < .
Even though this was a very simple proof, do not underestimate its value! Youll exploit this result
numerous times. Particularly,
Suppose you know that (a
n
) converges but you do not know the value of the limit L.
Then you can extract the value of L by nding the limit of a particular subsequence!
Now that you understand subsequences, we can state the Bolzano-Weierstrass theorem:
Every bounded sequence has a convergent subsequence.
In some cases, it is obvious. For example, in a bounded sequence like
1, 0, 1, 0, 1, 0, 1, 0, . . .
we can pick out all the even terms:
1, 0 , 1, 0 , 1, 0 , 1, 0 , . . .
to get the convergent subsequence (a
n
j
) of all zeros. Even in an sequence that lists all the rationals
on [0, 1], you could pick something like
0
1
,
1
1
,
0
2
,
1
2
,
2
2
,
0
3
,
1
3
,
2
3
,
3
3
,
0
4
,
1
4
,
2
4
,
3
4
,
4
4
, . . .
But what if Bart Simpson decided to dedicate his life to picking crazy numbers from [100, 100] one
at a time:
2
6
, e
e
,

2
90
, 2
3
, . . .
Here, constructing a convergent subsequence is not obvious.
The key is to use the fact the these numbers are bounded to plot them:
K K
From this picture, we are going to devise a simple plan to construct a convergent subsequence.
By the way, we never answered the question:
Why do we care about Bolzano-Weierstrass ?
Heres a great reason:
Almost all of the magical theorems in this course will stem from this seemingly
innocuous fact.
Yes, Im serious. And if you want a ghost of a chance of surviving this course, much like Indiana
Jones you need to take a leap of faith before you can achieve the Holy Grail.
To prove Bolzano-Weierstrass we will need two major theorems:
Monotone Convergence Property
Sandwich Theorem
10.2 Monotone Convergence Property
Imagine Barbossa forcing Jack Sparrow to walk the plank. Jack can either
Move forward slightly.
Stay still.
However, he cannot move closer to the ship (at the risk of getting skewered) and cant jump into the
ocean (at the risk of drowning). Therefore, Jack gets arbitrarily close to some xed position.
The Monotone Convergence Property is the same idea: we have a sequence that either increases or
stays the same (we call such a sequence monotonically increasing). It cannot backtrack, and it is
bounded above. So eventually it gets smooshed up somewhere:
K 0 K
10.2. MONOTONE CONVERGENCE PROPERTY 201
The same is true of a monotonically decreasing sequence. Formally we say
All bounded monotonic sequences converge.
We only prove the theorem for monotonically increasing sequences since the case of a monotonically
decreasing sequence is almost verbatim.
Theorem (Monotone Convergence Property). If
a
i
a
j
for i < j and there exists K such that
a
i
K
for all i, then (a
i
) converges to some real number.
Proof Summary:
Consider the set of sequence terms. By the Completeness Axiom, there is a least upper bound
S.
Show that the sequence converges to S:
Argue there is at least one element in the interval (S , S].
All future elements remain in this interval.
Proof: Consider the set of sequence terms:
A = {a
i
| i N} .
Since this set is bounded and nonempty, by the Completeness Axiom, A has a least upper bound S.
We now show (a
i
) converges to S:
Let > 0. Then we want to show that there exists N such that for all i N,
|a
i
S| <
First, there must be at least one a
q
that lies in the interval
S < a
q
S.
Otherwise, S would be an upper bound, contradicting that S is the least upper bound of A.
Secondly, for all i q
|a
i
S| <
Suppose not. Then there must exists some j > q such that either
a
j
S or a
j
S +
In the case
a
j
S
we have
a
j
S < a
q
contradicting that (a
i
) is monotonic increasing. The second case,
a
j
S +
contradicts that S is an upper bound of the sequence.
Thus, for all i q
|a
i
S| <
Therefore, set N = q and were done!
At this point, every math textbook likes to point out that this theorem relies heavily on the
Completeness Axiom. In fact, we can prove:
The Completeness Axiom is logically equivalent to the Monotone Convergence Property
This means we can replace our Completeness Axiom by the Monotone Convergence Property in the
original Field Axioms. But like choosing between Team Edward and Team Jacob, it doesnt matter.
I personally prefer the Monotone Convergence Property: to prove a sequence converges to a real
number we just need to show the sequence is
Bounded
Monotonic
But showing these properties is pretty easy:
JUST USE INDUCTION!
Sequences are meant for induction, just like Starships are meant to y. Dont believe me? Heres a
fun example:
If you are a Khan Academy addict, you may have run into the magic number known as the Golden
Ratio,
1 +
5
2
It tends to show up in expressions that involve self-circularity. For example, you may have seen the
following (incorrect) proof:
Example.
_
1 +
_
1 +
1 + . . . =
1 +
5
2
10.2. MONOTONE CONVERGENCE PROPERTY 203
Bad Proof: Dene
y =
_
1 +
_
1 +
1 + . . ..
Then,
y
2
= 1 +
_
1 +
1 + . . .
But lo and behold, the right-hand side contains our original expression. Therefore,
y
2
= 1 + y
Using the quadratic equation and selecting the positive solution since y is clearly non-negative, we
get
y =
1 +
5
2
.
This proof has major problems. The rst is :
How do we know that we can give
_
1 +
_
1 +
1 + . . . a name?
The name y suggests a nite numner, but how do we know this expression doesnt explode? If it did,
then bad things would happen.
For example, consider the explosion
S = 1 + 1 + 1 + . . .
Exploiting circularity as before
S 1 = S.
In other words,
1 = 0.
OUCH.
Secondly,
Who said we can even take innite summations and square roots?
To quote Professor Simon,
Math is not a mystical study
1
!
We are really looking at tricky notation for a limit, namely the limit of the sequence dened by
a
1
=
1
a
n+1
=
1 + a
n
for all n 1
Here is a proper proof using the Monotone Convergence Property and limits:
1
Despite what some logicians believe (*cough* transnite induction *cough*).
Example. Dene a sequence by:
a
1
=
1
a
n+1
=
1 + a
n
for all n 1
Then,
a
n

1 +
5
2
.
Proof: In Lecture 4, we already proved this sequence is bounded by 2. Therefore to show conver-
gence, we need only prove (a
n
) is increasing.
We use induction on n to prove the property
P
n
: a
n+1
a
n
0
Base case, n = 1
Obvious since
_
1 +
1 1 0
Inductive Step
Let k 1. Assume P
k
is true:
a
k+1
a
k
0.
We want to show property P
k+1
:
a
k+2
a
k+1
0.
By denition
a
k+2
=

1 + a
k+1
a
k+1
=
1 + a
k
Therefore,P
k+1
is equivalent to
_
1 + a
k+1
. .
a
k+2
1 + a
k
. .
a
k+1
0
i.e.
_
1 + a
k+1

1 + a
k
.
But from the inductive hypothesis,
a
k+1
a
k
.
Adding 1 to both sides and using the fact that square roots preserve non-negative inequalities
(for the billionth time),
_
1 + a
k+1

1 + a
k
.
as needed.
10.3. THE SANDWICH THEOREM 205
Thus (a
n
) is increasing and bounded, so by the Monotone Convergence Property,
a
n
L
for some L. Awesome, we have an actual number L to work with!
Now, consider the sequence
1
(b
n
) where
b
n
= a
n+1
a
n+1
a
n
1
To compute the limit of (b
n
), use the multiplication and sum properties of limits:
b
n
L L L 1
When we expand the right hand side of the original denition of b
n
, we see that
b
n
= 1 + a
n
. .
a
n+1
a
n+1
a
n
1 = 0.
Therefore,
b
n
0
Since
b
n
0
b
n
L L 1 L,
by uniqueness of limits,
0 = L L 1 L.
Solving the quadratic and taking the positive solution,
L =
1 +
5
2
.
10.3 The Sandwich Theorem
The Sandwich Theorem states that
If we have a sequence sandwich-ed between two sequences that converge to the same limit, then the
sandwich-ed sequence converges to this limit as well.
Like the Monotone Convergence Property, the biggest application of the Sandwich Theorem is proving
the Bolzano-Weierstrass Theorem (which, like Johnny Depp in any movie, is literally the star of the
show).
The Sandwich Theorem, however, is important in its own right as it will allow us to prove
n
n 1.
This limit will be used in our proof of the Change of Base-Point Theorem in Lecture 21.
1
The denition of b
n
mirrors the bad proofs step of y
2
= y 1 (or really, y
2
y + 1 = 0). Even the bad proof has
some good ideas to learn from!
Theorem (Sandwich Theorem). If
a
i
b
i
c
i
for all i and
a
n
L
c
n
L
then (b
n
) converges and
b
n
L.
Proof Summary:
1. Let > 0.
2. There are N
1
, N
2
such that all terms in (a
n
) and (b
n
) beyond the N
1
th term and N
2
th
term, respectively, are within of L.
3. Let N = max{N
1
, N
2
}.
4. Explicitly write out all the limit denitions as inequalities without absolute values, and combine.
Proof: Let > 0. We want to show there is some N such that for all i N,
|b
i
L| <
When we expand the absolute value, this means that we have to prove the pair of inequalities:
L < b
i
b
i
< L +
By the denition of convergence for (a
n
), there exists an N
1
1
|a
i
L| <

2
.
This inequality is equivalently
L < a
i
a
i
< L +
_
()
Likewise, we know there exists an N
2
2
,
L < c
i
c
i
< L +
_
()
Since we want both () and () to hold, let
N = max{N
1
, N
2
}
Then, for all i N,
10.3. THE SANDWICH THEOREM 207
L < a
i
by ()
b
i
Likewise,
b
i
c
i
< L + by ()
Thus, for all i N,
L < b
i
< L + .
To prove
n
n 1,
we need one quick result.
Lemma. If x 1, then
n
x 1.
Proof: Suppose
n
x < 1. The product of positive numbers less than 1 is still less than 1 (recall the
ordering axioms). Therefore,
ntimes
..
n
x
..
<1
x
..
<1
. . .
n
x
..
<1
< 1 1 . . . 1 = 1
But we have n copies of
n
x on the left hand side, so

x < 1
a contradiction.
Now, we are ready.
Theorem.
n
n 1
Proof Summary:
Dene sequence s
n
=
n
n 1.
(1 + s
n
)
n
bounds the third term of its binomial expansion.
Isolate s
n
and use the Sandwich Theorem to show the limit of s
n
is 0.
Conclude (
n
n) 1 by the limit sum property.

Proof: First, to make everything look nicer, dene
s
n
=
n
n 1.
By the preceding theorem, since n 1,
s
n
0.
Now heres the trick: consider
(1 + s
n
)
n
.
By the awesomeness of Binomial Theorem,
(1 + s
n
)
n
=
_
n
0
_
1
n
(s
n
)
0
. .
0
+
_
n
1
_
1
n1
(s
n
)
1
. .
0
+. . . +
_
n
n
_
1
0
(s
n
)
n
. .
0
.
All these terms are non-negative since s
n
is non-negative (thats why we needed the previous theorem)!
Therefore, (1+s
n
)
n
must be greater than (or equal to) each individual term in its sum. In particular,
it bound the third term:
(1 + s
n
)
n
n (n 1)
2
(s
n
)
2
. .
(
n
2
)1
n2
(sn)
2
Substituting s
n
into the (LHS)
n
..
(1+sn)
n
n (n 1)
2
(s
n
)
2
and simplifying,
s
n

_
2
n 1
.
Again, since s
n
is non-negative,
0 s
n

_
2
n 1
.
Since the zero sequence converges to 0 and we can easily show
_
2
n 1
0,
we can conclude
s
n
0
by the Sandwich Theorem. Therefore, by the limit sum theorem,
1
s
n
+ 1 1.
In other words,
n
n 1.
On this weeks homework, you will take the Sandwich Theorem one step further to prove that if
1
n
k
a
n
n
k
for every n and some positive integer k, then
lim
n
n
a
n
= 1.
1
Note, the 1 signies a constant sequence c
n
= 1 for all n.
10.4. BOLZANO-WEIERSTRASS THEOREM 209
10.4 Bolzano-Weierstrass Theorem
To explain the proof of Bolzano Weierstrass, lets play a game: the game of bisection. If youve
ever been to nerd-camp, youve probably seen this game before. Some cocky little trickster, say Puck,
asserts
Puck: I betcha I can guess your birthday in 9 tries, as long as, after each try, you tell me if your
birthday is later or earlier.
And if youre a naive little Muggle, you might reply:
Muggle: Youre on!
Say Muggles birthday is September 4th. Puck rst thinks of all the dates of the year as an interval:
January 1st
December 31st
In order to eliminate as many choices as possible, Puck asks Muggle to compare her birthday to the
middle of the year:
Puck: Is your birthday before or after July 1st?
Muggle: After
So Puck erases all the dates before July 1st.
July 1st
December 31st
He then calculates the new middle of this interval, October 1st.
Puck: Is your birthday before or after October 1st?
Muggle: Before
Then he removes all dates after October 1st.
July 1st
October 1st
The game continues for a few more iterations.
August 17
October 1st
August 17
September 9th
September 9th
August 29th
Finally, on the sixth guess,
Puck: Is your birthday September 4th?
Muggle: Wow youre psychic!
Pretty simple, huh? Bolzano-Weierstrass constructs the subsequence in the same way! However,
there are a few twists:
The game goes on forever.
Guesses will be terms in the original sequence.
Instead of choosing the interval containing the birthday, you choose the interval that contains
innitely many terms of the original sequence. This ensures the game never ends.
It turns out that by the awesomeness of Monotone Convergence Property and Sandwich Theorem,
your sequence of guesses is convergent!
Theorem (Bolzano-Weierstrass). Every bounded sequence has a convergent subsequence.
Proof Summary:
Dene a sequence of intervals, each one half the size of the previous and each containing innitely
many points of (a
n
)
From the j-th interval, select a term to be a
n
j
.
Use the Monotone Convergence Theorem to prove that the sequence of left interval endpoints
and the sequence of right interval endpoints both converge.
Use the Sandwich Theorem on the endpoint sequences to show (a
n
j
) converges.
Proof: Let (a
n
) be bounded by K. So that means we are starting with the interval
I
1
= [K, K]
To help us understand the process, we are going to use the following schematic:
10.4. BOLZANO-WEIERSTRASS THEOREM 211
K
K
Here, each dot represents a dierent term of the sequence. Choose any a
n
1
in this interval. In
particular, we can choose
a
n
1
= a
1
.
Now consider the two subintervals
[K, 0] [0, K]
Since a sequence is innite, at least
1
one of these interval contains innitely many terms. Let
I
2
be this interval.
In our schematic,
I
2
= [0, K]
0
K
Now choose a point a
n
2
I
2
. Make sure that you choose an index that is bigger than the previous:
n
2
> n
1
.
This is because subsequences must have increasing indices. And because our interval contains in-
nitely many sequence terms, we know such an n
2
exists.
Now we have
a
n
1
I
1
a
n
2
I
2
Continuing, split interval I
2
in half and let I
3
be the interval that contains innitely many terms of
(a
n
). Choose a
n
3
to be a point in this interval, such that
n
3
> n
2
In our schematic, we choose
I
3
=
_
0,
K
2
_
0
K
2
1
You may have a choice between two intervals containing innitely many terms. Just pick one, but you should see
from this that there may be many dierent convergent subsequences, each converging to a dierent limit. Mull over
why this doesnt contradict the fact that any subsequence of a convergent sequence converges to the same limit.
Now that you understand the intuition, we give the inductive denition:
The intervals I
n
are dened
1
as
I
1
= [K, K]
I
n
=
_
_
_
_
a,
a+c
2
: if I
n1
= [a, c] and the interval
_
a,
a+c
2
contains innitely many sequence terms

_
a+c
2
, c
: else
The inductive denition of the subsequence is just
a
n
1
= a
1
a
n
j
= some point of (a
n
) in I
j
such that n
j
> n
j1
To reiterate, we know a
n
j
exists since I
j
contains innitely many terms of (a
n
) by construction.
I claim (a
n
j
) converges. And to prove this, we will use the Sandwich Theorem. Namely, we will use
the fact that our sequence is sandwiched between the sequences formed from the endpoints of the
intervals.
Let
c
j
= Left endpoint of interval I
j
d
j
= Right endpoint of interval I
j
We check the preconditions of the Sandwich Theorem:
c
j
d
j
They are the left and right endpoints of interval I
j
, duh.
(c
j
) and (d
j
) both converge.
Inductively, you can show that (c
j
) is a monotonic increasing sequence: every time you cut the
interval in two, your next endpoint either
Stays the same
Is the midpoint of the previous interval.
Likewise, we can prove (d
j
) is monotonic decreasing. Since they are both bounded by K, we
know by the Monotone Convergence Theorem, (c
j
), (d
j
) both converge and their limits exist.
(c
j
) and (d
j
) converge to the same limit.
Let
c
j
L
1
d
j
L
2
To prove L
1
= L
2
, it suces to show
(d
j
c
j
) 0.
1
Dont be scared of the whole
a+c
2
business: this is just the midpoint of a and c.
10.5. AN ALTERNATE PROOF OF BOLZANO-WEIERSTRASS 213
This is because, by our limit sum properties,
(d
i
c
i
. .
0
+ c
i
..
L
1
) L
1
Moreover,
( d
i
..
L
2
c
i
+ c
i
. .
=0
) L
2
By uniqueness,
L
2
= L
1
.
Showing
(d
i
c
i
) 0
is easy: it is just the interval length. From our construction, each interval is half the size of the
previous. Inductively,
d
i
c
i
=
2K
2
i1
which converges to 0.
In conclusion, by the Sandwich Theorem on
c
j
a
n
j
d
j
,
(a
n
j
) converges. Awesome.
By the way, there are two other infamous proofs of Bolzano-Weierstrass involving
limsups
dominated terms
I leave the rst item for Math 171. The second is a very cute forty second proof.
10.5 An Alternate Proof of Bolzano-Weierstrass
In a given sequence (a
n
), we say that a term a
k
is dominant if it is strictly greater than all the terms
that appear later in the sequence:
a
j
a
j+1
a
j+2
a
j+3
a
j+4 . . .
Formally,
Denition. Let (a
n
) be a sequence. Then we say a term a
k
is dominant if for all i > k,
a
k
> a
i
.
Otherwise, there exists some i > k such that
a
i
a
k
,
and we say a
k
is conquerable and that a
i
dominates a
k
.
Using this jargon, we can give a very simple proof of the Bolzano-Weierstrass theorem:
Theorem (Bolzano-Weierstrass). Every bounded sequence has a convergent subsequence.
Proof Summary:
There are two possible cases:
There are infinitely many dominant terms.
The subsequence of dominant terms is monotonically decreasing.
Apply the Monotone Convergence Property.
There are finitely many dominant terms.
After the last dominant term, every term is conquerable.
Construct a subsequence of conquerable terms such that each term is dominated by
the next term. This is a monotonically increasing subsequence.
Apply the Monotone Convergence Property.
Proof: We have one of two cases. Either:
There are infinitely many dominant terms
OR
There are finitely many dominated terms
There are infinitely many dominant terms.
Dene (a
n
i
) where
a
n
i
= the i-th dominant term.
a
n
1
. . . a
n
2
. . . a
n
3
. . .
10.5. AN ALTERNATE PROOF OF BOLZANO-WEIERSTRASS 215
By denition, this is a monotonically decreasing sequence. Since its also bounded, it follows by
the Monotone Convergence Property that (a
n
i
) converges.
There are finitely many dominant terms.
Since there are nitely many dominant terms, there is a last dominant term, say a
N
. Then all
the terms after a
N
are conquerable. Formally, for all i > N, there exists some j > i such that
a
i
a
j
.
Construct a subsequence (a
n
i
) where each next term is dominated by the next:
1
a
n
3
. . . a
n
2
. . . a
n
1
. . .
In particular, choose
a
n
1
= a
N+1
.
Then for each i 1, supposing a
n
i
has been selected, we may select
a
n
i+1
= a
j
for some a
j
that dominates a
n
i
.
By construction, (a
n
i
) is a monotonically increasing sequence. Since its also bounded, it follows
by the Monotone Convergence Property that (a
n
i
) converges.
In both cases, we constructed a convergent subsequence.
1
Think of a sh being eaten by a bigger sh and that sh being eaten by an even bigger sh, and so on ad innitum.
Lecture 11
Fishing for Complements
They go together like oil and water. Nothing in common cept the zero vector.
- Mathematical idiom
Goals: We fulll our promise and prove that the dimensions of the row space and the
column space are equal. The key to this proof lies in improving the original proof
of the Rank-Nullity Theorem. This will require introducing orthogonal complements
and proving many of their properties. Afterwards, we use these properties to dene
projection maps.
11.1 The Story so Far...
Last week, we left you with a clihanger: the phenomenal fact that the dimensions of the row space
and column space are equal:
dimC(A) = dimR(A)
Unlike Lost, we are not going to leave you stranded. To prove this fact, we need to take another look
at the Rank-Nullity Theorem.
In truth, our proof of the Rank-Nullity theorem was too fast. If we took the time to examine our
steps more carefully, we could have improved this theorem tenfold.
Consider the rst step of the proof. Starting with a basis for the null space, we extended this basis
to a full basis for the domain R
n
:
N
1
,

N
2
, . . . ,

N
nq
. .
null space
, x
1
, x
2
, . . . , x
q
. .
extension vectors
At that point, we had many choices of extension vectors. In fact, we could have chosen them to
have additional structure:
Math Mantra: If you have multiple choices for an object, choose one that has
additional structure. Then EXPLOIT this extra structure in your proof.
217
218 LECTURE 11. FISHING FOR COMPLEMENTS
But what structure should we impose on
x
1
, x
2
, . . . , x
q
?
To determine this structure, we rst need to ask ourselves,
How does the row space relate to the null space?
Write out a matrix multiplication for a vector in the null space and stare at it:
_
_
1 0 1
0 1 1
0 0 0
_
_
_
_
1
1
1
_
_
=
_
_
0
0
0
_
_
Notice that the null space vector has dot product 0 with each of the rows!
_
_
1
0
1
_
_
_
_
1
1
1
_
_
= 0
_
_
0
1
1
_
_
_
_
1
1
1
_
_
= 0
_
_
0
0
0
_
_
_
_
1
1
1
_
_
= 0
In fact, this holds for any vector in the null space by denition. This property is so important that
we give it a name:
Denition. We say x is orthogonal to y if
x y = 0.
For a subspace V R
n
, we dene the orthogonal complement
1
V
to be the set of all vectors that

are orthogonal to every vector in V :
V
= {x R
n
|x y = 0 for all y V } .
First, I claim
Claim: We can choose the extension vectors
x
1
, x
2
, . . . , x
q
such that they are a basis for
_
N(A)
_
and
N
1
,

N
2
, . . . ,

N
nq
. .
N(A)
, x
1
, x
2
, . . . , x
q
. .
(N(A))
is still a basis for R

n
.
1
Careful! Orthogonal complement is not the same as set complement!
11.2. PROVING THE FIRST CLAIM 219
I also claim that:
Claim: The orthogonal complement of the null space is the row space
_
N(A)
_
= R(A)
So by our claims, we can in fact choose
x
1
, x
2
, . . . , x
q
to be a basis for the row space. Since weve already proven
Ax
1
, Ax
2
, . . . , Ax
q
is a basis for the column space, we instantly have
dimC(A) = dimR(A)
Awesome!
Now that we have the game plan, lets verify these claims.
11.2 Proving the First Claim
Consider our rst claim,
Claim: We can choose the extension vectors
x
1
, x
2
, . . . , x
q
such that they are a basis for
_
N(A)
_
and
N
1
,

N
2
, . . . ,

N
nq
. .
N(A)
, x
1
, x
2
, . . . , x
q
. .
(N(A))
is still a basis for R

n
Instead of proving this claim directly, it is better to abstract (youll see why)!
Revised Claim: Given a subspace V R
n
, let
v
1
, v
2
, . . . v
q
w
1
, w
2
, . . . w
r
be bases for V, V
respectively. Then,
v
1
, v
2
, . . . , v
q
, w
1
, w
2
, . . . w
r
is a basis for R
n
.
This is a better claim, but we still need to be careful! Our claim could be completely nonsensical:
we forgot to check that V
is the right animal. Remember, for our discussion of bases to make sense,
V
needs to be a subspace! Lets verify this:

Lemma. Given subspace V R
n
, V
is also a subspace.
Proof :
Existence of Zero
For any x V ,
x
0 = 0.
Hence,
0 V
.
Let a,
b V
. Then for any x V

x (a +
b) = x a
..
0
+x
b
..
0
= 0.
Thus,
(a +
b) V
.
Closure under Scaling
Let a V
and k R. Then for any x V ,

x (ka) = k(x a
..
0
) = 0
so
ka V
.
To prove our revised claim, we need to prove a few fundamental facts about subspaces.
First, we need to show that V and V
have only the zero vector in common:

Lemma. For any subspace V ,
V V
= {
0}.
Proof:

Since V and V
are vector spaces,
0 V V
.

Now let x V V
i.e
x V
x V
.
Since x V ,
x y = 0
for any y V
. But x V
, so choose in particular
y = x
Then,
x x
..
y
= 0
i.e
x
2
= 0
A vector has norm zero if and only it is the zero vector; thus,
x =
0.
The following theorem states that the whole space, and only the whole space, has just the zero vector
in its orthogonal complement.
Lemma. For a vector space V R
n
,
V
= {
0} V = R
n
Proof Summary:

x e
i
= 0 for all i.
x e
i
= x
i
. Thus, x =
0.

Suppose the dimension of V is less than n.
Taking the dot product of an arbitrary x with each of V
s basis vectors yields a system

with fewer equations than unknowns.
This system has a non-trivial solution by Under-determined Systems Lemma.
This non-trivial solution is in V
, contradiction.
Proof:

We are given V = R
n
. Let x V
. Then for any v V ,

x v = 0
Choose an arbitrary standard basis e
i
, and let v = e
i
:
x e
i
..
v
= 0
Dotting with e
i
picks out the i-th component:
_
_
x
1
.
.
.
x
i1
x
i
x
i+1
.
.
.
0
_
_
0
.
.
.
0
1
0
.
.
.
0
_
_
= x
i
.
This tells us that the i-th component is 0:
x
i
= 0
Of course, since this is true for any i,
x =
0
and hence
V
0}.
Of course, {
0} V
, so we conclude
V
= {
0}.

Now we are given V
= {
0}. We need to show V = R

n
.
Suppose not. Then V has dimension k < n and we can select a basis
v
1
, v
2
, . . . v
k
.
Consider the dot product of each v
i
with an arbitrary vector x R
n
v
1
x = 0
v
2
x = 0
.
.
.
v
k
x = 0
Expanding each line, this gives us a system of k equations with n unknowns. By the Under-
determined Systems Lemma, this system has a non-trivial solution x
0
=

0. But we can show
this non-trivial solution x
0
is in V
. For any v V , we can write

v = a
1
v
1
+ a
2
v
2
+ . . . + a
k
v
k
.
Then,
x
0
v = x
0
(a
1
v
1
+ a
2
v
2
+ . . . + a
k
v
k
)
. .
=v
= a
1
(x
0
v
1
)
. .
=0
+a
2
(x
0
v
2
)
. .
=0
+. . . + a
k
(x
0
v
k
)
. .
=0
= 0
Therefore x
0
V
, but x
0
=
0; this contradicts the fact that V
only contains the zero vector!

Thus,
V = R
n
.
For the next property, dene
Denition. The sum of subspaces A, B R
n
is dened as
A + B = {x +y | x A, y B}
It is a very easy exercise to prove
A + B = span{
1
,
2
, . . . ,
q
,

1
,

2
, . . . ,

r
}
where

1
,
2
, . . . ,
q
1
,

2
, . . . ,

s
.
are bases for A, B respectively. In particular,
V + V
= span{v
1
, v
2
, . . . , v
q
, w
1
, w
2
, . . . , w
r
}
where the vs are a basis for V and the ws are a basis for V
. Now we can prove part of our revised

claim, namely that
span{v
1
, v
2
, . . . , v
q
, w
1
, w
2
, . . . , w
r
} = R
n
,
by showing:
Lemma. For any subspace V R
n
,
V + V
= R
n
Proof Summary:
It suces to prove that the complement of V + V
is the zero vector.

Any vector x
_
V + V
is orthogonal to every vector in V + V
.
V V + V
so x is orthogonal to every vector in V , hence x V
.
Thus, x V + V
.
x is in a subspace and its complement, so x =
0.
Proof: By the previous lemma, it suces to show that
_
V + V
= {
0}
Of course, {
0}
_
V + V
, so we need only show

_
V + V
0}.
Let x
_
V + V
. Then for every y V + V
x y = 0.
In particular,
V V + V
so for every w V ,
x w = 0.
This means x V
. Of course this implies

x V + V
since
x =

0
..
V
+ x
..
V

.
Now,
x
_
V + V
x V + V
But we proved that only the zero vector is in both a subspace and its complement, so
x =
0.
Theorem (First Claim). Given subspace V R
n
, let
v
1
, v
2
, . . . v
q
w
1
, w
2
, . . . w
r
be bases for V, V
respectively. Then,
v
1
, v
2
, . . . , v
q
, w
1
, w
2
, . . . w
r
is a basis for R
n
.
Proof Summary:
Spanning: Already proved V + V
= R
n
.
Suppose not. Then we have a non-trivial combination of

0.
Move all V terms on one side and V
to the other.
(LHS) is in V and (RHS) is in V
. So both sides must be

0 since it is the only vector in
V and V
.
Use basis denition to conclude original combination was trivial
Contradiction.
Proof:
Spanning: We already proved
V + V
= R
n
.
Linear Independence: Suppose
v
1
, v
2
, . . . v
q
. .
V
, w
1
, w
2
, . . . w
r
. .
V

is not linearly independent. Then, there is a non-trivial combination that yields

0:
c
1
v
1
+ c
2
v
2
+ . . . + c
q
v
q
+ c
q+1
w
1
+ c
q+2
w
2
+ . . . + c
q+r
w
r
=
0
Isolate all the V vectors on one side:
c
1
v
1
+ c
2
v
2
+ . . . + c
q
v
q
. .
V
= c
q+1
w
1
c
q+2
w
2
. . . c
q+r
w
r
. .
V

This gives us a vector in V that is equal to a vector in V
. But we proved there is only one

vector that is both in V and V
: the zero vector. Thus, we have

c
1
v
1
+ c
2
v
2
+ . . . + c
q
v
q
=

0
c
q+1
w
1
c
q+2
w
2
. . . c
q+r
w
r
=

0
But the vs are a basis for V and the ws are a basis for V
. Therefore,
c
1
= c
2
= . . . = c
q
= 0
c
q+1
= c
q+2
= . . . = c
q+r
= 0
which contradicts that the combination is non-trivial.
Thus we have a basis for R
n
.
Note that this immediately implies
Theorem. Given a subspace V R
n
,
dimV + dimV
= n.
Recall the second claim:
11.3 Proving the Second Claim
Now lets prove the second claim,
Claim: The orthogonal complement of the null space is the row space
_
N(A)
_
= R(A)
Unfortunately, if we tried to directly prove this directly, we would get stuck. We can easily prove that
any vector in the row space is orthogonal to every vector in the null space. But,
How do we know that the row space is all vectors of
_
N(A)
_
?
Instead, we try a dierent approach. Consider the easier lemma:
Lemma. For any mn matrix A,
_
R(A)
_
= N(A)
Proof: Let

A
1
,

A
2
, . . . ,

A
m
be the rows of A.

Let x
_
R(A)
_
. Then
x y = 0
for every y R(A). In particular, since the transposed rows
1
of A are in R(A):
x

A
T
i
= 0
for i = 1, 2, . . . m.
1
On the rst day of Math 51H, Professor Simon states that, instead of arrows, you should denote vectors by underline.
Here, you can see why.
11.3. PROVING THE SECOND CLAIM 227
Representing this as a matrix multiplication,
_
A
1
A
2
.
.
.
A
m
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
=
_
_
0
0
.
.
.
0
_
_
we can see x N(A).

Let x N(A). We want to show
x r = 0
for any r R(A). Any vector r R(A) is a span of the transposed row vectors:
r = r
1

A
T
1
+ r
2

A
T
2
+ . . . + r
n

A
T
n
Again, we know
x

A
T
i
= 0
for i = 1, 2, . . . m. Thus,
x r = x r
1

A
T
1
+ r
2

A
T
2
+ . . . + r
n

A
T
n
. .
r
= r
1
(x

A
T
1
)
. .
=0
+r
2
(x

A
T
2
)
. .
=0
+. . . + r
n
(x

A
T
n
)
. .
=0
= 0
which implies x
_
R(A)
_
.
Since the two sides are identical, we can complement both sides:
_
_
R(A)
_
=
_
N(A)
_
.
Now, if we could cancel out the double complement, wed be done:
R(A) =
_
N(A)
_
Thus, we just need one more orthogonal complement property: the complement of the complement
is the original subspace.
Lemma.
V =
_
V

Proof Summary:
By homework, suces to prove V is contained in
_
V
and they have the same dimensions:

Directly from denition.
dimV = dim
_
V
Substitute V for S in dimS + dimS
= n.
Substitute V
for S in dimS + dimS
= n.
Combine the equations.
Proof:

Let v V . To prove
v
_
V
we have to show that, for any x V
,
x v = 0.
But because v V , this follows immediately from the denition of the complement space V
perp
Unfortunately, the direction is not as easy. If you tried to prove it directly, you would get stuck.
Instead,
Math Mantra: If you cannot prove a theorem using the typical methods, then
exploit the structure of the objects so that a different method will work.
Recall the homework exercise
Homework. For subspaces A, B if
A B
and
dimA = dimB
then A = B.
Intuitively, this says that if one subspace is within another and both these subspaces have the same
size, then they must be equal. Instead of proving separately, we can exploit the additional
subspace structure of our sets. Therefore, it suces to prove that the dimensions of V and
_
V
are equal:
11.4. A HAPPY ENDING 229
dimV = dim
_
V
From the rst claim, we concluded

1
dimS + dimS
= n
for any subspace S. Plugging in V for S gives:
dimV + dimV
= n.
Moreover, plugging in V
for S yields:
dimV
+ dim
_
V
= n.
Equating, we get
dimV = dim
_
V

The second claim is now easy to prove:
Theorem (Second Claim).
_
N(A)
_
= R(A).
Proof: We already proved
_
R(A)
_
= N(A)
Complement both sides
_
_
R(A)
_
=
_
N(A)
_
and cancel out the double complement:

R(A) =
_
N(A)
_
.
11.4 A Happy Ending
As mentioned in the introduction, we could easily use our proven claims to rewrite the proof of the
Rank-Nullity Theorem so that it encodes
dimC(A) = dimR(A).
As a result of our hard work, however, there is no need. Instead, we can keep our original proof of
the Rank-Nullity Theorem and apply our claims:
Theorem. For any mn matrix A,
dimC(A) = dimR(A)
1
Notice, we use a new dummy variable S since V is already in use.
Proof: We proved that for any subspace V R
n
,
dimV + dimV
= n,
so in particular,
dimN(A)
. .
V
+ dim
_
N(A)
_
. .
V

= n.
Using the fact
R(A) =
_
N(A)
_
,
our equality becomes
dimN(A) + dimR(A) = n.
Since the original Rank-Nullity Theorem gives us
dimN(A) + dimC(A) = n
we have
dimN(A) + dimR(A) = dimN(A) + dimC(A).
Hence,
dimC(A) = dimR(A).
11.5 Something Extra for our Troubles: Orthogonal Projec-
tion Map
We didnt need to talk generally about V and V
for any subspace V . Instead, we could have just

stuck with R(A) and C(A). But,
Math Mantra: By taking the extra time to abstract, we are rewarded.
Particularly, we now have the orthogonal complement properties:
Orthogonal Complement Properties
A subspace and its complement only share the zero vector:
V V
0
The sum of a subspace and its complement is the entire space:
V + V
= R
n
The dimension of a subspace and its complement sum to the dimension of the full space:
dimV + dimV
= n
The complement of the complement is the original subspace:
_
V
= V
11.5. SOMETHING EXTRA FOR OUR TROUBLES: ORTHOGONAL PROJECTION MAP 231
Now that we have these neat properties, lets try to use them!
There were several times in high school when you needed to break a vector into orthogonal compo-
nents. For example, in Algebra II you took a vector in the Cartesian Plane and divided it into a
component along the x-axis and a component along the y-axis.
x
y
_
3
4
_
x
y
_
3
0
_
_
0
4
_
Another example is in Physics, when you studied force diagrams. For a sliding block, you broke the
force of gravity into two orthogonal components. One component was normal to the surface, and the
other component was parallel to the surface:
mg
Today, we generalize these results! Namely, we proved for any subspace V R
n
, we can break
any vector v R
n
into a sum of two component vectors, one in the subspace and the other in the
orthogonal complement:
x = v
..
V
+ w
..
V

In fact, this decomposition is unique:
Theorem. For any subspace V R
n
and vector v V , if
x = v
1
+ w
1
for some v
1
V and w
1
V
and
x = v
2
+ w
2
for some v
2
V and w
2
V
, then
v
1
= v
2
w
1
= w
2
Proof: Again, we will use the trick:
If we have a vector in both V and V
, then that vector is

0.
Equating
v
1
+ w
1
= v
2
+ w
2
,
rearrange to get
v
1
v
2
. .
V
= w
2
w
1
. .
V

.
But subspaces are closed under vector subtraction; therefore, the left hand side is in V whereas the
right hand side is in V
. This implies
v
1
v
2
V
v
1
v
2
V
So by our trick:
v
1
v
2
=
0.
Hence,
v
1
= v
2
.
Likewise,
w
2
w
1
V
w
2
w
1
V
gives us
w
1
= w
2
.
This unique decomposition allows us to dene a function f by
f(x) = v
where v is the component vector in V .
Particularly, in our Algebra II example,
f
_
3
4
_
=
_
3
0
_
11.5. SOMETHING EXTRA FOR OUR TROUBLES: ORTHOGONAL PROJECTION MAP 233
for
V = span{e
1
}.
What do we know about function f? Obviously, its image is in V :
f(x) V
We also know that
x v = w
so the dierence of an input and its output value under f is the orthogonal component:
x f(x)
. .
w
V
In fact, I claim that these two properties alone uniquely dene the function f:
n
, there is only one function f : R
n
V that satises
f(x) V
and
x f(x) V
.
Proof Summary:
Consider two such maps f
1
, f
2
.
Show the dierence f
1
(x) f
2
(x) is in V .
Show the dierence f
1
(x) f
2
(x) is in V
.
Conclude f
1
(x) f
2
(x) =
0.
Proof: Suppose there are two functions f
1
and f
2
such that for any x R
n
,
f
1
(x) V
x f
1
(x) V
f
2
(x) V
x f
2
(x) V
To prove two functions are equal, we need to show for any input, both functions have the same output.
Let x R
n
be an arbitrary input. If we can show
f
1
(x) f
2
(x) V and f
1
(x) f
2
(x) V
,
then we know
f
1
(x) f
2
(x) =
0
since only

0 lies in both V and V
. Hence,
f
1
(x) = f
2
(x).
f
1
(x) f
2
(x) V
Each term is in V . Since V is closed under vector subtraction,
f
1
(x)
. .
V
f
2
(x)
. .
V
V.
f
1
(x) f
2
(x) V
Add zero to rewrite the expression:

f
1
(x) f
2
(x) = f
1
(x) + (x +x)
. .
=0
f
2
(x)
=
_
x f
1
(x)
_
+
_
x f
2
(x)
_
Each summand is in V
. By closure,
_
x f
1
(x)
_
. .
V

+
_
x f
2
(x)
_
. .
V

V

Since these two properties dene a unique function, we can give this function a name:
Denition. For any subspace V R
n
, the projection map
P
V
is the unique function that satises
P
V
(x) V
and
x P
V
(x) V
for every x R
n
.
11.6 Orthogonal Projection Properties
Like 007s car, P
V
has some sweet features. And they will all be derived with the same trick:
Math Mantra: To prove that two quantities are equal, it suffices to prove that
their difference is zero.
11.6. ORTHOGONAL PROJECTION PROPERTIES 235
In the universe of projection maps, we have two ways to create zero:
To show a vector is

0, show that it lies in both V and V
To show a dot product is 0, show that one term is in V and the other is in V
First we prove that P

V
satises our favorite property:
n
, P
V
(x) is a linear function.
Proof Summary
Additive:
Show P
V
(x +y) P
V
(x) P
V
(y) V
Show P
V
(x +y) P
V
(x) P
V
(y) V
Conclude P
V
(x +y) P
V
(x) P
V
(y) =
0
Scaling
Show P
V
(kx) kP
V
(x) V
Show P
V
(kx) kP
V
(x) V
Conclude P
V
(kx) kP
V
(x) =
0
Proof:
Additive
We want to show that for any x, y R
n
,
P
V
(x +y) = P
V
(x) + P
V
(y)
This is equivalent to proving
P
V
(x +y) P
V
(x) P
V
(y) =
0
Immediately
P
V
(x +y)
. .
V
P
V
(x)
. .
V
P
V
(y)
. .
V
V
since the output of P
V
is always in V , and V is closed under vector addition.
To show this expression is also in V
, all we need to do is add and subtract the input vector:

P
V
(x +y) P
V
(x) P
V
(y) = P
V
(x +y) +
_
(x + y) + (x + y)
_
. .
=0
P
V
(x) P
V
(y)
=
_
(x +y) P
V
(x +y)
_
. .
V

+
_
x P
V
(x)
_
. .
V

+
_
y P
V
(y)
_
. .
V

Since V
is closed under vector addition,

P
V
(x +y) P
V
(x) P
V
(y) V
Because this vector is in both V and V
,
P
V
(x +y) P
V
(x) P
V
(y) =
0
Scaling
We want to show that for any k R and x R
n
,
P
V
(kx) = kP
V
(x)
or equivalently,
P
V
(kx) kP
V
(x) =
0.
By closure,
P
V
(kx)
. .
V
k P
V
(x)
. .
V
V.
Again add and subtract the input vector:
P
V
(kx) kP
V
(x) = P
V
(kx) kx + kx
. .
=
0
kP
V
(x)
=
_
kx P
V
(kx)
_
. .
V

+k
_
x P
V
(x)
. .
V

_
Thus,
P
V
(kx) kP
V
(x) V
.
Since this vector is in V and V
, we conclude
P
V
(kx) kP
V
(x) =
0.
The second cool property is that a projection map can be swapped across a dot product:
n
and any x, y R
n
,
x P
V
(y) = P
V
(x) y
Proof Summary
Add zero to x P
V
(y) P
V
(x) y
Rewrite as a sum of dot products.
Show each dot product is between an element in V and an element in V
.
Conclude sum is 0.
11.6. ORTHOGONAL PROJECTION PROPERTIES 237
Proof: To show
x P
V
(y) P
V
(x) y = 0,
just add 0:
x P
V
(y) P
V
(x) y = x P
V
(y) P
V
(x)P
V
(y) + P
V
(x)P
V
(y)
. .
=0
P
V
(x) y
=
_
x P
V
(x)
_
P
V
(y) P
V
(x)
_
y P
V
(y)
_
But,
_
x P
V
(x)
_
. .
V

P
V
(y)
. .
V
P
V
(x)
. .
V
_
y P
V
(y)
_
. .
V

= 0 + 0 = 0.
In conclusion,
x P
V
(y) P
V
(x) y = 0.
Lastly, we prove a minimal distance property. Namely, the projected vector is the element in V
closest to the original vector.
Theorem. For any x R
n
, and v V
x P
V
(x) x v
Moreover,
x P
V
(x) = x v v = P
V
(x)
Proof Summary
Consider x v
2
.
Add P
V
(x) P
V
(x) inside the norm and expand.
Cancel dot products between elements in V and V
.
Remaining equality implies all results.
Proof: First, we derive a useful inequality by considering the square
x v
2
.
First, include P
V
(x) terms by adding zero:
xP
V
(x) + P
V
(x)
. .
0
v
2
.
Expanding as a dot product, we get
_
x P
V
(x)
_
+
_
P
V
(x) v
_
2
=
_
_
x P
V
(x)
_
. .
a
+(P
V
(x) v)
. .
b
_
_
_
x P
V
(x)
_
. .
a
+
_
P
V
(x) v
_
. .
b
_
= x P
V
(x)
2
. .
a
2
+2
_
x P
V
(x)
_
_
P
V
(x) v
_
. .
2a
b
+P
V
(x) v
2
. .
b
2
Since the inner term is zero
2
_
x P
V
(x)
_
. .
V

_
V
..
P
V
(x)
V
..
v
_
. .
V
= 0,
we have
x v
2
= x P
V
(x)
2
+P
V
(x) v
2
. ()
x P
V
(x) x v
The right side of () is a sum of non-negative terms; therefore, the total sum must bound any
of its parts. In particular,
x v
2
x P
V
(v)
2
.
Since square roots preserve inequalities,
x v x P
V
(v).
x P
V
(x) = x v v = P
V
(x)
Suppose
x v = x P
V
(v)
Then our equality () becomes
x P
V
(v)
2
. .
xv
2
= x P
V
(x)
2
+P
V
(x) v
2
.
Therefore,
P
V
(x) v
2
= 0
which only happens when
P
V
(x) = v.
Conversely, if P
V
(x) = v,
x P
V
(v) = x v
..
P
V
(v)
.
New Notation
V
The orthogonal com-

plement of V or V
perp.
V
= {
0} The orthogonal complement of V

is {
0}.
P
V
(x) The projection of x
onto V .
P
V
(x +y) = P
V
(x) + P
V
(y) The projection of x + y onto V
is the sum of the projection of x
onto V and the projection of y
onto V .
Lecture 12
A Game of Cat and Gauss
Sometimes the truth isnt good enough, sometimes people deserve more.
Sometimes people deserve to have their faith rewarded...
-Batman, The Dark Knight
Goals: Using Gaussian Elimination, we construct explicit formulas for the bases of the
null space and the column space. Not only is this useful in practice, but this construction
also gives us an alternate proof to the Rank-Nullity Theorem. Finally, we end this unit
by explaining how to solve inhomogeneous systems of equations.
12.1 A Little Constructive Criticism
By now, we have cited the basis theorem a gazillion times. We used it to assert the existence of bases
for the null space and the column space, to derive the Rank-Nullity Theorem, and to prove that the
dimensions of the column space and the row space are equal. Thats great. But, in case you havent
noticed:
We never told you how to explicitly calculate these bases.
Its not enough to simply assert that the bases exist. In the real world, we need to be able to ex-
plicitly nd the bases for the null space and the column space. How else are our Kindles, iPads, and
microwaves going to work?
We deserve more. We deserve to have our faith in proofs rewarded. Therefore, we are going to give a
rigorous method to construct the bases. And as an added bonus, we will get an alternate proof of the
Rank-Nullity Theorem. So everyone- mathematicians, engineers, and even those crazy mathematical
constructivists- will be happy.
239
240 LECTURE 12. A GAME OF CAT AND GAUSS
12.2 Gaussian Elimination
Recall, from Lecture 4, that we can transform a system of equations
a
11
x
1
+ a
12
x
2
+ a
13
x
3
+ . . . + a
1n
x
n
= 0
a
21
x
1
+ a
22
x
2
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
x
1
+ a
m2
x
2
+ a
m3
x
3
+ . . . + a
mn
x
n
= 0
into one of two possible forms
0x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= 0
0x
1
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ . . . + a
mn
x
n
= 0
1x
1
+ a
12
x
2
+ . . . + a
1n
x
n
= 0
0x
1
+ a
23
x
3
+ . . . + a
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
0x
1
+ a
m2
x
2
+ . . . + a
mn
x
n
= 0
Using matrices, we can represent this reduction in a more condensed form. The matrix
_
_
a
11
a
12
a
13
. . . a
1n
a
21
a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
a
m3
. . . a
mn
_
_
is reduced to one of two matrices:
_
_
0 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
_
_
1 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
For ease, lets call this maneuver (C1) for rst column reduction.
We will use (C1) to dene Gaussian Elimination on matrices. Even though our algorithm may
look unintuitive, just remember:
The reductions are the same operations used in solving a homogeneous system of equations.
Matrices are just a neat shorthand to represent these operations.
Moreover, Gaussian Elimination is a systematic procedure. And through this matrix shorthand, we
will discover cool patterns.
Gaussian Elimination is composed of two phases.
Phase I repeatedly applies C1 and records pivot column indices. At the end of this phase, we
say that our matrix A is in row echelon form.
Phase II uses the pivot column indices acquired in Phase I to completely clean the system.
We say the nal matrix is in reduced row echelon form.
We dene Phase I inductively:
12.2. GAUSSIAN ELIMINATION 241
Gaussian Elimination: Phase I
Consider the mn matrix
_
_
a
11
a
12
a
13
. . . a
1n
a
21
a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
a
m3
. . . a
mn
_
_
Step 1
Perform (C1) on this matrix to get one of two cases:
Case 1
The matrix reduces to
_
_
0 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
Then dene S
2
to be the sub-matrix that ignores the rst column:
_
_
0 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
Case 2
The matrix reduces to
_
_
1 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
Record, as the rst pivot column index:
P
1
= 1.
Dene S
2
to be the sub-matrix that ignores the rst row and the rst column:
_
_
1 a
12
a
13
. . . a
1n
0 a
22
a
23
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 a
m2
a
m3
. . . a
mn
_
_
Step 2
Let
_
_
b
11
b
12
b
13
. . . b
1n
b
21
b
22
b
23
. . . b
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
b
m1
b
m2
b
m3
. . . b
mn
_
_
be the full transformed matrix acquired after performing Step 1
Focus on the sub-matrix S
:
_
_
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
b
i
b
i(+1)
b
in
b
(i+1)
b
(i+1)(+1)
b
(i+1)n

.
.
.
.
.
.
.
.
.
.
.
.
b
m
b
m(+1)
b
mn
_
_
Leaving the columns and equations of the larger matrix unchanged, perform (C1) on sub-
matrix S
. Again, we have two cases:

Case 1
Sub-matrix S
reduces to:
_
_

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
0 b
i(+1)
b
in
0 b
(i+1)(+1)
. . . b
(i+1)n

.
.
.
.
.
.
.
.
.
.
.
.
0 b
m(+1)
. . . b
mn
_
_
Then dene S
+1
to be the sub-matrix that ignores the rst column of S
:
_
_

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
0 b
i(+1)
b
in
0 b
(i+1)(+1)
b
(i+1)n

.
.
.
.
.
.
.
.
.
.
.
.
0 b
m(+1)
b
mn
_
_
Case 2
Sub-matrix S
reduces to
_
_

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
1 b
i(+1)
b
in
0 b
(i+1)(+1)
b
(i+1)n

.
.
.
.
.
.
.
.
.
.
.
.
0 b
m(+1)
b
mn
_
_
Given that steps 1, 2, . . . produced pivot column indices
P
1
, P
2
, . . . P
k
,
we record the next pivot column index to be the current column number .
P
k+1
=
Dene S
+1
to be the sub-matrix that ignores the rst column and rst row of S
:
_
_

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
1 b
i(+1)
b
in
0 b
(i+1)(+1)
b
(i+1)n

.
.
.
.
.
.
.
.
.
.
.
.
0 b
m(+1)
. . . b
mn
_
_
Phase I is nished once sub-system S
is empty.
Upon completing Phase I, the resulting matrix is in row echelon form matrix. Moreover, we have a
list of pivot columns indices:
P
1
, P
2
, . . . P
Q
.
The columns with these indices are called pivot columns:
_
_
. . .
P
1
. . .
P
2
. . .
P
Q
. . .
_
_
Out of convention, we label (in order) the indices of the remaining n Q columns
N
1
, N
2
, . . . , N
nQ
and call the corresponding columns non-pivot columns:
_
_

N
1
. . .
N
2
. . .
N
nQ
_
_
All together, the columns of our matrix are labelled
1
:
_
_

N
1

P
1

N
2

P
2
. . .
N
nQ
_
_
pivot pivot
non-pivot non-pivot non-pivot
Moreover, each pivot column a
P
i
is of the form
a
P
i
=
_
.
.
.
1
0
.
.
.
0
_
_
i-th row
where the s are numbers that need not be zero. This can be rigorously proven with induction.
Although we omit the proof, this fact is simply a consequence Case 2. To see this, lets write out
the rst few steps of Phase I:
We start with the full matrix as the sub-matrix:
_
_

_
_
1
Be careful when interpreting the following schematic! The non-pivot and pivot columns need not be in alternating
order. This diagram merely emphasizes that every column is labelled.
Columns are successively removed from this sub-matrix (Case 1)
_
_

_
_
_
_
0
0
0
.
.
.
0
_
_
_
_
0 0
0 0
0 0
.
.
.
.
.
.
0 0
_
_
until we enter Case 2. Here, we select the rst pivot column as the rst column of a sub-matrix that
has no rows removed.
P
1
_
_
0 1
0 0
0 0
.
.
.
.
.
.
0 0
_
_
Then the rst row (and column) is removed to form a new sub-matrix:
_
_
0 1
0 0
0 0
.
.
.
.
.
.
0 0
_
_
Continuing, columns are successively removed from this sub-matrix (Case 1)
_
_
0 1
0 0
0 0
.
.
.
.
.
.
0 0
_
_
_
_
0 1
0 0 0
0 0 0
.
.
.
.
.
.
.
.
.
0 0 0
_
_
_
_
0 1
0 0 0 0
0 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0
_
_
until the second pivot column is selected (Case 2):
P
1
P
2
_
_
1
0 1
0 0

.
.
.
.
.
.
0 0
_
_
Then we go through the same process until the third pivot column is selected:
P
1
P
2
P
3
_
_
1
0 1
0 0 1

.
.
.
.
.
.
.
.
.
0 0 0
_
_
If you do not fully understand Phase I, go back and read it again. Once you fully understand it, you
can continue to Phase II.
The second phase simply kills the values above the 1 in each pivot column:
Gaussian Elimination: Phase II
After performing Phase I, we are left with a matrix in row echelon form where the pivot columns
are exactly columns P
1
, P
2
, . . . , P
Q
:
_
_
. . .
P
1
. . .
P
2
. . .
P
Q
. . .
_
_
For each pivot column
P
i
,
P
i
i
_
_

.
.
.
1
0

.
.
.
0
_
_
from each row above the i-th, subtract the right multiple of the i-th row so that the entry in
column P
i
becomes 0:
P
i
i
_
_
0
0
0

.
.
.
1
0

.
.
.
0
_
_
Formally,
Denition. After performing Gaussian Elimination on a matrix A, the resulting matrix is denoted
rref(A)
and is called the reduced row echelon form of A.
Example. Perform Gaussian Elimination on the matrix
A =
_
_
2 4 2 6
2 4 2 6
2 4 4 10
2 4 0 2
_
_
Phase one
Step I
Apply (C1) to get
_
_
1 2 1 3
0 0 0 0
0 0 2 4
0 0 2 4
_
_
Since we are in Case 2, set
P
1
= 1
and dene the sub-matrix S
2
by ignoring the rst row and rst column:
_
_
1 2 1 3
0 0 0 0
0 0 2 4
0 0 2 4
_
_
Step 2
Apply (C1) on the sub-matrix S
2
to get
_
_
1 2 1 3
0 0 0 0
0 0 2 4
0 0 2 4
_
_
Since we are in Case 1, dene sub-matrix S
3
by ignoring the rst column of S
2
:
_
_
1 2 1 3
0 0 0 0
0 0 2 4
0 0 2 4
_
_
Step 3
Apply (C1) on the sub-matrix S
3
to get
_
_
1 2 1 3
0 0 1 2
0 0 0 0
0 0 0 0
_
_
Since we are in Case 2, set
P
2
= 3
and dene sub-matrix S
4
by ignoring the rst column and rst row of S
3
:
_
_
1 2 1 3
0 0 1 2
0 0 0 0
0 0 0 0
_
_
Step 4
Apply (C1) on the rst column of sub-matrix S
4
to get
_
_
1 2 1 3
0 0 1 2
0 0 0 0
0 0 0 0
_
_
Since we are in Case 1, dene sub-matrix S
5
by ignoring the rst column of S
4
. Since S
5
is empty, we are done with Phase I.
Phase II
From Phase I, we have pivot column indices
P
1
= 1
P
2
= 3
and row echelon form
_
_
1 2 1 3
0 0 1 2
0 0 0 0
0 0 0 0
_
_
Kill the entry above the 1 in pivot column P
2
by subtracting the second row from the rst:
rref(A) =
_
_
1 2 0 1
0 0 1 2
0 0 0 0
0 0 0 0
_
_
One question you should be asking yourself is,
What is the point of Gaussian Elimination?
Notice that this procedure reects the solution preserving-operations used to solve a homogeneous
systems of equations. This means that
The solution set of
Ax = 0
is the same as the solution set of
rref(A)x = 0
i.e.
N(A) = N
_
rref(A)
_
By studying the structure of a rref(A), we will derive an explicit formula for a basis of N(A). Generally,
Math Mantra: By reducing an object to some basic canonical form, we can exploit
its structural properties.
But,
What is so special about reduced row echelon form?
Matrices in reduced row echelon form have an incredible structural property:
1
The i-th pivot column of rref(A) is the i-th standard basis vector!
1
This can be proven rigorously with induction. If you have doubts, complete it during Thanksgiving break.
Precisely,
b
P
i
= e
i
where
b
k
is the k-th column of rref(A).
In our example, the rst and second pivot columns are e
1
and e
2
respectively:
_
_
1 2 0 3
0 0 1 2
0 0 0 0
0 0 0 0
_
_
As for non-examples, the following matrices are not in reduced row echelon form:
_
_
1 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 0
_
_
,
_
_
1 0 0 0 0
0 1 0 0 0
0 0 0 0 1
0 0 0 1 0
0 0 0 0 0
_
_
The left matrix fails because it is missing e
2
whereas the right matrix fails because the standard basis
are out of order.
12.3 An Enlightening Example
The explicit formula for the
1
null space basis looks very intimidating. But, to quote Professor
Simon,
Its just book-keeping!
In order to make the formula and its derivation precise, we need heavy notation. Particularly, we
need to use double subscripts to distinguish a pivot column from a non-pivot column. But the proof
is really just simple algebra in disguise!
In order to motivate the proof, we are going to do a specic example.
2
Generally,
Math Mantra: We can find inspiration for theorems in examples.
Then, we will prove the theorem in the special case when the pivots are actually the rst Q columns.
This is because of another mantra,
1
We will often say the to describe the basis that we will construct. Of course, the null space has many bases. But
this method nds one particular special basis.
2
For the midterm, you shouldnt even memorize the explicit null space basis formula. You should just understand
how to nd a null space basis in any particular problem by following the steps in our example.
12.3. AN ENLIGHTENING EXAMPLE 251
Math Mantra: If you cannot solve the harder problem, make a simplifying
assumption and try to solve the easier version.
1
Finally, when you are ready to attack the notation, we will prove the general statement.
Example. Find a basis for the null space of the matrix
A =
_
_
1 6 1 0 15
0 0 1 1 17
0 0 0 2 18
2 12 0 0 14
_
_
Use Phase I of Gaussian Elimination to reduce A to row echelon form and nd the pivot indices:
P
1
= 1
P
2
= 3
P
3
= 4
It follows that the non-pivot indices are:
N
1
= 2
N
2
= 5
Then use Phase II to reduce A to reduced row echelon form
rref(A) =
_
_
1 6 0 0 7
0 0 1 0 8
0 0 0 1 9
0 0 0 0 0
_
_
Expand
rref(A)x =
0
as a system of equations
x
1
+ 6x
2
+ 7x
5
= 0
x
3
+ 8x
5
= 0
x
4
+ 9x
5
= 0
Isolate:
x
1
= 6x
2
7x
5
x
3
= 8x
5
x
4
= 9x
5
1
Tons of special cases of Fermats Last Theorem cases were veried long before Andrew Wiles came along.
This means for x N(A),
x =
_
_
x
1
x
2
x
3
x
4
x
5
_
_
=
_
_
6x
2
7x
5
x
2
8x
5
9x
5
x
5
_
_
=
_
_
6x
2
x
2
0
0
0
_
_
+
_
_
7x
5
0
8x
5
9x
5
x
5
_
_
= x
2
_
_
6
1
0
0
0
_
_
+ x
5
_
_
7
0
8
9
1
_
_
Since x
2
, x
5
can take on any values, the null space is
N(A) = span
_
_
_
_
6
1
0
0
0
_
_
,
_
_
7
0
8
9
1
_
_
_
_
In more condensed notation,
N(A) = span {e
2
6e
1
, e
5
7e
1
8e
3
9e
4
}
Stare at the indices of the standard basis vectors:
e
2
6e
1
, e
5
7e
1
8e
3
9e
4
and notice that these are the non-pivot and pivot indices:
e
N
1
6e
P
1
, e
N
2
7e
P
1
8e
P
2
9e
P
3
Moreover, the scaling coecients are just the entries of the reduced row echelon form:
_
_
1 6 0 0 7
0 0 1 0 8
0 0 0 1 9
0 0 0 0 0
_
_
1
N
1
1 1 2 3
e
N
1
6e
P
N
2
1
2
3
_
_
1 6 0 0 7
0 0 1 0 8
0 0 0 1 9
0 0 0 0 0
_
_
e
N
2
7e
P
8e
P
9e
P
12.4. AN EASIER THEOREM 253
Generally, to construct each vector in the null space basis, we
Take a standard basis vector with non-pivot index
e
N
j
and look at the corresponding non-pivot column
b
N
j
in rref(A):
_
_
b
1N
j
b
2N
j
.
.
.
b
mN
j
_
_
For each of the pivot columns
P
1
, P
2
, . . . , P
Q
subtract the standard basis with index P
i
scaled by the i-th entry in column
b
N
j
:
e
N
j
b
1N
j
e
P
1
b
2N
j
e
P
2
. . . b
QN
j
e
P
Q
This is an awesome result, but proving it is simple: it is just basic algebra in disguise. In fact, the
proof is merely a formalization of our enlightening example.
12.4 An Easier Theorem
Before we derive the full-blown explicit null space basis formula, lets do an easier case. This way,
you can understand the general idea of the proof before juggling double subscript notation.
Theorem. Let A be an mn matrix and B = rref(A). Moreover, assume
1, 2, . . . , Q
are the indices of the pivot columns and
Q + 1, Q+ 2, . . . , n
are the indices of the non-pivot columns. Then the basis for the null space of A is
e
Q+1

Q
i=1
b
i(Q+1)
e
i
,
e
Q+2

Q
i=1
b
i(Q+2)
e
i
,
.
.
.
e
n

Q
i=1
b
in
e
i
,
Proof Summary:
Spanning
View Bx =
0 as a system of equations.
Look at the equations corresponding to pivot column numbers
Since our pivot columns are standard basis vectors, all but one pivot column variable
remains in each equation.
Isolate that variable.
Rewrite the solution vector x in terms of non-pivot column variables.
Separate the vector into basis components.
Linear Independence
Each e
j

Q
i=1
b
in
e
i
uniquely has a 1 in its j-th component, where j Q + 1.
Proof:
Spanning
Let x be a solution to
Bx =
0
Expand the system of equations and look at the equations corresponding to the pivot components
(the rst Q equations):
b
11
x
1
+ b
12
x
2
+ . . . b
1Q
x
Q
. . . + b
1n
x
n
= 0
b
21
x
1
+ b
22
x
2
+ . . . b
2Q
x
Q
. . . + b
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
b
Q1
x
1
+ b
Q2
x
2
+ . . . b
QQ
x
Q
. . . + b
Qn
x
n
= 0
()
Recall that the pivot columns are standard basis vectors. In particular, the rst Q columns
of B are the rst Q standard basis vectors:
Non-pivot
..
Pivot
..
_
_
1 0 0 . . . b
1(Q+1)
. . . b
1n
0 1 0 . . . b
2(Q+1)
. . . b
2n
0 0 1 . . . b
3(Q+1)
. . . b
3n
.
.
.
.
.
.
.
.
. . . .
.
.
.
0 0 0 . . . b
mn
. . . b
mn
_
_
This means, for column j = 1, 2, . . . , Q, we have
b
ij
=
_
0 i = j
1 i = j
12.4. AN EASIER THEOREM 255
Therefore, our system () becomes
x
1
+ b
1(Q+1)
x
(Q+1)
+ b
1(Q+2)
x
(Q+2)
+ . . . + b
1n
x
n
= 0
x
2
+ b
2(Q+1)
x
(Q+1)
+ b
2(Q+2)
x
(Q+2)
+ . . . + b
2n
x
n
= 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
Q
+ b
Q(Q+1)
x
Q(Q+1)
+ b
Q(Q+2)
x
(Q+2)
+ . . . + b
Qn
x
n
= 0
Isolating x
1
, x
2
, . . . x
Q
, we get the constraints
x
1
=
_
b
1(Q+1)
x
(Q+1)
+ b
1(Q+2)
x
(Q+2)
+ . . . + b
1n
x
n
_
x
2
=
_
b
2(Q+1)
x
(Q+1)
+ b
2(Q+2)
x
(Q+2)
+ . . . + b
2n
x
n
_
.
.
.
.
.
.
.
.
.
.
.
.
x
Q
=
_
b
Q(Q+1)
x
Q(Q+1)
+ b
Q(Q+2)
x
(Q+2)
+ . . . + b
Qn
x
n
_
Substituting into x
_
_
x
1
x
2
.
.
.
x
Q
x
Q+1
.
.
.
x
N
_
_
=
_
_
b
1(Q+1)
x
(Q+1)
+ b
1(Q+2)
x
(Q+2)
+ . . . + b
1n
x
n
_
_
b
2(Q+1)
x
(Q+1)
+ b
2(Q+2)
x
(Q+2)
+ . . . + b
2n
x
n
_
.
.
.
.
.
.
.
.
.
_
b
Q(Q+1)
x
Q(Q+1)
+ b
Q(Q+2)
x
(Q+2)
+ . . . + b
Qn
x
n
_
x
Q+1
.
.
.
x
N
_
_
_
_
b
1(Q+1)
x
Q+1
b
2(Q+1)
x
Q+1
.
.
.
b
Q(Q+1)
x
Q+1
x
Q+1
0
.
.
.
0
_
_
+
_
_
b
1(Q+2)
x
Q+2
b
2(Q+2)
x
Q+2
.
.
.
b
Q(Q+2)
x
Q+2
0
x
Q+2
.
.
.
0
_
_
+ . . . +
_
_
b
1n
x
n
b
2n
x
n
.
.
.
b
Qn
x
n
0
0
.
.
.
x
n
_
_
x
Q+1
_
_
b
1(Q+1)
b
2(Q+1)
.
.
.
b
Q(Q+1)
1
0
.
.
.
0
_
_
+ x
Q+2
_
_
b
1(Q+2)
b
2(Q+2)
.
.
.
b
Q(Q+2)
0
1
.
.
.
0
_
_
+ . . . + x
n
_
_
b
1n
b
2n
.
.
.
b
Qn
0
0
.
.
.
1
_
_
=
=
Since x
Q+1
, x
Q+2
, . . . , x
n
can be any value,
e
Q+1

Q
i=1
b
i(Q+1)
e
i
,
e
Q+2

Q
i=1
b
i(Q+2)
e
i
,
.
.
.
e
n

Q
i=1
b
in
e
i
,
span the null space.
Linear Independence
When expanded, each vector
e
j

Q
i=1
b
ij
e
i
uniquely has 1 at its j-th component:
_
_
b
1(Q+1)
b
2(Q+1)
.
.
.
b
Q(Q+1)
1
0
.
.
.
0
_
_
,
_
_
b
1(Q+2)
b
2(Q+2)
.
.
.
b
Q(Q+2)
0
1
.
.
.
0
_
_
, . . . ,
_
_
b
1n
b
2n
.
.
.
b
Qn
0
0
.
.
.
1
_
12.5 Null Space Basis

If you understand the previous proof, then the following proof is just optional. You wont be tested
on it and its just simple algebra stated in precise notation. In fact, the only dierence between this
proof and the previous one is that we consider a sequence of pivot indices. This forces us to introduce
double subscripts.
12.5. NULL SPACE BASIS 257
Theorem. Let B = rref(A) and let
P
1
, P
2
, . . . , P
Q
be the indices of the pivot columns and
1
N
1
, N
2
, . . . , N
K
be indices of the non-pivot columns. Then the basis for the null space of A is
e
N
1

Q
i=1
b
iN
1
e
P
i
,
e
N
2

Q
i=1
b
iN
2
e
P
i
,
.
.
.
e
N
K

Q
i=1
b
iN
k
e
P
i
,
Proof Summary:
Spanning
View Bx =
0 as a system of equations.
Look at the equations corresponding to pivot column numbers
Since our pivot columns are standard basis vectors, all but one pivot column variable
remains in each equation.
Isolate that variable.
Rewrite the solution vector x in terms of non-pivot column variables.
Separate the vector into basis components.
Linear Independence
Each e
N
j

Q
i=1
b
iN
j
e
P
i
uniquely has a 1 in its N
j
-th component.
Proof:
Spanning
Let x to a solution to
Bx =
0
1
K is, of course, n Q. We only leave it as K for simplication.
and look at the equations of the expanded system. The i-th equation tells us
n
j=1
b
ij
x
j
= 0.
Split this sum into two. Each sum will have standard basis vectors corresponding to non-pivot
and pivot columns, respectively:
Q
r=1
b
iPr
x
Pr
. .
Pivots
+
K
r=1
b
iNr
x
Nr
. .
Non-Pivots
= 0 ()
By denition,
b
iPr
is the i-th entry of the r-th pivot column. Moreover, we know the r-th pivot column is the
r-th standard basis vector
b
Pr
= e
r
Therefore,
b
iPr
=
_
1 if i = r
0 otherwise
Thus,
Q
r=1
b
iPr
x
P
j
=
0
..
b
iP
1
x
P
1
+ b
iP
2
x
P
2
+ . . . +
x
P
i
..
b
iP
i
x
P
i
+
0
..
. . . + b
iP
Q1
x
P
Q1
+ b
iP
Q
x
P
Q
and the i-th equation () collapses into
x
P
i
+
K
r=1
b
iNr
x
Nr
. .
Non-Pivots
= 0
i.e
x
P
i
=
_
K
r=1
b
iNr
x
Nr
_
.
Split x into its separate standard basis components
x =
n
i=1
x
i
e
i
Again, separate this sum into
x =
K
i=1
x
N
i
e
N
i
. .
Non-pivots
+
Q
i=1
x
P
i
e
P
i
. .
Pivots
.
12.5. NULL SPACE BASIS 259
Substituting () into each pivot variable, we rewrite x as
x =
K
i=1
x
N
i
e
N
i
+
Q
i=1
_
K
r=1
b
iNr
x
Nr
_
. .
x
P
i
e
P
i
i.e
x =
K
i=1
x
N
i
e
N
i

Q
i=1
K
r=1
b
iNr
x
Nr
e
P
i
Since double sums commute, this is
x =
K
i=1
x
Nr
e
Nr

K
r=1
Q
i=1
b
iN
i
x
N
i
e
P
i
Then, pull out the constant x
Nr
from the inner sum
x =
K
i=1
x
N
i
e
N
i

K
r=1
x
Nr
_
Q
i=1
b
iNr
e
P
i
_
and change a dummy variable from i to r:
x =
K
r=1
x
Nr
e
Nr

K
r=1
x
Nr
_
Q
i=1
b
iNr
e
P
i
_
Collapse using distributive law:
x =
K
r=1
x
Nr
_
e
Nr

Q
i=1
b
iNr
e
P
i
_
.
But x
Nr
can take on any value. Therefore, the null space is the span of
e
N
1

Q
i=1
b
iN
1
e
P
i
,
e
N
2

Q
i=1
b
iN
2
e
P
i
,
.
.
.
e
N
K

Q
i=1
b
iN
K
e
P
i
,
Linear Independence
If we look at the N
j
-th component
_
e
N
j

Q
i=1
b
iN
j
e
P
i
_
N
j
we see that its value is 1:
1
..
_
e
N
j
N
j
0
..
_
b
1N
j
e
P
1
N
j
0
..
_
b
2N
j
e
P
2
N
j
. . .
0
..
_
b
QN
j
e
P
Q
N
j
= 1
Moreover, for k = j, the N
j
-th component
_
e
N
k

Q
i=1
b
iN
k
e
P
i
_
N
j
have value 0:
0
..
[e
N
k
]
N
j
0
..
[b
1N
k
e
P
1
]
N
j
0
..
[b
2N
k
e
P
2
]
N
j
. . .
0
..
_
b
QN
k
e
P
Q
N
j
= 0
Thus, each vector
e
N
j

Q
i=1
b
iN
j
e
P
i
is the only vector with 1 in its N
j
-th component.
12.6 Column Space Basis
Because of the massive amount of book-keeping we did in constructing the null space basis, we are
rewarded for our troubles. We can actually use the null space basis to derive an explicit formula for
the column space basis. In fact, we even get a pithy description:
The basis of the column space of A is the columns of A corresponding to the pivot indices.
Note that we are referring to the columns of the original matrix, not the reduced row echelon form.
1
Theorem. Let A be an mn matrix with columns
i
:
_
_

1

2
. . .
n
_
_
Let B = rref(A) with pivot indices
P
1
, P
2
, . . . , P
Q
and non-pivot indices
N
1
, N
2
, . . . , N
K
.
Then,

P
1
,
P
2
, . . . ,
P
Q
is a basis for A.
1
That would be nuttier than squirrel poo: it would mean that every vector space is a span of standard basis vectors!
12.6. COLUMN SPACE BASIS 261
Proof Summary:
Spanning:

Obvious

Suces to show an arbitrary non-pivot column is in the span of the pivot columns.
Multiply A by this non-pivot columns corresponding null space basis vector.
Rewrite product as the non-pivot column in terms of pivot columns.
Suppose not, then linear dependence gives us a null space vector c.
Write c as a combination of the explicit null space basis vectors.
c is zero at some non-pivot index while the combination is non-zero at the same location.
Contradiction.
Proof:
Spanning
Obviously,
span
_

P
1
,
P
2
, . . .
P
Q
_
span {
1
,
2
, . . . ,
n
}
Therefore, we need only show . Moreover, it suces to show that any of the other columns of
A is contained in
span
_

P
1
,
P
2
, . . .
P
Q
_
Consider a column that is not a pivot column. By denition, it is enumerated by some non-pivot
index:

N
i
Then, take the null space basis vector
e
N
i

Q
i=1
b
iN
i
e
P
i
and multiply it by A:
A
_
e
N
i

Q
i=1
b
iN
i
e
P
i
_
= 0
This yields
A(e
N
i
) = A
_
Q
i=1
b
iN
1
e
P
i
_
But multiplying A by some standard basis vector e
j
simply picks out the j-th column of A.
Therefore,

N
i
=
Q
i=1
b
iN
1

P
i
i.e,

N
i
span
_

P
1
,
P
2
, . . .
P
Q
_
Linear Independence
Suppose they are not. Then we can nd a non-trivial solution
c
P
1

P
1
+ c
P
2

P
2
+ . . . + c
P
Q

P
Q
=
0
But this gives us a non trivial vector in the null space. Precisely, we have a matrix multiplication
with a vector c where
c =
Q
i=1
c
P
i
e
P
i
.
Visually,
_
_
. . .
P
1
. . .
P
2
. . .
P
Q
. . .
_
_
_
_
.
.
.
c
P
1
.
.
.
c
P
2
.
.
.
c
P
Q
.
.
.
_
_
=
0
Using our explicit formula for the null space basis, we know that
c = s
N
1
_
e
N
1

Q
i=1
b
iN
1
e
P
i
_
+ . . . + s
Nr
_
e
Nr

Q
i=1
b
iNr
e
P
i
_
+ . . . + s
N
K
_
e
N
K

Q
i=1
b
iN
K
e
P
i
_
where at least one s
Nr
is non-zero. In particular, notice that the (RHS) is non-zero at its N
r
component:
_
s
N
1
_
e
N
1

Q
i=1
b
iN
1
e
P
i
__
Nr
. .
=0
+. . .+
_
s
Nr
_
e
Nr

Q
i=1
b
iNr
e
P
i
__
Nr
. .
=0
+. . .+
_
s
N
K
_
e
N
K

Q
i=1
b
iN
K
e
P
i
__
Nr
. .
=0
However, by denition, c is 0 at its N
r
component! So we have a contradiction.
Notice that our explicit basis formulas give us an alternate proof of the Rank-Nullity theorem. We
explicitly constructed a null space basis that has as many vectors as there are pivot columns, hence:
dimN(A) = # of pivot columns.
And we explicitly constructed a column space basis that has as many vectors as there are non-pivot
columns, hence:
dimC(A) = # of non-pivot columns.
Of course, every column is either a pivot or non-pivot:
# of non-pivot columns + # of pivot columns = # of columns.
Therefore,
dimN(A)
. .
# of pivot columns
+ dimC(A)
. .
# of non-pivot columns
= n
..
# of columns
12.7. INHOMOGENEOUS EQUATIONS 263
12.7 Inhomogeneous Equations
We end this lecture with some pretty easy proofs. These proofs are of practical importance and after
three weeks of intense math, you earned a break.
In the real world, we would like to solve inhomogeneous systems of equations of the form
Ax =
b.
Particularly, we need to ask,
1. How do we know there exists a solution?
2. How do we nd one solution?
3. How do we nd all solutions?
The rst question is easy. The product
Ax
is a linear combination of columns. In fact, a solution exists if and only if
b is in the column space:

Ax =
b has a solution if and only if
b C(A).
Practically, we check this the same way that we did in high school. Consider the inhomogeneous
system:
x
1
+ x
2
= 1
x
1
+ x
2
= 5
After applying Gaussian Elimination, we have the equivalent system
x
1
+ x
2
= 1
0 + 0 = 4
Here, we can instantly see that there is no solution (otherwise we would have 0=4).
This idea can be extended to arbitrary matrices. First, we can easily prove that the solutions to
Ax =
b
are the same as the solutions to
A
x =

b
where A
are the sub-matrices acquired from performing Gaussian Elimination on an augmented

matrix:
1
rref
_
A
b
_
=
_
A
_
Moreover, we can prove that a solution exists if and only if for every row of zeros in A
, the corre-
sponding component of
is also zero:
1
Im not going into too much detail since youve already learned this procedure in high school. Everything is precisely
the same except your perspective has changed.
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
We can also use Gaussian Elimination to answer the second question. To construct a particular
solution, perform Gaussian Elimination on the augmented matrix. Then, substitute arbitrary values
into variables with non-pivot indices. In fact, it is easiest to set
x
N
1
= 0
x
N
2
= 0
.
.
.
x
N
K
= 0
We can then directly solve for each of the pivot variables.
Example. Find a particular solution x
0
to
Ax =
b
where
A =
_
_
1 2 0 1
0 0 1 2
0 0 0 0
0 0 0 0
_
_
and
b =
_
_
4
3
0
0
_
_
.
We want to solve the system
x
1
+ 2x
2
+ x
4
= 4
x
3
+ 2x
4
= 3
Since A is already in reduced row echelon form,
N
1
= 2
N
2
= 4
Therefore, set
x
2
= 0
x
4
= 0
12.7. INHOMOGENEOUS EQUATIONS 265
to get
x
1
= 4
x
3
= 3
Thus,
x
0
=
_
_
4
0
3
0
_
_
is a particular solution.
Once you have a particular solution to an inhomogeneous system, you immediately know all solutions.
It is simply the set of all null space vectors shifted
1
by the particular solution:
Theorem. Let x
0
be a particular solution to
Ax =
b. ()
Then, x is a solution of () if and only if x N(A) +x
0
where
N(A) +x
0
= {n +x
0
| n N(A)}
Proof: Let x
0
be a particular solution:
A x
0
=
b.

For any x such that
Ax =
b
we have
Ax
b =
0.
Substituting, we get
Ax A x
0
..
b
=
0
Therefore,
A(x x
0
) =
0
so
(x x
0
) N(A).
Since
x = (x x
0
)
. .
N(A)
+x
0
we conclude
x N(A) +x
0
1
We typically call the sum of a vector and subspace an ane subspace.
266

Let x N(A) +x
0
. Then
x = n +x
0
for some n N(A). Multiplying,
Ax = A(n +x
0
) = An
..
0
+ Ax
0
..
b
=
b.
New Notation
rref(A) The reduced row eche-
lon form of the matrix
A
N(A) = N
_
rref(A)
_
The null space of A equals the
null space of the reduced row ech-
elon form of A.
Midterm I: The Linear Algebra Menace
If you only know denitions, theorem statements, and computation, then...
YOU SHALL NOT PASS
- Gand
Making a Choice and Mastering the Material
If you dont like doing proofs, then you gotta get out. Now. Transfer to Math 51. Theres no shame
in doing so. However, if you still feel in your bones that you need to be a mathematician, then stay.
But unless you want to be trampled by all those IMO, SUMaC, PROMYS, CMS math prodigies,
you need to master the material. But how do you even know whether youve obtained such mastery?
There are two age-old tests:
1. Can you re-derive the proofs from scratch?
It is not enough to be able to read it from a book. To quote Professor Devlin,
Its like learning to ride a bike. Someone can ride up and down in front of you for hours,
telling you how they do it. But you wont learn to ride from watching them and having them
explain it to you. You have to keep trying for yourself and FAILING until it eventually
clicks.
2. Can you explain it out loud?
Find someone, a dorm member, an upperclassman, or even a fellow Math 51H compatriot, and
reteach the proofs. To quote Hilbert,
A mathematical theory is not to be considered complete until you have made it so clear that
you can explain it to the rst man whom you meet on the street.
Or, if you are too shy, go to a vacant classroom with a chalkboard and start talking to yourself.
Or even better, just don a ninja costume and start teaching on YouTube.
Ask yourself the following questions to see if you have mastered all the topics.
267
268
Week 1
1. Do you understand how to do basis proofs?
Arbitrary Proofs
Proof by Contradiction
Proof by Cases
If and Only If
Induction
2. Do you know the denition of the distance and the angle between two (non-zero) vectors?
3. Do you understand the geometry of vectors? Can you construct line segments and lines using
sets of vectors?
4. Do you feel comfortable working with inequalities: squaring and square rooting, establishing
upper bounds?
5. Do you know all the dot product properties and how to prove them?
6. Do you know the fundamental relation between norm and dot product? Can you prove the law
of cosines, the parallelogram law, etc.?
7. Can you state and prove the Cauchy-Schwarz inequality?
8. Can you state and prove the Cauchy-Schwarz equality?
9. Can you apply Cauchy-Schwarz to prove other inequalities?
10. Do you know the denition of a linear function?
11. Do you know the denition of a subspace? Can you verify that a given set is a subspace?
12. Can you prove all the linear independence/dependence properties and equivalences?
13. Can you state and prove the Under-determined Systems Lemma?
14. Can you state and prove the Linear Dependence Lemma?
Week 2
1. Do you know how to prove something is unique?
2. Can you prove statements using the Field Axiom?
3. Do you understand the abstract meaning of a rational number?
4. Do you understand the abstract meaning of a real number?
5. Can you state the completeness axiom? Can you use the Completeness Axiom to prove
that some number exists? Can you prove that number satises some property?
269
6. Can you prove two sets are equal?
7. Can you state the Basis Theorem and prove it?
8. Can you state the Basis Extension Theorem and prove it?
9. Can you prove that dimension is unique?
10. Can you prove dimension properties?
11. Can you prove two matrices are equal?
12. Do you know that Ax is a linear combination of the column vectors of A?
13. Do you understand the norm of a matrix and can you prove the Cauchy Schwarz-like bound?
14. Can you solve for the matrix A that represents a given linear transformation?
15. Do you understand null space, column space, and row space?
16. Can you state and prove the Rank-Nullity Theorem?
17. Can you prove rank properties?
Week 3
1. Do you know the denition of a limit? Can you prove that a sequence converges to
a specic number?
2. Can you prove the limit properties?
3. Can you state and prove the Monotone Convergence Property?
4. Can you state and prove the Sandwich Theorem?
5. Can you state and prove the Bolzano-Weierstrass Theorem?
6. Can you state and prove all the orthogonal complement properties?
7. Can you state and prove all the projection map properties?
8. Can you prove that the dimension of the column space and the dimension of the
row space are equal?
9. Can you calculate bases for the null space and the column space?
10. Can you solve for all the solutions of an inhomogeneous system of equations? Can
you dene an ane space?
270
Final Advice
Thats the content. Heres some nal pointers that will make a huge dierence:
1. DO THE PRACTICE TEST.
The questions on the real test will be similar to the questions on the practice test.
2. MAKE SURE YOUR DEFINITIONS ARE EXACT.
Professor Simon is pretty relentless about this: if the question asks you to give a denition, your
statement must be awless.
3. GET ACCUSTOMED TO WORKING AT 7PM.
The test is at 7PM. I know. It sucks. So do practice questions (or even the practice test) at
this time.
Alright, I think thats everything. This is going to be the rst real math test you take: barely any
calculation, 1 or 2 denition questions, and the rest proofs. So, to quote the Hunger Games,
Good luck, and may the odds be ever in your favor.
Lecture 13
Continuing with Continuity
Like going from sine of x over x to sinc,
I hope this book provides a continuous extension
from high school to college mathematics.
- Otis B. Ramsay 99
Goals: Today, we cover one of the most important objects of Calculus: continuous func-
tions. We prove basic properties about continuous functions, including the fundamental
property that the maximum and minimum is always achieved on a closed interval. We
then add the stronger condition of dierentiability to prove Rolles Theorem and the
Mean Value Theorem. The rest of this week will be devoted to cashing in our work
in R to generalize to R
n
.
13.1 Why Continuity?
Weve all seen continuous function in Calc BC. But what makes them so important? Why do we
care?
Continuity is everywhere.
Signals, economics, motion: so many things in the world can be viewed as continuous functions. And
why would we want to model these phenomena as continuous functions?
A continuous function, on a closed interval, always has a maximum and a minimum.
Makes intuitive sense, right? If we look at a continuous function on some limited space, we can trace
it with our nger until we nd the highest point and the lowest point. And if you take Ramesh
Joharis infamous
MS&E 246: Game Theory with Engineering Applications
you will see incredible economic examples of maximizing prot.
However, because this book is written for underdogs, instead of a fantastic real world example, I give
you an xkcd comic:
271
272 LECTURE 13. CONTINUING WITH CONTINUITY
By tracing the graph with your nger, you can nd the happiest point in this mans relationship.
But in the math world, we do not want to say
Trace with your nger.
That sounds kinky. We rather formalize this as
There exists points where the maximum and minimum are achieved.
But how do we prove this assertion? First, we need to give a formal denition of continuity. That
means no denitions involving a pen, paper, and never letting go.
1
13.2 Limit of a Function

Last week, we dened the limits of a sequence:
Denition. We say L is the limit of the sequence (a
n
such that for all n N,
|a
n
L| < .
But we never dened the limit of a function! Dont be afraid, it uses the same idea:
As the input gets closer to some value, then the output approaches some limit.
Lets look at the sequence limit denition for inspiration. To prove a sequence converges to some
limit,
We nd an integer N > 0 and check that if the term number is greater than N, then some
condition holds.
But functions dont have term numbers. Instead, a function has a domain. And whereas we are
interested in later terms with very larger term numbers, here we are interested in capturing closeness
to some point c. Rigorously, we consider all points within some distance > 0 from c:
R
c c c +
1
We reserve such sayings for Titanic.
13.2. LIMIT OF A FUNCTION 273
So a rst attempt at a limit denition for functions is
We nd a > 0 and check that if x lies within distance around c, then some condition holds.
Instead of checking that all terms a
n
with n N satisfy some condition, we will be checking that
all points in an interval satisfy some condition.
However, to quote Professor Simon,
Just as we dont look at a
n
with n at innity, we want to exclude the evaluation of the function
at c itself. We want the limit to tell us what happens to the function as we approach c, even if the
function isnt dened at the point c itself. Therefore, we remove that case:
We nd a > 0 and check that, for x = c, if x lies within distance around c, some condition holds.
But what is the condition we want to hold?
In the sequence limit denition, we demanded that all terms were eventually trapped in an -vortex:
L L L +
a
1
a
2
a
3
. . .
a
90
a
91
a
92
a
93
. . .
Again, we dont have terms anymore! Instead we have a function. So we demand that all mappings
of the x in our interval are trapped within distance of L:
R
L L L +
R
c c c +
f f f
The precise denition of the limit of a function is:
Denition. A function f : R R has limit L at c if, for any > 0, there exists a > 0 such that if
0 < |x c| < ,
then
|f(x) L| < .
Note that the
0 < |x c|
of the if condition simply means that we exclude the case
|x c| = 0
which is just the case x = c.
13.3 Please Read: A Fundamental Dierence in Texts
When writing this book, I had to choose between
assuming f is dened for all real numbers
and
allowing f to be dened on a subset U R.
The latter is more rigorous and indeed the convention of Math 51H. In particular, the denition
of a limit of a function is
Denition. A function f : U R where U R has limit L at c U if, for any > 0, there exists
a > 0 such that if
x U and 0 < |x c| < ,
then
|f(x) L| < .
Note that the if condition requires
x U.
Otherwise, f(x) need not be dened!
However, with some hesitation, I chose to assume
f is dened for all real numbers
This is because I wanted to make this book as simple as possible. By making this assumption,
the overall structure of the proofs remain the same, yet avoids technical details pertaining to domain
restriction. True, function domains are very important; however, I feel that taking extra steps to
ensure a function is dened detracts from a proofs big picture. Ultimately,
This book is not meant to supplant Professor Simons text. This books only goal is to
illuminate the simple ideas.
Therefore:
13.4. CONTINUOUS FUNCTIONS 275
WARNING
On the exams, do not assume f is dened for all real numbers!
For the rigorous denitions, read Professor Simons book!
Youve been warned.
13.4 Continuous Functions
Our denition of continuity is the same as in high school:
Denition. A function f : R R is continuous at c if
1. There is a limit of f at c.
2. The value of this limit is just f(c).
Equivalently, for any > 0, there exists a > 0 such that if
|x c| < ,
then
|f(x) L| < .
Notice that we replaced the
0 < |x c| <
in the limit denition with
|x c| < .
By dropping 0 < |x c|, we include the case
x = c.
Hence, if a limit does exist, then for any and corresponding > 0 it is indeed the case
|c c| <
so
|f(c) L| < .
But this is true for any > 0, therefore
|f(c) L| = 0
which implies
L = f(c)
as needed.
The denition of continuous function is also the same:
Denition. A function is continuous on the interval [a, b] if it is continuous at all points x [a, b].
Even though the denitions of continuity at a point and continuous functions are the same, we have
changed the meaning behind a key word. Namely, we now have a rigorous limit denition. However,
this new denition raises one glaring issue:
How do we know that the formal denition of continuity captures our pen and paper
intuition of continuity?
When you hand-waved your way through Calc BC limits, the picture of limits was obvious: just ll
in the hole. But here, we are actually dealing with a complicated denition using two intervals, one
bounded by and one bounded by . So what do we do?
Just combine both intervals into a single picture! Remember, we graph x horizontally and each
corresponding f(x) vertically:
x
f(x)
c
c
c +
L
L
L +
All the points from c to c+ satisfy the continuity condition: each of their heights f(x) is between
L and L + . In this example, the continuity condition is satised for this choice of .
But you may not understand this picture. Personally, I was always confused by Professor Simons
graphs. Also, it is not obvious from this graph why this is the correct denition of continuity. And I
would rather you understand the intuition over the algebraic statement.
Here is an example where the chosen is not good enough for the given :
x
f(x)
c c +
c
L
L
L +
Notice that some of the points x in the interval [c , c +] do not have f(x) within the [L, L+]
range! But this function looks continuous, so we try again with a smaller choice of :
x
f(x)
c
c c +
L
L
L +
We have found a that guarantees every point within distance of c has an image that is within
distance from f(c).
The next example shows how our denition excludes the discontinuous graphs you saw in high school:
c
c c +
x
f(x)
L
L
L +
Here, no matter how much you shrink the , youre screwed. The points are going to escape the
bounds:
x
f(x)
c
c c +
L
L
L +
x
f(x)
c
L
L
L +
c c +
x
f(x)
c
L
L
L +
c c +
I know these pictures arent obvious, but mull them over. Like Professor Simon says, this took a
hundred years to develop, and not even Newton had the precise denition! When you achieve that
Eureka moment, lets continue (puns, puns, for everyone).
Here are a few examples of how to rigorously check that a function is continuous:
Example. The function
f(x) = x
2
is continuous on [1, 2].
Proof Summary:
Let c [1, 2], and let > 0.
Rewrite the condition as a bound on |x c|.
Restrict < 1 so that |x + c| 5.
Show condition holds when = min
_
5
, 1
_
.
Proof: Let c be an arbitrary point in [1, 2] and let > 0. We want to show that there exists > 0
such that if
|x c| <
then
|f(x) f(c)| < .
Look at the condition:
|f(x) f(c)| < .
This is just
|x
2
c
2
| <
or
|(x c)(x + c)| < .
Applying absolute value properties, we rewrite this as
|x c| |x + c| <
and dividing,
|x c| <

|x + c|
.
Therefore, we need choose a so that when x satises |xc| < , the above inequality is guaranteed.
Does this mean we can use
=

|x + c|
?
NO WAY! must rely only on and c; cannot depend on x because we only consider x in a xed
-interval once is selected.
Instead, suppose we add the restriction that
< 1.
Rewriting
|x + c| = |x c + 2c|,
we apply triangle inequality
|x c + 2c| < |x c| + 2c.
By our restriction on and the fact c [1, 2],
|x c| + 2c < + 2c
< 5
and hence.
|x + c| < 5
Thus, if we choose
= min
_
5
, 1
_
,
we are guaranteed
|x c| < <

5

|x + c|
as needed.
Example. The function
f(x) =
_
0 : x 0
1 : x > 0
is not continuous at 0.
Proof Summary:
Suppose the function is continuous at 0. Then it satises the continuity denition for =
1
2
with some corresponding
2
satises the
bound, but f
_
2
_
is not bounded by
1
2
, contradiction.
Proof: Suppose the function is continuous at x = 0. Then, for any > 0 there exists a > 0 such
that if
|x| <
then
|x f(0)| = |f(x)| <
So in particular, consider the case of =
1
2
. Then there exists a
> 0 such that if

|x| <
then
|f(x)| <
1
2
But if we choose
x =

2
we have

..
x
<
yet
f
_
2
_
. .
1
<
1
2
which is impossible!
13.5 Properties of Continuous Functions
To prove that continuous functions on closed intervals achieve their extrema, we must rst prove
1. If a sequence on the interval [a, b] converges, then it must converge to a point in [a, b].
2. If (x
i
) converges to x, then for any continuous function f, the image sequence (f(x
n
)) converges
to f(x).
3. Continuous functions on closed intervals are bounded.
13.5. PROPERTIES OF CONTINUOUS FUNCTIONS 281
Each of these properties is fundamental. In fact, they are so important that we are going to extend
them to n-dimensions. But for now, lets prove the 1-dimensional cases.
By the way, notice that we are always considering closed intervals [a, b]. This is a very important
distinction.
Otherwise, we can give each statement a corresponding counterexample:
1. The sequence
x
i
=
1
n
is a sequence where every term lies in the interval (0, 1]. However, its limit is 0. Because the
interval is not closed, the sequence can escape the interval.
2. This statement wouldnt even make sense. Suppose f(x) is only dened on (0, 1). Then, if we
used
x
i
=
1
n
once more, the sequence cannot converge to f(x) because f(x) is undened at 0.
3. The function
f(x) =
1
x
is continuous on the interval (0, 1) but it is denitely not bounded.
Now that we have the suspects lined up, we begin the proofs. First up,
Theorem. If the sequence (x
i
) satises
A x
i
B
for every i and (x
i
) converges to x, then x also satises the inequality
A x B.
Proof Summary:
Suppose the limit x is greater than B.
Set equal to half the distance from x to B.
There exists an N such that for all n N, a
n
> B. Contradiction.
Apply the same argument to show x cannot be less than A.
Proof: Suppose not. Then either x > B or x < A. Let x > B. Schematically, this would show us
A
B
x
All of our terms are nowhere near the value they converge to so something is denitely wrong. To
derive a contradiction, it suces to show that the limit would drag the sequence out of the B upper
bound. This will be a direct consequence of the denition of convergence if is chosen to be half the
distance between the upper bound B and x.
A
B
x
x B
Lets make this argument formal:
By denition of convergence, for any > 0 there is an integer N such that for all n N,
|a
n
x| <
which is just
x < a
n
< x + .
Choose to be half the distance from B to x:
=
x B
2
Note that this is indeed a valid choice of since x > B and so
x B
2
> 0.
Then we know there exists an N such that for all terms with index n N,
x
x B
2
. .
x+B
2
< a
n
< x +
x B
2
So in particular,
a
N
>
x + B
2
=
x
2
+
B
2
>
B
2
+
B
2
= B
contrary to our stipulation that all terms be less than (or equal to) B. We can make a similar argu-
ment to prove that x < A also yields a contradiction.
Notice how this argument would not have worked if the inequality were strict:
A < x < B
Then, in our contradiction proof, we might have had x = B, so
=
x B
2
=
which is an invalid choice of . In general,
Math Mantra: Whenever you find a counterexample to a theorem if a certain
condition is relaxed, figure out where that condition is used in the proof and
why that condition is necessary.
This proof also shows us that if a sequence in a closed interval converges to a limit, then that limit
must be in the interval. Next lecture, we generalize this property to higher dimensions and, in fact,
make this the dening property of a closed set.
Next on our list,
Theorem. Let f be a continuous function on [a, b] and let (x
i
) be a sequence with each
x
i
[a, b]
and that converges to some limit:
x
i
x.
Then the mapped sequence converges to f(x):
f(x
i
) f(x)
Proof Summary:
Let > 0. We want to nd an N that satises the sequence convergence denition for (f(x
i
))
By the denition of continuity of f at the limit x with this choice of , we need only nd a N
such that the condition always holds.
Use the convergence of (x
i
) to nd N.
Proof: First write out all the denitions you know. Let > 0.
We want to prove: There exists an N such that for all i N,
|f(x
i
) f(x)| < .
We know f is continuous: for any
2
> 0 there exists a > 0 such that if
y [a, b] with |y c| <
then
|f(y) f(c)| <
2
.
We also know that
x
i
x
So for any
3
> 0 there exists an integer N
,
|x
i
x| <
3
.
Now just stare at what is given. We want to use the continuity denition at c = x and
2
= to give
us a condition involving some
> 0: If,
y [a, b] with |y x| <
then
|f(y) f(x)| < .
Now all we have to do is show that, eventually, the x
i
satisfy
|x
i
x| <
for then by continuity:

|f(x
i
) f(x)| <
But satisfying the condition is easy. Just choose
3
=
. Since
x
i
x
there exists an integer N
,
|x
i
x| <
.
So going back to the beginning, given , we can nd an N (namely N = N
) such that for all i N,

|x
i
x| <
.
But by denition of continuity, we know this -dependent choice of
guarantees
|f(x
i
) f(x)| < .
Notice that
1. This was a very simple and bone-headed proof! You only needed to juggle denitions
and be very careful with variables. Generally,
Math Mantra: If you are not sure what to do, write out all the definitions and
stare at them.
2. The converse does not hold: Even if
f(x
n
) M
it may not be the case that
x
n
x
For example, consider the function f dened by
f(x) = 0
for all x R, and the sequence (x
n
) dened by
x
n
= n.
Then f(x
n
) = 0 for all n, so clearly
f(x
n
) 0.
However, the original unmapped sequence (x
n
) does not converge. Generally,
Math Mantra: Just because the forward implication holds
DOES NOT MEAN the backward implication is true.
3. We can use this theorem to derive the limits of sequences. We save this incredible
assertion for the end of the lecture.
Finally, we prove:
Theorem. Let f be a continuous function on [a, b]. Then f is bounded.
Proof Summary:
Suppose f is unbounded. It suces to consider only the case of f unbounded from above.
We can construct a sequence in [a, b] where the i-th term is mapped to a number greater than
i.
By Bolzano-Weierstrass, this sequence has a convergent subsequence.
The image of this subsequence diverges.
But the previous theorem tells us that the mapping of a convergent sequence by a continuous
function is still convergent, a contradiction.
Proof: Suppose f is unbounded. We need only consider the case of f unbounded from above (since
the case of f unbounded from below follows similarly). Thus, for any n, we can choose an x
n
[a, b]
such that
f(x
n
) n
This means that we can pick specic values
x
1
, x
2
, x
3
. . .
such that
f(x
1
) 1
f(x
2
) 2
f(x
3
) 3
.
.
.
Since x
n
[a, b] we can apply Bolzano-Weierstrass to nd a convergent subsequence
x
n
1
, x
n
2
, x
n
3
. . . .
By the previous theorem, the image of this subsequence
f(x
n
1
), f(x
n
2
), . . .
converges. However, the i-th subsequence term number must be greater than or equal to i:
n
i
i
thus
f(x
n
1
) n
1
1
f(x
n
2
) n
2
2
f(x
n
3
) n
3
3
.
.
.
But this means the sequence
_
f(x
n
i
)
_
fails to converge! So we have a contradiction and f must be
bounded above on [a, b].
The lesson to take from this proof is
Math Mantra: Suppose all your points lie on a bounded interval. If you need a
convergent sequence that satisfies some property, construct any sequence
satisfying that property. Then use Bolzano-Weierstrass to construct a
convergent, and check that this subsequence inherits the desired property.
13.6 Max and Min
We only need to prove that a continuous function on a closed interval achieves a maximum. Why?
Because:
13.6. MAX AND MIN 287
Lemma. Let f be a function on [a, b]. There f achieves its maximum at y [a, b] if and only if (f)
achieves its minimum at the same y.
Proof: Let f achieve its maximum at y, that is
f(y) f(x)
for every x [a, b]. Multiplying by -1, we get
f(x) f(y)
for every x [a, b]. This is equivalent to saying f achieves its minimum at y.
So suppose we can prove that for any continuous function f on a closed interval [a, b] attains its
maximum. Then, in particular, f is a continuous function, so f must attain its maximum. But
we just proved this is equivalent to (f) = f achieving its minimum on [a, b].
Now, we are prepared to prove:
Theorem. Let f be a continuous function on [a, b]. Then f achieves its maximum for some y [a, b]:
f(y) f(x)
for all x [a, b]. Likewise, f achieves its minimum for some y [a, b]:
f(x) f(y)
for all x [a, b].
Proof Summary:
By the lemma, it suces to prove that f achieves its maximum.
Consider the image set F of [a, b] under f.
f is a continuous function on a closed interval, so F is bounded.
By the Completeness Axiom, F has a least upper bound S.
Dene a sequence (x
i
) of points in [a, b] whose image under f converges to S.
By Bolzano-Weierstrass, we have a subsequence of (x
n
i
) that converges to some x. By our
closed interval property, x [a, b]
Since the subsequence converges and f is continuous, the image of (x
n
i
) under f converges to
f(x).
But the image of (x
n
i
) under f is a subsequence of a sequence converging to S, so it also
converges to S.
By uniqueness of limits, f(x) = S.
Proof: By the lemma, it suces to prove that f achieves its maximum. Consider the set of all
mapped values of x [a, b] under f:
F =
_
f(x)
x [a, b]
_
.
Since f is continuous on a closed interval, by one of our previous theorems we know that this set is
bounded. Moreover, by the Completeness Axiom, we know that this set has a least upper bound S.
The goal is to show that we have an element q [a, b] with
S = f(q).
Since S is the least upper bound,
f(q) f(x)
for all x [a, b].
First, we nd a sequence of points in [a, b] whose image converges to S. Note that given an
i
> 0,
we can always nd a point in
s
i
F
within distance
i
of S. If not, then this would contradict S being the least upper bound!
1
So, using
the following,
1
= 1
2
=
1
2
3
=
1
3
.
.
.
we can nd s
i
F such that
S
1
s
1
S
S
2
s
2
S
S
3
s
3
S
.
.
.
Unravelling the denition of the set F, each s
i
= f(x
i
) for some x
i
[a, b]. Thus,
S 1 f(x
1
) S
S
1
2
f(x
2
) S
S
1
3
f(x
3
) S
.
.
.
1
To quote Professor Cohen, this is a napkin problem. Alternatively, we showed this in the proof of the Monotone
Convergence Property.
13.7. ON A ROLLE! 289
By the Sandwich Theorem,
f(x
n
) S.
Now, if (x
i
) converged, wed be done. Unfortunately, this need not be the case. Instead, we use our
nice trick of nding a convergent subsequence. Our original sequence (x
n
) is bounded (every term is
between a and b), so by Bolzano-Weierstrass, there exists a convergent subsequence (x
n
i
) with
x
n
i
x
for some x. Moreover, by our closed interval property, x is still in [a, b]. We also proved that the
image of (x
n
i
) under a continuous f yields
f(x
n
i
) f(x)
But now notice that
f(x
n
1
), f(x
n
2
), . . .
is a subsequence of the convergent sequence
f(x
1
), f(x
2
), . . .
Since we already proved that
If a sequence converges to some limit, then any subsequence converges to the same limit.
we have
f(x
n
i
) S
Because
f(x
n
i
) f(x)
f(x
n
i
) S
uniqueness of limits gives us
f(x) = S
for x [a, b]. Awesome.
Notice that I only showed you that the maximum and minimum exist. I did not tell you how to nd
them. Thats what derivatives are for!
13.7 On a Rolle!
First, we dene
Denition. A function f on R is dierentiable at c if the quotient map
g(x) =
f(x) f(c)
x c
has a limit at c. We call this limit the derivative of f at c and denote it f
(c).
Notice that this denition is the same one from high school. Except we stated it in a more precise
way and your perspective is dierent! Armed with a rigorous denition of limits, you now have the
precision to prove theorems involving derivatives.
Recall that Rolles Theorem states that if you have a dierentiable function with two endpoints on
the x-axis, there must be a turning point where the function has derivative 0:
f(a) f(b)
f
(c) = 0
Theorem (Rolles Theorem). For a function f that is continuous on [a, b] and dierentiable on
(a, b), if
f(a) = 0
f(b) = 0
then there must be a point c (a, b) where f
(c) = 0.
Proof Summary:
WLOG, assume f entirely non-negative.
Since f is a continuous function on a closed and bounded interval, it has a maximum x
max
.
We want to show f
(x
max
) = 0. Suppose not.
Case 1: f
(x
max
) > 0.
Apply dierentiability denition with = f
(x
max
).
Choose an x that satises the -hypothesis that lies to the right of x
max
.
Show f
(x
max
) 0, contradiction.
Case 2: f
(x
max
) < 0.
Apply dierentiability denition with = f
(x
max
).
Choose an x that satises the -hypothesis that lies to the left of x
max
.
Show f
(x
max
) 0, contradiction.
Conclude f
(x
max
) = 0
Proof: Without loss of generality, we can assume the function f is either completely non-positive or
completely non-negative: just restrict between a and the rst time after a that f crosses the x-axis.
13.7. ON A ROLLE! 291
a
b
Moreover, we can assume that f 0 since we can reapply this argument to f. Also ignore the case
that f is the zero function because its derivative is always 0.
By the previous theorem, we know that f achieves a maximum
1
at x
max
(a, b). I claim
f
(x
max
) = 0.
Suppose not. We will use the fact that f
(x
max
) exists to extract information on its sign and
hence derive a contradiction.
Case 1: f
(x
max
) > 0
Apply the dierentiability denition with the choice
= f
(x
max
).
Then there exists a > 0 such that if
x (x
max
, x
max
+ )
then

f(x) f(x
max
)
x x
max
f
(x
max
)
< f
(x
max
)
. .
.
Consider an x that satises the -hypothesis that lies to the right of x
max
:
x (x, x
max
+ ).
Then
x x
max
> 0
and by denition of a maximum,
f(x) f(x
max
) 0.
The -conclusion is then
f(x) f(x
max
)
x x
max
. .
0
f
(x
max
)
. .
>0
< f
(x
max
).
1
Notice the open interval. Otherwise, if the maximum occurred at an endpoint, it would force the entire function
to be 0.
Negate the term under the absolute value:
(x
max
)
. .
>0
f(x) f(x
max
)
x x
max
. .
0
< f
(x
max
).
Since we are taking the absolute value of a non-negative term,
(x
max
)
. .
>0
+
f(x) f(x
max
)
x x
max
. .
0
< f
(x
max
)
we can drop the sign
f
(x
max
)
. .
>0
+
f(x) f(x
max
)
x x
max
. .
0
< f
(x
max
).
But this says
f(x) f(x
max
)
x x
max
. .
0
< 0,
which is absurd.
Case 2: f
(x
max
) < 0
Apply the dierentiability denition with the choice
= f
(x
max
).
Then there exists a > 0 such that if
x (x
max
, x
max
+ )
then

f(x) f(x
max
)
x x
max
f
(x
max
)
< f
(x
max
)
. .
.
Consider an x that satises the -hypothesis that lies to the left of x
max
:
x (x
max
, x
max
).
The -conclusion is then
f(x) f(x
max
)
x x
max
. .
0
f
(x
max
)
. .
<0
< f
(x
max
).
13.7. ON A ROLLE! 293
Since the term under the absolute value is non-negative,
f(x) f(x
max
)
x x
max
. .
0
f
(x
max
) < f
(x
max
).
Therefore,
f(x) f(x
max
)
x x
max
. .
0
< 0
which is again an absurdity.
In conclusion,
f
(x
max
) = 0.
We can now use Rolles Theorem to prove one of our most useful results, the Mean Value Theorem.
1
Recall that the Mean Value Theorem asserts the existence of a point where the derivative has the
same slope as the secant line from a to b:
f
(c) =
f(b)f(a)
ba
b a
f
(
b
)
f
(
a
)
Theorem (Mean Value Theorem). For a function f that is continuous on [a, b] and dierentiable
on (a, b), there exists a point c [a, b] such that
f
(c) =
f(b) f(a)
b a
Proof Summary:
Dene a dierentiable function g that is zero at a and b.
By Rolles Theorem, there is a point c where g
(c) = 0
Isolate f
(c) in the expression for g
(c).
1
We already used this to prove that a function is constant if and only if its derivative is always 0
Proof: This is a smart application of Rolles Theorem. Consider the function
g(x) = f(x) f(a)
f(b) f(a)
b a
(x a)
We have
g(a) = f(a) f(a)
f(b) f(a)
b a
(a a) = 0
g(b) = f(b) f(a)
f(b) f(a)
b a
(b a) = 0
By Rolles Theorem, there exists some c [a, b] such that
g
(c) = 0.
But
g
(x) = f
(x)
f(b) f(a)
b a
,
so in particular,
g
(c) = f
(c)
f(b) f(a)
b a
= 0.
This gives us
f
(c) =
f(b) f(a)
b a
.
Notice that the proof is easy to understand. But cooking up the function g(x) is hard! Again,
Math Mantra: Part of the beauty of mathematics is coming up with those elegant
ideas that work.
Rolles Theorem has tons of fun applications, from xing points to xing bounds. If you happen to
be free on a Friday night, I highly recommend sitting down and applying Rolles Theorem to random
functions. Who knows what you might nd.
13.8 Applying Continuous Functions to Limit Solving
As promised, we take a closer look at the theorem
Theorem. Let f be a continuous function on [a, b] and let (x
i
) be a sequence with each
x
i
[a, b]
and that converges to some limit:
x
i
x.
Then the mapped sequence converges to f(x):
f(x
i
) f(x)
13.8. APPLYING CONTINUOUS FUNCTIONS TO LIMIT SOLVING 295
Do you remember how, in high school, the chain rule gave you a vast new way to create derivatives?
Same idea here. This theorem gives us a ton of new limits:
If we know the limit exists, we can apply a continuous operation to the sequence to solve for that
limit.
Example. We can reprove
_
1 +
_
1 +
1 + . . . =
1 +
5
2
Consider the sequence
a
1
= 1
a
n+1
=
1 + a
n
We can still apply the Monotone Convergence property to show the limit exists:
a
n
a
for some a. We want to solve for this a.
Look at
a
n+1
=
1 + a
n
.
The function
f(x) =
1 + x
is continuous, so by our theorem, we have
f(a
n
) f(a)
which is by denition
a
n+1

1 + a.
Since we can prove that delaying a sequence by one term does not change its limit,
a
n

1 + a
Now that
a
n
a
a
n

1 + a
by uniqueness of limits,
a =
1 + a.
So we can solve for
a =
1 +
5
2

For the second example, recall one of your high school proofs that
lim
n
_
1 +
1
n
_
n
= e.
To prove this, you wrote
y =
_
1 +
1
n
_
n
and took the natural log
ln(y) = nln
_
1 +
1
n
_
You visualized the right hand side as a fraction
ln(y) =
ln
_
1 +
1
n
_
1
n
Then, you took the limit as n approaches innity to get
ln(y) =
0
0
Because of this indeterminate form, you knew you could apply LHospitals rule. Dierentiating the
numerator and denominator of the right-hand side, you got:
1
n
2
+ n
1
n
2
=
n
2
n
2
+ n
=
n
n + 1
Taking the limit as n approaches innity,
ln(y) = 1.
After raising both sides to the base e, we get
y = e
Looking back, you should now realize this proof is totally sketch. There were quite a few steps we
did not justify. For example, we set
y =
_
1 +
1
n
_
n
.
Here, y starts o as a function of n. And then, all of a sudden, y is a constant
y = e.
The sensible thing to say is that the limit of y is a constant, but, we dont even know this constant
exists! But ignoring that fail for a moment, notice that you really took limits and applied the same
function to both sides, but you did so in dierent orders:
ln
_
lim
n
_
1 +
1
n
_
n
_
= lim
n
ln
__
1 +
1
n
_
n
_
But who said you could do this? If we applied the same reasoning to the sequence
_
1 +
1
n
_
with the function
f(x) =
_
1 : x = 1
0 : x = 1
we would get
f
_
lim
n
_
1 +
1
n
__
= lim
n
f
_
1 +
1
n
_
so
1 = 0.
OUCH.
The truth is, the magic comes from the continuity of ln.
Lets rewrite the beginning of the proof, rigorously.
Example.
lim
n
_
1 +
1
n
_
n
= e
Proof: Let
a
n
=
_
1 +
1
n
_
n
First,
The limit exists: This is because we can check, with induction, that
a
n
=
_
1 +
1
n
_
n
is increasing and bounded above. By the Monotone Convergence Property,
a
n
y
Applying natural logs to both sides: Since
a
n
y
and ln is a continuous function, we know
ln(a
n
) ln(y)
Thus,
nln
_
1 +
1
n
_
ln(y)
Solving for y: Using LHospitals rule, we can rigorously show that
nln
_
1 +
1
n
_
ln(y)
implies
n
n + 1
ln(y).
But we already know
n
n + 1
1
so
n
n + 1
1
n
n + 1
ln(y)
By uniqueness of limits,
ln(y) = 1.
In conclusion,
y = e.
Lastly, we can use the theorem to prove that the power tower
2
.
.
.
= 2
We rst have to notice that the left-hand side is shorthand for the limit of a sequence.
Example. The sequence dened by
a
1
=
2
a
n+1
=
2
an
for all n 1
converges to 2.
Proof:
The limit exists: This is because we can check, via induction, that (a
n
) is an increasing
sequence bounded above by 2. Thus,
a
n
y
for some y R.
Applying natural logs to both sides: we also have
a
n+1
y
and ln is a continuous function, we know
ln(a
n+1
) ln(y)
Thus,
a
n
ln(2)
2
ln(y)
Solving for y
By the scaling theorem for sequences, we know
a
n
ln(2)
2
y
ln(2)
2
But then
a
n
ln(2)
2
y
ln(2)
2
a
n
ln(2)
2
ln(y)
so by uniqueness of limits,
y
ln(2)
2
= ln(y)
To nd y, we just have to solve
ln(y)
y
=
ln(2)
2
Clearly, 2 is one solution. Taking the derivative of
ln(y)
y
we get
1 lny
y
2
so the function is strictly increasing on (0, e), hence; 2 is the only solution on [0, 2]. Since all
the terms in the sequence lie in the closed interval [0, 2], we know the limit
y [0, 2].
Thus, y must be 2.
Lecture 14
Keeping an Open Mind
Sets arent doors: they can be neither open nor
1
closed!
-Leon Simon
Goals: Today, we generalize open and closed intervals in R to open and closed sets in
R
n
. After dening open and closed sets, we give numerous examples. Then, we prove
the fundamental relationship between the two notions as well as some basic properties.
14.1 All in the Intervals
On our journey from R to R
n
, there are a few things we gotta pack. First and foremost are open and
closed intervals.
(a, b) [a, b]
Why are we making such a big deal about these little things? They are vital to Calculus. Essentially,
An open interval is a microcosm of the entire real number line and captures the essence of
being
close but not touching.
No matter which point you choose in an open interval, it will never be at the end - there will
always be points to the left and right.
A closed interval forces nice properties on functions. For example,
Continuous functions on a closed interval are bounded.
Continuous functions on a closed interval achieve a maximum and a minimum.
1
On the other hand, see Hitler Learns Topology.
301
302 LECTURE 14. KEEPING AN OPEN MIND
14.2 Open Intervals to Open Sets
To extend an open interval from R to R
n
we rst need to ask ourselves,
What concept are we extending?
The answer is simple: we want to extend the idea that any point in an open interval is
some positive distance from the end.
But extending this concept requires that we rst understand the shape of
the collection of all points that lie (strictly within) a xed distance r from a center c.
Lets look at cases.
In R, the collection of all points (strictly within) r from c is an open interval:
. .
r
c
In R
2
, this is the set of all points within a circle:
c
.
.
r
In R
3
, this is the set of all points within a sphere:
c
.
.
r
Generalizing, we dene:
Denition. An open ball centered at c with radius r is the set of all points that are strictly within
distance r of c:
B
r
(c) = {y | y c < r}
14.2. OPEN INTERVALS TO OPEN SETS 303
Now that we have a formal denition of higher dimensional balls, how do we show that a point is
inside a given set, but not at the end?
Let the enclosed region below represent our set and consider a point outside:
There is no way I can draw a ball around the point and have it lie entirely within the set:
Also notice that if a point is on the boundary, we are in a similar scenario:
No matter what ball we draw around the point, that ball will never be entirely contained in the set.
To exclude these two scenarios, we dene a set to be open if, for any point in the set, we can draw
a ball around that point so that the ball lies entirely within the set. In this way, the point is
cushioned from the end of the set.
Denition. A set S is open if, for any x S, we can nd a ball centered at x that lies entirely
within S; that is,
B
r
(x) S
for some radius r > 0.
Let the region below represent the same set as before but without its boundary:
Around any point in the set, we can draw a ball that lies entirely inside the set:
So, schematically, we can see that this set is open.
14.3 Verifying a Set is Open
Go into your closet and draw a picture.
When you have the right intuition in mind, destroy the evidence.
-Leon Simon
As a math teacher, I see a lot of bad proofs. Especially when students try to verify a set is open.
14.3. VERIFYING A SET IS OPEN 305
They simply
Draw some set with a dashed boundary.
Draw some ball inside.
Say they are done.
But,
Math Mantra: PICTURES ARENT PROOFS!
Pictures are useful. I agree. But they arent proofs! Namely because,
Pictures can be deceiving! Here is an infamous example from Lovaszs Discrete Mathematics.
By rearranging the picture, you can prove 64=65:
Sometimes you cant even visualize the picture! How do you even begin to imagine points
in R
n
?
So how do we actually prove a set S is open? Heres the procedure:
Scratch work (closet-mode):
Draw a schematic. Plot a simple 2D version of S. Pick a point c S and draw a ball around
the point that lies entirely within S. It usually helps to draw the radius as big as possible
(such that the ball is still entirely within S).
Guess the radius. The radius is going to rely on the coordinates of the center c that you
chose. If its a constant, then you know you are constantly wrong.
Actual Proof :
Show that a ball with this radius (around an arbitary point) is still in the set. This
is the hardest part. Take an arbitrary center c, and an arbitrary point x in the ball around c.
Then show that x is still in the set. You typically need to add 0 to exploit extra information
about the center.
Theorem. The region
S =
_
(x, y) R
2
|x| < 1 and |y| < 1

_
is an open set.
Proof Summary:
Choose an arbitrary center (c
x
, c
y
).
Dene radius r as the shortest distance from the center to any side of the box.
Form a ball and let (a, b) be in that ball.
Show |a| < 1: Add zero and condition on the sign of c
x
.
Show |b| < 1: Add zero and condition on the sign of c
y
.
Scratch work:
Draw a schematic: We plot the region, pick a random point, and draw a ball around it with
radius as big as possible.
1 -1
1
-1
Guess a radius: In our schematic, the largest possible radius is the closest distance to the right
side. But that is because the ball is placed very close to the right side! Generally, given position
(c
x
, c
y
), the radius should be the distance to the closest of the four sides:
. .
1 c
x
. .
c
x
(1)
1 c
y
_
_
c
y
(1)
_
_
Given center (c
x
, c
y
), the radius is
r = min{c
x
+ 1
. .
left
,
top
..
1 c
y
,
bottom
..
c
y
+ 1, 1 c
x
. .
right
}
Proof:
Show ball is entirely within the set: Let (c
x
, c
y
) be an arbitrary point in the square. We want
to show
B
r
(c
x
, c
y
)
where r is the radius suggested by the scratch work. Choose an arbitrary
(a, b) B
r
(c
x
, c
y
).
We have to show (a, b) is still in the square S. This means we have to prove |a| < 1 and |b| < 1:
|a| < 1:
First we add 0:
|a| = |a + c
x
c
x
. .
=0
|.
By the triangle inequality, this is bounded above by
|a c
x
| +|c
x
|
But we know
|a c
x
| < r,
so
|a| < r +|c
x
|
Now, we have one of two cases, depending on whether our point is closer to the right or
left side of the square:
Case 1: c
x
0
By our denition of r,
r 1 c
x
. .
right
since r is the smallest of the four quantities. Thus,
|a| < r + c
x
1 c
x
+ c
x
= 1
Case 2: c
x
< 0
Notice that
|c
x
| = c
x
Since
r c
x
+ 1
. .
left
we have
|a| < r +|c
x
|
= r c
x
c
x
+ 1 c
x
= 1
|b| < 1:
By the same tricks as before, we have
|b| < r +|c
y
|.
Again, we have two cases, depending on whether our point is closer to the top or to the
bottom of the square.
Case 1: c
y
0
Since
r 1 c
y
. .
top
we have
|b| < r + c
y
1 c
y
+ c
y
= 1.
Case 2: c
y
< 0
Since
r c
y
+ 1
. .
bottom
we have
|b| < r +|c
y
|
= r c
y
c
y
+ 1 c
y
= 1.
Notice that this proof models our intuitions exactly: the radius depends on the nearest side to the
point we choose. The only catch is that we had to make our picture into a rigorous proof.
But the last example was in only two dimensions.
1
How about an example in n dimensions? Sure!
Notice that we dened an open ball. To quote Professor Simon,
This would be a completely bone-headed name if an open ball werent open.
Lets check that we werent bone-headed:
Theorem. The ball in R
n
centered at c with radius R,
B
R
(c )
is open.
Proof Summary:
Choose an arbitrary center x.
Dene radius r as the dierence of the radius of the giant ball and the distance from x to the
giant balls center, c.
Form a ball at x and let a be in that ball.
Show a c < R: Add zero and apply the triangle inequality.
Scratch work:
Draw a schematic: We draw an open ball
c
Now, we would like to choose a point and draw a ball with the radius as big as possible. So
what do we do?
First, choose your point x B
R
(c):
1
The general proof of an n-dimensional box would have actually cut our proof in half: we would prove an arbitrary
component is in the box instead of two |a|, |b| cases. Try it!
x
c
Then, draw the radius through that point:
x
c
Then, swing the portion of the radius that connects your chosen point to the edge:
x
c
Ta-da!
Guess a radius: The radius should be the length of this swung segment:
x
c
Isolating this segment, notice that it is just the dierence between the full radius R and the
distance from the point x to the center c:
R
.
.
Thus, we are going to let

r = R x c.
Proof:
Show ball is entirely within the set
Let x be an arbitrary point in the giant open ball and let r be the radius suggested by the
picture. Choose an arbitrary
a B
r
(x)
To show
a B
R
(c)
we must prove
a c < R.
Adding zero,
a c = a x + x
. .
0
c.
Then, by the triangle inequality
a c a x +x c.
But a B
r
(x) means a x < r, thus
a x +x c < r +x c.
By our choice of r,
r +x c < R x c +x c
so
a c < R
as needed.
14.4 Closed Intervals to Closed Sets
How do we extend the notion of a closed interval from R to R
n
? Last lecture, we proved the property,
If a sequence in a closed set converges, then it must converge to a point in that closed set.
This is exactly the property we will use to dene closed sets in general. This denition will give us all
those nice continuous function properties and will ensure that we have maxima and minima. First,
we dene convergence for a sequence of vectors:
Denition. We say L is the limit of the sequence of vectors (x
i
) if for any > 0 there exists an
integer N such that for all i N
x
i
L <
Then,
Denition. A set S is closed if any convergent sequence in S converges to a point in S; that is, if
x
1
, x
2
, . . . S
and
x
i
S,
then x S as well.
Youve heard the expressions
Closed under addition.
and
Closed under multiplication.
When we say closed set, we mean
Closed under limits i.e. closed under taking the limit of convergent sequences.
Whatever happens in Vegas, stays in Vegas. Whatever happens to a convergent sequence of points
in a closed set, the limit stays in the set. Pictorially, think of a sequence getting squeezed to a point.
That point cant slip out of the set:
14.5. VERIFYING A SET IS CLOSED 313
14.5 Verifying a Set is Closed
Heres the general strategy for proving that a set S is closed:
Consider an arbitrary convergent sequence in S.
Suppose its limit x is not in S.
Derive a contradiction.
Consider the xy-plane in any dimension. For example, in R
3
, the xy plane is just the oor:
We can show that the xy-plane (the span of the vectors e
1
, e
2
R
n
) always forms a closed set in R
n
,
no matter the value of n:
Theorem. Consider e
1
, e
2
R
n
. Then,
S = span {e
1
, e
2
}
is a closed set in R
n
.
Proof Summary:
Consider an arbitrary convergent sequence and suppose the limit x is not in S.
Then x has a non-zero component c
j
for j > 2
Use the condition with = |c
j
| to derive a contradiction.
Proof: Consider a sequence of vectors (x
i
) such that each x
i
S and
x
i
c.
Suppose c / S. This means c has a non-zero component at some position j > 2:
_
_
c
1
c
2
.
.
.
c
j
.
.
.
_
_ NON-ZERO
By convergence, we know that for any > 0 there exists a corresponding integer N such that for all
i N,
x
i
c < .
Choose = |c
j
|. Then there exists some N > 0 such that for all i N
x
i
c < |c
j
|
In particular,
x
N
c < |c
j
|.
Notice that the norm is the square root of a sum of squares:
x
N
c =
_
(x
N
1
c
1
)
2
+ (x
N
2
c
2
)
2
+ . . . + (x
N
n
c
n
)
2
In particular, the expression in the square root is larger than the square from the j-th component; so,
x
N
c
_
(x
N
j
c
j
)
2
By denition of S, all components of a vector in S after the rst two are zero. Thus,
x
N
j
= 0
and so
_
(x
N
j
c
j
)
2
=
_
c
2
j
= |c
j
|.
Therefore,
x
N
c |c
j
|.
But we already found
x
N
c < |c
j
|
a contradiction. We conclude that c S.
The nal example is one we will need when tackling quadratic forms:
Theorem. The unit sphere
S = {x R
n
| x = 1}
is a closed set in R
n
.
14.6. OPEN AND CLOSED SETS ON R 315
Proof Summary:
Consider an arbitrary convergent sequence and suppose the limit x is not in S.
WLOG, let x < 1.
Use the condition with = 1 x to derive a contradiction.
Proof: Let
x
i
x
where each x
i
S. Suppose x / S. Then, x = 1. Assume x < 1 (the proof of x > 1 follows
similarly). Applying the -condition of convergence with
= 1 x,
there is an N such that for all i N, we have
x
i
x < 1 x.
In particular,
x
N
x < 1 x.
Writing
x
N
= x
N
x + x
. .
=0
,
apply the triangle inequality on the (RHS),
x
N
x
N
x
. .
<1x
+x
Therefore,
x
N
< 1
contradicting that x
N
S! We conclude that x S.
14.6 Open and Closed Sets on R
On R, we have more open sets and closed sets than just open intervals and closed intervals. For
example, we will prove that the union of two open sets is open
(0, 1) (2, 3)
and the union of two closed sets is closed
[0, 1] [2, 3].
In Math 115, you will learn that
any non-empty open set on R is a countable union of disjoint open intervals.
Closed sets, on the other hand, cannot be as easily classied
1
:
1
You will see the classic example of the Cantor Set in Math 171.
Theorem. The set
S =
_
1,
1
2
,
1
3
,
1
4
, . . .
_
{0}
is closed.
Proof Summary:
Consider an arbitrary convergent sequence in S and suppose the limit x is not in S.
We condition on whether this sequence has a number that repeats innitely many times.
Case 1: There is at least one innitely repeated number.
Consider an innitely repeated number and call it T. Use the condition to show x = T,
so x S.
Case 2: There are no innitely repeated numbers.
We can construct a strictly decreasing subsequence: if not, then we will have a term repeat
innitely many times (pigeonhole principle).
This subsequence is also a subsequence of (
1
n
) and so converges to 0.
Since the original sequence is convergent, it must have the same limit as any subsequence.
So x = 0 S
Proof: Let
x
i
x
where each x
i
S. Then we have one of two cases:
Case 1: There is at least one innitely repeated number.
Let T be one such innitely repeated number in the sequence:
. . . , T, . . . , T, . . . , T, . . . .
Then for an > 0, there exists N such that for all n N ,
|x
n
x| < .
Eventually, T shows up in the sequence (else, we would have only nitely many). Then,
|T x| < .
Since is arbitrary,
T = x
But T S, a contradiction.
14.6. OPEN AND CLOSED SETS ON R 317
Case 2: There are no innitely repeated numbers.
We are going to construct a decreasing subsequence of (x
i
) that converges to 0. Then, using the
fact that
If a sequence converges to a limit, then every subsequence converges to the same limit
we can conclude (x
i
) also converges to 0. But this means x = 0. Lets start:
It cannot be the case that every term in (x
i
) is zero, for then 0 would appear innitely many
times. Thus, we can choose x
n
1
to be the rst non-zero term in (x
i
). Now, I claim we can nd
a later term in the sequence that is non-zero and smaller than x
n
1
:
Suppose not. Then all the remaining terms in the sequence
. . . x
n
1
, x
n
1
+1
, x
n
1
+2
, x
n
1
+3
, . . .
are either 0 or greater than (or equal to) x
n
1
. But we know that there are only nitely many
terms in S that equal 0 or greater than (or equal to) x
n
1
, namely:
_
_
1,
1
2
,
1
3
, . . . ,
. .
>xn
1
x
n
1
_
_
{0}
Then the innitely many remaining terms in the sequence must all come from this nite set, so
at least one of these numbers is repeated innitely many times in the sequence, a contradiction.
If you didnt quite catch that, heres an example:
If x
n
1
=
1
5
and there are no non-zero terms less than
1
5
in the rest of the sequence, then the
innitely many later terms can only be
1,
1
2
,
1
3
,
1
4
,
1
5
, 0.
So at least one of these numbers will be repeated innitely many times (pigeonhole)! Thus, we
have a contradiction.
Inductively applying the preceding argument, we can construct a decreasing non-zero subse-
quence
x
n
1
, x
n
2
, x
n
3
, . . .
Moreover this is a subsequence of
_
1
n
_
. Since
1
n
0,
we conclude
x
n
i
0.
It follows that the larger sequence (x
i
) converges to 0. Since
x
i
0
x
i
x
we have
x = 0.
Therefore, x S.
In either case, we conclude x S as needed.
14.7 Open vs. Closed
The fundamental relationship between open and closed sets is
The complement of an open set is a closed set and the complement of a closed set is open.
Why is this so important?
If we prove facts about one type of set, we get analogous results about the other.
Instead of proving a set is one type directly, it may be easier to prove the complement is the
other type.
We will demonstrate both of these advantages in the next section. But for now, lets prove the theorem.
Before we begin, be sure that you understand how to negate a logical quantier. In particular,
The negation of
Every convergent sequence converges to a point in the set
is
There exists a convergent sequence that does not converge to a point in the set.
Note that
Does not converge to a point in the set
is the same as
converges to a point outside the set.
The negation of
14.7. OPEN VS. CLOSED 319
For every x, there exists an open ball around x contained in the set
is
There exists x such that every open ball around x is not contained in the set.
Note that
Not contained
is the same as saying
non-empty intersection with the complement,
i.e. there exists some point in the ball that is not in the set.
Theorem. A set S is open if and only if its complement S
c
is closed.
Proof Summary:
The complement of an open set is a closed set.
Suppose not. Then there exists a sequence of points in S
c
that converge to a point in
x S.
Since S is open, form a -ball around x.
Contradict sequence convergence with = .
The complement of a closed set is open.
Suppose not. Then for some point in x S, we can always form a -ball around x such
that the ball has non-empty intersection with S
c
.
Discretize the delta: Using balls centered at x with radii = 1,
1
2
,
1
3
, . . ., construct a
sequence of points in S
c
that lie in these balls.
The limit of this sequence is x S, contradicting S
c
being closed.
Proof:

Let S be open. Suppose the complement S
c
is not closed. Then, by negating the denition of
closedness, there exists some sequence of points in S
c
that converges to a point not in S
c
. So
there are
x
1
, x
2
, . . . S
c
such that
x
i
x
and x / S
c
. But x / S
c
just means x S. Now use the fact S is open to get some ball around
x contained in S:
B
(x) S
for some > 0. Using the denition of convergence with = , we know there exists an N such
that x
i
x < for all i N. But this would mean each
x
i
B
(x)
so
x
i
S
for all i N. However, every point of the sequence is in S
c
, a contradiction.

Let S
c
be closed. Suppose S is not open. By negating the denition of open, there exists some
point x S such that for any > 0, B
(x) is not contained in S. This means that there exists

some point in the ball not in S (which is the same as saying in S
c
). Therefore, for any > 0
we can nd some y such that
y B
(x) and y S
c
Now we do a trick known as discretizing the delta. Particularly, we choose
1
= 1
2
=
1
2
3
=
1
3
4
=
1
4
.
.
.
to construct a sequence in S
c
:
x
1
B
1
(x) and x
1
S
c
x
2
B1
2
(x) and x
2
S
c
x
3
B1
3
(x) and x
3
S
c
x
4
B1
4
(x) and x
4
S
c
.
.
.
.
.
.
14.8. OPEN AND CLOSED SET PROPERTIES 321
Notice that (x
i
) converges to x: by denition of the balls,
x
1
x < 1
x
2
x <
1
2
.
.
.
x
N
x <
1
N
x
N+1
x <
1
N + 1
. .
<
1
N
x
N+2
x <
1
N + 2
. .
<
1
N
.
.
.
So given > 0, simply choose N such that
1
N
< .
Thus
x
i
x
But each x
i
S
c
yet x S, contradicting that S
c
is closed! Thus, S is open.
Examine this proof very closely! Especially be sure how to negate the denition of open and
closed. If you still think that the negation of open is simply closed, then your minds not open and
you should just close the door on your mathematical future.
By the way, we can plug in a complement set S
c
in this theorem statement to get
A set S
c
is open if and only if its complement (S
c
)
c
is closed.
Using the fact that the complement of the complement is the original set
(S
c
)
c
= S
we automatically have
Theorem. A set S is closed if and only if its complement S
c
is open.
14.8 Open and Closed Set Properties
There are a few basic properties you need to know about open sets: the rst is a two-second proof:
Theorem. The whole space R
n
and the empty set are open.
Proof: By denition, any ball lies completely in R
n
, so R
n
is open. To prove that is open, we have
to check that any point in can be encircled in a ball that lies entirely in . But there are no points
to check, so the fact is open is vacuously
1
true.
The second property is
Theorem. The intersection of nitely many open sets is open.
Proof Summary:
If the intersection is empty, then this is vacuously true.
Otherwise, take an arbitrary point x in the intersection.
For each i, x A
i
. Thus, we can form an open ball with radius
i
contained in A
i
.
The open ball with radius min{
1
,
2
, . . . ,
n
} is contained in the intersection.
Proof: Let
A
1
, A
2
, . . . , A
n
be a collection of open sets. If
n
i=1
A
i
=
then by the preceding theorem,
n
i=1
A
i
is open. Otherwise, let
x
n
i=1
A
i
.
By denition of intersection,
x A
1
x A
2
.
.
.
x A
n
Since each A
i
is open, there exists
1
,
2
, . . .
n
> 0 such that
B
1
(x) A
1
B
2
(x) A
2
.
.
.
B
n
(x) A
n
1
In general, any statement of the form, for all x S, property P(x) holds is vacuously true if S is empty.
Choose = min{
1
,
2
, . . .
n
}. Then,
B
(x) B
1
(x) A
1
B
(x) B
2
(x) A
2
.
.
.
B
(x) B
n
(x) A
n
This means
B
(x)
n
i=1
A
i
.
Thus,
n
i=1
A
i
is open.
The third and nal property involves a new concept, an indexing set. Dont panic! Up until now,
youve been indexing sets by natural numbers:
A
1
A
2
A
3
.
.
.
So the union would be
_
i=1
A
i
But we can also represent this union using the notation
_
iN
A
i
In fact, instead of using N to index our sets, we can use even bigger sets. For example, the reals:
A
.005
A
.92
A
. . .
A
e
A
8.6
A
8.6
. . .
A
7.321
A
9.7
A
11.2
. . .
.
.
.
.
.
.
.
.
.
.
.
.
The union of all these sets is
_
iR
A
i
Generalizing, we can index over any set and represent the union as
_
i
A
i
.
In fact, the indexing set doesnt even need to contain numbers. For example, just let
= {red, white, blue}
and dene
A
red
= {1}
A
white
= {2}
A
blue
= {3}
Then,
i
A
i
= A
red
A
white
A
blue
= {1, 2, 3}
In summary, an indexing set is just an organizational scheme. Its nothing but a matter of notation.
By the way, when taking unions over huge indexing sets, you shouldnt think of physically unioning
each set, one at a time. Always recall Professor Simons saying,
Math Mantra: Math is NOT a mystical study.
We are humans and we cant physically union innitely many times. But we can consider collections
with innitely many elements. Nothing wrong there. So
_
i
A
i
simply translates to the set of all points that are in at least one of the A
i
:
{x|x A
i
for some i }
This is just the word arbitrary in disguise, since the points can be in any arbitrary A
i
. Thats why
they call this an arbitrary union.
The third property states that the arbitrary union of open sets is still open:
Theorem. Let
{A
i
|i }
be a collection of open sets. Then the arbitrary union
_
i
A
i
is open.
Proof Summary:
Take an arbitrary point in this union.
This point lies in at least one open set A
i
. So we can form an open ball around that point
contained in A
i
.
This ball is contained in the full union.
Proof: Let
x
_
i
A
i
.
Then x A
j
for some j . But A
j
is open, so we can nd a ball
B
(x) A
j
.
Therefore,
B
(x)
_
i
A
i
.
Since the choice of x was arbitrary, we can conclude that the union
_
i
A
i
is open.
Now, using the awesome fundamental relationship between open and closed sets (and De Morgans
laws), we instantly have:
Theorem. The following are true:
1. The whole space R
n
and the empty set are closed.
2. The union of nitely
1
many closed sets is closed.
3. The arbitrary intersection of closed sets is closed.
Proof: We only prove (3) since the others follow similarly.
Let
{A
i
| i }
be a collection of closed sets. Then each A
c
i
is open and the arbitrary union of open sets is open, so
_
i
A
c
i
is open. But on this weeks problem set, you will prove De Morgans laws for an arbitrary union of
sets:
_
i
A
c
i
=
_
i
A
i
_
c
.
Taking complements
_
_
i
A
c
i
_
c
=
__
i
A
i
_
c
_
c
,
1
As an exercise, construct counterexamples to show that the arbitrary union of closed sets need not be closed and
the arbitrary intersection of open sets need not be open.
so
_
_
i
A
c
i
_
c
=
i
A
i
.
Thus
i
A
i
is the complement of an open set, so its closed.
Using these properties and our fundamental relation, we get a much easier proof of one of our examples:
Theorem. The set
S =
_
1,
1
2
,
1
3
,
1
4
, . . .
_
{0}
is closed.
Proof: Notice that
S
c
= (, 0) (1, )
_
1
2
, 1
_
_
1
3
,
1
2
_
. . .
_
1
3
,
1
2
_
_
1
4
,
1
3
_
. . .
or concisely,
S
c
= (, 0) (1, )
_
iN
_
1
i + 1
,
1
i
_
But open intervals are open (they are one dimensional open balls). So their arbitrary union is open
and thus S
c
is open. The complement (S
c
)
c
= S is then closed.
This was a lot easier than directly checking that S is closed. Typically,
Math Mantra: Instead of appealing to the original definition,
you can save yourself a lot of trouble by using theorems you have already
proven.
New Notation
B
r
(c) The open ball of ra-
dius r centered at c
B
1
(
0) M = The intersection of the unit ball

at

0 with M is empty.
_
i
A
i
The arbitrary union of
A
i
indexed by
_
iN
A
i
= R The arbitrary union of A
i
indexed
by the natural numbers is R.
Lecture 15
Continuing from R to R
n
The diculty of the H-series is not the jump
from single variable to multivariable. Thats easy.
What makes it hard is the jump from a
non-rigorous treatment to a rigorous one.
-B
F
SCHO
([])
Goals: Today, we continue extending our results from R to R
n
. Particularly, we prove the
n-dimensional Bolzano-Weierstrass Theorem and dene continuous multivariable func-
tions. The proof of these extensions will either mimic the 1-dimensional proof s exactly
or follow from repeated application of the single variable theorems.
15.1 Plans of Ascension
We are so close to completely extending our work in R to R
n
. So far we have extended the domains
of our functions. Namely,
Open and closed intervals in R have become open and closed sets in R
n
.
These domains will ensure that our functions have certain nice properties (like the existence of maxima
and minima). Next, we are going to introduce continuity for multivariable functions. But to prove
facts about continuous multivariable functions, we need to reincarnate Bolzano-Weierstrass in the
world of R
n
:
Every bounded sequence of vectors has a convergent subsequence.
As is often the case with n-dimensional analogues, the general theorem is just a consequence of the
1-dimensional theorem. But in order to apply the 1D result, we must rst prove a fundamental rela-
tion between the convergence of n-dimensional sequences and the convergence of their 1-dimensional
component sequences:
A sequence of vectors converge if and only if each component sequence converges.
Seems intuitive, right? For example, the sequence
327
328 LECTURE 15. CONTINUING FROM R TO R
N
_
_
0
4
25
100
_
_
,
_
_
0
2
5
10
_
_
,
_
_
0
1
1
1
_
_
,
_
_
0
.5
.2
.1
_
_
,
_
_
0
.25
.04
.01
_
_
, . . .
converges to
_
_
0
0
0
0
_
_
.
and each component sequence converges to 0:
0 , 0 , 0 , 0 , 0 , . . .
4 , 2 , 1 , .5 , .25 , . . .
25 , 5 , 1 , .2 , .04 , . . .
100 , 10 , 1 , .1 , .01 , . . .
The intuition is obvious. The only thing dicult about this theorem is the notation. Everyone freaks
out about the whole superscript-subscript thing. But its obvious if you just stare at it:
_
_
x
1
1
x
1
2
x
1
3
.
.
.
x
1
n
_
_
. .
x
1
,
_
_
x
2
1
x
2
2
x
2
3
.
.
.
x
2
n
_
_
. .
x
2
,
_
_
x
3
1
x
3
2
x
3
3
.
.
.
x
3
n
_
_
. .
x
3
, . . .
Given the number
x
i
j
The superscript i denotes the i-th vector in our sequence, x
i
The subscript still means the j-th component of that vector, x
i
j
Also, when we write
x
(i)
j
x
j
we put extra () parentheses around i to show that the sequence is indexed by i and not j. Notice
that none of this is new mathematics. We are just agreeing on conventions! And always remember:
15.1. PLANS OF ASCENSION 329
Math Mantra: Dont be afraid of new notation!
Treat notation like an adorable puppy on the street; stare at it and then move on.
Now, back to the main theorem:
Theorem. Let (x
i
) be a sequence with each x
i
R
n
. Then,
x
i
x
if and only if, for any component j, the j-th component sequence
x
(i)
j
x
j
Proof Summary:

Let > 0.
Expand the denition x
i
x as the square root of a sum of squares. The norm is bigger
than the root of a single square.

Let > 0.
Choose
i
=

n
in the convergence denition for each component to get corresponding
integers N
1
, N
2
, . . . , N
n
.
Set N = max{N
1
, N
2
, . . . , N
n
}.
Expand x
i
x and show that for i N, this norm is bounded by .
Proof:

Let > 0. By denition of vector sequence convergence, there is an N such that for all i N,
_
_
x
i
x
_
_
<
Expanding the denition of norm, we get a square root of a sum of squares:
x
i
x =
_
(x
i
1
x
1
)
2
+ (x
i
2
x
2
)
2
+ . . . + (x
i
n
x
n
)
2
In particular, the norm is bigger than (or equal to) the root of one of its squares. So for each
component j,
_
_
x
i
x
_
_
_
_
x
i
j
x
j
_
2
=
x
i
j
x
j
Thus, for each component j,
x
i
j
x
j
<
for all i N.
N

For each component j, we are given
x
(i)
j
x
j
Let > 0. We want to nd an N such that we can guarantee, for all i N,
x
i
x =
_
(x
i
1
x
1
)
2
+ (x
i
2
x
2
)
2
+ . . . + (x
i
n
x
n
)
2
is bounded by . To do this, we are going to use the convergence denition of each component
to bound each of the squares in the square root.
By denition
1
of convergence in component j, for any
j
> 0, there exists an N
j
such that
x
i
j
x
j
<
j
for all i N
j
.
Choose
1
=
2
= . . . =
n
=

n
1
, N
2
, . . . , N
n
. Let
N = max{N
1
, N
2
, . . . , N
n
}.
Now, for i N,
x
i
x =
_
_
x
i
1
x
1
_
2
. .
<
2
n
+
_
x
i
2
x
2
_
2
. .
<
2
n
+. . . +
_
x
i
n
x
n
_
2
. .
<
2
n
<
2
n
+

2
n
+ . . . +

2
n
. .
n-times
=
2
=
Thus,
x
i
x <
Fun fact: this result does not hold true for innite dimensional vectors. If you remember the preface,
youll recall that the counterexample embodies what mathematics means to me.
15.2 How Bolzano-Weierstrass Should Not be Proven
Now, before we bring Bolzano back, I would like to show a common incorrect proof of Bolzano-
Weierstrass for R
n
:
1
Notice that I introduced subscripts to the . Each component converges. So applying the denition of convergence
to each component, each component gets its own choice of and a corresponding N. Subscripts help us keep track of
these s and Ns.
15.2. HOW BOLZANO-WEIERSTRASS SHOULD NOT BE PROVEN 331
Theorem (Bolzano-Weierstrass). Let (x
i
) be a bounded sequence of vectors in R
n
. Then this
sequence has a convergent subsequence.
BAD PROOF: We are given that the sequence (x
i
) is bounded, which simply means that there
exists a constant K such that
_
_
x
i
_
_
< K
for all i. But then this would mean that the component sequences are bounded. Indeed,
_
_
x
i
_
_
=
_
(x
i
1
)
2
+ (x
i
2
)
2
+ . . . + (x
i
n
)
2
implies that for each j,
x
i

_
(x
i
j
)
2
= |x
i
j
|.
Hence,
|x
i
j
| < K
for all i. Therefore, each component sequence is bounded:
x
1
1
, x
2
1
, x
3
1
, . . .
x
1
2
, x
2
2
, x
3
2
, . . .
x
2
3
, x
2
3
, x
3
3
, . . .
.
.
.
.
.
.
Applying Bolzano-Weierstrass to each component sequence, we get
x
n
1
1
, x
n
2
1
, x
n
3
1
, . . .
x
n
1
2
, x
n
2
2
, x
n
3
2
, . . .
x
n
1
3
, x
n
2
3
, x
n
3
3
, . . .
.
.
.
.
.
.
where the j-th component subsequence converges to x
j
,
x
(n
i
)
j
x
j
for some x
j
. Isolating each x
n
i
,
x
n
1
..
x
n
2
..
x
n
3
..
x
n
1
1
, x
n
2
1
, x
n
3
1
, . . .
x
n
1
2
, x
n
2
2
, x
n
3
2
, . . .
x
n
1
3
, x
n
2
3
, x
n
3
3
, . . .
.
.
.
.
.
.
.
.
.
N
we have a subsequence
x
(n
i
)
x.
This proof is rubbish. Why?
We made one really bad step. When we applied Bolzano-Weierstrass, we wrote:
x
n
1
1
, x
n
2
1
, x
n
3
1
, . . .
x
n
1
2
, x
n
2
2
, x
n
3
2
, . . .
x
n
1
3
, x
n
2
3
, x
n
3
3
, . . .
.
.
.
.
.
.
Here, each component subsequence picks out the same indices:
n
1
, n
2
, . . .
But thats a huge assumption. Applying Bolzano-Weierstrass to each component subsequence only
guarantees that each component sequence will have its own convergent subsequence. But we dont
know that each subsequence will pick out the same terms. Thats crazy!
For example, consider
n
1
n
2
n
3
n
4
n
5
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, . . .
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, . . .
The rst component sequence has a convergent subsequence:
0, 0, 0, 0, . . .
as guaranteed by Bolzano-Weierstrass. But the subsequence of the second component sequence formed
by using the same indices
1, 1, 1, 1, . . .
fails to converge!
The correct statement would have been that there are convergent subsequences:
x
a
1
1
, x
a
2
1
, x
a
3
1
, . . .
x
b
1
2
, x
b
2
2
, x
b
3
2
, . . .
x
c
1
3
, x
c
2
3
, x
c
3
3
, . . .
.
.
.
.
.
.
15.3. BRINGING BOLZY BACK 333
where (a
i
), (b
i
), . . . are dierent indexing sequences.
1
But we cant work with this because it doesnt
allow us to isolate vectors from the original sequence. In particular,
_
_
x
a
1
1
x
b
1
2
x
c
1
3
.
.
.
_
_
need not be a vector in the sequence because its components may come from dierent vectors:
_
_
x
a
1
1
x
1
2
x
1
3
.
.
.
_
_
,
_
_
x
2
1
x
2
2
x
2
3
.
.
.
_
_
,
_
_
x
3
1
x
b
1
2
x
3
2
.
.
.
_
_
,
_
_
x
4
1
x
4
2
x
4
3
.
.
.
_
_
,
_
_
x
5
1
x
5
2
x
c
1
3
.
.
.
_
_
, . . .
15.3 Bringing Bolzy Back
The remedy to the preceding false proof is a very simple idea. And we all know Professor Simons
saying,
Math Mantra: The best ideas in mathematics are simple.
Imagine that your friends Alice, Bob, Chris all have to agree on a restaurant. First, Alice lists her
favorite restaurants:
R
1
R
2
R
3
R
4
R
5
R
6
R
7
R
8
R
9
. . .
Then, Bob chooses his favorite restaurants from Alices list:
R
1
R
2
R
4
R
5
R
7
R
9
. . .
Finally, Chris, chooses his favorite restaurant from Bobs list:
R
1
R
2
R
4
R
8
R
9
. . .
Notice that all the restaurants on Chris list are places that everyone likes:
R
1
R
2
R
3
R
4
R
5
R
6
R
7
R
8
R
9
. . .
1
Using letters is bad notation, but I didnt want to introduce another level of subscripts. This isnt Inception!
N
The proof of Bolzano-Weierstrass follows the same renement process! First we nd a subsequence of
indices such that the resulting subsequence of the rst component sequence converges. Then we con-
struct a subsequence of indices from that subsequence of indices such that the resulting subsequences
of the rst two component sequences both converge.
To demonstrate, consider our previous example in which we found a convergent subsequence of the
rst component sequence:
n
1
n
2
n
3
n
4
n
5
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, . . .
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, . . .
Focus on the second component corresponding to this subsequence:
1, 1, 1, 1, 1, . . .
From this subsequence, choose a convergent subsequence
-1
..
x
m
1
2
, 1, -1
..
x
m
2
2
, 1, -1
..
x
m
3
2
, . . .
Using the indices m
1
, m
2
, . . ., we have two convergent subsequences with the same indices!
m
1
m
2
m
3
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, . . .
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, . . .
This strategy works for two important reasons:
A subsequence of a subsequence is still a subsequence of the original sequence.
If a sequence converges to some limit, then all subsequences converge to that same limit.
Now that we have the idea in mind, lets give a proper proof:
Theorem (Bolzano-Weierstrass). Let (x
i
) be a bounded sequence of vectors in R
n
. Then this
sequence has a convergent subsequence.
Proof Summary:
Since the sequence of vectors is bounded, each component sequence is bounded.
Use induction to prove that, given n bounded sequences, we can nd a convergent subsequence
for each. Moreover, these subsequences have the same indexing sequence.
15.3. BRINGING BOLZY BACK 335
Base Case
1
Apply Bolzano-Weierstrass on the rst bounded sequence to get a sequence of indices (a
i
).
Apply Bolzano-Weierstrass on the subsequence of the second bounded sequence indexed
by (a
i
). We now have a sequence of indices (b
i
).
Both bounded sequences each have a subsequence that converges with indexing sequence
(b
i
).
Inductive step
By Inductive Hypothesis, our n bounded sequence each have a subsequence that converges
with indexing sequence (a
i
).
Apply Bolzano-Weierstrass on the subsequence of the (n+1)-st bounded sequence indexed
by (a
i
). We now have a sequence of indices (b
i
).
All the bounded sequences have subsequences that converge with indexing sequence (b
i
).
Apply Inductive result to component sequences.
Proof: The beginning of the Bad proof of Bolzano-Weierstrass is still valid, so we know that the
component sequences of the vectors are bounded:
x
1
1
, x
2
1
, x
3
1
, . . .
x
1
2
, x
2
2
, x
3
2
, . . .
x
2
3
, x
2
3
, x
3
3
, . . .
.
.
.
.
.
.
Now we prove, via induction, the property
P(n): For any set of n bounded sequences
(x
i
1
), (x
i
2
), . . . , (x
i
n
)
we can nd an increasing sequence of indices (b
i
) such that subsequences
_
x
(b
i
)
1
_
,
_
x
(b
i
)
2
_
, . . . ,
_
x
(b
i
)
n
_
all converge.
Base Case, P(2)
Suppose we have bounded sequences
x
1
1
, x
2
1
, x
3
1
, . . .
x
1
2
, x
2
2
, x
3
2
, . . .
1
We already have the case n = 1, but for the sake of pedagogy, I start with the case n = 2.
N
By 1D Bolzano-Weierstrass, the rst sequence has a convergent subsequence
x
a
1
1
, x
a
2
1
, x
a
3
1
, . . .
where
x
(a
i
)
1
x
1
.
Now consider the second sequence under the same indices:
x
a
1
2
, x
a
2
2
, x
a
3
2
, . . .
By 1D Bolzano-Weierstrass, this sequence has a convergent subsequence
x
b
1
2
, x
b
2
2
, x
b
3
2
, . . .
where
x
(b
i
)
2
x
2
By denition of a subsequence, (b
i
) chooses specic indices of (a
i
). Thus,
_
x
(b
i
)
1
_
is a subsequence of
_
x
(a
i
)
1
_
Using the fact that a sequence converges implies all subsequences converge to the same limit,
x
(a
i
)
1
x
1
implies
x
(b
i
)
1
x
1
Therefore,
x
(b
i
)
1
x
1
x
(b
i
)
2
x
2
P(n) P(n + 1)
Let
_
x
(i)
1
_
,
_
x
(i)
2
_
, . . . ,
_
x
(i)
n
_
,
_
x
(i)
n+1
_
be bounded sequences. By inductive hypothesis, we can nd a sequence of indices (a
i
) such that
x
(a
i
)
1
x
1
x
(a
i
)
2
x
2
x
(a
i
)
3
x
3
.
.
.
x
(a
i
)
n
x
n
all converge.
15.4. CONTINUOUS FUNCTIONS AND LIMITS IN R
N
337
Now consider the subsequence
_
x
(a
i
)
n+1
_
By Bolzano-Weierstrass, this sequence has a convergent subsequence
_
x
(b
i
)
n+1
_
But (b
i
) chooses indices of (a
i
), so
_
x
(b
i
)
1
_
is a subsequence of (x
(a
i
)
1
)
_
x
(b
i
)
2
_
(a
i
)
2
)
_
x
(b
i
)
3
_
(a
i
)
3
)
.
.
.
_
x
(b
i
)
n
_
(a
i
)
n
)
Since a subsequence of a convergent sequence converges to the same limit,
x
(b
i
)
1
x
1
x
(b
i
)
2
x
2
x
(b
i
)
3
x
3
.
.
.
x
(b
i
)
n+1
x
n+1
Now that weve nished to induction proof, we can apply this result to the component sequences to
get
x
(b
i
)
j
x
j
.
Therefore,
x
(b
i
)
x.
15.4 Continuous Functions and Limits in R
n
As you can imagine, the denition of continuity for functions on R
n
is the same as the denition for
functions on R. Except,
The function inputs are vectors.
Instead of 1-dimensional distance, you are dealing with n-dimensional distance (absolute values
become norms).
So, our denition of a limit is now,
N
Denition. A function f : R
n
R
m
has vector limit

L at c if, for any > 0, there exists a constant
> 0 such that if
0 < x c <
then
f(x)
L < .
Continuity at a point becomes
Denition. A function f : R
n
R
m
is continuous at c if
1. There is a limit of f at c.
2. This value of this limit is f(c).
Equivalently, for any > 0, there exists a > 0 such that if
x c < ,
then
f(x)
L < .
Continuous functions are now
Denition. A function f is continuous on a set U R
n
if it is continuous at all points x U.
Again, I would like to repeat the warning from Lecture 13:
WARNING
On the exams, do not assume f is dened for all R
n
!
For the rigorous denitions, read Professor Simons book!
Now, there are several n-dimensional limit properties, but we wont prove all of them. This is because
their proofs are exactly as the proofs for the limits of 1 dimensional sequences. There are only a few
minor cosmetic dierences, like replacing the
n N
with
0 < x c <
For example, if a series of inequalities required the conditions
N N
1
N N
2
.
.
.
N N
k
N
339
you would dene
N = max{N
1
, N
2
, . . . , N
k
}
and let n N to make all inequalities hold. With continuous functions, however, you are given the
conditions
0 < x y <
1
0 < x y <
2
.
.
.
0 < x y <
k
To make all corresponding inequalities hold, you dene
= min{
1
,
2
, . . . ,
k
} > 0
and let
0 < x c < .
Other than replacing -N with -, the proofs are verbatim. If you are skeptical (as you should always
be), compare the proofs below with the previous lectures. Its mostly CTRL-C and CTRL-V.
First, the easy properties:
Theorem. Let f, g be functions that each map R
n
to R
m
. If the limits
L
1
= lim
xa
f(x)
L
2
= lim
xa
g(x)
both exist, then
Scaling: For any constant K, the limit of the scaled function Kf at a is simply the limit of f
at a scaled by K:
lim
xa
Kf(x) = K lim
xa
f(x)
Sum: The limit of the function sum at a is the sum of the limits of the functions at a:
lim
xa
_
f(x) + g(x)
_
= lim
xa
f(x) + lim
xa
g(x)
Proof: We only prove the sum property since both proofs are exactly the same as their sequence
counterparts.
Let > 0. We want to nd a > 0 such that, if
0 < x a <
then
f(x) + g(x) (

L
1
+

L
2
) < .
N
Rewriting the (LHS) as
(f(x)
L
1
) + (g(x)
L
2
)
and apply the triangle inequality:
(f(x)
L
1
) + (g(x)
L
2
) f(x)
L
1
+g(x)
L
2
.
As usual, it suces to show that this upper bound is bounded by .
Since
L
1
= lim
xa
f(x)
L
2
= lim
xa
g(x)
for any
1
> 0, we can nd a corresponding
1
> 0 such that for all x satisfying 0 < x a <
1
,
f(x)
L
1
<
1
.
Also, for any
2
> 0, we can nd a corresponding
2
> 0 such that for all x satisfying 0 < xa <
2
,
g(x)
L
2
<
2
Choosing
1
=
2
=

2
,
we can nd
1
,
2
such that for all x satisfying x a <
1
,
f(x)
L
1
<

2
and for all x satisfying x a <
2
, we have
g(x)
L
2
<

2
That means if we choose
= min{
1
,
2
} > 0,
whenever x a < , it follows that x automatically satises
x a <
1
x a <
2
Thus,
(f(x)

L
1
) + (g(x)

L
2
) f(x)

L
1
. .
<
2
+g(x)

L
2
. .
<
2
< .
We also have a product rule. However, we cannot multiply two vectors, so it only makes sense for
functions that map into R:
N
341
Theorem. For functions f, g that each map R
n
to R, if the limits
lim
xa
f(x)
lim
xa
g(x)
both exist, then the limit of the product of the functions exists and is equal to the product of the limits:
lim
xa
f(x)g(x) = lim
xa
f(x) lim
xa
g(x)
To reiterate, because f, g output real numbers, the right-hand side of this equation makes sense.
In this course, we will put special emphasis on functions that map R
n
into R. This is because a lot of
important functions take multiple real numbers as inputs and spit out a single real number output.
For example, the present value of an investment is a function of three variables: principal P, interest
rate r, and time t:
f(P, r, t) = P(1 + r)
t
.
Alternatively, Happiness level in a Tamagotchi is described by
Happy(Food, Attention, Sleep, Discipline) = (Food + Sleep)
Attention
Discipline
Functions from R
n
to R have quite a few perks. For starters, we can order the values of the images,
so a Sandwich Theorem also makes sense:
Theorem (Sandwich Theorem). For functions f, g, h from R
n
R,
f(x) g(x) h(x)
for all x and
lim
xa
f(x) = lim
xa
h(x) = L
then the limit of g(x) exists and
lim
xa
g(x) = L
Proof: Let > 0. We want to show there is some > 0 such that for all x that satisfy
0 < x a <
we have
|g(x) L| <
When we expand the absolute value, this really just means we have to show that the following pair
of inequalities are satised:
L < g(x)
g(x) < L + .
N
But we know there exists a
1
> 0 such that for all x satisfying
0 < x a <
1
we have
|f(x) L| < ,
which is just the pair of inequalities
L < f(x)
f(x) < L +
_
()
Likewise, we know there exists a
2
> 0 such that for all x satisfying
0 < x a <
2
we have
L < h(x)
h(x) < L +
_
()
Since we want () and () to hold, let
= min{
1
,
2
}
Then, for all
0 < x a <
we have both
0 < x a <
1
0 < x a <
2
so
L < f(x) by ()
g(x)
Likewise
g(x) h(x)
< L + by ()
So for all x satisfyibng
0 < x a <
we have
L < g(x) < L +
But the most important fact about continuous functions from R
n
to R is that, on a closed, bounded,
and non-empty set, we can nd a maximum and minimum:
N
343
MAX
M
IN
We are going to need two more n-dimensional extensions, but we leave them as exercises.
1
The rst
is about subsequences of convergent sequences:
Theorem. For a sequence of vectors (x
i
), if
x
i
x
then any subsequence
x
n
i
x.
The second is about continuous mappings of sequences:
Theorem. Let f be a continuous function on R
n
, and (x
i
) be a sequence of vectors in R
n
. If
x
i
x
then
f(x
i
) f(x).
With these results, we can prove the existence of extrema. Again, it suces to only prove the
maximum case.
Theorem. Let f : R
n
R be a continuous function and let K be a closed, bounded, and nonempty
subset of R
n
. Then f achieves its maximum on K i.e. there exists some y K with
f(y) f(x)
for all x K. Likewise, f achieves its minimum on K i.e. there exists some y K with
f(x) f(y)
for all x K.
1
I hate this expression too. But you cant expect me to rewrite the proofs, especially when they just replace the
absolute values in 1D proofs with norms.
N
Proof: Since f is bounded and K nonempty, the set
F = {f(x)| x K}
has a least upper bound S by the Completeness Axiom.
We want to nd some q K so that
f(q) = S,
for then
f(q ) f(x)
for all x K.
First, we construct a sequence of vectors whose images converge to S. For any > 0, notice that we
can always nd a point in
F = {f(x) | x K}
within of S. Otherwise, all points of F would be at least from S, hence, S would be an upper
bound for S, contradicting the fact that S is the least upper bound fo F.
Taking each of the following values for ,
1
= 1
2
=
1
2
3
=
1
3
.
.
.
we can nd s
1
, s
2
, . . . F such that
S
1
s
3
S
S
2
s
2
S
S
3
s
3
S
.
.
.
Unravelling the denition of F, this means
S 1 f(x
1
)
. .
s
1
S
S
1
2
f(x
2
)
. .
s
2
S
S
1
3
f(x
3
)
. .
s
3
S
.
.
.
By the n-dimensional Sandwich Theorem,
f(x
n
) S.
N
345
Our original sequence (x
i
) is bounded since it is contained in the bounded set K. So by the n-
dimensional Bolzano-Weierstrass Theorem, there exists a convergent subsequence (x
n
i
) with
x
n
i
x.
Because K is closed, x K. Taking the images of x
n
1
, x
n
2
, x
n
3
, . . . under the continuous function f,
we conclude
f(x
n
i
) f(x).
But now notice that
f(x
n
1
), f(x
n
2
), . . .
is a subsequence of the convergent sequence
f(x
1
), f(x
2
), . . .
which converges to S. Since any subsequence of a convergent sequence converges to the same limit,
f(x
n
i
) S
Because
f(x
n
i
) f(x)
f(x
n
i
) S,
uniqueness of limits gives us
f(x) = S.
Since S is the least upper bound for F,
f(x) f(y)
for all y K.
New Notation
x
i
j
The j-th component
of vector x
i
x
i
j
= 7. The j-th component of vector x
i
is 7.
_
x
(i)
j
_
The sequence of j-th
components of the se-
quence of vectors (x
i
).
x
(i)
j
41 The j-th component sequence of
(x
i
) converges to 41.
N
Lecture 16
Dishing out Derivatives
Its JACOB-BIAN not JAC-O-BE-AN.
Thats an era in the sixteenth hundreds.
- Leon Simon
Goals: Finally, we dene the multivariable derivative. First, we explain why a natural
extension fails. To remedy this, we introduce directional derivatives and give a stronger
denition of dierentiability. This denition will ensure that dierentiable functions are
continuous and all their directional derivatives exist. Moreover, calculation of direc-
tional derivatives is reduced to simple matrix multiplication. We save the proofs about
directional derivatives for next week.
16.1 Motivation on a Multivariable Derivative
Its Friday night, Tri-Delt Special Dinner. Its two hours in and you are completely out of gin and
green apple schnapps, which you need to make cucumber martinis. But you still have various other
types of booze, a ton of cucumber juice, and dozens of sorority girls demanding,
MORE CUCUMBER MARTINIS!
So what do you do? You try to generalize the recipe. You replace the gin with a dierent base hard
liquor (like rum or vodka) and try a dierent tasty liqueur (say strawberry schnapps).
You just have to make sure
It still has the same kick.
Its still sweet and tasty.
And if its not good enough, you revise.
1
We are going to do the same thing: we generalize our recipe for 1-dimensional derivatives to get a
recipe for n-dimensional derivatives. But we have to make sure that our n-dimensional derivative still
makes sense and still satises some nice properties. Namely,
1
Dont ever mix cucumber juice with triple sec. It literally tastes like tire iron!
347
348 LECTURE 16. DISHING OUT DERIVATIVES
Dierentiable functions are always continuous.
We can exploit dierentiability to nd extrema.
But where do we start? The rst approach would be to look at the denition of the one dimensional
derivative
f
(x) = lim
h0
f(x + h) f(x)
h
We are going to be using multivariable functions, so maybe we should try replacing the arguments
with vectors:
x x
h

h
0

0
Since dividing by a vector

h doesnt make sense, we could try dividing by its norm. So our guess for
the multivariable denition is
f
(x) = lim
0
f
_
x +
h
_
f(x)
h
This may seem somewhat intuitive: instead of approaching the origin from only the left or right (a
constant close to 0), we now approach from any direction (a vector close to

0). However,
This is an awful guess.
This formula doesnt even make sense for the 1-dimensional case! Consider,
f(x) = x
if you tried to calculate
f
(x) = lim
h0
f (x + h) f(x)
|h|
,
the right limit would yield
f
(x) = lim
h0
+
f (x + h) f(x)
|h|
= lim
h0
+
h
|h|
= 1.
However the left limit would yield
f
(x) = lim
h0
f (x + h) f(x)
|h|
= lim
h0
h
|h|
= 1.
Complete and utter fail. And the reason why it fails is that, by introducing the norm , we no
longer take into account the sign of h.
Instead a better guess is to
Dierentiate along a xed direction.
16.1. MOTIVATION ON A MULTIVARIABLE DERIVATIVE 349
Before I give you a rigorous denition, let me give you an intuition. Consider some point on a hill:
Suppose you approach the point by traveling parallel to the x direction on the surface of the xy plane:
If you consider just the slice of the hill in the xz plane, you get a curve.
Look at just the xz plane:
x
z
The derivative at the point is simply what you have been doing in 1D:
x
z
But you could have approached the base of the point from any direction:
The idea is that the direction of our dierentiation matters.
16.2. DIRECTIONAL DERIVATIVES 351
16.2 Directional Derivatives
A direction is just a vector. To travel along a direction means to travel along a line formed from
the direction vector. Recall that a line is simply all scalings
tv
of a single vector v. To approach a point
f(x)
we actually mean that we are approaching its preimage x. Moreover, we are traveling along the line
formed by v that goes through x
x + tv.
Visually:
To isolate the curve, we evaluate f on this line
f(x + tv).
The multivariable extension is now easy: just take the dierence quotient
f(x + tv) f(x)
t
as t approaches 0.
Denition. The directional derivative of f evaluated at x with respect to the direction v is
D
v
f(x) = lim
t0
f(x + tv) f(x)
t
Example: Calculate the directional derivative of
f
_
_
x
y
z
_
_
= xy + z
2
with respect to direction
_
_
1
2
3
_
_
First, we calculate
f
x + t
y + 2t
z + 3t
x
y
z
t
=
(x + t)(y + 2t) + (z + 3t)
2
xy z
2
t
=
2xt + yt + 2t
2
+ 6zt + 9t
2
t
At this point, you know from Calc BC that
lim
t0
2xt + yt + 2t
2
+ 6zt + 9t
2
t
= 2x + y + 6z
But lets try to get some practice with -.
We guess that the limit is
2x + y + 6z
So for any arbitrary > 0, we want to show that there is some > 0 such that whenever
0 < |t| <
we have

2xt + yt + 2t
2
+ 6zt + 9t
2
t
(2x + y + 6z)
<
Let > 0. We can reduce the (LHS) of the -condition to
2xt + yt + 2t
2
+ 6zt + 9t
2
t
(2x + y + 6z)
= |2x + y + 2t + 6z + 9t (2x + y + 6z)|

= |11t|
= 11|t|
Choosing
=

11
will make
11 |t|
..
<
< 11

11
= .
as needed.
One guess at the denition of dierentiable is
A function is dierentiable if all directional derivatives exist.
Unfortunately, this is not quite enough. To see why, we need to examine directional derivatives a little
more closely. In particular, we take Leon Simons advice,
Math Mantra: When given a new definition, you should always try the most
extreme cases.
And the most extreme case is dierentiating with respect to the i-th standard basis vector. Its so
important that we even give it a name:
Denition. The i-th partial derivative of f evaluated at x is the directional derivative of f eval-
uated at x with respect to the direction e
i
:
D
i
f(x) = lim
h0
f(x + he
i
) f(x)
h
.
NOTE: The convention (in this book) will be to write
f
x
i
(x)
in place of D
i
f(x) whenever f is a function into R. So, D
i
f(x) will always be a vector and
f
x
i
(x)
will always be a real number.
Remarkably, computing the partial derivative is just single variable dierentiation. Namely, for
each component of f,
Dierentiate with respect to the i-th variable.
Treat all other variables as constant
Theorem. For a function f : R
n
R
m
, the partial derivative D
i
f(x) is the single variable derivative
of f with respect to variable x
i
applied to each component.
Proof: It suces to assume m = 1 (we will give an explanation after the proof).
If you look at the partial derivative denition closely, we are simply adding h to the i-th coordinate
while holding all other components xed:
lim
h0
f
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
i
+ h
.
.
.
x
n
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
i
.
.
.
x
n
_
_
_
_
_
_
_
_
_
h
Dene g : R R by
g(t) = f
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i1
t
x
i+1
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
.
Then, we can rewrite
f
x
i
(x) = lim
h0
f(x + he
i
) f(x)
h
as
f
x
i
(x) = lim
h0
g(x
i
+ h) g(x
i
)
h
,
which is simply g
(x
i
), the single variable derivative of g, which is the function f viewed as a function
of x
i
only with all other components constant.
A few notes:
To prove this theorem, we applied a very important trick:
Math Mantra: By defining single variable functions from multivariable
functions, we can apply single variable results to multivariable functions.
Although this was a very simple proof, dont underestimate this trick. It will be the linchpin of
some of the most dicult theorems in this course.
For this theorem, as well as several others, we assume f is a mapping into the reals
f : R
n
R
instead of a vector mapping
f : R
n
R
m
This is because we can often apply the theorem to each component of a vector mapping.
For example, if
f(x) =
_
_
f
1
(x)
f
2
(x)
.
.
.
f
m
(x)
_
_
,
then after expanding the components of
D
i
f(x) = lim
h0
f(x + he
i
) f(a)
h
we get
D
i
f(x) =
_
_
lim
h0
f
1
(x+he
i
)f
1
(x)
h
lim
h0
f
2
(x+he
i
)f
2
(x)
h
.
.
.
lim
h0
fm(x+he
i
)fm(x)
h
_
_
.
Now repeatedly apply the m = 1 case to each component.
Armed with a fast way to calculate partial derivatives, notice how much easier the next example is
than the previous:
Example: Calculate the partial derivatives of
f
_
_
x
y
z
_
_
= xy + z
2
with respect to variables x, y, z.
By the previous theorem, this is just 1D Calculus:
f
x
= y (since y, z are constant)
f
y
= x (since x, z are constant)
f
z
= 2z (since x, y are constant)
Comparing this example with the previous, your heart should skip a little. The direction derivative
of f with respect to the direction
_
_
1
2
3
_
_
was
2x + y + 6z = 1 y + 2 x + 3 (2z)
= 1
f
x
+ 2
f
y
+ 3
f
z
This directional derivative is a linear function of the partial derivatives. In fact, it can be computed
by the matrix multiplication
_
f
x
f
y
f
z
_
_
_
1
2
3
_
_
.
Note that the columns of the matrix on the left are just the partial derivatives. So maybe, to compute
the directional derivative, we just need to multiply the matrix with partial derivative columns and the
direction vector:
_
_
D
1
f(x) D
2
f(x) . . . D
n
f(x)
_
_
_
_
v
1
v
2
.
.
.
v
n
_
_
That would be awesome. Unfortunately, with our current guess at the denition of dierentiability,
A function is dierentiable if all directional derivatives exist
we dont have this linearity property. On this weeks homework, you will prove that
f(x, y) =
_
_
|x|y
_
x
2
+ y
2
if (x, y) = 0
0 if (x, y) = 0
has directional derivatives at (0, 0) with respect to every direction. However we can nd a directional
derivative v such that
D
v
(0, 0) = v
1
f
x
(0, 0) + v
2
f
y
(0, 0)
It gets even worse: our guess at dierentiability doesnt even imply continuity! You will also
prove that
f(x, y) =
_
_
x
3
y
if y = 0
0 if y = 0
has all directional derivatives at (0, 0) but is not continuous at (0, 0).
In summary, we want the following to hold:
Every directional derivative is a linear function of the partial derivatives.
A dierentiable function is always continuous.
Since we saw that these properties are not satised, we are going to need a stronger denition of
dierentiability.
16.3 Dierentiability
We want our denition of dierentiability to force the directional derivative to be a linear function
of the direction vectors. Therefore, we are going to encode this matrix multiplication in the denition.
Lets return to our bad guess from the beginning of this lecture:
lim
0
f
_
x +
h
_
f(x)
h
.
16.3. DIFFERENTIABILITY 357
Indeed, for a xed x, this limit does not exist: not only does it fail for the n = 1 case, but it fails
to take into account how

h approaches the origin. However if we also x

h, the quotient
f
_
x +
h
_
f(x)
h
()
does exist. For any vector

h, we would like to approximate this quotient as a matrix multiplication
involving

h. In particular, we want the approximation to become more accurate as

h approaches

0.
In other words, we want the limit of the dierence to approach 0 as

h approaches

0:
lim
0
_
_
f
_
x +
h
_
f(x)
h
Matrix Multiplication
_
_
= 0
There is nothing wrong here: even though the separate limits need not exist, its possible that
the dierence does.
1
Moreover, the matrix A used in the multiplication should be independent of
h (though it can certainly depend on x). This is because we would like to use A to calculate the
directional derivatives without referring to limits (youll see)!
As a rst guess to approximating (), for every

h, we could try the matrix multiplication
A
h
for some xed matrix A independent of

h. But thats a bone-headed guess! The limit as

h approaches
0 is always 0:
lim
0
A
h = 0
Instead, we want to consider only the direction of

h and not its magnitude. Therefore, we try dividing
h by its norm:
h
This makes
h
into a unit vector since
_
_
_
_
_
h
_
_
_
_
_
=
1
h = 1
However, the direction is still intact since
h
is a scaling of

h.
1
Heres my favorite example of this phenomena: the limit of
k
n=1
1
n
and the limit of
_
k
1
1
x
dx do not exist. However,
the limit of the dierence
k
n=1
1
n

_
k
1
1
x
dx does exist and is known as Eulers gamma constant.
As our improved guess to approximating (), we try the matrix multiplication
A
_

h
h
_
for some xed matrix A independent of

h. Visually, in the case
f(x) =
_
x
2
1
x
2
+ 2x
1
x
1
x
2
+ 2x
2
_
with
A =
_
2x
1
x
2
+ 2 x
2
1
x
2
x
1
+ 2
_
evaluated at
x =
_
1
1
_
we can see that
A
h
will provide better approximations to
f
_
x +
h
_
f(x)
h
as

h approaches

0, no
matter how

h approaches

0:
h =
_
4
0
_
x
2
x
1
h =
_
1
1
_
x
2
x
1
h =
_
0
.25
_
x
2
x
1
A
h
f
_
x +
h
_
f(x)
h
We now give a formal denition:
Denition. We say the function f is dierentiable at x if there exists an mn matrix A (inde-
pendent of

h but possibly dependent on x) such that
lim
0
_
f(x +
h) f(x)
h
_
= 0
Moreover, we say a function is dierentiable if it is dierentiable at every point in its domain.
1
There are two important notes about this denition:
When expanding the - denition of derivative, consider the -condition
_
_
_
_
_
f(a +
h) f(a) A
h
_
_
_
_
_
< .
By pulling out the norm and multiplying by
h, we can alternatively express this as

_
_
_f(a +
h) f(a) A
h
_
_
_ <
h.
We will often use this rewrite in our proofs to avoid annoying fractions.
We did not dene the matrix A to have partial derivative columns
A =
_
_
D
1
f(x) D
2
f(x) . . . D
n
f(x)
_
_
It turns out that the existence of A forces A to be of this form. Because Math people
are complete badasses, we are going to prove it instead of assuming it. And once were done
kicking ass, we give A a name: the Jacobian.
We end this lecture by doing a reality check on our denition of derivative: a dierentiable function
should always be continuous!
Theorem. Let f : R
n
R
m
. If f is dierentiable at a, then f is continuous at a.
1
Since this is a book for underdogs, we are going to prove results for dierentiable functions rather than functions
that are dierentiable at a point. The proofs are virtually the same, except the latter is more precise. For the more
precise presentation, you must read Professor Simons text.
Proof Summary:
Let > 0. Rewrite the continuity denition by substituting

h = x a.
Incorporate the dierentiability denition by adding A
h A
h
. .
0
.
Use inequalities to bound with the A
h terms and others involving the -conclusion of

dierentiability.
Using the dierentiability denition for

2
, get a corresponding
.
Set = min
_
1,
,

2A
_
Proof: Recall the denition of continuity: for any > 0, there exists a > 0 such that if
0 < x a <
then
f(x) f(a) < .
To make this resemble the dierentiability denition, we substitute
1
h = x a.
This gives us the equivalent denition: for any > 0, there exists a > 0 such that if
0 <
h <
then
f(a +
h) f(a) <
Let > 0. Looking at
f(a +
h) f(a),
we are going to incorporate the -conclusion of dierentiability by adding

0:
f(a +
h) f(a) = f(a +
h) f(a) +A
h + A
h
. .
=0
By the triangle inequality,

f(a +
h) f(a) A
h + A
h f(a +
h) f(a) A
h +A
h
and by the matrix version of Cauchy-Schwarz,
f(a +
h) f(a) A
h +A
h
. .
f(a +
h) f(a) A
h +A
h
. .
.
Therefore, if we can nd a bound on
h that makes each of the terms on the (RHS) less than

2
,
we are done.
1
FYI, youve been doing this switch all the time in Calc BC.
By denition of dierentiability for

2
, we know we can nd a
such that if
0 <
h <
then _
_
_f(a +
h) f(a) A
h
_
_
_ <

2
h.
By restricting
h < 1 as well,
_
_
_f(a +
h) f(a) A
h
_
_
_ <

2
.
To make
A
h <

2
we need
h <

2A
.
So choose
= min
_
1,
,

2A
_
.
Then if
0 <
h <
we have
f(a +
h) f(a)
_
_
_f
_
a +
h
_
f(a) A
h
_
_
_
. .
<
2
+A
h
. .
<
2
< .
Recall that I claimed that an open set is a microcosm for the entire universe. Now that you understand
the big picture of the previous proof, with a simple modication you can easily restrict the domain
of f to an open set:
Theorem. Let f : U R
m
where U is an open set. If f is dierentiable at a, then f is continuous
at a.
Proof: Since U is open, then by denition there is a > 0 such that
B
(a) U
Now repeat the preceding preceding proof with
< min
_
1,
,

2A
,
_
.
This additional restriction on ensures that our function is actually dened for all x with
x a < .
NOTE: For most theorems in this book, I will avoid the technicality of restricting to an
open set. I do this to avoid juggling an additional condition and Id rather focus on
the main points by assuming that functions are dened on all of R
n
. But remember that
all the proofs can be easily modied to apply to functions on open sets: simply restrict
so f is dened on all of B
(a).
New Notation
D
v
f(x) Directional derivative
of f with respect to
the direction v evalu-
ated at x
D
e
1
f(x) =
_
1
2
_
The directional derivative of f
with respect to the direction e
1
evaluated at x is
_
1
2
_
.
D
i
f(x) The i-th partial
derivative of f evalu-
ated at x
D
1
f = x
2
1
x
2
The rst partial derivative of f
evaluated at x is x
2
1
x
2
Lecture 17
Sum-body that I Used to Know
Why so SERIES-OUS?
- Heath -er
Goals: For this weeks real analysis lecture, we review series and prove convergence tests.
Some tests you have seen before: N-th term, geometric, and p-test. But we also prove
fundamental results that were not covered in Calc BC: absolute convergence implies
convergence; and any non-negative series can be rearranged without changing the limit.
17.1 A Second Chance
By now, youve probably noticed that much of the mathematics in high school was hand-waved. This
is especially true about high school Calculus: it has more plot holes than an M. Night Shyamalan
movie.
For example, one of my personal pet-peeves is how students interpret
n=1
a
n
To quote Professor Simon,
What is an innite summation? Well, we are not adding innitely many terms. Thats humanly
impossible!
Remember, quoting Professor Simon again,
Math Mantra: Math is not a mystical study!
Suppose we were actually adding innitely many terms. Then,
0 = 0 + 0 + 0 + 0 + 0 + 0 + 0 + . . .
Substituting
0 = (1 1) + (1 1) + (1 1) + (1 1) + (1 1) + (1 1) + (1 1) + . . .
363
364 LECTURE 17. SUM-BODY THAT I USED TO KNOW
and applying associativity rule innitly many times:
0 = 1 + (1 + 1) + (1 + 1) + (1 + 1) + (1 + 1) + (1 + 1) + (1 + 1) + . . .
so
0 = 1.
This is complete and utter rubbish. The correct interpretation of an innite summation is that we
are computing the limit of the sequence of partial sums. As a result, we cannot take for granted
standard facts about nite sums when we are working with series. Dont let your guard down just
because there is a .
Denition. The series
n=1
a
n
is just shorthand for the limit of sequence (S
k
) where
S
k
=
k
n=1
a
n
.
The sequence (S
k
) is called the sequence of partial sums. Moreover, if
n=1
a
n
exists, we say the series converges. Otherwise, we say the series diverges.
Honestly, the hand-wavery in high school wasnt your teachers fault: a teacher has to cater to an
entire class. The problem is
The proofs and denitions are not tested on the Calculus BC Exam,
so there is no incentive for teachers to give a rigorous treatment.
I want you to toss away this test-centric ideal. You are going to get a chance at redemption; an
opportunity to ll in these gaps.
1
The remaining analysis lectures will focus on the nal quarter of
Calc BC: series. And like Sheldon Axlers treatment of Linear Algebra, we are going to do it right.
Using the rigorous denition of a limit, we are going to prove all of our results. This includes:
A rigorous derivation of Convergence Tests
A rigorous treatment of Power Series
A rigorous treatment of Taylor Series
Today, we will focus on the rst: testing series for convergence.
1
Like going from the rationals to the reals.
17.2. THE MOST BASIC TEST: N-TH TERM TEST 365
17.2 The Most Basic Test: N-th Term Test
The most obvious way to check that a sequence of partial sums diverge is to show the sequence of
added terms (a
n
) fails to converge to 0. Equivalently, in order for
n=1
a
n
to exist, it is necessary that
a
n
0.
For example, consider
n=1
1
2
n
.
Notice that the added terms
1
2
,
1
4
,
1
8
,
1
16
,
1
32
,
1
64
,
1
128
,
1
256
,
1
512
, . . .
approach 0. Precisely, the sequence
1
2
n
0.
This makes intuitive sense: if you want your partial sum to approach a value, the terms you are
adding should get closer to 0. Otherwise, the sum would explode (or oscillate).
Of course, the N-th term test is not a sucient condition for convergence. For example, the harmonic
series
n=1
1
n
.
diverges yet
1
n
0.
So how are we going to prove the N-th term test? The key is to relate the two sequences:
The sequence of partial sums (S
k
).
The sequence of added terms (a
n
).
But this is easy:
Math Mantra: To go from the partial sums back to the original sequence,
just subtract consecutive partial sums.
Precisely, since
S
k1
+ a
k
=
k1
n=1
a
n
+ a
k
=
k
n=1
a
n
= S
k
we have
a
k
= S
k
S
k1
.
For example, we check that a
6
= S
6
S
5
:
S
6
= a
1
+ a
2
+ a
3
+ a
4
+ a
5
+ a
6
S
5
= a
1
+ a
2
+ a
3
+ a
4
+ a
5
a
6
Now we are ready for the proof:
Theorem (N-th term test). If
n=1
a
n
exists, then the sequence of added terms (a
n
) converges to 0.
Proof Summary:
Rewrite a
k
as dierence of consecutive partial sums.
Each partial sum converges to the same limit.
Proof: Using our aforementioned trick, rewrite a
k
as a dierence of consecutive partial sums
a
k
= S
k
S
k1
The limit of the sequence (a
k
) is the limit of the sequence (S
k
S
k1
). Since
n=1
a
n
exists,
S
k
S
But S
k1
is the same sequence shifted by one term, so
S
k1
S
By our limit dierence theorem, the limit of (S
k
S
k1
) is the same as the dierence of the limits of
(S
k
) and (S
k1
). Therefore,
S
k
..
S
S
k1
..
S
0
Thus,
a
k
0
As in Calc BC, you can use the contrapositive of this theorem to instantly tell that the geometric
sequence
1 + 2 + 4 + 8 + 16 + . . .
fails to converge. Generally,
17.3. STAYING NON-NEGATIVE 367
Theorem. If a = 0 and |r| 1, the geometric sequence
n=1
ar
n
diverges.
Proof: Consider the sequence of added terms,
_
ar
k
_
.
For each i 1, the absolute value of the i-th term of this sequence is bounded below by |a|:
|ar
i
| = |a| |
i-times
..
r
..
1
r
..
1
r
..
1
. . . r
..
1
| |a|.
Since |a| > 0, the sequence of added terms does not converge to 0.
17.3 Staying Non-negative
Math people love when the added terms in a series are all non-negative.
1
S
k
=
k
n=1
a
n
a
n
0
This is because the sequence of partial sums (S
k
) is non-negative and increasing.
0 a
1
..
S
1
a
1
+ a
2
. .
S
2
a
1
+ a
2
+ a
3
. .
S
3
. . .
This is awesome. Recall the almighty
Monotone Convergence Property: Bounded monotonic sequences converge.
To prove that the sequence of partial sums (S
k
) converges, it suces
2
to prove that (S
k
) is bounded.
And proving a sequence is bounded is a lot easier than proving convergence directly by nding some
value and showing that the sequence actually converges to that value.
Now we formalize the preceding discussion into a theorem and a proof:
1
All results on non-negative terms apply to non-positive terms (just scale by -1).
2
In fact, the converse also holds since convergent sequences are bounded. Therefore, for partial sums over non-
negative terms, boundedness is equivalent to convergence.
Theorem. Let (a
n
) be a sequence where each term is non-negative:
a
n
0
for all n 1. If the sequence of partial sums (S
k
) is bounded, then (S
k
) converges.
Proof: For any two consecutive terms,
S
k
S
k1
= a
k
0.
Thus, the sequence of partial sums (S
k
) is monotonically increasing. If (S
k
) is also bounded, then
(S
k
) converges by the Monotone Convergence Property.
For example, we can use this result to prove that a geometric series converges when the common ratio
r [0, 1):
Theorem. For a 0 and 0 r < 1, the geometric sequence
n=1
ar
n
converges.
Proof Summary:
Since each added term is non-negative, it suces to show that the partial sums are bounded.
Exploit the self-symmetry of geometric series to derive a nice (closed form) formula for partial
sums.
Conclude that the partial sums are bounded.
Proof: Since each added term
ar
n
0,
we only need to check that the partial sums
S
k
=
k
n=1
ar
n
are bounded. To do this, we are going to use a nice trick
1
to convert this expression into a simple
formula.
First, write out the sum
S
k
= a + ar + ar
2
+ ar
3
+ . . . + ar
k
1
Although the formula for the sum of a nite geometric series is covered in high school, I am surprised that the
derivation is often overlooked. Its a cute trick that everyone should know.
Now, exploit the self symmetry of S
k
: multiply both sides by r
rS
k
= ar + ar
2
+ ar
3
+ ar
4
+ . . . + ar
k+1
.
and subtract:
rS
k
= 0 + ar + ar
2
+ ar
3
+ ar
4
+ . . . + ar
k
+ ar
k+1
S
k
= a + ar + ar
2
+ ar
3
+ ar
4
+ . . . + ar
k
+ 0
rS
k
S
k
= ar
k+1
a
Isolating S
k
,
S
k
=
ar
k+1
a
r 1
.
With this simple expression, it is easy to check that (S
k
) is bounded. Let k 1. By absolute value
properties,
|S
k
| =
ar
k+1
a
r 1
= |a|
|r
k+1
1|
|r 1|
.
To nd an upper bound, we make the numerator as big as possible. Since 0 < r < 1,
r
k+1
= r
..
<1
r
..
<1
r
..
<1
. . . r
..
<1
< 1
hence
0 r
k+1
< 1.
Therefore,
|r
k+1
1| 1,
and so
|S
k
| = |a|
1
..
|r
k+1
1|
|r 1|

|a|
|r 1|
.
Since k was arbitrary, we conclude that
|S
k
|
|a|
|r 1|
for all k 1. Thus, (S
k
) is bounded and we conclude that
n=1
ar
n
converges.
Too easy? How about a more interesting application? Recall that if you are given a p-series,
n=1
1
n
p
this series converges if p > 1. Each term in this series is non-negative, so we can use our theorem on
series with non-negative sequences to prove this test. We just need to show that the partial sums are
bounded p > 1. But bounded by what?
We use a cute trick: notice that the n-th added term
1
n
p
can be visualized as the area of a width 1 rectangle under the curve
f(x) =
1
x
p
:
n
1n
f
(
n
)
1
1
n
p
So the partial sum S
k
is just the sum of these rectangles with right endpoints 1, 2, . . . , k:
f(x) =
1
x
p
1
1
1
p
2
1
2
p
3
1
3
p
4
From this picture, one guess is that the partial sum S
k
is bounded by
Area under
1
x
p
from x = 0 to x = k
This is true, but it doesnt help: we cannot evaluate the area of
1
x
p
near 0.
Instead, we consider the area of the rst rectangle separately and say the partial sum S
k
is bounded
by
Area of the rst rectangle
. .
1
+
_
Area under
1
x
p
from x = 1 to x = k
_
.
Now that we have this schematic, lets make the proof rigorous.
Theorem. If p > 1, then the series
n=1
1
n
p
converges.
Proof Summary:
Since
1
x
p
is strictly decreasing, each added term
1
n
p
is bounded by
_
n
n1
1
x
p
dx.
Apply this inequality on each term (except the rst). After combining integrals, we get an upper
bound of 1 +
_
k
1
1
x
p
dx.
Since p > 1, this bound is bounded above by an expression independent of k.
The partial sums are bounded, and every added term a
n
is non-negative, so the series converges.
Proof: Let k 1 and consider the partial sum
S
k
=
k
n=1
1
n
p
By our preceding discussion, we intuitively know that each term after the rst
1
n
p
can be represented by the area of a rectangle under
1
x
p
from n to n + 1. Visually, we see that
1
n
p
..
rectangle
_
n
n1
1
x
p
. .
area under curve
But remember, picture arent proofs! Instead, we recall two facts from Calculus BC:
1. If the derivative of a function is negative, then the function is strictly decreasing.
2. If f(x) g(x) on interval [a, b], then
_
b
a
f(x) dx
_
b
a
g(x) dx
The rst fact tells us that
1
x
p
is decreasing and thus
1
x
p

1
n
p
when x is in [n 1, n].
Letting
f(x) =
1
x
p
g(x) =
1
n
p
we use the second fact to get
_
n
n1
1
x
p
..
f(x)
_
n
n1
1
n
p
..
g(x)
dx.
But g is a constant function; therefore,
_
n
n1
1
x
p

1
n
p
.
Equivalently,
1
n
p

_
n
n1
1
x
p
.
Applying this inequality to each term of S
k
(except the rst) and noting that the rst term is just 1,
we have:
1
1
p
1
1
2
p

_
2
1
1
x
p
dx
1
3
p

_
3
2
1
x
p
dx
1
4
p

_
4
3
1
x
p
dx
.
.
.
1
k
p

_
k
k1
1
x
p
dx
Summing all these inequalities gives us S
k
on the left:
S
k
1 +
_
2
1
1
x
p
dx +
_
3
2
1
x
p
dx + . . . +
_
k
k1
1
x
p
dx
Of course, we can collapse the integrals on the right:
S
k
1 +
_
k
1
1
x
p
dx.
17.4. ABSOLUTE CONVERGENCE 373
Evaluating the integral,
S
k
1 +
k
p+1
p + 1

1
p + 1
.
Now, if we can nd an upper bound that is some constant independent of K, then we are done. But
p > 1, so
0
..
k
p+1
p + 1
. .
0
0
Thus,
S
k
1 +
k
p+1
p + 1
. .
0
1
p + 1
1
1
p + 1
.
Because (S
k
) is bounded and each added term is non-negative, we conclude that the series converges.
To prove divergence for 0 p < 1, use a similar argument. Instead of right endpoint rectangles, use
left endpoints and show that the p-series is bounded below by the area under the curve:
_

1
1
x
p
dx
and thus diverges. Since we still have a lot of material to cover, I challenge the reader to reconstruct
the proof.
17.4 Absolute Convergence
Partial sums of non-negative terms are pretty sweet. If only every series had only non-negative added
terms, then this whole convergence business would be as easy as .
1
On the bright side, we know how to turn any sequence into a non-negative one: simply take the
absolute value of each term
( |a
n
| )
Wouldnt it be great if we could relate the convergence of
n=1
a
n
to the convergence of its absolute value analogue
n=1
|a
n
|?
In fact, we can:
1
Or rather,

2
6
.
If the absolute value analogue of a series converges, then the original series converges as well.
For some series, we can make our lives a lot easier by just showing that the absolute value analogue
of the series is bounded! Because we are so grateful, we give this type of convergence a name.
Denition. We say a series
n=1
a
n
is absolutely convergent if its absolute value analogue
n=1
|a
n
|
is convergent.
Remember the slogan,
Absolute Convergence implies Regular Convergence
Practically, this just says
Math Mantra: If you are having trouble proving that a series converges, it may
be easier to prove absolute convergence. To do that, you only need to show that
the absolute value analogue of the series is bounded.
But how are we going to prove this implication?
Simple. Take an absolutely convergent series with partial sums (S
k
) and split each S
k
into the
dierence of two non-negative sequences:
S
k
= P
k
..
0
N
k
..
0
By limit properties, it suces to prove that (P
k
) and S
k
both converge. Moreover, since they are
non-negative, it suces to prove that these two sequences are bounded. But thats easy to show: just
use the bound on the convergent absolute value analogue!
For example, consider the series with partial sums of the form
S
k
=
1
1
+
1
4
+
1
9
+
1
16
+
1
25
+
1
36
+
1
49
+ . . . +
(1)
k
k
2
We split this into the dierence of two non-negative sums by separating the negative and non-negative
added terms:
S
k
= P
k
N
k
where
P
k
= 0 +
1
4
+ 0 +
1
16
+ 0 +
1
36
+ 0 + . . . +
1
k
2
N
k
=
1
1
+ 0 +
1
9
+ 0 +
1
25
+ 0 +
1
49
+ . . . +
1
(k1)
2
The absolute value analogue of the series has partial sums of the form
S
+
k
=
1
1
..
N
k
+
P
k
..
1
4
+
1
9
..
N
k
+
P
k
..
1
16
+
1
25
..
N
k
+
P
k
..
1
36
+
1
49
..
N
k
+. . . +
1
k
2
By absolute convergence, (S
+
k
) converges and hence is bounded by some B. But P
k
and N
k
are simply
sub-sums of S
+
k
!
P
k
=
1
4
+
1
16
+
1
36
+ . . . +
1
k
2
1
1
+
1
4
+
1
9
+
1
16
+
1
25
+
1
36
+
1
49
+ . . . +
1
k
2
N
k
=
1
1
+
1
9
+
1
25
+
1
49
+ . . . +
1
(k1)
2
1
1
+
1
4
+
1
9
+
1
16
+
1
25
+
1
36
+
1
49
+ . . . +
1
k
2
Thus, for each k, P
k
and N
k
are both bounded above by B:
P
k
S
+
k
B
N
k
S
+
k
B
Now that we have the intuition, lets formalize the proof:
Theorem. For any sequence (a
i
), if
i=1
|a
i
| converges, then
i=1
a
i
converges.
Proof Summary:
Split the partial sum up into the dierence of the positive terms P
k
and negative terms N
k
.
For each k, P
k
and N
k
are bounded by
k
i=1
|a
i
|.
By absolute convergence,
k
i=1
|a
i
| < B, hence P
k
< B and N
k
< B for each k.
Since (P
k
) and (N
k
) are both bounded increasing sequences, they both converge, and hence, so
does their dierence.
Proof: Consider the partial sum
S
k
=
k
i=1
a
i
We can split this into positive and negative sequences
S
k
=
k
i=1
(p
i
n
i
)
. .
a
i
where the positive part is
p
i
=
_
a
i
if a
i
is positive
0 otherwise
and the negative part is
n
i
=
_
|a
i
| if a
i
is negative
0 otherwise
Then,
S
k
=
k
i=1
p
i
i=1
n
i
Letting P
k
be the partial sum of positive parts
P
k
=
k
i=1
p
i
and N
k
the partial sum of negative parts
N
k
=
k
i=1
n
i
we have
S
k
= P
k
N
k
.
By limit properties, to prove that (S
k
) converges, it suces to show that (P
k
) and (N
k
) both converge.
But by construction,
P
k

k
i=1
|a
i
|
N
k

k
i=1
|a
i
|
Moreover, by absolute convergence, the sequence of partial sums
_
k
i=1
|a
i
|
_
converges. Since convergent sequences are bounded,
k
i=1
|a
i
| B.
for all k. Thus, for all k,
P
k
B.
N
k
B.
So (P
k
) and (N
k
) are both bounded and increasing non-negative sequences, hence they both converge.
Therefore (S
k
) converges.
We can now use this result to complete the geometric series test:
Theorem. Let a > 0. If |r| < 1, the geometric series
n=1
ar
n
converges. Otherwise, the series diverges.
Proof: Since we already completed the case 0 r < 1 and |r| 1, we just need to prove the case
1 < r < 0. But this is simple: consider the absolute value analogue. By absolute value properties,
it is just
n=1
|a||r|
n.
Because this is a geometric series with ratio
r
= |r|,
we can apply the 0 r
< 1 case to conclude that the absolute value analogue of the series converges.
Since absolute convergence implies regular convergence,
n=1
ar
n
converges.
Even though checking absolute convergence makes life a lot easier, you have to be careful. Like often
the case in math,
Math Mantra: The converse need not be true!
For example, the alternating harmonic series
1
1
+
1
2
+
1
3
+
1
4
+ . . .
converges. However, it does not converge absolutely since the absolute value analogue is the harmonic
series
1
1
+
1
2
+
1
3
+
1
4
+ . . .
which diverges.
17.5 Rearrangements
A natural question to ask is,
What happens to the partial sums if we rearrange the order in which the terms are added?
For example, instead of a typical sequence of partial sums
S
1
= a
1
S
2
= a
1
+ a
2
S
3
= a
1
+ a
2
+ a
3
S
4
= a
1
+ a
2
+ a
3
+ a
4
S
5
= a
1
+ a
2
+ a
3
+ a
4
+ a
5
.
.
.
what happens if we decide to add the terms out of order?
S
1
= a
6
S
2
= a
6
+ a
50
S
3
= a
6
+ a
50
+ a
454
S
4
= a
6
+ a
50
+ a
454
+ a
59
S
5
= a
6
+ a
50
+ a
454
+ a
59
+ a
74
.
.
.
Your intuition about sums might lead you to think that reordering an innite sum would yield the
same value. But this is not necessarily true! You are not actually adding innitely many terms.
You are looking at a sequence of values. And we can change the destiny of this sequence by deciding
which value to add next.
Again, consider the alternating harmonic sequence
n=1
(1)
n+1
n
.
Then, the partial sums
S
1
= 1
S
2
= 1
1
2
S
3
= 1
1
2
+
1
3
S
4
= 1
1
2
+
1
3

1
4
S
5
= 1
1
2
+
1
3

1
4
+
1
5
.
.
.
17.5. REARRANGEMENTS 379
are known to converge to
ln(2).
However, if we rearrange the sequence of added terms so that we add two positive terms before each
negative term, the partial sums become
S
1
= 1
S
2
= 1 +
1
3
S
3
= 1 +
1
3

1
2
S
4
= 1 +
1
3

1
2
+
1
5
S
5
= 1 +
1
3

1
2
+
1
5
+
1
7
S
5
= 1 +
1
3

1
2
+
1
5
+
1
7

1
4
.
.
.
This sequence actually converges to
3
2
ln(2).
Crazy
1
! Fortunately, if our terms are non-negative, this phenomenon doesnt happen: if a series of
non-negative terms converges to some limit, then any rearrangement of that series will converge to
the same limit. First, lets formalize
Denition. A rearrangement of (a
n
) is a sequence (a
j
i
) where the sequence of indices
j
1
, j
2
, j
3
. . .
lists each natural number exactly once.
For example, the sequence
a
2
..
j
1
, a
1
..
j
2
, a
4
..
j
3
, a
3
..
j
4
, a
6
..
j
5
, a
5
..
j
6
. . .
is a rearrangement where
j
i
=
_
i + 1 if i is odd
i 1 if i is even
Now, lets prove the theorem:
Theorem. Let (a
n
) be a non-negative sequence and (a
j
i
) be a rearrangement of this sequence. If
i=1
a
i
1
In fact, any conditionally convergent series (a series that converges but does not converge absolutely), can be
rearranged to converge to any real number. That is, whatever real number a and conditionally convergent series you
give me, I can rearrange that series to converge to a.
converges, then so does
i=1
a
j
i
.
Moreover, both sequences converge to the same limit.
Proof Summary:
Convergence
Consider the k-th partial sum of the rearranged sequence
S
k
=
k
i=1
a
j
i
and let j
max
be the highest index of j
i
in this sum.
S
k
is bounded by S
jmax
.
Since (S
k
) converges, S
jmax
B for some B.
Thus S
k
B, so (S
k
) converges since it is non-negative and bounded.
Equal Limits
S
k
is bounded by its limit,
i=1
a
i
S
k
is bounded by S
jmax
, so
k
i=1
a
j
i

i=1
a
i
Since the right is a constant, we know the limit of S
k
satises
i=1
a
j
i

i=1
a
i
(a
n
) is a rearrangement of (a
j
i
). So we can repeat the entire proof to get
i=1
a
i

i=1
a
j
i
Conclude
i=1
a
j
i
=
i=1
a
i
Proof: First we need to prove the sum exists:
Convergence:
The idea of this proof is very simple: if we have a partial sum of a rearrangement, say
S
3
= a
1
+ a
8
+ a
6
and we look at the partial sum corresponding to the maximum index among the rearranged
terms,
S
8
= a
1
+ a
2
+ a
3
+ a
4
+ a
5
+ a
6
+ a
7
+ a
8
then, by non-negativity:
S
3
S
8
.
Using the fact that (S
k
) is bounded, we can use this upper bound to show S
k
is bounded. Then,
because S
k
is bounded and has only non-negative added terms, it must converge.
Now that you have the right idea, the formalization is simple: let
S
k
=
k
i=1
a
j
i
be an arbitrarily rearranged partial sum. Then, let j
max
be the maximum index:
j
max
= max {j
1
, j
2
, . . . , j
k
}
Then, the added terms
a
j
1
, a
j
2
, . . . , a
j
k
are among
a
1
, a
2
, a
3
, . . . , a
jmax
and since all terms are non-negative,
k
i=1
a
j
i
a
1
+ a
2
+ . . . + a
jmax
=
jmax
i=1
a
i
Thus,
S
k
S
jmax
.
But we know the sequence (S
k
) converges, so we can nd some B that bounds every term of
(S
k
). Thus,
S
k
S
jmax
B.
This also means every term of (S
k
) is bounded by B and thus this sequence converges.
Equal limits:
Recall the proof of the Monotone Convergence Theorem. In that proof, we showed that an
increasing sequence (t
n
) converged to the supremum of the set of all terms:
T = sup{t
1
, t
2
, . . .}.
In particular, this meant that the limit T was bigger than any term in the sequence
1
t
i
T.
So in fact, we know
S
k

i=1
a
i
for all k and therefore
S
k
S
jmax

i=1
a
i
for all k. Removing the middle man,
S
i=1
a
i
.
Now, we are going to use two key facts:
2
1. If each term of a sequence is less than (or equal) to a constant, then the limit is less than
(or equal to) that constant: (t
n
) t and t
n
c, then t c.
2. If (x
n
) is a rearrangement of (y
n
), then (y
n
) is a rearrangement of (x
n
).
Since
i=1
a
i
is just a constant, we use the rst fact on
S
i=1
a
i
to get
i=1
a
j
i

i=1
a
i
.
Now, using the second fact, we know (a
n
) is a rearrangement of (a
j
i
). So we repeat the entire
argument from the beginning to get
i=1
a
i

i=1
a
j
i
.
Since
i=1
a
j
i

i=1
a
i
i=1
a
i

i=1
a
j
i
1
This should make intuitive sense. The limit of an increasing sequence of partial sums should be bigger than all
partial sums
2
I would like to prove these facts, but there is so much more we need to cover. The rst is an easy exercise on N-.
The second requires introducing the concept of a bijection.
we conclude
i=1
a
j
i
=
i=1
a
i
.
Two nal comments about this proof:
If you still have doubts about the repeat the argument step, simply dene a new sequence
(b
i
) by
b
i
= a
j
i
for each i. Then dene a rearrangement (b
q
i
) such that
b
q
i
= a
i
for each i. Now, using bs instead of as, we get
k
i=1
b
q
i
..
a
i
i=1
b
i
..
a
j
i
.
The assumption that the sequence is non-negative is overkill. We only need the sequence to be
absolutely convergent. However, the proof for absolutely convergent sequences will be one of
your homework exercises.
New Notation
n=1
a
n
The sum of a
n
from
n = 1 to innity.
n=1
1
n
2
=

2
6
The sum of
1
n
2
from n = 1 to in-
nity is

2
6
.
Lecture 18
From Dierentiable to Directional
Go condently in the direction of your dreams.
Live the life you have imagined.
Learn the mathematics you always wanted to learn.
-Henry David Thor R(A)
Goals: We fulll our promise and prove the two remaining fundamental facts about
dierentiability. First, we prove that dierentiability implies that all directional deriva-
tives exist. Second, if a function is dierentiable, then the computation of its directional
derivative is reduced to a matrix multiplication between the matrix that satises the
dierentiability denition and the direction vector. Moreover, the j-th column of this
matrix must be the j-th partial derivative. Lastly, we introduce gradients.
18.1 The Story Thus Far...
Last week, we provided motivation for the denition of dierentiability. We explained that simply
replacing the variables in
f
(x) = lim
h0
f(x + h) f(x)
h
with vectors is not enough. Namely, we can take the derivative from any direction v
lim
t0
f(x + tv) f(x)
t
So a rst guess at the denition of dierentiability is
All directional derivatives exist.
However, this is still insucient since
We can have all directional derivatives exist even for non-continuous functions.
We want the computation of a directional derivative to be a simple matrix multiplication.
385
386 LECTURE 18. FROM DIFFERENTIABLE TO DIRECTIONAL
Instead, we approximate the quotient
f(x +
h) f(x)
h
with a simple matrix multiplication
A
h
and say a function is dierentiable if we can nd some matrix A (independent of

h such that
lim
0
_
f(x +
h) f(x)
h
_
= 0.
Finally, we proved that this denition of dierentiability implies continuity.
18.2 Dierentiability in Action
Today, we justify our denition of dierentiability by proving
All directional derivatives exist
Any directional derivative is just a product of a matrix A and the direction vector.
This matrix A is simply the matrix whose columns are partial derivatives.
We kill the rst two with one stone:
Theorem. If f : R
n
R
m
is dierentiable at x, then all directional derivatives at x exist. Moreover,
the directional derivative with respect to v is simply
Av
where A is the matrix in the denition of dierentiability.
Proof Summary:
Let > 0 and v be an arbitrary direction vector.
Use (fractional form) dierentiability denition on
=

v
and get a corresponding
.
Plug in tv for

h to get the directional derivative -condition with =

v
Proof: Let x R
n
. For any direction v R
n
, we want to show that the directional derivative with
respect to v exists and equals some

L. By denition of directional derivative, we need to prove that
for any > 0, there exists a > 0 such that if
0 < |t| <
18.2. DIFFERENTIABILITY IN ACTION 387
then
f(x + tv) f(x)

t

< .
Let v and > 0 be arbitrary. Using the (fractional form) denition of dierentiability, we know there
exists a matrix A such that for any
> 0, there is a corresponding
> 0 such that if

0 <
h <
then _
_
_
_
_
f(x +
h) f(x)
h
_
_
_
_
_
<
.
In particular, consider a vector tv that satises the condition:
0 < tv <
.
Then,
_
_
_
_
f(x + tv) f(x)
tv

A(tv)
tv
_
_
_
_
<
.
By absolute value properties and pulling out constants, we can reduce this to: if
0 < |t| <

v
,
then
1
v

_
_
_
_
f(x + tv) f(x)
|t|

t(Av)
|t|
_
_
_
_
<
.
This motivates us to choose
=

v
.
With the corresponding
, we construct
=

v
.
Therefore: if
0 < |t| <
..
v
then _
_
_
_
f(x + tv) f(x)
|t|

t(Av)
|t|
_
_
_
_
<
..
v
Lastly, notice that when you multiply the inside of a norm by a constant c, the sign doesnt matter:
cv = cv
Thus, we can replace |t| by t in our condition:
_
_
_
_
f(x + tv) f(x)
t

t(Av)
t
_
_
_
_
<
In conclusion, for every > 0, there exists a > 0
_
namely =

v
_
such that when
0 < |t| <
then,
_
_
_
_
f(x + tv) f(x)
t
Av
_
_
_
_
< .
In other words, the directional derivative at v exist and equals Av.
Now that weve proved that all directional derivatives at x exist, it is easy to prove that the columns
of A are the partial derivatives.
Theorem. If f : R
n
R
m
is dierentiable at x R
n
, then the only m n matrix A that satises
the dierentiability denition at x is
A =
_
_
D
1
f(x) D
2
f(x) . . . D
n
f(x)
_
_
Proof: By the preceding proof, we know that the directional derivative of f at x with respect to e
i
exists and is
A e
i
.
But this matrix multiplication just pulls out the i-th column. Therefore, the i-th column of A is the
i-th partial derivative. Duh.
Now that we have proved that A must have this awesome form, we award it with a new symbol and
title:
Denition. The Jacobian matrix of the dierentiable function f : R
n
R
m
evaluated at x is the
mn matrix
Df(x) =
_
_
D
1
f(x) D
2
f(x) . . . D
n
f(x)
_
_
Be careful about this notation! In particular,
Df(x)v
18.3. GRADIENTS 389
is not a product of x with v. It is the multiplication of the matrix Df(x) (the Jacobian of f
evaluated at x) with the vector v. I agree that this notation is very cumbersome,
1
but this is the
standard notation so you just have to get used to it.
18.3 Gradients
For a function that maps into R, you should ask
What does a directional derivative look like?
Since each partial derivative is just a number, the Jacobian is a row vector:
Df(x) =
_
f
x
1
(x)
f
x
2
(x) . . .
f
x
n
(x)
_
Therefore, the directional derivative with respect to v,
Df(x)v
is just a single number:
f
x
1
(x)v
1
+
f
x
2
(x)v
2
+ . . . +
f
x
n
(x)v
n
.
Since we can talk about numbers being greater than or less than other numbers, another question we
can ask is,
At a given point, what direction vector will give the greatest directional derivative?
Unfortunately, this is a bad question. We can choose larger direction vectors to get arbitrarily large
directional derivatives:
Df(x) (2v) = 2 Df(x)v
Df(x) (3v) = 3 Df(x)v
Df(x) (4v) = 4 Df(x)v
.
.
.
However, we can rene this question by focusing only on the direction of the vector. Just as in our
denition of dierentiability, by normalizing the direction vector to have norm 1 (so that it is on the
unit sphere), we can control the eect of magnitude and isolate the eect of direction.
Our revised question is now
At a given point, what directional vector v on the unit sphere, will give the greatest directional
derivative?
It turns out that the answer is really simple. First, because we hate row vectors, we prop the Jacobian
as a column and give it a new name.
1
I always had reservations about the notation for the Jacobian and the gradient. In fact, in the proof of the Chain
Rule I will dispense with this notation completely in favor of subscripts.
Denition. For a function f : R
n
R that is dierentiable at x, the gradient of f evaluated at x
is the column vector of partial derivatives
f(x) =
_
_
f
x
1
(x)
f
x
2
(x)
f
x
3
(x)
.
.
.
f
x
n
(x)
_
_
.
Just like with maxima and minima, it only makes sense to talk about gradients when dealing with
functions that map into R. In this case, we have
D
v
f(x) = Df(x)v = f(x) v
Now, to answer our question,
At a given point, the normalized gradient is the vector on the unit sphere that gives the greatest
directional derivative.
Lets prove this fact!
Theorem. Let f : R
n
R be dierentiable at a. If f(a) =
0, then for any direction vector v R

n
with v = 1,
Df(a)v Df(a)
_
f(a)
f(a)
_
.
Moreover, equality is achieved when and only when
v =
f(a)
f(a)
.
Proof Summary:
Rewrite the condition in terms of dot products and gradients.
Bound f(a) v using alternate form of Cauchy-Schwarz inequality.
Use the fact that Cauchy-Schwarz equality is achieved with v =
1
f(a)
. .
0
f(a) to rewrite the
upper bound.
18.3. GRADIENTS 391
Take the norm of both sides of the Cauchy-Schwarz equality condition to show that equality is
achieved only when = f(a).
Proof: Since Df(a) is a row vector, Df(a)v is really just a vector dot product
Df(a)v = (Df(a))
T
. .
f(a)
v
By denition of gradient, we can rewrite the inequality we are trying to prove as
f(a) v f(a)
_
f(a)
f(a)
_
.
On your rst homework, you proved an alternate version of the Cauchy-Schwarz inequality:
Homework. For any x, y R
n
x y x y.
Moreover, equality is achieved when and only when
x = y for some 0
Applying this inequality,
f(a) v f(a) v.
Since v = 1, in fact
f(a) v f(a). ()
Moreover, equality is achieved for the unit vector
v =
1
f(a)
. .
0
f(a).
Therefore,
f(a)
_
f(a)
f(a)
_
= f(a)
_
_
_
_
f(a)
f(a)
_
_
_
_
. .
=1
= f(a).
This enables us to rewrite () as
f(a) v f(a)
_
f(a)
f(a)
_
. .
f(a)
as needed. Moreover, equality only holds in () when
f(a) = v
for some 0. Taking the norm of both sides,
f(a) = || v
..
=1
.
Replacing || with since it is non-negative, = f(a). Therefore, equality holds when and only
when
v =
f(a)
f(a)
.
So you dont forget this theorem, just remember,
the gradient is in the direction of maximal increase.
In fact, the gradient gives us something more. First, we dene the multivariable analogue of a local
maximum and a local minimum:
Denition. For f : R
n
R, we say f has a local maximum at a if there exists an > 0 such that
f(x) f(a) for all x B
(a).
Likewise, f has a local minimum at a if there exists an > 0 such that
f(a) f(x) for all x B
(a).
Visually, we are just restricting the domain to a small ball:
Recall how in single variable calculus,
The derivative is 0 at a local maximum or minimum.
Analogously, in multivariable calculus,
The gradient is

0 at a local maximum or minimum.
We only prove the maximum case below, though both will be a direct result of the 1D case:
18.3. GRADIENTS 393
Theorem. Suppose f : R
n
R has a local maximum at a. Then,
f(a) =
0
Proof Summary:
Consider an arbitrary component of f(a).
Rewrite this component as a derivative of a one dimensional function g(t) evaluated at a
i
g(t) has a local maximum at a
i
, so g
(a
i
) = 0.
Proof: Consider an arbitrary component of f(a) :
f
x
i
(a)
We want to show that this must be 0.
Last week, we proved that the partial derivative can be rewritten as
f
x
i
(a) = lim
h0
g(a
i
+ h) g(a
i
)
h
=
dg
dt
(a
i
)
where g(t) is a single variable function
g(t) = f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
t
.
.
.
a
n
_
_
_
_
_
_
_
_
_
.
By denition of the multivariable local maxima, we know there exists a such that for all
x B
(a)
we have
f(x) f(a).
This means that
g(a
i
+ t) g(a
i
)
for all t (, ): otherwise, if we did have
g(a
i
+ t
) > g(a
i
)
for some t
, expanding the denition of g would yield

_
_
a
1
a
2
.
.
.
a
i
+ t
.
.
.
a
n
_
_
B
(a)
yet
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ t
.
.
.
a
n
_
_
_
_
_
_
_
_
_
> f(a),
contradicting a is a local maxima of f. Since g has a local maxima at a
i
, by single-variable calculus,
dg
dt
(a
i
) = 0.
Because this is true for any component,
f(a) =
0.
New Notation
Df(x) The Jacobian of f
evaluated at x
D
v
f(x) = Df(x)v The directional derivative of f at
x with respect to the direction v
is the product of the Jacobian of
f (evaluated at x) with v.
f(x) The gradient of f eval-
uated at x
D
v
f(x) = f(x) v The directional derivative of f at
x with respect to the direction v
is the dot product of the gradient
of f (evaluated at x) with v.
Lecture 19
From Directional to Dierentiable
Lets convince ourselves we didnt do anything.
<Pause>. Yep! We didnt do anything
- Leon Simon, regarding
the telescoping series
Goals: Last lecture we proved that if a function is dierentiable, then all directional
derivatives (and hence all partial derivatives) exist. Today, we prove a partial converse:
if all partial derivatives exist and are continuous, then the function is dierentiable.
19.1 A Clever Converse
Last lecture, we proved:
If a function is dierentiable, then all directional derivatives exist.
It is natural to ask if the converse is true:
If all directional derivatives exist, must the function be dierentiable?
The answer is no: on last weeks homework, you proved
f(x, y) =
_
_
|x|y
_
x
2
+ y
2
: if (x, y) = 0
0 : if (x, y) = 0
has all directional derivatives at (0, 0) but is not dierentiable at (0, 0). But instead of giving up,
Math Mantra: If the converse fails, try to add some extra condition to make it
true.
When extra conditions are added, we call the result a partial converse. Partial converses are abundant
in mathematics. In fact, if you ask any math major, he will give you the prototypical example:
Cauchys Theorem is a partial converse of Lagranges Theorem.
395
396 LECTURE 19. FROM DIRECTIONAL TO DIFFERENTIABLE
But this is a book for underdogs, so here is a bartending analogy:
If you ever order a tropical drink, its going to involve light rum. This is because light rum is easy to
mix and doesnt overpower the fruity taste of the drink. Great examples are Mojitos, Mai Tais, and
Pina Coladas. It is true
1
that
If a drink is tropical, then it has light rum.
But the converse
If a drink has light rum, then it is tropical.
is not true: there are many drinks with light rum that are certainly not tropical. For example, some
Egg Nog recipes use light rum. And we all know Egg Nog is for Christmas!
However, suppose we impose the additional condition that the brand of the light rum is Malibu.
Since Malibu light rum is infused with coconut extract, any drink with Malibu will have a Caribbean
overtone. Thus, the partial converse is true:
If a drink has light rum and the brand of that light rum is Malibu, then it is tropical.
In fact, the drink is so tropical, we call it Tropic of Calculus.
2
Returning to the main question, we simply need to add the modest proviso of continuity,
If all directional derivatives exist and are continuous, then the function is dierentiable
In fact, since directional derivatives are linear combinations of partial derivatives, this can be boiled
down to
If all partial derivatives exist and are continuous, then the function is dierentiable
1
There are a few exceptions to this rule (like a Tequila sunrise) but for mathematical reasons, assume this assertion
is true.
2
Obligatory Tom Lehrer joke.
19.2. APPLYING THE 1D MEAN VALUE THEOREM 397
So how are we going to prove this fact?
The proof that dierentiability implies all directional derivatives exist was pretty obvious: just apply
the denitions. But the converse is not as straightforward. Generally,
Math Mantra: Some proofs require EXTREME cleverness
Although you will need to be clever, the proof is simple. To quote Professor Simon,
The best proofs are simple. I did not say easy, I said simple.
The clever ideas we are going to use are:
By adding a lot of zeros, we are going to rewrite a dierence as a giant sum.
1
We can apply the 1D Mean Value Theorem to consecutive pairs in this sum.
The rst idea is a technique we have been using since Week 1. The second, however, requires some
discussion.
19.2 Applying the 1D Mean Value Theorem
To use 1-dimensional results to prove multidimensional theorems, we used two tricks:
We invoked the 1-dimensional result multiple times (for example, on each component of a
vector).
We took a multivariable function and constructed a single variable function by xing all terms
except one.
The second trick was rst used to convert the calculation of the directional derivative with respect to
e
j
into single variable dierentiation. This was also the lynchpin of the proof that, at local extrema,
the gradient is

0. Here, we will use this trick to apply the single variable Mean Value Theorem to
multivariable functions.
By the way, to prove
If all partial derivatives exist and are continuous, then the function is dierentiable.
why would we even think to use the Mean Value Theorem?
Simply mull over the goal and the given: we need to nd a way to rewrite the denition of dieren-
tiability, particularly the -condition
|f(x +
h) f(x) Df(a)
h| <
h
1
The jargon is that we rewrite the dierence as a telescoping series.
so that we can exploit the continuity of partial derivatives. Particularly, we need to introduce
the dierences of partial derivatives so we can shrink them by continuity:
f
x
1
(a +
h)
f
x
1
(a)
f
x
2
(a +
h)
f
x
2
(a)
.
.
.
f
x
n
(a +
h)
f
x
n
(a)
()
If we expand the Df(a)
h term of
|f(x +
h) f(x) Df(a)
h|,
we get
h
1
f
x
1
(a) h
2
f
x
2
(a) . . . h
3
f
x
n
(a).
Note that each term is a scaling of the subtracted part of the corresponding dierence in (). To
complete the dierences in (), we stare at the other term,
f(x +
h) f(x).
This expression reminds us that, if f were a single variable function, the 1D Mean Value Theorem
could be used to transform
f( b
..
x+h
) f( a
..
x
)
into
f
(q)(b a),
which is an expression involving a derivative. By applying the 1D Mean Value Theorem to multivari-
able functions, we are going extract our missing partial derivatives:
Theorem. Let f : R
n
R be dierentiable. Then for some
i
[0, 1],
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
= h
i
f
x
i
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+
i
h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
19.2. APPLYING THE 1D MEAN VALUE THEOREM 399
Proof Summary:
Dene a single variable function g that adds to the i-th argument of f.
Apply the 1D Mean Value Theorem to g and expand in terms of f.
Rewrite the q [0, h
i
] that satises the Mean Value Theorem as q =
i
h
i
.
Proof: Dene the single variable function
g(t) = f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ t
.
.
.
a
n
_
_
_
_
_
_
_
_
_
By the 1D Mean Value Theorem, there exists a q
i
[0, h
i
] such that
g(h
i
) g(0) = g
(q
i
) (h
i
0)
But
g
(q
i
) = lim
t0
g(q
i
+ t) g(q
i
)
t
.
Expanding the right, this is
lim
t0
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
+ t
.
.
.
a
n
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
t
But we can pull out the t in the i-th component to get a partial derivative
lim
t0
f
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
.
.
.
a
n
_
_
+
t e
i
..
_
_
0
0
.
.
.
t
.
.
.
0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
t
So g
(q
i
) is just the i-th partial derivative
g
(q
i
) =
f
x
i
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
.
.
.
a
n
.
_
_
_
_
_
_
_
_
_
Thus, when we expand
g(h
i
) g(0) = g
(q
i
) h
i
in terms of f, we get
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
. .
g(h
i
)
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
. .
g(0)
= h
i
f
x
i
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ q
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
. .
g
(q
i
)
Finally, by denition of q
i
,
0 q
i
h
i
.
Therefore, we can nd a
i
that is between 0 and 1 such that
q
i
=
i
..
q
i
h
i
h
i
.
Substituting
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
= h
i
f
x
i
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+
i
h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
19.3. THE PROOF 401

19.3 The Proof
We only need to prove the case where f maps into R. Again, this is because we can apply the case
for m = 1 to each component.
Theorem. Suppose all partial derivatives of f : R
n
R exist and are continuous. Then f is
dierentiable.
Proof Summary:
Let > 0
Rewrite f(x+
h) f(x) by adding pairs that dier by an h

i
component and applying preceding
theorem.
By triangle inequality, bound |f(x +

h) f(x) Df(x)
h| by dierence pairs of i-th partial

derivatives.
By continuity, we can shrink each pair to less than

n
for a corresponding
i
.
-condition of dierentiability holds with choice = min{
1
,
2
, . . . ,
n
}
Proof: We want to show that for any > 0, there exists a > 0 such that if
0 <
h <
then
|f(x +
h) f(x) Df(x)
h| <
h
When we expand the vector notation of
f(x +
h) f(x).
we have
f
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n
+ h
n
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
x
1
x
2
x
3
.
.
.
x
n
_
_
_
_
_
_
_
.
We now add zeros in between, where each zero pair diers by one h
i
component:
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
+ h
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
.
19.3. THE PROOF 403
Consider one pair at a time:
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
+ h
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
+ h
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
+ h
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
x
3
.
.
.
x
n2
x
n1
x
n
_
_
_
_
_
_
_
_
_
_
_
.
Notice that each pair only diers in the i-th component:
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
.
.
.
x
i1
+ h
i1
x
i
+ h
i
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
.
.
.
x
i1
+ h
i1
x
i
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
.
Applying the preceding theorem, this dierence is the same as
h
i
f
x
i
_
_
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
.
.
.
x
i
+
i
h
i
.
.
.
x
n
_
_
_
_
_
_
_
_
_
.
for some
i
[0, 1]. Rewriting each pair and reordering, our sum reduces to
h
1
f
x
1
_
_
_
_
_
_
_
x
1
+
1
h
1
x
2
x
3
.
.
.
x
n
_
_
_
_
_
_
_
+ h
2
f
x
2
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+
2
h
2
x
3
.
.
.
x
n
_
_
_
_
_
_
_
+ . . . + h
n
f
x
n
_
_
_
_
_
_
_
x
1
+ h
1
x
2
+ h
2
x
3
+ h
3
.
.
.
x
n
+
n
h
n
_
_
_
_
_
_
_
Now that expanding the vectors has served its purpose, we can rewrite this concisely as
h
1
f
x
1
(x +
1
h
1
e
1
) + h
2
f
x
2
(x + h
1
e
1
+
1
h
2
e
2
) + . . . + h
n
f
x
2
(x + h
1
e
1
+ h
2
e
2
+ . . . +
n
h
n
e
n
)
Using this expression for f(x +
h) f(x) and
Df(x)(
h) = h
1
f
x
1
(x) + h
2
f
x
2
(x) + . . . + h
3
f
x
n
(x)
we can rewrite
|f(x +
h) f(x) Df(x)(
h)|
as
h
1
f
x
1
(x +
1
h
1
e
1
) + h
2
f
x
2
(x + h
1
e
1
+
2
h
2
e
2
) + . . . + h
n
f
x
2
(x + h
1
e
1
+ . . . +
n
h
n
e
n
)
h
1
f
x
1
(x) h
2
f
x
2
(x) . . . h
1
f
x
1
(x)
After grouping the partial derivatives, we can use multiple applications of triangle inequality to bound
the above by
h
1
f
x
1
(x +
1
h
1
e
1
) h
1
f
x
1
(x)
h
2
f
x
2
(x + h
1
e
1
+
2
h
2
e
2
) h
2
f
x
2
(x)
+
.
.
.
h
n
f
x
2
(x + h
1
e
1
+ h
2
e
2
+ . . . +
n
h
n
e
n
) h
n
f
xn
(x)
19.3. THE PROOF 405

But we can bound this further by pulling out the constant h
i
from each term and replacing it with
f
x
1
(x +
1
h
1
e
1
)
f
x
1
(x)
f
x
2
(x + h
1
e
1
+
2
h
2
e
2
)
f
xn
(x)
+
.
.
.
f
xn
(x + h
1
e
1
+ h
2
e
2
+ . . . +
n
h
n
e
n
)
f
xn
(x)
Now we would like to use the fact that each partial derivative is continuous: there exists 0 <
i
such
that if
c
i
<
i
then
f
x
i
(x +c
i
)
f
x
i
(x) <

n
.
Choose
= min{
1
,
2
, . . . ,
n
}
and let
h < .
We then check that the choice
c
i
=
_
_
h
1
h
2
.
.
.
i
h
i
0
.
.
.
0
_
_
satises the -condition. Expanding the norm
_
h
2
1
+ h
2
2
+ . . . + h
2
i1
+
2
i
h
2
i
+ 0 + . . . + 0,
because
i
is between 0 and 1, this norm is bounded by
_
h
2
1
+ h
2
2
+ . . . + h
2
i1
+ h
2
i
+ 0 . . . + 0
which is, of course, bounded by the full norm
h. Thus,
f
x
1
(x +
1
h
1
e
1
)
f
x
1
(x)
. .
<

n
+
f
x
2
(x + h
1
e
1
+
2
h
2
e
2
)
f
x
n
(x)
. .
<

n
+
.
.
.
f
x
n
(x + h
1
e
1
+ h
2
e
2
+ . . . +
n
h
n
e
n
)
f
x
n
(x)
. .
<

n
is less than
h.
Lecture 20
Ironclad Chain Rule
We do not want the reader to
unlearn the automatic technique.
But the reader should know that the
chain rule stands behind it.
- Kenneth Ross
Goals: Today, we prove the Chain Rule. This lecture is a Math 51H rite of passage:
you will remember it for the rest of your life.
20.1 Dierentiating a Composition
As a tutor and teacher of 7 years, one thing that completely boggles my mind is how the Chain
Rule is taught. At most schools, it has devolved into some automated technique for strange looking
functions. Students see
sin(x
2
)
and say
Dierentiate the outside and multiply it by the derivative of the inside.
But you should not think of the Chain Rule in terms of inside and outside. The Chain Rule deserves
more respect than an In-N-Out Burger. The correct view is that
The Chain Rule is a formula for dierentiating the composition of functions
Namely,
The derivative of g(f(x)) is the product of the derivative of g evaluated at f(x)
with the derivative of f evaluated at x.
Symbolically,
(g f)
(x) = g
(f(x)) f
(x).
So to calculate the derivative of sin(x
2
), we notice that it is the composition (g f)(x) where
g(x) = sin(x)
f(x) = x
2
407
408 LECTURE 20. IRONCLAD CHAIN RULE
Thus,
sin(x
2
)
. .
(gf)
(x)
= cos(x
2
)
. .
g
(f(x))
2x
..
f
(x)
By the way, Chain Rule is amazing! Before you learned the Chain Rule, you only knew how to
dierentiate
Powers of x
Sines and Cosines
Sums of Dierentiable Functions
Products and Quotients of Dierentiable Functions
Instantly, the Chain Rule gave you an entire new universe of dierentiable functions to play with.
For example, you could dierentiate functions like
cos
_
x
3
+ sin(cos(sin x
2
))
_
You also used the Chain Rule
1
to
Solve Related Rates
Implicitly Dierentiate
Calculate the Derivatives of Inverse Functions
Now that we are in college, can we extend the magic of the 1-dimensional Chain Rule to multivariable
functions? Absolutely! In fact, the Chain Rule simply becomes a product of matrices:
Example: Given the functions
g
_
x
y
_
=
_
_
y
x
y
2
_
_
f
_
_
x
y
z
_
_
=
_
x
2
+ xy
z
2
_
compute the Jacobian of the composition of g f:
D(g f)(x)
and the matrix product
Dg(f(x)) Df(x)
1
Literally, these three topics are simply the Chain Rule! It is absurd how some people just accept the sudden
appearance of a mysterious y
term.
20.2. THE PROOF 409
First, we compute the composition g f:
g f
_
_
x
y
z
_
_
= g
_
x
2
+ xy
z
2
_
=
_
_
z
2
..
y
x
2
+ xy
. .
x
z
4
..
y
2
_
_
.
Calculating the Jacobians, we get
Dg
_
x
y
_
=
_
_
0 1
1 0
0 2y
_
_
Df
_
_
x
y
z
_
_
=
_
2x + y x 0
0 0 2z
_
D(g f)
_
_
x
y
z
_
_
=
_
_
0 0 2z
2x + y x 0
0 0 4z
3
_
_
.
Notice that
_
_
0 1
1 0
0 2z
2
_
_
. .
Dg(f(x))
_
2x + y x 0
0 0 2z
_
. .
Df(x)
=
_
_
0 0 2z
2x + y x 0
0 0 4z
3
_
_
. .
D(gf)(x)
From this example, we conjecture that
The Jacobian of g f is the product of the Jacobian of g evaluated at f(x)
with the Jacobian of f evaluated at x
This is almost verbatim the 1D Chain Rule!
Now that we have the right idea, lets prove this fact. Like the 1D Chain Rule, the multivariable
Chain Rule is going to open up a whole new world of possibilities.
20.2 The Proof
Before we get started, I must give you a warning. In class, the proof of the Chain Rule will take the
full hour and will completely cover all three boards. You are going to dart back and forth referencing
so many stars (), double stars (), and triple stars ( ) that youll be seeing stars. But really,
this proof is just a ton of book-keeping. And always remember,
Math Mantra: NEVER be afraid of book-keeping!
Once you power your way through it, you are going to realize that this is a pretty simple proof.
Personally, I think the only real diculty is the notation.
For example, stare at the term
Dg(f(x))f(x +
h)
Even though the fs are right next to each other, they play two dierent roles. In
Dg(f(x))
we are evaluating the Jacobian at the vector f(x). So
Dg(f(x))
. .
Matrix
f(x +
h)
. .
Vector
.
is really just a product of a matrix and vector. We are also allowed to take the Jacobian of f (evaluated
at x), so
Dg(f(x))Df(x)
h
is also a matrix multiplication
Dg(f(x))
. .
Matrix
Df(x)
. .
Matrix
h
..
Vector
Honestly, toggling between
f(x)
Df(x)
Dg(f(x))
is as unnecessarily complicated as a Rube Goldberg machine.
To assist in this proof, I am going to introduce a new notation: a subscript will denote the vector at
which the Jacobian is evaluated.
For ONLY this proof, let
Df
a
be a shorthand notation for the Jacobian of f evaluated at a,
Df(a)
Do not get this confused with a directional derivative! Remember that the subscript of the
directional derivative is placed between the D and the f, while the vector at which the Jacobian
is evaluated is placed after the f. So here, Df
a
will be a matrix and thus
Df
a
..
Matrix
h
..
Vector
20.2. THE PROOF 411
is a matrix multiplication.
Go ahead and yell at me for being inconsistent: the experts can add it to their list of grievances. But
the the vector where Df is evaluated will have a minor role in the proof. It would just get in the
way, so Ive bumped it down to a subscript. You will see that doing so really clears up the proof.
Regardless, once you understand the proof in its entirety, I encourage you to rewrite it with the
standard notation.
Theorem. Let
g : R
p
R
m
f : R
n
R
p
be dierentiable functions. Then
g f : R
n
R
m
is dierentiable and has Jacobian
D(g f)(x) = Dg(f(x)) Df(x)
Proof Summary:
Let > 0.
Incorporate the dierentiability of f into -condition of g f by adding
Dg
f(x)
_
f(x +
h) f(x)
_
+ Dg
f(x)
_
f(x +
h) f(x)
_
. .
=0
Using the triangle inequality, bound the -condition by
_
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_
. .
A
+
_
_
_Dg
f(x)
_
f(x +
h) f(x) Df
x
h
__
_
_
. .
B
B <

h
2
Use Cauchy-Schwarz like inequality for matrices to pull out Dg
f(x)
.
Use dierentiability of f with
1
=

2
_
_
Dg
f(x)
_
_
and get corresponding
1
.
A <

h
2
Rewrite dierentiability of g at f(x) so the -condition bounds A. Notice that the -
condition relies on f(x) f(x +
h)
We can show this condition is bounded by h (1 +Df
x
). To do this, we need to add
Df
x
h + Df
x
h
. .
=0
and apply dierentiability of f with
2
= 1 with corresponding
2
.
Using revised dierentiability denition, choose
=

2 (1 +Df
x
)
, then we have A <

h
2
as long as f(x) f(x +
h) <
3
and
h <
2
.
To ensure the last condition is satised, further restrict
h <

3
1 +Df
x
A + B <
h for choice of = min

_
1
,
2
,

3
1 +Df
x
_
Proof: We want to show, for any > 0, that there exists a > 0 such that if
0 <
h <
then _
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
Df
x
h
_
_
_ <
h.
Let > 0. Now, we need to introduce the fact that f and g are dierentiable, so stare at Df
x
h:
_
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
Df
x
h
_
_
_
First, we try to incorporate the dierentiability denition of f by adding zero,
_
f(x +
h) f(x)
_
+
_
f(x +
h) f(x)
_
However, the term containing Df
x
h is scaled by Dg
f(x)
, so we add the missing dierentiability terms
scaled by Dg
f(x)
,
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
g(f(x +
h)) g(f(x))
+
Dg
f(x)
_
f(x +
h) f(x)
_
+ Dg
f(x)
_
f(x +
h) f(x)
_
. .
=0
+
Dg
f(x)
Df
x
h
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Isolating the dierentiability denition of f, we now have
_
_
_
_
_
_
_
_
_
_
g(f(x +
h)) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
_
+
Dg
f(x)
_
f(x +
h) f(x) Df
x
h
_
_
_
_
_
_
_
_
_
_
_
Then, by triangle inequality, this is bounded by
_
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_
. .
A
+
_
_
_Dg
f(x)
_
f(x +
h) f(x) Df
x
h
__
_
_
. .
B
If we can nd conditions on such that A and B are each less than

h
2
, we are done!
20.2. THE PROOF 413
_
_
_Dg
f(x)
_
f(x +
h) f(x) Df
x
h
__
_
_ <

h
2
Using the Cauchy-like inequality for matrices on
_
_
_Dg
f(x)
_
f(x +
h) f(x) Df
x
h
__
_
_
. .
Ax
we can bound this above by pulling out the matrix norm
Dg
f(x)
. .
A
_
_
_f(x +
h) f(x) Df
x
h
_
_
_
. .
x
.
Using dierentiability of f, we know that, for
1
=
2
_
_
Dg
f(x)
_
_
,
we can nd a corresponding
1
such that when
h <
1
then
_
_
_f(x + h) f(x) Df
x
h
_
_
_ <

2
_
_
Dg
f(x)
_
_
h.
Under this restriction on

h,
Dg
f(x)
_
_
_f(x + h) f(x) Df
x
h
_
_
_
. .
h
2
Dg
f(x)
<
h
2
Great! One term down, one to go.
_
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_ <

h
2
Notice that
_
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_
looks an awfully like the -condition of the denition of the dierentiability of g at f(x):
If
0 < s <
then
_
_
g(f(x) +s) g(f(x)) Dg
f(x)
s
_
_
< s
However, we are a shifting factor o. So we consider a particular instance of s:
s = f(x) f(x +
h)
This gives us
If
0 < f(x) f(x +
h)
. .
s
<
then
_
_
_g(f(x +
h) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_ <
f(x) f(x +
h).
Now we need to overcome two obstacles:
We need to make the if condition
0 < f(x) f(x +
h)
. .
s
<
true by nding a restriction on

h. Remember, we are trying to show that under certain
restrictions on

h, our condition of dierentiability of g f holds.
The right hand side of the
condition
f(x) f(x +
h)
needs to be
h
2
Look at
f(x) f(x +
h).
We are going to nd an upper bound that is a function of only

h. Adding zero
f(x) f(x +
h) Df
x
h + Df
x
h
. .
=0
and applying both the triangle inequality and the Cauchy-like inequality for matrices, we get
the bound
f(x) f(x +
h) Df
x
h +Df
x
h.
Using dierentiability of f for
2
= 1 to get corresponding
2
, we further bound
f(x) f(x +
h) Df
x
h
. .
<1
h
+Df
x
h <
h +
hDf
x
20.2. THE PROOF 415

by restricting
h <
2
.
Thus, if
h <
2
we have
f(x) f(x +
h) < h (1 +Df
x
) .
This bound solves our two obstacles: rst, we know that for
=

2 (1 +Df
x
)
we can nd a
3
such that if
0 < f(x) f(x +
h) <
3
then
_
_
_g(f(x +
h) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_ <

2 (1 +Df
x
)
_
_
_f(x) f(x +
h)
_
_
_ .
Adding the condition,
h <
2
,
we can further bound
_
_
_g(f(x +
h) g(f(x)) Dg
f(x)
_
f(x +
h) f(x)
__
_
_ <

2 (1 +Df
x
)
f(x) f(x +
h)
. .
<
h(1+Df
x
)
<

h
2
.
We are almost done: all thats left is to get a condition on

h that ensures
0 < f(x) f(x +
h) <
3
.
Again, we know that if
h <
2
we have
f(x) f(x +
h) <
h (1 +Df
x
) .
Shrinking
h such that
h <

3
1 +Df
x
,
we can bound
f(x) f(x +
h) < h (1 +Df
x
)
<
3
.
In conclusion, for
= min
_
1
,
2
,
3
(1 +Df
x
)
_
if
h <
then _
_
_g(f(x +
h)) g(f(x)) Dg
f(x)
Df
x
h
_
_
_ <
h.
If you can get through the waves of notation and bookkeeping, you should notice a subtlety.
In the proof of the multivariable Bolzano-Weierstrass Theorem, it was not enough to look at each
component sequence simultaneously to get a subsequence of each. Instead, we examined the com-
ponent sequences, one at a time, using each component sequences subsequence to build the next
component sequences subsequence.
Though we didnt even have to look at component sequences in this proof, we have something similar
here with the -conditions. Normally, we examine a series of delta conditions separately and try to
make them all simultaneously true. But in this proof, we needed to rst x one delta condition,
namely
h <
2
.
Using this, we were then able to derive the remaining condition
h <
3
1 +Df
x
.
Lecture 21
Mighty Morphin Power Series
The Change of Base Potion calls for
1. Binomial Theorem
2. Interchange of Sums
3. The Limit of
n
_
n
m
_
as n
4. Three Measures of Boomslang Skin
- Hermione Im(G)
Goals: Today, we review power series and formally prove that the set on which a power
series converges must be an interval. We also prove the Change of Base-Point Theorem
for both polynomials and power series.
21.1 From Polynomial to Power
Ever since 7th grade, polynomials have been all the rave.
f(x) = a
n
x
n
+ a
n1
x
n1
+ . . . + a
3
x
3
+ a
2
x
2
+ a
1
x + a
0
=
n
i=0
a
i
x
i
You spent a large fraction of high school math studying these objects (although most of the time,
they were limited to the case n = 2). In particular, you learned how to:
Multiply Binomials using the Acronym FOIL.
Solve for Real and Non-Real Roots of Quadratics.
Expand Powers of Binomials using the Binomial Theorem.
Determine the Possible Rational Roots of Polynomials with Integer Coecients.
Divide and Factor Polynomials via Synthetic Division.
Find the Asymptotes of the Quotient of Two Polynomials (Rational Functions).
417
418 LECTURE 21. MIGHTY MORPHIN POWER SERIES
And the list goes on. But whats the big deal with polynomials anyways?
Math Mantra: Polynomials are AMAZING.
They capture the construction process, starting with 1 and x, made by
Multiplying
Scaling
Adding
For example, to build the polynomial
3x
3
+ 5x
2
start with just x. Then build x
3
and x
2
by multiplying x by itself:
x x x = x
3
x x = x
2
Then scale to get
3 x
3
= 3x
3
5 x
2
= 5x
2
Finally just add:
3x
3
+ 5x
2
Tada!
But youve seen the words Scaling, Adding, Multiplying many times! Particularly in
The Denition of a Vector Space
Theorems on Continuous Functions
Theorems on Dierentiable Functions
Indeed, the set of polynomials is a vector space. This will be important in your future analysis courses.
Secondly,
All polynomials are continuous.
This is because 1 and x are both continuous and continuous functions are closed under scaling,
addition, and multiplication. Likewise,
All polynomials are dierentiable.
And we all know that dierentiating and integrating polynomials is a piece of cake. In fact, the
derivative of a polynomial is still a polynomial. Moreover, if instead of just x, we started with
x
1
, x
2
, . . . x
n
and built multivariable polynomials, then
Any combination of partial derivatives of a polynomial are continuous.
21.1. FROM POLYNOMIAL TO POWER 419
Because this property is so cool, it has a name: we say that polynomials are smooth.
But it gets even better. As the last topic of Calc BC, you learned about power series:
f(x) =
n=0
a
n
x
n
You were probably told that this was an innite polynomial. But thats ridiculous. Do not let the
summation notation fool you!
Math Mantra: Math notation is often reused with different meanings!
You must check the CONTEXT!
Theres one major dierence between polynomials and power series. Polynomials are always dened
for all x. However, this is not true for all power series! For example, plugging x = 1 into the power
series
f(x) =
n=0
n!x
n
gives us the series
f(1) =
n=0
n!
which completely explodes. In fact, the power series f(x) diverges for any non-zero x.
Remember, the correct interpretation of any series is as the limit of partial sums. In particular, the
correct interpretation of a power series is as the limit of approximating polynomials.
For example,
e
x
=
n=0
x
n
n!
really means that for each x, e
x
is the limit of the sequence of polynomials:
s
0
= 1
s
1
= 1 + x
s
2
= 1 + x +
x
2
2
s
3
= 1 + x +
x
2
2
+
x
3
6
s
4
= 1 + x +
x
2
2
+
x
3
6
+
x
4
24
.
.
.
These polynomials provide increasing accurate approximations of e
x
as we add higher degree terms.
1 0 1 2 3
5
10
15
20
1 + x
1 0 1 2 3
5
10
15
20
1 + x +
x
2
2
1 0 1 2 3
5
10
15
20
1 + x +
x
2
2
+
x
3
6
+
x
4
24
1 0 1 2 3
5
10
15
20
1 + x +
x
2
2
+
x
3
6
+
x
4
24
+
x
5
120
+
x
6
720
Notice however, that the power series itself is clearly not a polynomial: e
x
cannot be formed by mul-
tiplying, scaling, and adding 1 and x.
Next week we will answer the question of which functions can be expressed by a power series, and
how to extract that power series. But today, we will consider the preliminary question,
For what values of x does a power series converge?
At this moment, a few words from high school may come to mind: interval of convergence, ratio test,
check both endpoints. You may think you know the answer. But you dont have the complete story.
In case you forgot, heres an example:
Example: Determine the interval of convergence of
g(x) =
n=0
(1)
n
n + 1
x
n
First you would use ratio test: call each term
b
n
=
(1)
n
n + 1
x
n
so
b
n+1
b
n
=
n + 1
n + 2
x.
Since
lim
n
b
n+1
b
n
= |x|
the Ratio Test tells us that the series converges when
|x| < 1
and diverges when
|x| > 1.
Now we have to check the endpoints:
g(1) =
n=0
(1)
n
n + 1
g(1) =
n=0
1
n + 1
Note that g(1) is the alternating harmonic series whereas g(1) is the harmonic series. Thus, g(x)
converges if and only if
1 < x 1.
Therefore, the interval of convergence is (1, 1].
However, the high school treatment of power series convergence is really sketch. In high school, we
assumed that the set where a power series g converged,
S = {x| g(x) converges}
is an interval. This turned to work because all our power series were simple enough that we could
mindlessly apply the Ratio Test.
Simple. But that was high school. This is college, and the fact of the matter is:
We cant always apply the Ratio Test!
When you worked with power series in high school, you always had a pretty formula for b
n
and a nice
way to compute the ratio. But what if Bart Simpson devoted his life to picking random coecients
for the power series?
g(x) = +
1
11
x +
1
3!
x
2
+
1
24
601
x
3
+ . . .
We cant apply the Ratio Test to this series! In fact, the Ratio Test cannot be applied to series
that contain zero terms like
g(x) = 0 + x + 0 + 0 + x
4
+ 0 + 0 + 0 + x
8
. . .
And without the Ratio Test, how do we know that the set of points where the series converges is
actually an interval ?
Luckily, because of the beauty of rigorous mathematics, we can prove that this is always the case.
And our proof is going to rely on a simple lemma:
Lemma. If we can nd some such that the series
f() =
n=0
a
n
n
converges, then for all |x| < ||, the series
f(x) =
n=0
a
n
x
n
converges.
1
Proof Summary:
It suces to prove absolute convergence, so we need to show the absolute value analogue is
bounded.
Make the added terms of the given convergent series, a
n
n
, appear by multiplying by

n
n
..
=1
.
The sum is bounded by an innite geometric series with ratio
|x|
||
.
Proof: Remember our usual trick with series:
1
In fact it converges absolutely. But I want you to get into the habit of trying to check the simpler condition of
absolute convergence as a strategy to prove convergence.
Math Mantra: If you have trouble proving that a series converges, it may be
easier to prove that it converges absolutely. To do that, you only need to
check that the absolute value analogue of the series is bounded.
Fix x such that |x| < ||. To prove absolute convergence, we just need to show that the sequence of
partial sums (S
N
) where
S
N
=
N
n=0
|a
n
(x)
n
|
is bounded. Looking at this partial sum, we rst try to incorporate
a
n
n
by multiplying by 1
S
N
=
N
n=0
n
..
=1
a
n
(x)
n
Pulling
x
n
n
outside the absolute value yields
S
N
=
N
n=0
x
n
|a
n
n
|
Notice that
a
n
n
is the n-th added term of the given convergent series. Therefore, by the N-th term test,
a
n
n
0.
Since convergent series are bounded, there exists an M such that
|a
n
n
| M
for every n. Thus,
N
n=0
x
n
|a
n
n
|
N
n=0
x
n
M = M
N
n=0
x
n
But lo and behold, what is this? Bringing the exponent outside the absolute value, we have a geometric
series with ratio less than 1
M
N
n=0
_
|x|
||
_
. .
<1
n
,
which is bounded above by the innite geometric sum
M
N
n=0
_
|x|
||
_
n
=
M
1
|x|
||
.
Thus, every partial sum is bounded above:
S
N
=
N
n=0
|a
n
x
n
|
M
1
|x|
||
.
We conclude that
f(x) =
n=0
a
n
x
n
converges.
We will now use this lemma to prove that the convergence set of a power series always takes the form
of an interval centered around 0:
Theorem. Given the power series
f(x) =
n=0
a
n
(x)
n
let S be the set where x converges:
S = {x | f(x) converges} .
Then S has one of three possible forms:
1. S = {0}
2. S = R
3. S is a bounded interval centered around 0: For some > 0, S is one of
(, ), [, ), (, ], (, )
Proof Summary:
Consider the set
S
= {|x| | f(x) converges}

Either S
is unbounded or bounded
S
is unbounded
Then S = R. If not, use the preceding lemma and unboundedness of S to derive a
contradiction.
S
is bounded
If sup(S
) = 0, then S = {0}
If sup(S
) = 0, use the preceding lemma to show f(x) diverges for all |x| > sup(S
) and
converges for all |x| < sup(S
).
Proof: To exploit the symmetry around 0, it is easier to look at the set of the absolute value of x
where f(x) converges:
S
= {|x| | f(x) converges}

Then either
S
is unbounded or S
is bounded
S
is unbounded
We can show that S = R. Suppose there is a y R but y / S. Since S
is unbounded there is
a Y S
such that Y > |y|.

0
y
S S
Y
S
But to be in S
, either f(Y ) or f(Y ) converges, so by the previous lemma f(x) converges for
all |x| < |Y |:
0
y
Y Y
This means f(y) converges and thus y S, a contradiction.
S
is bounded
We know 0 is in S
(duh), so S
is non-empty. Now, using our favorite axiom (Completeness),

we know that S
has a supremum. Either

sup(S
) = 0 or sup(S
) = 0
If sup(S
) = 0, then we instantly done and S = {0}.

If sup(S
) = 0, then for all y such that

|y| > sup(S
)
it must be the case that f(y) diverges.
0
S
sup(S
) |y|
Otherwise, if f(y) converges, |y| would be in S
. But that means |y| sup(S
), a contradiction.
Moreover, for all y such that
|y| < sup(S
)
it must be the case that f(y) converges. Suppose we could nd a y such that f(y) diverges:
0
S
sup(S
) |y|
Then for all Y > |y| it must be the case that f(Y ) diverges: otherwise, by our previous lemma,
we would have f(y) converge:
0
S
sup(S
) |y|
But then this would mean that S
is bounded above by |y|, contradicting that sup(S
) is the
least upper bound.
In summary, the solution set S is all y such that |y| < sup(S
), possibly containing either

sup(S
) or sup(S
) (or both). Therefore, S is one of

(, ) [, ) (, ] (, )
for = sup(S
).
21.2 Change of Base-Point, Finite Case
Imagine the scenario: you start with some polynomial
b
2
x
2
+ b
1
x + b
0
.
21.2. CHANGE OF BASE-POINT, FINITE CASE 427
and after shifting to the right by , you get a new polynomial
a
2
x
2
+ a
1
x + a
0
.
What if, after too much partying, you lose your original coecients b
2
, b
1
, b
0
. How do you nd them?
Simple. Expand
b
2
(x )
2
+ b
1
(x ) + b
0
and match coecients with your new polynomial:
Example: Given that the polynomial
f(x) = a
2
x
2
+ a
1
x + a
0
is formed by shifting the polynomial
b
2
x
2
+ b
1
x + b
0
to the right by , there is a nice formula to instantly solve for b
2
, b
1
, b
0
from
b
2
(x )
2
+ b
1
(x ) + b
0
= a
2
x
2
+ a
1
x + a
0
Expanding the left hand side, we get
b
2
(x )
2
+ b
1
(x ) + b
0
= b
2
x
2
2b
2
x + b
2
2
+ b
1
x b
1
+ b
0
= b
2
x
2
+ (2b
2
+ b
1
) x + (b
0
b
1
+ b
2
2
)
Equating coecients
b
2
..
a
2
x
2
+ (2b
2
+ b
1
)
. .
a
1
x +
_
b
0
b
1
+ b
2
2
_
. .
a
0
we have a system of equations
a
2
= b
2
a
1
= 2b
2
+ b
1
a
0
= b
0
b
1
+ b
2
2
Now simply reverse substitute. We are already given b
2
= a
2
, and by plugging this into the second
equation, we can solve for b
1
in terms of a
1
, a
2
. Then plugging this expression for b
1
into the third
equation will let us solve for b
0
. This will give us
b
2
= a
2
b
1
= a
1
+ 2a
2
b
0
= a
0
+ a
1
+ a
2
2
But, of course, you can play the same game with a polynomial of degree 3:
b
3
= a
3
b
2
= a
2
+ 3a
3
b
1
= a
1
+ 2a
2
+ 3a
3
b
0
= a
0
+ a
1
+ a
2
2
+ a
3
3
But how about a polynomial of arbitrary degree N? If you rewrite
f(x) = a
N
x
N
+ a
N1
x
N1
+ . . . + a
0
as
a
N
x
N
+ a
N1
x
N1
+ . . . + a
0
= b
N
(x )
N
+ b
N1
(x )
N1
+ . . . + b
0
,
is there a nice formula for the coecients b
0
, b
1
, . . . , b
N
in terms of a
0
, a
1
, . . . , a
N
?
At this point, there should be an awkward silence. You see structure in the N = 2, 3 cases, but you
cant quite put your nger on a formula. Or at least, not a formula for an arbitrary coecient b
m
.
1
.
In fact, even when I rewrite the system of equations with binomial coecients, it still isnt obvious:
b
3
=
_
3
3
_
a
3
b
2
=
_
2
2
_
a
2
+
_
3
2
_
a
3
b
1
=
_
1
1
_
a
1
+
_
2
1
_
a
2
+
_
3
1
_
a
3
2
b
0
=
_
0
0
_
a
0
+
_
1
0
_
a
1
+
_
2
0
_
a
2
2
+
_
3
0
_
a
3
3
An educated guess would be, for an N-th degree polynomial
b
m
=
_
m
m
_
a
m
+
_
m + 1
m
_
a
m+1
+
_
m + 2
m
_
a
m+2
2
+ . . . +
_
N
m
_
a
N
Nm
=
N
n=m
_
n
m
_
a
n
nm
Unless you are the Mathematical Dalai Lama, this shouldnt be obvious. Generally,
1
You can probably guess the cases m = 0, 1, N
21.2. CHANGE OF BASE-POINT, FINITE CASE 429
Math Mantra: When faced with a new formula, you have to mull it over and plug
in values yourself.
Ironically, even though discovering the formula is not obvious, proving it is very easy. We just need
the Binomial Theorem and the following double summation identity:
Lemma.
N
n=0
n
m=0
c
nm
=
N
m=0
N
n=m
c
nm
Proof: Already proved in Lecture 7: Matrix Madness.
By the way, what does this summation even mean? Although the notation is scary, the intuition is
obvious. Always remember,
Math Mantra: NEVER BE AFRAID OF NOTATION
Suppose we wanted to add the following elements:
c
00
c
10
c
11
c
20
c
21
c
22
We can either add the elements one row at a time or one column at a time:
c
00
c
10
c
11
c
20
c
21
c
22
c
00
c
10
c
20
c
11
c
21
c
22
Our lemma just says that we can add the elements either way!
N
n=0
_
n
m=0
c
nm
_
. .
n-th Row
=
N
m=0
_
N
n=m
c
nm
_
. .
m-th Column
By the way, when you try to prove this using Professor Simons hint, you should think of the following
picture:
c
00
c
01
c
02
c
10
c
12
c
11
c
20
c
21
c
22
After setting the terms in dotted circles to zeros, just interchange the row and column sums as usual!
Now we are ready to prove the formula for b
m
:
Theorem (Change of Base-Point, Finite). For any shifting factor , the degree N polynomial
f(x) = a
N
x
N
+ a
N1
x
N1
+ . . . + a
0
can be rewritten as:
f(x) = b
N
(x )
N
+ b
N1
(x )
N1
+ . . . + b
0
where
b
m
=
N
n=m
_
n
m
_
a
n
nm
.
Proof Summary:
Replace the term a
n
x
n
with a
n
( + x )
n
Expand using the Binomial Theorem.
Apply our Lemma.
Proof: Recall that the Binomial Theorem states
(a + b)
n
=
n
m=0
_
n
m
_
a
nm
b
m
For
f(x) =
N
n=0
a
n
x
n
add and subtract to get:
N
n=0
a
n
(
..
a
+(x )
. .
b
)
n
21.3. CHANGE OF BASE-POINT, POWER SERIES 431
and expand each n-th power using the Binomial Theorem:
N
n=0
a
n
n
m=0
_
n
m
_
nm
(x )
m
. .
(a+b)
n
.
Pulling a
n
inside the inner summation and applying our Lemma, we get
N
n=0
n
m=0
_
n
m
_
a
n
nm
(x )
m
. .
cnm
=
N
m=0
N
n=m
_
n
m
_
a
n
nm
(x )
m
. .
cnm
But we can pull out (x )
m
from the inner sum since it does not depend on n:
N
m=0
_
N
n=m
_
n
m
_
a
n
nm
_
. .
bm
(x )
m
This is a polynomial shifted by . So
f(x) =
N
m=0
b
m
(x )
m
where we set
b
m
=
N
n=m
_
n
m
_
a
n
nm
.
21.3 Change of Base-Point, Power Series
We proved the Change of Base-Point Theorem for the nite case. To quote Abed from Community:
Cool-Cool-Cool.
But that was only the nite case. How about power series? Remarkably, we have a similar formula!
As long as x and x are both within the radius of convergence, we can replace the N with :
f(x):
N
n=0
b
n
(x )
n
m=0
b
m
(x )
m
b
m
:
N
n=m
_
n
m
_
a
n
nm
n=m
_
n
m
_
a
n
nm
But there are quite a few catches. For starters, our b
m
are no longer nite sums. Therefore we must
prove that all the b
m
converge. For this, well need another lemma:
Lemma. Let m be a xed non-negative integer. Then,
lim
n
n
_
n
m
_
= 1
Proof Summary:
Apply the Sandwich Theorem to the bounds
1
n
_
n
m
_
n
m
The lower bound (the constant sequence a
n
= 1) obviously converges to 1.
Using the product rule for limits and the fact that
n
n 1
the upper bound also converges to 1.
Proof: Since the case m = 0 is obvious, we can assume m is positive. The heart of the proof is
applying the Sandwich Theorem:
Upper Bound
After expanding,
_
n
m
_
=
n!
m!(n m)!
=
1
m!
_
n (n 1) (n 2) . . . (n m + 1)
. .
n!
(nm)!
_
we note that
n (n 1) (n 2) . . . (n m + 1) n n . . . n
. .
m times
= n
m
to conclude
_
n
m
_
1
m!
n
m
.
But because
1
m!
1, in fact,
_
n
m
_
n
m
and thus
n
_
n
m
_
n
m
.
Lower Bound
Notice that
_
n
m
_
is a positive integer so
1
_
n
m
_
.
Taking n-th roots on both sides
1
,
1
n
_
n
m
_
So we have the bounds
1
n
_
n
m
_
n
m
We already proved in Lecture 10: Being Bolzy that
n
n 1.
Applying the product rule for limits m times, we have
_
n
n
_
_
n
n
_
. . .
_
n
n
_
. .
m times
1 1 . . . 1
giving us
n
n
m
1.
We also know the constant sequence
1 1.
Therefore, by the Sandwich Theorem,
lim
n
n
_
n
m
_
= 1.
Theorem. Let the power series
f(x) =
n=0
a
n
x
n
have radius of convergence > 0 and let be a shifting factor such that || < . Then, for each
m 0 the series
b
m
=
n=m
_
n
m
_
a
n
nm
converges.
1
Like square roots, n-th roots preserve inequalities that involve non-negative numbers.
Proof Summary:
We rst show that
n=0
_
n
m
_
a
n
x
n
converges absolutely for |x| < .
Consider an arbitrary partial sum
N
n=0
_
n
m
_
a
n
x
n
Choose > 0 small enough so that (1 + )|x| < .

Use this in the preceding result to bound the tail of our partial sum by an absolutely convergent
power series.
Show the partial sum is bounded, and therefore the series converges absolutely.
Manipulate the series to prove that b
m
converges.
Proof: Let x satisfy |x| < and let m be a non-negative integer. First we show that
n=0
_
n
m
_
a
n
x
n
converges absolutely and then we will manipulate this series to show that b
m
converges.
For the umpteenth time, we check that the sequence of partial sums of the absolute value analogue
are bounded. Consider an arbitrary term of this sequence
N
n=0
_
n
m
_
a
n
x
n
.
Of course we can pull out the non-negative term from the absolute value:
N
n=0
_
n
m
_
. .
>0
|a
n
x
n
| .
Now, we are going to choose > 0 small enough so that (1+)|x| also lies in the radius of convergence:
(1 + )|x| <
We know such exists because this is equivalent to
<
|x|
|x|
. .
>0
Using this in the preceding lemma, we know there exists some Q such that for all n Q,
_
n
m
_
1
< .
But because the n-th root of a positive integer is greater than (or equal to) 1, we can drop the absolute
value sign:
n
_
n
m
_
1 <
and since n-th powers preserve inequalities
_
n
m
_
< (1 + )
n
.
for all n Q. Thus,
_
Q
m
_
< (1 + )
Q
_
Q + 1
m
_
< (1 + )
Q+1
.
.
.
_
N
m
_
< (1 + )
Q+1
Summing,
N
n=Q
_
n
m
_
|a
n
x
n
| <
N
n=Q
(1 + )
n
|a
n
x
n
| . ()
Because
Q1
n=0
_
n
m
_
|a
n
x
n
| <
Q1
n=0
(1 + )
n
|a
n
x
n
| + C ()
for some constant C, we can add () and () to get
N
n=0
_
n
m
_
|a
n
x
n
| <
N
n=0
(1 + )
n
|a
n
x
n
| + C
Pulling (1 + )
n
inside the absolute value and combining exponents,
N
n=0
(1 + )
n
|a
n
x
n
| =
N
n=0
a
n
_
(1 + )|x|
_
n
. .
<
These partial sums converge absolutely because they are the partial sums of the absolute value ana-
logue of the series evaluated at (1 + )|x|, which is still in the radius of convergence! So there exists
some M such that for all N 1
N
n=0
a
n
_
(1 + )|x|
_
n
. .
<
M
and so
N
n=0
_
n
m
_
a
n
x
n
M + C.
Therefore,
n=0
_
n
m
_
a
n
x
n
is absolutely convergent for all x with |x| < . By assumption, || < so in particular,
n=0
_
n
m
_
a
n
By our scaling theorems, we can multiply by a constant |

m
| and the resulting series
n=0
_
n
m
_
a
n
nm
still converges, and hence its partial sums are bounded. Now, in general, if a full sum of non-negative
terms is bounded by some L
|s
1
| +|s
2
| + . . . +|s
K
| L
then the sub-sum that begins at m (which is smaller) is bounded by the same L
|s
m
| +|s
m+1
| + . . . +|s
K
| L
So, in particular, the partial sums of
n=m
_
n
m
_
a
n
nm
are bounded and hence this series converges. But this series is the absolute value analogue of b
m
, so
we can conlude that b
m
converges absolutely.
Now that we proved that all the b
m
actually exist, we can prove that our new power series equals
f(x). But before the proof, we need to give a quick explanation of why we can split innite sums:
Lemma. For any convergent series
m=0
g
m
we can write
b
m=a
g
m
=
m=a
g
m
m=b+1
g
m
.
Proof: For K b, we can always make the following split:
b
m=a
g
m
=
K
m=a
g
m
m=b+1
g
m
.
Intuitively, we are just isolating a segment:
1
K
K
m=a
g
m
a
0
K
K
m=b+1
g
m
b + 1
0
K
b
b
m=a
g
m
a
0
Since the sequence of partial sums (S
k
) where
S
K
=
K
m=a
g
m
is convergent, the sequence (S
k
) where
S
K
=
K
m=b+1
g
m
must also be convergent: this is because
S
K
= S
K

b
m=0
g
m
and by the limit theorems for sequences, the limit of the dierence
S
K

b
m=a
g
m
. .
constant
exists. Now, we can apply our limit theorems on the sequences
( S
K
..
K
m=a
gm
) ( S
K
..
K
m=b+1
gm
) =
b
m=a
g
m
1
This picture is useful, but must be interpreted correctly. Remember, we are dealing with integer indices
to get
m=a
g
m
m=b+1
g
m
=
b
m=a
g
m
.
Theorem (Change of Base-Point). Let the power series
f(x) =
n=0
a
n
x
n
have radius of convergence and let be a shifting factor such that || < . Also dene the coecients
b
m
=
n=m
_
n
m
_
a
n
nm
Then
f(x) =
m=0
b
m
(x )
m
for |x | < ||.
Proof Summary:
First use Change of Base-Point (Finite) to rewrite the partial sum of the series as
S
N
(x) =
N
m=0
_
N
n=m
_
n
m
_
a
n
nm
_
. .
(x )
m
Then use the sum splitting lemma to rewrite the inner sum:
S
N
(x) =
N
m=0
_

n=m
_
n
m
_
a
n
nm
_
(x )
m
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
. .
A
So it suces to show that the subtracted term A 0 as N .
Show |A| is bounded by the convergent power series,
n=N+1
|a
n
| (|| +|x |)
n
This power series converges to 0 as N .
Proof: Looking at the power series partial sums
S
N
(x) =
N
n=0
a
n
x
n
we can use Change of Base-Point (Finite), to rewrite this as
S
N
(x) =
N
m=0
_
N
n=m
_
n
m
_
a
n
nm
_
. .
(x )
m
.
But we also proved that
n=m
_
n
m
_
a
n
nm
is convergent, so we can use the preceding lemma to rewrite
N
n=m
_
n
m
_
a
n
nm
=
n=m
_
n
m
_
a
n
nm
n=N+1
_
n
m
_
a
n
nm
.
Substituting, we have
S
N
(x) =
N
m=0
_
N
n=m
_
n
m
_
a
n
nm
_
. .
(x )
m
=
N
m=0
_
_
_
_
_
_
n=m
_
n
m
_
a
n
nm
n=N+1
_
n
m
_
a
n
nm
. .
_
_
_
_
_
_
(x )
m
=
N
m=0
_

n=m
_
n
m
_
a
n
nm
_
(x )
m
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
Now stare at what we have
S
N
(x) =
N
m=0
_

n=m
_
n
m
_
a
n
nm
_
(x )
m
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
and compare it with what we are trying to prove
f(x) =
m=0
_

n=m
_
n
m
_
a
n
nm
_
(x )
m
Other than the value of N, they only dier by a single term! So to nish the proof, we just have
to show:
As N goes to innity, the subtracted term
N
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
goes to 0.
Applying repeated triangle inequality to
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
we get the upper bound

N
m=0
n=N+1
_
n
m
_
a
n
nm
_
(x )
m
which, after absolute value properties, is

N
m=0
_
n=N+1
_
n
m
_
a
n
nm
|(x )
m
|
_
Now, be careful. Particularly, you cant just mindlessly apply triangle inequality to the inner sum!
Because it isnt a sum!
n=N+1
_
n
m
_
a
n
nm
is a real number. It is the limit of a partial sum. Just because it looks like a sum doesnt mean it is
actually a sum!
However, for a series that converges absolutely, we know that
n=0
g
n
n=0
|g
n
|
so we can bound further,
n=N+1
_
n
m
_
a
n
nm
n=N+1
_
n
m
_
a
n
nm
giving us
N
m=0
_
n=N+1
_
n
m
_
a
n
nm
|(x )
m
|
_
m=0
_

n=N+1
_
n
m
_
a
n
nm
|(x )
m
|
_
But remember that we can rearrange any convergent non-negative sequence. So the expression on
the right is the same as the one with the interchanged sum symbol
n=N+1
_
N
m=0
_
n
m
_
a
n
nm
|(x )
m
|
_
Now the goal is to compress the inner sum into something more tangible. First, distribute the absolute
value and pull out the a
n
index:
n=N+1
|a
n
|
_
N
m=0
_
n
m
_
||
nm
|x |
m
_
Since we always have N < n, we can increase the inner summation from N to n:
n=N+1
|a
n
|
_
n
m=0
_
n
m
_
||
nm
|x |
m
_
Of course this summation is greater (you are adding more non-negative terms), but there is a catch:
how do we know that this limit actually exists?! First, we recognize that the inner sum is just
a Binomial expansion:
n=N+1
|a
n
| (|| +|x |)
n
Now we have a power series. Moreover, we know it converges since we made the strange assump-
tion in the theorem statement that
|x | < || .
But thats the same as saying
(|| +|x |) < .
Therefore this series converges! Now we have the upper bound
n=N+1
|a
n
| (|| +|x |)
n
.
But we can easily show this upper bound goes to 0 as N approaches innity. Innite sum splitting
again, we know that
N
n=1
|a
n
| (|| +|x |)
n
=
n=1
|a
n
| (|| +|x |)
n
n=N+1
|a
n
| (|| +|x |)
n
So we can rewrite the upper bound as
n=1
|a
n
| (|| +|x |)
n
n=1
|a
n
| (|| +|x |)
n
. .
n=N+1
|an|(||+|x|)
n
But the left term is a constant. And the right converges to the same constant,
n=1
|a
n
| (|| +|x |)
n
.
Therefore
n=N+1
|a
n
| (|| +|x |)
n
converges to 0 as N approaches innity. And since
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
<
n=N+1
|a
n
| (|| +|x |)
n
we conclude, as N approaches innity,
N
m=0
_

n=N+1
_
n
m
_
a
n
nm
_
(x )
m
approaches 0.
Now that we have worked so hard to prove the Change of Base-Point Theorem for power series, you
should ask
What was the point? Why did we go through all this trouble? Why should we care?
This is a very good question. In fact, if you google this formula, the rst hit will be Professor Simons
book. But, we will use this theorem to give an easy proof that power series can be dierentiated term
by term. We will save that proof for the nal Real Analysis lecture.
Lecture 22
A Mixed Bag of Partials
Which road do I take? Alice asked.
Where do you want to go? responded the Cheshire Cat.
I dont know, she answered.
Then, said the Cat, it doesnt matter.
- Lewis Carroll, Alice in Wonderland
Goals: Today, we prove that mixed partials commute. In the proof, we will nd two
dierent ways to rewrite an expression. And in each revision, we will make several smart
applications of the Mean Value Theorem.
22.1 Out of Order
When you want to make a Mojito, you have to keep the order straight:
1. Brown Sugar
2. Lime and Mint
3. 4 oz Light Rum
4. Muddle
5. Add Ice
6. Fill with Tonic Water
You must follow the order. If you add ice rst, then you cant muddle; pour tonic water rst, and
everythings oating. If youre working at Tri-Del or Phi-Psi, messing up is forgiveable. But if
youre at a nightclub, the bars backed up, and your managers watching, the wrong order could mean
youre out of a job.
Luckily, the same isnt true of partial derivatives:
443
444 LECTURE 22. A MIXED BAG OF PARTIALS
Example: Let
f
_
x
1
x
2
_
=
_
_
x
1
x
2
2
e
x
1
x
2
cos(x
1
) + x
2
_
_
Compute the 1st partial derivative of the the 2nd partial derivative of f,
D
1
D
2
f(x),
and compute the 2nd partial derivative of the the 1st partial derivative of f,
D
2
D
1
f(x).
First, we compute the partial derivatives of f:
D
1
f
_
x
1
x
2
_
=
_
_
x
2
2
e
x
1
x
2
sin(x
1
)
_
_
D
2
f
_
x
1
x
2
_
=
_
_
2x
1
x
2
e
x
1
1
_
_
Then we take partial derivatives of these partial derivatives
D
2
D
1
f
_
x
1
x
2
_
=
_
_
2x
2
e
x
1
0
_
_
D
1
D
2
f
_
x
1
x
2
_
=
_
_
2x
2
e
x
1
0
_
_
Remarkably,
D
2
D
1
f(x) = D
2
D
1
f(x)
This wasnt a coincidence: it turns out,
The order in which you take partial derivatives doesnt matter!
This is incredible. And if you were in Math 51, the story would stop right there with practice
calculations. But Im not going to bore you with that. The purpose of the H-Series is not to train
calculators. Its to train thinkers. By now, you should realize
Math Mantra: The WHY is more important than the WHAT.
Why in the world would this magical relation be true? How are we going to prove it?
We use a common trick:
Math Mantra: If you can find two different ways to represent the same
expression, then you can equate the two to derive key properties.
22.2. THE PROOF 445
You will really appreciate this trick
1
in Math 108: Introduction to Combinatorics and several
other classes in Discrete Mathematics. To use the trick in this proof, you will need to bring back an
old friend, our theorem from Lecture 19,
Theorem. Let f : R
n
R be dierentiable and let i {1, 2, . . . , n}. Then for some
i
[0, 1],
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+ h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
= h
i
f
x
i
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
i
+
i
h
i
.
.
.
a
n
_
_
_
_
_
_
_
_
_
22.2 The Proof
To prove that for all i, j {1, 2, . . . , n},
D
i
D
j
f(x) = D
j
D
i
f(x)
we only need to consider the case i = j. We can also assume m = 1: once we prove that the partials
commute for a function that maps into R,
2
f
x
i
x
j
(x) =

2
f
x
j
x
i
(x)
then for a multivariable function
f(x) =
_
_
_
_
_
f
1
(x)
f
2
(x)
.
.
.
f
m
(x)
_
_
_
_
_
we can apply the single variable result to each component of
D
i
D
j
f(x) =
_
_
_
_
_
_
_
_
2
f
1
x
i
x
j
(x)
2
f
2
x
i
x
j
(x)
.
.
.
2
fm
x
i
x
j
(x)
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
2
f
1
x
j
x
i
(x)
2
f
2
x
j
x
i
(x)
.
.
.
2
fm
x
j
x
i
(x)
_
_
_
_
_
_
_
_
= D
j
D
i
f(x)
Now we turn to the actual proof.
1
Personally, my favorite application is the bijective proof of Eulers Theorem in Number Theory.
Theorem. Let f : R
n
R
m
be a C
2
function (i.e all second order partial derivatives exist and are
continuous). Then, for any x R
n
and any i, j {1, 2, . . . , n}, the order in which the i-th and j-th
partials are taken at the point x doesnt matter:
D
i
D
j
f(x) = D
j
D
i
f(x).
Proof Summary:
It suces to consider the case m = 1.
The goal is to nd two dierent ways to express
f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
) f(x + h
j
e
j
) + f(x).
Way 1
Rewrite as a dierence A B where A and B dier only in the j-th component.
Apply the Mean Value Theorem with respect to the j-th component.
Expand and apply the Mean Value Theorem with respect to the i-th component.
Way 2
Group as a dierence A B where A and B dier in the i-th component.
Apply the Mean Value Theorem with respect to the i-th component.
Expand and apply the Mean Value Theorem with respect to the j-th component.
Equate expressions and take the limit as h 0 (while taking advantage of the C
2
assumption).
Proof: Let x R
n
and let i, j {1, 2, . . . , n}. As explained above, we may assume m = 1, so we
need to show
2
f
x
i
x
j
(x) =

2
f
x
j
x
i
(x)
Let h
i
, h
j
= 0 and consider the expression
f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
) f(x + h
j
e
j
) + f(x)
In expanded notation, it is clear we are just adding constants to the i-th and j-th components
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
+ f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
22.2. THE PROOF 447
We are going to rewrite this expression int two ways.
Way 1
We are going to apply the Mean Value Theorem in a very unobvious way. Up until now, we
only applied the Mean Value Theorem to a dierence of two terms. But our expression has
four terms. So we group them as two terms! Reordering as a dierence of dierences, we get
(f(x + h
i
e
i
+ h
j
e
j
) f(x + h
j
e
j
)) (f(x + h
i
e
i
) f(x))
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Notice anything? The rst pair only diers from the second pair by the h
j
term in the j-th
component. Dene a new function g by
g(x) = f(x + h
i
e
i
) f(x)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
= f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
,
We can rewrite our pairs in terms of g:
(f(x + h
i
e
i
+ h
j
e
j
) f(x + h
j
e
j
))
. .
g(x+h
j
e
j
)
(f(x + h
i
e
i
) f(x))
. .
g(x)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
Now the expression is the correct form to apply the Mean Value Theorem: there exists some
1
[0, 1] such that
g(x + h
j
e
j
) g(x) = h
j
g
x
j
(x +
1
h
j
e
j
)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
= h
j
g
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
.
By expanding g in terms of f and applying the linearity of the derivative,
g
x
j
(x) =
f
x
j
(x + h
i
e
i
)
f
x
j
(x).
Therefore our original expression is actually
h
j
_
f
x
j
(x +
1
h
j
e
j
+ h
i
e
i
)
f
x
j
(x +
1
h
j
e
j
)
_
h
j
_
_
f
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Notice that the inner functions dier by just the h
i
in the i-th component, so we can immediately
apply the Mean Value Theorem again: there exists some
2
[0, 1] such that
h
j
_
f
x
j
(x +
1
h
j
e
j
+ h
i
e
i
)
f
x
j
(x +
1
h
j
e
j
)
_
= h
j
_
h
i
2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
)
_
h
j
_
_
f
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
= h
j
_
_
h
i
2
f
x
i
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
2
h
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
22.2. THE PROOF 449
Therefore,
f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
) f(x + h
j
e
j
) + f(x) = h
j
_
h
i
2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
)
_
.
Way 2
We do the same trick and reorder as
(f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
)) (f(x + h
j
e
j
) f(x))
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Again, the rst pair only diers from the second by the h
i
term in the i-th component. Using
g(x) = f(x + h
j
e
j
) f(x)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
= f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
we can rewrite our pairs as
(f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
))
. .
g(x+h
i
e
i
)
(f(x + h
j
e
j
) f(x))
. .
g(x)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
Applying the Mean Value Theorem, there exists some
1
[0, 1] such that
g(x + h
i
e
i
) g(x) = h
i
g
x
i
(x +
1
h
i
e
i
)
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+ h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
=
g
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
Expanding g in terms of f,
g
x
i
(x) =
f
x
i
(x + h
j
e
j
)
f
x
i
(x)
Thus, our original expression is actually
h
i
_
f
x
i
(x +
1
h
i
e
i
+ h
j
e
j
)
f
x
i
(x +
1
h
i
e
i
)
_
h
i
_
_
f
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Notice that the inner functions dier by just the h
j
in the j-th component, so we again apply
the Mean Value Theorem: there exists some
2
[0, 1] such that
h
i
_
f
x
i
(x +
1
h
i
e
i
+ h
j
e
j
)
f
x
i
(x +
1
h
i
e
i
)
_
= h
i
_
h
j
2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)
_
h
i
_
_
f
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
+ h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
f
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
= h
i
_
_
h
j
2
f
x
j
x
i
f
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
+
2
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Therefore,
f(x + h
i
e
i
+ h
j
e
j
) f(x + h
i
e
i
) f(x + h
j
e
j
) + f(x) = h
i
_
h
j
2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)
_
.
22.2. THE PROOF 451
Equating our two expressions,
h
j
_
h
i
2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
)
_
. .
Way 1
= h
i
_
h
j
2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)
_
. .
Way 2
giving us
2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
) =

2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
) ()
2
f
x
i
x
j
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
2
h
i
.
.
.
x
j
+
1
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
=

2
f
x
j
x
i
_
_
_
_
_
_
_
_
_
_
_
_
x
1
.
.
.
x
i
+
1
h
i
.
.
.
x
j
+
2
h
j
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
At this point you could say we are done: let h
i
, h
j
approach 0, so both points approach x. But of
course, this is a book for underdogs, and it is good to get additional practice. Also, I want you to see
for yourself why you cant overlook the C
2
assumption.
Consider the dierence

2
f
x
i
x
j
(x)

2
f
x
j
x
i
(x)
.
We want to show that this quantity must be 0, so we can equivalently show that
2
f
x
i
x
j
(x)

2
f
x
j
x
i
(x)
<
for any > 0. First, we add 0. But this 0 is going to come from our proven property ()
2
f
x
i
x
j
(x)

2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
) +

2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)
. .
=0

2
f
x
j
x
i
(x)
By the triangle inequality, we can bound this above by
2
f
x
i
x
j
(x)

2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
)
2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)

2
f
x
j
x
i
(x)
Now all we need to do is add some restrictions on our choice of h

i
, h
j
that guarantee that the above
must be bounded by . Because

2
f
x
i
x
j
and

2
f
x
j
x
i
are continuous functions (since f is C
2
), we know
that we can nd
1
,
2
> 0 such that if
h <
1
then

2
f
x
i
x
j
(x)

2
f
x
i
x
j
(x +
h)
<

2
and if
h <
2
then

2
f
x
i
x
j
(x)

2
f
x
i
x
j
(x +
h)
<

2
.
Both inequalities are satised if we require
h < min {
1
,
2
} .
But because
1
h
j
e
j
+
2
h
i
e
i
=
_
2
1
h
2
j
+
2
2
h
2
i

_
h
2
i
+ h
2
j
and
1
h
i
e
i
+
2
h
j
=
_
(
1
)
2
h
2
i
+ (
2
)
2
h
2
j

_
h
2
i
+ h
2
j
are both bounded by the full norm
h =
_
h
2
1
+ h
2
2
+ . . . + h
2
n
,
in particular,
2
f
x
i
x
j
(x)

2
f
x
i
x
j
(x +
1
h
j
e
j
+
2
h
i
e
i
)
. .
<
2
+
2
f
x
j
x
i
(x +
1
h
i
e
i
+
2
h
j
e
j
)

2
f
x
j
x
i
(x)
. .
<
2
<
Thus

2
f
x
i
x
j
(x)

2
f
x
j
x
i
(x)
<
and since our choice of was arbitrary, this inequality must hold for every > 0. Therefore,
2
f
x
i
x
j
(x)

2
f
x
j
x
i
(x) = 0.
so
2
f
x
i
x
j
(x) =

2
f
x
j
x
i
(x).
New Notation
D
i
D
j
f(x) or

2
f
x
i
x
j
(x) The i-th partial
derivative of the j-th
partial derivative of f
evaluated at x.
D
i
D
j
f(x) = D
j
D
i
f(x) The i-th partial derivative
of the j-th partial derivative
of f evaluated at x is equal
to the j-th partial derivative
of the i-th partial derivative
of f evaluated at x.
Lecture 23
Second to None
The derivative of my derivative is my friend.
- Math Proverb
Goals: Today, we introduce quadratic forms and prove the Second Derivative Test.
But to do this, we rst prove that any C
2
function f(x) can be approximated up to
second order by a quadratic. The proof of this approximation is a cute application of an
innocuous integral formula.
23.1 The Three Kings
In Math 51H, the three most dicult results to prove are:
Second Derivative Test
Lagrange Multiplier Theorem (and anything pertaining to Manifolds)
Implicit Function Theorem
You have a choice. You could just understand these theorem statements and how to apply them. I
wouldnt blame you. Life is busy and the application is the only thing youll be tested on. And there
is so much more you have to master. But,
Math Mantra: Math is about ENJOYING proofs.
If you really want to be a mathematician, you have to appreciate the proofs. Moreover,
Math Mantra: Math is about PERSEVERANCE.
Even though mathematics is dicult, if you just stare at a mathematical result long enough you can
nd beauty - a beauty more intricate than any piece of art and more colorful than any rainbow.
But its not easy. And I know this is going to be dicult. Personally, when I was an undergrad, I
learned dicult proofs like a Java compiler. I would try to understand a proof one step at a time
453
454 LECTURE 23. SECOND TO NONE
to double check its validity. But I lost the full picture this way. As the proverb goes, I couldnt see
the forest from the trees.
I dont want you to make the same mistake I did. I want you to treat these proofs like close friends
or epic stories. And when you fully understand the of the Second Derivative Test, I encourage you to
complete the Wikipedia entry.
1
23.2 On Multivariable Extensions
To reiterate our goal for the umpteenth time, we like to nd the maxima and minima of multivariable
functions. And once again, this only make sense for functions that map into R.
Back in Calc BC, nding local maxima was easy:
Set the derivative equal to 0 and solve for x
max
.
Check the sign of the derivative to the left and right of x
max
If you had the derivative chart
x
max
0
+
you knew it was a maximum.
Life was easy: you only had to consider 1 dimension and two directions. But in R
n
, lifes not as
simple! The input is now a vector, so there is no left or right! Nor can we say the derivative is
positive or negative, since its also a vector.
However, we have some new tools at our disposal. We showed showed that the gradient must be

0 at
an extremum, so to nd local maxima you can
Set the gradient equal to

0 and solve for possible
2
x
max
.
See if you can nd a ball around that point such that f(x
max
) is greater than (or equal) to f
evaluated at any point in that ball.
The rst part is easy. However, nding a ball and proving maximality at x
max
can be quite trouble-
some. In the real world, we need a more convenient way of isolating extrema.
Have noea fear: we have a pretty sweet extension of the single variable Second Derivative Test. First,
well need to talk about approximations and errors.
1
As of February 2013, no one has dared to post a proof of the multivariable Second Derivative Test on Wikipedia.
2
Even if the gradient at x is

0, a local maximum need not be achieved at x.
23.3. A COMEDY OF ERRORS 455
23.3 A Comedy of Errors
One of the standard exercises in Calculus BC was to solve for the line tangent to the graph of f at a
point a:
f(a) + f
(a)(x a)
The idea was that the tangent line approximates the function locally (close to a).
You also learned how to express functions with Taylor Series
f(x) = f(a) + f
(a)(x a) +
f
(a)
2!
(x a)
2
+
f
(a)
3!
(x a)
3
+
f
(a)
4!
(x a)
4
+ . . .
As already mentioned in Lecture 21, the successive partial sums approximate the function more and
more accurately. But focus on only the rst three terms of the Taylor Series:
f(a) + f
(a)(x a)
. .
Tangent Line
+
f
(a)
2
(x a)
2
. .
Quadratic Term
This is just the tangent line plus an additional quadratic term. So how well does this quadratic
function approximate f(x) near a? From a rst glance, it looks pretty sweet:
But rst, we need to make the word approximation formal. We dene the error (E(x)) to be the
dierence between the real value of f at a and the value of the guess at a:
E(x) = f(x)
..
Original
_
f(a) + f
(a)(x a) +
f
(a)
2
(x a)
2
_
. .
Guess
We call a guess an approximation if the error gets smaller as x gets closer to a. In fact, we should be
able to make the error arbitrarily small, i.e.
For any > 0, we can nd a > 0 such that if |x a| < , then |E(x)| < .
Thats the minimum requirement for a guess to be an approximation. A decent approximation would
have an error that gets smaller much faster near a. Taking a hint from dierentiability, a decent
approximation satises
For any > 0, we can nd a > 0 such that if |x a| < , then |E(x)| < |x a|.
But the error of a great approximation would get smaller even faster than that!
For any > 0, we can nd a > 0 such that if |x a| < , then |E(x)| < |x a|
2
.
Formally, we call such an approximation a second-order approximation. Indeed, we can prove that
our quadratic approximation is second-order and that
f(x) = f(a) + f
(a)(x a) +
f
(a)
2
(x a)
2
+ E(x)
where
lim
xa
E(x)
|x a|
2
= 0
Can we extend this to Multivariable Calculus? Absolutely!
For a C
2
function f : R
n
R, our approximation becomes
f(x) = f(a) + Df(a)
_
x a
_
+ Q(x a) + E(x)
The extension of the rst two terms are obvious. However, the quadratic term Q(xa) and the error
E(x) need some explanation.
Looking at the single variable quadratic term for inspiration,
f
(a)
2
(x a)
2
we see that we take the second derivative with respect to a single variable x. In multivariable notation,
this is
1
2

2
f
x
1
x
1
(a)
_
x
1
a
1
__
x
1
a
1
_
.
But we do not have to take the derivative with respect to x
1
. We can dierentiate (twice) with respect
to, say, x
3
and multiply by (x
3
a
3
)
2
:
1
2

2
f
x
3
x
3
(a)
_
x
3
a
3
__
x
3
a
3
_
.
But we dont even need to take both derivatives with respect to the same variable; for example, we
can dierentiate with respect to x
5
and x
2
and then multiply by (x
5
a
5
)(x
2
a
2
):
1
2

2
f
x
5
x
2
(a)
_
x
5
a
5
__
x
2
a
2
_
.
In fact, we can do this for all possible pairs (i, j) where i and j vary from 1 to n:
2
f
x
1
x
1
(a)
_
x
1
a
1
__
x
1
a
1
_

2
f
x
1
x
2
(a)
_
x
1
a
1
__
x
2
a
2
_
. . .

2
f
x
1
xn
(a)
_
x
1
a
1
__
x
n
a
n
_
2
f
x
2
x
1
(a)
_
x
2
a
2
__
x
1
a
1
_

2
f
x
2
x
2
(a)
_
x
2
a
2
__
x
2
a
2
_
. . .

2
f
x
2
xn
(a)
_
x
2
a
2
__
x
n
a
n
_
.
.
.
.
.
.
.
.
.
.
.
.
2
f
xnx
1
(a)
_
x
n
a
n
__
x
1
a
1
_

2
f
xnx
2
(a)
_
x
n
a
n
__
x
2
a
2
_
. . .

2
f
xnxn
(a)
_
x
n
a
n
)
_
x
n
a
n
)
It turns out, Q(x a) is going to be half the sum of all the pairs:
Q(x a) =
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
.
As for the error term E(x), it ends up being a complete mess:
E(x) =
n
i,j=1
q
ij
(x)
_
x
i
a
i
__
x
j
a
j
_
where
q
ij
(x) =
_
1
0
_
(1 t)

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt.
Luckily, this term doesnt matter: well prove that it quickly goes to 0 as x approaches a.
Before we prove that f(x) can be written as a second-order approximation, consider Professor May-
danskiys saying
Math Mantra: You can either be clever on each individual problem or you can
reuse the same tools and tricks again and again.
Unlike the last few proofs where we used the single variable Mean Value Theorem a gazillion times,
we are going to be really clever and prove an innocuous single-variable identity:
Lemma. Let h : R R be a twice dierentiable function. Then,
_
1
0
(1 t)h
(t)dt = h(1) h(0) h
(0)
Proof: Recall your integration by parts mnemonic,
_
u dv = uv
_
v du
Applying integration by parts to
_
1
0
(1 t)
. .
u
h
(t)
. .
dv
dt
with
u = 1 t
dv = h
(t) dt
so that
du = 1 dt
v = h
(t),
we have
(1 t)
. .
u
h
(t)
..
v
1
t=0
_
1
0
h
(t)
..
v
1 dt
. .
du
which is just
h(1) h(0) h
(0).
Now all we gotta do is plug and chug into the above integral equation!
Theorem. Let f : R
n
R be a C
2
function. Then
f(x) = f(a) + Df(a)
_
x a
_
+
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
+ E(x)
where
E(x) =
n
i,j=1
q
ij
(x)
_
x
i
a
i
__
x
j
a
j
_
and
q
ij
(x) =
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt.
Proof Summary:
Fix x, a. Dene
h(t) = f(a + t(x a)).
Use the chain rule to calculate h
(t) and h
(t)
Plug h, h
, h
into the previous integral identity.

f(x) = f(a) + Df(a)
_
x a
_
+
_
1
0
(1 t)
_
n
i,j=1
2
f
x
i
x
j
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
_
dt
. .
Use a series of integral and summation shenanigans to rewrite the integral term.
Proof: Fix x, a. We are going to use our integral equation on the magic function
h(t) = f(a + t(x a))
Geometrically, you can view h as the valuation of f on the (parametrized) line segment from a to x
where t goes from 0 to 1:
x
f
a
First, we calculate h
(t) and h
(t):
Calculating h
(t)
h(t) is really the composition
h(t) = (f g) (t)
where
g(t) = a + t(x a) =
_
_
a
1
+ t(x
1
a
1
)
a
2
+ t(x
2
a
2
)
.
.
.
a
n
+ t(x
n
a
n
)
_
_
By the chain rule,
h
(t) = Df(g(t))Dg(t).
Since f maps into R, Df(g(t)) is the row vector
Df(g(t)) =
_
f
x
1
(a + t(x a))
f
x
2
(a + t(x a)) . . .
f
x
n
(a + t(x a))
_
.
Moreover, because g is a single-variable function, Dg(t) is the column vector
Dg(t) =
_
_
x
1
a
1
x
2
a
2
.
.
.
x
n
a
n
_
_
Performing the matrix multiplication
h
(t) =
_
f
x
1
(a + t(x a))
f
x
2
(a + t(x a)) . . .
f
x
n
(a + t(x a))
_
. .
Df(g(t))
_
_
x
1
a
1
x
2
a
2
.
.
.
x
n
a
n
_
_
. .
Dg(t)
we get
h
(t) =
f
x
1
(a + t(x a))
_
x
1
a
1
_
+
f
x
2
(a + t(x a))
_
x
2
a
2
_
+ . . . +
f
x
n
(a + t(x a))
_
x
n
a
n
_
=
n
i=1
f
x
i
(a + t(x a))
. .
H
i
(t)
_
x
i
a
i
_
Calculating h
(t)
To dierentiate h
(t) with respect to t, by linearity of the derivative, we simply need to dier-

entiate each
H
i
(t) =
f
x
i
(a + t(x a))
We could go through chain rule all over again. Instead, just compare this term with our original
h(t):
h(t) = f (a + t(x a))
H
i
(t) =
f
x
i
(a + t(x a)) .
Both arguments are the same! All we are doing is replacing the f with
f
x
i
! Through our
supreme laziness,
1
we can replace all the f in h
(t) with
f
x
i
. In particular, we dierentiate
f
x
i
with respect to j, giving us

2
f
x
j
x
i
. So
H
i
(t) =

2
f
x
1
x
i
(a + t(x a))
_
x
1
a
1
_
+ . . . +

2
f
x
n
x
i
(a + t(x a))
_
x
n
a
n
_
=
n
j=1
2
f
x
j
x
i
(a + t(x a))
_
x
j
a
j
_
1
If you have doubts, redo the chain rule.
Thus,
h
(t) =
n
i=1
_
n
j=1
2
f
x
j
x
i
(a + t(x a))
_
x
i
a
i
_
_
. .
H
i
(t)
_
x
i
a
i
_
Merging this double sum,
h
(t) =
n
i,j=1
2
f
x
j
x
i
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
Applying the Lemma (Integral Identity)
Now we have
h(t) = f(a + t(x a))
h
(t) =
n
i=1
f
x
i
(a + t(x a))
_
x
i
a
i
_
. .
Df(a+t(xa))(xa)
h
(t) =
n
i,j=1
2
f
x
j
x
i
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
By the identity
_
1
0
(1 t)h
(t)dt = h(1) h(0) h
(0)
we have
_
1
0
(1 t)
_
n
i,j=1
2
f
x
i
x
j
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
_
. .
h
(t)
dt = f(x)
..
h(1)
f(a)
..
h(0)
Df(a)
_
x a
_
. .
h
(0)
which is simply
f(x) = f(a) + Df(a)
_
x a
_
+
_
1
0
(1 t)
_
n
i,j=1
2
f
x
i
x
j
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
_
dt
Rewriting the integral
All thats left is to rewrite the integral term. First, notice that we are computing the integral
of a sum
1
_
1
0
(1 t)
_
n
i,j=1
2
f
x
i
x
j
(a + t(x a)) (x
i
a
i
)(x
j
a
j
)
_
dt
1
These linearity manipulations are vital in both math and engineering. If you cannot see the linearity, expand the
summation.
which, by linearity, is the same as a sum of integrals. This allows us to bring the integral
within the sum:
n
i,j=1
__
1
0
(1 t)

2
f
x
i
x
j
(a + t(x a))
_
x
i
a
i
__
x
j
a
j
_
dt
_
.
Pulling out terms from the integral that are independent of t, we have
n
i,j=1
(x
i
a
i
)(x
j
a
j
)
__
1
0
(1 t)

2
f
x
i
x
j
(a + t(x a)) dt
_
.
Since we want our quadratic to appear, add 0 to introduce the missing

2
f
x
i
x
j
(a):
n
i,j=1
_
_
_
_
(x
i
a
i
)(x
j
a
j
)
_
1
0
(1 t)
_
_
_
_
2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a) +

2
f
x
i
x
j
(a)
. .
=0
_
_
_
_
dt
_
_
_
_
.
Distributing the integral and (1 t), we get
n
i,j=1
(x
i
a
i
)(x
j
a
j
)
_
_
_
_
_
_
_
_
_
_
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))
2
f
x
i
x
j
(a)
_
dt
+
_
1
0
(1 t)
2
f
x
i
x
j
(a) dt
_
_
_
_
_
_
_
_
_
_
.
Distributing the product over the inner sum yields
n
i,j=1
_
_
_
_
_
_
_
_
_
(x
i
a
i
)(x
j
a
j
)
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))
2
f
x
i
x
j
(a)
_
dt
+
(x
i
a
i
)(x
j
a
j
)
_
1
0
(1 t)
2
f
x
i
x
j
(a) dt.
_
_
_
_
_
_
_
_
_
Break this sum of sums into two:
n
i,j=1
(x
i
a
i
)(x
j
a
j
)
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt
+
n
i,j=1
(x
i
a
i
)(x
j
a
j
)
_
1
0
(1 t)

2
f
x
i
x
j
(a) dt
.
The rst line is E(x) by denition. Moreover, after pulling out the constant

2
f
x
i
x
j
(a) from
the second line, we have
E(x) +
n
i,j=1
(x
i
a
i
)(x
j
a
j
)

2
f
x
i
x
j
(a)
__
1
0
(1 t) dt
_
.
Evaluating the integral gives us
E(x) +
1
2

n
i,j=1
(x
i
a
i
)(x
j
a
j
)

2
f
x
i
x
j
(a)
Therefore,
f(x) = f(a) + Df(a)
_
x a
_
+
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
+ E(x).
Now that we have an approximation, we can prove that this approximation is great (i.e. second-order).
The key lies in the appearance of the following expression in the error
f
x
i
x
j
(a + t(x a))
f
x
i
x
j
(a).
This dierence is screaming for us to use the continuity of all second order partial derivatives.
Theorem. Let f : R
n
R be a C
2
function. Then the error term
E(x) =
n
i,j=1
q
ij
(x)
_
x
i
a
i
__
x
j
a
j
_
where
q
ij
(x) =
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt
has limit
lim
xa
E(x)
x a
2
= 0
Proof Summary:
Let > 0. We want to nd a such that if
x a <
then
|E(x)| < x a
2
Bound |E(x)| by
n
i,j=1
|q
ij
(a)| x a
2
.
Now we just need to ensure that
n
i,j=1
|q
ij
(a)| <
Use the continuity of all second partials with choice of

n
2
for the -condition. Choose as the
minimum of all the corresponding s.
Proof: For any , we want to nd a such that if
x a <
then
|E(x)| < x a
2
.
Let > 0. Expanding |E(x)|, we get
i,j=1
q
ij
(a)
_
x
i
a
i
__
x
j
a
j
_
.
By triangle inequality applied to each term in this summand, this is bounded by
n
i,j=1
q
ij
(a)
_
x
i
a
i
__
x
j
a
j
_
Moreover, by absolute value properties, this bound is the same as

n
i,j=1
|q
ij
(a)| |(x
i
a
i
)| |(x
j
a
j
)|
Because |x
i
a
i
| and |x
j
a
j
| are both bounded by the full norm
x a =
_
. . . + (x
i
a
i
)
2
+ . . . + (x
j
a
j
)
2
+ . . .,
we can further bound |E(x)| by
n
i,j=1
|q
ij
(a)| x a
2
.
After pulling out the constant from the summation, this bound is
x a
2
n
i,j=1
|q
ij
(a)|
Therefore, if we can ensure
n
i,j=1
|q
ij
(a)| < ,
then we are done since
|E(x)|
n
i,j=1
|q
ij
(a)|
. .
<
x a
2
< x a
2
.
Recall
1
from Calc BC that for a function g,
_
b
a
g(t) dt
_
b
a
|g(t)| dt.
Using this inequality,
|q
ij
(a)| =
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt
is bounded by
_
1
0
(1 t)
_

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
_
dt.
By absolute value properties, this bound is the same as
_
1
0
|(1 t)|
2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
dt.
Moreover, we are integrating t from 0 to 1; thus, we can drop the absolute value on (1 t):
_
1
0
(1 t)
2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
dt. ()
Applying continuity of

2
f
x
i
x
j
at a, we know there exists a
ij
such that if
y a <
ij
then

2
f
x
i
x
j
(y)

2
f
x
i
x
j
(a)
<

n
2
In particular,
y = a + t(x a)
satises this -condition when we restrict
x a <
ij
.
This is because
a + t(x a)
. .
y
a = |t|
..
1
x a
. .
<
ij
<
ij
.
Therefore, if
x a <
ij
then

2
f
x
i
x
j
(a + t(x a))

2
f
x
i
x
j
(a)
<

n
2
.
1
This is known as the triangle inequality for integration. Intuitively, it says that the absolute value of the area under
the curve is less than or equal to the integral of the absolute value of the function.
This means, as long as x a <
ij
, we can further bound () by
_
1
0
(1 t)

n
2
dt
which, after pulling out the constant and integrating, is
2n
2
.
Of course,
2n
2
<

n
2
.
In conclusion, if
x a <
ij
,
then
|q
ij
(a)| <

n
2
Therefore, if
x a <
11
x a <
12
x a <
13
. . . x a <
1n
x a <
21
x a <
22
x a <
23
. . . x a <
2n
x a <
31
x a <
32
x a <
33
. . . x a <
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x a <
n1
x a <
n2
x a <
n3
. . . x a <
nn
then
|q
11
(a)| <

n
2
|q
12
(a)| <

n
2
|q
13
(a)| <

n
2
. . . |q
1n
(a)| <

n
2
|q
21
(a)| <

n
2
|q
22
(a)| <

n
2
|q
23
(a)| <

n
2
. . . |q
2n
(a)| <

n
2
|q
31
(a)| <

n
2
|q
32
(a)| <

n
2
|q
33
(a)| <

n
2
. . . |q
3n
(a)| <

n
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
|q
n1
(a)| <

n
2
|q
n2
(a)| <

n
2
|q
n3
(a)| <

n
2
. . . |q
nn
(a)| <

n
2
.
Choosing
= min{
11
,
12
, . . . ,
nn
},
if
x a <
23.4. QUADRATIC FORMS 467
then
n
i,j=1
|q
ij
(a)| = |q
11
(a)| +|q
12
(a)| + . . . +|q
nn
(a)|
<

n
2
+

n
2
+ . . . +

n
2
. .
n
2
times
= n
2
_

n
2
_
= .
23.4 Quadratic Forms
Consider the quadratic term in our approximation,
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
It turns out that studying this term is the key to the Second Derivative Test. Forget all the bells and
whistles for a moment and consider the underlying structure:
Denition. A quadratic form is a function of the form
Q(x) =
n
i,j=1
a
ij
x
i
x
j
where each a
ij
satises
a
ij
= a
ji
. ()
If matrix A = [a
ij
] satises (), we call A a symmetric matrix.
In particular, we can rewrite
1
2
Q(xa)
..
n
i,j=1
2
f
x
i
x
j
(a)
. .
a
ij
_
x
i
a
i
__
x
j
a
j
_
as half a quadratic form evaluated at (x a) where
Q(x a) =
n
i,j=1
a
ij
_
x
i
a
i
__
x
j
a
j
_
and A = [a
ij
] is the matrix whose ij entry is

2
f
x
i
x
j
(a) :
A =
_
2
f
x
1
x
1
(a)

2
f
x
1
x
2
(a) . . .

2
f
x
1
x
n
(a)
2
f
x
2
x
1
(a)

2
f
x
2
x
2
(a) . . .

2
f
x
2
x
n
(a)
.
.
.
.
.
.
.
.
.
.
.
.
2
f
x
n
x
1
(a)

2
f
x
n
x
2
(a) . . .

2
f
x
n
x
n
(a)
_
_
Again, this matrix only makes sense if our function maps into R. Otherwise, we would have a matrix
of vectors! Moreover, this matrix is symmetric since mixed partial commute:
a
ij
=

2
f
x
i
x
j
(a) =

2
f
x
j
x
i
(a) = a
ji
It turns out that quadratic forms have pretty sweet properties, including a stronger form of maxima
and minima:
Lemma. For any quadratic form Q : R
n
R, there exist constants m and M such that for any
x R
n
, we have
mx
2
Q(x) Mx
2
.
Proof Summary:
It suces to prove for x =
0.
On the unit sphere, Q achieves its maximum M and minimum m. So for any y on the sphere,
m Q(y) M
For any non-zero x,
x
x
lies on the unit sphere. Plug this into the preceding inequality.
Proof: The inequality is immediate if x =

0, so it suces to prove the result for x = 0. First, recall
that we proved that the unit sphere,
S = {x | x = 1}
is closed. Clearly, S is bounded as well. Also, Q(x) is a multivariable polynomial and thus continuous.
Because continuous functions achieve a maximum and a minimum on any closed and bounded set,
there exists constants M and m such that
m Q(y) M
23.4. QUADRATIC FORMS 469
for every y on the unit sphere.
But remember our trick that any non-zero point can be mapped to the unit sphere by dividing by its
norm. Namely,
x
x
S
since _
_
_
_
x
x
_
_
_
_
=
1
x
x = 1
Particularly, for any x R
n
with x =
0,
m Q
_
x
x
_
M.
Since each component of the input is scaled by a constant, we see, after expanding Q, that Q(x) is
scaled by the constant squared:
Q
_
x
x
_
=
n
i,j=1
a
ij
_
x
i
x
_ _
x
j
x
_
=
n
i,j=1
a
ij
x
i
x
j

1
x
2
.
Therefore,
Q
_
x
x
_
= Q(x)
1
x
2
.
Plugging this back into the inequality, we see that for any non-zero x R
n
,
m Q(x)
1
x
2
M
mx
2
Q(x) Mx
2
This is a great bound. But we can do even better as long as we have an additional proviso on the
sign of Q:
Lemma. If for any x =
0,
Q(x) > 0,
then there exists some positive m such that for all x R
n
:
m
..
>0
x
2
Q(x)
Likewise, if for any x =
0,
Q(x) < 0,
then there exists some negative M such that for all x R
n
:
Q(x) M
..
<0
x
2
Proof: Again, the inequalities clearly hold if x =

0, so it suces to them for all x =

0. If Q(x) is
positive for x =
0, then in particular, Q(x) is strictly positive on the unit sphere. So in the preceding
proof, the minimum value m must be positive. Likewise if Q(x) is negative for x =

0, then the
maximum value M must be negative.
23.5 Second Derivative Test
From all of our hard work, we can now easily derive the multivariable Second Derivative Test. Starting
with the second-order approximation formula for f about a,
f(x) = f(a) + Df(a)
_
x a
_
+
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
+ E(x)
we exploit quadratic form properties and the fact that the error is second order to nd a ball around
a such that all the junk to the right of f(a) is positive:
f(x) = f(a) + Df(a)
_
x a
_
+
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
+ E(x)
. .
>0
.
Then,
f(x) = f(a) + POSITIV E.
In other words,
f(a) < f(x)
for all points x in the ball around a (excluding x = a), i.e. f has a strict local minimum at a.
Theorem (Second Derivative Test). Given a point a such that
f(a) =
0,
dene the corresponding quadratic form
Q(x) =
n
i,j=1
2
f
x
i
x
j
(a) x
i
x
j
.
If
Q(x) > 0
for all x =
0, then f has a strict local minimum at a.

Likewise, if
Q(x) < 0
for all x =
0, then f has a strict local maximum at a.

23.5. SECOND DERIVATIVE TEST 471
Proof Summary:
Use the second-order approximation formula at a.
By the preceding lemma, show that there exists an m > 0 such that
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
is bounded below by
m
..
>0
x a
2
for x = a.
Find a ball around a such that the error is less than
1
2
mx a
2
.
Conclude f(x) > f(a) for all x in this ball (excluding x = a).
Proof: We prove the theorem only in the case of a strict local minimum since the maximum case
follows similarly.
Applying our approximation formula at a, we have
f(x) = f(a) + Df(a)
_
x a
_
+
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
+ E(x)
Since
f(a) =
0
and Df(a) = (f(a))
T
, the second term vanishes:
f(x) = f(a) +
1
2
n
i,j=1
2
f
x
i
x
j
(a)
_
x
i
a
i
__
x
j
a
j
_
. .
Q(xa)
+E(x).
Applying the preceding lemma with Q at (xa), there is a positive m such that for all x with x = a,
Q(x a) m
..
>0
x a
2
.
Therefore,
f(x) f(a) +
1
2
mx a
2
+ E(x)
for all x = a.
Now we just have to shrink E(x). Using E(x)s key property,
lim
xa
E(x)
x a
2
= 0
with =
1
2
m, we can nd a such that if 0 < x a < , then
|E(x)| <
1
2
m
..
x a
2
.
Expanding the absolute value, this implies
1
2
mx a
2
< E(x).
Thus, if 0 < x a < (so x = a), then
f(x) f(a) +
1
2
mx a
2
+ E(x)
. .
>0
> f(a).
Therefore f has a strict local minimum at a.
Example. The function f : R
2
R,
f(x) = x
2
1
+ x
2
2
has a strict local minimum at

0.
x
2
0
x
1
Notice that
f(
0) =
_
_
f
x
1
(
0)
f
x
2
(
0)
_
_
=
0,
Moreover,
Q(x) =

2
f
x
1
x
1
(
0)x
2
1
+

2
f
x
1
x
2
(
0)x
1
x
2
+

2
f
x
2
x
1
(
0)x
2
x
1
+

2
f
x
2
x
2
(
0)x
2
2
= 2x
2
1
+ 2x
2
2
and Q(x) > 0 for all x = 0. Since the conditions of the Second Derivative Test are satised, f has a
strict local minimum at

0.
Lecture 24
Chasing Curves
In the middle of the journey of my life,
I found myself astray in a dark wood
where the image of the path has been lost.
- inf(Dante)
Goals: Focusing on functions from R to R
n
, we introduce curves. Specically, we dene
a curves arc-length to be the supremum of all linear approximations with respect to
nite partition. If our curve is C
1
, we prove that, over a closed interval, this supremum
is guaranteed to exist. Moreover, the arc-length can be directly calculated using the
natural extension of the 2D arc-length formula from high school.
24.1 What are Curves?
For most of Math 51H, we have studied the specic case of multivariable functions that take a vector
input and return a real number:
f : R
n
R
Today, we study functions that input a real number and return a vector:
f : R R
n
These functions have geometric signicance. Specically, in R
2
and R
3
, we can view the output of
the function as a position in space as a function of a single variable (say, time). And if the function
is continuous, then the image looks like a curve. For example, the following is a continuous mapping
from [0, 60] to R
3
:
473
474 LECTURE 24. CHASING CURVES
0
10
20
30
40
50
60
We can think of this function as describing the position of a circling bird from time t = 0 to time t = 60.
Formally, we say
Denition. A curve is a continuous function from a closed
1
interval [a, b] to R
n
.
Careful though: although the intuition is that a curve is the image of a continuous function (like what
you see in the image above), formally, we dene the curve to be the function itself.
2
24.2 Arc-Length
Of course, this is not your rst encounter with curves. Every time you graphed parametric functions,
you were really working with curves in 2D. You even knew how to calculate the arc-length of a curve.
In Calc BC, to compute the length of
x = sin(t)
y = cos(t)
at t ranged from 0 to , you would plug the equations for x and y into some magical formula
L =
_
b
a
_
dy
dt
_
2
+
_
dx
dt
_
2
dt
and out would pop your solution. But,
Why is this formula valid?
You probably were given some intuitive spiel about the Pythagorean Theorem and hypotenuses. But
of course,
1
For todays lecture, we MUST dene curves on a closed interval. However, we will extend the denition of curves
to open intervals when studying manifolds.
2
So in this context, the Dante quote should be the curve and not the image of the path.
24.2. ARC-LENGTH 475
Math Mantra: INTUITION IS NOT A PROOF!
Today is your lucky day: we are going to gift you with a rigorous derivation that also applies to the
multivariable case!
So purge your brain of that arc-length formula lets consider again the example,
x = sin(t)
y = cos(t)
We dont know how to compute the length of a curve, but at least we know how to compute the
length of a line segment connecting two points:
y
x
We could try choosing points on the curve
P
1
P
3
P
2
and calculating the sum of the lengths of the line segments connecting consecutive points:
P
1
P
3
P
2
P
2
P
1
P
3
P
2
Unfortunately, this estimation sucks more than a Hoover 500. Instead, we try with more points:
P
1
P
2
P
3
P
4
P
5
P
6
P
7
In fact, these points dont have to be evenly spaced out. We could try:
P
1
P
2
P
3
P
4
P
5
We can keep guessing all day. The truth is, no matter what points we choose, we are not going to get
the exact length of the curve.
1
Instead, we cheat:
We consider all possible estimations (that could be achieved from any choice of nitely
many points). Then, we dene the length to be the supremum of (the set of ) all these
estimations.
1
Remember, we can only choose nitely many points. Math is not a mystical study!
i.e.
length(curve) = sup {Estimation formed from set P| P is a set of points on the curve}
Does this mean that there is some special collection of points on the curve such that, when we sum
the consecutive distances between them, we get the actual arc-length?
Absolutely not! Remember, the supremum need not lie in the set! For example,
sup
_
1
1
n
n N
_
= 1
yet
1 /
_
1
1
n
n N
_
.
But this does mean that for any arbitrarily chosen closeness, we can always nd a set of points such
that the length estimation is that close to the actual length.
But there are a few catches. The rst is that our entire discussion has been imprecise! Always
remember:
Math Mantra: PRECISION, PRECISION, PRECISION!
First, lets try to make the notion of our selected points precise. Particularly, we have to make sure
that
Our points lie on the curve. For a function
f : [a, b] R
n
,
a point is a vector of the form f(t
k
) where t
k
[a, b]. Think of t
k
as a time between a and b:
f(t
0
)
f(t
1
)
f(t
2
)
f(t
3
)
f(t
4
)
f(t
5
)
f(t
6
)
The sequence of times are ordered:
t
0
< t
1
< t
2
< . . . < t
N
This ensures that our points are ordered by the time they occur on the curve. Otherwise, our
estimation method would be completely wack!
f(t
0
)
f(t
5
)
f(t
6
)
f(t
3
)
f(t
2
)
f(t
7
)
f(t
4
)
The rst point we choose occurs at the start time,
t
0
= a
and the last point we choose occurs at the end time.
t
N
= b
Although this seems like a trivial decision, it will be a key step in our proofs.
Therefore, we make a denition satisfying these requirements:
Denition. A partition P of [a, b] is a nite set
P = {t
0
, t
1
, . . . , t
N
}
such that
1. t
0
= a and t
N
= b
2. The t
i
form a strictly increasing sequence,
t
0
< t
1
< t
2
< . . . < t
N
.
We also dene the length of f:
Denition. Let f : [a, b] R
n
be a continuous function. The approximation of the length of f
with respect to the partition
P = {t
0
, t
1
, . . . , t
N
}
of [a, b] is
L(f, P) =
N
i=1
f(t
i
) f(t
i1
)
The length of f on [a, b] is the supremum of all approximations of f with respect to all partitions of
[a, b]:
L(f) = sup {L(f, P) | P is a partition [a, b]}
Even with this formalization, we still have one more problem to deal with. Namely, in the denition
of the length of f,
THE SUPREMUM MAY NOT EXIST!
Remember,
Math Mantra: Just because we define an object doesnt guarantee that the object
exists!
How do you know that taking ner and ner partitions wont just give you greater and greater ap-
proximations?
Luckily, this doesnt happen if we add the additional proviso that our curves are C
1
. But in order
to prove this result, we need an identity:
Lemma. Let f : [a, b] R
n
be a continuous function. Then,
_
_
_
b
a
f
1
(t) dt
_
2
+
_
_
b
a
f
2
(t) dt
_
2
+ . . . +
_
_
b
a
f
n
(t) dt
_
2
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
Proof Summary:
Consider the square of the inequality.
Starting from the left hand side, expand each integral square and replace one of the variables
of integration.
Collapse the expression into a single integral and apply Cauchy-Schwarz.
Proof: As usual, if both sides of an inequality are non-negative and contains roots, we should prove
the square of the inequality:
_
_
b
a
f
1
(t) dt
_
2
+
_
_
b
a
f
2
(t) dt
_
2
+ . . . +
_
_
b
a
f
n
(t) dt
_
2
__
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
_
2
Starting from the left,
__
b
a
f
1
(t) dt
_
2
+
__
b
a
f
2
(t) dt
_
2
+ . . . +
__
b
a
f
n
(t) dt
_
2
we can expand each of the squares in this sum:
__
b
a
f
1
(t) dt
___
b
a
f
1
(t) dt
_
+
__
b
a
f
2
(t) dt
___
b
a
f
2
(t) dt
_
+ . . . +
__
b
a
f
n
(t) dt
___
b
a
f
n
(t) dt
_
Now we do a very cool trick: the variable of integration does not matter, so you change one of
them.
1
__
b
a
f
1
(x) dx
___
b
a
f
1
(t) dt
_
+
__
b
a
f
2
(x) dx
___
b
a
f
2
(t) dt
_
+ . . . +
__
b
a
f
n
(x) dx
___
b
a
f
n
(t) dt
_
But the integrals involving x are constants, so we can pull them inside the integrals involving t. To
avoid confusion, lets label these constants c
1
, c
2
, . . . , c
n
:
_
_
_
_
b
a
c
1
..
(
b
a
f
1
(x) dx)
f
1
(t) dt
_
_
_
+
_
_
_
_
b
a
c
2
..
(
b
a
f
2
(x) dx)
f
2
(t) dt
_
_
_
+ . . . +
_
_
_
_
b
a
c
n
..
(
b
a
fn(x) dx)
f
n
(t) dt
_
_
_
.
Applying linearity, we collapse this into a single integral:
_
b
a
c
1
f
1
(t) + c
2
f
2
(t) + . . . + c
n
f
n
(t) dt
which is, by our integral properties, bounded by
_
b
a
|c
1
f
1
(t) + c
2
f
2
(t) + . . . + c
n
f
n
(t)|
. .
dt. ()
Here, the inner argument is the absolute value of a dot product! Using our old pal Cauchy-Schwarz
with vectors
c =
_
_
c
1
c
2
.
.
.
c
n
_
f =
_
_
f
1
(t)
f
2
(t)
.
.
.
f
n
(t)
_
_
1
This is a very neat trick. You will see it again when proving that the area under the Gaussian is

.
we have the inequality:
|c
1
f
1
(t) + c
2
f
2
(t) + . . . + c
n
f
n
(t)|
. .
|c
f|
_
c
2
1
+ c
2
2
. . . + c
2
n
. .
c
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
. . . +
_
f
n
(t)
_
2
. .
f
This gives us the integral bound
1
_
b
a
|c
1
f
1
(t)+c
2
f
2
(t)+. . .+c
n
f
n
(t)| dt
_
b
a
_
c
2
1
+ c
2
2
+ . . . + c
2
n
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
which allows us to conclude () is bounded by
_
b
a
_
c
2
1
+ c
2
2
+ . . . + c
2
n
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
Comparing our starting point to the nal upper bound, we have
__
b
a
f
1
(t) dt
_
2
+
__
b
a
f
2
(t) dt
_
2
+. . .+
__
b
a
f
n
(t) dt
_
2
_
b
a
_
c
2
1
+ . . . + c
2
n
_
_
f
1
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
Dividing each side by the constant
_
c
2
1
+ c
2
2
+ . . . + c
2
n
, we get
_
_
b
a
f
1
(t) dt
_
2
+
_
_
b
a
f
2
(t) dt
_
2
+ . . . +
_
_
b
a
f
n
(t) dt
_
2
_
__
b
a
f
1
(t) dt
_
2
. .
c
2
1
+
__
b
a
f
2
(t) dt
_
2
. .
c
2
2
+. . . +
__
b
a
f
n
(t) dt
_
2
. .
c
2
n
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
But, lo and behold, the left-hand side is just a number divided by its square root! Simplify this to
get
__
b
a
f
1
(t) dt
_
2
+
__
b
a
f
2
(t) dt
_
2
+ . . . +
__
b
a
f
n
(t) dt
_
2
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
AWESOME!
Theorem. For a function f : [a, b] R
n
, if f is C
1
, then the length L(f) of f on [a, b] exists.
Proof Summary:
We need to show that there is a xed number that bounds L(f, P) under any partition P. So
consider a sum over an arbitrary partition.
1
We are using the fact that f g implies
_
b
a
f dt
_
b
a
g dt.
In the i-th term, expand the vector dierence f(t
i
)f(t
i1
) and apply the Fundamental Theorem
of Calculus to rewrite each component as an integral over f
.
Apply the preceding lemma to bound the i-th term by an integral
The sum of all terms is bounded by a sum of integrals. This integral sum collapses into a single
integral independent of the partition.
Proof: To show that
L(f) = sup {L(f, P)| P is a partition [a, b]}
exists, by the magic of Completeness axiom, we just have to show that the corresponding set is
bounded. This means, we have to show, for an arbitrary partition P, that
N
i=1
f(t
i
) f(t
i1
)
. .
L(f,P)
B
where B is some xed bound independent of the choice of the partition.
Let P be an arbitrary partition and consider the sum
N
i=1
f(t
i
) f(t
i1
) .
Expanding the i-th term in this sum, we get
f(t
i
) f(t
i1
) =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
f
1
(t
i
) f
1
(t
i1
)
f
2
(t
i
) f
2
(t
i1
)
f
3
(t
i
) f
3
(t
i1
)
.
.
.
f
n
(t
i
) f
n
(t
i1
)
_
_
_
_
_
_
_
_
_
_
_
_
_
_
Recall the single-variable Fundamental Theorem of Calculus, which links derivatives and integrals:
g(b) g(a) =
_
b
a
g
(t) dt.
Applying this theorem to each component of f(t
i
) f(t
i1
, we can rewrite this as
f(t
i
) f(t
i1
) =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
t
i
t
i1
f
1
(t) dt
_
t
i
t
i1
f
2
(t) dt
_
t
i
t
i1
f
3
(t) dt
.
.
.
_
t
i
t
i1
f
n
(t) dt
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
After expanding, the norm is really
__
t
i
t
i1
f
1
(t) dt
_
2
+
__
t
i
t
i1
f
2
(t) dt
_
2
+ . . . +
__
t
i
t
i1
f
n
(t) dt
_
2
which we showed, in our lemma, is bounded by
_
t
i
t
i1
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
for any i. This gives us a series of inequalities
f(t
1
) f(t
0
)
_
t
1
t
0
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
f(t
2
) f(t
1
)
_
t
2
t
1
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
f(t
3
) f(t
2
)
_
t
3
t
2
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
.
.
.
.
.
.
f(t
N
) f(t
N1
)
_
t
N
t
N1
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
By summing these inequalities and combining the integrals, we have
N
i=1
f(t
i
) f(t
i1
)
_
t
N
t
0
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt.
But t
0
= a, and t
N
= b; therefore,
N
i=1
f(t
i
) f(t
i1

_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
. .
B
.
This means our sum is bounded above by
B =
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
and since our partition was arbitrary, we have, for any partition P,
L(f, P) B.
In conclusion, the length of f exists.

But we can do better! We can actually show that the length equals the integral:
L(f) =
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
This completely removes the middle man: no more partitions! And if you havent noticed, this
sum is just like the formula you saw in Calc BC. Just change the notation:
L(f) =
_
b
a
_
dx
1
dt
_
2
+
_
dx
2
dt
_
2
+ . . . +
_
dx
n
dt
_
2
dt
The multivariable formula is an obvious extension of the two-variable case! To prove it, we use our
usual trick,
Math Mantra: Using only the fact that an object exists, we can DERIVE
properties that object must satisfy.
By the preceding theorem, we know that the length of a curve f that is C
1
on [a, b] exists. But we
didnt have to use the interval [a, b]. We could have used a smaller interval and we still would have
had a C
1
function. Thus, for any t such that a t b, the length of f restricted to [a, t]
L(f|[a, t])
also exists. This allows us to dene a function S : R R that inputs the right endpoint and outputs
the length of the curve that stops at the right endpoint:
S(t) = L(f|[a, t])
Using this seemingly innocuous function, we will derive our arc-length formula.
Theorem. For a function f : [a, b] R
n
, if f is C
1
, then
L(f) =
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt.
Proof Summary:
Consider the function
S(t) = L(f|[a, t])
By the Fundamental Theorem of Calculus, it suces to prove
S
(t) =
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
To do this, we use the Sandwich Theorem on the dierence quotient
S(t + h) S(t)
h
for the limit as h 0
+
.
By the additive property of lengths, the numerator
S(t + h) S(t) = L(f|[t, t + h]).
Lower bound
L(f|[t, t + h]) is bounded below by the worst partition which consists of only two points.
Upper bound
L(f|[t, t +h]) is bounded by the nal bound from the previous proof (applied to [t, t +h])
Combining the bounds, we have
_
_
_
_
f(t + h) f(t)
h
_
_
_
_
S(t + h) S(t)
h
. .
L(f| [t,t+h])
h
_
t+h
t
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
h
For the limit of the lower bound, use continuity of the norm to apply the limit denition on
each component.
For the right, split the integral so it resembles the derivative denition. Then, dierentiate the
integral.
Proof: Dene the function
S(t) = L(f|[a, t]).
If we can show that
S
(t) =
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
then we are done. Why? By the magic of the Fundamental Theorem of Calculus,
S(b) S(a) =
_
b
a
S
(t) dt.
But
S(a) = L(f|[a, a]) = 0
S(b) = L(f|[a, b]) = L(f)
giving us
L(f)
..
S(b)
0
..
S(a)
=
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
. .
S
(t)
dt
Therefore, if S
(t) has the aforementioned value, then

L(f) =
_
b
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt.
With this goal in mind, lets look at the dierence quotient of S(t),
S(t + h) S(t)
h
.
The strategy is to use the Sandwich Theorem on this quotient. To nd the proper bounds, we rst
rewrite the quotient using a theorem on this weeks homework. You will formally prove that
The arc-length from a to c is the sum of the arc-lengths from a to b and b to c where b is any
intermediate point a < b < c.
Symbolically,
L(f|[a, c]) = L(f|[a, b]) +L(f|[b, c])
In particular, we have
L(f|[a, t + h]) = L(f|[a, t]) +L(f|[t, t + h])
L(f|[a, t + h])
. .
S(t+h)
L(f|[a, t])
. .
S(t)
= L(f|[t, t + h])
Thus, our dierence quotient is actually
L(f| [t, t + h])
h
. .
S(t+h)S(t)
h
But before we construct our bounds, notice,
THERE IS A SLIGHT FLY IN THE OINTMENT!
If h is negative then,
L(f|[t, t + h])
wouldnt make sense! It turns out though, if we compute the right and left derivatives separately,
1
then apart from a few minor dierences like replacing our above split
L(f|[a, t + h]) = L(f|[a, t]) +L(f|[t, t + h])
with
L(f|[a, t]) = L(f|[a, t + h]) +L(f|[t + h, t])
our proof is exactly the same. Therefore, we focus on only the right derivative and positive h.
1
Remember the beauty of 1D calculus: a derivative exists if and only if the left and right derivatives exist and are
equal!
Lower Bound Recall that the length of f is a supremum over all partitions. In particular, it
is greater than the approximation by the worse partition. Minimally, a partition contains the
rst and last endpoint. Therefore, the worst partition on [t, t + h] only contains the two terms
t
0
= t and t
1
= t + h, giving us
f(t + h) f(t) L(f| [t, t + h])
Dividing both sides by positive h,
f(t + h) f(t)
h

L(f| [t, t + h])
h
.
Equivalently,
_
_
_
_
f(t + h) f(t)
h
_
_
_
_
L(f| [t, t + h])

h
.
Upper Bound In the previous proof, we found a bound for any partition L(f, P). Therefore,
L(f), the supremum of the set, is bounded by the same bound. Using the previous proof with
interval [t, t + h], we have
L(f| [t, t + h])
_
t+h
t
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
Dividing by positive h,
L(f| [t, t + h])
h

_
t+h
t
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
h
Combining the bounds, we have
_
_
_
_
f(t + h) f(t)
h
_
_
_
_
S(t + h) S(t)
h
. .
L(f| [t,t+h])
h
_
t+h
t
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
h
()
Now we have to be careful! Particularly, we are only going to consider positive h and have h approach
0 from the right. Consider the limit of the left bound rst:
lim
h0
+
_
_
_
_
f(t + h) f(t)
h
_
_
_
_
.
Using the fact that taking the norm is a continuous operation, after expanding component-wise, the
limit is the same as
_
_
_
_
_
_
_
_
_
_
_
_
_
lim
h0
+
f
1
(t+h)f
1
(t)
h
lim
h0
+
f
2
(t+h)f
2
(t)
h
.
.
.
lim
h0
+
fn(t+h)fn(t)
h
_
_
_
_
_
_
_
_
_
_
_
_
_
.
But regardless of how h approaches 0, each component still gives us the normal derivative:
_
_
_
_
_
_
_
_
_
_
f
1
(t)
f
2
(t)
.
.
.
f
n
(t)
_
_
_
_
_
_
_
_
_
_
.
Thus,
lim
h0
+
_
_
_
_
f(t + h) f(t)
h
_
_
_
_
=
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
.
For the right bound
lim
h0
+
_
t+h
t
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
h
,
split the integral in two
lim
h0
+
G(t+h)
..
_
t+h
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
G(t)
..
_
t
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt
h
.
This is just the denition of the (right) derivative of
G(t) =
_
t
a
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
dt,
which, from Calculus BC, is
G
(t) =
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
.
By Sandwich Theorem
1
on (), we have
S(t + h) S(t)
h
=
_
_
f
1
(t)
_
2
+
_
f
2
(t)
_
2
+ . . . +
_
f
n
(t)
_
2
.
New Notation
f|S The function f with
its domain restricted
to set S.
f|[0, 1] The function f with its domain
restricted to [0, 1].
L(f|[a, b]) The length of the
function f restricted
to the interval [a, b].
S(t) = L(f|[a, t]) S(t) is the function that inputs
the right endpoint t and returns
the length of the function f re-
stricted to [a, t].
1
Right-handed Sandwich Theorem, though this would be easy to prove.
Lecture 25
Taylor Swift Series
And we are never, ever, ever,
gonna converge together.
- Taylor Series to the
continuous extension of e
1
x
2
Goals: For the rst half of lecture, we apply the Change of Base-Point Theorem to prove
convergence results for dierentiated power series. The second half is devoted to proving
a condition that guarentees he corresponding Taylor series will converge to f(x). The
proof will rely on Taylors Theorem, which gives us a nice way to rewrite the error of an
m-term Taylor approximation.
25.1 Dierentiating a Power Series
Back in Calc BC, you were asked to dierentiate a power series
f(x) = a
0
+ a
1
x + a
2
x
2
+ a
3
x
3
+ a
4
x
4
+ . . .
term by term. And because you knew the derivative of
x
n
was
nx
n1
this was cake:
f
(x) = a
1
+ 2a
2
x + 3a
3
x
2
+ 4a
4
x
3
+ . . .
But
Who said you can directly dierentiate a power series?!
Remember, a series is not actually a sum of innitely many terms!
1
What you are really looking at
is a limit of a sequence of partial sums. In this case, you really calculated a sequence of dierentiated
1
That wouldnt make any sense, but even if it did, the linearity rule for derivatives only applies to sums of nitely
many terms!
489
490 LECTURE 25. TAYLOR SWIFT SERIES
partial sums:
S
1
= a
1
S
2
= a
1
+ 2a
2
x
S
3
= a
1
+ 2a
2
x + 3a
2
x
2
S
4
= a
1
+ 2a
2
x + 3a
2
x
2
+ 4a
3
x
3
.
.
.
But how do we know that this sequence actually converges to f
(x)? In other words, how do we know

that the derivative of a limit of partial sums equals the limit of the derivatives of the partial sums.
Even though it seems intuitively obvious, we have to prove it! Remember,
Math Mantra: INTUITION is not a proof!
Also, how do we even know the dierentiated power series has the same radius of convergence? The
sequence of dierentiated terms is a completely dierent power series! Perhaps this series con-
verges to f
(x) on a smaller radius.

Luckily though, it does turn out that the series
n=1
nx
n1
converges to f
(x) and that this convergence holds for all x in the interval of convergence of the
original series. Moreover, this will be an easy consequence of our Change of Base-Point Theorem.
But wasnt the Change of Base-Point Theorem about shifting power series? How is it even related?
Simple. Recall that the coecients of the shifted sequence are
b
m
=
n=m
_
n
m
_
a
n
nm
.
Stare at the b
1
term:
b
1
=
n=1
_
n
1
_
..
n
a
n
n1
Look familiar? Its the dierentiated series evaluated at x = .
Theorem. Let the power series
f(x) =
n=0
a
n
x
n
25.1. DIFFERENTIATING A POWER SERIES 491
have radius of convergence > 0. Then for every with || < , f is dierentiable at and f
()
equals the convergent power series
n=1
na
n
n1
Proof Summary:
The dierentiated series is simply b
1
, which is convergent by the Change of Base-Point Theorem.
We need to show, using the - denition of derivative, that the quotient
f(x) f()
x a
converges to b
1
.
Let > 0. Apply the Change of Base-Point formula to rewrite the -condition as
1
x
n=2
a
n
(x )
n
<
Bound the quotient by a convergent power series.
The - denition is satised with choice
= min
_
||
2
,

M
_
.
where
M =
m=2
|b
m
|
_
||
2
_
m2
.
Proof: Let satisfy || < . Observe that the dierentiated power series
n=1
na
n
n1
is simply the coecent b
1
from Change of Base-Point formula. We already proved in the Change of
Base-Point Theorem that b
1
was convergent, so we conclude that the dierentiated power series is
convergent. Now the goal is to show that the derivative of f evaluated at exists and equals b
1
. This
means that given > 0, we must nd a > 0 such that when
|x | <
then

f(x) f()
x
b
1
< .
Staring at the -condition, we rst simplify the numerator
f(x) f()
to get (x ) to appear, by expanding f(x) as a power series centered at . But remember, in order
to apply the Change of Base-Point Theorem, we need
|x | < ||.
Therefore, we restrict
||.
Moreover, to help us in our proof, we actually restrict even more by dividing by 2 (youll see why).
So let x satisfy
|x | <
||
2
.
Applying the Change of Base-Point Theorem to the original series
f(x) =
n=0
a
n
x
n
we get the shifted series
f(x) =
m=0
b
m
(x )
m
with coecients
b
m
=
n=m
_
n
m
_
a
n
nm
Notice that the rst term in the shifted series is
b
0
(x )
0
= b
0
=
n=0
_
n
0
_
..
=1
a
n
n
.
which is simply
f() =
n=0
a
n
n
Now, rewrite the the -condition as:
f(x) f() b
1
(x )
x
<
So the numerator f(x) f() b
1
(x ) is just
m=0
b
m
(x )
m
b
0
..
f()
b
1
(x ) =
m=0
b
m
(x )
m
m=0
b
m
(x )
m
25.1. DIFFERENTIATING A POWER SERIES 493
By the theorem for splitting series proved in a previous lecture, the right-hand side is the convergent
power series
m=2
b
m
(x )
m
Therefore the -condition is
1
x
m=2
b
m
(x )
m
<
But by sequence scaling theorems, we can pull out (x )
2
to get
|x |
m=2
b
m
(x )
m2
.
Recall that power series converge absolutely if they converge at all. Thefore, the convergent power
series in the absolute value must converge absolutely, so we can bound the whole expression by
|x |
m=2
b
m
(x )
m2
|x |
m=2
|b
m
| |x |
m2
.
Now its time to explain why we needed to divide by 2 in the -restriction: we can bound the above
by
|x |
m=2
|b
m
|
_
||
2
. .
>|x|
_
m2
and this series converges! Note that if we did not divide by 2, then this power series would not
necessarily converge as || could correspond to an endpoint of the interval of convergence.
Awesome, the right side is a xed number! That means if we choose
= min
_
||
2
,

M
_
where
M =
m=2
|b
m
|
_
||
2
_
m2
then
f(x) f()
x a
b
1
<

M
m=2
|b
m
|
_
||
2
_
m2
= .
25.2 Taylor Series
A natural question to ask is,
Which functions can be represented as the limit of a power series?
Back in Calc BC, you probably thought the answer was easy: just have all the derivatives exist. Then
you can use your trusty Taylor Series formula (centered around a point ) to extract a power series
from f:
f(x) = f() +
f
()
1!
(x ) +
f
()
2!
(x )
2
+
f
()
3!
(x )
3
+ . . . =
n=0
f
n
()
n!
(x )
n
Simply solve for the general form of the n-th derivative of f, evaluate at , and plug it back in. Voila.
And you never second-guessed your formula since you were told it always worked. But,
This is the biggest lie ever told
1
Just because all derivatives of f at exist and we can write down the Taylor series, this does not
guarantee that the limit of the corresponding Taylor series will equal f(x).
For example, on this weeks homework, you will show that all derivatives of the continuous function
g(x) =
_
e
1
x
2
if x = 0
0 if x = 0
exist and equal 0 at x = 0. This means that the Taylor series for g about 0 is just the zero function,
not g(x). In fact, you are going to prove that there does not exist a power series that represents
g on a open interval about 0.
Luckily, we can nd a condition that ensures, on some open interval about a given point , that the
limit of the corresponding Taylor series centered at converges to f(x). And we will derive this
condition from studying the error term.
But rst, we give a formal denition of Taylor series:
Denition. Given that f : R R is dierentiable to all orders at , the Taylor series of f centered
around is the power series
n=0
f
n
()
n!
(x )
n
.
When we approximate f(x) with the m-th partial sum of its Taylor series, we can look at the error:
E(x)
. .
Error
= f(x)
..
Actual
_
f() +
f
()
1!
(x ) +
f
()
2!
(x )
2
+
f
()
3!
(x )
3
+ . . . +
f
m
()
m!
(x )
m
. .
Guess
_
1
Yes, even bigger than Slughorns false memory of Voldemort.
25.2. TAYLOR SERIES 495
Remarkably, we can always write the error in a nice form. Particularly, for a xed x, there exists some
c in between
1
and x such that the error is the (m + 1)-th Taylor term except that the derivative is
evaluated at c instead of :
E(x) =
f
m+1
(c)
(m + 1)!
(x )
m+1
.
The words existence of c should remind you of Rolles Theorem.
2
In fact, this will be the lynchpin
of the proof!
Theorem (Taylors Theorem). Let f : R R. For > 0, let f be dierentiable to order m + 1
for all x satisfying |x | < . Then, for any xed x in this interval ( , + ), we can nd a
constant c between x and such that
f(x) =
m
n=0
f
n
()
n!
(x )
n
+
f
m+1
(c)
(m + 1)!
(x )
m+1
. .
E(x)
Proof Summary:
For and x xed, dene
g(t) = f(t)
m
n=0
f
n
()
n!
(t )
n
M(t )
m+1
where
M =
f(x)
m
n=0
f
n
()
n!
(x )
n
(x )
m+1
.
Show that g(x) = 0, g() = 0, and for any derivative with order i satisfying 1 i m
g
i
() = 0
Apply Rolles Theorem inductively until we have a constant c
m+1
such that
g
m+1
(c
m+1
) = 0.
Expand the denition of g
m+1
.
Proof: Remember how we proved the Mean Value Theorem? It was a clever application of Rolles
Theorem on some magical function. Same idea here.
1
So this means c (, x) or c (x, ). We say in between since we do not know whether x or is greater.
2
Or really, the Mean Value Theorem. But you use Rolles to prove the Mean Value Theorem.
Using f, we are going to build a magical function g such that g(x) = 0, g() = 0, and all derivatives
with orders from 1 to m vanish at :
g
() = 0
g
() = 0
.
.
.
g
m
() = 0
This allows us to use Rolles Theorem a ridiculous number of times. Applying Rolles Theorem once
to g, we know there is a point c
1
between x and such that g
(c
1
) = 0:
1
c
1 x
g(t)
But with this c
1
and , we can use Rolles Theorem again on the function g
to get a point c
2
between
and c
1
such that g
(c
2
) = 0:
c
1 x c
2
g
(t)
But it doesnt stop there! Using c
2
and , we play the same game on g
:
1
Below, we will draw the graphs as if < x, but keep in mind that x may really be smaller than . The intuition
of the graphs is the same in either case.
c
1 x
c
2
c
3
g
(t)
Inductively we can do this for all derivates of g up to order m. At the end, we are left with a c
m+1
between and x such that
g
m+1
(c
m+1
) = 0,
and once we expand this last line, a miracle happens: the theorem is proven!
Now that we have the battle plan, lets proceed to the proof. Fix an arbitrary x satisfying |x| < .
So for the remainder of the proof x is a constant. For simplicity, assume < x (the proof for > x
is almost verbatim).
First, we build our magic function g(t): take the dierence of the original function f(t) with the m-th
partial sum of the Taylor series evaluated at t:
f(t)
m
n=0
f
n
()
n!
(t )
n
Then subtract a constant scaling of (t )
m+1
to dene g(t):
g(t) = f(t)
m
n=0
f
n
()
n!
(t )
n
M(t )
m+1
But what should the scale factor M be? Remember, we need
g(x) = 0
so
0
..
g(x)
= f(x)
m
n=0
f
n
()
n!
(x )
n
M(x )
m+1
Solving for M yields
M =
f(x)
m
n=0
f
n
()
n!
(x )
n
(x )
m+1
Dont worry about this ugly M. Its just a constant! When you dierentiate, just pull M out to the
front!
Now that we have our function g(t) with g(x) = 0, we need to check that g() = 0 and that all
derivatives of g up to order m are 0 at . Lets look at g(t) expanded:
g(t) = f(t) f()
f
()
1!
(t )
f
()
2!
(t )
2
. . .
f
m
()
m!
(t )
m
M(t )
m+1
Then
g() = f() f()
. .
=0
()
1!
( )
f
()
2!
( )
2
. . .
f
m
()
m!
(t )
m
. .
=0
M( )
m+1
. .
=0
= 0
Likewise, when we directly calculate
g
(t) = f
(t)
f
()
1!
2
f
()
2!
(t)3
f
()
3!
(t)
2
. . .m
f
m
()
m!
(t)
m1
(m+1)M(t)
m
and
g
(t) = f
(t)2
f
()
2!
23
f
()
3!
(t). . .(m1)m
f
m
()
m!
(t)
m2
m(m+1)M(t)
m1
Plugging in t = , it is easy to see
g
() = g
() = 0
In fact, it is left as a straightforward exercise in induction to formally prove
g
() = 0
g
() = 0
.
.
.
g
m
() = 0
Now, as already discussed, by repeatedly applying Rolles Theorem,
g() = 0 and g(x) = 0 = g
(c
1
) = 0 for some c
1
(, x)
g
() = 0 and g
(c
1
) = 0 = g
(c
2
) = 0 for some c
2
(, c
1
)
g
() = 0 and g
(c
2
) = 0 = g
(c
3
) = 0 for some c
3
(, c
2
)
.
.
.
g
m
() = 0 and g
m
(c
m
) = 0 = g
m+1
(c
m+1
) = 0 for some c
m+1
(, c
m
)
so that
< c
m+1
< c
m
< . . . < c
1
< x
and
g
m+1
(c
m+1
) = 0
But what is g
m+1
(t)? Taking the derivative m + 1 times kills the inner
m
n=0
f
n
()
n!
(t )
n
term and
converts (t )
m+1
into (m + 1)! :
g
m+1
(t) = f
m+1
(t) (m + 1)!M
Thus,
0
..
g
m+1
(c
m+1
)
= f
m+1
(c
m+1
) (m + 1)!M
and so,
f
m+1
(c
m+1
)
(m + 1)!
= M.
Now, expand the denition of M:
f
m+1
(c
m+1
)
(m + 1)!
=
f(x)
m
n=0
f
n
()
n!
(x )
n
(x )
(m+1)
. .
M
which gives us,
f(x) =
f
m+1
(c
m+1
)
(m + 1)!
(x )
(m+1)
+
m
n=0
f
n
()
n!
(x )
n
.
AMAZING!
Now that we have a nice form for the error, we can impose a condition that forces the error to
shrink fast enough to ensure that the Taylor Series converges to f(x). As in the proof, assume here
that we have xed x with |x | < and for simplicitly assume < x.
Since the error is always of the form
f
n
(c)
n!
(x )
n
we can try to bound the coecient
f
n
(c)
n!
But we also know that all possible c lie in between x and , and x is at most from ,
<
x
so we bound
f
n
(c)
n!
for all c such that |c | < . But what bound should we choose?
We have to make sure that, for any power, the bound on this coecient counteracts the growth rate
of (x )
n
. Since we are guaranteed |(x )
n
| <
n
, we guess the condition
f
n
(c)
n!
<
C
n
where C is some xed constant.
Theorem. Let f be dierentiable to all orders for all x with |x | < . If there is some constant
C such that for all n,
f
n
(x)
n!
<
C
n
then the Taylor series
n=0
f
n
()
n!
(x )
n
is convergent and equals f(x) for all x with |x | < .
Proof Summary:
We need to show the sequence of partial sums converges to f(x), so look at the N denition
of convergence.
Use the preceding lemma to rewrite the -condition
Apply the given constraint to bound the left side of the -condition by
C
m+1
Choose N such that for all m N,
m+1
<

C
Proof: We want to show that for any > 0, there is some N such that for all m N,
f(x)
m
n=0
f
n
(x)
n!
(x a)
n
<
By Taylors theorem, we have a nice way to write the error: for some c between x and ,
f(x)
m
n=0
f
n
(x)
n!
(x a)
n
f
m+1
(c)
(m + 1)!
(x )
m+1

f
m+1
(c)
(m + 1)!
|x |
m+1
.
By our given condition (for the particular case of n = m + 1), we know this is bounded by
C
m+1
(x )
m+1

which is equivalent to
C
m+1
But
< 1
and recall that we showed, for any
0 < a < 1
that successive powers of a converge to 0:
a
n
0.
In particular, we know there is some N so that for all m N,
m+1
<

C
and so for all m N,
f(x)
m
n=0
f
n
(x)
n!
(x a)
n
< C

C
=
Lecture 26
Mastering Manifolds
Manifolds are a bit like pornography:
hard to dene, but you know one when you see one.
- S. Weinberger
Goals: After giving considerable intuition for manifolds, we give a formal denition.
We also dene the tangent space at a point on a manifold and prove that this space is
actually a subspace.
26.1 Another look at Curves
If you havent realized it yet, curves are awesome. When you look at a curve, you intuitively know
what the derivative (at a point) should be. Just follow the path:
And unlike surfaces and other functions that have multiple inputs, we dont have the weirdness that
arises from being able to take the derivative along any direction. Dierentiating a curve is simply
503
504 LECTURE 26. MASTERING MANIFOLDS
applying 1D dierentiation component-wise:
f
(t) =
_
_
f
1
(t)
f
2
(t)
f
3
(t)
.
.
.
f
n
(t)
_
_
In fact, a directional derivative was really a derivative along a curve. For example, in Lecture 16:
Dishing Out Derivatives, when we took the directional derivatives parallel to the x direction,
we were really dierentiating a curve that lies on the surface.
x
z
Now that we have a formal denition of a curve, we can go one step further: Instead of taking curves
along xed directions, we can now consider all curves that go through point a and that lie on
the surface:
26.2. WHAT IS A MANIFOLD? 505
For any of these curves, we can evaluate the derivative at its origin (when t = 0)
So what does the collection of all of these derivative vectors look like? It turns out that this collection
forms a subspace (and in this case, a plane
1
)!
But how do we prove that this collection of derivative vectors is a subspace? The answer is going to
rely on the type of surface we are dealing with. Particularly, we need to study the feared topic of
manifolds.
26.2 What is a Manifold?
The usual intuitive denition is
1
Caution: even though, visually, we think of the subspace as passing through a, it actually passes through the origin
(by the denition of subspace). In our drawings, we shift the tangent space by a so that it hits the graph and we can see
the tangency. Remember in Calc BC, when you computed the tangent line to a graph. After you took the derivative,
you didnt just draw a line, with that slope, passing through the origin. You had to shift the line so it hit your graph
and you could see the tangency. Same idea here.
A k-dimensional manifold is a geometric structure that, at every point on the structure, locally looks
like R
k
.
At this point, you may be completely confuzzled. But it is actually a simple idea.
Lets look at the following geometric structures in R
3
: a curve, a plane, and a (hollow) sphere.
Consider a point on each of these structures.
Around each of these points, draw an open ball
26.2. WHAT IS A MANIFOLD? 507
and consider all points on the structure that lie in that open ball:
Remember how we said that an open set in R
n
was a microcosm of the whole R
n
space? Same idea
here. In particular, the region on the curve resembles the real number line R, and the regions on the
plane and sphere resemble R
2
. So in these cases, we say that the curve is a 1-manifold in R
3
, and the
plane and sphere are 2-manifolds in R
3
.
Careful though, not everything is a manifold! For example, a curve in R
2
with endpoints is not a
1-manifold! Just consider any ball surrounding an endpoint:
This gives us a half-open interval,
which is absolutely not a microcosm of R. Also, in R
2
a right angle is not
1
a 1-manifold! Consider
any ball surrounding the vertex:
1
Actually, in some contexts we do consider this a manifold. Just like how in some contexts, we say that a doughnut
looks like a coee cup. BUT for this course, we are only considering C
1
manifolds.
No matter how close we zoom in, our interval is bent:
Now that you intuitively understand what a manifold is, how do we formalize it?
The original denition we had (for M to be a manifold) was
At any point a on M, M locally looks like R
k
.
First, we need to replace
locally
with
there exists some ball around a such that the set of all points on M in this ball
or symbolically,
there exists a > 0 such that B
(a) M
So now we have
For any point a on M, there exists a > 0 such that B
(a) M looks like R

k
.
But how do we formalize
looks like R
k
?
For this, we need to talk about graphs.
26.3 Graphs
Throughout high school, you were asked to graph functions. For sin(x) on [0, 2], you graphed:
1 2 3 4 5 6
1
0.5
0
0.5
1
x
s
i
n
(
x
)
26.3. GRAPHS 509
But you werent exactly plotting sin(x). You were really plotting a set of vectors in R
2
:
__
x
sin(x)
_
| 0 x 2
_
Same idea in R
3
. When you graphed f(x, y) on the unit square [0, 1] [0, 1],
f(x, y)
(x, y)
you were really plotting the set
_
_
_
_
_
x
y
f(x, y)
_
_
|
0 x 1
0 y 1
_
_
_
Generally, the graph of a function g : R
n
R
m
is just the set of vectors where the inputs and
corresponding outputs are concatenated to form a vector in R
n+m
:
_
x
g(x)
_
=
_
_
x
1
x
2
.
.
.
x
n
g
1
(x
1
, x
2
, . . . , x
n
)
g
2
(x
1
, x
2
, . . . , x
n
)
.
.
.
g
m
(x
1
, x
2
, . . . , x
n
)
_
_
So when we restrict the domain over a set U R
n
, we say
Denition. The graph of g : R
n
R
m
over U R
n
is the set
G(U) =
__
x
g(x)
_
x U
_
R
n+m
We also dene the corresponding graph map G : R
n
R
n+m
by
G(x) =
_
x
g(x)
_
Note that the graph G(U) is simply the image of the graph map over U, as is suggested by our choice
of notation.
Easy. Exactly what youve been doing since 7th grade. So what does this have to do with manifolds?
We can now make an educated guess how to formalize
looks like R
k
Namely, we guess
can be written as the graph of a C
1
function g over an OPEN set U in R
k
So consider again (the local restriction of) the curve
This is really just the graph of the C
1
function
g(x) =
_
2x
sin(x)
_
over the open interval U = (1.7, 2.4):
G(U) =
_
_
_
_
_
x
2x
sin(x)
_
_
x (1.7, 2.4)
_
_
_
From this example, it seems that our educated guess is correct. The C
1
assumption removes bends
at a vertex. The openness of U removes the endpoints:
U
26.3. GRAPHS 511
It turns out that this is almost what we want. Consider the graph in R
2
of a vertical line segment:
x
y
1
1
2
3
4
Intuitively, we feel that this is a 1-manifold in R
2
. But we cannot write this as a graph of a
function! This is literally the rst thing we learn in grade school.
_
x
g(x)
_
So what do we do? We cheat. We use a permuted graph:
_
g(x)
x
_
so the vertical line segment is the permuted graph
__
1
x
_
x (1, 3)
_
Intuitively, we are just ipping axes and viewing the shape from a dierent angle:
y
x
1
1 2 3 4
If youve read Professor Simons notes,
1
you already saw this idea in Figure 2.5. Here, we have a
closed curve in R
2
:
To show that it is a 1-manifold, for any point that is horizontal, we isolate a local neighborhood
and then represent this region by a non-permuted
2
graph (so a function of the x-coordinate):
x
U
_
x
g(x)
_
For any point that is vertical,
1
For some reason, Professor Simons graph reminds me of Roger from American Dad.
2
Formally this is also a permuted graph, just with the identity permutation.
26.3. GRAPHS 513
we represent the local region by a permuted graph (a function of the y-coordinate):
y
U
_
g(y)
y
_
Of course, we can have more complicated permutations than simply switching x and y. For example,
if g : R
2
R
3
,
_
_
x
2
x
1
g
3
(x
1
, x
2
)
g
1
(x
1
, x
2
)
g
2
(x
1
, x
2
)
_
_
_
_
g
3
(x
1
, x
2
)
x
2
g
2
(x
1
, x
2
)
x
1
g
1
(x
1
, x
2
)
_
_
_
_
g
3
(x
1
, x
2
)
g
1
(x
1
, x
2
)
g
2
(x
1
, x
2
)
x
1
x
2
_
_
are all permutations of
_
_
x
1
x
2
g
1
(x
1
, x
2
)
g
2
(x
1
, x
2
)
g
3
(x
1
, x
2
)
_
_
.
Now we can properly formalize
looks like R
k
as
can be written as a permuted graph
1
of a C
1
function g over an OPEN set U in R
k
This completes our manifold denition:
1
In particular, any graph is a permuted graph, just with the identity permutation.
Denition. We say that M R
n
is a k-manifold if for any point a M, there exists > 0 such
that B
(a) M is a permuted graph P(G(U)) of some C

1
function g : R
k
R
nk
over an open set
U R
k
.
Now that we nally have a denition, lets try it out!
Example. The unit circle S
1
is a 1-manifold in R
2
.
Because I want you to see the big picture, lets rst construct the four graphs before dening their
corresponding open sets U
1
, U
2
, U
3
, U
4
.
Around any point in the upper half of the circle
1
x
y
a =
_
a
1
a
2
_
, a
2
> 0
the circle, locally, is the graph
__
x
1 x
2
_
| x U
1
_
Around any point in the lower half of the circle,
1
We draw the point on the x-axis just to make the picture pretty.
26.3. GRAPHS 515
x
y
a =
_
a
1
a
2
_
, a
2
< 0
the circle, locally, is the graph
__
x
1 x
2
_
| x U
2
_
Now, we just have to consider the remaining two points on the sides of the circle, when
a
2
= 0
Around the left-side point,
x
y
_
1
0
_
the circle is locally the graph
__

_
1 y
2
y
_
| y U
3
_
and around the right-side point,
x
y
_
1
0
_
the circle is locally the graph
__ _
1 y
2
y
_
| y U
4
_
To show how to construct the open sets, we will construct U
1
. So consider a point a on the upper half
of the circle. WLOG, assume the point is in the rst quadrant. Convert to polar coordinates and let
0
denote the angle made by the point a with the x-axis. Consider

0
2
< <
0
+

0
2
x
y
0
We would like to solve for the radius of the circle that goes through the endpoints of the arc:
26.3. GRAPHS 517
x
y
0
So we isolate the triangles
0
2
0
2
r
r
1
1
1
and apply Law of Cosines to get
r =
2 2 cos
_
0
2
_
So to represent S
1
B
r
(a) as a graph, we can use the open set
U
1
=
_
cos
_
0
2
_
, cos
_
3
0
2
__
The constructions of U
2
, U
3
, U
4
follow the same approach.
What if the unit circle lies in a plane in R
3
rather than in R
2
? Is it still a 1-manifold? The an-
swer is yes, as we show in the next example.
Example. The unit circle in the xy-plane,
C =
_
_
_
_
_
x
y
0
_
_
x
2
+ y
2
= 1
_
_
_
is a 1-manifold in R
3
.
Consider a point a on the circle
and consider a ball of radius around this point:
Note that the intersection of this ball with the xy-plane looks just like a ball in R
2
, so we can apply
the previous example. Precisely,
C B
(a) =
_
_
_
_
_
x
y
0
_
_
_
x
y
_
S
1
B
(a)
_
_
_
where B
(a) denotes a ball in R

3
on the left-hand side and a ball in R
2
on the right-hand side (inside
the brackets). Without lost of generality, assume the point a has positive x- and y-coordinates. With
U
1
and r from the previous example,
S
1
B
r
(a) = G(U
1
)
where G(U
1
) is the graph over U
1
of the function g : R R dened by
g(x) =
1 x
2
So we can easily extend to R
3
by dening the function g : R R
2
by
g(x) =
_
1 x
2
0
_
26.4. TANGENT SPACES 519
Then letting

G(U
1
) denote the graph of g over U
1
,
C B
r
(a) =

G(U
1
)
The extension to R
3
for the graphs over U
2
, U
3
, U
4
is similar, so we conclude that C is a 1-manifold
in R
3
. Note that the function g still has domain R because C is still a 1-manifold, but g maps into
the higher dimensional space R
2
(instead of R) so that its graph will be in R
3
(instead of R
2
).
26.4 Tangent Spaces
Now that we have nished our tangent on manifolds, we can now give a formal denition of the
collection of derivative vectors discussed in the introduction.
Denition. Let M be a C
1
k-manifold, and let a M. The tangent space of M at a, denoted
T
a
M, is the set
{
(0)| is a C
1
curve from some
1
(, ) to R
n
; (0) = a; and the image of is contained in M}
NOTE: This denition
2
is dierent from Professor Simons text in that I take the
curves to be dened on OPEN intervals.
We claimed that T
a
M is a subspace. But what subspace? Using the denition of a manifold, the
answers easy:
The tangent space of M at a is just the set of all directional derivatives of the permuted graph map
evaluated at the corresponding k components
3
of a.
Just use the denition of manifold to get your permuted graph map, and then dierentiate. Easy.
And because the directional derivatives are linear combinations of the partial derivatives, this is going
to be a subspace:
Theorem. For a C
1
k-manifold M, let a M. The tangent space T
a
M is a subspace.
Proof Summary:
WLOG assume the map is not permuted.
1
Formally, for some > 0, but I have to use short-hand so the set ts on one line!
2
Do not worry, the proofs are all the same. I make this choice because I feel it is more intuitive. Moreover, in the
next lecture, it will make one of our proofs on tangential gradients less messy.
3
In the case that the graph is not permuted, it is just the rst k components.
We want to show
T
a
M = span
_
_
D
1
G(a
1
, a
2
, . . . a
k
),
D
2
G(a
1
, a
2
, . . . a
k
),
.
.
.
D
k
G(a
1
, a
2
, . . . a
k
)
_
_

Let
(0) T
a
M
Rewrite the original curve in terms of the graph.
Dierentiate via the chain rule.

Let
v = c
1
D
1
G(a
1
, . . . a
k
) + c
2
D
2
G(a
1
, . . . a
k
) + . . . + c
k
D
k
G(a
1
, . . . a
k
)
Consider the curve moving in direction c:
(t) = G( + tc),
where is the vector with the rst k components of a.
Compute the directional derivative.
Proof: Let a M. By the denition of manifold, there is a > 0, an open set U R
k
, and a graph
map G, such that
G(U) = B
(a) M
To make life easier, we assume that this map is not permuted (the proof will virtually be the same
except for a relabelling of coordinates). We are going to show that
T
a
M = span
_
_
D
1
G(a
1
, a
2
, . . . a
k
),
D
2
G(a
1
, a
2
, . . . a
k
),
.
.
.
D
k
G(a
1
, a
2
, . . . a
k
)
_
_
Note that when we expand the i-th partial, the rst k components are 0 except for a 1 in the i-th
component:
i-th component
D
i
G(a
1
, a
2
, . . . a
k
) =
_
_
0
.
.
.
1
.
.
.
0
D
i
g
1
(a
1
, a
2
, . . . a
k
)
D
i
g
2
(a
1
, a
2
, . . . a
k
)
.
.
.
D
i
g
nk
(a
1
, a
2
, . . . a
k
)
_
_

We simply need to rewrite the original curve in terms of the graph.
Let
(0) T
a
M. For all x B
(a) M,
x =
_
_
x
1
x
2
.
.
.
x
k
g
1
(x
1
, x
2
, . . . x
k
)
g
2
(x
1
, x
2
, . . . x
k
)
.
.
.
g
nk
(x
1
, x
2
, . . . x
k
)
_
_
In particular
(t) =
_
1
(t)
2
(t)
.
.
.
k
(t)
g
1
_
1
(t),
2
(t), . . . ,
k
(t)
_
g
2
_
1
(t),
2
(t), . . . ,
k
(t)
_
.
.
.
g
nk
_
1
(t),
2
(t), . . . ,
k
(t)
_
_
_
and the derivative at 0 is
(0) =
_
1
(0)
2
(0)
.
.
.
k
(0)
_
g
1
_
1
(t),
2
(t), . . . ,
k
(t)
_
(0)
_
g
2
_
1
(t),
2
(t), . . . ,
k
(t)
_
(0)
.
.
.
_
g
nk
_
1
(t),
2
(t), . . . ,
k
(t)
_
(0)
_
_
. ()
We then use chain rule to expand
_
g
i
_
1
(t),
2
(t), . . . ,
k
(t)
_
(0)
as the product
_
D
1
g
i
_
1
(0), . . . ,
k
(0)
_
D
2
g
i
_
1
(0), . . . ,
k
(0)
_
. . . D
k
g
i
_
1
(0) . . . ,
k
(0)
_
_
_
1
(0)
2
(0)
.
.
.
k
(0)
_
_
which gives us
1
(0)D
1
g
i
_
1
(0), . . . ,
k
(0)
_
+
2
(0)D
2
g
i
_
1
(0), . . . ,
k
(0)
_
+ . . . +
k
(0)D
k
g
i
_
1
(0), . . . ,
k
(0)
_
So () is really just
(0) =
_
1
(0)
2
(0)
.
.
.
k
(0)
1
(0)D
1
g
1
(
1
(0), . . . ,
k
(0) ) + . . . +
k
(0)D
k
g
1
(
1
(0), . . . ,
k
(0) )
1
(0)D
1
g
2
(
1
(0), . . . ,
k
(0) ) + . . . +
k
(0)D
k
g
2
(
1
(0), . . . ,
k
(0) )
.
.
.
1
(0)D
1
g
nk
(
1
(0), . . . ,
k
(0) ) + . . . +
k
(0)D
k
g
nk
(
1
(0), . . . ,
k
(0) )
_
_
.
But
(0) = a, therefore
(0) =
_
_
a
1
a
2
.
.
.
a
k
a
1
D
1
g
1
(a
1
, . . . , a
k
) + a
2
D
2
g
1
(a
1
, . . . , a
k
) + . . . + a
k
D
k
g
1
(a
1
, . . . , a
k
)
a
1
D
1
g
2
(a
1
, . . . , a
k
) + a
2
D
2
g
2
(a
1
, . . . , a
k
) + . . . + a
k
D
k
g
2
(a
1
, . . . , a
k
)
.
.
.
a
1
D
1
g
nk
(a
1
, . . . , a
k
) + a
2
D
2
g
nk
(a
1
, . . . , a
k
) + . . . + a
k
D
k
g
nk
(a
1
, . . . , a
k
)
_
_
Separating components,
(0) = a
1
_
_
1
0
.
.
.
0
D
1
g
1
(a
1
, . . . , a
k
)
D
1
g
2
(a
1
, . . . , a
k
)
.
.
.
D
1
g
nk
(a
1
, . . . , a
k
)
_
_
+a
2
_
_
0
1
.
.
.
0
D
2
g
1
(a
1
, . . . , a
k
)
D
2
g
2
(a
1
, . . . , a
k
)
.
.
.
D
2
g
nk
(a
1
, . . . , a
k
)
_
_
+. . .+a
k
_
_
0
0
.
.
.
1
D
k
g
1
(a
1
, . . . , a
k
)
D
k
g
2
(a
1
, . . . , a
k
)
.
.
.
D
k
g
nk
(a
1
, . . . , a
k
)
_
_
.
Thus,
(0) span
_
_
D
1
G(a
1
, a
2
, . . . a
k
),
D
2
G(a
1
, a
2
, . . . a
k
),
.
.
.
D
k
G(a
1
, a
2
, . . . a
k
)
_
_

Let
v span
_
_
D
1
G(a
1
, a
2
, . . . a
k
),
D
2
G(a
1
, a
2
, . . . a
k
),
.
.
.
D
k
G(a
1
, a
2
, . . . a
k
)
_
_
.
Then
v = c
1
D
1
G(a
1
, . . . a
k
) + c
2
D
2
G(a
1
, . . . a
k
) + . . . + c
k
D
k
G(a
1
, . . . a
k
)
for some constants c
1
, . . . , c
k
. Remember how directional derivatives were really derivatives
along curves? Same idea here! We are going to build a curve from G in the direction
c =
_
_
c
1
c
2
.
.
.
c
k
_
_
.
Dene the curve
(t) = G( + tc)
where is just the rst k components of a:
=
_
_
a
1
a
2
.
.
.
a
k
_
_
.
Then
(0) is just the directional derivative:
(0) = lim
t0
(t) (0)
t
= lim
t0
G( + tc) G( )
t
.
But we have a major theorem on computing directional derivatives! It is just the corresponding
linear combination of partial derivatives of G:
(0) = c
1
D
1
G(a
1
, . . . a
k
) + c
2
D
2
G(a
1
, . . . a
k
) + . . . + c
k
D
k
G(a
1
, . . . a
k
)
. .
v
Thus, v T
a
M.
Immediately, we see that this subspace is k-dimensional:
Theorem. For a C
1
k-dimensional manifold M, let a M. Then T
a
M is k-dimensional.
Proof: In the preceding proof, we showed that
T
a
M = span
_
_
D
1
G(a
1
, a
2
, . . . a
k
),
D
2
G(a
1
, a
2
, . . . a
k
),
.
.
.
D
k
G(a
1
, a
2
, . . . a
k
)
_
_
But if you expand the vectors in the span,
_
_
1
0
.
.
.
0
D
1
g
1
(a
1
, . . . , a
k
)
D
1
g
2
(a
1
, . . . , a
k
)
.
.
.
D
1
g
nk
(a
1
, . . . , a
k
)
_
_
,
_
_
0
1
.
.
.
0
D
2
g
1
(a
1
, . . . , a
k
)
D
2
g
2
(a
1
, . . . , a
k
)
.
.
.
D
2
g
nk
(a
1
, . . . , a
k
)
_
_
, . . . ,
_
_
0
0
.
.
.
1
D
k
g
1
(a
1
, . . . , a
k
)
D
k
g
2
(a
1
, . . . , a
k
)
.
.
.
D
k
g
nk
(a
1
, . . . , a
k
)
_
_
we have k linearly independent vectors.
New Notation
G(U) The graph of function
g on U
G(U) = B
(a) M The intersection of the ball of ra-

dius around a and M is the
graph of function g on U.
G(x) The graph map G of
function g
D
1
G(a) The rst partial derivative of the
graph map G evaluated at a.
P(G(U)) A permuted graph of
function g on U
P(G(U)) M The permuted graph of function
g on U is contained in M.
T
a
M The tangent space of
manifold M at a
(0) T
a
M The vector
(0) is in the tangent

space of manifold M at a.
Lecture 27
Living La Vida Lagrangian
And He asked it, What is thy name?
And it answered, saying,
My name is Lagrange Multiplier Theorem:
for we are many concepts.
- THE BOOK
Goals: Today, you will use the culmination of all your Math 51H knowledge to prove
the beautiful Lagrange Multiplier Theorem.
27.1 The Engineering Perspective
Often, it is not enough to directly nd a functions extrema. Instead, you will have to minimize (or
maximize) a function subject to a series of constraints. By adding these constraints, the problem
becomes a lot more dicult. For example, consider the problem
Minimize f(x, y) = x
2
+ y
2
Minima
The solution is straightforward: calculate the gradient
_
2x
2y
_
525
526 LECTURE 27. LIVING LA VIDA LAGRANGIAN
and set it to

0 to solve for the critical point
a =
_
0
0
_
.
Then we can verify by the second derivative test that this is indeed a minimum. Easy.
But what if I added the restriction that the minimum must lie along the line
x + y = 2
So now we are only considering a function along a single slice. Notice that the minimum of f(x, y)
on this line is not a critical point of the original function! The typical method of setting the gradient
equal to

0 will not work!
Instead, we could rewrite the constraint as
y = 2 x
and substitute into f to get a new function
x
2
+ (2 x)
2
and use 1D calculus to solve for the minimum:
a =
_
1
1
_
But this is just a simple case! What if I gave a more complicated scenario in which you cannot
substitute (or at least, the substitution is not obvious):
Minimize f(x, y, z) = 3xy 3z
restricted to the unit sphere
x
2
+ y
2
+ z
2
= 1
In fact, we could have more than one constraint!
27.1. THE ENGINEERING PERSPECTIVE 527
Minimize f(x, y, z) = xy + yz
subject to the constraints
x + 2y = 6
x 3z = 0
What do we do? If you were in Math 51 in Week 9, you would directly apply the following technique:
Necessary condition of Lagrange Multipliers, Single Constraint:
Suppose we want to maximize (or minimize)
f(x)
subject to the constraint
g(x) = 0.
For any solution a, if g(a) =
0, the gradient of f at a is parallel to the gradient of the constraint

at a i.e.
f(a) = g(a)
for some .
Be careful! I do not want you to think that you can mindlessly solve
f(a) = g(a)
and youre done. This is a necessary condition:
Math Mantra: IF a solution exists, it MUST satisfy the NECESSARY condition.
BUT a solution NEED NOT EXIST.
Heres an analogy: suppose someone died. You can get information on what a murderer must look
like. This gives you a smaller suspect list. But it is entirely possible that the man died of natural
causes!
So the moral of the story is to
Use Lagranges multipliers to get additional information on what the extrema must look like.
Verify your guess is actually an extremum.
Lets apply this condition to our rst example:
Example: Apply Lagranges condition on the problem,
Minimize f(x, y) = x
2
+ y
2
subject to the constraint
x + y = 2
Rewrite the constraint as
g(x, y) = 0
where
g(x, y) = x + y 2.
The, calculate the
f(x, y) =
_
2x
2y
_
g(x, y) =
_
1
1
_
.
Lagranges condition tells us that solution a must satisfy
_
2a
1
2a
2
_
=
_

_
so
a
1
= a
2
Again, does this tell us that a minimum exists? NOPE! But it does give us the extra information
that, if a minimum did exist, then it would have to satisfy the condition a
1
= a
2
.
Lets try a more dicult example:
Example: Apply Lagranges condition on the problem,
Minimize f(x, y, z) = 3xy 3z
restricted to the unit sphere x
2
+ y
2
+ z
2
= 1
Rewrite the constraint as
g(x, y, z) = 0
where
g(x, y, z) = x
2
+ y
2
+ z
2
1
Then,
f(x, y, z) =
_
_
3y
3x
3
_
_
g(x, y, z) =
_
_
2x
2y
2z
_
_
.
For a solution a,
f(a
1
, a
2
, a
3
) = g(a
1
, a
2
, a
3
)
yields the system:
3a
2
= 2a
1
3a
1
= 2a
2
3 = 2a
3
()
Consider the case a
1
= 0. Plugging back into the rst equation in (), we get a
2
= 0. This leaves us
with the equation
3 = 2a
3
.
27.1. THE ENGINEERING PERSPECTIVE 529
But can be any non-zero number, so a
3
can be any non-zero number. In this case, the solution must
be of the form
_
_
0
0
a
3
_
_
where a
3
= 0.
Now consider the case a
1
= 0. Then by the second equation in (), a
2
= 0. Isolating in the rst
two equations in () and then equating the two resulting expressions yields
3a
2
2a
1
=
3a
1
2a
2
or
a
2
1
= a
2
2
which is simply
a
1
= a
2
And substituting into the square of the rst equation
9a
2
2
=
2
4a
2
1
we get
9a
2
1
=
2
4a
2
1
implying
=
2
3
.
Plugging this back into the third equation tells us
a
3
=
9
4
Overall, Lagranges condition gives us the additional information that a solution must either be of
the form
_
_
0
0
a
3
_
_
where a
3
= 0, or satisfy
a
1
= a
2
a
3
=
9
4
where a
1
= 0 and a
2
= 0.
The extension to more than one constraint is simple. Stare at Lagranges condition. Another way to
interpret
f(a) = g(a)
is that the gradient of f is a linear combination of a single constraint gradient. Generally, the gradient
of f is a linear combination of all constraint gradients:
Necessary condition of Lagrange Multipliers, Multiple Constraints:
Suppose we want to maximize (or minimize)
f(x)
subject to the constraints
g
1
(x) = 0
g
2
(x) = 0
.
.
.
g
k
(x) = 0
For any solution a, if
g
1
(a), g
2
(a), . . . , g
k
(a)
are linearly independent, then the gradient of f at a is a linear combination of the gradients of
the constraints at a:
f(a) =
1
g
1
(a) +
2
g
2
(a) + . . . +
k
g
k
(a)
for some
1
,
2
, . . . ,
k
.
Ok, this is all great for engineers. But as math people, we want to know why this condition is true.
Amazingly, with the mathematics we have developed about manifolds and projections, this is going
to be easy:
The maximum (or minimum) of f restricted to these constraints will occur at a
critical point on a manifold.
Specically,
We will dene a new type of extremum and critical point that is relative to some manifold.
At any critical point a of f (in this new sense, relative to some manifold M), f(a) (T
a
M)
.
Apply this to a particular manifold. Namely, one built from our constraints. In particular, the
solution of our minimization problem is a critical point a on this manifold. Thus, f(a)
(T
a
M)
.
g
1
(a), , g
2
(a), . . . , g
k
(a) are also in (T
a
M)
.
Now heres the kicker. Since
g
1
(a), g
2
(a), . . . , g
k
(a)
are linearly independent and dim
_
(T
a
M)
_
= k, these vectors form a basis. Therefore we can
write f(a) as
f(a) =
1
g
1
(a) +
2
g
2
(a) + . . . +
k
g
k
(a).
AWESOME!
27.2. THE MATHEMATICS OF LAGRANGE MULTIPLIERS 531
27.2 The Mathematics of Lagrange Multipliers
We want to convert the problem of optimizing a function over a set of constraints into the language
of manifolds. First, we need to dene a notion of maximum and minimum on a manifold:
Denition. We say f|M has a local maximum at a if there exists an > 0 such that for all
x M B
(a)
then
f(x) f(a).
Likewise, we say f|M has a local minimum at a if there exists an > 0 such that for all
x M B
(a)
then
f(a) f(x).
This is exactly the normal denition of local max and min except we are only considering points on
the manifold.
And like normal maxima and minima, we want the gradient to be

0 at these points (in other words,
we want points of maximum and minimum to always be critical points). Then, whatever property we
derive about critical points will apply to the points of extrema.
But this will involve a new denition of critical point relative to a manifold M. And that requires a
new type of gradient relative to some manifold. To do this, we will use the key fact, from last lecture,
that
T
a
M
is a subspace. Our new gradient will be the ordinary gradient projected onto this subspace.
Denition. The tangential gradient of f relative to a manifold M is
M
f(a) = P
T
a
M
_
f(a)
_
Naturally, a critical point relative to a manifold M is dened as
Denition. The point a M is a critical point of f|M if
M
f(a) = 0.
Before we can prove that a maximum (minimum) can only occur at critical point, we will need to
prove a property about the tangential gradient. Recall that the directional derivative is
D
v
f(a) = v f(a).
We can prove that if the direction v is in T
a
M, then we can replace f(a) with
M
f(a):
D
v
f(a) = v
M
f(a)
This will be an easy consequence of the projection properties.
In case you forgot projections (Lecture 11), here are some quick sparknotes:
For a subspace V , the projection map
P
V
(x)
maps x into V so that
x = P
V
(x)
. .
V
+x P
V
(x)
. .
V
In particular, for v V ,
P
V
(v) = v.
Also,
P
V
(x) =
0
implies
x V
.
A projection map can be swapped across a dot product:
x P
V
(y) = P
V
(x) y.
Now the lemma!
Lemma. If v T
a
M and f : R
n
R is a C
1
function, then
D
v
f(a) = v
M
f(a)
Proof: Consider
v
M
f(a)
By denition, this is just
v P
T
a
M
_
f(a)
_
.
Swapping the projection map across the dot product yields
P
T
a
M
(v) f(a)
But v T
a
M, so P
T
a
M
(v) = v. Therefore our dot product is the same as
v
..
P
T
a
M
(v)
f(a) = D
v
f(a)
Now we can prove the analogous extreme value theorem for manifolds. Like all of our extrema proofs,
we only need to consider the case of nding a maximum.
Lemma. If f|M has a local maximum at a M, then a is a critical point of f|M.
Proof Summary:
Consider f along a curve with
(0) = v:
f
_
(t)
_
.
Use the continuity of show that f((t)) has a local maximum at t = 0.
Conclude that
d
dt
f
_
(t)
_
t=0
= 0
Use the Chain Rule to expand the (LHS):
v f(a) = 0
Apply the preceding lemma:
v
M
f(a) = 0
Choose v =
M
f(a) to conclude that
M
f(a) =
0.
Proof: The strategy is to consider f restricted to a C
1
curve on the manifold. Then, by
continuity of , we can show that the single-variable function f((t)) has a local maximum at t = 0,
so its derivative must equal 0 at t = 0.
Let v T
a
M. Then we know there exists some > 0 and some curve C
1
curve
: (, ) M
such that (0) = a and
(0) = v. Consider f along this curve:

f
_
(t)
_
f(a)
We will show that this function f((t)) has a local maximum at t = 0. Since f|M has a local
maximum at a, there is an > 0 such that
f(a) f(x)
for all
x M B
(a)
By the continuity of , there is a (0, ) such that if
|t| <
then
|(t) (0)
..
a
| <
and since (t) M,
(t) M B
(a)
Therefore,
f((0)) f((t))
for all
t (, )
so f((t)) has a local maximum at t = 0. By a single-variable calculus theorem proved in a previous
lecture,
d
dt
f
_
(t)
_
t=0
= 0.
Expanding the (LHS) by the chain rule, this says
Df
_
(0)
_
(0) = 0
which is really just
v
..
(0)
f( a
..
(0)
) = 0
In fact, since v T
a
M, we can use our lemma to replace the usual gradient with the tangential
gradient:
v
M
f(a) = 0.
But we know that
M
f(a) T
a
M since it is a projection onto T
a
M. Since the vector v T
a
M was
arbitrary, choose in particular v =
M
f(a). This gives us
M
f(a)
M
f(a) = 0.
Therefore,
M
f(a)
2
= 0,
allowing us to conclude
M
f(a) =
0.
Now we can apply this theorem to a particular manifold, namely the one formed from our constraint
set (with the additional linear independence condition):
Theorem. For C
1
functions g
1
, g
2
, . . . , g
k
from R
n
to R, the set
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
(x) are linearly independent
_
_
_
is an (n k)-manifold.
Unfortunately, we will not be able to prove that this set is actually a manifold until the nal week of
this course.
Before proving the Lagrange Multiplier Theorem we formally state one last lemma, which is really
just a special case of the previous lemma:
Lemma. Let g be a C
1
function from a manifold M to R. If for every x M,
g(x) = 0,
then for every x M,
M
g(x) =
0.
Proof: Let x M be arbitrary. Then
g(x) = g(y) = 0
for all y M. In particular, g = g|M (since the domain of g is already the manifold M) has a local
maximum at x . By the previous lemma,
M
g(x) =
0
We nally have the full mechanics we need to prove the Lagrange Multiplier Theorem:
Theorem (Lagrange Multiplier Theorem). Let g
1
, g
2
, . . . g
k
, f be C
1
functions from R
n
to R and
consider the manifold
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
_
_
_
If a M is a critical point of f|M (so in particular, if f|M has an extremum at a), then
f(a) =
1
g
1
(a) +
2
g
2
(a) + . . . +
k
g
k
(a)
for some
1
,
2
, . . . ,
k
.
Proof Summary:
By the previous lemma, we know
M
f(a) =
M
g
1
(a) =
M
g
2
(a) = . . . =
M
g
k
(a) =
0.
Expand the tangential gradient denition. By projection map properties,
f(a), g
1
(a), g
2
(a), . . . , g
k
(a) (T
a
M)
By our assumed theorem, M is a (n k)-manifold, and so dim(T

a
M) = n k.
By a Corollary of the Rank-Nullity Theorem, dim
_
(T
a
M)
_
= k.
g
1
(a), g
2
(a), . . . , g
k
(a) are k linearly independent vectors in a k-dimensional subspace, and
thus form a basis.
Proof: Since a is a critical point of f|M, we know
M
f(a) =
0.
By denition, this is the projection
P
T
a
M
_
f(a)
_
=
0.
Moreover, because the projection of this vector is

0, that vector must be in the orthogonal space:
f(a) (T
a
M)
.
Since the previous lemma also gives us
M
g
1
(a) =

0
M
g
2
(a) =

0
.
.
.
M
g
k
(a) =

0,
537
we can apply the same reasoning to get
g
1
(a) (T
a
M)
g
2
(a) (T
a
M)
.
.
.
g
k
(a) (T
a
M)
But by a Corollary of the Rank-Nullity Theorem, (T

a
M)
has dimension
n (n k)
. .
dim(T
a
M)
= k
Since we have k linearly independent vectors in a k-dimensional subspace,
g
1
(a), g
2
(a), . . . , g
k
(a)
is a basis for (T
a
M)
. Then write f(a) as a linear combination of these basis vectors:

f(a) =
1
g
1
(a) +
2
g
2
(a) + . . . +
k
g
k
(a).
for some constants
1
, . . . ,
k
. INCREDIBLE!
Youre not going to be tested on the proof of the Lagrange Multiplier Theorem. That would be
absurd. But you will probably be asked to use it to solve an optimization problem on the midterm.
But I really want you to understand this proof! It is the culmination of everything youve
done: directional derivatives, projections, manifolds, curves, and even linear algebra (basis and Rank-
Nullity). And if you completely understand this proof, then you are more than ready for the
second midterm.
New Notation
M
f(a) The tangential gradi-
ent relative to mani-
fold M of f evaluated
at a.
M
f(a) = P
T
a
M
_
M
f(a)
_
The tangential gradient relative
to manifold M of f evaluated at
a is the projection of the ordi-
nary gradient (of f evaluated at
a) onto the tangent space T
a
M.
538
Midterm II: Conquering Calculus
Dear friends, surely we are not unlearned in evils.
This is no greater evil now than it was when
manifolds had us cooped up in the math library,
but even there, by courage, counsel, and intelligence, we survived.
I think that all this will be remembered some day too.
Then do as I say, let us all be won over.
-The (2n + 1)yssey
Round 2
After the rst midterm, you probably noticed that the diculty has increased tenfold. This is because
you are expected to have a solid single variable Calculus background before entering 51H. And by
solid background, I mean a theoretical one, not just a 5 on the Calculus BC!
But many of you have never seen a rigorous presentation of sequences, series, or limits. Other than
direct calculation, youve never abstractly applied the Fundamental Theorem of Calculus, integral
inequalities, or chain rule. And truth be told, - really cant be taught in a day: I know, Ive tried.
But youre still here. So that means Im obligated to help you through.
First and foremost, if you are still using ashcards to study mathematics, then grab your favorite
brand of Vodka. It could be Stolis, Smirno, or even an awful Brita-ltered batch of Popov. Pour it
all over the cards and ignite.
1
If you havent gotten the point,
Math Mantra: Math is NOT about memorizing mindless facts.
In almost all of my philosophy and engineering classes, I found that I could get through by memorizing
formulas
2
and regurgitating facts. But in math, this doesnt cut it! Because you arent just building
your knowledge, you are changing the way you think.
1
Alcohol will easily ignite if it is 100 proof or above. Coincidentally, this is the number of proofs you will see in the
H-Series.
2
Or even programming them into a TI-89.
539
540
Here is the correct way to study:
First,
UNDERSTAND THE PROOF OF EVERY THEOREM IN THIS BOOK.
This is a necessary but not a sucient condition to passing the midterm.
Then, open up Chapter 13 and head to the rst theorem:
Read the theorem statement.
Close the book.
Start re-deriving the proof.
If you get stuck, glance at the proof summary for a hint.
Close the book again.
Rinse and repeat.
Do this for every proof in the book. This is the only way you can be sure that youve mastered the
material. Always remember Professor Simons saying,
The human capacity for self-delusion is limitless
Here are the topics you need to have mastered:
Week 4
1. Do you know the denition of a continuous function? Can you prove a function is continuous?
2. Do you know the basic properties of continuous functions (maxima and minima, mapping of
sequences, boundedness)?
3. Do you know the Mean Value Theorem and how to apply it to multivariable func-
tions?
4. Can you dene open and closed sets? Can you verify a set is open or closed?
5. Can you prove a set is open if and only it its complement is closed?
6. Can you open sets are closed under arbitrary unions?
7. Can you prove a sequence of vectors converge if and only if their component se-
quences converge?
8. Can you prove multivariable Bolzano-Weierstrass?
9. Can you dene directional and partial derivatives? Can you calculate a directional derivative?
Do you know the general denition for dierentiability at a point?
10. Can you prove dierentiability implies continuity?
541
Week 5
1. Do you understand the formal denition of a series? Can you dene absolute convergence?
2. Can you prove all the series convergence tests(N-th term, geometric, integral)?
3. Can you prove that an absolutely convergent series can be rearranged?
4. Do you know how to calculate the Jacobian and the gradient?
5. Can you prove that dierentiability implies all directional derivatives exists and can
be calculated as a simple matrix product?
6. Can you prove that the matrix in the dierentiability denition must be the Jaco-
bian?
7. Can you prove all gradient properties?
8. Do you know how to prove, if all directional derivatives exist and are continuous, then the
function is dierentiable?
9. Do you understand how to apply the chain rule?
Week 6
1. Can you prove that the convergence set of a power series is always an interval?
2. Can you state the Change of Base-Point Theorem?
3. Can you prove mixed partials commute?
4. Do you understand quadratic forms and all basic quadratic form properties?
5. Do you know the statement for the second derivative test?
6. Can you apply the Second Derivative Test?
7. Can you dene the length of a curve? Can you prove that the length of a C
1
curve
always exists?
Week 7
1. Do you know how to prove that a dierentiated power series converges to f
(x)?
2. Do you know the statement of Taylors Theorem?
3. Can you state and prove a convergence condition on Taylor Series?
4. Do you know the denition of a manifold and the denition of a tangent space?
Can you calculate the tangent space?
542
5. Can you prove that a tangent space is actually a subspace?
6. Can you dene the tangential gradient? Can you state the Lagrange Multiplier
Theorem?
7. Can you apply the Lagrange Multiplier Theorem?
8. Can you prove tangential gradient properties?
Final Advice
I repeat the same advice I gave for Midterm 1:
1. DO THE PRACTICE TEST.
The questions on the real test will be analogous to the practice tests.
2. MAKE SURE YOUR DEFINITIONS ARE EXACT.
Professor Simon is pretty relentless about this: if the question asks to give a denition, the
statement must be awless.
3. GET ACCUSTOMED TO WORKING AT 7PM.
The test is at 7PM. I know. It sucks. So do practice questions (or even the practice test) at
this time.
And again,
Good luck, and may the odds be ever in your favor.
Lecture 28
Playing with Permutations
I want a clean cup, interrupted the Hatter: lets all move one place on.
He moved on as he spoke, and the Dormouse followed him:
the March Hare moved into the Dormouses place,
and Alice rather unwillingly took the place of the March Hare.
-
10
/6
Goals: We are going to prove that any sequence of transpositions that corrects the
ordering of a permutation has the same parity. Even though this seems unimportant,
this will be a key step in the derivation of the determinant formula.
28.1 Insight on Invariance
Consider the following problem (SUMaC 2013):
Pennies are placed on an 8 8 checkerboard in an alternating pattern of heads and tails.
You are allowed to make moves where in each move you turn over exactly two pennies
that lie next to each other in the same row or column. Can you take a sequence of moves
that leaves just one penny face up?
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
The answer is no. In the initial position, the total number of heads is even. Inductively, on the i-th
move, we can make one of these type of ips:
543
544 LECTURE 28. PLAYING WITH PERMUTATIONS
T H H T
H T T H
T T H H
H H T T
Regardless of the move, the total number of heads is still even. Therefore a situation with only one
head is impossible.
The moral of the story:
Math Mantra: Look for some FIXED NON-VARYING QUANTITY.
We call such quantity an invariant. In this case, our invariant was the parity of the number of heads.
Regardless of what moves we made, the parity of heads remained the same (even). You are looking
past the smoke and mirrors and latching onto some known truth.
So how does this apply to Math 51H?
One of the most important quantities in mathematics and engineering is the determinant. It has tons
of applications:
Calculating volume.
Changing variables in integration.
Checking invertibility of a matrix.
Proving that every natural number can be written as a sum of four squares.
1
To derive the determinant formula, we have to perform a number of swaps. And like the chessboard
problem, we have an unlimited number of choices for swaps. But it turns out that only the parity
of the swaps matters. This will be the lynchpin in our proof of determining the determinant.
28.2 Permutations
Back in the day, you were asked the problem:
How many ways can you rearrange
ABC?
1
My favorite application.
28.2. PERMUTATIONS 545
You made a little chart
ABC BAC CAB
ACB BCA CBA
Now that you are older, you realize it is far more kosher to use numbers and n-tuples instead of letters
and concatenation:
(1, 2, 3) (2, 1, 3) (3, 1, 2)
(1, 3, 2) (2, 3, 1) (3, 2, 1).
Generally,
Denition. A permutation on (1, 2, . . . , n) is an n-tuple
(i
1
, i
2
, . . . , i
n
)
such that each number between 1 and n appears exactly once:
{i
1
, i
2
, . . . , i
n
} = {1, 2, . . . , n}
Now consider the following scenario: you have the 7 Harry Potter Books on a shelf in some order
(4, 2, 1, 5, 6, 3, 7)
and you want to correct the ordering. But you are restricted to only swapping two books at a time.
Is it possible to correct the ordering?
Of course! Just consider the sequence
(4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 3, 6, 7)
(1, 2, 4, 5, 3, 6, 7)
(1, 2, 3, 5, 4, 6, 7)
(1, 2, 3, 4, 5, 6, 7)
Because were math people, we like to give swaps a more formal name. We also like to think of a
swap as a function on permutations.
Denition. A transposition
j,k
is a function that maps permutations to permutations by swapping
the values in the j and k position
j,k
(i
1
, i
2
, . . . , i
j
. . . , i
k
, . . . , i
n
) = (i
1
, i
2
, . . . , i
k
, . . . , i
j
, . . . , i
n
)
With this denition, we can precisely describe the above sequence:
(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 6, 3, 7)
(
5,6
)(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 3, 6, 7)
(
1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 3, 6, 7)
(
3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 5, 4, 6, 7)
(
4,5

3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
Note that we are applying a function, so composition is on the left.
Also notice that our choice of transpositions could have been smarter. We could have been completely
methodical and gone from left to right, correcting one place at a time:
(4, 2, 1, 5, 6, 3, 7) = (4, 2, 1, 5, 6, 3, 7)
(
1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 6, 3, 7)
(
1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 4, 5, 6, 3, 7)
(
3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 5, 6, 4, 7)
(
4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 6, 5, 7)
(
5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
As a simple exercise on induction we can prove,
Theorem. For any permutation
(i
1
, i
2
, . . . , i
n
)
there exists a sequence of transpositions
1
,
2
, . . . ,
k
such that
(
k

k1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
28.3 The Trouble with Transposition
But theres a catch. In each case, we used 4 transpositions to restore the ordering to the identity:
(
4,5

3,5

1,3

5,6
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7)
(
5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
Thats because we were smart. However, we could have used far more than 4:
(
3,4

1,4

3,4

1,3

5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
Even worse, if you had one too many Amaretto Sours, you could have ipped the rst two coordinates
a hundred times:
(
1,2

1,2
. . .
1,2
. .
100 times

5,6

4,6

3,6

1,3
)(4, 2, 1, 5, 6, 3, 7) = (1, 2, 3, 4, 5, 6, 7).
28.3. THE TROUBLE WITH TRANSPOSITION 547
In fact, the number of transpositions used to restore a permutation to the identity ordering can be
arbitrarily large!
Luckily, even though there is no xed number of transpositions, the parity of transpositions is
always the same. And this will be the key step in deriving the determinant formula.
How are we going to prove this?
We are going to prove that the number of transpositions has the same parity as a
nicer, xed number.
Namely,
Denition. The number of inversions of a permutation
N(i
1
, i
2
, . . . , i
k
)
is the number of pairs (j, k) where j < k and i
k
< i
j
.
Dont be afraid: it is a very simple interpretation. The values of the normal identity permutation
increase as you go to the right.
(1, 2, 3, 4, 5, . . . , n)
So for any pair, the left number is less than the right number:
1
2
3
4
5
The number of inversions simply counts how many times this fails, i.e., when an element to the left
is bigger than an element to the right:
i
1
i
2
i
3
i
4
i
5
Think of the number of inversions as a way to measure how messed up a permutation is.
Example: Calculate
N(4, 2, 1, 5, 6, 3, 7)
Directly, we see that
(4, 2, 1, 5, 6, 3, 7)
has 6 inversions:
(4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7) (4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 6, 3, 7)
(4, 2, 1, 5, 6, 3, 7)
The number of inversions is a very nice number to work with. This is because we can break
it into a sum of smaller calculations by xing j and dening
N
j
(i
1
, i
2
, . . . , i
k
)
as the number of k where j < k and i
k
< i
j
. Visually, think of this as xing a leftmost element:
i
1
i
2
i
j
i
j+1
i
j+2
i
j+3
i
n
and then comparing it to the elements on the right, one at a time:
i
j
i
j+1
i
j+2
i
j+3
i
j
i
j+1
i
j+2
i
j+3
i
j
i
j+1
i
j+2
i
j+3
This means
N(i
1
, i
2
, . . . , i
k
) = N
1
(i
1
, i
2
, . . . , i
k
) + N
2
(i
1
, i
2
, . . . , i
k
) + . . . + N
n1
(i
1
, i
2
, . . . , i
k
),
and in our example,
N(4, 2, 1, 5, 6, 3, 7) =
N
1
(4, 2, 1, 5, 6, 3, 7)
. .
3
+ N
2
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
3
(4, 2, 1, 5, 6, 3, 7)
. .
0
+
N
4
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
5
(4, 2, 1, 5, 6, 3, 7)
. .
1
+ N
6
(4, 2, 1, 5, 6, 3, 7)
. .
0
= 6.
This decomposition will be the key step in the next proof.
Now, to prove
The number of inversions has the same parity as the number of transpositions
needed to correct
1
an ordering
we need to show that applying a transposition to a permutation changes the parity of the number of
inversions. First consider the case of swapping two consecutive elements:
Lemma.
N(
j,j+1
(i
1
, i
2
, . . . , i
n
))
and
N(i
1
, i
2
, . . . , i
n
)
have opposite parity
Proof Summary:
Assume i
j
< i
j+1
.
Split N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) into a sum
N
1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) + N
2
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) + . . . + N
n1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
Observe for s = j, j + 1,
N
s
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
s
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
)
and
N
j
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
j+1
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
) + 1
N
j+1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) = N
j
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
).
Replace each term in the summation and recombine to get N(i
1
, . . . i
j
, i
j+1
, . . . , i
n
) + 1.
Proof: Since the argument is virtually the same, assume i
j
< i
j+1
. Notice that
j,j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
k
) = (i
1
, i
2
, . . . , i
j+1
, i
j
, . . . , i
n
)
where i
j+1
is now in the j-th position and i
j
is now in the (j + 1)-th position:
i
1
1
i
2
2
i
j+1
j
i
j
j+1
i
n1
n-1
i
n
n
1
To correct means to return to the identity ordering. We think of the identity ordering as the correct one.
The key is to rewrite N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) in terms of N(i
1
, . . . i
j
, i
j+1
, . . . , i
n
). To do this, well
rewrite each term of the decomposition
N(i
1
, . . . i
j+1
, i
j
, . . . , i
n
) =
_
_
N
1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
+
N
2
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
+
.
.
.
+
N
n1
(i
1
, . . . i
j+1
, i
j
, . . . , i
n
)
in terms of N
s
(i
1
, . . . i
j
, i
j+1
, . . . , i
n
).
Consider an arbitrary position number s. For s < j, we must have
N
s
(i
1
, . . . , i
j+1
, i
j
, . . . i
n
) = N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)
This is because nothing changed: recall that N
s
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) compares the s-th element
to all the elements to the right:
i
s
s
i
s+1
s+1
i
j+1
j
i
j
j+1
i
n1
n-1
i
n
n
But in N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
), you are still making the same comparisons:
i
s
s
i
s+1
s+1
i
j
j
i
j+1
j+1
i
n1
n-1
i
n
n
All the elements on the right are the same. Order of comparison doesnt matter!
Likewise, it follows that
N
s
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
s
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)
for s > j + 1, so we only need to consider
s = j or s = j + 1
s = j.
When we compute N
j
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
), we get one inversion from comparing the rst pair:
+1
i
j+1
i
j
i
j+2
i
j+3
i
j+4
The remaining portion counts the number of inversions by comparing i
j+1
to i
j+2
, i
j+3
, . . . , i
n
:
i
j+1
i
j
i
j+2
i
j+3
i
j+4
But by denition, this is exactly the same as N
j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)! Thus,
N
j
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
j+1
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
) + 1.
s = j + 1
When we compute N
j+1
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
), we are looking at:
i
j
i
j+2
i
j+3
i
j+4
i
j+5
But we assumed i
j
< i
j+1
, so adding an extra comparison with i
j+1
doesnt change the number
of inversions:
i
j
i
j+1
i
j+2
i
j+3
i
j+4
This is exactly N
j
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
)! Thus,
N
j+1
(i
1
, . . . , i
j+1
, i
j
, . . . , i
n
) = N
j
(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
).
Now we can rewrite each term in our decomposition:
N(. . . i
j+1
, i
j
, . . .) =
_
_
N
1
(. . . i
j+1
, i
j
, . . .) N
1
(. . . i
j
, i
j+1
, . . .)
+ +
N
2
(. . . i
j+1
, i
j
, . . .) N
2
(. . . i
j
, i
j+1
, . . .)
+ +
.
.
.
.
.
.
N
j
(. . . i
j+1
, i
j
, . . .) N
j+1
(. . . i
j
, i
j+1
, . . .) + 1
+ +
N
j+1
(. . . i
j+1
, i
j
, . . .) N
j
(. . . i
j
, i
j+1
, . . .)
+ +
.
.
.
.
.
.
+ +
N
n2
(. . . i
j+1
, i
j
, . . .) N
n2
(. . . i
j
, i
j+1
, . . .)
+ +
N
n1
(. . . i
j+1
, i
j
, . . .) N
n1
(. . . i
j
, i
j+1
, . . .)
_
_
= N(. . . i
j
, i
j+1
, . . .) + 1
Or in other words,
N(i
1
. . . , i
j+1
, i
j
, . . . , i
n
) and N(i
1
, . . . , i
j
, i
j+1
, . . . , i
n
) have opposite parity.
Now we extend this result to work for any transposition. We will do this by rewriting any transposition
as a composition of transpositions between consecutive coordinates.
Lemma.
N(
j,k
(i
1
, i
2
, . . . , i
n
))
and
N(i
1
, i
2
, . . . , i
n
)
have opposite parity.
Proof Summary:
WLOG, assume j < k.
Apply k j consecutive transpositions to move i
j
to position k (so it is to the right of i
k
).
Apply k j 1 consecutive transpositions to move i
k
to position j.
Apply preceding lemma on a total of 2(k j) 1 consecutive transpositions.
Proof: Without loss of generality, we may assume j < k (for otherwise, we can just relabel the
transposition as
k,j
since switching j and k is the same as switching k and j). Applying only
consecutive transpositions, we want to take
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
and swap i
j
and i
k
:
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
First, apply consecutive transpositions to move i
j
to the right:
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
i
k4
i
k3
i
k2
i
k1
i
k
i
j+1
i
j
i
j+2
i
j+3
i
j+4
i
j+1
i
j+2
i
j
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
j
i
k2
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
j
i
k1
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
j
i
k
After applying transpositions
k1,k

k2,k1
. . .
j+2,j+3

j+1,j+2

j,j+1
. .
kj transpositions
we have
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
k
i
j
Now, starting with i
k
on the left of i
j
, we use transpositions to shift i
k
to the j-th position:
i
j+1
i
k
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
i
k4
i
k3
i
k2
i
k1
i
j
i
j+1
i
j+2
i
k
i
j+3
i
j+4
i
j+1
i
j+2
i
j+3
i
k
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k
i
k2
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k
i
k1
i
j
i
j+1
i
j+2
i
j+3
i
j+4
i
j+5
i
k3
i
k2
i
k1
i
k
i
j
So after applying additional transpositions
j,j+1

j+1,j+2
. . .
k3,k2

k2,k1
. .
kj1 transpositions
we have the desired transposition:
i
k
i
j+1
i
j+2
i
j+3
i
j+4
i
k4
i
k3
i
k2
i
k1
i
j
To swap j and k, we used a total of
k j
. .
right
+k j 1
. .
left
= 2(k j) 1
transpositions. Applying the previous theorem to each of these 2(k j) 1 transpositions, the parity
will change an odd number (namely, 2(k j) 1) of times, so in the end the parity will change. We
conclude that
N(
j,k
(i
1
, i
2
, . . . , i
n
)) and N(i
1
, i
2
, . . . , i
n
) have opposite parity.
Finally, we can prove that any two sequences of transpositions that restore the normal ordering always
have the same parity:
Theorem. Let (i
1
, . . . , i
n
) be a permutation. For any two sequences of transpositions
1
,
2
, . . . ,
q
t
1
, t
2
, . . . , t
r
where
(
q

q1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
(t
r
t
r1
. . . t
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n),
q and r have the same parity.
Proof Summary:
Apply inverse transpositions to get
(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n).
Inductively apply the preceding theorem.
Proof: First notice that any transposition is its own inverse:
(
i,j

i,j
)(i
1
, i
2
, . . . , i
n
) = (i
1
, i
2
, . . . , i
n
)
From
(
q

q1
. . .
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
(t
r
t
r1
. . . t
1
)(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
we can apply a series of inverse transpositions to both sides
(
1

2
. . .
q1

q

q

q1
. . .
2

1
)(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(t
1
t
2
. . . t
r1
t
r
t
r
t
r1
. . . t
2
t
1
)(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n)
to get
(i
1
, i
2
, . . . , i
n
) = (
1

2
. . .
q
)(1, 2, . . . , n)
(i
1
, i
2
, . . . , i
n
) = (t
1
t
2
. . . t
r
)(1, 2, . . . , n).
Since
N(1, 2, . . . , n) = 0,
we can apply the previous theorem q times on
N(i
1
, i
2
, . . . , i
n
) = N(
1

2
. . .
q
)(1, 2, . . . , n)
to get that N(i
1
, i
2
, . . . , i
n
) has the same parity as q.
1
Likewise, we can apply the thorem r times on
N(i
1
, i
2
, . . . , i
n
) = N(t
1
t
2
. . . t
r
)(1, 2, . . . , n)
to get that N(i
1
, i
2
, . . . , i
n
) has the same parity as r. Therefore, q and r have the same parity.
New Notation
(i
1
, i
2
, . . . , i
n
) The permutation
(i
1
, i
2
, . . . , i
n
)
(3, 2, 1) The permutation (3, 2, 1).
j,k
The transposition
that swaps the coor-
dinates j and k
1,2
(3, 2, 1) The transposition that swaps co-
ordinates 1 and 2 applied to the
permutation (3, 2, 1).
N(i
1
, i
2
, . . . , i
k
) The number of inver-
sions of the permuta-
tion (i
1
, i
2
, . . . , i
n
)
N(3, 2, 1) = 3 The number of inversions of
(3, 2, 1) is 3.
N
j
(i
1
, i
2
, . . . , i
k
) The number of in-
versions, relative to
the j-th coordinate,
of the permutation
(i
1
, i
2
, . . . , i
n
)
N
2
(3, 2, 1) = 1 The number of inversions, relative
to the 2nd coordinate, of (3, 2, 1)
is 1.
1
To see this, note that starting from 0, the parity is switched q times. If q is even, N(i
1
, i
2
, . . . , i
n
) will have the
same parity as 0, i.e. even. If q is odd, then N(i
1
, i
2
, . . . , i
n
) will have the opposite parity as 0, i.e. odd.
Lecture 29
Determining Determinant
ONE function to determine them all.
- Lord of Z, R, Q, and C.
Goals: Today, we build the determinant from a multilinear function D. Remarkably, we
can prove that this D is the UNIQUE multilinear function that outputs 1 at the ordered
standard basis and switches sign whenever two inputs are swapped. Finally, we prove
column and row reduction properties to eciently compute the determinant.
29.1 The Magic of Multilinearity
Unless you slept through the rst seven weeks of Math 51H, youve probably realized
Math Mantra: LINEARITY IS AWESOME!
Weve used this property a gazillon times. Namely when
Distributing an integral across a sum of functions.
Representing a function as a matrix multiplication.
Calculating the directional derivative.
But why restrict this awesomeness to a function with a single input? Why not consider a function
that has more than one input
f(x
1
, x
2
)
and make it linear with respect to each of these inputs? That means scaling any input scales the
output by the same factor
f(c x
1
, x
2
) = f(x
1
, c x
2
) = c f(x
1
, x
2
),
and if we have a sum in any component, we x the other components and perform normal linearity:
f(a + b, x
2
) = f(a, x
2
) + f(b, x
2
)
f(x
1
, a + b) = f(x
1
, a) + f(x
2
, b).
557
558 LECTURE 29. DETERMINING DETERMINANT
As with linear maps, there are lots of examples of multilinear maps. For example, consider a function
that inputs three variables and returns the product:
f(x, y, z) = xyz
But lets consider only multi-linear functions that input n vectors in R
n
and output a real number:
f : R
n
R
n
R
n
. . . R
n
. .
n inputs
R
Suppose we add two seemingly innocuous conditions:
If we swap any two inputs, the value of our function is negated:
f(x
1
, . . . , x
i
. . . x
j
, . . . , x
n
) = f(x
1
, . . . , x
j
. . . x
i
, . . . , x
n
)
If we input the ordered standard basis vectors, the function returns 1:
f(e
1
, e
2
, . . . , e
n
) = 1
The remarkable fact is that there is one and only one function that satises this! And we will
call
1
this almighty function D.
29.2 Uniqueness of D
How do we prove that there is exactly one function that satises the aforementioned properties? First,
we have to prove such a satisfying function actually exists. To do this, we normally
Prove that the function exists
Using the fact that it exists, we derive properties on what it must look like.
Weve used this trick a million times, especially in the calculation of limits. Most recently, in Lecture
24, this was the key step in deriving the formula for the arc-length of a curve.
But thats not going to cut it here. We have to cheat. We are going to assume D exists to derive its
formula. Generally,
Math Mantra: To deduce an object exists, we first ASSUME it actually does
exist. Then, by exploiting its properties, we derive an explicit formula for
that object. We then VERIFY that this explicit formula satisfies all the
properties we need.
1
When we apply D to the specic case of matrix columns, then we call it the determinant.
29.2. UNIQUENESS OF D 559
This was the strategy in the Cauchy-Schwarz equality proof. We also exploited this thinking with
Lagrange Multipliers: we found necessary conditions on what a maxima must look like. Then, we
formed a guess and veried that our guess was indeed a maximum.
Once we assume D exists, the proof is easy (though there is quite a bit of notation). We are going to
do the same trick we used when we proved that any linear function from R
n
to R can be written as
a matrix multiplication. Rewrite x in terms of standard basis vectors and apply linearity!
Before we proceed to the proof, we prove an extremely easy lemma.
Lemma. Let D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R and suppose D changes sign whenever we inter-
change any two inputs:
D(x
1
, . . . , x
j
. . . , x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
)
Then D must evaluate to 0 if two of its inputs are the same:
D(x
1
, . . . , a, . . . a, . . . , x
n
) = 0.
Proof: Suppose that the inputs at the j-th and k-th coordinate are equal:
D(x
1
, . . . , a
..
jth
, . . . , a
..
kth
, . . . , x
n
).
Interchanging the j-th and k-th components, we have
D(x
1
, . . . , a . . . , a, . . . , x
n
) = D(x
1
, . . . , a, . . . , a, . . . , x
n
)
implying
D(x
1
, . . . , a, . . . , a, . . . , x
n
) = 0.
Even though this is an incredibly easy lemma, do not underestimate it! When we try to derive the
determinant formula, we will use this lemma to kill unnecessary terms.
Theorem. If there exists some function D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R that satises the
following properties:
D is linear in each component: for any j,
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
) = cD(x
1
, . . . , a
..
jth
, . . . x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
)
D changes sign whenever we interchange any two inputs: for any j = k,
D(x
1
, . . . , x
j
, . . . , x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
).
D evaluates to 1 on the ordered standard basis vectors:
D(e
1
, e
2
, . . . , e
n
) = 1.
then D must satisfy
D(x
1
, x
2
, . . . , x
n
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
Proof Summary:
Assume such a D exists.
Write each input in terms of standard basis vectors. Be sure to index the sums.
Apply linearity on each component and condense the summations into one.
By the preceding lemma, any term over non-distinct indices is 0. The summation simplies to
permutations from 1 to n.
In each term, apply the swapping property on
D(e
i
1
, e
i
2
, . . . e
in
)
to restore it to
D(e
1
, e
2
, . . . e
n
)
. .
=1
From the last lecture, the number of times you apply swapping property is the number of
inversions. This introduces a
(1)
N(i
1
,i
2
,...,in)
factor in each term.
Proof: Assume that there exists a function D with the desired properties:
D(x
1
, x
2
, . . . , x
n
).
Now we rewrite each input in terms of the standard basis vectors. So a rst attempt would be to
expand as
D
_
n
i=1
x
i1
e
i
,
n
i=1
x
i2
e
i
, . . . ,
n
i=1
x
in
e
i
_
.
But we have n summations! Reusing the same dummy variable is completely boneheaded!
To keep everything straight, we decide to index the indexing terms:
D
_
n
i
1
=1
x
i
1
1
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Now, we use repeated applications of linearity to pull out the rst components summation:
n
i
1
=1
x
i
1
1
D
_
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Then, we can do the same trick to pull out the summation from the second component:
n
i
1
=1
n
i
2
=1
x
i
1
1
x
i
1
2
D
_
e
i
1
, e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
.
Inductively, after going through all components, we have
n
i
1
=1
n
i
2
=1
. . .
n
in=1
x
i
1
1
x
i
1
2
. . . x
i
1
n
D(e
i
1
, e
i
2
, . . . , e
in
) . ()
Chances are, this is the rst time youve seen n nested sums. But it is just like a double summa-
tion: a summation over a summation. Except you are doing this n times (giving us a total of n
n
terms)!
I know this is scary:
1
you are summing over sums of sums of sums, etc. etc. And because we have so
many indexing terms, we need to introduce ugly subscripts. But just to put your mind at ease, lets
write out the expansion for the rst two components:
Expanding the rst component of
D
_
n
i
1
=1
x
i
1
1
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
,
we have
D
_
x
11
e
1
+ x
21
e
2
+ . . . + x
n1
e
n
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
By repeated application of linearity on the rst component, this becomes
x
11
D
_
e
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
+ x
21
D
_
e
2
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
+ . . . + x
n1
D
_
e
n
,
n
i
2
=1
x
i
2
2
e
i
2
, . . .
_
But we can expand each of these terms. Consider the i
1
-th term:
x
i
1
1
D
_
e
i
1
,
n
i
2
=1
x
i
2
2
e
i
2
, . . . ,
n
in=1
x
inn
e
in
_
1
Its Inception all over again.
Now we apply repeated linearity on the second component of this term to get
x
i
1
1
x
12
D
_
e
i
1
, e
1
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
+ x
i
1
1
x
22
D
_
e
i
1
, e
2
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
.
.
.
.
.
.
+ x
i
1
1
x
n2
D
_
e
i
1
, e
n
,
n
i
3
=1
x
i
3
3
e
i
3
, . . .
_
Then you can play the same game by expanding the i
2
-th term of this sum:
x
i
1
1
x
i
2
2
D
_
e
i
1
, e
i
2
,
n
i
3
=1
x
i
3
3
e
i
3
,
n
i
4
=1
x
i
4
4
e
i
4
, . . . ,
n
in=1
x
inn
e
in
_
If you are still unsure, I recommend practicing with smaller cases.
Returning to (), we try to make the notation easier on the eyes by rewriting as a single sum symbol:
n
i
1
,i
2
,..., in=1
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . e
in
)
By the preceding lemma, we know that whenever two components of D are equal, the term is 0.
Therefore, we only need to consider terms where the indices are all distinct:
n
i
1
, i
2
, . . . , i
n
= 1,
i
1
, i
2
. . . , i
n
distinct
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
Notice that
i
1
, i
2
, . . . , i
n
are distinct numbers from 1 to n. So we are really looking at all permutations of (1, 2, . . . n):
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
) .
Next, use our swapping property to rearrange
D(e
i
1
, e
i
2
, . . . , e
in
)
into
D(e
1
, e
2
, . . . , e
n
) .
Each swap multiplies D by 1. Moreover, each swap is a transposition; thus, the power of 1 is the
number of transpositions that rearrange
(i
1
, i
2
, . . . , i
n
)
into
(1, 2, . . . , n).
For each specic term
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
we can nd transpositions
1
,
2
, . . .
q
to rewrite this term as
(1)
q
x
i
1
1
x
i
2
2
. . . x
inn
D(e
1
, e
2
, . . . , e
n
)
. .
=1.
Now stare at q: the only thing that matters is its parity. And we proved last lecture that the number
of transpositions has the same parity as the number of inversions! So this term is just
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
and our summation is
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
.
There is a very crucial philosophical step in this proof that you must mull over. Particularly, think
about the step in which you found transpositions
1
,
2
, . . . ,
q
to rewrite
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
as
(1)
q
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
) .
Suppose the parity of q was not unique and that we can nd odd and even sequences of transpositions
that correct the ordering of
(i
1
, i
2
, . . . , i
n
)
Then we would have both
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
and
x
i
1
1
x
i
2
2
. . . x
inn
D(e
i
1
, e
i
2
, . . . , e
in
)
FAIL!
To avoid catastrophes like this, we must make sure that our operations are well-dened.
What does it mean for a function to be well-dened? Suppose you have multiple ways to represent
an input. A function is well-dened if it returns the same output independent of how you choose to
represent an input.
For example, the following function is ill-dened:
For an integer x,
f(x) = s where x = st
For x = 20, we can represent
x = 5 4
which implies
f(20) = 5.
But we could have also written x as
x = 2 10,
giving us
f(20) = 2.
You will see the issue of well-denedness again in Math 120 when you talk about Quotient Groups
and Math 171 when you dene the Lebesgue Integral. For now, keep in mind that
Math Mantra: When you define a function, make sure that your output is
INDEPENDENT of how you choose to represent the input.
Now that weve proven
IF D exists, then it must look like
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
x
i
2
2
. . . x
inn
we must verify that this expression satises the three properties of D.
Theorem.
D(x
1
, x
2
, . . . , x
n
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
x
i
1
2
. . . x
i
1
n
satises the following properties:
D is linear in each component: for any j,
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
) = cD(x
1
, . . . , a
..
jth
, . . . , x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
).
D changes sign whenever we interchange any two inputs: for any j = k,
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
. . . x
j
, . . . , x
n
).
D evaluates to 1 on the ordered standard basis vectors:
D(e
1
, e
2
, . . . , e
n
) = 1.
Proof Summary:
Linearity
Directly apply the denition and distribute the summation.
Swapping
From D(x
1
, . . . , x
j
, . . . , x
k
, . . . , x
n
), apply a transposition and swap x
i
k
k
and x
i
j
j
.
The names of the indexing variables do not matter, so swap i
j
and i
k
.
Compare the inner summation to D(x
1
, . . . , x
k
, . . . , x
j
, . . . , x
n
). The terms are exactly the
same. Even though the indexing set is in a dierent order, it is still over all permutations.
So the summations are equal.
Ordered Standard Basis
Directly apply the denition. All terms are zero unless
(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
Proof:
Linearity
Expand the denition of
D(x
1
, . . . , ca +
b
. .
jth component
, . . . , x
n
)
to get
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
x
i
1
1
. . . (ca
i
j
j
+ b
i
j
j
)
. .
x
i
j
j
. . . x
i
1
n
Then distribute the sum
perm. (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
c (1)
N(i1,i2,...in)
x
i11
. . . a
i
j
j
..
x
i
j
j
. . . x
inn
+
perm. (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i1,i2,...in)
x
i11
. . . b
i
j
j
..
x
i
j
j
. . . x
inn
But this is just
cD(x
1
, . . . , a
..
jth
, . . . x
n
) + D(x
1
, . . . ,

b
..
jth
, . . . , x
n
)
Swapping
Expand
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
)
to get
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
j
. . . x
i
k
k
. . . x
inn
.
Switch x
i
k
k
and x
i
j
j
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
k
k
. . . x
i
j
j
. . . x
inn
and apply a single transposition to switch the i
j
and i
k
in N(i
1
, . . . , i
k
, . . . , i
j
, . . . , i
n
) and pull
out 1:
_
_
_
_
_
_
_
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
k
,...,i
j
,...,in)
x
i
1
1
. . . x
i
j
j
. . . x
i
k
k
. . . x
inn
_
_
_
_
_
_
_
But the names of the dummy indexing variables do not matter! So switch i
j
with i
k
:
_
_
_
_
_
_
_
permutation (i
1
, . . . , i
k
. . . , i
j
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
k
. . . x
i
k
j
. . . x
inn
_
_
_
_
_
_
_
Compare the inside to D(x
1
, . . . , x
j
. . . x
i
, . . . , x
n
):
permutation (i
1
, . . . , i
j
. . . , i
k
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,i
j
,...,i
k
,...,in)
x
i
1
1
. . . x
i
j
k
. . . x
i
k
j
. . . x
inn
They are summing the same terms over all permutations. The only dierence is that the
permutations are indexed by dierent dummy variables. Therefore,
D(x
1
, . . . , x
j
. . . x
k
, . . . , x
n
) = D(x
1
, . . . , x
k
. . . x
j
, . . . , x
n
).
Ordered Standard Basis
Expand
D(e
1
, e
2
, . . . , e
n
)
to get
29.3. COMPUTING DETERMINANTS 567
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
e
i
1
1
e
i
2
2
. . . e
inn
But
e
jk
= 1
only when j = k (and 0 otherwise). Thus every term in this sum is 0, except when
i
1
= 1
i
2
= 2
.
.
.
i
n
= n.
This leaves us with
(1)
N(1,2,...,n)
e
11
e
22
. . . e
nn
= 1.
Combining the last two theorems, we have:
Theorem. There exists one and only one function D : R
n
R
n
R
n
. . . R
n
. .
n inputs
R such that
D is linear in each component.
D changes sign whenever we interchange any two inputs.
D evaluates to 1 on the ordered standard basis vectors.
29.3 Computing Determinants
As you may have already guessed, we like to evaluate the function D on the columns of a square
matrix:
Denition. The determinant of an n n matrix A is the value of D applied to the columns of A:
det(A) = D(a
1
, a
2
, . . . , a
n
)
where
A =
_
_
_
_
_
_
a
1
a
2
. . . a
n
_
_
_
_
_
_
.
Back in high school, you already learned how to compute determinants of 2 2 and 3 3 matrices
using some mnemonic. You drew some diagonals
a
11
a
12
a
13
a
11
a
12
a
21
a
22
a
23
a
21
a
22
a
31
a
32
a
33
a
31
a
32
and said,
Main diagonals minus anti-diagonals.
But,
Math Mantra: Just because a formula is true for a specific case DOES NOT MEAN
it is true for all cases!
We cannot apply this mnemonic for higher n. For example, in the case of n = 5, the diagonal
process would only give us 10 terms. But in the actual determinant denition, our sum has 5!=120
terms.
The truth is: no one ever uses the direct denition
det(A) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...in)
a
i
1
1
a
i
2
2
. . . a
inn
It is a completely unwieldy mess of n! terms. No one in their right mind wants to expand a sum that big!
Heres a smarter idea:
Math Mantra: Instead of appealing to the original definition,
you can save yourself a lot of trouble by using theorems youve already proved.
The determinant immediately inherits some sweet simplication properties from D. Namely,
Theorem. For an n n matrix A,
Adding a scaling of one column to another does not change the determinant.
det
_
_
_
_
_
_
. . . a
i
. . . a
j
. . .
_
_
_
_
_
_
= det
_
_
_
_
_
_
. . . a
i
. . . ca
i
+a
j
. . .
_
_
_
_
_
_
Scaling a column by c scales the determinant by c:
det
_
_
_
_
_
_
. . . ca
i
. . .
_
_
_
_
_
_
= c det
_
_
_
_
_
_
. . . a
i
. . .
_
_
_
_
_
_
Swapping two columns switches the sign of the determinant
det
_
_
_
_
_
_
. . . a
j
. . . a
i
. . .
_
_
_
_
_
_
= 1 det
_
_
_
_
_
_
. . . a
i
. . . a
j
. . .
_
_
_
_
_
_
But theres more. We also have the same properties for the rows. And this will be a simple corollary
of the product property for determinants. Note that this proof follows the exact same method as
the derivation of D:
Theorem. For n n matrices A and B,
det(AB) = det(A) det(B).
Proof Summary:
Write each input in terms of columns of B. Be sure to index the sums.
Apply linearity on each component and condense the summations into one.
Any term over non-distinct indices is 0, therefore the summation is now over all permutations
from 1 to n.
In each term, apply the swapping property on
D
_
b
i
1
,
b
i
2
, . . . ,
b
in
_
to restore it to
D
_
b
1
,
b
2
, . . . ,
b
n
_
. .
det(B)
.
This introduces a
(1)
N(i
1
,i
2
,...,in)
factor in each term.
Pull out det(B)
Proof: Recall that the columns of AB are linear combinations of the columns of A:
AB =
_
_
_
_
_
_
A
b
1
A
b
2
. . . A
b
n
_
_
_
_
_
_
where the j-th column is
A
b
j
=
n
i=1
b
ij
a
i
.
When we apply the denition of determinant
det(AB) = D
_
A
b
1
, A
b
2
, . . . , A
b
n
_
,
we expand each input as a linear combination of columns of A:
D
_
n
i
1
=1
b
i
1
1
a
i
1
,
n
i
2
=1
b
i
2
2
a
i
2
, . . . ,
n
in=1
b
inn
a
in
_
.
Remember to index the sums!
Applying linearity on each component, we can, once more, pull out each sum
n
i
1
=1
n
i
2
=1
. . .
n
in=1
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . , a
in
)
and rewrite as
n
i
1
,i
2
,..., in=1
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . a
in
) .
Then, kill o terms where components are non-distinct. The summation is now over permutations:
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
i
1
, a
i
2
, . . . , a
in
)
The swapping property yields
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
1
, a
2
, . . . , a
n
)
. .
=det(A)
Yet the inner determinant term is a constant! Pull it out to get:
det(A)
_
_
_
_
_
_
_
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
D(a
1
, a
2
, . . . a
n
)
_
_
_
_
_
_
_
. .
=det(B)
.
This is the same as
det(A) det(B).
To prove the row reduction properties, we express the row operations as matrix multiplications and
use the preceding theorem:
Theorem. For an n n matrix A,
Adding a scaling of one row to another does not change the determinant.
det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
A
j
.
.
.
_
_
_
_
_
_
_
_
= det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
c
A
i
+

A
j
.
.
.
_
_
_
_
_
_
_
_
Scaling a row by c scales the determinant by c:
det
_
_
_
.
.
.
c
A
i
.
.
.
_
_
_
= c det
_
_
_
.
.
.
A
i
.
.
.
_
_
_
Swapping two rows switches the sign of the determinant
det
_
_
_
_
_
_
_
_
.
.
.
A
j
.
.
.
A
i
.
.
.
_
_
_
_
_
_
_
_
= 1 det
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
A
j
.
.
.
_
_
_
_
_
_
_
_
Proof Summary:
Adding a Scaling of One Row to Another:
Represent the operation as the product
E
add
A.
Then
det(E
add
A) = det(E
add
)
. .
=1
det(A) = det(A).
Scaling
E
scale
A.
Then
det(E
scale
A) = det(E
scale
)
. .
=c
det(A) = c det(A).
Swapping
E
swap
A.
Then,
det(E
swap
A) = det(E
swap
)
. .
=1
det(A) = det(A).
Proof:
Adding a Scaling of One Row to Another:
Let E
add
be the identity matrix with a constant c in position (j, i) for i = j:
j
i
E
add
=
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
Then
E
add
A =
_
_
_
_
_
_
_
_
.
.
.
A
i
.
.
.
c
A
i
+

A
j
.
.
.
_
_
_
_
_
_
_
_
Notice that
det(E
add
) = 1.
This is because we can apply the columns properties of determinant and subtract c times the
i-th column from the j-th column:
det
_
_
_
_
_
_
_
_
_
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
_
_
_
_
_
_
_
_
_
= det
_
_
_
_
_
_
_
_
_
_
_
1
1
.
.
.
1 . . . 0
.
.
.
1
1
_
_
_
_
_
_
_
_
_
_
_
Thus,
det(E
add
A) = det(E
add
) det(A) = det(A).
Scaling:
Let E
scale
be the identity matrix with c at coordinate (i, i):
i
i
E
scale
=
_
_
1
1
.
.
.
c
.
.
.
1
1
_
_
Then multiplying on the left by E
scale
scales the i-th row:
E
scale
A =
_
_
_
.
.
.
c
A
i
.
.
.
_
_
_
By the column scaling property of determinant,
det(E
scale
) = c,
giving us
det(E
scale
A) = det(E
scale
) det(A) = c det(A).
Column Swapping:
Let E
swap
be the identity matrix with the i and j columns swapped:
i
j
E
swap
=
_
_
1
1
.
.
.
0 1
.
.
.
1 0
.
.
.
1
1
_
_
Then multiplying on the left by E
swap
swaps the i-th and j-th row:
E
swap
A =
_
_
_
_
_
_
_
_
.
.
.
A
j
.
.
.
A
i
.
.
.
_
_
_
_
_
_
_
_
By the column swapping property of determinant,
det(E
swap
) = 1,
allowing us to conclude
det(E
swap
A) = det(E
swap
) det(A) = det(A).
By exploiting the row and column properties, we can easily
1
compute determinants.
Example. Compute the determinant of
_
_
0 0 5 0 0
0 4 0 5 0
0 0 5 0 5
0 5 0 6 0
1 0 0 0 1
_
_
.
1
This is especially easy with SPARSE matrices (matrices with lots of zeros). You will see these in your engineering
courses (e.g. EE263).
First, pull out constants from the rst and third row:
det
_
_
_
_
_
_
0 0 5 0 0
0 4 0 5 0
0 0 5 0 5
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 5 0
0 0 1 0 1
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
Then, subtract the second column from the fourth
25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 5 0
0 0 1 0 1
0 5 0 6 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 1 0
0 0 1 0 1
0 5 0 1 0
1 0 0 0 1
_
_
_
_
_
_
and then four times the fourth column from the second:
25 det
_
_
_
_
_
_
0 0 1 0 0
0 4 0 1 0
0 0 1 0 1
0 5 0 1 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 1 0 1
0 1 0 1 0
1 0 0 0 1
_
_
_
_
_
_
By subtracting rows from each other,
25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 1 0 1
0 1 0 1 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 1 0 0 0
1 0 0 0 1
_
_
_
_
_
_
= 25 det
_
_
_
_
_
_
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 1 0 0 0
1 0 0 0 0
_
_
_
_
_
_
which after permuting rows (e.g. 7 times) gives us
25 det
_
_
_
_
_
_
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
_
_
_
_
_
_
= 25
New Notation
det(A) The determinant of
the matrix A
det(I) = 1 The determinant of the identity
matrix is 1.
Lecture 30
Flirting with Inverting
In the dating game, sometimes its better to stay singular.
- Q Principle Ideal Domain
Goals: After giving a formal denition of a matrix inverse, we prove necessary and
sucient conditions for invertibility. Namely, a matrix is invertible if and only if the
determinant is non-zero. To prove this, we derive the cofactor expansion formula. We
then use this formula to explictly construct the left and right inverses.
30.1 A Revision of Algebra II
A long time ago, in a galaxy far, far away (Algebra II), you were given a denition of determinant.
Your teachers gave you a mnemonic, and told you how to calculate it. They also gave you some
concatenated matrices and a methodology to compute an inverse:
_
_
0 0 4 1 0 0
0 1 4 0 1 0
1 0 0 0 0 1
_
_
ADD
_
_
0 0 4 1 0 0
0 1 0 1 1 0
1 0 0 0 0 1
_
_
SWAP
_
_
1 0 0 0 0 1
0 1 0 1 1 0
0 0 4 1 0 0
_
_
SCALE
_
_
1 0 0 0 0 1
0 1 0 1 1 0
0 0 1 .25 0 0
_
_
By now, you understand
577
578 LECTURE 30. FLIRTING WITH INVERTING
Math Mantra: We dont care about the methodology. Methodology can always be
looked up or programmed on a computer. We care about the MEANING.
From last lecture, you should realize that the determinant is a complicated little bugger. It deserves
a lot more credit that some measly mnemonic. And now that we have dierent eyes, we can go back
to Algebra II and understand what we did and why we did it.
So like John Smith in Pocahontas, we are going to listen to the mathematics that surround us and
learn things we never knew that we never knew.
30.2 Left and Right Inverse
The rst thing we need to do is dene inverses:
Denition. Let A be an nn matrix. The right inverse of A, if it exists, is the right multiplicative
inverse of A. Formally, it is the n n matrix B that satises
AB = I.
Likewise, the left inverse of A, if it exists, is the n n matrix C that satises
CA = I.
Notice we are making two major points.
The inverse need not exist. For example, it may be the case that for every choice of B,
_
_
1 1 1
1 1 0
0 0 0
_
_
B = I
We have to consider the left and right inverse separately. This is because matrices
are not commutative!
Lets address the second issue rst. Using the same algebraic shenanigans as in Lecture 5, we can
prove that there is a unique matrix that is both the left and the right inverse, provided that the left
and right inverse both exist.
Theorem. Let A be an n n matrix. Suppose left and right inverses exist.
30.2. LEFT AND RIGHT INVERSE 579
The right inverse is unique: if
AB = I
AB
= I
then
B = B
Right inverse equals left inverse: if

AB = I
CA = I
then
B = C
Proof:
Right inverse is unique:
Suppose there exists B, B
such that
AB = I
AB
= I.
Then,
AB = AB
.
Since we also assumed that the left inverse of A exists, multiply both sides by a left inverse C:
C(AB) = C(AB
).
Then by associativity
(CA)B = (CA)B
,
and since C is the left inverse of A, this reduces to
B = B
.
Right inverse equals left inverse:
Assume B, C satisfy
AB = I
CA = I.
First,
B = IB.
Substituting,
IB = (CA)B.
By associativity,
(CA)B = C(AB) = C.
Thus,
B = C
Did we forget to prove that the left inverse is unique? Nope. Notice that this follows from the fact
that any left inverse must equal the right inverse, which was proven to be unique. Finally, we can
formally dene an inverse:
Denition. Let A be an n n matrix. If the left and right inverse both exist, the inverse of
A is the n n matrix A
1
that satises
AA
1
= A
1
A = I.
Now, we can explain the methodology behind our inverse computation. In truth,
_
_
0 0 4 1 0 0
0 1 4 0 1 0
1 0 0 0 0 1
_
_
was really shorthand for the system
AX = I
When we solved for X, we were multiplying both sides, on the left, by elementary matrices.
What is an elementary matrix? Last lecture, we used the matrices E
add
, E
swap
, E
scale
:
E
add
=
_
_
1
1
.
.
.
1 . . . c
.
.
.
1
1
_
_
E
scale
=
_
_
1
1
.
.
.
c
.
.
.
1
1
_
_
E
swap
=
_
_
1
1
.
.
.
0 1
.
.
.
1 0
.
.
.
1
1
_
_
An elementary matrix is simply a matrix of one of these forms. So we multiplied matrices on both
sides to reduce the left hand side
E
3
E
2
E
1
AX = E
3
E
2
E
1
I
30.2. LEFT AND RIGHT INVERSE 581
_
_
1 0 0
0 1 0
0 0 .25
_
_
. .
E
3
_
_
0 0 1
0 1 0
1 0 0
_
_
. .
E
2
_
_
1 0 0
1 1 0
0 0 1
_
_
. .
E
3
_
_
0 0 4
0 1 4
1 0 0
_
_
. .
A
X = E
3
E
2
E
1
I
to just
X = E
3
E
2
E
1
I =
_
_
0 0 1
1 1 0
.25 0 0
_
_
.
To reiterate, this computation required the assumption that the inverse exists, and for the umpteenth
time,
The inverse need not exist!
Can we come up with a nice condition that guarantees that the inverse exists?
Absolutely. To derive the condition, we follow the rst rule of Math Fight Club:
Math Mantra: You should always play around with new theorems and definitions!
We have already done this a million times:
Applying Cauchy-Schwarz inequality to derive new inequalities.
Applying Rolles Theorem to prove Mean Value Theorem.
Using the standard basis vectors in the denition of directional derivative.
etc., etc., etc.
Therefore, lets apply the determinant to the inverse denition,
AA
1
= I
to get
det(AA
1
) = det(I),
which by our determinant properties is
det(A) det(A
1
) = 1.
This means that if the inverse exists, then
det(A
1
) = 0.
Thats great. But it doesnt give us a condition on whether the inverse exists.
However, consider the converse:
If det(A) = 0, then A
1
exists.
Remarkably, this is true!
30.3 Cofactor Expansions
Before we can prove the preceding assertion, we will need an unintuitive way to rewrite a determinant
in terms of smaller matrices. Specically, let A
ij
denote the (n 1) (n 1) sub-matrix of A with
the i-th row and j-th column removed:
A
ij
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
a
21
a
22
. . . a
2j
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
a
i2
. . . a
ij
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
i
j
For any column number j, we can always rewrite the determinant as
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
)
Intuitively, you are choosing a column number and expanding the determinant along that column. So
in the case of a 3 3,
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
we can rewrite it as an expansion along the 2nd column:
a
12
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
+ a
22
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
a
32
det
_
_
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
_
_
We call this sum a cofactor expansion.
Theorem. For any column number j,
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
)
where A
ij
is the sub-matrix of A with the i-th row and j-th column removed.
30.3. COFACTOR EXPANSIONS 583
Proof Summary:
Perform swaps to move the j-th component to the last component of D(
1
,
2
, . . . ,
n
).
Use multilinearity on the last component of D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
j
) to expand it
into a summation.
For the i-th term of this summation, look at
D(
1
,
2
, . . . , ,
j1
,
j+1
, . . . ,
n
, e
i
)
and perform swaps to move the i-th row to the last.
In the last column, the only non-zero component is in the last row. Therefore, when you evaluate
D directly, i
n
= n and det(A
ij
) pops out.
Proof: By denition of determinant,
det(A) = D(
1
,
2
, . . . ,
n
).
First, shift the j-th column,
j
, all the way to the right:
D(. . .
j
,
j+1
,
j+2
,
j+3
, . . .
n
).
This requires n j swaps, so we have
det(A) = (1)
nj
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
j
).
Expand only the last component
det(A) = (1)
nj
D
_

1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
,
n
i=1
a
ij
e
i
_
and apply multilinearity
det(A) = (1)
nj
n
i=1
a
ij
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
). ()
Lets look at the D in the i-th term of this sum:
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
).
We have a 1 in the last column of the i-th row:
det
_
_
_
_
_
_
_
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
a
21
. . . a
2(j1)
a
2(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
n(j1)
a
n(j+1)
. . . 0
_
_
_
_
_
_
_
_
_
Applying the same trick we did with the columns, shift the i-th row to the last row:
det
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
a
(i+2)1
. . . a
(i+2)(j1)
a
(i+2)(j+1)
. . . 0
a
(i+3)1
. . . a
(i+3)(j1)
a
(i+3)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
n(j1)
a
n(j+1)
. . . 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
This requires n i transpositions, so now we have
(1)
ni
det
_
_
_
_
_
_
_
_
_
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(i1)1
. . . a
(i1)(j1)
a
(i1)(j+1)
. . . 0
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(n1)1
. . . a
(n1)(j1)
a
(n1)(j+1)
. . . 0
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
_
_
_
_
_
_
_
_
_
_
_
. .
B
Look at the determinant expansion of the matrix B above:
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
b
i
1
1
b
i
2
2
. . . b
inn
.
The last column is e
n
, so b
inn
is zero when i
n
= n. This means
i
1
, i
2
, . . . , i
n1
must be a permutation among the remaining numbers
1, 2 . . . , n 1.
Moreover, since n is the largest number,
N(i
1
, i
2
, . . . , i
n1
) = N(i
1
, i
2
, . . . , i
n1
, n)
and our sum reduces to
permutation (i
1
, i
2
, . . . , i
n1
)
of (1, 2, . . . , n 1)
(1)
N(i
1
,i
2
,...,i
n1
)
b
i
1
1
b
i
2
2
. . . b
i
(n1)
(n1)
which is just the determinant of B restricted to the rst n 1 rows and n 1 columns:
30.4. CONSTRUCTING THE LEFT INVERSE 585
_
_
a
11
. . . a
1(j1)
a
1(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(i1)1
. . . a
(i1)(j1)
a
(i1)(j+1)
. . . 0
a
(i+1)1
. . . a
(i+1)(j1)
a
(i+1)(j+1)
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
(n1)1
. . . a
(n1)(j1)
a
(n1)(j+1)
. . . 0
a
i1
. . . a
i(j1)
a
i(j+1)
. . . 1
_
_
Lo and behold, what is this? The i-th row is missing and the j-th column is missing. This is precisely
the submatrix A
ij
!
Therefore,
det(B) = det(A
ij
)
and thus
D(
1
,
2
, . . . ,
j1
,
j+1
, . . . ,
n
, e
i
) = (1)
ni
det(A
ij
)
Plugging back into (),
det(A) = (1)
nj
n
i=1
a
ij
(1)
ni
det(A
ij
)
. .
D(
1
,
2
,...,
j1
,
j+1
,..., n, e
i
)
.
Of course, we can distribute (1)
nj
and combine
(1)
ni
(1)
nj
= (1)
2n
_
(1)
1
_
i+j
= (1)
i+j
so
det(A) =
n
i=1
(1)
i+j
a
ij
det(A
ij
).
30.4 Constructing the Left Inverse
Armed with a neat way to write the determinant, we would like to use this new formula to explicitly
construct an inverse. To do this, we will write the matrix
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
as a product
det(A) I = MA.
for some matrxi M. This means, as long as det(A) = 0,
1
det(A)
M
is the left inverse of A.
Theorem. If det(A) = 0, then there exists a left inverse of A.
Proof Summary:
Dene a two input function
F(r, c) =
n
i=1
(1)
i+r
a
ic
det(A
ir
)
By the cofactor expansion theorem, F(c, c) = det(A).
F(r, c) = 0 for r = c:
Consider the matrix A
where the r-th column has been replaced by the c-th column.
Cofactor-expand A
along the c-th column and rewrite the expression as F(r, c) = 0.

The matrix with components F(r, c) is det(A)I.
This matrix can also be written as a product MA.
1
det(A)
M is the left inverse of A.
Proof: We would like to construct a function F(r, c) that is det(A) only when the inputs equal (and
0 otherwise):
F(r, c) =
_
det(A) if r = c
0 otherwise
This is because we want to rewrite det(A)I as
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
=
_
_
F(1, 1) F(1, 2) F(1, 3) . . . F(1, n)
F(2, 1) F(2, 2) F(2, 3) . . . F(2, n)
F(3, 1) F(3, 2) F(3, 3) . . . F(3, n)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F(n, 1) F(n, 2) F(n, 3) . . . F(n, n)
_
_
But what should our magic function F be?
When we expanded the determinant, we chose the column number j. This is a function
1
of one
variable:
f(j) =
n
i=1
(1)
i+j
a
ij
det(A
ij
).
But notice that j occurs multiple times in the above formula. So why not split the inputs
1
Though this is a boneheaded function that spits out det(A) for j = 1, 2, . . . , n.
30.4. CONSTRUCTING THE LEFT INVERSE 587
F(r, c) =
n
i=1
(1)
i+r
a
ic
det(A
ir
)?
Immediately we have, when r = c,
F(c, c) = det(A).
This is just the normal expansion formula for the choice j = c. But what happens when r = c?
This takes some creativity: stare at the matrix A
where the r-th column of A has been REPLACED

by the c-th column:
A
=
_
_
a
1
a
2
. . . a
c
. . . a
c
. . .
_
_
r c
We know the determinant of A
is 0 (two columns are the same, duh). Therefore, when we do a

cofactor expansion along the c-th column,
_
_
a
11
. . . a
1c
. . . a
1c
. . . a
1n
a
21
. . . a
2c
. . . a
2c
. . . a
2n
a
31
. . . a
3c
. . . a
3c
. . . a
3n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nc
. . . a
nc
. . . a
nn
_
_
r c
we have
0 =
n
i=1
(1)
i+c
a
ic
det(A
ic
).
Take a look at A
ic
and compare it to A
ir
:
A
ic
=
_
_
a
11
. . . a
1c
. . . a
1c
. . . a
1n
a
21
. . . a
2c
. . . a
2c
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
ic
. . . a
ic
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nc
. . . a
nc
. . . a
nn
_
_
A
ir
=
_
_
a
11
. . . a
1r
. . . a
1c
. . . a
1n
a
21
. . . a
2r
. . . a
2c
. . . a
2n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
i1
. . . a
ir
. . . a
ic
. . . a
in
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
. . . a
nr
. . . a
nc
. . . a
nn
_
_
r c
i i
r c
A
ir
has the same columns as A
ic
except in a dierent order! So, after performing some number of
transpositions q, we have
0 =
n
i=1
(1)
i+c
a
ic
(1)
q
det(A
ir
)
. .
det(A
ic
)
.
Multiply both sides by (1)
rcq
. This replaces the power of 1:
0 =
n
i=1
(1)
i+r
a
ic
det(A
ir
).
But the (RHS) is just F(r, c)! So when r = c,
F(r, c) = 0
Now notice that F(r, c) is simply a matrix product involving the c-th column of A
F(r, c) =
_
_
(1)
1+r
det(A
1r
) (1)
2+r
det(A
2r
) . . . (1)
n+r
det(A
nr
)
_
_
_
_
a
1c
a
2c
a
3c
.
.
.
a
nc
_
_
.
Therefore, the product
MA =
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
. .
[F(r,c)]
where
M =
_
_
(1)
1+1
det(A
11
) (1)
2+1
det(A
21
) . . . (1)
n+1
det(A
n1
)
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+r
det(A
1r
) (1)
2+r
det(A
2r
) . . . (1)
n+r
det(A
nr
)
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) (1)
2+n
det(A
2n
) . . . (1)
n+n
det(A
nn
)
_
_
Thus,
1
det(A)
M is the left inverse of A.
Did we nish proving
If det(A) = 0, then A
1
exists?
Absolutely not! We only proved that the left inverse exists. Remember, the inverse does not exist
unless both the left inverse and the right inverse exist!
To complete the proof, we will need to prove that we can calculate the cofactor expansion of det(A)
along a row. This requires proving one more fundamental fact about determinants.
30.5. PROVING DET(A) = DET(A
T
) 589
30.5 Proving det(A) = det(A
T
)
To the astute 51H veteran and careful reader: you probably noticed I completely sidestepped the
proof that
det(A) = det(A
T
)
Instead of proving this fact rst, I introduced elementary matrices and used the product property to
justify row reduction of determinants. I did this because
I wanted to explain the computation of inverses, which uses elementary matrices.
I wanted to warn students:
Math Mantra: When consulting other sources, be careful of circularity!
In Math, there are many dierent ways to introduce a subject. One source can
Use Theorem A to prove Theorem B,
whereas another source can
Use Theorem B to prove Theorem A.
If you mindlessly borrow from another source, you can accidentally
Use Theorem A to prove Theorem A.
Complete and utter fail!
Particularly, you cannot use the following proof on Khan Academy:
Theorem (Bad Proof ). For any n n matrix A,
det(A) = det(A
T
)
Bad Proof: We proceed by induction on n from the matrix size ntimesn. The base case n = 1 is
obvious. For the inductive step, cofactor expand det(A) along the rst column:
det(A) = a
11
det(A
11
) + a
21
det(A
21
) + . . . + a
n1
det(A
n1
)
By the induction hypothesis, we know the (RHS) is
a
11
det((A
11
)
T
) + a
21
det((A
21
)
T
) + . . . + a
n1
det((A
n1
)
T
)
a
11
det(A
T
11
) + a
21
det(A
T
12
) + . . . + a
n1
det(A
T
1n
)
But this is the formula for the cofactor expansion of det(A
T
) along the rst row:
det(A
T
) = a
11
det(A
T
11
) + a
21
det(A
T
12
) + . . . + a
n1
det(A
T
1n
)
However, we cannot use this proof because we will use det(A) = det(A
T
) to derive the cofactor ex-
pansion row formula!
Instead, we give the following proof:
Theorem. For any n n matrix A,
det(A) = det(A
T
)
Proof Summary:
Rewrite the determinant denition as
det(A) =
permutation = (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(j
1
,j
2
,...,j
n
)
x
1j
1
x
2j
2
. . . x
nj
n
where
(j
1
, j
2
, . . . j
n
) =
k

k1
. . .
1
(1, 2, . . . , n)
for
=
1

2
. . .
k
(1, 2, . . . , n).
Show every term of preceding summation is contained in
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
and vice versa.
Proof: Starting with the formula
det(A) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
i
1
1
x
i
2
2
. . . x
inn
()
the goal is to rewrite it into the form
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
.
Particularly, we need to make the right indices of
30.5. PROVING DET(A) = DET(A
T
) 591
x
i
1
1
x
i
1
2
. . . x
i
1
n
appear on the left:
x
1i
1
x
2i
2
. . . x
nin
.
Notice that
(i
1
, i
2
, . . . , i
n
)
is a permutation of
(1, 2, . . . , n),
so we can always reorder
x
i
1
1
x
i
2
2
. . . x
inn
to be
x
1j
1
x
2j
2
. . . x
njn
.
But what are the new js?
Consider again the left indices
(i
1
, i
2
, . . . , i
n
)
When we moved the is around, we were really applying transpositions that restored the is to the
original ordering
k

k1
. . .
1
(i
1
, i
2
, . . . , i
n
) = (1, 2, . . . , n)
At the same time, we were also applying these transpositions to the right components. They got
tacked on for the ride:
k

k1
. . .
1
(1, 2, . . . , n) = (j
1
, j
2
, . . . , j
n
).
Now each term of () is of the form
(1)
N(i
1
,i
2
,...,in)
x
1j
1
x
2j
2
. . . x
njn
.
Moreover, we know
N(j
1
, j
2
, . . . , j
n
) = N(i
1
, i
2
, . . . , i
n
).
This is because from
k

k1
. . .
1
(1, 2, . . . , n) = (j
1
, j
2
, . . . , j
n
),
we can apply transpositions to both sides to get
1
. . .
k1

k

k

k1
. . .
1
(1, 2, . . . n) =
1
. . .
k1

k
(j
1
, j
2
, . . . , j
n
).
Therefore,
(1, 2, . . . , n) =
1
. . .
k1

k
(j
1
, j
2
, . . . , j
n
)
i.e, the number of transpositions to restore the js to the correct ordering is also k. Thus, each term
of () is of the form
(1)
N(j
1
,j
2
,...,jn)
x
1j
1
x
2j
2
. . . x
njn
Note that (j
1
, j
2
, . . . , j
n
) is a function of permutation = (i
1
, i
2
, . . . , i
n
), so to signify this fact, we
write
(j
1
, j
2
, . . . , j
n
)
and thus () becomes
det(A) =
permutation = (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(j
1
,j
2
,...,j
n
)
x
1j
1
x
2j
2
. . . x
nj
n
. ()
I claim that this sum is exactly the same as
det(A
T
) =
permutation (i
1
, i
2
, . . . , i
n
)
of (1, 2, . . . , n)
(1)
N(i
1
,i
2
,...,in)
x
1i
1
x
2i
2
. . . x
nin
( )
First notice that every term in () appears in ( ) since (j
1
, j
2
, . . . , j
n
) is indeed a permutation.
Moreover, every term in ( ) appears in ():
Consider a term
(1)
N(q
1
,q
2
,...,qn)
x
1q
1
x
2q
2
. . . x
nqn
in ( ). Then,
(q
1
, q
2
, . . . , q
n
) = t
s
t
s1
. . . t
1
(1, 2, . . . , n).
for some transpositions t
i
. If we choose
= t
1
t
2
. . . t
s
(1, 2, . . . , n),
we have
(j
1
, j
2
, . . . , j
n
) = t
s
t
s1
. . . t
1
(1, 2, . . . , n)
and thus
(q
1
, q
2
, . . . , q
n
) = (j
1
, j
2
, . . . , j
n
).
Since each term of () is contained in ( ) and vice versa, we conclude
det(A) = det(A
T
).
30.6 Constructing the Right Inverse
Consider the cofactor expansion of det(A
T
) along a column. Using det(A) = det(A
T
), we know this
is the same as the cofactor expansion of det(A) across a row. Thus, we automatically have
Theorem. For any row number i,
det(A) =
n
j=1
(1)
i+j
a
ij
det(A
ij
)
where A
ij
is the sub-matrix of A with the i-th row and j-th column removed.
30.6. CONSTRUCTING THE RIGHT INVERSE 593
Now we can do a similar trick to prove a right inverse exists:
Theorem. If det(A) = 0, then there exists a right inverse of A.
Proof Summary:
Dene a two input function
F(r, c) =
n
j=1
(1)
c+j
a
rj
det(A
cj
)
By cofactor expansion theorem, F(c, c) = det(A).
F(r, c) = 0 for r = c:
Consider the matrix A
where the c-th row has been replaced by the r-th row.
Cofactor expand A
along the r-th row and rewrite the expression as F(r, c) = 0.

The matrix with components F(r, c) is det(A)I.
This matrix can also be written as a product AM.
1
det(A)
M is the right inverse of A.
Proof: We do the same trick, but use the row expansion formula instead of the column expansion.
Dene
F(r, c) =
n
j=1
(1)
c+j
a
rj
det(A
cj
).
Immediately we have, if r = c,
F(c, c) =
n
j=1
(1)
c+j
a
cj
det(A
cj
) = det(A).
If r = c, consider the matrix A
where the c-th row is replaced by the r-th row:

A
=
_
A
1
A
2
.
.
.
A
r
.
.
.
A
r
.
.
.
_
_
r
c
Again, since the two rows are the same we know
det(A
) = 0.
Cofactor-expand along the r-th row
_
_
a
11
a
12
a
13
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
a
r3
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
a
r3
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
a
n3
. . . a
nn
_
_
r
c
to get
0
..
det(A
)
=
n
j=1
(1)
r+j
a
rj
det(A
rj
)
Compare A
rj
and to A
cj
:
A
rj
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
A
cj
=
_
_
a
11
a
12
. . . a
1j
. . . a
1n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
r1
a
r2
. . . a
rj
. . . a
rn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
c1
a
c2
. . . a
cj
. . . a
cn
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nj
. . . a
nn
_
_
j
r
c
r
c
j
A
cj
has the same rows as A
rj
except in a dierent order! So, after performing some number of
transpositions q, we have
0 =
n
j=1
(1)
r+j
a
rj
(1)
q
det(A
cj
)
. .
det(A
rj
)
.
Multiplying both sides by (1)
crq
, we can replace the power of 1:
0 =
n
j=1
(1)
c+j
a
rj
det(A
cj
)
But this is just F(r, c)! Therefore, when r = c
F(r, c) = 0.
30.6. CONSTRUCTING THE RIGHT INVERSE 595
Because F(r, c) is simply a matrix product involving the r-th row of A
F(r, c) =
_
_
a
r1
a
r2
a
r3
. . . a
rn
_
_
_
_
(1)
c+1
det(A
c1
)
(1)
c+2
det(A
c2
)
(1)
c+3
det(A
c3
)
.
.
.
(1)
c+n
det(A
cn
)
_
_
we conclude that the product
AM =
_
_
det A 0 0 . . . 0
0 det A 0 . . . 0
0 0 det A . . . 0
0 0 0
.
.
. 0
0 0 0 . . . det A
_
_
. .
[F(r,c)]
where
M =
_
_
(1)
1+1
det(A
11
) . . . (1)
c+1
det(A
c1
) . . . (1)
n+1
det(A
n1
)
(1)
1+2
det(A
12
) . . . (1)
c+2
det(A
c2
) . . . (1)
n+2
det(A
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) . . . (1)
c+n
det(A
cn
) . . . (1)
n+n
det(A
nn
)
_
_
.
Thus,
1
det(A)
M is the right inverse
1
of A.
Since det(A) = 0 implies a left and right inverse exists, we can nally conclude:
Theorem. If det(A) = 0, then there exists an inverse of A.
New Notation
A
1
The inverse of A. AA
1
= I. The product of A with its inverse
is the identity.
A
ij
The sub-matrix of A
with the i-th row and
j-th column removed.
A
23
The sub-matrix of A with the 2nd
row and 3rd column removed .
1
Notice our M is the same in both proofs, as it should be.
Lecture 31
Gram-Schmidt Style
Heeeeeeeeeeeyyyyyy, Sexy Basis!
Or, Or, Or, Or, OrthoGonal Style!
-
Goals: Today, we give the denition of an orthonormal basis and prove basic properties.
Then, we use the Gram-Schmidt process to prove that every vector space can be written
as the span of an orthonormal basis. We will use this fact in the proof of the Spectral
Theorem.
31.1 Extra Structure
In our discussion on vector spaces, we mentioned that bases are not unique. For example, consider
V = span
_
_
_
_
1
1
1
1
0
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
1
1
1
1
_
_
_
_
The sets
_
_
_
_
1
0
0
0
0
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
_
_
_
_
_
_
1
0
0
0
1
_
_
,
_
_
0
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
_
_
_
_
_
_
1
1
1
1
0
_
_
,
_
_
0
0
0
0
1
_
_
,
_
_
0
1
1
1
1
_
_
_
_
are all bases for V .
In fact, there are innitely many choices of bases at our disposal.
A natural question to ask is
What is the best basis we can choose?
597
598 LECTURE 31. GRAM-SCHMIDT STYLE
Before we can answer this question, we should ask ourselves,
Why are we even bothering with choosing a dierent basis?
If its not broken, why x it? Does it even make a dierence what basis we choose as long as they all
span V ?
It absolutely does make a dierence. Generally,
Math Mantra: Suppose we can always rewrite our objects in some form that has
additional structure. Then we can EXPLOIT this extra structure in our proofs.
A good example of this is in Number Theory. By the Fundamental Theorem of Arithmetic, given a
natural number ( 2), we can always nd a unique prime factorization:
n = p
1
1
p
2
2
. . . p
n
n
We can focus on this form and exploit the properties of its prime components.
With subspaces, we will prove that we can always nd a basis with a special structure. And next
lecture, we will exploit this special structure to prove the infamous Spectral Theorem.
31.2 The Best Basis
Going back to the original question, we must rst clarify what we mean by best. What properties
should the best basis have? To solve this,
Math Mantra: Look for an object with nice properties and try to GENERALIZE it.
What is the best basis in the world? The standard basis, of course!
{e
1
, e
2
, . . . , e
n
}
In particular, the incredible property we have used a billion times (at least) is that we can instantly
write any vector as a linear combination of the standard basis vectors:
v = v
1
e
1
+ v
2
e
2
+ . . . + v
n
e
n
Calculating the scaling coecients is completely trivial!
But what if I gave you the subspace
V = span
_
_
_
_
47
43
41
37
31
_
_
,
_
_
29
23
19
17
13
_
_
,
_
_
11
7
5
3
2
_
_
_
_
31.2. THE BEST BASIS 599
and asked you to solve for the scaling coecients c
1
, c
2
, c
3
of
_
_
91
90
81
80
64
_
_
= c
1
_
_
47
43
41
37
31
_
_
+ c
2
_
_
29
23
19
17
13
_
_
+ c
3
_
_
11
7
5
3
2
_
_
This is not obvious! Instead of an immediate answer, you would have to solve a system of the form
Ax =
b.
Luckily, it is indeed possible to rewrite the basis in such a way that, for any vector in that space,
we can easily solve for the corresponding linear combination of basis vectors. To do so, we study the
standard basis even further. Namely, we focus on the properties that
The norm of each vector in the basis is 1.
The dot product of any two distinct vectors is 0.
We call such a set of vectors orthonormal.
Denition. A set of vectors
{u
1
, u
2
, . . . , u
n
}
is orthonormal if, for any two vectors in this set,
u
i
u
j
=
_
1 if i = j
0 if i = j
Remarkably, if the basis for V is orthonormal, then for any vector v V , we can easily solve for the
linear combination of basis vectors that equals v. Specically, let
v = c
1
u
1
+ c
2
u
2
+ . . . + c
n
u
n
.
To solve for the i-th coecient c
i
, all we need to do is take the dot product of v with the i-th basis
vector u
i
.
Theorem. Let
{u
1
, u
2
, . . . , u
k
}
be an orthonormal basis for V R
n
. Then for any v V ,
v = c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
where
c
i
= v u
i
.
Proof Summary:
Expand v as some arbitrary linear combination of the basis vectors u
i
.
Take the dot product with u
i
and apply orthonormality.
Proof: By the denition of basis, there are some constants c
1
, . . . , c
k
such that
v = c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
To nd the i-th coeecient, take the dot product of both sides with u
i
:
v u
i
= (c
1
u
1
+ c
2
u
2
+ . . . + c
k
u
k
) u
i
and distribute:
v u
i
= c
1
u
1
u
i
+ c
2
u
2
u
i
+ . . . + c
k
u
k
u
i
.
By denition of an orthonormal basis, only u
i
u
i
is non-zero, so
v u
i
= 0
..
c
1
u
1
u
i
+. . . + 0
..
c
2
u
2
u
i
+. . . + c
i
..
c
i
u
i
u
i
+. . . + 0
..
c
k
u
k
u
i
.
Thus
c
i
= v u
i
.
We just showed that for any v V ,
v =
k
i=1
(v u
i
) u
i
where the u
i
are members of an orthonormal basis.
Thats cool. But it gets even better! For any x in the entire space R
n
, the summation
k
i=1
(v u
i
) u
i
is actually the projection of x onto V !
Theorem. Let
{u
1
, u
2
, . . . , u
k
}
be an orthonormal basis for V R
n
. Then, for any x R
n
,
P
V
(x) =
k
i=1
(x u
i
) u
i
.
31.2. THE BEST BASIS 601
Proof Summary:
By denition of a projection map, we must show
k
i=1
(x u
i
) u
i
V
x
k
i=1
(x u
i
) u
i
V

.
The rst inclusion follows from closure.
To prove the second, take the dot product with u
i
and apply orthonormality.
Proof: Recall that P
V
(x) is uniquely dened by the property that, for any x,
P
V
(x) V
x P
V
(x) V

.
Therefore, to show that
f(x) =
k
i=1
(x u
i
) u
i
is the projection P
V
(x), we just have to show that f(x) satises this projection property.
Automatically,
k
i=1
(x u
i
) u
i
V
by closure. So we only need to show
x
k
i=1
(x u
i
) u
i
V

Equivalently, we need to check
_
x
k
i=1
(x u
i
) u
i
_
u
i
= 0
for any basis vector u
i
. Distribute the dot product on the (LHS) and again use orthonormality to kill
terms:
x u
i
(x u
1
) u
1
u
i
. .
=0
(x u
2
) u
2
u
i
. .
=0
. . . (x u
i
) u
i
u
i
. .
=xu
i
. . . (x u
k
) u
k
u
k
. .
=0
to get
x u
i
x u
i
= 0.
31.3 Gram-Schmidt Process
As you can see, orthonormal bases are awesome. But how do you convert any basis into an orthonor-
mal basis? The process is actually pretty simple.
Suppose you had nished writing a book and you wanted to edit it. In particular, you wanted to
make sure that
No chapter overlaps with any material from the previous chapters
Each chapter is completely edited.
You can do the following procedure:
Start with Chapter 1
Chapter
1
and edit it to get
Edited
Chapter 1
With Chapter 2, look for any overlap of material with the Edited Chapter 1
Chapter
2
Edited
Chapter
1
and remove it.
31.3. GRAM-SCHMIDT PROCESS 603
Then, edit the remains to get the Edited Chapter 2.
Edited
Chapter
2
For Chapter 3, look for any overlap of material with the Edited Chapter 1 and the Edited
Chapter 2
Chapter
3
Edited
Chapter
1
Edited
Chapter
2
and remove it:
Then edit the remains to get the Edited Chapter 3.
Edited
Chapter
3
Generally, we remove all the overlap of Chapter j with the previous Edited Chapters
1, 2, . . . , j 1. Then we edit what is left over to get the Edited Chapter j.
After we go through all the chapters, our book is completely edited with no overlapping material.
Simple, right? The process to convert a basis into an orthonormal basis follows exactly the same
reasoning!
Given some basis for V :
{v
1
, v
2
. . . , v
k
}
Start with v
1
and normalize it to get vector u
1
.
From v
2
, subtract the projection of v
2
onto span{u
1
}. Normalize the dierence to get vector u
2
.
From v
3
3
onto span{u
1
, u
2
}. Normalize the dierence to get vector
u
3
.
Generally, from v
j
j
onto span{u
1
, u
2
, . . . , u
j1
}. Normalize the dierence
to get vector u
j
.
When this process terminates, we have a set
{u
1
, u
2
, . . . , u
k
}
It turns out that this set is an orthonormal basis for V !
Theorem (Gram-Schmidt Process). Let
{v
1
, v
2
, . . . , v
k
}
be a basis for V . Set
u
1
=
v
1
v
1
and for i > 1, dene

u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
Then the set
{u
1
, u
2
, . . . , u
k
}
is an orthonormal basis for V .
Proof Summary:
Spanning and Existence (Induction):
Base Case
Obvious.
Inductive Step, Existence
Suppose not. Then
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
) =
0.
Use projection properties and inductive hypothesis to show
v
i
span{v
1
, v
2
, . . . , v
i1
}.
This contradicts that the vs form a basis.
Inductive Step, Spanning
By inductive hypothesis, it suces to prove
u
i
span {v
1
, v
2
, . . . , v
i
}
v
i
span {u
1
, u
2
, . . . , u
i
}
Expand the denition of u
i
in each case and argue by closure.
Basis
We know the dimension is k and we have k spanning vectors.
Orthonormal
In our construction, we divide by the norm in each step. So u
i
= 1.
For i < j, expand the higher index term of u
i
u
j
Look at the numerator
u
i
v
j
u
i
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
Use the swapping property of projections to rewrite this as
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
) v
j
.
But this is 0 since P
span{u
1
,u
2
,...,u
j1
}
(u
i
) = u
i
.
Proof: We have to be really careful! This is because the above construction may not make sense!
Namely, we have to make sure we never divide by 0!
Instead of immediately assuming that
u
1
, u
2
, . . . , u
k
already exist, we are going to do induction on each step of the construction and show that the next
u
i
exists. For this to work, we interweave it with our spanning proof.
Spanning and Existence
We do induction on the k-th step of the construction process of u
i
to prove property Q(i),
Q(i) :
_
_
u
1
, u
2
, . . . , u
i
exist
1
and
span{v
1
, v
2
, . . . , v
i
} = span{u
1
, u
2
, . . . , u
i
}
holds for i n.
1
If you understand strong induction, then you can simplify this line to u
i
exists.
Base Case, k = 1
Since v
1
is a member of a basis, v
1
= 0, so
u
1
=
v
1
v
1
exists and
span{v
1
} = span{u
1
}.
Thus, Q(1) is true.
Inductive Step, Existence
Assume Q(i): u
1
, u
2
, . . . , u
i1
exist and
span{v
1
, v
2
, . . . , v
i1
} = span{u
1
, u
2
, . . . , u
i1
}.
First, we need to show u
i
exists. Suppose not. That means
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
) = 0.
Then the argument of the norm must be zero. This implies
v
i
= P
span{u
1
,u
2
,...,u
i1
}
(v
i
).
Remember, by the denition of projection, this means
v
i
span{u
1
, u
2
, . . . , u
i1
}.
By our induction hypothesis,
v
i
span{v
1
, v
2
, . . . , v
i1
}
But we assumed the vs formed a basis, a contradiction. Therefore, u
i
exists.
Inductive Step, Spanning
By our inductive hypothesis, to prove
span{v
1
, v
2
, . . . , v
i
..
} = span{u
1
, u
2
, . . . , u
i
..
}
we only need to show that the new vectors
u
i
span {v
1
, v
2
, . . . , v
i
}
v
i
span {u
1
, u
2
, . . . , u
i
} .
To prove the rst set inclusion, consider
u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
and use the inductive hypothesis to rewrite the numerator as
v
i
..
span{v
1
,...,v
i
}
P
span{v
1
,...,v
i1
}
(v
i
)
. .
span{v
1
,...,v
i
}
.
By closure,
u
i
span {v
1
, v
2
, . . . , v
i
} .
To prove the second set inclusion, consider again
u
i
=
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
v
i
P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
and isolate v
i
:
v
i
= v
i
P
span{u
1
,u
2
,...,u
i1
}
( v
i
)u
i
. .
span{u
1
,u
2
,...,u
i
}
+P
span{u
1
,u
2
,...,u
i1
}
(v
i
)
. .
span{u
1
,u
2
,...,u
i
}
By closure,
v
i
span{u
1
, u
2
, . . . , u
i
}.
Basis
Since the dimension of V is k and we proved
span{v
1
, v
2
, . . . , v
k
} = span{u
1
, u
2
, . . . , u
k
},
we automatically know
{u
1
, u
2
, . . . , u
k
}
is a basis by our basis properties.
Orthonormal
Because we divide by the norm in each step, we automatically know that
u
1
, u
2
, . . . , u
k
are unit vectors. Therefore, we only need to argue that they are pairwise orthogonal.
Let i < j and consider
u
i
u
j
.
Expand the denition of only the vector with the higher index (i.e., u
j
):
u
i
_
v
j
P
span{u
j
,u
2
,...,u
j1
}
(v
j
)
v
j
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
_
Now we need only show
u
i
_
v
j
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
_
= 0.
Distribute the dot product
u
i
v
j
u
i
P
span{u
1
,u
2
,...,u
j1
}
(v
j
)
and use the swapping property of projections over dot products to get
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
) v
j
.
But i < j, implying
u
i
span{u
1
, u
2
, . . . , u
j1
}
Thus, the projection map simply spits out u
i
and so
u
i
v
j
P
span{u
1
,u
2
,...,u
j1
}
(u
i
)
. .
u
i
v
j
= 0.
Lecture 32
Spooky Spectral Theorem
Right now Im looking at you and I cant believe,
I now know, oh oh,
I now know youre beautiful.
Whoa, oh oh,
But thats what makes math beautiful!
-Same Direction, x
Goals: First, we introduce eigen-decompositions and their applications. We then show
how to nd such a decomposition: to do this, we introduce eigenvectors and eigenvalues.
Unfortunately, not all square matrices have such a decomposition. Thus, we give two
dierent sucient conditions that guarentee the existence of this decomposition. The
rst condition is that the matrix has distinct eigenvalues. The second is that the matrix
is symmetric (Spectral Theorem).
32.1 Rewriting Matrices
Last lecture, we talked about putting a basis in a nice form. Particularly, we can always rewrite a
vector space as a span of an orthonormal basis. But how about matrices? Do we have nice ways to
rewrite them?
Absolutely! And our method of decomposition follows the same philosophy as prime factorization:
n = p
1
1
p
2
2
. . . p
n
n
.
Here, we break n into a product of numbers that have nice properties (namely, being prime). Likewise,
we can decompose a matrix into a product of matrix factors, where each factor has a nice property.
Here are a few famous matrix decompositions:
QR Factorization
LU Decomposition
Singular Value Decomposition
609
610 LECTURE 32. SPOOKY SPECTRAL THEOREM
Jordan Canonical Form
Eigen-decomposition
The one we will focus on today will be the eigen-decomposition. Namely, we can break a matrix A
into
A = SDS
1
where S is invertible and D is diagonal.
1
But there is a catch. Not all matrices have an eigen-decomposition!
Denition. We call an n n matrix A diagonalizable if there exists an invertible matrix S and a
diagonal matrix D such that
A = SDS
1
In this case, we say that SDS
1
is the eigen-decomposition of A.
But why in the world would we care about eigen-decompositions?
This decomposition is incredibly important, especially when you hit Math 53H. In particular, you
will use this method to solve linear systems of dierential equations of the form
2
x
(t) = Ax
where A is some xed matrix.
However, Math 53H is at least another 5 months away, so here is a more accessible application:
suppose you wanted to compute the product of A multiplied by itself n times:
A
n
= A A A. . . A
. .
n times
Normally, this takes quite a bit of work. But suppose A is diagonalizable:
A = SDS
1
Then
A
n
= SDS
1
SDS
1
SDS . . . A
. .
n times
1
Remember, we say a matrix D = (d
ij
) is diagaonal if all of its entries outside the main diagaonal are zero, i.e.
d
ij
= 0 when i = j.
2
FYI, this is simply the multivariable extension of the dierential equation
dx
dt
= ax.
32.1. REWRITING MATRICES 611
Cancelling each of the inner S
1
S terms we get
A
n
= SD
n
S
1
.
And the power of D in the middle is easy to deal with. In general, taking powers of a diagonal matrix
is very easy! Simply take the corresponding powers of the diagonal entries!
Example. We can easily calculate
A
100
where
A =
_
4 2
1 1
_
We can nd an eigen-decomposition for A:
_
4 2
1 1
_
. .
A
=
_
1 2
1 1
_
. .
S
_
2 0
0 3
_
. .
D
_
1 2
1 1
_
. .
S
1
.
Then,
A
100
=
_
1 2
1 1
_
. .
S
_
2
100
0
0 3
100
_
. .
D
100
_
1 2
1 1
_
. .
S
1
=
_
2 3
100
2
100
2 3
100
2
101
2
100
3
100
2
101
3
100
_
But dont think that eigen-decomposition is only useful for practical calculations. We can also use it
to prove cool theoretical results. For example, eigen-decompositions can be used to derive an explicit
formula for the Fibonacci numbers.
Example. Consider the Fibonacci sequence
F
n
=
_
_
_
0 if n = 0
1 if n = 1
F
n1
+ F
n2
if n > 1
Then the n-th Fibonacci number is explicitly
F
n
=
1
5
__
1 +
5
2
_
n
_
1
5
2
_
n
_
.
The key idea is to consider vectors of consecutive Fibonacci numbers
_
F
1
F
0
_
,
_
F
2
F
1
_
,
_
F
3
F
2
_
,
_
F
4
F
3
_
,
_
F
5
F
4
_
, . . . ,
_
F
i+1
F
i
_
, . . .
By the Fibonacci relation, to go from one pair to the next we just need to do a matrix multiplication:
_
1 1
1 0
_ _
F
i+1
F
i
_
=
_
F
i+2
F
i+1
_
.
In particular, starting from
_
F
1
F
0
_
=
_
1
0
_
we can repeatedly multiply on the left by
_
1 1
1 0
_
to get
_
F
n+1
F
n
_
.
In fact, we only need to do this n times:
_
1 1
1 0
_
n
_
1
0
_
=
_
F
n+1
F
n
_
. ()
Using the eigen-decomposition
_
1 1
1 0
_
=
_
_
1
5
2
1+
5
2
1 1
_
_
. .
S
_
_
1
5
2
0
0
1+
5
2
_
_
. .
D
_
_
1
5
1+
5
2
5
1
5
1+
5
2
5
_
_
. .
S
1
we can simplify the left side of () as
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
_
1
5
2
_
n
0
0
_
1+
5
2
_
n
_
_
_
_
1
5
1+
5
2
5
1
5
1+
5
2
5
_
_
_
1
0
_
.
Multiplying from the right, this can be simplied further as
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
_
1
5
2
_
n
0
0
_
1+
5
2
_
n
_
_
_
_
1
5
1
5
_
_
=
_
_
1
5
2
1+
5
2
1 1
_
_
_
_
1
5
_
1
5
2
_
n
1
5
_
1+
5
2
_
n
_
_
=
_
_
1
5
_
_
1+
5
2
_
n+1
_
1
5
2
_
n+1
_
1
5
__
1+
5
2
_
n
_
1
5
2
_
n
_
_
_
.
Therefore, () says
_
_
1
5
_
_
1+
5
2
_
n+1
_
1
5
2
_
n+1
_
1
5
__
1+
5
2
_
n
_
1
5
2
_
n
_
_
_
=
_
F
n+1
F
n
_
.
32.2. EIGENVECTORS 613
Equating the second component,
F
n
=
1
5
__
1 +
5
2
_
n
_
1
5
2
_
n
_
.
32.2 Eigenvectors
Hopefully, you are convinced that eigen-decompositions are useful. But,
How do you even calculate the eigen-decomposition (assuming that it exists)?
For that, we need a discussion of eigenvectors and eigenvalues.
Denition. Let v be a non-zero vector. Then we say that v is an eigenvector of A corresponding
to eigenvalue R if
Av = v.
Eigenvectors and eigenvalues have a very simple geometric interpretation. Intuitively, eigenvectors
are the vectors that have the same direction after multiplication by A. The eigenvalue is just the
scale factor:
v
A
v
Suppose you can nd eigenvectors v
1
, v
2
, . . . , v
n
corresponding to eigenvalues
1
,
2
, . . . ,
n
, respec-
tively, such that
v
1
, v
2
, . . . , v
n
are linearly independent. Then, when we concatenate the vs into a single matrix and multiply on
the left by A,
A
_
_
v
1
v
2
. . . v
n
_
_
=
_
_
Av
1
Av
2
. . . Av
n
_
_
By the denition of eigenvectors, the (RHS) is
_
1
v
1

2
v
2
. . .
n
v
n
_
_
.
After pulling out a diagonal matrix
_
_
v
1
v
2
. . . v
n
_
_
_
1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
n
_
_
,
we can rewrite our equality:
A
_
_
v
1
v
2
. . . v
n
_
_
. .
S
=
_
_
v
1
v
2
. . . v
n
_
_
. .
S
_
1
0 0 . . . 0
0
2
0 . . . 0
0 0
3
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 . . .
n
_
_
. .
D
.
Because the columns (eigenvectors) v
1
, . . . , v
n
are linearly independent, S is invertible! That means
we can multiply both sides on the right by S
1
to get
A = SDS
1
AWESOME!
But how do we nd the eigenvectors and eigenvalues?
The trick is to solve for the eigenvalues rst. To do this, we use the determinant to reduce the problem
to solving for the roots of a polynomial.
Theorem. There exists an eigenvector v of A corresponding to eigenvalue if and only if
det(A I) = 0
32.2. EIGENVECTORS 615
Proof Summary:
Rewrite as (A I)v =
0.
v must be non-zero so (A I) is non-invertible.
(A I) non-invertible is equivalent to det(A I) = 0.
Proof: Suppose there exist v and such that
Av = v.
Equivalently,
Av v =
0.
Pulling out the v, we have
(A I)v =
0.
This means (A I) is non-invertible! Why? Suppose (A I)
1
did exist. Then multiplying both
sides on the left by (A I)
1
(A I)
1
(A I)v = (A I)
1
0
would give us
v =
0.
However, v is an eigenvector and eigenvectors are non-zero by denition, a contradiction.
By determinant properties, non-invertibility of (A I) is equivalent to
det(A I) = 0
Now, the existence of eigenvalues and eigenvectors is equivalent to solving
det(A I) = 0
Observe that this is an n-degree polynomial in terms of the variable :
n
b
(n1)
n1
+ . . . + b
2
2
+ b
1
+ b
0
.
By solving for the roots of this polynomial, you can get values for and plug them back into
(A I)v =
0.
But we know how to solve a system of this form!
Example. Solve for all eigenvectors of
A =
_
4 2
1 1
_
.
Expand
det
_
4 2
1 1
_
. .
det(AI)
= 0
to get
2
5 + 6 = 0.
Therefore, = 2 or = 3. These are the eigenvalues.
= 2
Plugging into (A I)v =
0,
_
2 2
1 1
_ _
v
1
v
2
_
=
_
0
0
_
.
Solving,
v
1
= v
2
so
v =
_
v
2
v
2
_
= v
2
_
1
1
_
Since v
2
can be any real number, we conclude that any vector of the form
t
_
1
1
_
is an eigenvector of A corresponding to the eigenvalue = 2.
= 3
Plugging into (A I)v =
0,
_
1 2
1 2
_ _
v
1
v
2
_
=
_
0
0
_
Solving,
v
1
= 2v
2
.
This allows us to reduce:
v =
_
2v
2
v
2
_
= v
2
_
2
1
_
.
Thus, any vector of the form
t
_
2
1
_
is an eigenvector of A corresponding to the eigenvalue = 3.
32.3. SEEKING SUFFICIENCY 617
If you go back to the rst example of this lecture, this explains how we found the eigen-decomposition
_
4 2
1 1
_
. .
A
=
_
1 2
1 1
_
. .
S
_
2 0
0 3
_
. .
D
_
1 2
1 1
_
. .
S
1
.
We just used the most obvious eigenvectors corresponding to the eigenvalues 2 and 3, respectively.
32.3 Seeking Suciency
Even though we now know how to compute eigenvectors and eigenvalues, an arbitrary matrix is not
necessarily diagonalizable!
Take another look at the eigen-decomposition process. For this to work, we required that the eigen-
vectors
v
1
, v
2
, . . . , v
n
are linearly independent. But it may be impossible to nd such a set! So what do we do?
Math Mantra: If you want some property to hold, look for sufficient conditions
that GUARANTEE it does.
Luckily, there do exist sucient conditions that will guarantee that we can nd a linearly independent
set of eigenvectors. Today, we will focus on two conditions:
A has n distinct eigenvalues.
A is symmetric.
The rst is a pretty easy proof, involving a simple algebraic shenanigan. The second will involve a
ton more work. But each will use induction.
NOTE: In both proofs, the n = 1 case is immediate. I will note this in the proof summary.
However, I will prove the case n = 2 so you can see the main ideas.
Theorem. Let A be an n n matrix. If A has n distinct eigenvalues, then there exists a set of n
linearly independent eigenvectors of A.
Proof Summary:
Base Case k = 1
Immediate.
Inductive Step
Suppose a linear combination of k + 1 of the eigenvectors equals

0. We want to show that
each coecient c
i
= 0.
Multiply this linear combination by the (k + 1)-th eigenvalue
k+1
to get one equation.
Multiply the original linear combination by A to get a second equation.
Subtract the rst equation from the second to kill the
k+1
term.
Use the inductive hypothesis and the distinctness of the eigenvalues to show
c
1
= c
2
= . . . = c
k
= 0.
Conclude c
k+1
= 0.
Proof: We will use induction to prove:
P(k) : If A has eigenvectors v
1
, v
2
, . . . , v
k
such that their corresponding
i
are all distinct, then
v
1
, v
2
, . . . , v
k
are linearly independent
For then P(n) yields the theorem.
Base Case, k = 2
Consider two eigenvectors v
1
and v
2
corresponding to distinct eigenvalues
1
and
2
, respectively.
Assume
c
1
v
1
+ c
2
v
2
=
0. ()
Multiplying both sides of () by
2
yields
c
1
2
v
1
+ c
2
2
v
2
=
0.
We can also multiply both sides of () by A to get
c
1
Av
1
+ c
2
Av
2
=
0,
which by denition of an eigenvector is
c
1
1
v
1
+ c
2
2
v
2
=
0.
Subtracting,
c
1
2
v
1
+ c
2
2
v
2
=

0
c
1
1
v
1
+ c
2
2
v
2
=

0
c
1
(
2
1
)v
1
=

0
Because
1
=
2
and v
1
is non-zero (by denition of an eigenvector), we must have c
1
= 0.
Plugging back into ()
c
1
v
1
..
0
+c
2
v
2
=
0,
we conclude c
2
= 0.
Thus, P(2) holds.
32.4. SPECTRAL THEOREM 619
Inductive Step:
Assume P(k) and consider k+1 eigenvectors v
1
, . . . , v
k+1
with corresponding distinct eigenvalues
1
, . . . ,
k+1
. Assume
c
1
v
1
+ c
2
v
2
+ . . . + c
k+1
v
k+1
=
0. ()
Multiplying both sides of () by
k+1
yields
c
1
k+1
v
1
+ c
2
k+1
v
2
+ . . . + c
k+1
k+1
v
k+1
=
0.
We can also multiply both sides of () by A
c
1
Av
1
+ c
2
Av
2
+ . . . + c
k+1
Av
k+1
=
0
and apply the eigenvector denition to get
c
1
1
v
1
+ c
2
2
v
2
+ . . . + c
k+1
k+1
v
k+1
=
0.
Subtracting,
c
1
k+1
v
1
+ c
2
k+1
v
2
+ . . . + c
k+1
k+1
v
k+1
=

0
c
1
1
v
1
+ c
2
2
v
2
+ . . . + c
k+1
k+1
v
k+1
=

0
c
1
(
k+1
1
)v
1
+ c
2
(
k+1
2
)v
2
+ . . . +

0 =

0.
By the inductive hypothesis, v
1
, v
2
, . . . , v
k
are linearly independent. Therefore,
c
1
(
k+1
1
) = c
2
(
k+1
2
) = . . . = c
k
(
k+1
k
) = 0.
But
k+1
is distinct from the other
i
s; thus,
c
1
= c
2
= . . . = c
k
= 0.
Plugging back into () yields
c
k+1
= 0.
32.4 Spectral Theorem
Unlike the previous sucient condition, this one is not as easy to prove. However, it is a very
fundamental result that
If an n n matrix A is symmetric, then there exists a linearly independent set of n eigenvectors.
In fact, it gets even better:
If an n n matrix A is symmetric, then there exists an orthonormal
1
set of n eigenvectors.
1
As an exercise, prove that orthonormality implies linear independence.
This is known as the Spectral Theorem. But how do we prove it?
First you need to notice that we are dealing with a symmetric matrix. Weve rst seen these matrices
in Lecture 23: it is a required condition in the denition of a quadratic form:
Q(x) =
n
i,j=1
a
ij
x
i
x
j
Moreover, in that lecture, we showed that quadratic forms have a minimum on the unit sphere. We
all know that the gradient is

0 at a minimum.
The rst step to proving the Spectral Theorem is observing a magical fact: when you dierentiate
Q
_
x
x
_
and plug in the minimum, out pops the eigenvector relationship! In fact, the minimum is achieved at
the eigenvector and the minimum value is the corresponding eigenvalue.
This gives you the rst eigenvector. We need n 1 more, so we use induction.
For the inductive step, the trick is to remove the rst eigenvector from the space: formally, we look
at the orthogonal complement. By Gram-Schmidt, the complement has a basis of n 1 orthonormal
vectors in R
n
:
w
1
, w
2
, . . . , w
n1
.
Using this basis, we are going to build a smaller (n 1) (n 1) symmetric matrix and apply the
inductive hypothesis. This will give us n 1 eigenvectors in R
n1
:
Heres the kicker: consider an eigenvector of the smaller matrix:
_
_
v
1
v
2
.
.
.
v
n1
_
_
We can use this eigenvectors coordinates to form a linear combination of the basis vectors:
v
1
_
_
w
1
_
_
+ v
2
_
_
w
2
_
_
+ . . . + v
n1
_
_
w
n1
_
_
Remarkably, this is an eigenvector of the bigger matrix!
Before we begin, here are a few notes:
Rewriting Q(x)
We will need to compute the k-th partial derivative of the quadratic form
Q(x) =
n
i,j=1
a
ij
x
i
x
j
.
If you expand it directly, the derivative is obvious. However, we are professionals now! As
professionals, we would like to rewrite the quadratic so that the dierentiation is immediate.
Therefore, split Q(x) into a sum over three cases:
Q(x) =
n
i=j
a
ij
x
i
x
j
+
n
i<j
a
ij
x
i
x
j
+
n
j<i
a
ij
x
i
x
j
.
Exploiting the symmetry of A, we can rewrite the last term as
n
j<i
a
ij
x
i
x
j
=
n
j<i
a
ji
x
i
x
j
Switching the indexing variables, this is
n
j<i
a
ji
x
i
x
j
=
n
i<j
a
ij
x
i
x
j
.
Moreover the rst term in the sum simply goes from 1 to n. Therefore, we can rewrite Q(x) as
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
.
This form of Q(x) makes dierentiation a breeze! Dierentiating the rst term
n
i=1
a
ii
x
2
i
with respect to x
k
kills everything except the k-th term, giving us
2a
kk
x
k
.
To dierentiate the second
n
i<j
2a
ij
x
i
x
j
notice that all terms are killed except the ones over pairs involving k:
(1, k) (2, k) (3, k) . . . (k 1, k)
(k, k + 1) (k, k + 2) (k, k + 3) . . . (k, n)
Therefore, dierentiation reduces this sum to
n
i=1,i=k
2a
ik
x
i
.
Base case
Again, you can skip the base case in the following proof since the result is immediate with n = 1.
However,
I advise that you to understand the n = 2 case rst.
It gives some (but not all!) of the main ideas and will help paint a clearer picture. When you
are ready, move on to the inductive step.
Theorem. Let A be an n n symmetric matrix. If A is symmetric, then there exists a set of n
orthonormal eigenvectors of A.
Proof Summary:
Base Case, n = 1
Immediate.
Inductive Step, First Eigenvector
Consider the corresponding quadratic form over the unit sphere and extend to a function
S dened on R
n
\ {
0}.
S achieves a minimum value of m at the point .
The gradient of S is

0 at . Expand this system to show that is an eigenvector corre-
sponding to eigenvalue m.
Inductive Step, Remaining Eigenvectors
Consider the orthogonal complement V

of V = span{}. It has an orthonormal basis
w
1
, w
2
, . . . , w
n1
by Gram-Schmidt.
Take the image of the ws under left multiplication by A. By the homework, each of these
mapped vectors can be written as a linear combination of the original ws.
Show that the matrix of coecients B is symmetric and apply inductive hypothesis to get
eigenvectors in R
n1
.
Use the coecients of each eigenvector of B to get a corresponding linear combination of
ws. Call this combination q
i
.
Show that the q
i
are eigenvectors of the larger matrix A: directly multiply on the left by A
and use summation shenanigans. Take particular note that the innermost summation will
collapse into the eigenvector relation for the k-th component of B.
Directly show that
, q
1
, . . . , q
n1
.
is orthonormal.
Proof: We will use induction to prove:
P(n) : If A is an n n symmetric matrix, then there exists a set of n orthonormal eigenvectors of A.
Base Case, n = 2. First Eigenvector:
Consider the corresponding quadratic form
Q(x) = a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
,
over the unit sphere and extend to a new function S : R
2
\ {
0} R that normalizes the input

vector x and then applies Q to the resulting vector on the unit sphere:
S(x) = Q
_
x
x
_
=
a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
x
2
=
a
11
x
2
1
+ 2a
12
x
1
x
2
+ a
22
x
2
2
x
2
1
+ x
2
2
.
Since the unit sphere is closed and bounded, it follows by the Extreme Value Theoerem that Q
achieves its minimum value of m at some unit vector . Since the images of S and Q are the
same, m is also the minimum value of S. In particular, S achieves this minimum value m at
(and more generally, at any positive scalar multiple of ). Therefore, the gradient of S is

0 at
, giving us the system
x
1
S () = 0
x
2
S () = 0
_
_
()
We would like to expand this system. First, calculate the derivatives on the left by applying
single variable quotient rule:
x
1
S(x) =
(
x
2
1
+x
2
2
)
(2a
11
x
1
+2a
12
x
2
)
(
a
11
x
2
1
+2a
12
x
1
x
2
+a
22
x
2
2
)
(2x
1
)
(x
2
1
+x
2
2
)
2
x
2
S(x) =
(
x
2
1
+x
2
2
)
(2a
22
x
2
+2a
12
x
1
)
(
a
11
x
2
1
+2a
12
x
1
x
2
+a
22
x
2
2
)
(2x
2
)
(x
2
1
+x
2
2
)
2
By construction, is on the unit sphere, so
=
2
1
+
2
2
= 1.
Since S and Q agree on the unit sphere,
S() = Q() = a
11
2
1
+ 2a
21
2
+ a
22
2
2
= m
Plugging into the derivatives,
x
1
S() =
1
..
_
2
1
+
2
2
_
(2a
11
1
+2a
12
2
)
m
..
_
a
11
2
1
+ 2a
12
2
+ a
22
2
2
_
(2
1
)
(
2
1
+
2
2
)
2
. .
1
x
2
S() =
1
..
_
2
1
+
2
2
_
(2a
22
2
+2a
12
1
)
m
..
_
a
11
2
1
+ 2a
12
2
+ a
22
2
2
_
(2
2
)
(
2
1
+
2
2
)
2
. .
1
which simplies to
x
1
S() = (2a
11
1
+ 2a
12
2
) m(2
1
)
x
2
S() = (2a
22
2
+ 2a
12
1
) m(2
2
)
.
Substituting into the original system (), isolate the equations
a
11
1
+ a
12
2
= m
1
a
22
2
+ a
12
1
= m
2
But this is just
A
_

1
2
_
. .
= m
_

1
2
_
. .
,
so is an eigenvector of A corresponding to the eigenvalue m.
Base Case, n = 2. Remaining Eigenvectors:
Let
V = span{}
By the Gram-Schmidt process, we know we can nd an orthonormal basis for V

. Particularly
in the case n = 2, V

has dimension 1 so we have the single unit vector
w
1
.
Consider the operation of multiplication on the left by A:
Ax
On this weeks homework, you will prove that
Homework. Given a symmetric matrix A, if x V

then Ax V

.
Therefore, we can write the image of the basis of V

(under left multiplication by A) in terms
of the basis of V

:
A w
1
= a
11
w
1
.
Thus, w
1
is an eigenvector corresponding to eigenvalue a
11
.
But V and w
1
V

, so they are orthogonal. Moreover, = 1 by construction and
w
1
= 1 by Gram-Schmidt. In conclusion,
, w
1
is an orthonormal set of eigenvectors of A.
Inductive step, First Eigenvector:
By the note preceding this proof, we can rewrite the quadratic form as
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
Q(x) =
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
As before, consider this quadratic form on the unit sphere and extend to a function
S : R
n
\ {
0} R dened by
S(x) = Q
_
x
x
_
=
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
x
2
=
n
i=1
a
ii
x
2
i
+
n
i<j
2a
ij
x
i
x
j
x
2
1
+ x
2
2
+ . . . + x
2
n
.
As before, Q achieves its minimum m at some unit vector . It follows that the minimum of S
is also m, and S achieves its minimum m at . Thus, the gradient of S must be

0 at , giving
the system
x
1
S() = 0
x
2
S() = 0
.
.
.
x
n
S() = 0
_
_
()
The goal is to expand this system: once we do, an eigenvector magically pops out!
First, use the single-variable quotient rule to calculate the k-th partial
x
k
S(x) =
(x
2
1
+ x
2
2
+ . . . + x
2
n
)
_
a
kk
2x
k
+
n
i=1,i=k
2a
ik
x
i
_
(2x
k
)
_
n
i=1
a
ii
x
2
i
+
n
i<j
a
ij
x
i
x
j
_
(x
2
1
+ x
2
2
+ . . . + x
2
n
)
2
.
Observe that = 1 since it lies on the unit sphere. Moreover,
n
i=1
a
ij
2
i
+
n
i<j
a
ij
j
= m
This is because, by denition, S takes the value m at and
S () =
n
i=1
a
ii
2
i
+
n
i<j
a
ij
2
1
+
2
2
+ . . . +
2
n
. .
=1
= m.
Plugging into the k-th partial,
x
k
S() =
1
..
(
2
1
+
2
2
+ . . . +
2
n
)
_
2a
kk
k
+
n
i=1,i=k
2a
ik
i
_
(2
k
)
m
..
_
n
i=1
a
ii
2
i
+
n
i<j
a
ij
j
_
(
2
1
+
2
2
+ . . . +
2
n
)
2
. .
1
which reduces to
x
k
S() =
_
2a
kk
k
+
n
i=1,i=k
2a
ik
i
_
(2
k
) m.
Notice that the 2a
kk
k
is the missing term in the summation! Therefore, recombine to get
x
k
S() =
_
n
i=1
2a
ik
i
_
(2
k
) m.
Substituting into system (), isolate the m
i
terms to get
n
i=1
a
i1
i
= m
1
n
i=1
a
i2
i
= m
2
.
.
.
n
i=1
a
in
i
= m
n
which is the matrix multiplication
_
_
a
11
a
21
. . . a
n1
a
12
a
22
. . . a
n2
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
a
2n
. . . a
nn
_
_
. .
A
_
2
.
.
.
n
_
_
. .
= m
_
2
.
.
.
n
_
_
. .
m
.
But the left matrix is still A since it is symmetric. Therefore, is an eigenvector of A corre-
sponding to eigenvalue m.
Inductive step, Remaining Eigenvectors:
Let
V = span{}
By the Gram-Schmidt process, we know we can nd an orthonormal basis for V

w
1
, w
2
, . . . , w
n1
.
Consider the operation of multiplying on the left by A:
Ax
On this weeks homework, you will prove that this multiplication maps vectors of V

directly
into V

:
V

x
A
Ax
In particular, this means we can write the image of the basis of V

(under left multiplication
by A) in terms of the basis of V

:
A w
1
= b
11
w
1
+ b
21
w
2
+ . . . + b
(n1)1
w
n1
A w
2
= b
12
w
1
+ b
22
w
2
+ . . . + b
(n1)2
w
n1
A w
3
= b
13
w
1
+ b
23
w
2
+ . . . + b
(n1)3
w
n1
.
.
.
.
.
.
.
.
.
A w
n1
= b
1(n1)
w
1
+ b
2(n1)
w
2
+ . . . + b
(n1)(n1)
w
n1
Consider the matrix of these coecients
B =
_
_
b
11
b
12
. . . b
1(n1)
b
21
b
212
. . . b
2(n1)
.
.
.
.
.
.
.
.
.
.
.
.
b
(n1)1
b
(n1)2
. . . b
(n1)(n1)
_
_
Since it is (n 1) (n 1), if we can prove it is symmetric, then we can apply the inductive
hypothesis.
To show B is symmetric, consider another theorem from this weeks homework:
Homework. For a symmetric matrix A,
(A w
j
) w
i
= (A w
i
) w
j
.
Using
A w
i
= b
1i
w
1
+ b
2i
w
2
+ . . . + b
(n1)i
w
n1
dot product both sides with w
j
to get
(A w
i
) w
j
= b
ij
by orthonormality. Applying the same trick with
A w
j
= b
1j
w
1
+ b
2j
w
2
+ . . . + b
(n1)j
w
n1
,
dot product both sides with w
i
to get
(A w
j
) w
i
= b
ji
.
Therefore,
b
ij
..
(A w
j
) w
i
= b
ji
..
(A w
i
) w
j
.
In other words, B is symmetric.
Now we can apply the induction hypothesis to B. This gives us orthonormal eigenvectors
u
1
, u
2
, . . . , u
n1
of B with corresponding eigenvalues
1
,
2
, . . . ,
n1
, respectively. For any one of these eigen-
vectors
u
i
=
_
_
u
1i
u
2i
.
.
.
u
(n1)i
_
_
we can check that
q
i
= u
1i
w
1
+ u
2i
w
2
+ . . . + u
(n1)i
w
n1
=
n1
j=1
u
ji
w
j
is an eigenvector of the original matrix A: multiply on the left by A
Aq
i
= A
_
n1
j=1
u
ji
w
j
_
=
n1
j=1
u
ji
A w
j
.
Expanding our denition of A w
j
, rewrite the sum as
n1
j=1
u
ji
_
n1
k=1
b
kj
w
k
_
.
Pull in u
ji
, switch the order of summation, and pull out w
k
:
n1
j=1
_
n1
k=1
b
kj
u
ji
w
k
_
=
n1
k=1
_
n1
j=1
b
kj
u
ji
w
k
_
=
n1
k=1
w
k
_
n1
j=1
b
kj
u
ji
_
.
But we have a cute way to rewrite the inner sum! Look at the denition of the i-th eigenvector
of B. Focus on the k-th component:
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
b
k1
b
k2
b
k3
. . . b
k(n1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
. .
B
_
_
u
1i
u
2i
u
3i
.
.
.
_
_
. .
u
i
=
_
_
.
.
.
i
u
ki
.
.
.
_
_
. .
i
u
i
This means
_
n1
j=1
b
kj
u
ji
_
=
i
u
ki
.
Therefore,
n1
k=1
w
k
_
n1
j=1
b
kj
u
ji
_
=
i
n1
k=1
w
k
u
ki
=
i
q
i
Thus, q
i
is an eigenvector of A corresponding to
i
. So
, q
1
, q
2
, . . . , q
n1
is a set of eigenvectors of A.
All thats left is to check that this set in orthonormal. But = 1 and V whereas each
q
i
V

. Therefore we only need to check that the q
i
are orthonormal.
Directly compute the dot product. First, expand only one term and distribute:
q
i
q
j
=
_
n1
r=1
u
ri
w
r
_
. .
q
i
q
j
=
n1
r=1
(u
ri
w
r
q
j
)
Then expand the other term and distribute:
n1
r=1
_
_
_
_
_
_
u
ri
w
r
_
n1
t=1
u
tj
w
t
_
. .
q
j
_
_
_
_
_
_
=
n1
r=1
n1
t=1
u
ri
u
tj
w
r
w
t
Since the ws are orthonormal, only the terms where r = t remain:
n1
r=1
u
ri
u
rj
which is just
u
i
u
j
.
But the us are orthonormal! This means that the qs are orthonormal.
Lecture 33
Keeping up with Contractions
Take a map of Texas and throw it into Texas.
Theres going to be some place in Texas where it lands
that aligns with its corresponding point on the map
- H
i
Goals: Today, we study functions that map into themselves. In particular, we prove the
Contraction Mapping Theorem which guarantees, under certain conditions, the existence
of a xed point. Not only does this theorem give us a constructive method to nd such
a point, but it also serves as a vital step in the proof of the Inverse Function Theorem.
33.1 Its Inception All Over Again
One of the key steps in the proof of the Spectral Theorem was considering a mapping whose image
was contained in its domain.
E
f
f(E)
But what if we mapped the image again? Then the image of the image will lie within the original
domain!
631
632 LECTURE 33. KEEPING UP WITH CONTRACTIONS
f(E)
f
f(f(E))
Because we love Inception, we keep on mapping the images under f:
This gives us a nested sequence of sets:
E f(E) f(f(E)) f(f(f(E))) f(f(f(f(E)))) f(f(f(f(f(E))))) . . .
Can we prove anything about sequences of nested sets? Absolutely! Namely, we are going to prove
the Contraction Mapping Theorem.
33.2 Non-Triviality of Non-Emptiness
Before we prove the Contraction Mapping Theorem, lets do an easier proof that has the same avor.
Ignore the role of f and consider just the sets. Then,
Given a nested sequence of closed, bounded, and non-empty sets, their intersection is also
closed, bounded, and non-empty.
We already know the intersection is bounded and closed. Strangely, the only thing we really have to
prove is that the intersection is non-empty! Weird!
33.2. NON-TRIVIALITY OF NON-EMPTINESS 633
But what possible applications could this theorem have? Who cares about whether the intersection
is non-empty or not?
When you continue your studies in analysis, you will see that this can actually be used to prove the
incredible Heine-Borel Theorem. Also, in the nal lecture, we will use this to prove that the reals are
uncountable.
To prove our nested sequence is non-empty, we need to nd a point x with the property that
x is in every set of the sequence.
We use a nice trick:
Math Mantra: Suppose you are given a sequence of nested closed sets, and you
want to find an x that satisfies some property. Form a sequence by taking a
point from each set. As long as our sequence of points is bounded, we can apply
Bolzano-Weierstrass. Hopefully, the limit will satisfy the property you need.
This trick will also be used to prove the Contraction Mapping Theorem.
Theorem. Let C
1
, C
2
, C
3
, . . . be a nested sequence of closed, bounded, and non-empty sets in R
n
:
C
1
C
2
C
3
. . .
Then the intersection
iN
C
i
is closed, bounded, and non-empty.
Proof Summary:
Construct a sequence (x
k
) where x
k
C
k
. Apply Bolzano-Weierstrass to get
x
n
j
x.
For arbitrary i,
x
n
i
, x
n
i+1
, x
n
i+2
, . . .
is a subsequence in C
i
. Thus, x C
i
.
Since i was arbitrary,
x
iN
C
i
.
Proof: To reiterate, the fact that the intersection is closed and bounded is immediate. We only need
to prove its non-empty.
First construct a sequence by choosing a point in each set
x
k
C
k
.
Since the sets are bounded, we automatically know (x
k
) is bounded. Applying Bolzano-Weierstrass,
there is a subsequence
x
n
j
x.
Now consider any set C
i
and notice that i n
i
. Then
x
n
i
, x
n
i+1
, x
n
i+2
, . . .
is a subsequence in C
i
since
C
n
i
C
i
C
n
i+1
C
i
C
n
i+2
C
i
.
.
.
But this sequence still converges to x and since C
i
is closed,
x C
i
.
Since C
i
was arbitrary,
x
iN
C
i
.
33.3 Contraction Mapping Theorem
Consider again a function f on a closed set
1
E that maps into itself:
f(E) E
Suppose f satises the following property: for any two points in E, the distance between those two
points is strictly smaller after being mapped by f.
x
y
f
(
x
)
f
(
x
)
f
(
y
)
f
(
y
)
1
We do not need to assume E is bounded.
33.3. CONTRACTION MAPPING THEOREM 635
Moreover, suppose that the ratio
1
of the distance after mapping to the distance before mapping is
bounded by a universal xed scale factor < 1:
f(x) f(y) x y.
y
x
f(x) f(y)
a

b
f(a) f(
b)
Remarkably, it follows that there must exist some point z E that gets mapped to itself:
f(z) = z
This is pretty cool! But dont think it is as intuitively obvious as the last proof. Specically, you
should not be thinking:
There is a point in common in all the mapped sets, so this must be true.
The fact that z is in all the images is a necessary condition. The Contraction Mapping Theorem
actually says something stronger. It says that even though our f could rotate and shift our points
around,
1
Generally, for an unrestricted constant , this type of function is called Lipschitz. In Math 52H you will learn
that these are great functions to work with.
there is at least one point that stays xed by the mapping. In fact, this point is unique.
So how do we prove this? Its going to be similar to the previous proof. We are going to construct a
sequence of points such that each point lies in the next nested image. However, there is a big dier-
ence: E need not be bounded! So we cannot use boundedness to instantly apply Bolzano-Weierstrass.
But there is a way around this. We do what Bane did in Dark Knight Rises: we throw a body into the
river and let it ride the current as it gets closer to a full stop. To prove the Contraction Mapping
Theorem, we are going to choose an arbitrary initial point and keep mapping it as the succesives
images get closer to a full stop.
Theorem (Contraction-Mapping Theorem). Let f : R
n
R
n
and let E R
n
be a closed set
such that f(E) E. Suppose there exists some (0, 1) such that, for any x, y E,
f(x) f(y) x y.
Then there exists a xed point z E:
f(z) = z.
Proof Summary:
Choose x
0
E. Construct sequence (x
k
) where
x
k+1
= f(x
k
)
Inductively apply the contraction mapping property to get the bound
x
r
x
r1

r1
x
1
x
0
.
Apply this bound to a geometric series to obtain the key inequality
x
l
x
k

k
x
1
x
0
1
for l k.
Use the key inequality to show (x
k
) is bounded. Apply Bolzano-Weierstrass to get
x
n
i
z.
By continuity of f,
f (x
n
i
) f(z).
Use key inequality to show
f (x
n
i
) z.
Conclude z = f(z).
Proof: Choose some x
0
E and form a sequence (x
i
) by repeatedly mapping x
0
under f:
x
1
= f(x
0
)
x
2
= f(x
1
)
x
3
= f(x
2
)
.
.
.
Generally,
x
k+1
= f(x
k
).
Particularly, our contraction mapping tells us
x
k+1
x
k
= f(x
k
) f(x
k1
) x
k
x
k1
.
First, we will prove that this sequence gets closer together. Let;s look at the distance between two
terms of the sequence
x
l
x
k
.
WLOG let l k. We can repeatedly add

0
x
l
x
l1
+ x
l1
x
l2
+ x
l2
+ . . . + x
k+1
x
k
and bound this by repeated triangle inequality

x
l
x
l1
+x
l1
x
l2
+ . . . +x
k+1
x
k
Observe that each term of this sum is bounded: by inductively applying the contraction mapping
property,
x
r
x
r1
x
r1
x
r2

2
x
r2
x
r3
. . .
we have
x
r
x
r1

r1
x
1
x
0
.
Now we can further bound
x
l
x
l1
+x
l1
x
l2
+ . . . +x
k+1
x
k

l1
x
1
x
0
+
l2
x
1
x
0
+ . . . +
k
x
1
x
0
=
_
l1
+
l2
+ . . . +
k
_
x
1
x
0
=
_
lk1
+ . . . +
2
+ 1
_
k
x
1
x
0
.
The last line contains a geometric sum with 0 < < 1, so we can bound this sum by the full innite
geometric sum:
_
lk1
+ . . . +
2
+ 1
_
. .
1
1
k
x
1
x
0

k
x
1
x
0
1
.
This gives us the key inequality: for l k,
x
l
x
k

k
x
1
x
0
1
()
Using this key inequality, we can show that our sequence (x
k
) is bounded: for arbitrary s,
x
s
= x
s
x
1
+ x
1
x
s
x
1
+x
1
.
Plugging in l = s, k = 1, into (), this sum is bounded by
x
1
x
0
1
+x
1
which are all constants. Thus the sequence (x

k
) is bounded.
This allows us to apply Bolzano-Weierstrass: there exists a convergent subsequence
x
n
i
z.
Since E is closed, z E.
I claim z is a xed point. To prove this, rst note that f is a continuous map
_
just choose =

_
.
Therefore,
f (x
n
i
) f(z).
Thus, if we can show
_
f (x
n
i
)
_
converges to z as well then
f (x
n
i
) f(z)
f (x
n
i
) z
and by uniqueness of limits,
f(z) = z.
Lets look at
f (x
n
i
) z.
By our recursive construction, this is just
x
n
i
+1
z
Then,
x
n
i
+1
z = x
n
i
+1
x
n
i
+ x
n
i
z x
n
i
+1
x
n
i
+x
n
i
z
By convergence, we know we can nd an N
1
such that for i N
1
,
x
n
i
z <

2
.
Moreover, by our key inequality,
x
n
i
+1
x
n
i

n
i
x
1
x
0
1
and since (0, 1), we can nd an N
2
such that for i N
2
,
x
n
i
+1
x
n
i

2
.
For i max{N
1
, N
2
},
x
n
i
+1
z x
n
i
+1
x
n
i
. .
<
2
+x
n
i
z
. .
<
2
<
i.e.,
f (x
n
i
) z.
Here are a few important observations:
This is a constructive proof. To nd the
1
xed point, just take any point in E and keep
applying f.
Our key inequality shows that our sequence is a Cauchy sequence. Intuitively, a
Cauchy sequence is a sequence that eventually bunches up:
Cauchy Sequences are very important: in upper level analysis courses, you will use these to
construct the reals from the rationals. Essentially, we tell the Completeness Axiom to back o.
We dont need to take it as an axiom. Instead, the Completeness Property will be a corollary
of our construction.
Lastly, we can easily show that our xed point is actually unique:
1
We will justify why this is the and not a very soon.
Theorem. Let f : R
n
R
n
and let E R
n
be a closed set such that f(E) E. Suppose there exists
some (0, 1) such that for any x, y E,
f(x) f(y) x y
If for z
1
, z
2
E,
f(z
1
) = z
1
f(z
2
) = z
2
then
z
1
= z
2
.
Proof: Rewrite the distance between z
1
and z
2
in terms of f and apply the contraction mapping
property
z
1
z
2
= f( z
1
) f( z
2
) z
1
z
2
.
Subtracting z
1
z
2
from both sides,
(1 )z
1
z
2
0.
Since (1 ) > 0,
z
1
z
2
= 0
i.e.,
z
1
= z
2
.
Lecture 34
Intimidating Inverse Function Theorem
The only thing we have to fear is fear itself
-Franklin (Evelt)
Goals: After giving a brief review of inverses, we prove the infamous Inverse Function
Theorem.
34.1 Introspection on Inverse
In my experience, there is a lot of confusion among high school students when it comes to inverses.
Particularly, they confuse the algebraic inverse with the functional inverse.
1
Just in case,
An algebraic inverse is an object x
1
that you apply, under some operation, to an object x to
get the identity element:
x
1
x = x x
1
= e
A functional inverse is a function f
1
such that inputting f(x) into f
1
returns x:
f
1
(f(x)) = x.
Today, we are going to focus on functional inverses. In fact, we are going to prove the most dicult
theorem in this course, the Inverse Function Theorem.
But before you jump headrst into a dicult yet fundamental result, you must know the basics. By
now, you should have realized that
Math Mantra: NEW mathematics is built on PREVIOUS mathematics. You have to
have a SOLID GRASP before continuing.
1
Of course, you can think of a functional inverse as an algebraic inverse under the operation of composition, where
the identity element is the identity function I(x) = x. But the typical high school student wouldnt know this.
641
642 LECTURE 34. INTIMIDATING INVERSE FUNCTION THEOREM
You already know this: the material in the second half of 51H relies on a solid understanding of both
single variable calculus and linear algebra.
Unfortunately, you may have had a lousy introduction to inverses. Instead of being taught concepts
like 1 : 1, your theory of inverses (and functions) may have been reduced to a mindless methodology:
Apply the Vertical Line Test.
Swap x and y and solve for y.
Fold a paper in half.
Yuck!
I am going to ll this gap, and when you are ready, you can move on to the Inverse Function Theorem.
34.2 Inverse Basics
Lets start with some function f. For every input x, we know there is some output f(x).
x
f(x)
Using this f, we want to construct some function f
1
such that when you input f(x), you get x. So
what does such a function have to satisfy?
The domain of f
1
must contain the image of f. Otherwise, f
1
(f(x)) = x wouldnt make
sense!
No inputs of f can map to the same output. Consider, for example, the case y = x
2
:
34.2. INVERSE BASICS 643
1 1
Then the inputs 1, 1 would map to the output 1:
f(1) = 1
f(1) = 1
So if an inverse did exist,
f
1
(1) = 1
f
1
(1) = 1
meaning f
1
is not a function!
To avoid the case where two inputs are competing for the same output, we dene the condition
Denition. A function is 1 : 1 (read: one-to-one) or injective if, for any x, y such that
f(x) = f(y)
it must be the case that
x = y.
This condition, along with the fact that f
1
is dened on Im(f), makes the existence of the inverse
immediate. Just switch the range and domain, and show that it is a function.
Theorem. If f is 1:1, then there exists a function f
1
with domain Im(f) such that
f
1
(f(x)) = x
Proof: Dene f
1
: Im(f) dom(f) by
f
1
(x) = y,
where y is some value such that
f(y) = x
Suppose f
1
is ill-dened (i.e., it is not a function). Then there exists a point z that is mapped to
two outputs:
f
1
(z) = a
f
1
(z) = b
where a = b. Then, by construction,
f(a) = z
f(b) = z
which contradicts the fact that f is 1 : 1.
Thats all the theory you need to understand the meaning of the Inverse Function Theorem. But
before you embark on this great proof, I want you to know that the concepts of 1 : 1 and inverses are
pretty darn important. Particularly in the elds of
Cryptography: Every encrypted message can be decrypted to precisely one corresponding
text.
Combinatorics: If we can produce 1 : 1 mappings between two nite sets, then they must
have the same size. This lets us prove some pretty nifty results.
Set Theory: We can create a notion of size for innite sets. Intuitively, the size of A is less
than (or equal to) the size of B if there is a 1 : 1 mapping of A into B.
We shall save the rst item for courses like Math 110 and CS 255. The last two items will be
discussed in our nal lecture.
34.3 An Overall Schematic
The Inverse Function Theorem is a dicult theorem to prove. There is no question about that. This
is because it contains a lot of intricacies. But so do some of the greatest works: Les Miserables,
Memento, and Pulp Fiction. So lets breathe slowly, and take this one step at a time.
First, lets give a simple example.
Focus on only the universe of functions that are at least C
1
. For example, consider the sine function:
6 4 2 2 4 6
34.3. AN OVERALL SCHEMATIC 645
If you tried to directly invert this function, you would completely and utterly fail. Instead, we cheat.
We restrict the domain so that it is locally invertible.
6 4 2 2 4 6
This was the entire point of arcsin(x), arccos(x), and arctan(x)!
Of course, where we center the restricted interval makes a dierence. If I centered around a turning
point, say at x
0
=

2
,
x
0
then no matter how much we restrict the interval, we cannot nd a local inverse!
Notice that this problem arose because the derivative at x
0
is zero. However,
For a C
1
function, at a point where the Jacobian is non-zero, we can always restrict the function to
an open set so that it is invertible. In fact, we can show that this inverse is also C
1
and dened on
an open set (meaning that Im(f) is open).
This is the Inverse Function Theorem.
So how do we prove this theorem?
Start with an open ball around x
0
. Then, we add mysterious conditions: shrink the ball so that for
all of its points x, the matrix inverse [Df(x)]
1
exists. Moreover, for any point in this ball, when we
evaluate the Jacobian, the error from the Jacobian at x
0
must be less than a xed (0,
1
2
):
x
0
[Df(x)]
1
exists
Df(x) Df(x
0
) <
We then take the image of the ball under f:
x
0
f(x
0
)
f
Incredibly, this f is 1 : 1 and its image is open. To prove this, we use the mysterious restrictions to
prove a magical inequality
_
x f(x)
_
_
y f(y)
_
y x.
Almost every step in our proof is going to require this inequality.
After we prove that f is 1 : 1 and the image is open, we know that f
1
exists. Then, we have a
Majoras Mask moment. Like Link in the Stone Tower Temple, ip the world and reverse f:
f
1
34.4. A MUCH NEEDED SIMPLIFICATION 647
Then we rewrite our magical inequality to get a magical reverse inequality:
f
1
(u) f
1
(v)
u v
(1 )
.
Using this inequality, we can check that f
1
is indeed C
1
.
That is the overall schematic. But there are a lot of details. Luckily we can make our lives easier:
34.4 A Much Needed Simplication
In the proof of the inverse function theorem, it turns out that we can actually assume
Df(x
0
) = I
To explain why, rst we need to talk about open sets.
Suppose that you take an open set V and multiply every point by an invertible matrix
AV = {Av | v V } .
It turns out that AV is still open.
Careful! The proof is not a straightforward denition check! Its going to need an idea. Namely,
Math Mantra: If you cannot prove the result directly, try to come up with an
INTERMEDIATE step.
Particularly, instead of directly proving
A C
we add an intermediate set inclusion
A B C
and prove instead that
A B
B C
You must understand this trick: in fact, its a fundamental step in the proof of the Inverse Function
Theorem.
Lemma. Let V be an open set and A an invertible matrix. Then
AV = {Av | v V }
is also open.
Choose such that
B
(v) V.
It suces to nd a such that
B
(Av) A
_
B
_
v
__
AV.
A
_
B
_
v
__
AV
Immediate.
B
(Av) A
_
B
_
v
__
Expand denition and choose
=

A
1
.
Proof: For a point Av AV , we want to nd a > 0 such that
B
(Av) AV.
Av
AV
Proving this directly is too dicult. Instead, we throw in an intermediate set inclusion. First, use
openness to nd a ball centered around v contained in V :
v
V
34.4. A MUCH NEEDED SIMPLIFICATION 649
Thus is some radius such that
B
(v) V
I claim that, with the right choice of given , when we map this ball under A, it contains our set
B
(Av) and lies in AV :

v
Av
A
The goal now is to nd a such that
B
(Av) A
_
B
_
v
__
AV
Notice that the right inclusion
A
_
B
(v)
_
AV
is automatically true. Namely, we chose so that
B
(v) V.
Thus
A
_
B
(v)
_
AV,
and all we have to show is
B
(Av) A
_
B
(v)
_
.
In other words, we have to show there exists a such that for all x where
x Av <
there exists a q B
(v) such that

Aq = x.
By invertibility, we can solve for
q = A
1
x,
so we just need to show A
1
x B
(v):
A
1
x v < .
Pulling out an A
1
and applying the Cauchy-like inequality for matrices, we get an upper bound:
A
1
x A
1
A
. .
I
v = A
1
(x Av) A
1
x Av.
Therefore, if we choose
=

A
1
we would get
A
1
x Av
. .
<
< A
1

A
1
= .
Suppose you can prove the Inverse Function Theorem in the case Df(x
0
) = I. Then in particular,
the theorem is true for the function
f(x) = [Df(x
0
)]
1
f(x)
since
D
f(x
0
) = [Df(x
0
)]
1
. .
constant
Df(x
0
) = I.
Thus, there exist open sets U, V such that

f : U V is 1 : 1 and f(U) = V . By denition,
f(U) = V [Df(x
0
)]
1
f(U) = V
so
f(U) = Df(x
0
)V.
But U is open so by our lemma, Df(x
0
)V is open. The inverse of

f is also explicitly
f
1
(x) = Df(x
0
)f
1
(x),
thus
f
1
(x) = [Df(x
0
)]
1
f
1
(x)
and ergo, f
1
is also C
1
.
Therefore, if the Inverse Function Theorem is true in the case Df(x
0
) = I, then it is in fact true when
Df(x
0
) is any arbitrary invertible matrix.
34.5 The Intricate Inverse Function Theorem
Now that we know we can assume that Df(x
0
) = I, we begin the legendary proof:
Theorem (Inverse Function Theorem). Let f : R
n
R
n
be a C
1
function, let x
0
R
n
, and
suppose Df(x
0
) is invertible. Then there exists an open set U containing x
0
and an open set V such
that f : U V is 1 : 1 and f(U) = V (so (f|U)
1
: V U exists). Moreover, (f|U)
1
is C
1
.
34.5. THE INTRICATE INVERSE FUNCTION THEOREM 651
Proof Summary:
By the preceding section, assume Df(x
0
) = I.
Construct an open set U = B
(x
0
) and dene such that for every x U,
Df(x) I < <
1
2
and Df(x) is invertible.
Prove the magical inequality
_
x f(x)
_
_
y f(y)
_
y x.
by applying the Fundamental Theorem of Calculus on
F
_
x + t(y x)
_
.
Prove V = f
_
B
(x
0
)
_
is open:
It suces to show
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
where
=
z x
0
2
.
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
Directly check denition.
B
_
f(z)
_
f
_
B
2
(z)
_
Show
F(x) = x f(x) +c
is a contraction mapping on B
2
(z) that maps into B
2
(z). Use the magical inequality.
Prove f is 1 : 1 on U by using the magical inequality. Conclude that f
1
exists.
Prove the magical reverse inequality for f
1
,
f
1
(u) f
1
(v)
u v
1
by rewriting the magical inequality in terms of f
1
.
Prove f
1
is dierentiable by checking that
Df
1
(v
0
) = [Df(u
0
)]
1
for f
1
(v
0
) = u
0
. Apply the magical reverse inequality on fs dierentiability denition.
Prove Df
1
is continuous as a consequence of composition properties and the explicit formula
for the matrix inverse.
Proof:
Building our open set U = B
(x
0
) and dening
As we mentioned earlier in our schematic, we would like U to have two properties:
For every x U, Df(x) I < <
1
2
Since f is C
1
we know that for a xed (0,
1
2
), there is a > 0 such that for all
x x
0
<
we have
Df(x) Df( x
0
) < <
1
2
.
But notice that the condition really describes all points in a -ball and by the preceding
section, we can assume Df( x
0
) = I. Hence, for all
x B
(x
0
)
we have
Df(x) I < <
1
2
.
For every x U, Df(x) is invertible.
Let x B
(x
0
). To prove this matrix is invertible, it suces to show that for any non-zero
vector v,
Df(x)v > 0.
This would mean that the null space of Df(x) is trivial and thus the matrix is invertible.
Adding

0, we have
Df(x)v = Df(x)v +v v = v + [Df(x) I] v.
To construct a lower bound, we will use the reverse
1
triangle inequality:
a b a b
Rewrite the (RHS) as a dierence and apply this inequality:
_
_
v
_
I Df(x)
v
_
_
v
_
_
_
I Df(x)
v
_
_
.
By our Cauchy-like inequality for matrices, we know that the (RHS) is minimally
v I Df(x) v.
1
Just use triangle inequality with x = a
b and y =
b.
Recall that for all points in our ball,
Df(x) I <
1
2
.
Therefore, we can shrink our lower bound to
v
1
2
..
>IDf(x)
v =
1
2
v.
Now we have
Df(x)v
1
2
v.
Thus, for any non-zero vector v,
Df(x)v > 0.
Proving the Magical Inequality:
_
x f(x)
_
_
y f(y)
_
y x
We do the same rst step as we had in the proof of the Second Derivative Test. Consider some
function F where the input goes from x to y as t goes from 0 to 1:
F
_
x + t(y x)
_
By the Fundamental Theorem of Calculus,
F(y)
..
F(x+1(yx))
F(x)
. .
F(x+0(yx))
=
_
1
0
d
dt
F
_
x + t(y x)
_
dt.
Applying chain rule, expand the (RHS) as
_
1
0
_
DF
_
x + t(y x)
_
_
(y x) dt.
Now we have the integral equation
F(y) F(x) =
_
1
0
_
DF
_
x + t(y x)
_
_
(y x) dt.
For our choice of F, lets consider plugging in the function that spits out the dierence of the
mapped point from the original vector:
F(x) = x f(x).
In particular,
DF(x) = I Df(x).
Plugging F into our integral equation,
_
x f(x)
_
_
y f(y)
_
=
_
1
0
_
_
_
I Df
_
x + t(y x)
_
. .
DF(x+t(yx))
_
_
_
(y x) dt.
Now, take the norm of both sides
_
x f(x)
_
_
y f(y)
_
=
_
_
_
_
_
1
0
_
I Df
_
x + t(y x)
_
_
(y x) dt
_
_
_
_
and from the right, we can compute an upper bound. First, we know the integral is bounded
by the integral with the norm pulled inside:
_
1
0
_
_
_
_
I Df
_
x + t(y x)
_
_
(y x)
_
_
_ dt.
Then, by our Cauchy-like inequality for matrices, we can bound this by
_
1
0
I Df
_
x + t(y x)
_
y x dt.
But the above looks like the second property of B
(x
0
)! To apply it, formally, you need to prove
that for t (0, 1), the line segment
x + t(y x) B
(x
0
)
lies in B
(x
0
). A set with this property is known as convex:
x
+
t
(
y
x
)
x
y
I leave it to you to prove that balls are convex.
1
Using the fact that balls are convex, we can apply our property of U to get our nal upper
bound,
_
1
0
..
IDf(x+t(yx))
y x dt = (y x).
This gives us
_
x f(x)
_
_
y f(y)
_
y x. ()
Proving the Image V = f
_
B
(x
0
)
_
is open.
We want to show that for any y f
_
B
(x
0
)
_
, we can nd a ball centered around y that is still
contained in f
_
B
(x
0
)
_
.
1
I think its only fair for you to prove one fact (when I have to prove a gazillion of them!) Dont worry: just check
the denition!
y
Proving this directly isnt easy! Instead, we insert an intermediate set inclusion and try
to prove that instead. But what should this intermediate step be?
First, since y f
_
B
(x
0
)
_
, we know
y = f(z)
for some z B
(x
0
). Therefore, the equation we are trying to prove is
B
_
f(z)
_
f
_
B
(x
0
)
_
Now consider the ball around this z with some radius 2:
B
2
(z)
x
0
z
2
(x
0
)
I claim that when we map this ball under f, it will contain our set B
(y) and lie in f

_
B
(x
0
)
_
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
.
z
2
f(z)
f
Great! But what should our be?
We do the same trick we did in Chapter 14 when we proved that open balls are open. Draw
the radius through z
x
0
z
and calculate the distances:
.
.
.
x
0
.
.
Therefore, let
=
z x
0
2
.
Now that we have the game plan, lets prove:
B
_
f(z)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
f
_
B
2
(z)
_
f
_
B
(x
0
)
_
Let x f
_
B
2
(z)
_
. Then x = f(y) for some y B
2
(z). The goal is to show
1
y B
(x
0
)
since this implies
x
..
f(y)
f
_
B
(x
0
)
_
.
By our usual triangle shenanigans, introduce z
y x
0
= y x
0
z +z
. .
=0
y z +x
0
z
and then bound the (RHS) by using the fact y B
2
(z):
y z +x
0
z < 2
..
2
zx
0
2
+x
0
z = .
Therefore, y B
(x
0
).
B
_
f(z)
_
f
_
B
2
(z)
_
Let c B
_
f(z)
_
. The goal is to show that there exists some

d B
2
(z) such that
c = f(
d)
Showing

d exists takes a little bit of creativity. Consider a variation of our earlier F:
F(x) = x f(x) +c.
Suppose we can nd a xed point x
FIX
B
2
(z):
F(x
FIX
) = x
FIX
.
Then, plugging in the above,
x
FIX
. .
F(x
FIX
)
= x
FIX
f(x
FIX
) +c
giving us
f(x
FIX
) = c.
Thus, we can use x
FIX
as our

d !
Before we apply last lectures work to prove F is a contraction mapping, notice that there
is a catch:
The Contraction Mapping Theorem only applies to contraction mappings on closed sets.
1
This should be intuitively obvious. Look at the previous diagram: this is how we chose !
Therefore, were going to show F is a contraction mapping on the closed ball B
2
(z). More-
over, we add the additional proviso that F maps this closed ball into the open ball B
2
(z).
This implies, particularly, that the xed point x
FIX
B
2
(z).
Now we must prove:
F
_
B
2
(z)
_
B
2
(z).
There is a constant (0, 1) such that for any x, y B
2
(z),
F(x) F(y) x y.
But weve seen the second item before. Namely, its our magical inequality ():
_
x f(x)
_
_
y f(y)
_
x y
where (0,
1
2
). By adding c c, we get the inequality we need:
(x f(x) +c)
. .
F(x)
(y f(y) +c)
. .
F(y)

..
x y
Therefore, we only need to check the rst item:
F
_
B
2
(z)
_
B
2
(z).
Let q B
2
(z). We want to show that F(q) z < 2. Expanding and doing our usual
tricks,
F(q) z = q f(q) +c z + f(z) f(z)
. .
=0
=
_
q f(q)
_
_
z f(z)
_
+
_
c f(z)
_
.
By triangle inequality, this is bounded by
_
q f(q)
_
_
z f(z)
_
+c f(z).
But weve just seen the left summand: it is again just an application of the magical
inequality () with x = q and y = z ! Thus, we can bound our sum:
_
q f(q)
_
_
z f(z)
_
+c f(z) <
..
<
1
2
q z+c f(z)
1
2
q z+c f(z).
By denition, c B
_
f(z)
_
and q B
2
(z), so in fact
1
2
q z +c f(z) <
1
2
2 + = 2,
giving us
F(q) z < 2.
f is a 1 : 1 map from U = B
( x
0
) to V = f
_
B
( x
0
)
_
Again, use the magical inequality ():
_
x f(x)
_
_
y f(y)
_
x y
This tells us that if f(x) f(y) = 0, then
x y x y.
Thus
(1 )x y 0.
which implies x = y.
In particular, this tells us (f|B
( x
0
))
1
: f
_
B
( x
0
)
_
B
( x
0
) exists.
Magical Reverse-Inequality for f
1
,
f
1
(u) f
1
(v)
u v
1
Starting from our magical inequality ()
_
x f(x)
_
_
y f(y)
_
x y,
rewrite as
_
x y
_
_
f(x) f(y)
_
y x
and apply the reverse triangle inequality on the left to get
x y f(x) f(y)
_
x y
_
_
f(x) f(y)
_
.
Now,
x y f(x) f(y) y x
and by moving terms, we get
(1 )x y f(x) f(y).
For u, v V , plug in
x = f
1
(u)
y = f
1
(v)
This gives us
(1 )f
1
(u) f
1
(v) u v
. .
f(f
1
(u))f(f
1
(v))
so
f
1
(u) f
1
(v)
u v
1
. ()
f
1
is dierentiable.
We directly check the denition
1
of dierentiability at any point v
0
V . First, we guess that
The Jacobian of f
1
at v
0
is the inverse of the Jacobian matrix of f evaluated at the inverse
of v
0
under f.
Symbolically,
Df
1
(v
0
) = [Df(u
0
)]
1
where f
1
(v
0
) = u
0
.
Now we need to check that, for any
1
, there exists a
1
> 0 such that if
v v
0
<
1
then
_
_
f
1
(v) f
1
(v
0
) [Df(u
0
)]
1
(v v
0
)
_
_
<
1
v v
0
.
Starting from
_
_
f
1
(v) f
1
(v
0
) [Df(u
0
)]
1
(v v
0
)
_
_
,
rewrite as
_
_
_
_
_
_
[Df(u
0
)]
1
[Df(u
0
)]
. .
I
f
1
(v) [Df(u
0
)]
1
[Df(x)]
. .
I
f
1
(v
0
) [Df(u)]
1
(v v
0
)
_
_
_
_
_
_
and pull out the inverse:
_
_
_[Df(u
0
)]
1
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_
__
_
_ .
But this is bounded by
_
_
[Df(u
0
)]
1
_
_
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
.
Therefore, if we can nd a condition on
1
such that
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
<

1
_
_
[Df(x)]
1
_
_
v v
0
then we are done.

To nd this condition, apply the dierentiability denition of f at f
1
(v
0
) with the choice of
2
=

1
(1 )
_
_
[Df(x)]
1
_
_
: there exists a
2
such that if
q f
1
(v
0
) <
2
1
Rather, we use an equivalent denition, in which we replace

h with the dierence v u.
then
_
_
[Df(x)]
_
q f
1
(v
0
)
_
_
f(q) v
0
_ _
_
<

1
(1 )
_
_
[Df(x)]
1
_
_
q f
1
(v
0
).
In particular, by adding a provision on
1
, this -hypothesis holds for choice
q = f
1
(v).
This is because, by the magical reverse inequality (),
f
1
(v) f
1
(v
0
)
v v
0
1
<
2
when v v
0
<
2
(1 ). Therefore, by choosing
1
=
2
(1 ),
we have
_
_
_
_
_
_
[Df(x)]
_
f
1
(v)
. .
q
f
1
(v
0
)
_
_
v
..
f( q)
v
0
_
_
_
_
_
_
_
<

1
(1 )
_
_
[Df(x)]
1
_
_
f
1
(v)
. .
q
f
1
(v
0
).
But we can increase this upper bound by applying the magical reverse inequality ()
f
1
(v) f
1
(v
0
) <
v v
0
1
and multiplying both sides by

1
(1 )
_
_
[Df(x)]
1
_
_
,
1
(1 )
_
_
[Df(x)]
1
_
_
|f
1
(v) f
1
(v
0
) <

1
_
_
[Df(x)]
1
_
_
v v
0
.
In conclusion,
_
_
[Df(x)]
_
f
1
(v) f
1
(v
0
)
_
_
v v
0
_ _
_
<

1
_
_
[Df(x)]
1
_
_
v v
0
for choice
1
=
2
(1 ).
Df
1
(v) is continuous
1
.
We just showed that
Df
1
(v) = [Df(u)]
1
where
f(u) = v,
so
Df
1
(v) =
_
Df
_
f
1
(v)
_
1
1
In other words, each component function is continuous.
But Df and f
1
are continuous, so their composition Df
_
f
1
(v)
_
is also continuous. Moreover,
the inverse matrix of a continuous matrix is also continuous: apply the explicit formula for the
inverse from Lecture 30:
A
1
=
_
_
(1)
1+1
det(A
11
) . . . (1)
c+1
det(A
c1
) . . . (1)
n+1
det(A
n1
)
(1)
1+2
det(A
12
) . . . (1)
c+2
det(A
c2
) . . . (1)
n+2
det(A
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(1)
1+n
det(A
1n
) . . . (1)
c+n
det(A
cn
) . . . (1)
n+n
det(A
nn
)
_
_
where
A = Df
_
f
1
(v)
_
.
Determinants are polynomials, so each component is a composition of continuous functions.
Without a doubt, this is one of the most dicult results to prove. But the mathematics itself isnt
hard. The only obstacle is that there are a lot of things that need checking.
In truth, I didnt really understand the Inverse Function Theorems proof until Math 148: Analysis
on Manifolds (Professor Wieczorek is an excellent teacher)! If youre worried about the nal, just
memorize the statement and learn the proof during Winter Break.
New Notation
1 : 1 One to one f(x) is 1 : 1 f(x) is a 1 : 1 function
B
(a) The closed ball of ra-

dius centered at a
B
1
(
0) The unit closed ball centered at
0.
Lecture 35
Implying Implicit Function Theorem
The Math is lovely, dark and deep.
But I have promises to keep.
And miles to go before I sleep.
-Robert
Goals: We prove the Implicit Function Theorem as an application of the Inverse Function
Theorem. Then, using the Implicit Function Theorem, we nally complete the proof of
the Lagrange Multiplier Theorem.
35.1 On Keeping Ones Word
Right before the second midterm, we proved the Lagrange Multiplier Theorem. Unfortunately, we
had to assume one lemma without proof:
Theorem. For C
1
functions g
1
, g
2
, . . . g
k
, the set
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
_
_
_
Today, we make good on our promise. After many miles of intense mathematics, we prove this theo-
rem as a corollary of the Implicit Function Theorem.
663
664 LECTURE 35. IMPLYING IMPLICIT FUNCTION THEOREM
35.2 Intuition on Implicit Function Theorem
Recall way back in high school, when you tried to graph the circle
x
2
+ y
2
= 4
using your trusty TI-89:
x
y
You couldnt plot it directly. Instead, you solved for y in terms of x, and got
y =
4 x
2
or y =
4 x
2
,
so you plugged in each equation in your calculator separately. The full circle was the combination of
the graphs:
x
y
x
y
Thats all familiar, but now lets look at this under a mathematical lens.
Consider the function
G(x, y) = x
2
+ y
2
4.
35.2. INTUITION ON IMPLICIT FUNCTION THEOREM 665
The circle is simply the set of points where this function is 0:
S =
__
x
y
_
R
2
G
_
x
y
_
= 0
_
.
For any point
_
a
b
_
S,
we want to write the y-coordinate as a function of x. But we have to consider two cases: for a point
in the upper half of the circle, it is contained in the graph
H
1
=
__
x
f
1
(x)
_
2 x 2
_
where f
1
(x) =
4 x
2
.
If the point is in the lower half, it is contained in the graph
H
2
=
__
x
f
2
(x)
_
2 x 2
_
where f
2
(x) =
4 x
2
.
Easy, right? This almost expresses the idea behind the Implicit Function Theorem. But because we
are in Math 51H, we are going to add another spin.
As you may know, we are in love with open sets. So lets ask,
Is it true that any point
_
a
b
_
S is contained in some graph
H =
__
x
f(x)
_
x U
_
where U is an open set?
Absolutely not! Let
_
x
0
0
_
be one of the two points on the x-axis:
x
y
x
0
= 2
Consider any open interval U containing x
0
. If you tried to dene a function f(x), it would only give
points on one side of x
0
:
x
y
U
Complete and utter fail!
1
But why does it fail? Well, its because at
_
x
0
0
_
the graph has a turning
point along the y direction:
1
This should remind you of how, in our proof that the unit circle is a 1-manifold, the two points on the x-axis
required special treatment.
35.3. FORMALIZATION 667
If we look at the Jacobian
DG(x, y) =
_
2x 2y
and restrict the matrix to only the entry corresponding to dierentiation with respect to the y variable,
_
2y
we have a non-invertible matrix (i.e. a nonzero real number since the matrix is 1 1) precisely when
y = 0.
In summary,
Start with some set of points S where the function G is zero (the circle).
For any x S, we want to write S locally as a graph on some open set (an open interval on the
x axis).
We can do this if the Jacobian restricted to the graph variables is invertible (the derivative of
G with respect to y is not zero).
This is the essence of the Implicit Function Theorem.
35.3 Formalization
To give a formal description of Implicit Function Theorem, we will need new notation. Consider a
function with more inputs than outputs:
G : R
n
R
m
, n > m.
When we look at the inputs, we can consider the rst n m variables and the remaining m variables
separately:
G
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
nm
x
nm+1
x
nm+2
.
.
.
x
n
_
_
_
_
_
_
_
_
_
_
_
_
_
To make our lives easier, we relabel variables and denote the input by a concatenation of vectors
x R
nm
and y R
m
:
G
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
nm
y
1
y
2
.
.
.
y
m
_
_
_
_
_
_
_
_
_
_
_
_
_
= G
_
x
y
_
.
Using this notation, we can formally dene a Jacobian restricted to variables x, y:
Denition. Let G : R
n
R
m
where n > m and denote the input vector of G as the concatenation
of x R
nm
and y R
m
,
_
x
y
_
.
The Jacobian restricted to variable x is the rst n m columns of the Jacobian:
D
x
G(q) =
_
_
D
1
G(q) D
2
G(q) . . . D
nm
G(q)
_
_
.
The Jacobian restricted to variable y is the last m columns of the Jacobian:
D
y
G(q) =
_
_
D
nm+1
G(q) D
nm+2
G(q) . . . D
n
G(q)
_
_
.
Observe that D
y
G(q) is an mm matrix, so it makes sense to talk about its inverse (if it exists).
We will also use sub-vector notation:
_
x
a
b
=
_
_
x
a
x
a+1
.
.
.
x
b
_
_
.
Armed with the proper notation, lets describe the theorem:
35.3. FORMALIZATION 669
For a function G : R
n
R
m
where n > m, let S be the set where G is

0:
S =
__
x
y
_
R
n
G
_
x
y
_
=
0
_
and consider an element
_
a
b
_
S.
S
_
a
b
_
Suppose that
D
y
G
_
a
b
_
is invertible
Then for some open set V containing
_
a
b
_
, consider S V :
V
We can write this local region of S as a graph
1
of a C
1
function h over U
S V =
__
x
h(x)
_
x U
_
where U is a open set in R
nm
containing a:
h(a)
a
U
35.4 The Proof
Proving the Implicit Function Theorem is going to be easy. Why? Because we did all the work when
we proved the Inverse Function Theorem! Generally,
Math Mantra: Dont reprove a theorem from scratch if you can build off previous
work!
Muych like the Mean Value Theorem is a cute application of Rolles Theorem, the Implicit Function
Theorem is a cute application of the Inverse Function Theorem. The key is to consider the function
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
.
Note that f maps
_
a
b
_
G
zero
to a vector a concatenated to the zero vector:
_
a
b
_ _
a
0
_
f
So, if the inverse did exist, it would give us the reverse mapping
1
If you dont see the connections to manifolds, you gotta go back to Lecture 26!
35.4. THE PROOF 671
_
a
0
_ _
a
b
_
f
1
Stare at the rst n m inputs and the last m outputs of f
1
:
_
a
0
_ _
a
b
_
f
1
This is exactly what we need: a mapping that inputs a and spits out

b. Therefore, form the function
h(x) =
_
f
1
_
x
0
__
nm+1
n
Now, all we have to do is verify that this is our graph map. Luckily, this is going to be an easy
consequence of the Inverse Function Theorem!
Theorem (Implicit Function Theorem). Let G : R
n
R
m
where n > m and denote the input
vector of G as the concatenation of x R
nm
and y R
m
,
_
x
y
_
.
Consider the set
G
zero
=
__
x
y
_
R
n
G
_
x
y
_
=
0
_
.
Then for any element
_
a
b
_
G
zero
, provided that D
y
G
_
a
b
_
is invertible, there exist a C
1
function
h and open sets V R
n
, U R
nm
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
.
Proof Summary:
Directly check that
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
satises the conditions of the Inverse Function Theorem.
By Inverse Function Theorem, there exist open sets U
R
2nm
, V R
n
such that f
1
: U
V
exists.
Dene
h(x) =
_
f
1
_
x
0
__
nm+1
n
and
U =
_
_
x
1
nm
x U
_
.
Verify
G
zero
V =
__
x
h(x)
_
x U
_
.
Proof: Let
_
a
b
_
G
zero
such that D
y
G
_
a
b
_
is invertible. Consider the function
f
_
x
y
_
=
_
_
x
G
_
x
y
_
_
_
.
First, we show that f
1
exists by verifying the conditions of the Inverse Function Theorem. Im-
mediately, we know that f is C
1
since G is C
1
. Also, we can directly compute the Jacobian of f
as
Df
_
x
y
_
=
_
_
1 0 . . . 0 . . . 0
0 1 . . . 0 . . . 0
0 0 . . . 0 . . . 0
.
.
.
.
.
.
.
.
.
.
.
. . . . 0
0 0 . . . 1
.
.
. 0
D
1
G
_
a
b
_
D
2
G
_
a
b
_
. . . D
nm
G
_
a
b
_
. . . D
n
G
_
a
b
_
_
_
Condensely, this is just
35.4. THE PROOF 673
Df
_
x
y
_
=
_
_
_
_
I
nm
0
m
D
x
G
_
a
b
_
D
y
G
_
a
b
_
where I
nm
, 0
m
are the (n m) (n m) identity matrix and mm zero matrix, respectively.
Because of the identity matrix, when we compute the determinant of Df
_
x
y
_
, we are forced to
choose
i
1
= 1
i
2
= 2
.
.
.
i
nm
= n m
so in fact,
det
_
Df
_
x
y
__
= det
_
D
y
G
_
a
b
__
.
Thus, we can apply the Inverse Function Theorem: there exist open sets U
R
2nm
, V R
n
such
that f
1
: U
V exists. Dene
h(x) =
_
f
1
_
x
0
__
nm+1
n
and take U to be the rst n m components of U
:
U =
_
_
x
1
nm
x U
_
Using a simple proof by contradiction, we can verify that U is indeed open.
Now, all we need to show is
G
zero
V =
__
x
h(x)
_
x U
_
:

Let
_
c
d
_
G
zero
V . Since
_
c
d
_
G
zero
,
f
_
c
d
_
=
_
c
0
_
.
Moreover,
_
c
d
_
V , so we can invert the map:
f
1
_
c
0
_
=
_
c
d
_
.
But
_
c
0
_
U
implies
c U,
and so
_
c
h(c)
_
=
_
_
c
_
f
1
_
c
0
__
nm+1
n
_
_
=
_
c
d
_
.

Let c U, and consider
_
c
h(c)
_
. The key observation is to notice
1
that the rst n m
components of f
1
s output is x, so in fact,
f
1
_
x
y
_
=
_
_
x
_
f
1
_
x
y
__
nm+1
n
_
_
.
Thus
f
1
_
c
0
_
=
_
_
c
_
f
1
_
c
0
__
nm+1
n
_
_
=
_
c
h(c)
_
.
But f
1
: U
V so
_
c
h(c)
_
V
Moreover, by denition of f,
f
_
c
h(c)
_
=
_
_
c
G
_
c
h(c)
_
_
_
.
But we also know
f
_
c
h(c)
_
=
_
c
0
_
therefore
_
_
c
G
_
c
h(c)
_
_
_
=
_
c
0
_
.
1
The inverse of the identity I(x) = x is just itself!
35.4. THE PROOF 675
G
_
c
h(c)
_
=
0.
Thus,
_
c
h(c)
_
G
zero
and so
_
c
h(c)
_
G
zero
V.
Now, we can nally fulll our promise:
Theorem. For C
1
functions g
1
, g
2
, . . . g
k
, the set
M =
_
_
_
x R
n
g
1
(x) = g
2
(x) = . . . = g
k
(x) = 0
and
g
1
(x), g
2
(x), . . . , g
k
_
_
_
Proof Summary:
Dene G : R
n
R
k
as
G
_
x
y
_
=
_
_
g
1
(x, y)
g
2
(x, y)
.
.
.
g
k
(x, y)
_
_
for input variables x R
nk
and y R
k
.
Compute DG
_
a
b
_
. There are k linearly independent rows; thus, there must exist k linearly
independent columns. WLOG, assume the last k columns are linearly independent.
Apply the Implicit Function Theorem: there exist a C
1
function h and open sets V R
n
, U
R
nk
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
.
Check that G
zero
can be replaced by M.
Proof: By denition of a manifold, we need to consider an arbitrary point
_
a
b
_
M,
where a R
nk
,

b R
k
. Then, we need to show there exist open sets U R
nk
, V R
n
such that
a U and M V is a permuted graph of some function h over U.
The key is to apply the Implicit Function Theorem. First, dene G : R
n
R
k
,
G
_
x
y
_
=
_
_
g
1
(x, y)
g
2
(x, y)
.
.
.
g
k
(x, y)
_
_
for input variables x R
nk
and y R
k
.
Then,
DG
_
a
b
_
=
_
_
_
g
1
(a,
b)
_
T
_
g
2
(a,
b)
_
T
.
.
.
_
g
k
(a,
b)
_
T
_
_
.
By denition of M, this matrix has k linearly independent rows. Furthermore, by our theorem on
rank, there must exist k linearly independent columns. Without loss of generality,
1
we can reorder
the input variables of G so that the last k columns of DG
_
a
b
_
are linearly independent. Thus,
D
y
G
_
a
b
_
is invertible.
So by the Implicit Function Theorem, there exists a C
1
function h and open sets V R
n
, U R
nk
such that a U and
G
zero
V =
__
x
h(x)
_
x U
_
If we can replace G
zero
by M, then we are done.
But recall that in our construction of V in the original Inverse Function Theorem, we designed
D
y
G
_
x
y
_
to be non-singular on V . Therefore the rank of D
y
G
_
x
y
_
is k on V . This implies the
full matrix
DG
_
x
y
_
=
_
_
_
g
1
(x, y)
_
T
_
g
2
(x, y)
_
T
.
.
.
_
g
k
(x, y)
_
T
_
_
has rank at least k on V . But there are only k rows, implying the rows are linearly independent.
These are the g
i
(x, y), so they are linearly independent on V , allowing us to conclude
M V =
__
x
h(x)
_
x U
_
.
1
Reordering will give us a permuted graph, which is all we need. But to avoid the headache of P(G(U)), lets make
this assumption.
35.4. THE PROOF 677
New Notation
G
_
x
y
_
The function G with
a concatenated input
formed from x, y
G
_
e
1
e
2
_
The concatenation of the rst two
standard basis vectors inputted
into G.
D
x
G(q) The Jacobian of G re-
stricted to x evaluated
at q
D
x
G(q) is invertible The Jacobian of G restricted to x
evaluated at q is invertible.
_
x
a
b
The sub-vector of x
from a to b
_
x
1
3
The sub-vector formed from the
rst 3 components of x .
Simons Secret Lecture 36
Proving FTA: An Analytic Way
When I read papers in the humanities, sometimes I see cool ideas and nod my head.
I completely understand where the author is coming from.
But only with mathematical proofs have I ever found myself in complete and total awe,
wondering how these ideas could have been pulled out of the ether.
- B
F
SCHO
([])
Goals: In the rst of two optional lectures, we prove the Fundamental Theorem of
Algebra. The proof will be purely analytic, relying on complex numbers and harmonic
functions.
36.1 Journey to Another Plane: Preparations
When you studied polynomials in Algebra II, you especially focused on solving for the roots, i.e., the
values r such that
P(r) = 0.
But why this emphasis? Why should we care?
The more obvious, practical reason is that a real world phenomenon could be modeled by some
polynomial P(x). Solving for a particular value b is equivalent to nding the root of
Q(x) = P(x) b.
But we also have a theoretical reason. Remember,
Math Mantra: Suppose we can always rewrite our objects in some form that has
additional structure. Then we can EXPLOIT this extra structure in our proofs.
The key idea is that we can use roots to factorize polynomials:
679
680 SIMONS SECRET LECTURE 36. PROVING FTA: AN ANALYTIC WAY
Lemma. If r C is a root of a monic polynomial
P(x) = x
n
+ a
n1
x
n1
+ . . . + a
1
x + a
0
then
P(x) = (x r)Q(x)
where Q is an (n 1)-degree monic polynomial.
Proof: Since
P(r) = 0
we know
P(x) = P(x) P(r).
Expanding the (RHS),
(x
n
+ a
n1
x
n1
+ . . . + a
1
x + a
0
) (r
n
+ a
n1
r
n1
+ . . . + a
1
r + a
0
)
which we can regroup as
_
x
n
r
n
_
+ a
n1
_
x
n1
r
n1
_
+ . . . + a
2
_
x
2
r
2
_
+ a
1
_
x r
_
+ (a
0
a
0
)
. .
=0
.
But we can pull out x r from each of these terms using the infamous identity:
1
x
j
r
j
= (x r)
_
x
j1
+ x
j2
r + x
j3
r
2
+ . . . + x
2
r
j3
+ x
1
r
j2
+ r
j1
_
. .
P
j
(x)
.
Now we have
(x r)P
n
(x) + a
n1
(x r)P
n1
(x) + . . . + a
2
(x r)P
2
(x) + a
1
(x r)P
1
(x)
giving us
P(x) = (x r)
_
P
n
(x) + a
n1
P
n1
(x) + . . . + a
1
P
1
(x)
_
. .
Q(x)
where Q(x) is a degree n 1 monic polynomial (since only P
n
contributes the x
n1
term).
Suppose we can prove
Theorem (Fundamental Theorem of Algebra). Every polynomial has at least one complex
2
root.
1
Expand it yourself to check!
36.2. NO HARM IN HARMONICS 681
Then, we can inductively apply the preceding lemma to rewrite every polynomial as a product
P(x) = (x r
1
)(x r
2
) . . . (x r
n
).
AWESOME!
This important fact is used in tons of proofs. For example, recall the polynomial
x
2
+ x + 41
that always spits out a prime number for
x = 1, 2, . . . , 39.
Using the Fundamental Theorem of Algebra (FTA), you can easily prove that there does not exist a
non-constant polynomial that spits out a prime for every integer input. But for this proof, you will
need to wait. Eventually you will hear the SOUND-Kararajan of mathematics.
But how do we prove the FTA?
Math Mantra: You dont have to limit yourself to a single branch of
Mathematics!
We are going to jump out of the Algebra zone and appeal to the gods of Analysis. But not just any
Analysis. We will need Complex Analysis. But before we take a journey to another plane, we need
to talk about harmonic functions.
36.2 No Harm in Harmonics
First, we dene an important class of functions:
Denition. Consider a C
2
function f that maps into C. We say that f is harmonic if, at every
point in its domain, the sum of the pure second derivatives is zero:
n
j=1
D
j
D
j
f(x) = 0.
For example,
x
3
3xy
2
is a harmonic function.
2
This is a VERY important distinction. For example, x
2
+ 1 has NO real roots.
Harmonic functions have a key property:
Recall that a continuous functions over a closed and bounded set always achieves its extrema. If our
function is also harmonic, the extrema over a closed and bounded set are always achieved on the
boundary of the set.
Of course, this only makes sense if the values of the function over this set are real (how do you
maximize a complex number)?
For this lecture, we only need to consider the case of a closed ball centered around

0. And as always,
it suces to prove the maximum case:
Theorem. Let f : B
R
(
0) R be continuous. If f|B
R
(
0) is harmonic, then the extrema of f are

achieved on the boundary of the ball (denoted B
R
(
0)). In other words, if an extremum is achieved at

a, then
a = R.
Proof Summary:
It suces to show that for every > 0,
g(x) = f(x) + x
2
achieves its maxima on the boundary.
Suppose g achieves a maximum at a where a / B
R
(
0).
Show
n
j=1
D
j
D
j
g(a) > 0 and
n
j=1
D
j
D
j
g(a) 0.
j=1
D
j
D
j
g(a) > 0
Directly compute D
j
D
j
g(a).
36.2. NO HARM IN HARMONICS 683
j=1
D
j
D
j
g(a) 0
Dene
s(t) = g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
and use 1D calculus to show that for all j,
s
(0) = D
j
D
j
g(a) 0
Proof: First, dene an auxiliary function
g(x) = f(x) + x
2
for an arbitrary > 0. If we can prove that the maximum of g over B
R
(
0) is achieved on the boundary,

then this implies that the maximum of f is also achieved on the boundary. To see this, note that if
g achieves its maximum at a B
R
(
0), then for all x B

R
(
0),
f(a) + R
2
. .
g(a)
f(x) + x
2
. .
g(x)
g(x)
and since is arbitrary we may tke the limit as 0 to get:
f(a) f(x)
Suppose that g achieves a maximum at a where a / B
R
(
0). We will derive a contradiction by

showing
n
j=1
D
j
D
j
g(a) > 0 and
n
j=1
D
j
D
j
g(a) 0.
j=1
D
j
D
j
g(a) > 0
Directly dierentiate
g(x) = f(x) + (x
2
1
+ x
2
2
+ . . . + x
2
n
)
Then
D
j
g(x) = D
j
f(x) + (2x
j
)
and thus
D
j
D
j
g(x) = D
j
D
j
f(x) + 2.
Summing across all j and evaluating at a yields
n
j=1
D
j
D
j
g(a) =
n
j=1
D
j
D
j
f(a) +
n
j=1
2.
But > 0 and
n
j=1
D
j
D
j
f(a) = 0 since f is harmonic. Thus,
n
j=1
D
j
D
j
g(a) > 0.
j=1
D
j
D
j
g(a) 0
We do our usual trick: build a single variable function and apply 1D Calculus.
Dene
s(t) = g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
.
Because g achieves a maximum at a, s achieves a maximum at 0. By 1D Calculus,
s
(0) 0
But
s
(t) = lim
h0
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t + h
.
.
.
a
n
_
_
_
_
_
_
_
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
h
= D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ t
.
.
.
a
n
_
_
_
_
_
_
_
,
implying
s
(0) = lim
h0
s
(h) s
(0)
h
= lim
h0
D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
+ h
.
.
.
a
n
_
_
_
_
_
_
_
D
j
g
_
_
_
_
_
_
_
a
1
.
.
.
a
j
.
.
.
a
n
_
_
_
_
_
_
_
h
= D
j
D
j
g(a).
Thus
D
j
D
j
g(a) 0
and summing over all j yields
n
j=1
D
j
D
j
g(a) 0.
36.3. GETTING COMPLEX IN HERE 685
One question you should be asking yourself is
What do harmonic functions have to do with the Fundamental Theorem of Algebra?
It isnt obvious. You cant just mindlessly apply the preceding lemma to polynomials because
For a single real variable x, generally a polynomial P(x) is not harmonic.
This is because P(x) is harmonic if and only if its second derivative is 0. Unless the degree is less
than three, this is absolutely untrue.
However,
For a complex variable z, a polynomial P(z) is haromonic.
1
Namely, it is a harmonic function with respect to its real and imaginary variables! But you still cant
use the preceding lemma with this fact: P(z) maps into C, not R, maximum doesnt even make
sense!
Instead, we will use that fact that P(z) is harmonic to prove
The real and imaginary parts of
1
P(z)
are harmonic.
which is the lynchpin of the FTA!
36.3 Getting Complex in Here
In case you were too busy studying for Midterm 2, lets discuss some of the philosophy behind the
rst problem of Homework 8. But I highly recommend that, before going on, you go back and prove
for yourself that P(z) is indeed harmonic. Its a pretty cool result!
Consider a complex function
f(z)
where we take z to be a complex input
z = x + iy
and the variables x, y correspond to the real and imaginary components, respectively. We are going
to think of f(z) as a function from R
2
C:
f(x, y) = f(x + iy)

For ease, we suppress the binary input and leave it as f(z) with the convention that z = x + iy and
that f is in fact, a function of two variables.
The denition of harmonic boils down to showing
D
x
D
x
f(z) + D
y
D
y
f(z) = 0
1
You proved this on the rst problem of Homework 8.
But because we are dealing with complex functions, we actually have an easier criterion.
For a complex function f(z), think of its output as a sum of two functions:
f(z) = u(x, y) + iv(x, y)
where u(x, y), v(x, y) correspond to the real and imaginary component of f(z), respectively:
Denition. Let f be a complex function:
f(z) = u(x, y) + iv(x, y)
where u and v are C
2
. We say that f satises the Cauchy-Riemann Equations on S R
2
if, for
all (x, y) S,
u
x
(x, y) =
v
y
(x, y)
u
y
(x, y) =
v
x
(x, y)
It turns out that
Lemma. Let f be a complex function with C
2
real and imaginary components u and v, respectively.
If f satises the Cauchy-Riemann Equations on S, then f is harmonic on S.
Proof: We simply apply the Cauchy-Riemann Equations and use the fact that for C
2
functions,
mixed partials commute.
Dierentiate with respect to x
D
x
f(z) =
u
x
(x, y) + i
v
x
(x, y)
and apply the Cauchy-Riemann Equations:
D
x
f(z) =
v
y
(x, y) i
u
y
(x, y).
Then dierentiate again
D
x
D
x
f(z) =
v
yx
(x, y) i
u
yx
(x, y).
Likewise, when we dierentiate with respect to y
D
y
f(z) =
u
y
(x, y) + i
v
y
(x, y)
we can apply the Cauchy-Riemann Equations to get
D
y
f(z) =
v
x
(x, y) + i
u
x
(x, y).
36.3. GETTING COMPLEX IN HERE 687
Thus,
D
y
D
y
f(z) =
v
xy
(x, y) + i
u
xy
(x, y)
so,
D
x
D
x
f(z) + D
y
D
y
f(z) =
v
yx
(x, y) i
u
yx
(x, y)
v
xy
(x, y) + i
u
xy
(x, y).
Since mixed partials commute,
D
x
D
x
f(z) + D
y
D
y
f(z) = 0.
This should all be familiar: on Homework 8, you used this criterion to prove that P(z) is harmonic.
We will use this criterion again to prove that
1
P(z)
is also harmonic. From this, we will then conclude
that the real and imaginary components of
1
P(z)
are harmonic.
To minimize distractions and make this proof easier to understand, we are going to use the notation of
Leon Simons text. We drop the (x, y) input symbol and use subscripts to denote partial derivatives.
Dont forget that u, v, S, T are functions!
Theorem. Let P(z) be a polynomial and let
1
A = {(x, y) R
2
: P(x + iy) = 0}
Then dene S, T : R
2
\ A R to be the real and imaginary components, respectively, of
1
P(z)
:
1
P(z)
= S(x, y) + iT(x, y)
Then S and T are harmonic on R
2
\ A.
Proof Summary:
It suces to show that
1
P(z)
is harmonic.
For P(z) = u + iv, solve
S =
u
u
2
+ v
2
T =
v
u
2
+ v
2
Use the fact that u, v satisfy the Cauchy-Riemann Equations to show that S, T also satisfy the
Cauchy-Riemann Equations.
Proof: To show that S and T are harmonic on R
2
\ A, it suces to show that
1
P(z)
is harmonic on
R
2
\ A. This is because for a complex function
f(z) = a + ib,
if
f
xx
+ f
yy
= 0
then expanding the partial derivatives gives
_
a
xx
+ ib
xx
_
+
_
a
yy
+ ib
yy
_
= 0.
Regrouping,
_
a
xx
+ a
yy
_
+
_
b
xx
+ b
yy
_
i = 0 + 0i
Equating real and imaginary coecients,
a
xx
+ a
yy
= 0
b
xx
+ b
yy
= 0
Thus, a and b are harmonic.
Now we just need to check that
1
P(z)
is harmonic by showing that S, T satisfy the Cauchy-Riemann
Equations.
On the homework, you already proved that the real and imaginary component of P(z) satises these
equations: for
P(z) = u + iv
we have
u
x
= v
y
u
y
= v
x
Rewrite S, T in terms of u, v
1
P(z)
=
1
u + iv
.
By our old trick of multiplying by conjugates,
1
u + iv
=
1
u + iv

u iv
u iv
=
u iv
u
2
+ v
2
=
u
u
2
+ v
2
i
v
u
2
+ v
2
.
Equating real and imaginary coecients,
S =
u
u
2
+ v
2
T =
v
u
2
+ v
2
Checking that
1
P(z)
is harmonic is now easy:
1
In the actual proof of the Fundamental Theorem of Algebra, A = . However, we include A here to be rigorous.
36.4. THE PROOF 689
S
x
= T
y
Apply quotient and chain rule to compute
S
x
=
_
u
2
+ v
2
_
u
x
u
_
2uu
x
+ 2vv
x
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
u
x
+
_
2uv
_
v
x
(u
2
+ v
2
)
2
T
y
=
_
u
2
+ v
2
_
v
y
v
_
2uu
y
+ 2vv
y
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
y
+
_
2uv
_
v
y
(u
2
+ v
2
)
2
By the Cauchy-Riemann equations for u, v
S
x
=
_
v
2
u
2
_
vy
..
u
x
+
_
2uv
_
uy
..
v
x
(u
2
+ v
2
)
2
= T
x
.
S
y
= T
x
Apply quotient and chain rule to compute
S
y
=
_
u
2
+ v
2
_
u
y
u
_
2uu
y
+ 2vv
y
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
u
y
+
_
2uv
_
v
y
(u
2
+ v
2
)
2
T
x
=
_
u
2
+ v
2
_
v
x
v
_
2uu
y
+ 2vv
x
_
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
x
+
_
2uv
_
v
x
(u
2
+ v
2
)
2
.
By the Cauchy-Riemann equations for u, v
S
y
=
_
v
2
u
2
_
vx
..
u
y
+
_
2uv
_
ux
..
v
y
(u
2
+ v
2
)
2
=
_
v
2
u
2
_
v
x
+
_
2uv
_
u
x
(u
2
+ v
2
)
2
= T
x
.
36.4 The Proof
The proof of the Fundamental Theorem of Algebra will be by contradiction. And the nal punchline
is that we will show, for
1
P(z)
= S(x, y) + iT(x, y),
that
S(x, y) = T(x, y) = 0.
This would mean
1
P(z)
= 0
which is complete and utter nonsense: the inverse of a real number is never zero!
The way that we reach this contradiction is, visually, very beautiful.
First,
1
for an arbitrary > 0, we show that there is some R
0
such that for all |z| > R
0
,
1
P(z)
< .
Geometrically, this means that outside a ball of radius R
0
centered at the origin,
1
P(z)
s magnitude
is bounded by :
x
iy
R
0
C
<
< <
<
But
1
P(z)
s magnitude is greater than (or equal) the magnitudes of its real and imaginary parts. So
in fact, we have for all (x, y) > R
0
|S| <
and
|T| <
In other words, outside a ball of radius R
0
centered at the origin, Ss and Ts magnitudes are also
bounded by :
1
Recall that
1
P(z)
is only dened on R
2
\ A, but in the actual proof, we will assume for a contradiction that P(z)
has no zeros. Hence, A = .
36.4. THE PROOF 691
S
x
y
R
0
R
2
<
< <
<
x
y
R
0
R
2
T
<
< <
<
Lets focus on S. Consider any ball with radius greater than R
0
. We know that S is harmonic, so the
extrema of S over this ball must occur on the boundary.
S
m
M
R
R
2
Therefore the magnitudes of the maximum value M and minimum value m of S over this ball must
be less than (since they lie in the gray region):
|M| <
|m| <
In fact, for all (x, y) B
R
(
0),
< m S(x, y) M <
i.e
|S(x, y)| < for all (x, y) B
R
(
0)
meaning that we can plug up this white hole:
S
<
R
2
and conclude that the magnitude is bounded by everywhere:
|S(x, y)| < for all (x, y) R
2
.
Likewise we can show
|T(x, y)| < for all (x, y) R
2
.
Since was arbitrary, we must have
S(x, y) = T(x, y) = 0
which, as aforementioned, is a big no-no.
The preceding schematic is correct, but we have to be really careful. One of my favorite Leon Si-
monisms is
Math Mantra: You must always know the TYPE OF ANIMAL you are working with.
I used the word magnitude and expressions like
|P(z)|,
1
P(z)
but | . . . | is not the normal absolute value symbol! This is because the argument is a complex
number!
For a complex number, we overload notation and dene a new norm:
Denition. The complex norm of z = a + ib is
|z| =
a
2
+ b
2
.
36.4. THE PROOF 693
Remember the discussion on metrics all the way back in Lecture 1? The complex norm can indeed
be used to form a metric on C:
d(z
1
, z
2
) = |z
1
z
2
|
In particular, we are going to use the properties:
Lemma. The complex norm satises
Triangle inequality
|z
1
+ z
2
| |z
1
| +|z
2
|
Product property
|z
1
z
2
| = |z
1
||z
2
|
The absolute value of the complex and imaginary parts are bounded by the full norm: for
z = a + ib
we have
max
_
|a|, |b|
_
|z|.
After 36 lectures of 51H, these should all be easy to check.
Now that we understand all the pieces, lets write out this legendary proof:
Theorem (Fundamental Theorem of Algebra). Every polynomial has at least one complex root.
Proof Summary:
Suppose P(z) is never zero. Show that
1
P(z)
<
for all |z| > R
0
where
R
0
= max
_
1, 2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
,
n
_
2
_
Then for all (x, y) > R
0
,
|S(x, y)| <
|T(x, y)| <
where
1
P(z)
= S(x, y) + iT(x, y).
Let R > R
0
. Use the fact that S, T are harmonic on B
R
(
0) to show
|S(x, y)| <
|T(x, y)| <
for all (x, y) R,
Conclude that
S(x, y) = 0
T(x, y) = 0
giving us
1
P(z)
= 0
a contradiction.
Proof: Suppose not. Then P(z) is never zero and in fact,
1
P(z)
is always dened. Let > 0. First, we need to nd an R
0
> 0 such that for all |z| > R
0
,
1
P(z)
<
This condition is equivalent to showing
|P(z)| >
1
.
Starting from the left, expand
|P(z)| =
z
n
+
_
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
_
.
Then, apply reverse triangle inequality to get
z
n
+
_
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
_
|z
n
|
a
n1
z
n1
+ . . . + a
2
z
2
+ a
1
z + a
0
.
Using multiple applications of the normal triangle inequality on the subtracted term, we get a smaller
bound
|z
n
|
_
a
n1
z
n1
+ . . . +
a
2
z
2
+|a
1
z| +|a
0
|
_
.
By repeatedly applying the product property of norms, this bound is equal to
|z|
n
_
|a
n1
| |z|
n1
+ . . . +|a
2
| |z|
2
+|a
1
| |z| +|a
0
|
_
.
For the next step, add the restriction that
|z| > 1.
36.4. THE PROOF 695
This will allow us to shrink the lower bound by taking each power of z in the subtracted term
|z|
n
_
|a
n1
| |z|
n1
. .
+. . . +|a
2
| |z|
2
..
+|a
1
| |z|
1
..
+|a
0
| |z|
0
..
_
and increasing their exponents. We could increase each |z|
i
to |z|
n
: but you can check this leads to a
possible divide by zero case. Instead, we shrink the bound by increasing each |z|
i
to |z|
n1
:
|z|
n
_
|a
n1
| |z|
n1
. .
+. . . +|a
2
| |z|
n1
. .
+|a
1
| |z|
n1
. .
+|a
0
| |z|
n1
. .
_
=
|z|
n
_
1
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
z
_ .
Further restricting
|z| > 2 (|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|) ,
we can shrink our bound to
|z|
n
_
_
1
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
_
_
=
|z|
n
2
.
Finally, we know
|z|
n
2
>
1
if we have the restriction

|z| >
n
_
2
.
In summary, if we choose
R
0
= max
_
1, 2
_
|a
n1
| + . . . +|a
2
| +|a
1
| +|a
0
|
_
,
n
_
2
_
then for all |z| > R
0
,
|P(z)| >
1

1
P(z)
< .
Now, for
1
P(z)
= S(x, y) + iT(x, y)
we immediately have, for all (x, y) > R
0
,
|S(x, y)| <
|T(x, y)| <
Let R > R
0
. Because we proved S is harmonic, we know that S achieves its extrema on the boundary
of B
R
(
0). Thus, for minimum M and maximum M, there exist points (x

min
, y
min
) and (x
max
, y
max
)
such that
S(x
min
, y
min
) = m
S(x
max
, y
max
) = M
and
(x
min
, y
min
) = R
(x
max
, y
max
) = R
Of course,
(x
min
, y
min
) > R
0
(x
max
, y
max
) > R
0
implying
|m| = |S(x
min
, y
min
)| <
|M| = |S(x
max
, y
max
)| <
So in fact, for all (x, y) B
R
(
0),
< m S(x, y) M < .
This means, for all (x, y) R
2
,
|S(x, y)| < .
Likewise, for all (x, y) R
2
, we can show
|T(x, y)| < .
Since was arbitrary,
S(x, y) = 0
T(x, y) = 0
giving us
0 =
1
P(z)
=
1
|P(z)|
.
This is a contradiction since the inverse of a real number (in particular 1/|P(z)|) is never zero.
The complex universe is a very rich eld (pun intended) and this proof demonstrates that fact.
Namely, we delved into the complex world to extract two real valued functions S and T. Then, we
applied our theory from the real universe to these functions. To quote Professsor Maydanskiy,
Math Mantra: If you want to prove a fact about the reals, you can delve into
the complex plane to extract the information you need.
You will see this remarkable strategy again in Math 106 and Math 116.
Simons Secret Lecture 37
An Animal Farm of Innities
All innite sets are innite,
but some sets are more innite than others.
-GEORGE WELL
Goals: In the nal optional lecture, we discuss the notion of size for innite sets. Par-
ticularly, we notice that in the nite case, two sets have the same size if there exists a
bijection between them. This inspires us to dene an extension of same size between
innite sets. We then proceed to prove the classic results that the rationals are countable
and the reals are uncountable. Finally, we end the lecture with the statement of the
Continuum Hypothesis.
37.1 A Little More Complicated than it Looks
An underlying theme in Math 51H (and generally, Calculus) is that we are working with innity.
Primarily, we have captured innity through limits, which we dened with with the word arbitrary:
N can be arbitrarily big
can be arbitrarily small
Consider an arbitrary union of sets
The sequence gets arbitrarily close to some limit
However, like the novel Animal Farm, innity is a lot more complicated that it seems.
Consider, for example, the natural numbers:
1 2 3 4 5 6
There are innitely many natural numbers and in between each consecutive pair, there are innitely
many real numbers.
n n + 1
697
698 SIMONS SECRET LECTURE 37. AN ANIMAL FARM OF INFINITIES
In fact, there are more reals than natural numbers. However, there are also innitely many rational
numbers between each consecutive pair:
n n + 1
Yet the set of rationals has the same size as the set of natual numbers.
Before we can dene size for innite sets, we need to talk about bijections.
37.2 Being a Bit Bijective
Suppose the function f is 1 : 1 on a nite set X. Notice
1
that the image f(X) has the same size as
X:
In particular, if
f(X) = A
then X has the same size as A. In fact, if we are given f : X A, we automatically know f(X) A.
Therefore, to show f(X) = A, we just have to show A f(X).
Denition. For f : X A, we say f maps X onto A if
A f(X)
If this function is also 1 : 1, we call it a bijection.
One of the key ideas about bijections is
Math Mantra: To count the size of a certain set, produce a bijection with a
set that is easier to count.
1
Formally, we can show this via induction, but I shall spare you the proof.
37.2. BEING A BIT BIJECTIVE 699
This is a pretty cool proof technique, so its worth giving a few classic examples.
The rst is an alternate proof that the number of subsets of a set of size n is 2
n
.
Theorem. The number of subsets of a nite set
S = {x
1
, x
2
, . . . , x
n
}
is 2
n
.
Proof: Consider the set P(S) of all subsets of S and the set B of binary n-tuples (i.e, n-tuples that
have only 1 and 0 as elements). Dene the mapping
f : B P(S)
where
f(a
1
, a
2
, . . . , a
n
) = {x
i
| a
i
= 1}
For example,
f(1, 0, 0, 1, 1) = {x
1
, x
4
, x
5
}
We can show that our f is a bijection:
1 : 1
Let a, b B,
a = (a
1
, a
2
, . . . a
n
)
b = (b
1
, b
2
, . . . b
n
)
and
f(a) = f(b).
Suppose a = b. Then there is some component a
i
= b
i
. WLOG, say a
i
= 1, b
i
= 0. Then,
x
i
f(a)
x
i
/ f(b)
So f(b) = f(a), a contradiction.
Onto
Let s P(S),
s = {x
n
1
, x
n
2
, . . . , x
n
k
}.
We need to nd some b B such that
f(b) = s.
But we can choose
b = (b
1
, b
2
, . . . , b
n
)
where
b
i
=
_
1 if i = n
j
for some j
0 otherwise
Thus, P(S) P(B). Since the other inclusion is clear, P(S) = P(B).
Since the size of B is 2
n
, the size of P(S) is 2
n
as well, so the number of subsets of S is 2
n
.
In the second example, we calculate the number of ways to break an integer n into a sum of positive
integers. Formally, for a given n, it counts the number of tuples
(a
1
, a
2
, . . . , a
k
)
with k 1 and each a
i
> 0 such that
a
1
+ a
2
+ + a
k
= n
For example, the number 4 can be broken up into 8 such tuples:
(1, 1, 1, 1) (1, 1, 2) (1, 3) (4)
(1, 2, 1) (2, 2)
(2, 1, 1) (3, 1)
Theorem. For a positive integer n, consider the set P
n
of all ordered sequences
(a
1
, a
2
, . . . , a
k
)
with k 1 and each a
i
> 0 such that
a
1
+ a
2
+ + a
k
= n.
There are 2
n1
elements in this set.
Proof Summary:
Dene
f : B P
n
where
f(x
1
, x
2
, . . . , x
n1
) = (1
1
1
2
1
3

n1
1)
and
i
=
_
, if x
i
= 1
+ if x
i
= 0
1 : 1
Suppose not: f(a) = f(b) but a = b. Let N be the rst time a
i
= b
i
. WLOG assume
a
N
= 1, b
N
= 0.
If N is the rst occurrence of 1, the rst component of f(b) is strictly bigger than the
rst component of f(a)
37.2. BEING A BIT BIJECTIVE 701
If N is the m-th occurrence of 1 where m > 1, the m-th component of f(b) is strictly
bigger than the m-th component of f(a).
Onto
The (n 1)-tuple corresponding to
00 . . . 0
. .
a
1
1
1 00 . . . 0
. .
a
2
1
1 . . . 1 00 . . . 0
. .
a
k
1
maps to (a
1
, a
2
, . . . , a
k
).
Proof: Let B be the set of binary (n 1)-tuples and dene the function
f : B P
n
by
f(x
1
, x
2
, . . . , x
n1
) = (1
1
1
2
1
3
. . .
n1
1)
where
i
=
_
, if x
i
= 1
+ if x
i
= 0
For example, in the case n = 5,
f(1, 0, 1, 0) = (1, 1 + 1, 1 + 1) = (1, 2, 2).
Even though f involves mapping to some strange symbols, dont be afraid! Its extremely intuitive if
you think of it in computer science terms. You begin with a single 1 in the rst line
1
Then, keep hitting the 0 button until the 1 grows to a desired size:
1 2 3
. . .
a
1
0 0 0 0
Then hit 1 to break
a
1
1
1
and play the same game with the second line.
1 : 1
Let a, b B,
a = (a
1
, a
2
, . . . , a
n1
)
b = (b
1
, b
2
, . . . , b
n1
)
and
f(a) = f(b).
Suppose a = b. Dene N to be the rst component where a and b dier:
a
i
= b
i
, i < N
a
N
= b
N
WLOG, let
a
N
= 1
b
N
= 0
We now show a contradiction, and we split into cases to build intuition:
1
N is the position of the rst occurrence of 1 in a.
Compare the number of zeros to the left of position N:
N
a = 000 . . . 0 1 . . .
b = 000 . . . 0 0 . . .
Since there are N1 zeros to the left of a
N
, the rst component of f(a) is N. But there are
also N 1 zeros to the left of b
N
as well as an additional zero at position N. Therefore,
the rst component of f(b) is strictly greater than N, a contradiction.
N is the position of the m-th occurrence of 1 in a where m > 1.
The (m1)-th occurrence of 1 must occur at some position J < N in both a and b.
J N
a = . . . 1 00 . . . 0 1 . . .
b = . . . 1 00 . . . 0 0 . . .
Since there are NJ 1 zeros between the (m1) and m-th occurrence of 1, the m-th
2
component of f(a) is N J. However, the m-th component of f(b) is strictly greater than
N J, a contradiction.
1
We could just consider a single case for m 1 and the proof would be correct, but even after 37 chapters, this is
still a book for underdogs.
2
Not the (m 1)-th! Be careful. Since the rst component comes after zero commas and the second component
comes after the rst comma, the m-th component comes after the (m1)-th comma.
37.3. COUNTING ON COUNTABILITY 703
Onto
Let
(a
1
, a
2
, . . . , a
k
) P
n
.
Then the (n 1)-tuple
00 . . . 0
. .
a
1
1
1 00 . . . 0
. .
a
2
1
1 . . . 00 . . . 0
. .
a
k
1
maps to (a
1
, a
2
, . . . , a
k
). Formally, for
x
i
=
_
_
1 if i = a
1
1 if i = a
1
+ a
2
.
.
.
.
.
.
1 if i = a
1
+ a
2
+ + a
k1
0 otherwise
we have
f(x
1
, x
2
, . . . , x
n1
) = (a
1
, a
2
, . . . , a
k
).
Thus, f(B) P
n
.
Since the size of B is 2
n1
, there are 2
n1
elements in P
n
.
If we can produce a bijection between two nite sets, then they must have the same size. Thats
great! But what about innite sets?
37.3 Counting on Countability
We are inspired to make an innite extension:
Denition. We say that two sets A and B have the same cardinality (i.e. the same size) if we can
produce a bijection between A and B.
In particular,
Denition. For an innite set A, if A has the same cardinality as the natural numbers, then we say
A is countable.
For example, we know the positive evens, positive odds, and the integers are countable since we can
respectively construct the invertible maps
E(n) = 2n
O(n) = 2n 1
Z(n) =
_
m if n = 2m
m if n = 2m1
Dont think of this as anything new. Youve worked with mappings dened on N a million times.
This is because a sequence
(a
n
)
is a mapping from N. Therefore, a bijection from N onto A is really just a sequence that enumerates
each element of A precisely once. This is exactly the reason why such a set is called countable.
The classic result you need to know about countable sets is that
The rationals are countable.
To prove this fact, you can think of each rational number (in reduced form) as a pair of natural
numbers:
p
q
(p, q)
As an easy exercise, you can show that an innite subset of a countable set is still countable (this is
merely a subsequence)! Therefore, to prove that the rationals are countable it suces to prove that
N N is countable.
This is the fastest and cutest proof of this fact:
Theorem. N N is countable.
Proof: Since the inverse of a bijection is still a bijection, it suces to nd a bijection from NN to
N. Consider
f(i, j) = 2
i1
(2j 1).
Now we check
1 : 1
Suppose
2
i
1
1
(2j
1
1) = 2
i
2
1
(2j
2
1)
Notice that (2j
1
1), (2j
2
1) are odd numbers and thus contain no power of 2. By the
Fundamental Theorem of Arithmetic, we know the powers of 2 are equal:
i
1
1 = i
2
1
implying
i
1
= i
2
.
Dividing out 2
i
1
, we get
2j
1
1 = 2j
2
1
thus,
j
1
= j
2
.
Onto
First note that
f(1, 1) = 1
For n 2, by the Fundamental Theorem of Arithmetic, n has the prime factorization
n = 2
1
1
p
2
2
. . . p
s
s
. .
odd
.
Choosing
i = + 1
j =
p
1
1
p
2
2
. . . p
s
s
+ 1
2
we have
f(i, j) = n.
Here, our choice of f relied more on arithmetic intuition than physical intuition. A more classic proof
that N N is countable is to visualize it as a grid of pairs
1, 1 1, 2 1, 3 1, 4
2, 1 2, 2 2, 3 2, 4
3, 1 3, 2 3, 3 3, 4
4, 1 4, 2 4, 3 4, 4
and count along diagonals:
1 2 4 7
3 5 8
6 9
10
Its a very simple idea; in fact, we can cook up an explicit mapping:
Theorem. N N is countable.
Proof Summary:
Dene
f(i, j) =
(i + j 2)(i + j 1)
2
+ i
Onto
Substitute
k = i + j
and rewrite f using Gauss formula.
For any n N, choose a particular k such that
1 + 2 + + (k 2)
. .
< n < 1 + 2 + + (k 2)
. .
+(k 1)
f(i, j) maps to n, where
i = n
_
1 + 2 + + (k 2)
_
j = k i
1 : 1
Suppose
(i
1
+ j
1
2)(i
1
+ j
1
1)
2
+ i
1
=
(i
2
+ j
2
2)(i
2
+ j
2
1)
2
+ i
2
.
It suces to show that f maps to the same diagonal:
i
1
+ j
1
= i
2
+ j
2
.
Derive contradiction using Gauss formula.
Proof: Dene the mapping
f(i, j) =
(i + j 2)(i + j 1)
2
+ i.
Onto
The key is to notice that along any diagonal, the sum i + j is constant. To make life easier,
dene a new variable
k = i + j
Then,
f(i, j) =
(k 2)(k 1)
2
+ i
where 1 i k 1.
But lo and behold, what is this? The fraction is an application of Gauss formula:
1 + 2 + 3 + + n =
n(n + 1)
2
Therefore,
f(i, j) = 1 + 2 + + (k 2) + i
where 1 i k 1.
For any n N, choose a particular k such that
1 + 2 + + (k 2)
. .
n < 1 + 2 + + (k 2)
. .
+(k 1).
Then, for the choice
1
i = n
_
1 + 2 + + (k 2)
_
j = k i
we have
f(i, j) = 1 + 2 + + (k 2) +
_
n
_
1 + 2 + + (k 2)
_
_
. .
i
= n
1 : 1
Suppose
(i
1
+ j
1
2)(i
1
+ j
1
1)
2
+ i
1
=
(i
2
+ j
2
2)(i
2
+ j
2
1)
2
+ i
2
1
This is analogous to the rst step of converting a number into binary: subtract o the biggest power 2
i
to get a
remainder. Subtract o the biggest
(k2)(k1)
2
less than n to get a remainder i. Think of it as Base Gaussian.
I claim that f must map to the same diagonal i.e., the sum
i
1
+ j
1
. .
k
1
= i
2
+ j
2
. .
k
2
.
Rewrite our equation as
(k
1
2)(k
1
1)
2
+ i
1
=
(k
2
2)(k
2
1)
2
+ i
2
.
WLOG, suppose k
1
< k
2
. By Gauss formula again, we can bound the (LHS):
(k
1
2)(k
1
1)
2
+ i
1
= 1 + 2 + + (k
1
2) + i
1
..
k
1
1
1 + 2 + + (k
1
1) =
(k
1
1)k
1
2
But by basic integer properties, k
1
< k
2
imply
k
1
k
2
1
k
1
1 k
2
2
Therefore, we can further bound
(k
1
1)k
1
2

(k
2
2)(k
2
1)
2
.
Moreover,
(k
2
2)(k
2
1)
2
<
(k
2
2)(k
2
1)
2
+ i
2
..
1
This gives us
(k
1
2)(k
1
1)
2
+ i
1
<
(k
2
2)(k
2
1)
2
i
2
,
which is a contradiction. Thus,
k
1
= k
2
.
Now the proof is easy. Our equality becomes
(k
1
2)(k
1
1)
2
+ i
1
=
(k
1
2)(k
1
1)
2
+ i
2
which immediately implies
i
1
= i
2
and then
j
1
= k
1
i
1
= k
2
i
2
= j
2
.
37.4 Realistically Uncountable
Not all innite sets are countable! In analogy with the denition of irrational, we dene
37.4. REALISTICALLY UNCOUNTABLE 709
Denition. For an innite set A, if A is not countable then we say A is uncountable.
One of the most fundamental results is
The reals are uncountable.
This is the cutest proof that I know:
Theorem. R is uncountable.
Proof Summary:
Suppose I
1
= [0, 1] is countable and is enumerated by (a
n
).
Construct a nested sequence of closed, non-empty sets
I
1
I
2
I
3
. . .
where I
j
contains only points of (a
n
) with index n > j.
Let
a
kN
I
k
a [0, 1] yet a is not an element of (a
n
). Contradiction.
Proof: It suces to show that the closed interval [0, 1], is uncountable. Suppose that there does exist
a bijection f from N to [0, 1]. Then, consider the image sequence (a
n
) where
a
n
= f(n).
First, we are going to construct a sequence of nested closed sets. Starting with
I
1
= [0, 1]
split the interval into three equal closed sets:
I
M
1
I
L
1
I
R
1
At least one of these sets contains only points of (a
n
) with index n 2. Call that set I
2
I
2
. .
an, n2
and play the same game: split I
2
into three sets
I
M
2
I
L
2
I
R
2
and choose I
3
to be one of the subsets that contains only points of (a
n
) with index n 3.
I
3
. .
an, n3
Formally,
1
we dene the recursive relation,
I
k+1
=
_
_
I
L
k
if I
L
k
n
) with index n k + 1
I
M
k
else if I
M
k
n
) with index n k + 1
I
R
k
else if I
R
k
n
) with index n k + 1.
where I
k
= [a, b] and
I
L
k
=
_
a, a +
b a
3
_
I
M
k
=
_
a +
b a
3
, a + 2
b a
3
_
I
R
k
=
_
a + 2
b a
3
, b
_
Now that we have a nested sequence of closed, non-empty sets,
I
1
I
2
I
3
. . .
by our work in Lecture 33, we know that the arbitrary intersection is non-empty. Therefore, let
a
kN
I
k
and since a [0, 1], by our countability assumption,
a
S
= a
for some S. But by construction,
a
S
..
a
/ I
S+1
and so
a /
kN
I
k
,
1
There really is no need to be formal. You only need to understand that we trisect the interval, and each time, we
choose either left, middle, or right.
37.4. REALISTICALLY UNCOUNTABLE 711
a contradiction.
This is a pretty cool proof and it uses the awesome nested sequence property. But its not the classic
argument (although it does have the same essence of diagonalization). Here is the diagonalization
proof
1
that every mathematics student must know:
Theorem. R is uncountable.
Proof Summary:
Suppose that (0, 1) is countable and is enumerated by (a
n
).
Expand each a
n
in terms of its decimal expansion and arrange in a list.
Dene s to have the decimal expansion
s = .s
1
s
2
s
3
s
4
s
5
. . .
where the digit
s
i
=
_
3 if a
ii
= 3
5 if a
ii
= 3
s (0, 1) yet s is not an element of (a
n
). Contradiction.
Proof: Consider the open interval (0, 1) and suppose that there does exist a bijection f from N to
(0, 1). Then, consider the image sequence (a
n
) where
a
n
= f(n)
We can write each term a
n
as a decimal expansion
a
n
= 0.a
n1
a
n2
a
n3
. . .
where each a
ni
is a digit:
a
ni
{0, 1, 2, . . . , 9}.
Line the a
n
in a grid:
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
This proof is almost valid. First, you should prove that every real has a decimal expansion. More importantly,
decimal expansions are not unique. For your Math 171 WIM, you will need to make the proper adjustments.
Using the diagonal digits a
ii
,
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
we construct an element that is not on this list. Specically, we construct s such that its i-th digit is
dierent from a
ii
:
s = 0.
s
1
s
2
s
3
s
4
s
5
a
1
= 0.
a
11
a
12
a
13
a
14
a
15
. . .
a
2
= 0.
a
21
a
22
a
23
a
24
a
25
. . .
a
3
= 0.
a
31
a
32
a
33
a
34
a
35
. . .
a
4
= 0.
a
41
a
42
a
43
a
44
a
45
. . .
a
5
= 0.
a
51
a
52
a
53
a
54
a
55
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Formally, we can dene s to have the decimal expansion
s = .s
1
s
2
s
3
s
4
s
5
. . .
where the digit
s
i
=
_
3 if a
ii
= 3
5 if a
ii
= 3
Then s (0, 1) yet s is not an element of (a
n
). Because if it were, then for some K,
a
K
= s.
But a
K
and s dier in the K-th digit by construction:
s
K
= a
KK
,
a contradiction. Thus, the reals are uncountable.
37.5. LAST WORDS ON THE CONTINUUM 713
37.5 Last Words on the Continuum
There are many levels of uncountability! In fact, for a given set S, S is smaller
1
than the set of all
subsets of S. This means we can have arbitrarily large innite sets.
Of course, I cannot explain that here. If you want to understand how to classify the sizes of innite
sets, I highly recommend taking Math 161: Set Theory. You will be exposed to some philosophical
and interesting topics. Also, Professor Sommer is an excellent lecturer.
I do feel, however, that it is tting to end with the statement of a famous unsolved problem. Even
though we can have arbitrarily large uncountable sets, it is unknown if there is a smallest:
Conjecture (Continuum Hypothesis). R is the smallest uncountable set.
Once you nish the Math 51H nal, I highly recommend mulling over this statement. Even though
it is way beyond the scope of this course, ask yourself what it means. How do you formalize it? And
as always, think about how one can possibly go about proving it.
1
We say A is smaller than B if there is a 1:1 mapping from A to B
714
The Final: Acing Analysis
B
(SNITCH) 51H
POTTER
The Last Run
Youve nally made it. Ten weeks of Math 51H. And now you are faced with the nal.
My rst advice is to
COMPLETE HOMEWORK 10.
I know its not graded. And you are probably swamped with work. Thats why they call it dead-week.
But problems are the only way to truly test your understanding of the material. Moreover, Professor
Simon may put a variation of one of these problems on the nal.
Secondly,
DO THE PRACTICE FINAL AND MAKE SURE YOU TIME YOURSELF.
To succeed, it is not enough to understand the material. You must make sure you have mastered it.
This means being able to write out complete proofs quickly in a limited amount of time.
In fact, while youre at it,
REDO ALL THE PREVIOUS MIDTERMS.
On the nal, all previous material is fair game. In fact, you cannot progress without full com-
prehension of earlier concepts!
715
716
Lastly, I repeat for the nal time, the correct way to study:
Open up Chapter 1 and head to the rst theorem. Then,
Read the theorem statement.
Close the book.
Start re-deriving the proof.
If you get stuck, glance at the proof summary for a hint.
Close the book again.
Rinse and repeat.
Do this for every proof in the book. This is the only way you can be sure that you understand the
material. Always remember Professor Simons saying,
The human capacity for self-delusion is limitless
In addition to the rst 7 weeks, here are the topics you need to have mastered:
Week 8
1. Do you know the denition of permutations, transpositions, and number of inversions? Can you
calculate the number of inversions of a given permutation?
2. Can you prove that applying a transposition to a permutation switches the parity of the number
of inversions?
3. Do you know the dening properties of D? Do you know the explicit formula for
D? Can you prove that it is unique?
4. Can you prove basic determinant properties? Can you prove det(AB) = det(A) det(B)?
Can you quickly calculate determinants using row and column reduction properties?
5. Do you know how to invert a matrix? Do you know how to apply a cofactor expansion? Can
you prove the cofactor expansion formulas? Can you prove det(A) = det(A
T
)? Do you know
the explicit formula for A
1
? Can you derive the explicit formula for A
1
?
6. Do you know the denition of orthonormal? Do you know how to apply the Gram-
Schmidt process?
717
Week 9
1. Do you know how to compute eigenvectors and eigenvalues? Do you know how to
use eigenvectors and eigenvalues to diagonalize a matrix?
2. Do you know the statement of the Spectral Theorem? Can you apply the Spectral
Theorem?
3. Do you know the statement of the Contraction Mapping Theorem? Can you prove
the Contraction Mapping Theorem? Can you use the Contraction Mapping Theo-
rem on some system of equations to show the existence of a solution?
4. Do you know the denition of 1 : 1? Do you know the denition of onto? Do you know the
statement of the Inverse Function Theorem?
5. Do you know the statement of the Implicit Function Theorem? Can you use the In-
verse Function Theorem to prove the Implicit Function Theorem? Can you use the
Implicit Function Theorem to complete the proof of Lagrange Multiplier Theorem?
Last Words on Math 51H
Even though it is a challenging course, I hope you enjoyed Math 51H. Without a doubt,
Professor Simon is an incredible lecturer who is second to none.
He has truly exposed you to some beautiful proofs and given you a preview of the dierent branches
of mathematics. In just ten weeks, you have learned more mathematics than you have in the last ten
years.
And by learn, you have acquired more than just content. Youve acquired a completely new way
of thinking. And like the Golden Snitch,
Your mind is now open at the close of Math 51H.
718
The After Math
In life, unlike chess, the game continues after checkmate.
-Isaac Asimov
A Message to the Underdog
If your experience in Math 51H was anything like mine, after you see your grades, youre going to
be shocked. For the rst time in your life, you didnt get an A. For me, that was just the tip of the
iceberg. By the end of that year, I scored a 1 on the Putnam and tanked the remaining H-series.
Almost every job rejected me that summer.
But the scores and rejections werent the worst part. It was the feeling of resignation. The feeling
that no matter how hard you tried, you would never catch up. That everyone in that room will always
be smarter than you and that you didnt belong.
When people talk about the Stanford Duck Syndrome, I feel it is especially true with the math
majors. There is an inux of brilliant students each year. Students who are already properly trained
in mathematical thinking. Because theyve already seen the algebraic and analytic concepts, they will
be the ones constantly raising their hands.
And these students are also the ones the professors take time to notice.
From someone who has been at the bottom of every math class, I assure you, it doesnt get any easier.
TAs will roll their eyes at you. Professors will call your questions trivial.
1
When you give an incorrect
answer in class, people will snicker at you. But worse of all, ten weeks of persistent eort will be
quantized into three hours of testing.
So the big question you need to ask yourself is
Do you still want to pursue theoretical math?
You dont have to study pure math. You can try engineering, economics, or even philosophy. From
personal experience, even someone with a mediocre mathematics background can excel in these sub-
jects. If you want acknowledgement, then maybe thats where you should go.
However, the H-series could have left you with a gnawing feeling. Because even amidst dark confu-
sion and struggle, you saw something. A glimpse of pure creativity. An idea that is paradoxically so
1
God I hate that word. If there was any word that captures mathematical arrogance, this would be it.
719
720
dicult yet so simple. An idea that was magically pulled out of thin air. And even if it was for a
brief moment, you understood. You were rapt in awe.
If this is how you feel, then you have to ignore the grades and keep chasing mathematics. To quote
Professor Vakil,
Of course, the best three on the Putnam will go o to become fantastic mathematicians.
But the people who are going to take over the world are those ghting to get the 1s.
Indeed, its going to be a tough ght. And the only one who is going to care (or even notice) is you.
But be sure to go at your own pace! You dont have to jump into Hardy, Dummit, and Pollack. Read
Jones, Armstrong, and Munkres. In fact, read as many yellow UTM and SUMS books you can get a
hold of. Dont just stop there. Devour the vast online resources that are available. Become
an expert in Coursera, Wikipedia, and Stack Exchange.
In the meanwhile, ignore the pompous people in your class. Find the right ones to collaborate
with. Learn from bad grades and move on. Try harder next time. There is no shame in failing: you
can always retake a course. Even if a professor laughs at you for sitting in for the hundredth time,
just grin back. You gotta keep ghting, even if the worlds telling you to give up. Because thats what
it means to live.
And when you completely understand a beautiful proof and are teeming with excitement, you have
to teach it to other people. Just be sure that, when you explain the proof, its not to make yourself
look smarter. Deliver the punch line in the right way so that even the struggling underdog walks
away with a smile. Shout it out to the world so that everyone can know its beautiful. Because thats
what it means to love.
Only the Beginning
As for me, Ive decided to stop pursuing the dream of becoming a professor. Not because Ive given
up on mathematics, but because professors teach the best. And I have no interest in teaching the
best. I want to inspire the best.
To inspire students, I intend on keeping a promise. Before my best friend overdosed, we made a pact
to take on the world. Even though shes gone, I still intend on keeping my end of that promise. I am
going to nd the worlds most infamous introductory math courses and I am going to translate them.
Because I feel that all students who have a passion for mathematics, yet lack the problem
solving and proof techniques, deserve to learn from these epic courses.
Thats enough about my future. As for you, it is inevitable that as you grow up, you will hit some
hard times. When that happens, you just have to keep living, laughing, and loving. Because, in the
end,
Let Hercules do as he may. Cats will mew.
Dogs will have their day.
Index
1 : 1, 643
T
a
M, 519
, 41
C, 4
Q, 4
R, 4
Z, 4
, 218
, 34
, 34
abelian group, 97
absolute convergence, 373
Aladdin, 138
arbitrary union, 324
Axiom of Choice, 124, 134
Axler, Sheldon, 364
Banach-Tarski Paradox, 135
Baptisma Pyros, viii
basis, 124
Basis Extension Theorem, 127
Basis Theorem, 125
applications of, 132
Batman, x, 239, 636
bijection, 698
Bolzano-Weierstrass, 197, 210
alternate proof, 213
multivariable, 330
Boyd, Stephen, 171
cardinality, 703
Cauchy Sequences, 639
Cauchy-Riemann Equations, 686
Cauchy-Schwarz Equality, 57
applications of, 59, 60
Cauchy-Schwarz Inequality, 21, 25
alternate proof, 79
applications of, 2931
matrices, 152
chain rule, 409
Change of Base-Point Theorem, 438
nite, 430
choice function, 134
closed set, 312
cofactor expansion, 582
column rank, 159
column space, 158
basis, 260
Completeness Axiom, 107, 639
Conrad, Brian, vii
constructivism, 51, 239
continuous function
applications, 294
multivariable, 338
properties, 280
single variable, 275
Continuum Hypothesis, 713
Contraction Mapping Theorem, 636, 657
convex, 654
countable, 703
N N, 704, 705
rationals, 704
critical point
manifold, 531
curve, 474
length, 479
Depp, Johnny, 205
derivative
directional, 351
partial, 353
single variable, 289
determinant, 567
Devlin, Keith, x, 267
diagonalizable, 610
dierentiable, 358
dimension, 129
directional derivative, 351
721
722
discretizing the delta, 320
distance function, 7
Division Algorithm, 93
dot product, 19
EE263: Linear Dynamical Systems, 171
eigen-decomposition, 610
eigenvalue, 613
eigenvector, 613
Everclear-190, 8
Field Axiom, 101
elds, 101
Frost, Robert, 663
Fundamental Theorem of Algebra, 680, 693
Fundamental Theorem of Arithmetic, 54
Fundamental Theorem of Calculus, 485
Fundamental Theorem of Linear Algebra, 159
Galatius, Soren, 135
Gaussian Elimination
full, 241
step one, 65
geometric series, 366, 377
gradient, 389
Gram-Schmidt Process, 602, 604
graph, 509
Guy, Richard, 12
Hardy, G.H., xi
harmonic function, 681
Harry Potter, xiii, 81, 417, 494, 545, 715, 717
Hilbert, xii, 89, 267
Hitler Learns Topology, 301
homogeneous equations, 83
Implicit Function Theorem, 671
Inception, 631
indexing set, 323
inhomogeneous equations, 263
injective, 643
invariant, 544
inverse, 580
Inverse Function Theorem, 650
Jacobian, 358, 388
Jepsen, Carly Rae, 47
Johari, Ramesh, 271
Jones, Indiana, 200
Kafka, Franz, 71
Lagrange Multipliers, 674
multiple constraints, 530
proof, 536
single constraint, 527
Law of Excluded Middle, 47
Law of Small Numbers, 12
left inverse, 578
Lehrer, Tom
Tropic of Calculus, 396
limit
function, 273, 274
multivariable, 337
sequence, 176
sequence of vectors, 312
uniqueness, 193
linear combination, 45
Linear Dependence Lemma, 86
linear function, 38
local maximum, 392
manifold, 531
local minimum, 392
manifold, 531
logical quantiers, 177
Lovasz, L., 47, 305
Malibu Light Rum, 396
manifold, 505, 514
Martini, 347
Math 106: Introduction to Complex Analysis, 696
Math 108: Introduction to Combinatorics, 33, 445
Math 114: Linear Algebra II, 38
Math 116: Introduction to Complex Analysis, 696
Math 120: Modern Algebra I, 110
Math 121: Modern Algebra II, 38, 110
Math 171: Introduction to Analysis, 38, 116
Maydanskiy, Maksim, 180, 457, 696
Mean Value Theorem, 293
application of, 51
multivariable application, 398
Mojito, 37, 443
Monotone Convergence Property, 200
MS&E 246: Game Theory with Engineering Ap-
plications, 271
723
multilinear function, 557
N-th term test, 365
non-pivot column, 244
norm
complex, 692
matrix, 151
vector, 15
Norris, Chuck, 193
null space, 159
basis, 256
nullity, 159
number of inversions, 547
onto, 698
open ball, 302
open set, 304
orthogonal, 218
orthogonal complement, 218
properties, 230
orthonormal, 599
Orwell, George, 697
Osgood, Brad, 37
Pang, Amy, xiii
parity, 544
partial derivative, 353
partition, 478
permutation, 545
Phil 151: Introduction to Logic, 102
Pina Colada, xiii
pivot column, 243
index, 241
Pocahontas, 578
Popov, 539
power series, 419
dierentiation, 491
projection map, 234
properties, 234
Proof Technique
7-10 split, 33
contradiction, 47
i, 52
cases, 67
existence, 109
induction, 71
set equality, 116
uniqueness, 90
universal statements, 8
quadratic form, 467
rank, 166
Rank-Nullity Theorem, 159, 162
rearrangement, 379
right inverse, 578
Rolles Theorem, 290
Ross, Kenneth, 407
row space, 165
Sandwich Theorem, 205
multivariable, 341
Sawin, Stephen, 197
Second Derivative Test, 470
Sher, David, 48
Shoham, Yoav, 134
Shyamalan, M. Night, 7, 363
Simon, Leon, vii, xxiii, 40, 50, 63, 115, 124, 127,
167, 173, 203, 250, 270, 273, 301, 304, 324,
347, 363, 395, 397, 442, 512, 519, 540, 542,
715717
Smirno, 539
Sommer, Professor, 102, 713
Sound, K., 110, 681
span, 45
Sparrow, Jack, 200
Spectral Theorem, 619
standard basis vector, 121
Steele, Michael, 17
Stolichnaya, 539
Strang, Gilbert, 159
sub-matrix, 582
subsequence, 198
subspace, 39
SUMaC, viii, 267, 543
symmetric matrix, 467
tangent space, 519
tangential gradient, 531
Taylor Series, 494
Taylors Theorem, 495
Titanic, 126
transpose, 165
transposition, 545
triangle inequality, 7, 27
724
uncountable, 708
reals, 708, 710
Under-determined Systems Lemma, 83
Vakil, Ravi, 720
vector space, 38, 418
Weinberger, 503
well-dened, 563
Wieczorek, W., 662

Math51H Translated For The Mathematical Underdog

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math51H Translated For The Mathematical Underdog

Uploaded by

Copyright:

Available Formats

Contents

0 The zero vector x +

0 = x The sum of vector x and the zero

0, consider the line that goes through vector x in the direction of y:

The next example is a famous one that translates to

2 is rational with gcd(a, b) = 1.

(x) = 0 for every x, then f is constant.

The next example requires a fundamental fact about the integers:

(x) = 0 for every x if and only if f(x) is constant.

be a solution to the homogeneous analogue

2 is irrational. Just choose

3 (by Inductive Hypothesis)

b gives us an upper bound for the right hand side

satisfying the conditions above, then

| is some positive integer and therefore

is 0 and the other is b 1. Applying these bounds to

is the zero function if and only if f is a constant.

(x) is its own derivative, so this expression simplies to

depends on the choice of x. In other words, the inverse is a function of x.

Theorem. The inverse of the inverse of an element is the original element:

means we have to show x is the inverse of x

0} does not have a basis.

0} (the subspace containing only the zero vector) to

134 LECTURE 6. ALL YOUR BASIS ARE BELONG TO US

{d|d > 0, d divides 24}

= {(i, j)| i and j are integers and for each 0 j N, j i N}

7.3. PROVING TWO MATRICES ARE EQUAL 149

and let B be a n p matrix with rows

7.4 Distances on Matrices

transforms this point:

7.5. IMPORTANCE BEHIND MATRICES: LINEAR MAPS 155

0 is the set of vectors x that A multiplies to

0} The null space of A is the set con-

is the rst standard

And after waiting even longer, within distance

And after waiting still longer, within distance

. For example, choose

> 0 we can a nd a corresponding N

such that for all j N

such that for all j N

and were done!

9.7. UNIQUENESS OF LIMITS 195

n = 1 The limit of the sequence (

x on the left hand side, so

n) 1 by the limit sum property.

contains innitely many sequence terms

to be the set of all vectors that

is still a basis for R

is still a basis for R

needs to be a subspace! Lets verify this:

. Then for any x V

and k R. Then for any x V ,

have only the zero vector in common:

are vector spaces,

s basis vectors yields a system

. Then for any v V ,

0}. We need to show V = R

. For any v V , we can write

0; this contradicts the fact that V

only contains the zero vector!

. Now we can prove part of our revised

is the zero vector.

is orthogonal to every vector in V + V