In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.
This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.
There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
This paragraph has nothing to do with numerics. It is about the fact that continuous functions can not be approximated globally by polynomials. So you need to restrict to intervals for reasons of mathematical theory. This is totally unrelated to the numerical issues, which are nowhere even acknowledged.
All polynomials "work" globally. That some polynomials form an orthonormal basis over certain intervals is essentially irrelevant.
The author does not address the single most important reason why high degrees of polynomials are dangerous. Which is pretty insane to be honest, obviously to be honest you have to at least mention why people tell you to be cautious about high degree polynomials AND point out why your examples circumvent the problem. Anything else is extremely dishonest and misleading.
Actually they aren't. You never compute high powers of the argument when working with specialized bases.
You use the recursive formula that both the Bernstein basis and the orthogonal polynomial bases are endowed with. This is implemented in numpy, so you don't have to do anything yourself. Just call, for example, np.polynomial.legendre.legvander to get the features for the Legendre basis.
And a basis orthogonal over [-1,1] is easily made orthogonal over arbitrary interval. Take p_i to be the i-th legendre polynomial, then the basis composed of
q_i(x)=p_i(2(x-a)/(b-a)-1)
is orthogonal over [a,b]. Each q_i is itself a polynomial of degree i, but you never use its coefficients explicitly.
There is an entire library for computing with polynomial apptoximants of functions over arbitrary intervala using orthogonal polynomials - Chebfun. The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.
>The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.
This is totally wrong. Of course there are enormous numerical problems with high degree polynomials. Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correct.
What you described are the ways to deal with those problems. But this isn't what the article does. My problem with the article is the following:
- Does not mention the most important issue with high degree polynomials, namely their numerical instability.
- Gives examples, but does not explain why they circumvent the numerical problems at all. The most important choice made (the interval of 0 to 1) is portrayed as essentially arbitrary, not an absolute necessity.
- Concludes that there are no problems with high degree polynomials, based on the fact that the experiments worked. Not based on the fact that the actual issues were circumvented, leaving the reader with a totally wrong impression of the issue.
This is terrible scholarship and makes for a terrible article. Not acknowledging the issue is a terrible thing to do and not explain why seemingly arbitrary choices are extremely important is another terrible thing to do. The whole article fails at actually portraying the issue at all.
To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work.
> Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correct
But we don’t compute large powers of large numbers? Chebyshev is on [-1, 1]. Your large powers go to zero. And your coefficients almost always decrease as the degree goes up. Then to top it off, you usually compute the sum of your terms in descending order, so that the biggest ones are added last.
"To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work."
I believe the author assumes that it's clear to the reader that there is a distinction between how a mathematical object is defined, and how it's computationally used. A polynomial can be defined as a power series, but it's not how they are computationally used. In this sense, the author was mistaken.
But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.
>But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.
But the author does not even acknowledge that there is a wrong way. He claims that it is just a "MYTH" that polynomials have huge problems.
Read the article, nowhere does the author acknowledge that these numerical issues exist at all. Nowhere does the author talk about why the specific methods he uses work or even acknowledge that had he used a naive approach and written the polynomials as a power series over a large interval everything would have fallen apart.
Surely the "MYTH" that high degree polynomials are dangerous is not a myth. Concrete examples where naive approaches fail are trivially easy to find.
Again. I think you are missing the point entirely. I am not saying that these Fourier type projections are "wrong" or "bad" or "numerically unstable" or that you can't write an article about those or that high degree polynomials are impossible to work with.
I am saying that if you claim that it is a "MYTH" that high degree polynomials are dangerous, you should mention why people think that is the case and why your method works. Everything else seems totally disingenuous.
Neural network training is harder when the input range is allowed to deviate from [-1, 1]. The only reason why it sometimes works for neural networks is because the first layer has a chance to normalize it.
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.
> Horner's method or another representation of your polynomial might start to become important.
Horner's is the default way to evaluate a polynomial - and I think you're overstating the cost of evaluation.
> what the author is doing isn't much different than a Taylor approximation
Yes it is. The Taylor approximation is only valid around a point - and it's based on matching the nth order derivative. There are no error metrics. And it requires differentiation to find it.
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.
1) data availability and computer limited training set sizes.
2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.
> So... are _all_ introductions to machine learning just extremely wrong here?
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
The minimization is regularized, meaning you add a penalty term for large coefficients. The coefficients will be different for the two bases, meaning the regularization will work differently.
Ok, yeah, doing a little googling that makes sense. I kind of feel that the article author was burying the lede by saying this was about ML optimization where apparently regularization is the norm(so to speak lol) and basis selection is the whole ball game indirectly through the way it influences convex optimization
I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.
Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?
Suppose you have n residual layers that look like:
x_0 = input
x_{i+1} = x_i + f(x_i)
x_n = output
If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get
x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)
so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.
loss(model) = product model^{-1}(training image)
Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).
So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.
Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.
-----
This is one place it's super useful to know numerical methods, but here are a couple others:
1. Weight initialization --> need to know stability analysis
2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature
> Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
Machine learning can be made much more effective and efficient, by first modeling the problem, and then optimizing that tailored representation. This is an alternative to throwing a bunch of layers of neurons, or copy pasting an architecture, and hoping something works out.
One of the most successful applications of ML is convolutional neural networks and is based on this principle. Image processing algorithms come from an optical theory which can be modeled with convolution - what if we used optimization to find those convolution kernels.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
Also you need to know when a problem is NOT optimization - for example solving equations via the bisection method.
I think the posted article has some important confusions. First is the distinction between a basis and an objective. If I fit a polynomial to go through every point in my dataset, I will end up with exactly the same resulting solution, it doesn't matter what basis I use to represent it (although the basis-specific coefficients will of course differ). This is because all polynomial bases (by definition) represent the same hypothesis class, and the recipe "fit a polynomial that passes through every point" can be expressed as an optimization problem over that hypothesis class. More generally, any two bases will give exactly the same solution when optimizing for the same objective (ignoring things like floating point errors).
So why do the results with Bernstein basis look better than for the standard basis? It's because they are actually optimizing a different objective function (which is explicitly written out in the stackexchange post). So in that sense the Bernstein-vs.-standard basis is really a comparison between different objective functions rather than between different bases. It so happens that the solution to the new objective function has a simple expression in terms of the Bernstein basis; but in principle I could set up and solve the objective in terms of the standard basis and obtain exactly the same result. From this perspective, the Bernstein polynomials are nothing more than a convenient computational device for expressing the solution to the objective. The OP kind of gets at this with the distinction between Fitting and Interpolation, but seems to conflate different fitting procedures with different bases.
Secondly, there is actually no need for explicit regularization when fitting Bernstein polynomials (cf. the stackexchange post). The way the OP fits Bernstein polynomials is non-standard. Although one can say that the specific form of the objective function provides an important source of implicit regularization.
So in conclusion I think the OP has importantly mis-attributed the source of success of the Bernstein method. It does not have to do with explicit regularization or choice of basis, and has everything to do with re-defining the objective function.
Right? I'm no stranger to ML, but this feels like magic, so cool! It clearly explains why normalizing features can be important (some polynomials blow up outside their range), provides a simple example of double descent and fits extremely high-degree polynomials without loss of generalization - amazing stuff!
The problem with Fourier approximation is that it works terribly for relatively "simple" functions. E.g. fitting a linear relationship is extremely hard.
The "normal" Pade Approximation is irrelevant. Could you point out why exactly this is superior or actually solves the issue? Any paper comparing these methods would also be interesting.
One way I handle the 'simple case that Fourier has trouble with' is I add the 'simple' bases to the catalogue of Fourier basis functions and then orthogonalise the resulting basis system.
Depending on specific 'simple' cases this can be messy but effective.
The same deal happens for decision trees. (Axis parallel) Trees have a hard time fitting linear functions, requiring many modes. One can use essentially the same solution.
BTW very informative commentary. The things that you bring up really needed to be brought out.
I’ll keep asking my ignorant questions since I found informed people: what about a basis of wavelets or some heterogeneous bag of functions (which could include linear, polynomial times Gaussian, and why not some harmonic functions too)? At the end of the day a functional basis might not need to be “well formed” like Fourier or some polynomial family. It just needs to do the job.
Yeah, I thought the whole point of a polynomial approximation was that it is really only useful for the first couple of powers (quick and cheap) or because you have a particular process that you know a priori has a particular form (non-convergent, non-conservative, higher powered, etc.).
The "particular form" is important--if you don't know that then there is no reason for choosing x^10000 over x^5 (other than computational complexity). But there is also no reason for choosing x^5 over x^10000! Maybe that function really is flat and maybe it isn't. You really just don't know.
If you don't have anything too weird, Fourier is pretty much optimal in the limit--if a bit expensive to calculate. In addition, since you can "bandwidth limit" it, you can very easily control overfitting and oscillation. Even more, Fourier often reflects something about the underlying process (epicycles, for example, were correct--they were pointing at the fact that orbits were ellipses).
Pick a metric. Fourier is probably better in general.
With respect to machine learning, probably the fact that Fourier is bounded and gives coherent (if completely random) results even very far from the interval of interest.
However, this is like saying that an O(n log n) algorithm is better than O(n^2). Sure, it's true in the limit, by the constant terms can be quite large and O(n^2) can remain better up to remarkably useful values of n.
As I pointed out, the advantage that polynomial basis generally has is that you can be accurate enough with lower powers that are much, much faster to calculate (we use matrices of linear equations precisely for that reason). Or, you can match a particular process because you know specifically that it follows some particular function (we know that road damage follows the fourth power of load--it would be counterproductive to model that with Fourier).
Using high powers for polynomial basis is almost always worse than any other choice.
I don't disagree, but I do not think it's meaningful to call something optimal if it is clearly unusable under certain circumstances.
If you know that some relationship is close to polynomial, obviously a polynomial basis is more suitable. E.g. a line performs terribly for a Fourier transformation.
>Using high powers for polynomial basis is almost always worse than any other choice.
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
If you treat the ring of polynomials as a vector space in their coefficients, then the unit monomials 1, x, x^2, etc. form the standard basis of that vector space. Wikipedia has an example of "standard basis" in this sense [0].
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
Yes! And the next articles in the series double down on this:
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
The quote has absolutely nothing to do with my point.
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.
Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.
At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.
In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.
https://arxiv.org/pdf/2503.02113
This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.
There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
You can probably calculate the value of P_n(x) using the equation:
P_n(x) = (2 - 1/n) x P_{n-1}(x) - (1 - 1/n) P_{n-2}(x)
This is numerically stable if x is in [0,1], but I haven't validated that you can get to very high n using floating point.
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
He seems reasonably explicit about this:
""
This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
""
No.
This paragraph has nothing to do with numerics. It is about the fact that continuous functions can not be approximated globally by polynomials. So you need to restrict to intervals for reasons of mathematical theory. This is totally unrelated to the numerical issues, which are nowhere even acknowledged.
But what's the point in acknowledging numerical issues outside of [-1,1] if polynomials do not even work there, as author explicitly notes?
All polynomials "work" globally. That some polynomials form an orthonormal basis over certain intervals is essentially irrelevant.
The author does not address the single most important reason why high degrees of polynomials are dangerous. Which is pretty insane to be honest, obviously to be honest you have to at least mention why people tell you to be cautious about high degree polynomials AND point out why your examples circumvent the problem. Anything else is extremely dishonest and misleading.
Is there a mathematical proof or basis to back what you’re saying?
That polynomials do not approximate continuous functions globally??
That is pretty obvious. Consider that every polynomial grows arbitrarily large, but not every continuous function does.
Actually they aren't. You never compute high powers of the argument when working with specialized bases.
You use the recursive formula that both the Bernstein basis and the orthogonal polynomial bases are endowed with. This is implemented in numpy, so you don't have to do anything yourself. Just call, for example, np.polynomial.legendre.legvander to get the features for the Legendre basis.
And a basis orthogonal over [-1,1] is easily made orthogonal over arbitrary interval. Take p_i to be the i-th legendre polynomial, then the basis composed of q_i(x)=p_i(2(x-a)/(b-a)-1) is orthogonal over [a,b]. Each q_i is itself a polynomial of degree i, but you never use its coefficients explicitly.
There is an entire library for computing with polynomial apptoximants of functions over arbitrary intervala using orthogonal polynomials - Chebfun. The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.
The ML community just hasn't caught up.
>The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.
This is totally wrong. Of course there are enormous numerical problems with high degree polynomials. Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correct.
What you described are the ways to deal with those problems. But this isn't what the article does. My problem with the article is the following:
- Does not mention the most important issue with high degree polynomials, namely their numerical instability.
- Gives examples, but does not explain why they circumvent the numerical problems at all. The most important choice made (the interval of 0 to 1) is portrayed as essentially arbitrary, not an absolute necessity.
- Concludes that there are no problems with high degree polynomials, based on the fact that the experiments worked. Not based on the fact that the actual issues were circumvented, leaving the reader with a totally wrong impression of the issue.
This is terrible scholarship and makes for a terrible article. Not acknowledging the issue is a terrible thing to do and not explain why seemingly arbitrary choices are extremely important is another terrible thing to do. The whole article fails at actually portraying the issue at all.
To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work.
> Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correct
But we don’t compute large powers of large numbers? Chebyshev is on [-1, 1]. Your large powers go to zero. And your coefficients almost always decrease as the degree goes up. Then to top it off, you usually compute the sum of your terms in descending order, so that the biggest ones are added last.
Could you please read the post before commenting?
"To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work."
I believe the author assumes that it's clear to the reader that there is a distinction between how a mathematical object is defined, and how it's computationally used. A polynomial can be defined as a power series, but it's not how they are computationally used. In this sense, the author was mistaken.
But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.
>But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.
But the author does not even acknowledge that there is a wrong way. He claims that it is just a "MYTH" that polynomials have huge problems.
Read the article, nowhere does the author acknowledge that these numerical issues exist at all. Nowhere does the author talk about why the specific methods he uses work or even acknowledge that had he used a naive approach and written the polynomials as a power series over a large interval everything would have fallen apart.
Surely the "MYTH" that high degree polynomials are dangerous is not a myth. Concrete examples where naive approaches fail are trivially easy to find.
Again. I think you are missing the point entirely. I am not saying that these Fourier type projections are "wrong" or "bad" or "numerically unstable" or that you can't write an article about those or that high degree polynomials are impossible to work with.
I am saying that if you claim that it is a "MYTH" that high degree polynomials are dangerous, you should mention why people think that is the case and why your method works. Everything else seems totally disingenuous.
Neural network training is harder when the input range is allowed to deviate from [-1, 1]. The only reason why it sometimes works for neural networks is because the first layer has a chance to normalize it.
If you have a set of points, why can't you just map them to an interval [0, 1] before fitting the polynomial?
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.
> Horner's method or another representation of your polynomial might start to become important.
Horner's is the default way to evaluate a polynomial - and I think you're overstating the cost of evaluation.
> what the author is doing isn't much different than a Taylor approximation
Yes it is. The Taylor approximation is only valid around a point - and it's based on matching the nth order derivative. There are no error metrics. And it requires differentiation to find it.
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.
1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.
> So... are _all_ introductions to machine learning just extremely wrong here?
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
It's not like they would sabotage competitors by training them wrong, as a joke.
A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".
https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf
This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.
Worth a read if you're ever fitting functions.
There's something I'm fundamentally missing here--if the standard basis and the Berenstain basis describe exactly the same set of polynomials of degree n, then surely the polynomial of degree n that minimizes the mean square error is singular (and independent of the basis--the error is the samples vs the approximation, the coefficients/basis are not involved) so both the standard basis and Berenstain basis solution are the same (pathological, overfitted, oscillating) curve?
Like I understand how the standard basis is pathological because the higher degree powers diverge like mad so given "reasonable" components the Berenstain basis is more likely to give "reasonable" curves but if you're already maximizing I don't understand how you arrive at a different curve.
What am I missing?
The minimization is regularized, meaning you add a penalty term for large coefficients. The coefficients will be different for the two bases, meaning the regularization will work differently.
Ok, yeah, doing a little googling that makes sense. I kind of feel that the article author was burying the lede by saying this was about ML optimization where apparently regularization is the norm(so to speak lol) and basis selection is the whole ball game indirectly through the way it influences convex optimization
This is why I believe a numerical methods course should be a requirement for any AI majors.
I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.
Sure. The original normalizing flows used a fixed number of layers. Someone at UToronto recognized that, as the number of layers gets very deep, this is essentially an ordinary differential equation (ODE). Why?
Suppose you have n residual layers that look like:
x_0 = input x_{i+1} = x_i + f(x_i) x_n = output
If you replace them with an infinite number of layers, and use "time" t instead of "layer" i, you get
x(t+dt) = x(t) + dt f(x(t)) <=> x'(t) = f(x, t)
so to find the output, you just need to solve an ODE. It gets better! The goal of normalizing flows is to "flow" your probability distribution from a normal distribution to some other (e.g. image) distribution. This is usually done by trying to maximize the probability the training images should show up, according to your model, i.e.
loss(model) = product model^{-1}(training image)
Notice how you need the model to be reversible, which is pretty annoying to implement in the finite-layer case, but with some pretty lenient assumptions is guaranteed to be true for an ODE. Also, when you're inverting the model, the probabilities will change according to the derivative; since you have more than one dimension, this means you need to calculate the determinant of the Jacobian for every layer, which is decently costly in the finite-layer case. There are some tricks that can bring this down to O(layer size^2) (Hutchinson++), but the ODE case is trivial to compute (just exp(trace)).
So, turning the model into an ODE makes it blazing fast, and since you can use any ODE solver, you can train at different levels of precision based on the learning rate (i.e. the real log canonical threshold from singular learning theory). I haven't seen any papers that do this exactly, but it's common to use rougher approximations at the beginning of training. Probably the best example of this is the company Liquid AI.
Finally, this all turns out to be very similar to diffusion models. Someone realized this, and combined the two ideas into flow-matching.
-----
This is one place it's super useful to know numerical methods, but here are a couple others:
1. Weight initialization --> need to know stability analysis
2. Convolutions --> the Winograd algorithm, which is similar to ideas in the FFT and quadrature
> Can you broadly explain where the gap in AI pedagogy is and how students can fill it?
Machine learning can be made much more effective and efficient, by first modeling the problem, and then optimizing that tailored representation. This is an alternative to throwing a bunch of layers of neurons, or copy pasting an architecture, and hoping something works out.
One of the most successful applications of ML is convolutional neural networks and is based on this principle. Image processing algorithms come from an optical theory which can be modeled with convolution - what if we used optimization to find those convolution kernels.
With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.
Also you need to know when a problem is NOT optimization - for example solving equations via the bisection method.
>But I'm concerned about what the "unknown-unknowns" are.
Try the examples in the article with the interval 0 to 10.
That problem's a known-known for most people interested in any of this, and for anyone who's made it as far as the first picture.
It obviously is something the author lacks any understanding of. As it is the most obvious reason why polynomials of high degree are dangerous.
If you are arguing that they aren't, of course you have to at least mention that objection and point out how to circumvent it.
See also this stackexchange answer, which makes basically the same point:
https://stats.stackexchange.com/questions/560383/how-can-we-...
I think the posted article has some important confusions. First is the distinction between a basis and an objective. If I fit a polynomial to go through every point in my dataset, I will end up with exactly the same resulting solution, it doesn't matter what basis I use to represent it (although the basis-specific coefficients will of course differ). This is because all polynomial bases (by definition) represent the same hypothesis class, and the recipe "fit a polynomial that passes through every point" can be expressed as an optimization problem over that hypothesis class. More generally, any two bases will give exactly the same solution when optimizing for the same objective (ignoring things like floating point errors).
So why do the results with Bernstein basis look better than for the standard basis? It's because they are actually optimizing a different objective function (which is explicitly written out in the stackexchange post). So in that sense the Bernstein-vs.-standard basis is really a comparison between different objective functions rather than between different bases. It so happens that the solution to the new objective function has a simple expression in terms of the Bernstein basis; but in principle I could set up and solve the objective in terms of the standard basis and obtain exactly the same result. From this perspective, the Bernstein polynomials are nothing more than a convenient computational device for expressing the solution to the objective. The OP kind of gets at this with the distinction between Fitting and Interpolation, but seems to conflate different fitting procedures with different bases.
Secondly, there is actually no need for explicit regularization when fitting Bernstein polynomials (cf. the stackexchange post). The way the OP fits Bernstein polynomials is non-standard. Although one can say that the specific form of the objective function provides an important source of implicit regularization.
So in conclusion I think the OP has importantly mis-attributed the source of success of the Bernstein method. It does not have to do with explicit regularization or choice of basis, and has everything to do with re-defining the objective function.
This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2
Right? I'm no stranger to ML, but this feels like magic, so cool! It clearly explains why normalizing features can be important (some polynomials blow up outside their range), provides a simple example of double descent and fits extremely high-degree polynomials without loss of generalization - amazing stuff!
In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?
The problem with Fourier approximation is that it works terribly for relatively "simple" functions. E.g. fitting a linear relationship is extremely hard.
Ah good point. There ought to be a way to get the best of both worlds.
https://en.wikipedia.org/wiki/Pad%C3%A9_approximant
The "normal" Pade Approximation is irrelevant. Could you point out why exactly this is superior or actually solves the issue? Any paper comparing these methods would also be interesting.
One way I handle the 'simple case that Fourier has trouble with' is I add the 'simple' bases to the catalogue of Fourier basis functions and then orthogonalise the resulting basis system.
Depending on specific 'simple' cases this can be messy but effective.
The same deal happens for decision trees. (Axis parallel) Trees have a hard time fitting linear functions, requiring many modes. One can use essentially the same solution.
BTW very informative commentary. The things that you bring up really needed to be brought out.
I’ll keep asking my ignorant questions since I found informed people: what about a basis of wavelets or some heterogeneous bag of functions (which could include linear, polynomial times Gaussian, and why not some harmonic functions too)? At the end of the day a functional basis might not need to be “well formed” like Fourier or some polynomial family. It just needs to do the job.
Yeah, I thought the whole point of a polynomial approximation was that it is really only useful for the first couple of powers (quick and cheap) or because you have a particular process that you know a priori has a particular form (non-convergent, non-conservative, higher powered, etc.).
The "particular form" is important--if you don't know that then there is no reason for choosing x^10000 over x^5 (other than computational complexity). But there is also no reason for choosing x^5 over x^10000! Maybe that function really is flat and maybe it isn't. You really just don't know.
If you don't have anything too weird, Fourier is pretty much optimal in the limit--if a bit expensive to calculate. In addition, since you can "bandwidth limit" it, you can very easily control overfitting and oscillation. Even more, Fourier often reflects something about the underlying process (epicycles, for example, were correct--they were pointing at the fact that orbits were ellipses).
>If you don't have anything too weird, Fourier is pretty much optimal
"Optimal" by what metric? Why is projecting on the Fourier basis "better" than projecting on a polynomial basis?
Pick a metric. Fourier is probably better in general.
With respect to machine learning, probably the fact that Fourier is bounded and gives coherent (if completely random) results even very far from the interval of interest.
However, this is like saying that an O(n log n) algorithm is better than O(n^2). Sure, it's true in the limit, by the constant terms can be quite large and O(n^2) can remain better up to remarkably useful values of n.
As I pointed out, the advantage that polynomial basis generally has is that you can be accurate enough with lower powers that are much, much faster to calculate (we use matrices of linear equations precisely for that reason). Or, you can match a particular process because you know specifically that it follows some particular function (we know that road damage follows the fourth power of load--it would be counterproductive to model that with Fourier).
Using high powers for polynomial basis is almost always worse than any other choice.
I don't disagree, but I do not think it's meaningful to call something optimal if it is clearly unusable under certain circumstances.
If you know that some relationship is close to polynomial, obviously a polynomial basis is more suitable. E.g. a line performs terribly for a Fourier transformation.
>Using high powers for polynomial basis is almost always worse than any other choice.
For some value of "high", yes.
It should also be noted that this kind of fitting is extremely closely related to integration or in other words calculating the mean.
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
For large polynomial degree, this seems to me to be very similar to the P-spline method with a very large number of parameters.
Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.
Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.
This is the Fourier projection. Just on a different basis,the mathematics are near identical, you just change the subspace.
The Fourier basis is already orthonormal.
See also https://en.wikipedia.org/wiki/Chebyshev_polynomials
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
good article, but I am very bother by the "standard basis"... it's called canonical in math. I don't think standard is the right name in any context.
If you treat the ring of polynomials as a vector space in their coefficients, then the unit monomials 1, x, x^2, etc. form the standard basis of that vector space. Wikipedia has an example of "standard basis" in this sense [0].
[0] https://en.wikipedia.org/wiki/Standard_basis#Generalizations
I completely disagree with the conclusion of the article. The reason the examples worked so well is because of an arbitrary choice, which went completely uncommented.
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
The article specifically points out that these polynomials only work well on specific intervals (emphasis copied from the article):
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
Yes! And the next articles in the series double down on this:
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
Totally irrelevant. The natural domain,if compact, can always be scaled, it has nothing to do with the numerical problems.
Also the hermite polynomials have an unbounded natural domain.
The quote has absolutely nothing to do with my point.
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
However, if you normalize your data to [0,1], you'll never have to compute 10^100 and thus never face any numerical issues. "Never" assumes no distribution shift.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
The article totally ignores this and does not even mention the numerical issues at all, which is pretty insane.
Surely at least naming THE ONE reason high degree polynomials are dangerous has to be done. Writing an article arguing that something is not a problem, while not even acknowledging the single most important reason why people believe the problem exist is totally disingenuousness and pretty terrible scholarship.
At least include that the choice of 0 to 1 is necessary for this to work. Not including it makes the author look either clueless or malicious.
Yes and it fails to talk about boundary conditions or predicates or whatever.
Great article and clever use of linkbait!
And the discussion here is pretty good.
One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.
In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.