Hellcats of the machine-learning world

I came across an historical story which offers important lessons in system design strategies. These lessons came to mind as I was considering a data-analysis strategy for exploratory analysis in root-cause analysis after product failure. I call this context “Quality forensics.” (It comes up often enough to deserve a name….)

Thus, here I offer a tale of World War II aviation, machine learning, and product quality forensic analysis. It will all make sense eventually….

Grumman F6F Hellcat: a lesson in system design

Two Grumman F6F Hellcats flying in formation at high altitude. — Grumman F6F Hellcat, an American World War II fighter aircraft which was far more effective than its subsequent relative obscurity would suggest. (Public domain, accessed from Wikipedia)

(You might enjoy this short documentary.)

If you, or someone you love, has an interest in WWII fighter aircraft, no doubt they’re familiar with the most famous, iconic, and most capable American fighters: the P-51 Mustang, the P-38 Lightning, and the Vought F4U Corsair. By “capable” I mean, mainly, fast. Also, reasonably (or very) maneuverable, and having long range.

However, there is an American fighter aircraft that was not the fastest, nor the most maneuverable, nor the longest range, and accordingly, has not attracted the status of the most famous planes. Nevertheless, it was arguably the most effective American fighter aircraft of the war: the Grumman F6F Hellcat.

There are system design lessons to be had here. Let me explain.

When the United States entered the war against Japan, they found their currently-deployed fighter aircraft–the Grumman F4F Wildcat–to be outclassed by the Mitsubishi A6M Zero. When they encountered the Zero, US naval flyers in their Wildcats managed to avoid utter devastation only by deploying clever tactics and flying defensively. They were in no position to go on an offensive in the skies. This vehicle was not going to win the war.

The Navy had a successor in development, the Vought F4U Corsair (see image below). The Corsair’s design principle was to pair the largest possible engine with the largest possible propeller. As such, it exerted substantial torque which challenged pilots. The fact that, during takeoff, this torque tended to roll the Corsair rightwards, facing an aircraft carrier’s bridge, convinced the Navy to give the plane to the Marines, with whom it could fly from land bases.

WWII fighter aircraft, Vought F4U Corsair, flying with blue (sky) background. — Vought F4U Corsair, the Navy’s originally intended successor to the Grumman F4F Wildcat. It had an exceptionally large propeller and so its distinctive inverted gull wings kept the propeller from hitting the tarmac during takeoff or landing. The cockpit is placed far back, causing the plane’s nose as well as wings to impinge on the pilot’s visibility. Credit: Gerry Metzler, via Wikipedia. License: Creative Commons Attribution Share-Alike 2.0 Generic.

So the Navy still needed a carrier-based successor to the Wildcat. Grumman, maker of the Wildcat, was already working on the F6F Hellcat, and the Navy asked Grumman to accelerate its development.

Even though Grumman was under time pressure, its engineers took care to ensure the system would be successful against the Zero. Engineers interviewed multiple pilots who had engaged with Zeros; one engineer even flew (from the East Coast?) to Hawai’i to interview a pilot, a non-trivial journey in those days!

The US also recovered an intact Zero, allowing designers to discover its weak points.

With all of this information, Grumman tweaked its design in large and small ways. The Hellcat in development was no longer a standard upgrade; rather, it was becoming a tailor-made Zero killer.

Grumman engineers improved the pilot’s visibility by raising his position in the cockpit and sloping the top of the plane’s front downwards.
The Hellcat was easy to control, lacking the Corsair’s inherent torque.
It was adapted to work from aircraft carriers, with folding wings and low-speed takeoff and landing.
It was easy to manufacture in large numbers.
It was easy to repair.
It was rugged; it could take hits and keep flying, based on self-sealing gas tanks and components that exceeded their specified requirements. Grumman was famous for testing components beyond their specifications. One way to prevail in dogfights is to not get shot down.

Meanwhile, the Navy refined its dogfighting tactics, tactics suitable to the Hellcat, and trained all its pilots in them. In other services, tactics were left for lower-level commanders to determine.

The Hellcat entered service in 1943, and immediately began to dominate the Japanese aircraft it encountered. It went on to become one of the most effective fighter aircraft of WWII, and certainly the most effective carrier-based airplane. It shot down more aircraft in WWII than any other US Navy aircraft, and earned a win/loss ratio of 19:1, higher than any other aircraft in US forces.

One has to be a little circumspect regarding a win/loss ratio: it depends on the opposition aircraft as well. (I’m a statistician, I’m professionally obligated to mention weaknesses in the scoring system!) American fighter aircraft in the European theater engaged with German planes that were class-leading. The Zero was class-leading at the beginning of the war, but fighters around the world began to use more powerful engines in next-generation aircraft. Japan, however, was never able to field a next-generation fighter in large numbers, although they developed a few.

Still, the point stands that the Hellcat was developed in a multidimensional, well-rounded way to deliver on a particular task, and for that task, it excelled extravagantly. May your development projects go as well!

The Hellcat’s lessons for all designers

The Hellcat development story is one in which:

There is a small set of criteria widely held to be critical: “primary” criteria.
- E.g., speed, maneuverability, and range
- These are often focused on the proximate mission, e.g., dogfighting.
There are also secondary criteria (visibility, ruggedness, ability to operate safely from aircraft carriers, etc.)
- Many secondary criteria pertain to deployment and ease-of-use, rather than the proximate mission. Dogfighting performance is important, but the system must also get to the dogfight.
- A designer who values deployment-related secondary criteria exercises a holistic view of what makes a system successful.
Such holistic designers are willing to sacrifice a little bit on the primary criteria in order to deliver on secondary criteria.
- A well-rounded system may well be more successful than a “better” system that is harder to use or maintain.
- Note that the system still does very well on the primary criteria; the designers haven’t abandoned it.

The Vought Corsair is a good example of design focused overwhelmingly on primary criteria (e.g., speed). Its designers sought to make one of the fastest propeller-driven aircraft in history, and they succeeded. But in doing so they sacrificed some secondary criteria (visibility, ease of handling) which made it less suitable than the Hellcat for the intended mission. Strictly speaking, it was not successful at its originally-assigned task, even though the world remembers the Corsair better than it remembers the Hellcat.

I suggest that when there is a commonly-accepted critical criterion, and a designed system performs very well but not strictly best-in-class for that criterion, yet is overall well-rounded and effective in its context, we call that system a “Hellcat” in honor of the highly-effective yet often-overlooked WWII fighter.

Context of quality forensics

A Hellcat is a Hellcat in a particular context; there is no general Hellcat. What is the context for quality forensic exploratory analysis?

Quality forensics is a private exploratory analysis exercise designed to generate leads for an internal team to investigate. A product has failed; hopefully the failure has been detected by QC methods or monitoring, and not by customer complaints!

Once a failure has been discovered, you can’t simply spin up processes again; how do you know the process won’t exhibit the same failures? You must identify the root cause of the failure, demonstrate that it is in fact a root cause–and the only root cause–and put preventative measures in place.

While you’re doing all this, the product line is dormant and your company is burning money. Pressure!

If the product or process is biochemical in nature, and uses biologically-generated materials, then it may not be possible to map out all mechanisms down to first principles, forcing the team to adopt a degree of empiricism. Data analysis will be paramount.

A team will pull together all available data for all failing cases and for all successful cases going back for a period of time. What batch numbers of component materials were used? When were the components manufactured? Received? When were the final-product batches manufactured? Which dates fell on weekends? Fridays? Mondays? (If a process doesn’t run on a weekend, materials might be held differently or processed different before, during, or after the weekend.) And so on.

Once the data analyst receives the compiled data, she will need to generate results very quickly. Those results will need to be:

Delivered quickly, as mentioned.
- It’s preferred to not have to spend time optimizing fitting parameters.
- Flexibility of model will be key.
Clear, not ambiguous or suggestive
- Robust to choice of fitting parameters. Oversensitivity of conclusions to fitting parameters may reduce trust in the results.
Comprehensive; all possibilities should be investigated.
Descriptive. If a particular factor predicts failure, what is its magnitude? If it’s continuous, what is the shape of its relation to outcome?
Generated from small data sets
- Statistical efficiency will be required. Binning continuous variables is not recommended.
- Resampling methods such as cross-validation may not be available.

One thing the analyst does not need to do is assess evidence, or make an argument to a skeptical audience. The analyst will generate leads, and characterize those leads; the team will validate them with further research. The analyst needs to minimize the likelihood of false leads and generate the best possible relationship estimates for the team’s benefit. She does not need to be a data referee. Since many statisticians predominately serve in such a referee role, some may find it difficult to adapt their priorities and practicies to this context; please see my post on statistical archetypes.

The Hellcats of exploratory analysis

With the context now established provided, let me introduce my Hellcats of exploratory analysis:

Random Forest
MARS (Multivariate Adaptive Regression Splines)
Generalized Additive Models (GAM’s), as implemented in R’s mgcv package

A Hellcat is extremely effective for its context yet not class-leading on a criterion widely viewed as critical. In this case, that primary criterion is predictive accuracy. Yet their predictive accuracy is very good; see more discussion below, specific to each tool.

The classification tools that are proving very successful nowadays, and are certainly getting the most attention, are deep neural nets. Due to the limited sample size and lack of hierarchical quality, I would actually expect deep neural nets to perform poorly in this context. Nevertheless, my Hellcats of exploratory analysis don’t get this attention, though they are incredibly effective.

Random Forest

Random Forest (for prediction and classification) is probably the fastest, most effective, and most idiot-proof exploratory modeling tool in existence. Its operating principle is that if we can manage to generate a large number of weakly-correlated predictive models, each of which is nearly unbiased yet has high variance, then if we average their predictions, we’ll have a nearly-unbiased, low-variance model. The trick is in generating the weakly-correlated individual, nearly-unbiased models; after that, the “smoothing parameter” is essentially the Law of Large Numbers. Thus, once a sufficient number of individual models has been generated, generating more doesn’t cause overfitting. “Tuning” a random forest fit is essentially never a concern, unless the data is problematic for the method (more on that below).

In applications, Random Forest consistently exhibits estimated prediction accuracy below that of class leaders, neural nets and support-vector machines. But its predicted performance is consistently only slightly lower, and it offers many other useful attributes. It exhibits a classic Hellcat profile.

As with any tree-based method, we can see what variables were selected (and which weren’t), and the increase in response homogeneity due to the split. This supports an excellent variable-importance measure. Another popular measure assesses how much prediction degrades if a variable’s values are permuted.

Random Forest doesn’t offer any built-in methods of plotting identified relationships, but the plotmo R package may help. I discuss this more with MARS, below.

Random Forest has some limitations that may not be obvious:

It requires a minimum sample size and a minimum number of predictor variables in order to work well. I don’t have a clear recommendation, but I would suggest at least 50 cases and at least 5 predictor variables.
It may be counter-intuitive that Random Forest can have too few predictor variables to work well. This accrues to its reliance on randomly sampling subsets of predictor variables and having limited overlap among them.
Random Forest’s prediction at a point is always a convex combination of outcomes from a subset of observations. Thus, it will never predict a response beyond the limits of the observed data. In some ways this is a good thing, but it also implies an edge effect: predictions at the edge of the observed data will be biased towards the overall mean. Consider, for instance, that a prediction at the boundary will be a weighted average of points within the boundary.
Random Forest can seriously overfit if one variable is a factor with many unordered levels, such as a set of zip codes or a set of component lot numbers. Substitute quantitative information for these things; for instance, instead of zip codes, use latitude and longitude as quantitative variables; instead of batch number, use the time, or order, of manufacture.

Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines, or “MARS,” is another Hellcat. I recommend looking at the earth R package (some creative naming there, to get around copyright restrictions!). The earth package is written by Stephen Milborrow. He also wrote the plotmo package, which is a general package for plotting estimated effects based on a model’s predicted values. This is also well worth investigating; it is the default plotting method for earth, but can be applied to other methods. Kudos to Dr. Milborrow!

Dr. Milborrow has written some excellent documentation on earth, and also on plotmo.

For fast computing, MARS uses linear or piecewise linear fits. It searches for linear effects, slope change-point locations, and interactions in the form of products (tensor product splines). Because it iteratively scans the predictive data set for the best elements to include or exclude, it is constructive in the sense that tree-based methods are, and so offers a quick fundamental variable importance measure: which variable(s) did it include?

What is the hedge that makes MARS a Hellcat?

MARS doesn’t use any shrinkage in estimation, as LASSO and ridge regression do. It merely searches for basis elements to be included or excluded. Once an element is included, the fit is via maximum likelihood, with no fitting penalty. A variant that uses some flavor of LASSO to simultaneously select factors and shrink their coefficient estimates would probably yield modestly improved predictions.
- However, this approach would allow many small shrunken effects to be included; the model would be less sparse, and so harder to describe.
By using piecewise linear fits, fitted surfaces look crude and unrealistic.

Meanwhile the contextual secondary Hellcat features include:

Computational speed
Relative ease of comprehension and plotting
Multiple realistic variable-importance metrics
No requirement for resampling-based complexity assessment.
Ability to operate effectively on small data sets

By default, MARS calculates “generalized cross-validation” (GCV). This is a misnomer, as no cross-validation occurs, and it is not a generalization of it; rather, it is a statistical estimate of leave-one-out cross-validation, based on statistical expectations. GCV is popular in the fitting of curves by splines, and in my experience it works pretty well although it can be a bit permissive (leading to modest overfitting).

By using GCV, MARS does not require cross-validation. This opens the way to operate on small data sets.

A small example

How well can MARS perform on a small data set? I simulated one:

Five quantitative variables are independently uniformly distributed (from zero to one). Call them \(X_1\) through \(X_5\) .
The (quantitative) response has expected value \(X_1\) if \(X_1 < 0.5\) and 0.5 otherwise.
I added Gaussian error with SD 0.1.

With no particular tuning, the earth package identified the following:

\(X_1\) is the only important variable.
There is a change in slope in \(X_1\).

Here is variable \(X_1\) and the fitted model:

Simulated data points with two superimposed piecewise linear curves, one showing the generating model and the other showing the fitted model. — Simulated data with fitted MARS model (red line) and the mean function for the generating model (black dotted line).

The fact that MARS correctly identified variable \(X_1\), and only \(X_1\), and fit the data pretty well, all with only 20 observations, is pretty remarkable!

Fitting the model was extremely easy. Here is the sum total of the code I needed to write in order to fit the model and extract basic information, once the data set was in an appropriate format:

Mod1 <- earth(x = dat[, 1:5], y = dat$y)
plotmo(Mod1)
summary(Mod1)

The first line fits the model, of course. I could have added an argument to allow for two-way interactions.
- With this example data, if given the option, the search correctly does not identify any.
The second line calls a generic summary plotting function, which adapts to the elements selected in the model.
The summary generates useful information in a readable format.

Statistical evidence

MARS doesn’t provide any assessment of statistical evidence. And you shouldn’t believe it if it did: since it conducts a search for variable inclusion, all of the variables included would appear to have statistical evidence. This would be overly optimistic. However, it is easy to deploy an omnibus “Is anything there?” test by permuting response values and calculating the correlation between the observed response and the predicted response.

We can expect a positive correlation even when response values are permuted (there is “nothing there”), since substantial searching occurs in the course of fitting the model.
If the original (unpermuted) response values yield a higher correlation than those derived from permuted responses, this indicates evidence that there is “something there.”
A permutation test p-value estimate is the fraction of permutation-based correlation values that equal or exceed the observed correlation.

In this case, I generated 500 permutations (this took fraction of a second) and only one permutation induced a correlation greater than the correlation based on the real data, for a p-value estimate of 1/500 or 0.002, indicating strong evidence that at least one component of the model is not null. The following figure illustrates; clearly the observed correlation is “extreme” relative to the permuted values.

Density plot of permutation-generated correlation values, with value from unpermuted data superimposed. — Distribution of 500 permutation-generated correlation values, with the correlation based on the unpermuted data superimposed at the vertical line. Only one of the permutation-generated values exceed the observed value, so a p-value estimate is 1/500.

This finding pertains to the whole model, not to particular components. However, the model also supports a feature importance ranking. We have evidence that at least one identified feature is important, and we have a feature importance ranking. We can presume we have evidence for the most-important feature; if more than one feature is identified, we don’t know where to stop. But this offers a strong basis for the team to start with the most-important factor and work down, depending on what their budget allows.

In this simple example, you can see that MARS is fast, easy, informative, and generates good predictions, even with a small data set. It is a Hellcat for exploratory analysis!

Generalized Additive Modeling via R’s `mgcv` package

Generalized Additive Modeling is an extension of regression modeling in which we relax the requirement of linear contributions. A quantitative factor’s contribution to the response remains mathematically independent of contributions from other factors, yet it is not constrained to be linear in its effect.

For informativeness of a model, it turns out that additivity is much more critical than linearity. We can’t easily describe how a fitted neural net makes its predictions, because it breaks additivity; a unit change in variable \(X\) yields some change in \(Y\) but it depends on all the other variables, so we can’t make a general statement about how \(X\) influences \(Y\).

mgcv is a Hellcat for a slightly different context, which may also apply to root-cause exploration:

You have a small number of variables
You want to assess the effect of continuous variables, such as:
- The frequency of failures over time, in manufacturing
- Whether specimen age contributes to failure rate

I’ve used mgcv a number of times to assess the role of time in a process, including QC forensics, so I include it among the Hellcats.

mgcv doesn’t conveniently select variables, so if discovering variables of interest is key, stick with Random Forest or MARS.

mgcv uses GCV, like MARS, except to determine the smoothness of flexible curves rather than inclusion or exclusion of variables. If you have a non-small data set, you can tolerate a number of useless variables and get a good estimate of continuous relationships, so simply don’t bother with variable selection. However, if you have a small data set, you may have to tweak smoothing parameters manually, which makes the tool less attractive for speed of analysis and robustness to fitting parameters.

Epilogue: why exploratory modeling?

If you asked someone with basic statistics training to search among candidate predictor variables to find correlations with a response variable, many would generate a plot for each predictor, with some accompanying correlation or hypothesis test. Hopefully this would surface a factor that correlates substantially better than others. Isn’t this the standard classical analysis? Why model at all, then?

My answer is:

Speed. A model-fitting algorithm conducts thorough internal search for relationships all in the blink of an eye.
Thoroughness. See above-said internal search.
Compactness. One fitted model puts all the important information in one place.
Ability to discover and characterize multiple influencers acting at the same time, adjusting each for the effect of the other.
Ability to discover if there is interaction between two factors, i.e., the shape of one’s contribution to response depends on the state of the other.

In truth, I would also generate plots–but rather than relating factors to the response, I would assess each factor’s distribution:

I would apply transformations to make continuous variables more normally distributed, if they’re not already.
I would check for possible outliers.
- In the quality forensics context, I would raise outliers to the team for further consideration, and meanwhile I would Winsorize them to limit (but not expunge) their influence in models.
I would assess categorical factors for degeneracy, such as a two-level factor where 95% of cases fall into one level.

Then I would fit a model.

Fitting a model is going to have the same power of discovery as generating plots. Consider how MARS operates: it evaluates every variable to assess whether there is evidence of a linear relationship between the variable and the response, just as someone looking at plots would do. If a variable is categorical, it would assess evidence for difference in response between levels, just as someone looking at plots would do. It does this for every variable, in a split second.

Furthermore, even if it finds one factor, it can still detect others. And then it can estimate how each contributes to response, adjusting for the effect of the other.

There is a risk of masking; if two variables are correlated, MARS will tend to pick the one that predicts best and ignore the other. It is best to generate a correlation matrix so one understands how factors are related to each other. This is true with the plotting approach also.

In other words, the workflow is as follows:

Assess the distributions of every predictor variable, as described above.
Assess the distribution of the response as well, and transform if appropriate (but don’t do so to maximize correlation with predictors).
Fit a model.
Summarize the model and generate plots describing the relationships it has identified.
- If the model has an interaction, the plot will represent a surface in 3 dimensions, something multiple univariate analyses will never uncover.

Note that the modeler does in fact generate plots relating predictor variables to the response. However, these plots are directed: we plot relationships that have already been discovered.

Plotting and multiple linear regression

“But,” you might ask, “I took a statistics course, and we studied multiple linear regression, and we were advised to plot all of the predictor variables against the response as a standard precursor to modeling. So you’re telling me now not to do such plotting?”

Multiple linear regression models are foundational for statistical modeling. For example, MARS wouldn’t exist without it. However, it is fraught, requiring many restrictive assumptions. Assumptions require, in turn, knowledge of how to deal with violations. Hence I’ve adopted the radical opinion that linear regression should not be left to beginners.

Precursor plots help the analyst diagnose whether the linearity assumption is valid. Otherwise, it’s best to use plots to explore relationships among predictor variables, but to leave it to the model-fitting algorithm to assess the relationship to the response.

These precursor plots raise the risk of non-model overfitting. That is, if you select the relationship, you’re actually fitting, yet that fitting is not accounted for. For instance, if you examine a plot of \(X_1\) vs. \(Y\) and you say “That looks curved, I’ll include a quadratic term,” you’ve actually done some implicit modeling that’s not accounted for. Now your estimates are optimistic and your statistical inference is biased.

After all, I did name my statistical practice “Replicate!” and the precursor plots threaten the likelihood of replication.

Frankly, as a beginner, don’t do multiple linear regression. Use MARS. Or mgcv. Counter-intuitively, flexible methods are safer.

Grumman F6F Hellcat: a lesson in system design

The Hellcat’s lessons for all designers

Context of quality forensics

The Hellcats of exploratory analysis

Random Forest

Multivariate Adaptive Regression Splines (MARS)

A small example

Statistical evidence

Generalized Additive Modeling via R’s `mgcv` package

Epilogue: why exploratory modeling?

Plotting and multiple linear regression

Discover more from Replicate! Statistical planning and analysis

Comments

Leave a Reply Cancel reply

More posts

Welcome!