The only reason you need: PCA is not a feature selection algorithm
Is the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.
It's very clear if you read the article that what the author is calling "feature selection" might be better termed "feature generation". He explicitly calls out what he means in the post:
> When used for feature selection, data scientists typically regard z^p:=(z_1,…,z_p) as a feature vector than contains fewer and richer representations than the original input x for predicting a target y.
I don't even think this is necessarily incorrect terminology, especially given the author's background of working primarily for Google and the like. It's the difference between considering feature section as "choosing from a list of the provided features" vs "choosing from the set of all possible features". The author's term makes perfect sense given the latter.
PCA is used for this all the time in the field. There have been an astounding number of presentations I've seen where people start with PCA/SVD as the first round of feature transformation. I always ask "why are you doing that?" and the answer is always mumbling with shoulder shrugging.
This is a solid post and I find it odd that you try to dismiss it as either ignorant or click bait, when a quick skim of it dismisses both of these options.
The eigen values do give information as to energy, so if your selection criteria is simply to pick the linear component with the most energy, you can use PCA to select that feature and extract it with the corresponding eigen vector. The MUSIC algorithm is a classic example.
Edit: having now actually read the article, the case I mention falls into the author's test of (do linear combinations of my features make sense and have as much of a relationship to the target as the features themselves).
so what's a good way/algorithm/strategy/method/technique to select features? (obviously asking for a friend who's not that familiar with this, whereas I've all the black belts in feature engineering!)
Feature selection ought to be model-specific. Because a feature wasn't selected by Lasso (in a linear model) does not mean it cannot be useful in a non-linear model.
I work with ML regularly, and there is always something new to learn!
Another commenter mentioned L1 regularization, which is useful for linear regression. You wouldn't use it for all classes of problems. L1 regularization has to do with minimizing error of absolute values, instead of squared errors or similar.
I'm probably nitpicking your language, but L1 regularization is precisely that: regularization. (See https://en.wikipedia.org/wiki/Regularization_(mathematics)#R....) In your typical linear regression setting, it does not replace the squared error loss but rather augments it. In regularized linear regression, for example, your loss function becomes a weighted sum of the usual squared error loss (aiming to minimize residuals/maximize model fit) and the norm of the vector of estimated coefficients (aiming to minimize model complexity).
Hey, I appreciate your correction! I wrote my comment late and night and definitely mashed the details. Your nice "nitpick" is a much needed correction to my inaccuracy.
> L1 regularization has to do with minimizing error of absolute values
Not quite. It has to do with minimizing the sum of absolute values of the coefficients, not the error. The squared error is still the "fidelity" term in the cost function.
lasso is probably the easiest way to do it relatively quickly.
Lasso is also known as L1 regularisation, and it tends to set the coefficients to a bunch of features to zero, hence performing feature selection.
Note that if two predictors are very correlated, lasso may pick one mostly at random. Obviously one should do CV and bootstrapping to ensure that the results are relatively stable.
In general though, there's no real substitute for domain expertise when it comes to selecting good features.
> equential feature selection (backward and forward) is quite common
This is fine if you have test/validation sets, but never, ever report p-values on the result of such a selection process, as they are incredibly biased.
This is the basic promise of deep learning: resnet, unet, transformers etc… you replace feature engineering with deep learning models… otherwise feature engineering comes from domain expertise/ human experience. or trial and error from known features that work elsewhere.
So you have an example? I've never seen this and would like to see what people are doing here. I teach a lot of newbies data mining, so I'm very interested in how people get it wrong
But this is not so much "feature selection" as it is "compressing the data". It says right in the conclusion that the entire goal was "dimensionality reduction". In a very real way, PCA is selecting all features. That is, your data collection process remains unchanged. In a real feature selection, you would be able to say "ok, we don't need to collect X data anymore".
The point of the article, though, is that dimensionality reduction which minimizes information loss (PCA) isn't necessarily dimensionality reduction which minimizes signal loss.
A good example from the article is random features: random implies high information content but no signal value.
Feature selection is a process by which you drop features for different reasons. The main reason features tend to be dropped is because they are closely related to another feature, so you only need one of them. This makes algorithms train faster, reduces noise, and makes it easier to diagnose what the algorithm did.
PCA takes some N features and compresses them into N-n features. This process ALSO eliminates Collinearity completely, as the resulting, compressed features will be completely uncorrelated. However, calling PCA a feature selection algorithm is a bit untrue, because you have essentially selected none of your features, you have completely transformed them into something else.
It doesn't seem like such a stretch to conceptualize that if PCA assigns a tiny weight to a variable (assume all variables have been pre de-biased to mean 0, std dev 1) then it is saying it's a feature that doesn't contribute much to the overall prediction and therefore "deselecting" it by merging it with several other variables that it's correlated with, and downweighting it relative to them.
Most successful techniques I see in deep nets take the incoming features and mux them into intermediate features which are the actual ones being learned. Feature selection and PCA are in a sense just built into the network.
Short answer: feature selection is one particular method of dimensionality reduction.
And most people when they say feature selection mean deliberate domain-driven selection of features.
That is to say, you can also create an entirely new synthesized feature from, say, 5 raw features and use it to replace those 5 features (this is ... PCA-esque.)
Or you could also use random forest techniques for example which just arbitrarily reduce the dimensions / features of each individual decision tree in the forest.
I agree many people here are making mountains out of molehills of terminology.
practically speaking i have never had a reason to do this with PCA, but i find this reaction very weird. i have encountered the concept of using PCA to reduce the dimensionality of your data before training many times, including in university. cursory searches show plenty of discussion, questions on crossvalidated, etc.
It’s not, but the typical use for PCA as a dimensionality reduction step prior to a downstream classifier is to introduce a bottleneck to prevent overfitting. In some sense, that can be thought of as a means of avoiding specific feature selection by using an unsupervised method to discover the population covariance structure
In that PCA pre-processing step, nothing guarantees that principal components are better representations for your problem than original inputs; in fact PCA has nothing to do with your target, how could it guarantee principal components are better representation to predict it?
Similarly, understanding the covariance structure of your original inputs will not necessarily help you predict your target better.
Here's a simple example illustrating this. Take x a single feature highly informative about y, take z a (large) d-dimensional highly structured vector that is independent from both x and y. Now, consider using [x, z] to predict y.
In this case, x happens to be a principal component as x and z are independent; it is associated to eigenvector [1, 0,...., 0] and eigenvalue Var(x). All other eigenvectors are of the form [0, u_1, ..., u_d] where [u_1, ..., u_d] is an eigenvector of Cov(z).
All it would take for x to be the very last (i.e. 'least important') principal component is for Var(x) to be smaller than all eigenvalues of Cov(z), which is easily conceivable, irrespective of y! In your quest for a lower-dimensional 'bottleneck' using PCA you would end up removing x, the only useful feature for predicting y! This will certainly not reduce overfitting.
PCA and other autoencoders work well as pre-processing step when there are structural reasons to believe low energy loss (or low reconstruction error) coincides with low signal loss. In tabular data, this tends to be the exception, not the norm.
This is false though as your top eigenvectors are going to have non-zero entries in all components unless your original data has some features with 0 covariance which is highly unlikely. You still need to collect all the original features, you have not selected features that can be removed.
That’s like saying that in a stacked model every layer is feature selection. That doesn’t really make sense. Feature selection is about which input data is used, not about how it is transformed during the whole prediction procedure.
PCA is a form for dimensionality reduction. It can be used to reduce your data to a only few features while preserving some properties like the predictive quality of your model. Simple example: you give PCA 10 input features, it gives back the principal components, you pick the first 2 principal components and use them in your regression model (you still select how many componenets to use yourself, PCA doesn't do that for you).
Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use. You can do this based on domain knowledge, based on analysing variance, based on correlation between features, based on feature importance in some model, by adding/removing/shuffling features and seeing how your model performs etc.
>Feature selection is well... selecting features. You start with your original 10 features and you pick which ones to use.
That sounds a lot like another form of dimensionality reduction to me. The part people trip up on is that dimensions have a very specific definition in the world of statistics. All of these are basically reductionist approaches to tackle representing more complex systems by shedding information, they just go about it in different ways.
> but you can't say that PCA is feature selection.
One of the steps in PCA is to select the top N eigenvectors sorted by eigenvalues. So while it is not feature selection, it can be argued that it involves a feature selection step.
Is the author misunderstanding something very basic or are they deliberately writing this way for clicks and attention? I can see that they have great credentials so probably the latter? It's a weird article.