G5 Global

First, note that the smallest L2-norm vector that can fit the training data for the core model is \(<\theta^\text<-s>>=[2,0,0]\)

On the other hand, in the presence of the spurious feature, the full model can fit the training escort services in Richmond data perfectly with a smaller norm by assigning weight \(1\) for the feature \(s\) (\(|<\theta^\text<-s>>|_2^2 = 4\) while \(|<\theta^\text<+s>>|_2^2 + w^2 = 2 < 4\)).

Generally, in the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data. In this example, we do not observe any information about the second and third features. However, the non-zero weight for the spurious feature leads to a different assumption for the unseen directions. In particular, the full model does not assign weight \(0\) to the unseen directions. Indeed, by substituting \(s\) with \(<\beta^\star>^\top z\), we can view the full model as not using \(s\) but implicitly assigning weight \(\beta^\star_2=2\) to the second feature and \(\beta^\star_3=-2\) to the third feature (unseen directions at training).

Within example, deleting \(s\) reduces the error to have an examination shipment with a high deviations regarding no into 2nd function, whereas removing \(s\) escalates the mistake for a test shipping with a high deviations out-of no toward third ability.

Drop in accuracy in test time depends on the relationship between the true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\(<\beta^\star>\)) in the seen directions and unseen direction

As we saw in the previous example, by using the spurious feature, the full model incorporates \(<\beta^\star>\) into its estimate. The true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\(<\beta^\star>\)) agree on some of the unseen directions and do not agree on the others. Thus, depending on which unseen directions are weighted heavily in the test time, removing \(s\) can increase or decrease the error.

More formally, the weight assigned to the spurious feature is proportional to the projection of \(\theta^\star\) on \(<\beta^\star>\) on the seen directions. If this number is close to the projection of \(\theta^\star\) on \(<\beta^\star>\) on the unseen directions (in comparison to 0), removing \(s\) increases the error, and it decreases the error otherwise. Note that since we are assuming noiseless linear regression and choose models that fit training data, the model predicts perfectly in the seen directions and only variations in unseen directions contribute to the error.

(Left) New projection regarding \(\theta^\star\) to the \(\beta^\star\) is actually positive regarding the seen recommendations, but it’s negative about unseen direction; therefore, deleting \(s\) decreases the error. (Right) The fresh projection out-of \(\theta^\star\) towards the \(\beta^\star\) is similar in viewed and you will unseen tips; hence, deleting \(s\) escalates the mistake.

Let’s now formalize the conditions under which removing the spurious feature (\(s\)) increases the error. Let \(\Pi = Z(ZZ^\top)^<-1>Z\) denote the column space of training data (seen directions), thus \(I-\Pi\) denotes the null space of training data (unseen direction). The below equation determines when removing the spurious feature decreases the error.

This new key model assigns lbs \(0\) into the unseen recommendations (pounds \(0\) on 2nd and you will third features contained in this example)

The new left top ‘s the difference between the new projection of \(\theta^\star\) with the \(\beta^\star\) regarding viewed recommendations with regards to projection about unseen assistance scaled from the shot date covariance. Suitable front side is the difference in 0 (we.elizabeth., not using spurious has actually) and also the projection out-of \(\theta^\star\) towards the \(\beta^\star\) throughout the unseen guidelines scaled by the shot go out covariance. Deleting \(s\) assists if the kept top is actually more than the best front.

Because the idea can be applied just to linear designs, we currently reveal that from inside the non-linear designs trained for the actual-globe datasets, removing a great spurious element decreases the accuracy and you may influences teams disproportionately.

First, note that the smallest L2-norm vector that can fit the training data for the core model is \(>=[2,0,0]\)

Drop in accuracy in test time depends on the relationship between the true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\(<\beta^\star>\)) in the seen directions and unseen direction

This new key model assigns lbs \(0\) into the unseen recommendations (pounds \(0\) on 2nd and you will third features contained in this example)

Leave a Reply Cancel reply

OUR LOCATION

BUSINESS HOURS