Bias Vs Variance in Supervised Learning

-Sridhar Gangavarapu

I spent quite a bit of time understanding the underlying concepts in (Linear and Polynomial) Regression models. The concepts of Bias and Variance threw me off several times before i could properly understand what they meant and their importance in Machine Learning.

There are three types of errors in machine learning

1) Training Error : How well the fitted model on training data performs on Training Data itself. Overly fitted models do well on training data but poorly else where.

2) Test Error: How well the ML model that is fitted on Training Data performs in predicting values on Test Dataset. We assume that the Model never saw the contents of the Test set.

3) Generalization Error : Generalization error is a concept which deals with Expected output value Vs Empirical value for an input variable to the model. In this concept we take the probability distribution into consideration. For example, what is the probability of a house with 25 bathrooms or 10SQ FT are some examples that effect the empirical value calculations.

Now lets get to look into the Bias Vs Variance concepts: 


A fitted model's performance could be under or overly fitted. Which means that the model is able to predict the values of the Training set poorly or very very well. There is a issue with both these extremes. When the  model is overfitted, it might not predict the values in the test or other datasets that it has not been trained. For example, if a metropolitan has highest price home compared to rest of the state, a overlyfitted model trained on the metropolitan data cannot predict the home prices in a different county.  This is called Variance. Variance is a measure of sensitivity to very small deviations/fluctuations in the training set.

Bias is a type of error that the model creates by making incorrect assumptions about the data. Underfitting model creates high Bias, thus does not make out a proper relationship between features in the training set.

Bias and Variance are mutually exclusive, as we saw they are outcome of being underfitted Vs Overfitted. The solution could be:

1) Data should be gathered should include variety sources
2) Split the input data set into Tarining, Test and Validation set.


So, the Model will be trained on Training set and validated on Validation set and the best algorithm selected based on the lessons learned on validation set. Then the model is tested on testing set. This will enable the model to be refined few times before sent into production.

Noise: It is well known fact that Data use for analysis is noisy. This means that there is an external factor that could influence the actual outcome. for example, i urgently need a home in a good school district and ready to pay 25% premium on the asking price of the home. This will create noise that has nothing to do with house size Vs house price. Then the 25% in this case is the variance of the noise. This is just an example, but there could be more of such out of whack data points that skew the outcome. For example, Personal relationship between owner and buyer, Bank liquidating foreclosed homes, etc...

Note: No model is perfect. Error and Noise are part of every dataset. The amount of error when kept low, would increase the level of predictability. The number of features and relationship might change and hence models need to re-trained often to better understand the relationship between the features. There may be times when an algorithm itself becomes not suitable because of environmental changes.

For example if the price of the house is strictly based on SQ FT, # bedrooms and bathrooms, then linear regression is best suitable. But what if demand and supply is playing a key part along with the mentioned features, then the Linear algorithm makes no sense.

In those cases, K-means algorithm is better suited. Similar "K" homes in your neighborhood was sold for average N dollars, then your home is around that N.

Thanks,
Sridhar.


Comments

Popular posts from this blog