## Neural networks bias variance dilemma

Not only mixture models, but also a wide variety of other classical statistical models for density estimation are representable as simple networks with one or more layers of adaptive weights. Following steps convert the standard Bayes rule into a logistic function:

To achieve good generalization it is important to have more data points than adaptive parameters in the model.

It has been demonstrated that MLP models of this form (with one hidden layer) can approximate to arbitrary accuracy any continuous function, defined on a compact domain, provided the number M of hidden units is sufficiently large.

The linear, logistic, and softmax functions are (inverse) canonical links for the Gaussian, Bernoulli, and multinomial distributions, respectively.

A variety of such pruning algorithms are available [cf. Bishop, 1995].

Some theoretical insight into the problem of overfitting can be obtained by decomposing the error into the sum of bias and variance terms. A model which is too inflexible is unable to represent the true structure in the underlying density function and this gives rise to a high bias. Conversely, a model which is too flexible becomes tuned to the specific details of the particular data set and gives a high variance. The best generalization is obtained from the optimum trade-off of bias against variance.

