What is ensemble averaging?
In machine learning, particularly in the creation of artificial neural networks, ensemble averaging is the process of creating multiple models and combining them to produce the desired output, as opposed to creating just one model. Frequently an ensemble of models performs better than any individual model because the various errors of the models "average out."
Ensemble averaging is one of the simplest types of committee machines. Along with boosting, it is one of the two major types of static committee machines. In contrast to standard network design in which many networks are generated but only one is kept, ensemble averaging keeps the less satisfactory networks around, but with less weight. The theory of ensemble averaging relies on two properties of artificial neural networks:
- In any network, the bias can be reduced at the cost of increased variance
- In a group of networks, the variance can be reduced at no cost to bias
Ensemble averaging creates a group of networks, each with low bias and high variance, then combines them to a new network with (hopefully) low bias and low variance. It is thus a resolution of the bias-variance dilemma. The idea of combining experts has been traced back to Pierre-Simon Laplace.
How ensemble average is calculated?
Ensemble averages are essentially estimated by taking a random walk in configuration space. A Markov chain get created and the individual Markov states will be points in the usual configuration space of statistical mechanics.
Here is a summary of the steps in a Monte Carlo calculation for a system consisting of N atoms at temperature T:
- The initial coordinates (position vectors) of the N are:
r1a,....rNa - Determine the potential energy using
Va=12i,j=1i≠jN v(|ria -rja|)
and the Boltzman factor
ba = exp{-vakbT}
making use of the initial coordinates. - Select a particle (lets say 1) of the system serially or randomly and select random atomic displacements x1,y1,z1uniformly distributing each over the interval (-/2, /2).
- Using the new configuration, the potential energy Vngets expressed by
Vn=Va-j=2N[v1ja-v1jn]and you have bn=exp{-vnkbT}. - In case V=Vn-Vais negative, the old configuration gets replaced by the new one.
In case Vis positive, the move will only be accepted with the probability exp{-VkBT}.
The computer selects a uniformly distributed random number h from the interval (0,1) and compares it to (2.121). In case the exponential expression is larger, the new configuration and the new potential energy get accepted and they become the current properties of the system. If not, the old configuration gets counted again. In a chain that is long enough produced by these rules, configurations Γ appear with a probability of exp{-VkBT}as is needed in canonical ensembles.
How does ensemble averaging work?
A ensemble averaging model combines the prediction from each model equally and often results in better performance on average than a given single model.
Sometimes there are very good models that we wish to contribute more to an ensemble prediction, and perhaps less skillful models that may be useful but should contribute less to an ensemble prediction. A weighted average ensemble is an approach that allows multiple models to contribute to a prediction in proportion to their trust or estimated performance.
What is Weighted Average Ensemble?
Model averaging is an approach to learning ensemble averaging where each ensemble member contributes an equal amount to the final prediction.
In the case of regression, the ensemble prediction is calculated as the average of the member predictions. In the case of predicting a class label, the prediction is calculated as the mode of the member predictions. In the case of predicting a class probability, the prediction can be calculated as the argmax of the summed probabilities for each class label.
A limitation of this approach is that each model has an equal contribution to the final prediction made by the ensemble. There is a requirement that all ensemble members have skill as compared to random chance, although some models are known to perform much better or much worse than other models.
A weighted ensemble is an extension of a model averaging ensemble where the contribution of each member to the final prediction is weighted by the performance of the model.
The model weights are small positive values and the sum of all weights equals one, allowing the weights to indicate the percentage of trust or expected performance from each model.
Uniform values for the weights (e.g. 1/k where k is the number of ensemble members) means that the weighted ensemble acts as a simple averaging ensemble. There is no analytical solution to finding the weights (we cannot calculate them); instead, the value for the weights can be estimated using either the training dataset or a holdout validation dataset.
Finding the weights using the same training set used to fit the ensemble members will likely result in an overfit model. A more robust approach is to use a holdout validation dataset unseen by the ensemble members during training.
The simplest, perhaps most exhaustive approach would be to grid search weight values between 0 and 1 for each ensemble member. Alternately, an optimization procedure such as a linear solver or gradient descent optimization can be used to estimate the weights using a unit norm weight constraint to ensure that the vector of weights sum to one.
Unless the holdout validation dataset is large and representative, a weighted ensemble has an opportunity to overfit as compared to a simple averaging ensemble.
A simple alternative to adding more weight to a given model without calculating explicit weight coefficients is to add a given model more than once to the ensemble. Although less flexible, it allows a given well-performing model to contribute more than once to a given prediction made by the ensemble.