Techniques such as ensembling and distillation promise model quality improvements
when paired with almost any base model.
However, due to increased test time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings.
In this paper, Google researchers explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters.
Their first claim is that online distillation enables the use of extra parallelism to fit very large datasets about twice as fast.
Crucially, training can still be sped up even after reaching the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent.
Two neural networks trained on disjoint subsets of the data can share knowledge
by encouraging each model to agree with the predictions the other model would
These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted.
The second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible.