Logarithmic-time Schedules for Scaling Language Models with Momentum
Damien Ferbach, Courtney Paquette et al.
TLDR: ADANA, an optimizer with time-varying schedules for hyperparameters, improves large-scale language model training efficiency by up to 40% compared to AdamW.