Epoch-skipping: A Faster Method for Training Neural Networks (2024)

London Lowmanstone

This is a snippet.

I’ve had this idea on my mind for about a year, but I haven’t found the time to work on it, and I don’t think I have the resources to make it successful, so I figured I’d post about it in case someone can make it a reality (and hopefully I can get a little credit for the idea.)

The idea is called epoch-skipping. The main thought behind it is that neural networks are really good at making predictions. So, why go through the work of training a neural network when you could just get a neural network to predict what the right weights are?

When I shared this idea among my grad student friends, one of them mentioned this wonderful paper that attempted to do this, with decent results. The general conclusions of the paper are that 1. You can represent machine learning architectures in a way that allows one neural network to make predictions for the weights of many different types of architectures and 2. If you try to predict the final weights of a neural network, you’ll get a neural network that performs about 60% as well as the actual neural network would.

However, I thought this paper was missing a large section: how long does it take to train that predicted 60% neural network up to the fully-trained 100%? If it takes half the time, then this is a wonderful advancement! But if the 60% neural network has weights that are in a local minima such that training out of that minima takes just as much time as it would have to train the network to 100% from scratch, then this method seems relatively useless.

So my idea is that instead of trying to predict the final network’s weights from scratch, you should instead just try to predict what the weights will look like after one more epoch of training. That is, train a neural network on a sequence prediction problem of “given the weights of a neural network after a few epochs of training, what will the weights look like at the next epoch?”

Chances are, the results won’t be quite as good as if you had just actually trained the neural network on real data for that epoch. But running the neural network to make the prediction should be much faster than actually training the neural network. You can then alternate between predicting weights and actually training, essentially allowing you to skip epochs of training by running the predictor instead.

By training in-between skips, you allow the model to improve on the real data and fix issues that the prediction may have introduced. But, by doing the prediction, you also hopefully get faster performance gains than regular training (because prediction is far faster than training.)

The rationale behind why you should do this at the epoch level rather than the batch level is that each example contains new information. If you try to get a prediction of how the network will look after the next batch, that prediction will have to be made only from data that doesn’t contain that batch. However, after training on a single epoch, all of the information in that epoch should have had some impact on how training went, and so the change in weights should store some information about every single example in the dataset. Thus, a really good prediction network should be able to make accurate predictions about what the network will look like after training for another epoch on the exact same data.

Now, the paper mentioned above uses a pretty complex mechanism for representing the architecture of a neural network, such that their prediction network can make predictions for many different types of architectures. However, I come from a natural language processing (NLP) background, and I think we should be able to use natural language processing to help with the encoding. The idea here is that there will likely be large advancements in neural network architectures. So if you come up with your own hard-coded way of representing neural network architectures, as soon as a new architecture comes out, your entire prediction model is useless on those architectures because it can’t represent them.

But, all neural network architectures can be represented by natural language, otherwise we wouldn’t be able to describe the architectures to each other. So, imagine that the parameter prediction model takes as its input a description of the neural network, such as “3 fully connected layers with 100 inputs, a hidden layer size of 300, and an output size of 3” followed by three sets of weights from 3 epochs of training. Then, the goal of the model is to output a new set of weights that should mimic what the weights would change to after a fourth epoch of training.

The issue with this is that the output may not be perfect — it may output the wrong amount of weights, or the description may contain words it has never encountered during training. However, as word embeddings become better, it’s likely that even if there are new words, the embeddings of those words will allow it to gain information about that word’s meaning such that it will help in making more accurate predictions.

In short, by having the architectures be described by natural language, we future-proof the predictor network to allow it to make predictions for the weights of future neural networks that didn’t exist at the time of the predictor network’s training.

The main issue with this sort of prediction is the collection of training data. From my perspective, it’s only companies like Google or OpenAI who train many models who, if they saved the weights of their networks after each epoch, might be able to gather the data required to build such a predictor network. Part of the insight of the previously mentioned paper is that although we don’t have training data for many different architectures, for most neural networks that are published, we do have their final weights. So they were actually able to write and submit their paper based on that data.

Personally, I’ve collected data from training some basic neural networks on MNIST, but haven’t yet built a predictor network. If you’re looking to do this, I’d love to hear from you just so I can get updates on the results, because I think this is an interesting idea.

UPDATE 3/30/2024:

Over the summer of 2023, I spent some time working on this idea with an undergrad researcher. We were particularly excited about new research from the same lead author of the previous paper, showing that it was possible to 1. Represent different network architectures (without NLP) 2. Predict reasonably-performing weights for those architectures, and 3. Train the model more quickly to great performance by using those initial predicted weights rather than training from scratch, thus answering my question above: “How long does it take to train that predicted 60% neural network up to the fully-trained 100%?”

Throughout the summer, we found that main issue in trying to predict the weights of a model is that models have a lot of weights, especially in comparison to the size of their outputs. For example, GPT-2 had 1.5 billion parameters, just to accurately predict 1,600 numbers. If we want a model to accurately predict each of those 1.5 billion numbers, we’re going to need a much bigger neural network.

So, we began looking into ways of representing the large numbers of parameters in a neural network using very few numbers. That is, trying to figure out a good way to encode the parameters of a neural network.

Overall, we were unsuccessful in doing this, and found that most encodings were not stable and incurred too much loss: if you encoded the parameters and then decoded them again, the neural network would no longer function.

Looking back on this, and now having more experience in the field, I recognize that our work is also closely tied to weight quantization and some forms of model distillation, both of which we did not dive into enough over the summer.

In short, in order for this method to work, we need a good way of representing all the weights of a neural network using a small amount of numbers where changes to those few numbers result in reasonable changes to the weights.

It could be that this already exists, and we just didn’t find it, so if you’re aware of any methods for encoding neural network parameters, I would love to hear from you.

If you are interested in this idea and perhaps want to work on it more together, please reach out! I am happy to fill you in on the details of exactly what we tried and what worked and what didn’t and see if we could work together to make neural network training faster.