We Can Now Train Big Neural Networks on Small Devices

The gadgets around us are constantly learning about our lives. Smartwatches pick up on our cadences to track our health. Home speakers listen to our conversations to recognize our voices. Smartphones watch what we write to fix our idiosyncratic typos. We appreciate these conveniences, but the information we share with our gadgets isn’t always kept between us and our electronic minders. Machine learning can require heavy hardware, so “edge” devices like phones often send raw data to central servers, which then return trained algorithms. Some people would like that training to happen locally. A new AI training method expands the training capabilities of smaller devices, potentially helping to preserve privacy.

The most powerful machine-learning systems use neural networks, complex functions filled with tunable parameters. During training, a netowork receives an input, such as a set of pixels, generates an output, such as the label “cat,” compares its output with the correct answer, and adjusts its parameters to do better next time. To know how to tune each of those internal knobs, it needs to remember the effect of each one, which regularly number in the millions or even billions. That requires a lot of memory. Training a neural network can require hundreds of times the memory of merely using one, where it’s allowed to forget what each layer of the network did as soon as it passes information to the next layer.

To reduce the memory demands, researchers have used a few of tricks. In one, called paging or offloading, the machine moves those activations from short-term memory to a slower but more abundant type of memory such as flash or an SD card, then brings it back when needed. In another, called rematerialization, it deletes the activations, then computes them again later. Previous memory-reduction systems used one of those two tricks or, according to Shishir Patil, a computer scientist at the University of California, Berkeley, and the lead author of the new work, combined them using “heuristics” that are “suboptimal,” often requiring a lot of energy. The new work formalizes the combination of paging and rematerialization.

“Taking these two techniques, combining them well into this optimization problem, and then solving it—that’s really nice,” says Jiasi Chen, a computer scientist at the University of California, Riverside, who works on edge computing but was not involved in the work.

Patil presented his system, called POET (Private Optimal Energy Training), in July, in Baltimore, at the International Conference on Machine Learning. He first gives POET a device’s technical details and the architecture of a neural network he wants it to train. He specifies a memory budget and a time budget. He then asks it to create a training process that minimizes energy usage. It might decide to page certain activations that would be inefficient to recompute, and rematerialize others that are simple to redo but require a lot of memory to store.

One of the keys was to define the problem as a Mixed Integer Linear Programming (MILP) puzzle, a set of constraints and relationships between variables. For each device and network architecture, POET plugs its variables into Chen’s hand-crafted MILP program, then finds the optimal solution. “The main challenge is actually formulating that problem in a nice way so that you can input it into a solver,” Chen says. “So you capture all of like the realistic system dynamics, like energy, latency, and memory.”

The team tested POET on four different devices, ranging from the kind you might find in a fitness tracker to the kind you might find in a fancy smartphone. On each, they trained three different neural network architectures: two types popular in image recognition (VGG16 and ResNet-18) and a popular language-processing network (BERT). In many of the tests, it could reduce memory usage by about 80 percent, without boosting energy use, while comparison methods couldn’t do both at the same time. Patil says that BERT can now work on the smallest devices, which was previously impossible.

“When we started off, POET was mostly a cute idea,” Patil says. Now, several companies have reached out about using it, and at least one large company has tried it in their smart speaker. One thing they like, Patil says, is that POET doesn’t reduce network precision by “quantizing,” or abbreviating, activations to save memory. So the teams that design networks don’t have to coordinate with teams that implement them in order to negotiate tradeoffs between precision and memory.

Patil notes other reasons to use POET besides privacy concerns. Some devices need to train networks locally because they have low or no internet connection, such as those on farms or in submarines. Others need to do so because data transmission requires too much energy. And then large devices—internet servers—might benefit from POET too when they train giant networks. But as for keeping data private, Patil says, “I guess this is very timely, right?”