High-powered computers are back, but how do we deal with their thirst for energy?
In a presentation for a panel on teraflop computing at the Supercomputing conference in Reno at the end of the 1980s, Lawrence Livermore Laboratories researcher Eugene Brooks predicted an “attack of killer micros” would sweep aside machines like the fastest supercomputer of its day, the NEC SX-3.
With a peak throughput of 5.5 billion floating-point operations per second (Gflops), the core of the NEC SX-3 and other machines of its generation was a single vector processor, a unit designed specifically to handle the matrices and tensors innate to most scientific computing problems. Driven by a single stream of instructions, the vector processor seemed ideal because pretty much all matrix arithmetic can be broken down into operations on vectors of various lengths. The relative uniformity of matrix arithmetic means one instruction can fire off tens of calculations at once. The SX-3 could operate on vectors with as many as 64 elements at a time.
Though this was immensely powerful, such a machine relied on being fed with problems that relied on large matrices. If your algorithm used lots of small matrices you either had to come up with ways of combining the operations or waste the full power of the vector unit.
In the meantime, chips such as Intel’s i860 were popping up. A single i860 was 70 times less powerful than the mighty SX-3. But if you had an application you could slice into modules that run in parallel, it became possible to get a performance from a hundred PC-style motherboards for roughly what it would cost to lease the SX-3 for less than half a year. The i860 itself failed to gain much traction because its spartan instruction set was fiddly to program at the assembly level. The compiler technology of the day was not up to the job of harnessing its potential and Microsoft abandoned an early plan to port Windows NT to it. But the design techniques in its high-throughput pipeline quickly moved into mainstream microprocessors that in turn continued the process of bulldozing traditional mainframes and supercomputers out of the way.
Based on off-the-shelf x86 processors, Intel’s ASCI Red soared past Hitachi’s traditional vector machine to take the top-supercomputer slot of 1997. By 1995, Cray Computer had filed for bankruptcy protection and mainframe maker IBM was trying to claw its way out of trouble after declaring what was then the biggest financial loss in US corporate history, having also fallen foul of the micro attack.
The late 1980s also saw the first wave of chips for artificial neural networks and they had their own size problem. Inova Microelectronics aimed to capitalise on the wave of enthusiasm for AI with what was then the largest processor ever made. It stretched over twice as much silicon area as one of the floating-point units that went into NEC’s supercomputers. But it was a short-lived behemoth. Neural network-based artificial intelligence (AI) slumped back into one of its periodic winters. And it would stay near frozen for close to a decade.
Today, AI is bigger than ever and now driving the design of a new generation of high-end computers that are seeing techniques from the 1980s getting their revenge. Big iron is back and in a big way. Where cloud computing once meant slapping hundreds of PC-style “blades” into a rack, the new generation of machines are not just taking on many of the attributes of mainframes and supercomputers but combining them. Even the vector units of old supercomputers have staged a comeback in both the Arm and RISC-V architectures, albeit with some changes to make them work better on DNNs.
The reason? Deep learning and the statistical processing needed for “big data” computing put enormous stress on parts of the system that have been ignored for years in the push to improve core processor performance. That focus translated into high energy costs that are quickly moving into the realm of excessively high.
In the spring of 2019, Emma Strubell with two other colleagues from the University of Massachusetts at Amherst calculated how much energy it takes to train some of the popular deep neural networks (DNNs). The situation is not quite as bad as the figures quoted in some recent keynote speeches. The popular claim speakers use is that training the kind of language model that powers Google’s predictive search emits the same quantity of carbon dioxide as the lifetime emissions from five family cars. In reality, this was the total estimated for the development of the complete model, including all the training runs performed during prototyping. The estimate for a single training run on the typical off-the-shelf graphics processing units (GPUs) installed in cloud data centres run by AWS or Microsoft Azure is several hundred times lower.
If you take a model such as BERT-large, which has 345 million trainable parameters, the carbon dioxide emissions for an 80-hour run on 64 GPUs worked out to be around 650kg, which according to researchers from the University of York is the same amount the average Briton is responsible for during three days of (normal) Christmas festivities or a drive from Boston to Vancouver in a Ford Focus.
Though the figures for a training run are not as eyewatering as some make out, deep learning is doing its best to achieve those numbers. BERT-large today really is not that big. OpenAI’s GPT-3 language model, launched in the summer, has 175 billion parameters. According to the team that created it, the model requires about 600 times more floating-point calculations to train than BERT-large. Were it run on the same hardware as that used for Strubell and colleagues’ paper, you genuinely would be looking at lifetime emissions for a family car.
Deep learning is not stopping at a couple of hundred billion parameters. At the company’s design summit in October, Steven Woo, Rambus fellow and distinguished inventor claimed: “We are seeing trillion-parameter models on the horizon.”
There are good reasons for making models ever bigger and spending more time training them. “Deep learning has the property that if you feed it more data or make the model larger, the accuracy improves,” says Sidney Tsai, research staff member at IBM.
At the HotChips conference in the summer, Raja Koduri, chief architect and senior vice president of the discrete graphics division at Intel, said: “Intelligence requires a ton of computing. Advances in AI have resulted in a need for compute that is [growing] at a much steeper trajectory than Moore’s Law: a doubling every 3.4 months.” Graph 1 demonstrates this recent acceleration, plotting the number of days it takes to train the network given a computer that performs 1Petaflop/s.
One claimed advantage of a model like GPT-3 is that once trained it can be fine-tuned and deployed on numerous applications with practically no additional compute-intensive training. Though not a one-and-done process, the number of machines needed for training is small relative to the quantity you need to run the models on real data. Inferencing takes far fewer operations to perform. If you take ResNet-50, now quite a small model at a mere 23 million parameters, getting it to classify an image needs about 4Glops. That is only 20 per cent less than the SX-3’s peak throughput, but is easily deliverable for about 80mW by a £2000 Nvidia P4 card. You would need to run 250 million images to get to the number of calculations required to train it. With billions of social-media users aiming to do just that, the amount of computer hardware you need to support these data-hungry applications quickly adds up. And ResNet-50 deals with pretty small images: just 244 x 244 pixels.
This immense hunger for processing power would not be so bad if silicon scaling had continued as it did in the 1990s. But things changes after 2003 when power no longer ratcheted down alongside transistor size.
Speaking at his company’s autumn processor conference, Linley Gwennap, chief analyst and head of analyst firm The Linley Group, said: “We are not seeing a lot of improvement even as transistors are getting smaller. If we are not getting improvements from transistors we have to turn to architecture. We have seen a rise in recent years in the use of application-specific architectures.”
The first moves towards application-specific computing in servers came with the parallel arithmetic units, such as the Intel MMX, which were designed to boost the speed of video and audio encoding. GPUs for rendering 3D scenes on desktop PCs and games consoles pushed this further. Initially those rendering pipelines were hardwired but, realising games makers wanted more flexibility to try out different approaches, Nvidia and other suppliers made them programmable, a decision that proved instrumental in moving GPUs over to servers once deep learning became established. The one thing that 3D rendering and DNNs have in common is a focus on matrix arithmetic. Nvidia happily built ever bigger GPUs to satisfy the thirst for greater throughout.
Though GPUs successfully pushed regular general-purpose processors to one side in these data-centric tasks, they are now the targets of chip and add-in card vendors with more specialised accelerators because of their high energy consumption. “We have seen examples recently of custom chips completely optimised for specific algorithms,” Gwennap says.
A problem that’s common to GPUs and Intel’s Xeon or AMD’s Ryzen processors is the way software has to continually shuffle data around for each calculation. The actual multiplications and additions take up less than half of total energy needed. Most the rest comes from moving data along buses and in and out of the register files and caches that act as temporary storage.
Gwennap points to the use of designs like systolic arrays, which are commonly used in hardwired signal processors for radar, and which now form the basis of Google’s Tensor Processor Unit (TPU) among others. “Data shifts from one unit to another without having to go through a register file,” he explains.
Vendors like Cerebras and Groq are pushing the dataflow idea further with their architectures. Like Inova decades ago, Cerebras has pushed integration to its near limits with a processor array that takes up much of a silicon wafer with the aim of pushing the performance per watt of training. “We have built the largest chip in the world,” says Sean Lie, cofounder and chief hardware architecture at Cerebras. “In that enormous size we can pack 1.2 trillion transistors with processors, all fed by 18GB of onchip memory.”
Groq focuses on inference with its smaller processor but which also deploys a 2D array of processors that on the fly can be formed into vectors with elements that far exceed the 64 elements offered by the NEC SX-3. “We chain operations so they don’t have to touch memory in between,” says Dennis Abts, chief architect at Groq.
The issue of moving data between registers and onchip memory is shadowed by a bigger problem, especially when it comes to working with the huge quantities of data needed by AI and data-analytics software. If moving data along onchip buses is a relatively heavyweight operation in terms of power, consider how that scales up when trying to read and write memory in external chips or boards. Reading a single data value from external DRAM mounted on a PCB needs at least ten times more power than a single calculation.
There is a second problem that deep learning presents. Traditional software takes advantage of cache memories to hold data temporarily, which both speeds up operations and keeps power down, but big-data applications, true to the name, do not benefit nearly so much from caching. Each piece of data may only be needed once before being overwritten by a new calculations. Trying to cache any of that data consumes more power than trying just to pipeline it through an accelerator.
Some of the new generation of processors try to deal with the problem by bringing data much closer, which cuts down on access power. One method is to put bulk memory on top of the processors or at least in the same package using techniques originally employed to make smartphone processors and the all-in-one module for the Arm-based A14 in the latest crop of Apple Macintosh computers.
Lie says stacking memories on top of the waferscale processor is “a logical next step beyond what we’ve shown today, it’s something we are looking into”. But he notes that though the SRAM in the current generation is less dense compared to DRAM, the amount the company can pack in for the current generation is enough to cope with most AI models.
There is a question as to how far brute-force scaling can go before the law of diminishing returns takes hold. One of the big trends in inferencing that is helping to limit the amount of additional hardware that needs to go into a system is the use of pruning and similar techniques to try to remove unnecessary calculations. Pruning inspects the neural connections to find those that have no practical effect on the outcome for any data and to remove them. That can be done in compilation but a new wave of architectures now tries to find situations where, for a given image or piece of text, they work out which calculations to drop on the fly. This focus on only processing as much data as necessary drove the design of Tenstorrent’s architecture.
Ljubisa Bajic, CEO and lead architect at Tenstorrent points to extensive redundancies in BERT as an example. ”Instead of running an input through the whole network you cut in early where you have a high probability of getting an answer,” he says. Using ‘early exit’ improved throughput five-fold in initial experiments with further tuning used to inform the design of the Jawbridge and Greyskull chips so that they prune calculations on the fly more readily than with conventional processors, that tend to favour large, fixed batches of work. “As you open up the model you go from being ten times better to being a hundred times better,” he adds.
The growth spurt that has characterised AI over the past couple of years may turn out to be just that and the industry will settle into a more sedate pace. Even in that case, the pressure on energy consumption looks likely to continue and drive the industry to greater use of accelerators and densely package memories for even more muscular servers that are very different to the racks full of PC-class motherboards that used to be the norm.
In the big-iron machines of the past, how to move data in and out of the system got a lot of attention. Built using high-speed custom logic, the central processor was a precious resource that was wasted on jobs such as pulling data off a disk drive. So, mainframes and machines like them had I/O channel processors arranged like a tree that would have the job of organising the information that needed to stream through the machine at speed. Data-centric computing means the circle has turned once more though with a few modifications for the world of cloud computing.
The big difference today is that there is not really just one server with a lot of peripherals. Instead, there are now separate machines that talk to each other over a network organised into microservices, each of which may only be running for a small fraction of time before being reallocated to another job. That keeps the overall data centre running more efficiently although it puts a lot of pressure on the internal network. So-called east-west traffic – the data flowing between server nodes – is now far more intense than north-south, which is the data moving in and out of the data centre.
To make the process more efficient, the nodes compress and filter data on the fly so that it consumes less network bandwidth. They also handle the task of reorganising data and preprocessing data so that it is in a form that is more easily handled by machine-learning and analytics software running on the core processors and accelerators. This had led to the creation of a new category of processor: the datapath unit (DPU). Nvidia bought its way into this market with the acquisition of Mellanox and startups such as Fungible have emerged over the past few years aiming to capitalise on this distribution of workloads.
Even bulk-storage drives are getting their own processors. As with the memory wall for central processors, it makes sense to keep data inside the drive as much as possible and only move it outside where absolutely necessary. Jobs such as searching for data can be distributed across multiple intelligent drives and run in parallel, only sending the results back to a central processor rather than gigabytes of raw data.
As well as DPUs, custom processors are moving into this space. In November, Samsung and Xilinx teamed up to launch a smart flash drive that uses a programmable-logic device to handle specific tasks as well as handling applications built by third parties. Bigstream, for example, developed a module that can be loaded into the Xilinx device inside the SmartSSD to filter the data on a drive locally on behalf of applications written in the Apache Spark, a widely used tool for data analytics.
Over time, more of the AI and analytics seem likely to devolve down to these I/O machines, only passing the most vital nuggets of data up to the main processors where the biggest models get to work on them.