Software needs a rethink for future chips, Intel says
Chip giant says Moore’s Law is alive and well but you are going to have to change the way software runs to take advantage of it.
In his keynote at the HotChips conference held online this week, Raja Koduri, Intel’s chief architect and head of the graphics division, showed off the company’s snappy slogan for the future of computer architecture: “No transistor left behind.” But software is going to have to do more to make that happen.
Apparently, the slogan popped up at the chipmaking giant two years ago during a meeting of the company’s architects as they tried to work out how to get past the apparent roadblocks in Moore’s Law and computer design now that neither is going as easily as they used to. Despite its problems with getting its process technology advances out on schedule, Intel is keen to stress that Moore’s Law is not dead and that recent claims are as off-target as they were two decades and three decades ago when technologies like lithography looked to be at the end of their rope.
Though it is not dead, Moore’s Law has started to smell a bit funny. Straightforward scaling of transistor size and interconnect spacing disappeared from the menu some years back. Increasingly, foundries and designers have worked together to achieve effective scaling with alternative techniques, such as changes to how devices are connected across the ten or more metal layers available for routing. This led Philip Wong and colleagues to try to find different ways to measure progress on density than those used over the past few decades. Ironically, some of these metrics hark back to those used by Gordon Moore in his 1975 speech to predict how chip circuit density would most likely double every two years. Transistor size reductions only accounted for part of the expected improvements, with smart circuit layout and bigger chips making up the difference.
In practice, although chip sizes grew steadily into the 1980s, which is where Moore’s 1975 projection stopped, the problem of getting 100 per cent functional circuitry on larger chips proved to be a major problem. The economic sweetspot stuck at around a square centimetre for decades, though high-end processors pushed as far as the reticle on a lithographic stepper: the biggest area the machine can expose on a die at one time. It works to be a bit under nine square centimetres.
Today, the reticle itself is no longer a limit. Though it is not the first time that companies have tried waferscale integration, a combination of money flowing into AI research and improvements in on-chip redundancy techniques has brought the concept back. Cerebras Systems, for example, worked out a way to connect the circuitry in each reticle to neighbours and found a willing partner in TSMC to produce the resulting module. Various forms of 3D integration are expanding the overall die area with stacks and arrays of chips on reticle-sized interposers and larger plastic packages. As a result, chips in mobile systems and in high-end computers are getting much, much larger.
Though 3D stacking is not without its problems, there is a general consensus forming in the industry that it represents the way forward. One big advantage is that it allows the mixing of different process technologies more cost effectively. Power and analogue circuits make poor use of the highly advanced technologies used for digital processors. Memory technology is also pursuing different paths to those taken for the sub-20nm nodes aimed at logic circuits. Intel and competitors expect to pursue this line.
“It will allow us to fundamentally change our approach to design,” Koduri claimed, and in concert help keep the company in line with Moore’s Law by not trying to pack everything onto monolithic chips.
The issue is finding ways to use all that hardware capacity productively. A spinoff of the mix-and-match-in-a-stack approach to chip production will likely be changes to the way software is put together. “The primary contract between hardware and software that we are used to is being overwhelmed,” Koduri argued.
In the past, software developers were happy to take the improvements in cycle time and core count to drive the amount of work computers could do. The problem now is that computer architectures are splintering. General-purpose processors are giving way to accelerators with specific functions, many of them intended for matrix-heavy code for physics and AI. Over time, architects expect many other parts of the software base to be accelerated, from accessing files to crunching through network packets. But compiler writers find it hard to keep up with the growing use of heterogeneous architectures.
“There are enormous performance gains to be had from heterogeneity,” Koduri said. “We continue to add more heterogeneity for math in the CPU and in doing so we have learned a lot about programming abstractions. But we estimate it takes three to five years to gain broad adoption of new heterogeneous extensions.”
Some of the ground is being made up through the use of libraries of function calls of pre-packaged operations but this is not enough, Koduri said. “We need productivity at all levels and in all languages and there need to be abstractions at multiple layers.”
Intel’s own response is what it calls “OneAPI”, which despite its name packages many programming interfaces into an agglomeration that supports the various architectures Intel has, including programmable-logic devices and PC processors. The reason for this is that despite developers’ best efforts to abstract away details, this rarely works in practice. Optimising code for speed or memory size invariably involves reaching down into lower layers to make changes. Koduri said the neat layers of abstraction shown in architectural diagrams are more like Swiss cheese.
It’s not even clear whether programmers will continue to notice differences in instruction sets by the time this process has finished, Koduri added: “It’s the trillion-dollar question ahead of us. I’m not proposing that we have an answer today but we will share it when we find one through OneAPI. We all need to work on it.”
The need to accelerate low-level functions that are today managed by operating systems together with arithmetic-heavy applications code means that the hardware design teams will absorb more and more software engineers. “Today we have a ton of CPU cycles consumed by the bottom layer of the stack. There are definite opportunities to offload functions [from software into hardware] in this part of the stack. Replumbing this infrastructure is not going to be easy but we do see a huge amount of opportunity,” said Koduri, and doing it will mean recruiting software engineers to advise on the types of hardware optimisation developers and their compilers can use.
Intel has, after all, been here before and it did not go that well. Derived from R&D that went into the Multics operating system – itself displaced by the Unix software that was a poke at the complexity of the original – the x86 processor had instructions that were meant to take over task management from the software. But programmers never used them. They got faster code by running those operations using streams of much simpler instructions. The industry cannot afford to do that again. Or a lot of transistors dedicated to specific functions will be left behind.