The AI surge makes the PC processor maker’s purchase of Xilinx look like a one-way bet, but trouble may surface not that long from now.
In the summer, the Elon Musk-backed non-profit institute OpenAI released the applications programming interface (API) for its GPT-3 language model to a select list of users. Aside from its ability to act as the most powerful chatbot the internet has ever seen, leading to it masquerading as a real user on Reddit, GPT-3 is remarkable for its sheer size. With 175 billion trainable parameters, it is ten times larger than the Turing-NLG unveiled by Microsoft early this year. Microsoft is the one company that holds a licence to build and use its own copies of GPT-3.
Language models have overtaken the old deep-learning stalwart of recognising photographs of cats in baskets by some distance. Before the BERT language model appeared on the landscape just a couple of years ago, the image-processing pipelines were top of the size charts but only doubling every year. The sudden growth spurt from the language models, based on multiple layers of highly connected neural networks called Transformers, has pushed the rate of increase way higher to at least ten times per year.
Though image-processing AI is nowhere near the level of natural language processing, the research teams at the big social-media companies believe scale is important to the models. In an online seminar organised by IBM and the IEEE last week, Facebook AI Research director Laurens van der Maaten claimed their experiments showed simply increasing the parameter count in their deep-learning pipelines helps boost accuracy, especially if you couple that with training on ever larger data sets.
“One observation is that every time you double the training data you get a fixed increase, which is kind of log-linear. And the increase is also dependent on the number of parameters. Every time you double the numbers of examples, the increase is larger if you have more parameters. You can see it in our study as well as in the GPT-3 study. You can think of GPT-3 as the language-recognition version of our study.”
Naturally, chipmakers have been salivating over what this means for them. The rate of growth far outstrips Moore’s Law, which means they will be selling a lot more silicon if these enormous models become widespread.
At the company’s design summit a few weeks ago, Rambus fellow and distinguished inventor Steven Woo claimed: “We are seeing trillion-parameter models on the horizon.”
This plays into the reasoning behind AMD’s decision to buy FPGA maker Xilinx as well as AMD’s ability to do an all-stock deal worth $35bn, slightly ahead of the amount nVidia has agreed with Softbank for Arm. AMD has gone from being a sideshow to Intel’s dominance of server and PC silicon to becoming an increasingly significant threat to the chip giant’s place in servers as the growth in AI coupled with the cloud-computing revolution has taken effect. That has seen AMD’s share price go through the kind of parabolic increase over the past five years that we have also seen in Apple and Microsoft.
AMD is hardly unusual in wanting to buy an FPGA maker, especially given its ability now that its paper is far more valuable than it was when Intel came to a similar conclusion several years ago with its decision to buy Altera. Intel aimed to combine the FPGA silicon with its Xeon processors either using multi-chip packages or with full single-chip integration. That has not worked out so well for Intel.
FPGAs have something of a mixed relationship with the AI sector, which is partly why Xilinx’s share price was on a tear up to 2019 before falling back. The timing for AMD may be a bit more advantageous.
FPGAs do not perform too well in training applications that have gained much of the attention from Wall Street because they lack the floating-point horsepower of the GPUs that nVidia supplies. AMD is also a GPU provider, though it has not been able to take as much advantage of AI training in data centres. However, FPGAs do perform well when you start to optimise the neural network for day-to-day inferencing work. Their ability to reroute data efficiently is a major advantage, especially when you take into account the rapid evolution of deep learning. Barely a day goes by on the arXiv preprint site without some novel form of deep-learning pipeline being proposed. Increasingly, these structures are moving towards irregular, sparse graphs rather than the dense matrix and tensor arithmetic at which GPUs excel.
If you want to put AI inferencing into things like robots and self-driving cars, FPGAs look pretty attractive right now as they avoid manufacturers getting locked into a deep-learning architecture that might not last a full year before being surpassed by something else.
However, there is a potential issue looming that will show once again: no exponential lasts forever. Scientists are peering into the core structure of models like BERT. They have to work with BERT as GPT-3 is something of a black box unless you work at OpenAI or Microsoft. But language models like other deep-learning architectures are pretty inscrutable. A number of papers on “BERTology” have appeared over the past year that try to work out what exactly is going on inside the models and why they work as well as they do on some problems and then fail miserably on others. Some have indicated you could cheerfully prune BERT of millions of parameters for a particular target task and see no reduction in performance. There is clearly a lot of redundancy in these networks that more forensic work will uncover.
At that point, the growth prospects for high-end silicon look somewhat less frothy. And like the parabolic blowoff tops of the past, the long-term trend line to which the price falls can be some way further down the graph.