tech

Big models go to the terminal, what about chips?

Artificial Intelligence (AI) has become the most significant new driving force in the semiconductor industry over the past few years. Last year, large models represented by ChatGPT further ignited the AI and related chip markets. The large models behind ChatGPT are becoming the representatives of the next generation of AI and are expected to further promote the birth of new applications.

When it comes to large models, what usually comes to mind is models running on cloud servers. However, in fact, large models are already entering terminal devices. On the one hand, a considerable amount of work has already proven that large models can actually run on terminal devices after appropriate processing (not limited to running on cloud servers); on the other hand, running large models on terminal devices will also bring great value to users. Therefore, we believe that in the next few years, large models will be running more and more on terminal devices, which will also promote the further development of related chip technology and industries.

Advertisement

Smart cars are the first important market for large models to run on terminals. From an application perspective, the primary driving force for large models to run on smart cars is that large models can indeed bring significant performance improvements to intelligent driving-related tasks. Last year, the BEVformer, which represents the end-to-end bird's-eye camera large model, can be said to be the first milestone of large models in the field of smart cars. It directly inputs multiple camera video streams into the large model using the transformer module for computation, and the final performance is nearly 10 points better than the results of the traditional Convolutional Neural Network (CNN) model, which is a revolutionary change. At last month's CVPR, SenseTime Technology released the UniAD large model, which uses a single visual large model to adapt to multiple different downstream tasks after unified training. In the end, it has greatly surpassed the existing best models in multiple tasks: for example, the multi-object tracking accuracy exceeded 20%, the lane line prediction accuracy increased by 30%, and the errors in predicting motion displacement and planning were reduced by 38% and 28%, respectively.

Currently, car companies (especially new forces in car manufacturing) are actively embracing these large models for smart cars. BEVformer (and related models) has been used by many car companies, and we expect the next generation of large models to gradually enter intelligent driving in the next few years. If considered from an application perspective, the large models on smart cars must run on terminal devices because smart cars have very high requirements for the reliability and latency of model operation. Running large models on the cloud and using the network to transmit the results to the terminal cannot meet the needs of smart cars.

In addition to smart cars, mobile phones are another important market for large models to enter terminals. The language-based large models represented by ChatGPT have actually become an important part of the next generation of user interaction. Therefore, using large language models on mobile phones will bring this new user interaction experience into the mobile operating system. The main advantage of running large language models directly on mobile device terminals is that it can bring personalized experiences to users while protecting user privacy (such as summarizing and chatting with a user, etc.). At present, the open-source community can already run the Llama large language model on Android mobile phone CPUs, and it takes about 5-10 seconds to answer a question. We believe there is great potential for the future.

Smart car chip accelerates large models: computing power and power consumption become key

At present, artificial intelligence has been widely applied in the auxiliary driving applications of smart cars, so most of the chips used in smart cars also support artificial intelligence, such as adding AI accelerators. However, these AI accelerators mainly consider accelerating the object model of the previous generation represented by convolutional neural networks, which often have a smaller number of parameters and lower demand for computing power.

To adapt to the next generation of large models, smart car chips will have corresponding changes. The requirements for smart car chips for the next generation of large models mainly include:

1. High computing power: Since the related perception and planning tasks on smart cars must be completed in real-time, the relevant chips must be able to provide sufficient computing power to support such calculations.2. Low Power Consumption: The computing power consumption on smart cars is still limited. Considering factors such as heat dissipation, the chip cannot have a power consumption as high as hundreds of watts like a GPU.

3. Reasonable Cost: The chip on smart cars cannot be as expensive as a GPU, which can cost thousands of dollars. Therefore, the main consideration for the acceleration chip of large models on smart cars is how to achieve the highest possible computing power under the constraints of power consumption and cost.

We can start from the most successful large model acceleration chip (i.e., GPU) to speculate on the specific architecture of the chip that supports large model smart cars, considering which design ideas on the GPU need to be further developed, and which should be redesigned.

Firstly, there are a large number of matrix computing units on the GPU, which are the core support of the GPU's computing power (in contrast, the CPU lacks these large number of matrix computing units, so the computing power cannot be high). These computing units are also essential on the smart car chip; however, since the computing on the smart car chip does not need to consider the support for data flow and operator generality on the GPU, there is no need for a large number of stream cores on the smart car chip, so the control logic can be simplified to reduce the chip area cost.

Secondly, another key to the successful operation of large models on the GPU is the ultra-high-speed memory interface and a large amount of memory, because the parameter volume of current large models is often in the hundreds of billions, and these models must have corresponding memory support. This is also needed on the smart car chip, but the smart car chip may not be able to use ultra-high-end (and high-cost) memory like HBM on the GPU, but will consider collaborative design with the architecture to make full use of the bandwidth of interfaces such as LPDDR.

Thirdly, the GPU has good scalability and distributed computing capabilities. When a model cannot be accommodated on a single GPU, the GPU can conveniently divide the model into multiple sub-models for computation on multiple GPUs. The smart car chip can also consider such an architecture to ensure that the car can meet the ever-changing model requirements during its usage period.

Considering the above, we speculate that in the architecture of smart car chips for large models, there may be multiple artificial intelligence accelerators running simultaneously, each with a simple design (such as a simple control core with a large number of computing units), equipped with large memory and high-speed memory interfaces, and the accelerators communicate with each other through high-speed interconnects to accelerate large models in a local distributed computing manner. From this perspective, we believe that memory and memory interfaces in smart driving chips will play a decisive role, and on the other hand, such an architecture is also very suitable for implementing each accelerator using chiplet and using advanced packaging technology (including 2.5D and 3D packaging) to complete the integration of multiple accelerators. In other words, the application of large models in smart cars will further promote the popularization and evolution of the next generation of memory interfaces and advanced packaging technology.

Large models will drive the innovation of mobile phone memory and AI accelerators

As mentioned earlier, the entry of large models into mobile phones will bring the next generation of user interaction paradigms into mobile phones. We believe that the entry of large models into mobile phones will be a gradual process: for example, the current large language model, even the small version of the 7 billion parameter Llama model, cannot be completely accommodated in the mobile phone's memory and must be partially run in the mobile phone's flash memory, which leads to slower operation speed. In the next few years, we believe that large language models on mobile phones will first start from smaller versions (such as models with less than 1 billion parameters) and then gradually increase the parameter volume.From this perspective, running large models on mobile phones will still accelerate the development of mobile phone chips in related fields, especially in the areas of memory and AI accelerators—after all, the current mainstream models running on mobile phones have less than 10M parameters, while the parameters of large language models are two orders of magnitude larger, and the model parameters will increase rapidly in the future. On the one hand, this will promote the evolution of mobile phone memory and interface technology at a faster speed— to meet the needs of large models, we can expect to see faster growth in the capacity of mobile phone memory chips in the future, and the bandwidth of mobile phone memory interfaces will also develop more quickly, because memory is actually the bottleneck for large models at present.

In addition to memory, the AI accelerators on mobile phone chips will also make relevant changes for large models. The AI accelerators on mobile phone chips (such as various NPU IPs) are almost standard configurations, but the design of these accelerators is basically aimed at the previous generation of convolutional neural networks, so they are not completely designed for large models. To adapt to large models, AI accelerators must first have a larger memory access bandwidth and reduce memory access latency. This requires some changes in the interface of the AI accelerator (such as allocating more pins to the memory interface), and on the other hand, it requires corresponding changes in the on-chip data interconnection to meet the memory access needs of the AI accelerator.

In addition, in the internal logic design of the accelerator, we believe that low-precision quantization computing (such as 4-bit or even 2-bit) and sparse computing may be more aggressively promoted. Current academic research indicates that large language models have a greater opportunity to perform such low-precision quantization/sparseness, and if quantization can be reduced to 4-bit, it will greatly reduce the chip area required for the relevant computing units, and also reduce the space required for the model in memory (for example, 4-bit quantization accuracy will halve the memory requirements compared to the previous standard 8-bit precision), which is expected to be the future design direction for mobile AI accelerators.

Based on the above analysis, we expect that from a market perspective, mobile phone memory chips will become more important with the help of the trend of large models on mobile phones, and we expect to see faster development in the future, including large-capacity memory and high-speed memory interfaces. On the other hand, the AI accelerator IP on mobile phones will also usher in new demands and development, and we expect the related market to become more lively.

Leave A Comment