tech

How to play with self-developed chips?

Microsoft has been in the spotlight in the recent wave of artificial intelligence, from investing heavily in the acquisition of OpenAI to integrating ChatGPT into the Bing search engine, it has been at the forefront of the entire field's development. Just a few days ago, news emerged that Microsoft is collaborating with AMD to develop its own AI chips. The story has been full of twists and turns, and here we outline the general trajectory of Microsoft's self-developed AI chips.

Firstly, about half a month ago, media reports stated that Microsoft was developing its own chips for large language models (LLMs, currently the most cutting-edge AI technology and also the model technology behind ChatGPT), with the internal codename Athena. Then, on May 2nd, during the analyst call following AMD's release of its financial report for the first quarter of 2023, an analyst asked AMD about its views on internet cloud computing companies developing their own chips and whether there were plans to cooperate with related companies to develop semi-custom chips. AMD CEO Lisa Su stated that AMD currently has a very comprehensive IP library in the fields of CPUs, GPUs, FPGAs, and DPUs, and also has a strong semi-custom chip team, so the company plans to further invest in this field to cooperate with major customers. Two days later, Bloomberg reported that AMD is collaborating with Microsoft on AI chips, with Microsoft providing R&D support in AI for AMD, and AMD developing the Athena chip for Microsoft. After the report was released, AMD's stock price once rose by 6%. Following the Bloomberg report, a Microsoft spokesperson stated that AMD is an important partner for Microsoft, but the Athena chip is not currently being developed by AMD. However, Microsoft did not deny the reports of cooperation with AMD in the field of AI.

Advertisement

We believe that summarizing the existing reports, on the one hand, AMD's semi-custom chip field will be one of the key investment directions for companies in the future AI field, as large customers of AI applications (mainly internet technology giants) have a great interest in this field. On the other hand, although the Athena chip may not be directly developed by AMD, the possibility of Microsoft cooperating with AMD in the field of AI hardware is very high. At present, the most likely situation is that Microsoft is cooperating with AMD to develop a complete set of hardware solutions for accelerating AI large language models, which includes Microsoft's self-developed Athena chip and also includes AMD's CPUs and other chips. During the development process of the Athena chip, Microsoft is very likely to consider adding interfaces and optimizations related to AMD's chipset (and may even use some of AMD's IP), and AMD may also consider adding some semi-custom elements defined by Microsoft in the design of the cooperative hardware solution (such as data interfaces, storage bandwidth, optimizations for Microsoft's AI framework, etc.).

Finally, in terms of chip system integration, it would be a natural result if Microsoft uses AMD's highly experienced advanced packaging technology to integrate Athena and AMD's chips together, and in terms of upper-level software integration, it is expected that Microsoft and AMD will cooperate deeply to ensure that the entire AI system can run efficiently in the system.

Seeing the development here, it is hard not to sigh at the changes of time: 30 years ago, it was the deep cooperation between Microsoft and Intel in the Wintel alliance that ignited the rapid development of the entire PC market, and both Microsoft and Intel achieved rapid growth in the process, while AMD was a dispensable role in the market at that time, and there were even sayings that Intel kept AMD mainly to avoid triggering antitrust laws and being split up; but today, AMD's market value has surpassed Intel, and Microsoft has chosen to cooperate with AMD in the hottest field of AI. On the other hand, we believe that Microsoft and AMD's deep cooperation in the field of hardware and chips has also opened a new chapter in the history of self-developed chips by technology giants, that is, from emphasizing self-made chips to emphasizing cooperation with traditional chip companies - note that this cooperation is not just cooperation in the aspects of OEM or design services, but deep cooperation in the fields of design indicators, IP, software and hardware interfaces, etc.

The history of internet companies developing their own chips

Let's review the history of internet companies developing their own chips. The history of internet companies developing their own chips is almost synchronized with the AI boom that started in 2016. The rise of AI has had a decisive impact on the business of the internet. In the cloud, AI technology has greatly improved the core businesses of internet companies such as recommendation systems and advertising systems, and on the terminal, AI has also empowered many important computer vision and voice technologies. The companies that develop their own chips for AI-related businesses include almost all technology giants, including Google, Microsoft, Amazon, Alibaba, ByteDance, Baidu, etc. Looking at the starting point of developing their own chips, in the past, internet technology companies mainly considered two aspects when developing their own chips, namely cost and functionality.

From the perspective of cost, because AI computing requires a very large amount of computing power, the cost is also very high. From the supply chain perspective, Nvidia is the most mainstream cloud AI chip supplier, and the price of its GPUs is high on the one hand, and on the other hand, for technology giants, over-reliance on a single supplier also has supply chain risk costs (especially for Chinese internet giants, the risk of relying on Nvidia is even higher due to geopolitical influences and has a lot of uncertainty). Another aspect is that the power efficiency ratio of GPUs when running AI applications is not perfect. In fact, in cloud data center applications, a large part of the electricity cost is for paying for AI applications. Therefore, the main purpose of internet technology giants developing their own chips in the field of cloud AI chips is to reduce dependence on Nvidia on the one hand, and on the other hand, to achieve a better power efficiency ratio than Nvidia, so that when deployed on a large scale, from a comprehensive cost perspective, it can be lower than the cost of directly purchasing Nvidia's GPUs. In this regard, Google's TPU is a famous example. After several iterations, we see that the performance of Google TPU is usually similar to that of Nvidia's GPUs, but in terms of power efficiency ratio and other cost-affecting aspects, it can achieve better than Nvidia.

Another main purpose of internet technology companies developing their own chips is to achieve stronger functionality, that is, there are no chips on the market that can meet the company's needs, so it is necessary to develop their own chips to meet design requirements, while creating higher product competitiveness compared to other companies using third-party general-purpose chips. A typical example here is Microsoft's self-developed HPU chip used in HoloLens to accelerate AI machine vision-related applications, thereby providing sufficient computing power for the core functional modules of HoloLens (such as indoor SLAM positioning, etc.) without consuming too much battery. Google's Tensor processor used in Pixel phones is another related example.

The self-developed chips of internet companies in the past often emphasized the direction of "independence". Independence means that the most critical modules (IP) and system architecture of self-developed chips are designed by the internet companies themselves. In practical operation, after all, internet technology giants have not accumulated much in the chip industry, so they usually build a team of several hundred people, mainly responsible for defining chip architecture and verifying the design of core IPs; on the other hand, general-purpose IPs (such as DDR, etc.) are usually purchased, and responsibilities that can be outsourced, such as backend design, are handed over to external design service companies. In summary, the usual pattern of internet companies making chips is to have their own core team complete the definition of chip architecture and the design of core modules, and then cooperate with neutral third-party IP companies and design service companies to purchase other general-purpose IPs and complete the entire chip design process.Microsoft Opens a New Chapter in Internet Chip Making

Microsoft's collaboration with AMD marks a new milestone in the tech giant's chip-making endeavors: this time, Microsoft is not only partnering with a neutral third-party design service company but also collaborating with a traditional chip giant to design chips and hardware systems that support next-generation artificial intelligence technology. In other words, the tech giant's self-developed chips have gradually moved from emphasizing "independence" to "cooperation" today.

If we want to explore the reasons for this shift, we believe that there are at least two factors driving this change. The first factor is the exponential increase in the demand for computing power for future artificial intelligence, and the requirements for the complexity of the chip system are also incomparable to the past.

For example, in 2016, the hottest artificial intelligence application was machine vision (object recognition and classification tasks), with mainstream model parameter volumes usually ranging from 10M to 100M, and computing power requirements around 1-10 GFLOPs; whereas currently popular large language models (such as ChatGPT and its next-generation GPT-4) have model parameter volumes at the 1T level, with computing power requirements around 1-10 PTOPS, which is more than 1000 times larger than before in terms of both parameter volume and computing power requirements. Under such circumstances, the design of artificial intelligence chips is completely different from Google's TPU designed mainly for machine vision artificial intelligence tasks in 2017 - in 2017, Google's TPU could complete a large number of artificial intelligence task accelerations around its pulsating array-based convolution acceleration IP and larger on-chip SRAM. Its TPU can be said to be more independent from other chips in the system, and as long as the pulsating array IP and on-chip storage are well done, the performance can meet the standard; however, in 2023, due to the parameter volume and computing power requirements of the models having increased by several orders of magnitude, it is necessary to carefully consider other chips in the hardware system when designing artificial intelligence acceleration chips, including storage access, high-speed data interconnection, data and computation division and movement between CPUs and artificial intelligence chips, etc. It can be said to be a very complex system, and in this complex system, the performance of each chip must be reasonable to ensure the efficiency of the overall system, otherwise, any chip in the system may become a bottleneck of the overall efficiency - in other words, if only the artificial intelligence acceleration chip is optimized without the optimization of other chips, the overall performance may not be very high.

Obviously, tech giants cannot self-develop all these chips in the system, and they must cooperate deeply with traditional chip giants to complete a system that is optimized as a whole, especially AMD has a very deep accumulation in overall system integration (advanced packaging technology and data interconnection technology), while Microsoft has more capabilities on the software level, and the deep cooperation between the two is a complement to each other's advantages.

In addition to system complexity, another driving factor is the current economic situation. For tech giants, although artificial intelligence remains hot, the overall global macroeconomic situation is not optimistic, so tech giants tend to reduce the extent of expansion and investment in non-core businesses. For chip business, tech giants are more inclined to focus on the blade, that is, the IP related to the core acceleration of artificial intelligence, and for other non-core IPs and other chips in the system, tech giants will turn to partners to complete, rather than expanding their own teams to do as much as possible a few years ago.

Looking to the future, the pattern of tech giants making chips will continue to some extent in the current pattern, but we also expect to see more in-depth cooperation with traditional chip giants. As mentioned earlier, in applications such as next-generation artificial intelligence, we can expect to see more cooperation like Microsoft and AMD to jointly challenge such complex systems; on the other hand, due to the impact of the economic situation, we expect to see more and more Internet tech giants moving upstream when making chips, that is, defining chip architecture and delivering core IP, and the integration of these IPs in SoC can be completed by partners, and we may even see more customized SoCs, such as integrating the core IP provided by Internet tech giants on the basis of the public version SoC design, thereby maximizing the reduction of design cost expenditure. From this perspective, Internet tech giants need not only a design service partner but also a chip cooperation company that already has relevant SoC design and mass production experience. From this perspective, AMD, Samsung, MTK, etc., will be beneficiaries of such business, because they have a strong design service/semi-custom chip department, and also have cutting-edge SoC design and mass production experience. At the same time, from a technical perspective, advanced packaging and chiplet technology are expected to play a core enabling role in such chip cooperation, because if chiplets can be used, the core IP of tech giants can be put into chiplets and integrated with other SoCs, without the need to design a dedicated SoC mask, which can greatly reduce design costs, and on the other hand, greatly increase design flexibility - this may also be another reason why Microsoft cooperates with AMD, which has rich experience in the field of chiplet advanced packaging.

Leave A Comment