Generative AI, one of the hottest growing technologies, is used by OpenAI’s ChatGPT and Google Bard for chat and by image generation systems such as Stable Diffusion and DALL-E. Still, it has certain limitations because these tools require the use of cloud-based data centers with hundreds of GPUs to perform the computing processes needed for every query.
But one day you could run generative AI tasks directly on your mobile device. Or your connected car. Or in your living room, bedroom, and kitchen on smart speakers like Amazon Echo, Google Home, or Apple HomePod.
MediaTek believes this future is closer than we realize. Today, the Taiwan-based semiconductor company announced that it is working with Meta to port the social giant’s Lllama 2 LLM — in combination with the company’s latest-generation APUs and NeuroPilot software development platform — to run generative AI tasks on devices without relying on external processing.
Of course, there’s a catch: This won’t eliminate the data center entirely. Due to the size of LLM datasets (the number of parameters they contain) and the storage system’s required performance, you still need a data center, albeit a much smaller one.
For example, Llama 2’s “small” dataset is 7 billion parameters, or about 13GB, which is suitable for some rudimentary generative AI functions. However, a much larger version of 72 billion parameters requires a lot more storage proportionally, even using advanced data compression, which is outside the practical capabilities of today’s smartphones. Over the next several years, LLMs in development will easily be 10 to 100 times the size of Llama 2 or GPT-4, with storage requirements in the hundreds of gigabytes and higher.
That’s hard for a smartphone to store and have enough IOPS for database performance, but certainly not for specially designed cache appliances with fast flash storage and terabytes of RAM. So, for Llama 2, it is possible today to host a device optimized for serving mobile devices in a single rack unit without all the heavy compute. It’s not a phone, but it’s pretty impressive anyway!
MediaTek expects Llama 2-based AI applications to become available for smartphones powered by their next-generation flagship SoC, scheduled to hit the market by the end of the year.
For on-device generative AI to access these datasets, mobile carriers would have to rely on low-latency edge networks — small data centers/equipment closets with fast connections to the 5G towers. These data centers would reside directly on the carrier’s network, so LLMs running on smartphones would not need to go through many network “hops” before accessing the parameter data.
In addition to running AI workloads on device using specialized processors such as MediaTek’s, domain-specific LLMs can be moved closer to the application workload by running in a hybrid fashion with these caching appliances within the miniature datacenter — in a “constrained device edge” scenario.
So, what are the benefits of using on-device generative AI?
- Reduced latency: Because the data is being processed on the device itself, the response time is reduced significantly, especially if localized cache methodologies are used by frequently accessed parts of the parameter dataset.
- Improved data privacy: By keeping the data on the device, that data (such as a chat conversation or training submitted by the user) isn’t transmitted through the data center; only the model data is.
- Improved bandwidth efficiency: Today, generative AI tasks require all data from the user conversation to go back and forth to the data center. With localized processing, a large amount of this occurs on the device.
- Increased operational resiliency: With on-device generation, the system can continue functioning even if the network is disrupted, particularly if the device has a large enough parameter cache.
- Energy efficiency: It doesn’t require as many compute-intensive resources at the data center, or as much energy to transmit that data from the device to the data center.
However, achieving these benefits may involve splitting workloads and using other load-balancing techniques to alleviate centralized data center compute costs and network overhead.
In addition to the continued need for a fast-connected edge data center (albeit one with vastly reduced computational and energy requirements), there’s another issue: Just how powerful an LLM can you really run on today’s hardware? And while there is less concern about on-device data being intercepted across a network, there is the added security risk of sensitive data being penetrated on the local device if it isn’t properly managed — as well as the challenge of updating the model data and maintaining data consistency on a large number of distributed edge caching devices.
And finally, there is the cost: Who will foot the bill for all these mini edge datacenters? Edge networking is employed today by Edge Service Providers (such as Equinix), which is needed by services such as Netflix and Apple’s iTunes, traditionally not mobile network operators such as AT&T, T-Mobile, or Verizon. Generative AI services providers such as OpenAI/Microsoft, Google, and Meta would need to work out similar arrangements.
There are a lot of considerations with on-device generative AI, but it’s clear that tech companies are thinking about it. Within five years, your on-device intelligent assistant could be thinking all by itself. Ready for AI in your pocket? It’s coming — and far sooner than most people ever expected.