Apple Enables Larger AI Models to Run on Smartphones

2023-12-26

Last year, there was a revolution in the high-tech field as artificial intelligence replaced the metaverse and became the hottest topic on the internet. Suddenly, everyone was creating their own large language models (LLMs), but most of them ran in the cloud and required powerful server hardware support. Smartphones did not have enough memory to run the largest and most powerful models, but Apple believed it had a solution. In a new research paper, Apple engineers proposed a method to store LLM parameters in the NAND flash memory of iPhones.


With companies like Qualcomm, Intel, and others incorporating machine learning hardware into their latest chips, your next device may have everything needed to run local AI. The problem is that large language models are just too "big." During model execution, there may be trillions of parameters that need to be stored in memory, and mobile phone RAM is very limited - especially for Apple phones, with the maximum RAM of the iPhone 15 Pro being only 8GB.


In data centers, AI accelerator cards that run these models have more memory than similar graphics cards. For example, the Nvidia H100 is equipped with 80GB of HBM2e memory, while the gaming-focused RTX 4090 Ti has only 24GB of GDDR6X.


Google is working to enhance mobile LLMs with its new Gemini model, which includes a "nano" version designed specifically for smartphones. Apple's new research aims to rely on NAND flash storage, which typically has at least 10 times the storage space of mobile phone RAM, to fit a larger model into smartphones. However, the main issue is speed - flash memory is much slower.


Apple NAND Speed Boost


According to the research, the team used two techniques to enable their model to run without requiring RAM. Both methods reduce the amount of data the model needs to load from memory. Windowing allows the model to load only the parameters of the last few tokens, essentially recycling data to reduce storage access time. Row-column bundling is also used to organize data more efficiently, allowing the model to handle larger data blocks.


The research has been successful in expanding the LLM capabilities of iPhones. With this approach, LLMs run 4-5 times faster on standard CPUs and 20-25 times faster on GPUs. Perhaps most importantly, iPhones can run AI models twice the size of those installed in memory, achieved by saving parameters in internal storage. The conclusion of the research is that this method paves the way for running LLMs on devices with limited memory.