Apple Flash: Enhancing Intelligence in Our Smart Devices

2023-12-29

Now imagine enhancing these capabilities. Engaging in deep natural language conversations to answer academic or personal questions; comparing our vital signs to global databases to check for urgent health issues; packaging vast databases to provide real-time translations, enabling communication between parties speaking different languages; conversing with GPS software to provide detailed information on the best burgers, movies, hotels, or popular observation points along your route.

We have witnessed significant progress in communication between humans and the technology we increasingly rely on, thanks to the power of large-scale language models and natural language processing.

However, there is a barrier when it comes to AI and our portable devices. Apple researchers have indicated that they are ready to take action on this.

The problem is memory. Large language models require a significant amount of memory. These models demand storage for data that can reach trillions of parameters, while ordinary smartphones like Apple's iPhone 15 only have a mere 8GB of memory, which falls far short of meeting this task.

In a paper uploaded to the preprint server arXiv on December 12th, Apple announced that they have developed a method that utilizes data transfer between flash memory and DRAM, which will allow smart devices to run a powerful AI system.

The researchers state that their approach can run AI programs that are twice the size of the device's DRAM capacity and can accelerate CPU operations by up to 500%. They claim that GPU processes can be accelerated up to 25 times compared to current methods.

In their paper titled "LLM in Flash: Efficient Large Language Model Inference with Limited Memory," the researchers state, "Our approach involves constructing an inference cost model that is coordinated with flash behavior, guiding our optimizations in two key areas: reducing the amount of data transferred from flash to RAM and reading data in larger, more contiguous blocks."

The two techniques they employ are:

  1. Windowing, which reduces the amount of data that needs to be exchanged between flash and RAM by reusing recent computation results, minimizing IO requests, and saving energy and time.
  2. Row-column bundling, which achieves higher efficiency by digesting more data blocks from flash in a single operation.

The researchers state, "These two processes together result in a significant reduction in data volume and improved memory utilization."

They add, "This breakthrough is crucial for deploying advanced LLM in resource-constrained environments, thus expanding their applicability and accessibility."

In another recent breakthrough, Apple announced that they have designed a program called HUGS, which can create animated avatars from just a few seconds of single-camera video. Current avatar creation programs require multiple camera angles. The report titled "HUGS: Human Gaussian Splats" was uploaded to arXiv on November 29th.

Apple states that their program can create realistic dance avatars in just 30 minutes, much faster than the two days currently required by popular methods.