High Cost of Training and Deployment for Large-scale Language Models Based on Transformer Shines in the NLP Field AI NEWS

Home
AInews
High Cost of Training and Deployment for Large-scale Language Models Based on Transformer Shines in the NLP Field

High Cost of Training and Deployment for Large-scale Language Models Based on Transformer Shines in the NLP Field

2024-06-14

Transformer-based generative large language models (LLMs) have demonstrated powerful advantages in the wide application of natural language processing (NLP). Despite the numerous benefits that applications have gained from the extensive use of LLMs, the high costs of training and implementation often deter most developers. To address this challenge, top AI companies such as OpenAI, Google, and Baidu have introduced Language Model as a Service (LMaaS), providing developers with access to LLMs through APIs.

Challenges in LMaaS Scenarios

In LMaaS scenarios, developers send user input messages and specific instructions to LLM services. To improve service quality (QoS) and support more customers, service providers constantly seek methods to reduce response time and increase throughput. However, existing systems such as TensorFlow Serving and Triton Inference Server suffer from inefficiencies in query processing. These systems employ a first-come, first-served (FCFS) query execution approach and fixed batch sizes, limiting the parallel computing capabilities of GPUs and potentially causing out-of-memory (OOM) issues.

Continuous Batch Processing Approach and Its Limitations

Some suggest adopting continuous batch processing to address the aforementioned issues, dynamically removing completed requests and adding new ones. However, this approach often relies on conservative GPU memory management techniques, sacrificing the parallel processing capabilities of GPUs to limit throughput. Although this helps reduce memory consumption, strategies such as model quantization and pruning may compromise the quality of generated outputs.

Magnus System: Optimizing Batch Processing Services in LMaaS

A research team in China has proposed the Magnus system, which cleverly utilizes application-level and user-level semantic information, as well as the length of user input, to predict the length of generated requests. Magnus consists of four core components: batch scheduling, adaptive batch processing, service time estimator, and length prediction.

Length Prediction: Uses a random forest regressor to estimate the length of requests based on user input, application-level semantic features, and user-level semantic features.
Adaptive Batch Processing: Groups requests with similar lengths and dynamically selects the optimal batch size to minimize waste of computational resources.
Batch Scheduling: Selects batches based on the highest response ratio next (HRRN) strategy to minimize request queuing time and response time.
Service Time Estimator: Uses KNN regression to predict batch processing service time, further optimizing service quality.

Performance Testing and Validation of the Magnus System

In tests conducted on the NVIDIA V100 GPU using the ChatGLM-6B instance, the Magnus system demonstrated significant performance improvements. Compared to the baseline method, Magnus increased request throughput by 234% and reduced response time by 89.7%. This significant performance enhancement proves the effectiveness of utilizing length prediction to optimize batch processing services in LMaaS.

Jules

Jules - AI coding assistant with automatic pull requests

Final Round AI

Final Round AI - Automated job interview preparation and assistance

Sapia

Sapia - AI hiring agent for fair recruitment processes

Magic Motion

Magic Motion - AI transforms text into engaging 3D animations

Recall

Recall - AI summarizer for streamlined knowledge management

Rocket.new

Rocket.new - AI analyzes and summarizes call conversations

Qodo AI Platform

Qodo AI Platform - AI tool for ensuring code quality and integrity

RECENT AI TOOLS

Interviewer AI

Jules

Final Round AI

Sapia

Magic Motion

RECENT AI NEWS

X Trial AI Chatbot Drives Community Notes Initiative

Amazon Deploys One Millionth Robot and Unveils Generative AI Model

Google’s Agent2Agent Protocol Joins Linux Foundation

Elon Musk's xAI Raises $10 Billion to Upgrade AI Infrastructure

Calling the Algorithm Doctor: Microsoft's AI Diagnoses Like House MD, Prices Like Costco

Cloudflare Halts AI Crawlers, Gaining Industry Applause

Google DeepMind Releases AlphaGenome: Unified AI Model for High-Resolution Genomic Interpretation

Cursor Launches Web Application for Managing AI Coding Agents

RECENT AI TOOLS