Transformer-based generative large language models (LLMs) have demonstrated powerful advantages in the wide application of natural language processing (NLP). Despite the numerous benefits that applications have gained from the extensive use of LLMs, the high costs of training and implementation often deter most developers. To address this challenge, top AI companies such as OpenAI, Google, and Baidu have introduced Language Model as a Service (LMaaS), providing developers with access to LLMs through APIs.
Challenges in LMaaS Scenarios
In LMaaS scenarios, developers send user input messages and specific instructions to LLM services. To improve service quality (QoS) and support more customers, service providers constantly seek methods to reduce response time and increase throughput. However, existing systems such as TensorFlow Serving and Triton Inference Server suffer from inefficiencies in query processing. These systems employ a first-come, first-served (FCFS) query execution approach and fixed batch sizes, limiting the parallel computing capabilities of GPUs and potentially causing out-of-memory (OOM) issues.
Continuous Batch Processing Approach and Its Limitations
Some suggest adopting continuous batch processing to address the aforementioned issues, dynamically removing completed requests and adding new ones. However, this approach often relies on conservative GPU memory management techniques, sacrificing the parallel processing capabilities of GPUs to limit throughput. Although this helps reduce memory consumption, strategies such as model quantization and pruning may compromise the quality of generated outputs.
Magnus System: Optimizing Batch Processing Services in LMaaS
A research team in China has proposed the Magnus system, which cleverly utilizes application-level and user-level semantic information, as well as the length of user input, to predict the length of generated requests. Magnus consists of four core components: batch scheduling, adaptive batch processing, service time estimator, and length prediction.
- Length Prediction: Uses a random forest regressor to estimate the length of requests based on user input, application-level semantic features, and user-level semantic features.
- Adaptive Batch Processing: Groups requests with similar lengths and dynamically selects the optimal batch size to minimize waste of computational resources.
- Batch Scheduling: Selects batches based on the highest response ratio next (HRRN) strategy to minimize request queuing time and response time.
- Service Time Estimator: Uses KNN regression to predict batch processing service time, further optimizing service quality.
Performance Testing and Validation of the Magnus System
In tests conducted on the NVIDIA V100 GPU using the ChatGLM-6B instance, the Magnus system demonstrated significant performance improvements. Compared to the baseline method, Magnus increased request throughput by 234% and reduced response time by 89.7%. This significant performance enhancement proves the effectiveness of utilizing length prediction to optimize batch processing services in LMaaS.