Cloudflare Optimizes MLOps: Driving Efficient Deployment of AI Models at Scale

2023-12-21

Cloudflare's blog describes its MLOps platform and best practices for running large-scale artificial intelligence (AI) deployments. Cloudflare's products, including WAF attack scoring, bot management, and global threat identification, rely on evolving machine learning (ML) models. These models play a crucial role in enhancing customer protection and supporting services. Cloudflare has achieved unprecedented scale in delivering ML within its network, highlighting the importance of a robust ML training methodology. Cloudflare's MLOps team collaborates with data scientists to implement best practices. Jupyter notebooks deployed on Kubernetes through JupyterHub provide a scalable and collaborative environment for data exploration and model experimentation. GitOps serves as the cornerstone of Cloudflare's MLOps strategy, utilizing Git as the single source of truth for managing infrastructure and deployment processes. ArgoCD is used for declarative GitOps, automating the deployment and management of applications and infrastructure. The future roadmap includes migrating the platform to Kubeflow, a machine learning workflow platform on Kubernetes that recently became a CNCF incubating project. This transition is facilitated by the deployKF project, which provides distributed configuration management for Kubeflow components. To help data scientists confidently and efficiently launch projects with the right tools, Cloudflare's MLOps team provides model templates, which serve as production-ready repositories with example models. These templates are currently used internally, but Cloudflare plans to open-source them. The covered use cases include: - Training templates: Configured for ETL processes, experiment tracking, and DAG-based orchestration. - Batch inference templates: Optimized for efficient processing through scheduled model optimization. - Streaming inference templates: Customized for real-time inference using FastAPI on Kubernetes. - Explainability templates: Generate model insight dashboards using tools like Streamlit and Bokeh. Another key task of the MLOps platform is efficiently orchestrating ML workflows. Cloudflare adopts various orchestration tools based on team preferences and use cases: - Apache Airflow: A standard DAG orchestrator with extensive community support. - Argo Workflows: Kubernetes-native orchestration for microservice workflows. - Kubeflow Pipelines: Designed specifically for ML workflows, emphasizing collaboration and version control. - Temporal: A stateful workflow specifically for event-driven applications. Optimizing performance involves understanding workloads and adjusting hardware accordingly. Cloudflare emphasizes using GPUs for core data center workloads and edge inference, utilizing Prometheus metrics for observability and optimization. Cloudflare's successful adoption involves simplifying ML workflows, standardizing pipelines, and introducing projects to teams lacking data science expertise. The company's vision is for data science to play a critical role in business, which is why Cloudflare invests in its AI infrastructure and collaborates with other companies like Meta to globally promote LLama2 on its platform.