Docker Model Runner Aims to Simplify Local LLM Model Execution

2025-04-23

Docker Model Runner is now available as a preview for Docker Desktop 4.40 on macOS with Apple Silicon. It enables developers to run models locally and iterate on application code using local models, all without disrupting their container-based workflows.

Developing with local LLMs offers several benefits, such as cost reduction, enhanced data privacy, reduced network latency, and greater control over the model.

Docker Model Runner addresses multiple pain points developers face when integrating LLMs into containerized applications, including managing different tools, configuring environments, and handling models outside of containers. Additionally, there's no standardized approach for storing, sharing, or serving models. To minimize related friction, Docker Model Runner includes:

  • An inference engine that is part of Docker Desktop, built on llama.cpp, and accessible via the familiar OpenAI API. No extra tools, setup, or workflow interruptions are required. Everything is centralized, allowing you to quickly test and iterate directly on your machine.

To avoid typical performance overhead associated with virtual machines, Docker Model Runner uses host-based execution. This means models run directly on Apple Silicon, leveraging GPU acceleration, which is critical for inference speed and smooth development cycles.

For model distribution, Docker unsurprisingly places its bets on the OCI standard — the same specification used for container distribution — aiming to unify both into a single workflow.

You can now effortlessly pull ready-to-use models from Docker Hub. Soon, you'll also be able to push your own models, integrate with any container registry, connect them to your CI/CD pipelines, and manage access control and automation using familiar tools.

If you're using Docker Desktop 4.40 for macOS on Apple Silicon, you can use commands that support workflows similar to those you're accustomed to with images and containers. For example, you can pull and run a model. To specify an exact model version, such as its size or quantization, use tags like: docker modeldocker model

docker model pull ai/smollm2:360M-Q4_K_M
docker model run ai/smollm2:360M-Q4_K_M "Give me a fact about whales."

However, the mechanism behind these commands is model-specific because they don't actually create containers. Instead, the command delegates the inference task to an inference server running natively as a process on llama.cpp. The inference server loads the model into memory and caches it for some time.run

You can use Model Runner with any OpenAI-compatible client or framework via the OpenAI endpoint provided inside the container. You can also access this endpoint from the host if you enable TCP host access by running .http://model-runner.docker.internal/engines/v1docker desktop enable model-runner --tcp 12434

Docker Hub hosts various models compatible with Model Runner, including smollm2 for on-device applications, as well as llama3.3 and gemma3. Docker has also released a tutorial on integrating Gemma 3 into a comment-processing app with Model Runner. It covers common tasks such as configuring the OpenAI SDK to work with local models and generating test data using the model itself.

Docker Model Runner isn't the only option for running LLMs locally. If Docker’s container-centric approach doesn’t appeal to you, you might want to check out Ollama. It operates as a standalone tool, offering a larger model repository and community, along with more flexibility for model customization. While Docker Model Runner is currently macOS-only, Ollama is cross-platform. However, while Ollama supports GPU acceleration on Apple Silicon when running natively, this feature is unavailable when running within containers.