ServiceNow, Hugging Face, and Nvidia have recently announced the release of StarCoder2, the latest version of their collaborative StarCoder series open-source large-scale language models specifically designed for code generation.
StarCoder2 has significantly improved performance compared to its predecessor, with faster speed and increased flexibility. Additionally, the model has incorporated features to protect intellectual property rights, meeting the security needs of developers and enterprises during usage.
Trained on 619 programming languages, StarCoder2 is a research project developed in collaboration between ServiceNow and Hugging Face, managed by the BigCode community. The model is built on a new code dataset called Stack v2, which is seven times larger than Stack v1. This new dataset also includes training techniques that help the model understand low-resource programming languages such as Cobol, mathematics, and program source code discussions.
StarCoder2 has various applications and can be fine-tuned and embedded into enterprise applications for tasks such as source code generation, workflow generation, and text summarization. Developers can leverage the model's code completion, code summarization, and code snippet retrieval capabilities to improve code writing efficiency.
To cater to different user needs, StarCoder2 offers three model sizes to choose from: a 3 billion parameter model trained by ServiceNow, a 7 billion parameter model trained by Hugging Face, and a 15 billion parameter model built by Nvidia using its NeMo generative AI framework and trained on Nvidia infrastructure. The smaller variants can run on consumer-grade graphics processors, saving computational costs.
Notably, the 3 billion parameter model of StarCoder2 matches the performance of the original 15 billion parameter model of StarCoder and can make more accurate predictions. This is because training on a larger language corpus enables the model to provide better context-aware predictions.
With the growing demand for AI in the field of software development, AI coding tools have become essential assistants for developers, partly driven by early success stories such as GitHub's Copilot and Amazon Web Services' CodeWhisperer. However, a recent GitHub survey found that despite 91% of US developers using AI coding tools, nearly a quarter of developers remain skeptical about the value of AI, and 28% stated that their employers prohibit the use of AI.
In response to these concerns, the three sponsoring companies emphasize the transparency of StarCoder2. The model is built using responsible data sources authorized by Software Heritage, which hosts what it claims to be the largest public collection of source code. The supporting code for the model will reside on the GitHub page of the BigCode project and will be available for tax-free access and usage under the BigCode OpenRAIL-M license. While this license has many open-source characteristics, it is not technically a fully open-source license. RAIL-M imposes some restrictions, prohibiting licensed software from providing medical advice or performing judicial functions. Furthermore, all StarCoder2 models can be downloaded from Hugging Face, while the 15 billion parameter model can be accessed on the Nvidia AI Foundation models.