"Meta Open-Sources Llama 3: The Next-Generation Large Language Model"

2024-04-19

Meta has released its latest open source generative AI model in the Llama series - Llama 3. Specifically, Meta has introduced two models in its new Llama 3 series, with the rest of the models to be released at a later date.

Meta describes the newly released models - Llama 3 8B with 8 billion parameters and Llama 3 70B with 70 billion parameters - as a 'significant leap' in performance compared to the previous generation Llama models (Llama 2 8B and Llama 2 70B). In the field of AI, parameters essentially define a model's ability to handle tasks such as analyzing and generating text, with models having higher parameter counts generally being more powerful than those with lower counts.

Meta states that for their respective parameter sizes, both Llama 3 8B and Llama 3 70B - trained on two custom clusters with 24,000 GPUs - are among the top-performing generative AI models currently available.

This claim is quite bold. So how does Meta support this statement? The company references the scores of the Llama 3 model in popular AI benchmarks such as MMLU (measuring knowledge), ARC (measuring skill acquisition), and DROP (testing the model's ability to reason about text passages). While the effectiveness and practicality of these benchmarks are still debatable, they remain one of the standard ways for AI players like Meta to evaluate their models.

In at least nine benchmarks, Llama 3 8B outperforms other open source models such as Mistral's Mistral 7B and Google's Gemma 7B, both of which have 7 billion parameters. These benchmarks include MMLU, ARC, DROP, GPQA (a set of questions related to biology, physics, and chemistry), HumanEval (code generation test), GSM-8K (math word problems), MATH (another math benchmark), AGIEval (problem-solving test set), and BIG-Bench Hard (common-sense reasoning evaluation).

It is worth noting that Mistral 7B and Gemma 7B are not the cutting-edge models (Mistral 7B was released in September last year), and in multiple benchmarks mentioned by Meta, Llama 3 8B only scores a few percentage points higher than them. However, Meta also claims that the larger Llama 3 model - Llama 3 70B - is competitive with flagship generative AI models, including the latest member of Google's Gemini series, Gemini 1.5 Pro.

In benchmarks such as MMLU, HumanEval, and GSM-8K, Llama 3 70B outperforms Gemini 1.5 Pro. Although it falls short of the best-performing model in the Anthropic series, Claude 3 Opus, Llama 3 70B scores higher than the second-worst performing model in the Claude 3 series, Claude 3 Sonnet, in five benchmarks (MMLU, GPQA, HumanEval, GSM-8K, and MATH).

Additionally, Meta has developed its own test set, covering various application scenarios from coding and creative writing to reasoning and summarization. Surprisingly, Llama 3 70B stands out in comparisons with Mistral's Mistral Medium model, OpenAI's GPT-3.5, and Claude Sonnet. Meta states that to ensure objectivity, their modeling team did not have access to this test set, but given that the test set was designed by Meta itself, these results must be taken with caution.

From a more qualitative perspective, Meta states that users of the new Llama models can expect higher 'controllability,' lower likelihood of refusing to answer questions, and higher accuracy in handling trivial questions, history-related questions, STEM fields (such as engineering and science), and general coding advice. This is partly attributed to a larger dataset: an astonishing quantity of 15 trillion tokens, approximately 750 billion words, which is seven times the size of the Llama 2 training set. (In the field of AI, 'tokens' refer to subparts of raw data, such as the syllables 'fan,' 'tas,' and 'tic' in the word 'fantastic.')

Regarding this data, Meta has not disclosed specific sources but states that they come from 'publicly available sources.' These data volumes are reportedly four times larger than the Llama 2 training dataset, with 5% of the data being non-English and covering approximately 30 languages to improve the model's performance on non-English languages. Additionally, Meta mentions the use of synthetic data - data generated by AI - to create longer training documents for training the Llama 3 model. While this approach may raise controversies and flaws in performance, Meta seems confident in it.

Meta explains in an article shared with the tech blog TechCrunch, 'Although the models we currently release have only been fine-tuned for English output, the increased diversity of data helps the model better recognize subtle differences and patterns and demonstrate superior performance across a range of tasks.'

Many generative AI vendors consider training data as their competitive advantage, hence maintaining high secrecy around it and related information. Additionally, details of training data can potentially become a source of litigation related to intellectual property rights, which is another significant reason for not disclosing too much information. Recent reports have indicated that Meta, in order to keep up with AI competitors, had used copyrighted e-books in AI training, despite warnings from internal lawyers. As a result, Meta and OpenAI have become the targets of ongoing lawsuits filed by authors, including comedian Sarah Silverman, accusing these vendors of unauthorized use of copyrighted data for training.

So, what about the other common concerns with generative AI models - toxicity and bias? Has Llama 3 improved in these aspects? Meta claims that there have indeed been improvements.

Meta states that they have developed new data filtering pipelines to enhance the quality of model training data and have updated two generative AI safety suites - Llama Guard and CybersecEval - to prevent misuse and unwanted text generation from the Llama 3 model and other models. Additionally, they have released a new tool called Code Shield to detect potential security vulnerabilities in the code of generative AI models.

However, it is important to note that data filtering is not foolproof, and tools like Llama Guard, CyberSecEval, and Code Shield can only play a role to a certain extent. For example, the Llama 2 model was found to have a tendency to fabricate answers to questions and leak private health and financial information. Therefore, we still need to wait and observe the performance of the Llama 3 model in real-world applications, including testing results from the academic community in alternative benchmark tests.

Currently, Meta states that the Llama 3 models are available for download and are being supported for Meta's AI assistant on Facebook, Instagram, WhatsApp, Messenger, and web. Soon, these models will be made available in hosted form on various cloud platforms, including AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM's WatsonX, Microsoft Azure, Nvidia's NIM, and Snowflake. Model versions optimized for AMD, AWS, Dell, Intel, Nvidia, and Qualcomm hardware will also be released in the future.

Although the Llama 3 models may be widely available, when we use the term 'open' to describe them, it is not entirely in the sense of 'open source.' This is because, despite Meta's claim, the Llama series models are not as unrestricted as they would like people to believe. Yes, these models can be used for both research and commercial applications. However, Meta prohibits developers from using Llama models to train other generative models, and app developers with over 700 million monthly active users must apply for special licenses from Meta, which the company will grant at its own discretion.

A more powerful Llama model is on the horizon. Meta states that they are currently training Llama 3 models with over 400 billion parameters. These models will be able to 'converse in multiple languages,' handle more data, understand images and other modalities, as well as text. This will align the Llama 3 series with open-source products like Hugging Face's Idefics2.

Meta writes in a blog post, 'Our near-term goals are to make Llama 3 multilingual and multimodal, with longer context and continued improvements in core [large language model] capabilities like reasoning and overall performance of encoding. There is much more to look forward to in the future.'