Thomas Scialom, an AI research scientist at Meta, revealed the development ideas of Llama 3.1 and discussed the updates for Llama 4 in the Latent Space podcast.
Scialom stated that the parameter scale of Llama 3.1 was chosen considering various factors, including scaling law, training time, GPU, and hardware constraints. They aimed to find a balance point with suitable inference efficiency, thus expanding the model scale to 405B.
When reevaluating the Scaling Law, Scialom pointed out that the Chinchilla law emphasizes the importance of the total number of training data tokens. However, to improve inference performance, they chose to increase the number of training tokens and training duration, allowing the model to reach a state of "overtraining."
Regarding the model architecture, Scialom believes that the current Transformer architecture still lacks flexibility and expects more improvements in the future. He explained why the MoE architecture was not used but also mentioned ongoing work to explore this hyperparameter.
In terms of data, Scialom mentioned that they filtered out high-quality tokens for training and relied entirely on synthetic data obtained from Llama 2 for fine-tuning. He has high expectations for the potential of synthetic data.
Regarding the evaluation and improvement of LLM, Scialom considers it an open research problem with no definitive answers yet. They have tried various methods for model evaluation, including reward models, model-as-a-judge, and the use of diverse prompts.
Finally, Scialom revealed that Meta has already started training the Llama 4 model, with a focus on agent technology. They have conducted some work on agent tools such as Toolformer. He emphasized that without an excellent instruction model, the capabilities of the agent would also be limited. Therefore, they will continue to focus on building the agent.