EvolutionaryScale launches ESM3: AI-designed novel proteins

2024-06-27

As the exploration of the potential of GPT-4o to surpass Claude 3.5 Sonnet continues, EvolutionaryScale, an AI research lab created by former Meta engineers, is making significant progress in a completely different field: making biology programmable.


This task may sound complex, but this one-year-old company has already made waves in the industry. Recently, the company announced the launch of ESM3, a local multimodal and generative language model that can design novel proteins based on prompts. In testing, the model successfully generated a new green fluorescent protein (esmGFP), a process that would naturally take hundreds of millions of years in evolution.

"The sequence of esmGFP... has only 58% similarity to the closest known fluorescent protein. Based on the diversification rate of GFPs found in nature, we estimate that the generation of this novel fluorescent protein simulates over 500 million years of evolution," the company wrote in a preprint paper published on its website on Tuesday.

In addition to this new model (offering three size options), the startup also announced that it raised $142 million in a seed funding round led by Nat Friedman, Daniel Gross, and Lux Capital. Amazon and Nvidia's venture capital divisions also participated in this round. The smallest model has been open-sourced to accelerate research using the new model.

However, building the model is just the first step, and its real-world impact will require time to observe.

Why EvolutionaryScale Targets Biology with AI

While generative AI models have made great strides in understanding and reasoning human language, many are wondering if we can train these models to interpret the core language of life and use them to develop new molecules. The core molecules of life - RNA, proteins, and DNA - have evolved through natural chemical reactions over the past 3.5 billion years. Therefore, a method to program biology and design new molecules could pave the way for addressing some of the biggest challenges humanity faces, including climate change, plastic pollution, and diseases like cancer.

Several organizations, including Google DeepMind and Isomorphic Labs, have already ventured into this field, and the latest addition is EvolutionaryScale. Founded in 2023, the company has developed protein language models over the past few months, but its latest product, ESM3, is the largest and inherently multimodal and generative.

ESM3 is described as a cutting-edge generative model for biology, leveraging the computational power of 1 trillion trillion floating-point operations to train on 2.78 billion natural proteins extracted from various organisms and biological communities, as well as 771 billion unique labels. It can jointly reason about the three fundamental biological properties of proteins: sequence, structure, and function. These three data modalities are represented as discrete token trajectories in the input and output of ESM3. Therefore, users can provide combinations of partial inputs across trajectories to the model, which will provide output predictions for all trajectories, thus generating novel proteins.

"ESM3's multimodal reasoning capability enables scientists to generate new proteins with unprecedented control. For example, the model can be prompted to combine structure, sequence, and function to propose a potential scaffold for the active site of PETase, an enzyme that degrades polyethylene terephthalate (PET) and is of interest to protein engineers for plastic waste degradation," the company explained.

In one case, the company used the model with guided evolution to design a new version of green fluorescent protein, a rare protein that can attach to another protein and label it with fluorescence, allowing scientists to visualize the presence of specific proteins in cells. EvolutionaryScale found that the generated protein version exhibited brightness characteristics similar to natural fluorescent proteins. In nature, it would take 500 million years to evolve such a generation of proteins.


The team also noted that ESM3 can self-improve and receive feedback on the quality of its generations. Feedback from lab experiments or existing experimental data can also be applied to align its generations with the target.

Impact Still to Be Observed

Currently, ESM3 offers three size options: small, medium, and large. The smallest version has 1.4 billion parameters and has been open-sourced on GitHub under a non-commercial license for weights and code. Meanwhile, the medium and large versions, with up to 98 billion parameters, are available for commercial use by companies through EvolutionaryScale's API and partnerships with Nvidia and AWS platforms.

EvolutionaryScale hopes that researchers can leverage this technology to address some of the world's biggest problems and benefit human health and society. However, broader applications of this technology still need time for validation. The biggest potential beneficiaries of this technology may be pharmaceutical companies, which can lead the development of new drugs for life-threatening conditions.

The company's previous models have been used to improve therapeutic properties of antibodies and detect COVID-19 variants that pose significant risks to public health, among other use cases.