Creators of Transformer Reunite at NVIDIA GTC 2024 Conference

2024-04-02

"Attention is All You Need" brought together all the authors of the book, except Niki Parmer, after seven years. This moment finally arrived during the NVIDIA GTC 2024 conference, titled "Transforming AI," hosted by Jensen Huang, the leader in the GPU field.


Character AI founder Noam Shazeer revealed that the Transformer architecture was once called "CargoNet," but it didn't receive much attention at that time.


"There were many names, one of which was CargoNet (Convolution, Attention, Recognition, and Google's abbreviation)," Shazeer excitedly said. However, this name didn't leave a lasting impression, and everyone unanimously voted that it was "terrible."


Eventually, Jakob Uszkroeit proposed the name "Transformer." "It became the generic name because, in theory, our focus was not limited to just translation. We certainly realized that we were trying to create something very general, something that could truly transform anything into anything else," said Llion Jones, founder of Sakana AI.


Talking about the multimodality of Transformer, Aidan Gomez, founder of Cohere, said, "When we built the Tensor library, we really focused on scaling up autoregressive training. It's not just for language; there are components for images, audio, and text, both as input and output."


What are the creators of Transformer busy with now?


Illia Polusukhin was the first to leave Google in 2017. He eventually created NEAR Protocol, a blockchain platform aimed at being faster, cheaper, and more user-friendly than existing options.


Ashish Vaswani left Google in 2021. "One of the main reasons I left was that to make these models smarter, it can't just be about working in the vacuum of the lab; it actually has to go out and be put in people's hands," he said.


In late 2022, he co-founded a company called Essential AI with Niki Parmer. "We're very excited about building models that can eventually learn to solve new tasks as efficiently as humans do, by observing what we do," Vaswani said, adding that their ultimate goal is to change the way we interact with computers and how we work.


Meanwhile, Shazeer founded Character AI in 2021. "The biggest frustration at the time was that this incredible technology wasn't democratized to everyone, and it had so many uses," Noam expressed enthusiastically.


Gomez founded Cohere in 2019. He believed that the idea behind Cohere was similar to Noam's, that this technology would change the world as computers started communicating with humans.


"I think where I differ from Noam is that Cohere is built for enterprises. We've created a platform for every enterprise to adopt and integrate (genAI) into their products, rather than going directly to consumers," Gomez said.


In 2023, Jones co-founded Sakana AI, a nature-inspired AI startup in Japan. In Japanese, Sakana means fish. The company is currently researching a technique called Evolutionary Model Merge, which combines diverse models with different functionalities from the open-source model ocean.


"We're handcrafting algorithms. We take all the models available on Hugging Face and use a lot of computation for evolutionary computation to search for ways to merge and stack layers," Jones said.


"I want to remind everyone that NVIDIA provides us with a lot of computational power, and besides gradient descent, we can do other things," he added.


Lukasz Kaiser joined OpenAI in 2021. "It's the place to build the best Transformer. There's a lot of fun in the company. We know you can get a lot of data and computational power and create beautiful things," Kaiser said.


Uszkroeit founded Inceptive AI in 2021, aiming to use AI to design novel biomolecules for vaccines, therapies, and other treatments, essentially creating a new "biological software." Uszkroeit said, "My first child was born during the pandemic, and that really made me realize the fragility of life, among other reasons."


What lies ahead after Transformer?


Jensen Huang asked the team members about the most important improvements to the foundational design of Transformer. Gomez answered that a lot of work has been done to speed up these models in inference. However, Gomez expressed dissatisfaction with all the developments today being based on the fact of Transformer.


"I still feel uncomfortable with how similar we are to the original form. I think the world needs something better than Transformer," he said, adding that he hopes Transformer will be replaced by a "new performance peak." "I think it's too similar to something from six or seven years ago."


Jones mentioned that companies like OpenAI are currently using a lot of computational power. When Jensen asked about their interest in larger context windows and faster token generation, he said, "I think they're doing a lot of wasted computation." Jensen Huang quickly added, "We're working hard to improve efficiency."


Uszkroeit believes that the key to solving the computational problem lies in proper allocation. "It's really about putting in the right amount of effort and eventually energy," he said. Additionally, he finds the State Space Model (SSM) "too complex" and "not elegant enough."


Meanwhile, Ashish Vaswani, CEO of Essential AI, believes that the right interface is crucial to create better models. "If we eventually want to build models that can imitate and learn how to solve tasks by observing us, then the interface becomes crucial," he said.


Jones believes that many young researchers have forgotten the era before Transformer. He stated that all the problems they faced when trying to make things work at that time are likely still present in these models. "People seem to have forgotten the era before Transformer, so they have to rediscover all those problems," he added.


Polusukhin mentioned that Transformer has a loop step. "Interestingly, I found that nobody really takes advantage of the fact that you can run the Transformer with a variable number of steps and train it differently," he said.


Meanwhile, Lukasz Kaiser believes that we never truly learned how to train recurrent layers using gradient descent. "I personally believe that we never really learned how to train recurrent layers using gradient descent. Maybe it's just not possible," he said.