Nous releases open-source visual language model Hermes 2 Vision Alpha, facing optimization challenges

2023-12-05

Nous Research, a private applied research group known for its open-source work in the field of large-scale language models (LLMs), has released a lightweight vision-language model called Nous Hermes 2 Vision.

This open-source model can be obtained through Hugging Face and is developed based on the company's previous OpenHermes-2.5-Mistral-7B model. It has visual capabilities, including the ability to extract textual information from visual content and generate detailed answers based on image prompts.

However, shortly after the release of this model, it was discovered that the model had more hallucinations than expected, leading to errors. As a result, the project was renamed Hermes 2 Vision Alpha. The company plans to release a more stable version that offers similar benefits with fewer errors.

Nous Hermes 2 Vision Alpha

Named after the Greek god Hermes, Nous Vision aims to be a system that can "navigate complex human discourse with divine skill." It utilizes user-provided image data and combines visual information with its learned knowledge to provide detailed answers in natural language.

For example, it can analyze a user's image and provide detailed descriptions of its different aspects. One of Nous' co-founders, known as Teknium on X, shared a test screenshot where the LLM was able to analyze a picture of a hamburger and determine whether eating it would be harmful to health, explaining the reasons behind it.

While ChatGPT based on GPT-4V also brings the ability to use image prompts, Nous' open-source product stands out in two key aspects.

Firstly, unlike traditional approaches that rely on large 3B visual encoders, Nous Hermes 2 Vision adopts SigLIP-400M. This not only simplifies the model's architecture, making it lighter than similar products, but also helps improve performance on vision-language tasks.

Secondly, it has been trained on a custom dataset with rich function calls. This allows users to prompt the model with tags and extract written information from images, such as menus or billboards.

"This unique addition transforms Nous-Hermes-2-Vision into a vision-language action model. Developers now have a versatile tool ready to create various sophisticated automations," the company wrote on the model's Hugging Face page.

Other datasets used for training the model include LVIS-INSTRUCT4V, ShareGPT4V, and dialogues from OpenHermes-2.5.

Ongoing Issues

Although the Nous vision-language model is available for research and development purposes, early usage has shown that it is far from perfect.

Shortly after its release, the co-founder posted a message stating that the model had issues, including generating many hallucinations and misusing EOS tokens. Subsequently, the model was renamed as an alpha version.

"I see people talking about 'hallucinations,' and yes, the situation is indeed bad. I am aware of it too, as the underlying LLM is an unreviewed model. I will update this version by the end of this month to address these issues," wrote Quan Nguyen, a researcher at Nous, on X.

However, Nguyen also pointed out in another post that the function call feature still works well if users define a good architecture. He also mentioned that if there is sufficient user feedback, he will release a model specifically designed for function calls.

So far, Nous Research has released 41 open-source models with different architectures and functionalities as part of its Hermes, YaRN, Capybara, Puffin, and Obsidian series.