This is a key observation that is simple and intuitive:
>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.
This quote is important, but in isolation it's not clear that they are claiming to have beat this problem: they are saying the new model, voyage-multimodal-3 instead identifies linked concepts across modalities. That would indeed be pretty cool -- if there is a latent space that could cluster the same idea, represented visually or in text.
> ... the vectors truly capture the semantic content contained in the screenshots. This robustness is due to the model’s unique approach of processing all input modalities through the same backbone.
With that said, I think this benchmark is a pretty narrow way of thinking about multi-modal embedding. Having text embed close to images of related text is cool and convenient, but doesn't necessarily extend to other notions of related visual expression (e.g. "rabbit" vs a photo of a rabbit). And on the narrow goal of indexing document images, I suspect there are other techniques that could also work quite well.
This seems like a great opportunity for a new benchmark dataset with multi-modal concept representations beyond media-of-text.
They could be solving it with multimodal mixup, a technique making sure that there's no big latent gap between the two : https://arxiv.org/abs/2203.03897
The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.
I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini
Until now, the standard approach to creating multimodal models involved
training separate components for different modalities and then stitching them
together to roughly mimic some of this functionality. These models can
sometimes be good at performing certain tasks, like describing images, but
struggle with more conceptual and complex reasoning.
We designed Gemini to be natively multimodal, pre-trained from the start on
different modalities. Then we fine-tuned it with additional multimodal data to
further refine its effectiveness. This helps Gemini seamlessly understand and
reason about all kinds of inputs from the ground up, far better than existing
multimodal models — and its capabilities are state of the art in nearly every
domain.
Because LLMs such as Gemini -- and other causal language models more broadly -- are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren't that useful for RAG or semantic search compared to what you get from actual embedding models.
One distinction to make here is that token embeddings and the embeddings/vectors that are output from embedding models are related but separate concepts. There are numerous token embeddings (one per token) which become contextualized as they propagate through the transformer, while there is a single vector/embedding that is output by embedding models (one per input data, such as long text, photo, or document screenshot).
LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.
Fwiw if the other replies aren't clear: change "embeddings" to "List<double> that some layer of my AI model produces" (that's not exactly correct, it's slightly more specific than that, but in this context it's correct)
LLMs, including multimodal LLMs, do have embeddings, but they're embeddings learned by generating text, instead of finding similar documents
First, I don't see the problem with conflicting interests.
Sad for them is not necessary sad for us.
Then, in my case it was more "sad" from a commercial point of vue, because it is means that despite their models potentially be betters, almost no one use them, and they are not well known. And it will probably not change as there is a high barrier to entry to have to trust them to suddenly start using their models with their APIs out of the blue.
Not that many persons will test, benchmark and then recommend the models.
Also, sad on a last aspect that is not inconsistent with paying their employees:
- If you only offer an API but not a way to self host the commercial models, you are limiting yourself a lot the potential customers that are looking for alternatives to OpenAI.
This is the same somehow shitty move as Adobe forcing full "cloud" solutions.
No, but it serves everyone in the "AI retrieval" space better if we continue to make rapid improvements. New models are great, but not the ultimate solution.
No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.
Agreed on both parts of the statement. Granted, there are obvious considerations for exclusive API focus beyond just trying get the money from people, but I personally would not consider it based on the fact that they don't offer other options.
Looks quite interesting! I’ve been working on AnyModal, a framework for integrating different data types (like images and audio) with LLMs: https://github.com/ritabratamaiti/AnyModal. It seems that voyage-multimodal-3 would be quite promising in developing multimodal LLMs, but I am not sure if that is the intended use case.
In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?
While the raw similarity score does matter, what typically matters more is the score relative to other documents. In the case of the examples in the notebook, those values were the highest in relative terms.
I can see why this may be unclear/confusing -- we will correct it. Thank you for the feedback!
Funny, all those big name Stanford advisors for a company that builds embeddings... A couple of strong MLEs can deliver everything they are doing. This shouldn't be a company but OK... I'm sure some clueless VCs in SV gave them money.
And just to be clear. I don't think that delivering strong embeddings for different domains is an easy task. However, it's 2024 not 2016.
This is a key observation that is simple and intuitive:
>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.
This quote is important, but in isolation it's not clear that they are claiming to have beat this problem: they are saying the new model, voyage-multimodal-3 instead identifies linked concepts across modalities. That would indeed be pretty cool -- if there is a latent space that could cluster the same idea, represented visually or in text.
> ... the vectors truly capture the semantic content contained in the screenshots. This robustness is due to the model’s unique approach of processing all input modalities through the same backbone.
With that said, I think this benchmark is a pretty narrow way of thinking about multi-modal embedding. Having text embed close to images of related text is cool and convenient, but doesn't necessarily extend to other notions of related visual expression (e.g. "rabbit" vs a photo of a rabbit). And on the narrow goal of indexing document images, I suspect there are other techniques that could also work quite well.
This seems like a great opportunity for a new benchmark dataset with multi-modal concept representations beyond media-of-text.
They could be solving it with multimodal mixup, a technique making sure that there's no big latent gap between the two : https://arxiv.org/abs/2203.03897
If you are interested in that space, would throw our project in the mix which uses ColPali under the hood transparently.
https://github.com/tjmlabs/ColiVara
The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.
I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini
Because LLMs such as Gemini -- and other causal language models more broadly -- are trained on next token prediction, the vectors that you get from pooling the output token embeddings aren't that useful for RAG or semantic search compared to what you get from actual embedding models.
One distinction to make here is that token embeddings and the embeddings/vectors that are output from embedding models are related but separate concepts. There are numerous token embeddings (one per token) which become contextualized as they propagate through the transformer, while there is a single vector/embedding that is output by embedding models (one per input data, such as long text, photo, or document screenshot).
LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.
Fwiw if the other replies aren't clear: change "embeddings" to "List<double> that some layer of my AI model produces" (that's not exactly correct, it's slightly more specific than that, but in this context it's correct)
LLMs, including multimodal LLMs, do have embeddings, but they're embeddings learned by generating text, instead of finding similar documents
Indeed, sad that their models are both commercial proprietary and API only.
Sad that people have to pay their employees?
First, I don't see the problem with conflicting interests. Sad for them is not necessary sad for us.
Then, in my case it was more "sad" from a commercial point of vue, because it is means that despite their models potentially be betters, almost no one use them, and they are not well known. And it will probably not change as there is a high barrier to entry to have to trust them to suddenly start using their models with their APIs out of the blue. Not that many persons will test, benchmark and then recommend the models.
Also, sad on a last aspect that is not inconsistent with paying their employees:
- If you only offer an API but not a way to self host the commercial models, you are limiting yourself a lot the potential customers that are looking for alternatives to OpenAI. This is the same somehow shitty move as Adobe forcing full "cloud" solutions.
No, but it serves everyone in the "AI retrieval" space better if we continue to make rapid improvements. New models are great, but not the ultimate solution.
This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?
I understand the model is, like for other commercial ones, available exclusively through their API, right?
Yes, voyage models are API only.
There was a part here about multilingualism but that was wrong! Sorry!
FWIW: Voyage also has separate `law`, `code`, and `finance` models. See [1]
Really cool results, anyway.
[1]: https://docs.voyageai.com/docs/embeddings
Glad you liked the results! We do have multilingual models (and rerankers) -- voyage-3, in particular, is multilingual: https://blog.voyageai.com/2024/09/18/voyage-3/
voyage-multimodal-3 is multilingual as well, supporting the same set of languages as voyage-3.
Sorry for spreading false information. I edited the post above.
It is interesting that you’re not as up front about multilingualism compared to cohere. They seem to mention it a lot, which led to my confusion.
No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.
API-only model. No thanks but congrats anyway.
Agreed on both parts of the statement. Granted, there are obvious considerations for exclusive API focus beyond just trying get the money from people, but I personally would not consider it based on the fact that they don't offer other options.
I understand the sentiment. We are starting to open source some tools, mostly around embedding model evaluation (i.e. https://github.com/voyage-ai/voyage-evaluation-public), with other stuff coming up.
FWIW, there are other deployment options besides the API as well: AWS (https://docs.voyageai.com/docs/aws-marketplace-model-package), Azure (https://docs.voyageai.com/docs/azure-marketplace-managed-app...), Snowflake (https://docs.voyageai.com/docs/snowflake), and vector database integrations (https://docs.voyageai.com/docs/integrations-and-other-librar..., https://milvus.io/docs/integrate_with_voyageai.md, https://docs.pinecone.io/integrations/voyage, https://weaviate.io/developers/weaviate/model-providers/voya..., https://qdrant.tech/documentation/embeddings/voyage/, etc).
Looks quite interesting! I’ve been working on AnyModal, a framework for integrating different data types (like images and audio) with LLMs: https://github.com/ritabratamaiti/AnyModal. It seems that voyage-multimodal-3 would be quite promising in developing multimodal LLMs, but I am not sure if that is the intended use case.
In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
[1]: https://docs.voyageai.com/docs/tokenization
This is a cool way to look at multimodal embeddings. They look at performance as the the percentage of inputs slides from one modality to another:
https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
> https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
why does it pop up at the end?
Check out ColPali and ColQwen for a SOTA open source version.
The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?
While the raw similarity score does matter, what typically matters more is the score relative to other documents. In the case of the examples in the notebook, those values were the highest in relative terms.
I can see why this may be unclear/confusing -- we will correct it. Thank you for the feedback!
The raw output value is generally irrelevant. What matters is its position in the distribution of outputs
A 0.4 with cosine similarity is not the same as a 0.4 with sigmoid thresholding.
0.4 cosine similarity is pretty good for real-world data that isn't an near-identical duplicate.
I wish people would take the time to put in real datasets and make qualitative analysis of when and why "foo new solution" is better.
Quantitative benchmarks are great, but sparse.
Funny, all those big name Stanford advisors for a company that builds embeddings... A couple of strong MLEs can deliver everything they are doing. This shouldn't be a company but OK... I'm sure some clueless VCs in SV gave them money.
And just to be clear. I don't think that delivering strong embeddings for different domains is an easy task. However, it's 2024 not 2016.