Skip to content
Opinion

It’s a Multimodal World, After All

LLaVA is a fusion of LLM like GPT-4, and vision encoders like CLIP. What does it do, and can it solve today’s AI problems?
It’s a Multimodal World, After All

Share This Post:

By Junko Yoshida

What’s at stake:
The new frontier in AI is in multimodal models such as CLIP and Stable Diffusion. Because humans interact with the world through both vision and language, AI researchers believe machines, too, need multimodal channels.  Can something like LLaVA foster a “general-purpose assistant” that effectively follows multimodal vision-and-language instructions aligned with human intent? If so, at what cost?

We live in a tumultuous time.

Business reporters are often pressed to investigate commercial and technological advancements almost immediately after the scientific community breaks new ground. This is particularly true in the field of artificial intelligence.

With little time either for reflection or examination, reporters take scientists’ word on the Next Big Thing, reducing AI journalism to little better than stenography.  Remember when its promoters kept saying the autonomous vehicle is “just around the corner.”

Corporations, obliged to become AI prospectors, now face a similar dilemma. Like the 49ers of the Gold Rush, the first priority for a company hoping to strike it rich, is to stake a claim, only worrying afterward whether its particular AI investment is the mother lode or a dry hole.

Clouding our judgement further is the exponential growth of Nvidia. Money talks.


This is great stuff. Let’s get started.

Already have an account? Sign in.