As fall approaches, Google and OpenAI are locked in a good ol’ fashioned software race, aiming to launch the next generation of large-language models: multimodal. These models can work with images and text alike, producing code for a website just by seeing a sketch of what a user wants the site to look like, for instance, or spitting out a text analysis of visual charts so you don’t have to ask your engineer friend what these ones mean.
Google’s getting close. It has shared its upcoming Gemini multimodal LLM with a small group of outside companies (as I scooped last week), but OpenAI wants to beat Google to the punch. The Microsoft-backed startup is racing to integrate GPT-4, its most advanced LLM, with multimodal features akin to what Gemini will offer, according to a person with knowledge of the situation. OpenAI previewed those features when it launched GPT-4 in March but didn’t make them available except to one company, Be My Eyes, that created technology for people who were blind or had low vision. Six months later, the company is preparing to roll out the features, known as GPT-Vision, more broadly.