
Written by:
CEO & Founder

GPT-4o, where 'o' stands for "omni" or "everything" as it translates to. The model is a step towards a much more natural interaction between humans and ChatGPT. It is a so-called multimodal model — it accepts text, audio, and image as input and generates text, audio, and image as output. It can respond to audio input in as little as 1/4 of a second, which is similar to human response time in a conversation.
It matches GPT-4 Turbo performance on English text and code, with significant improvements on text in other languages, while being much faster and the API is 50% cheaper. GPT-4o is particularly better at understanding images and audio compared to existing models.
Before GPT-4o, you could use Voice to talk to ChatGPT but it couldn't directly observe tone, multiple speakers, or background sounds, and it couldn't generate laughter, singing, or express emotions like GPT4o can.
In videos that OpenAI has shown, you can also interrupt the AI mid-conversation to, for example, ask a new question, which makes the flow of the conversation much more natural.