OpenAI has unveiled a new image generation capability directly integrated into ChatGPT, referred to as "Image Generation within ChatGPT." Users can now utilize GPT-4o inside ChatGPT to create images.
This initial release focuses solely on image creation and will be accessible in ChatGPT Plus, Pro, Team, and even the free subscription tiers. According to an OpenAI spokesperson, the usage limits for image generation in the free tier are similar to DALL-E, though the exact number remains undisclosed and may vary based on demand over time. Previously, free users could generate up to three images daily through DALL-E 3.
Gabriel Goh, the research lead, highlighted that GPT-4o is a "multimodal" model capable of producing various data types, including text, images, audio, and video. He noted remarkable improvements in correctly associating attributes with objects (binding). Older image generators often struggled with color and shape mismatches when handling multiple items (typically 5 to 8), while the new tool accurately binds attributes for 15 to 20 objects, greatly enhancing precision and reliability.
Text rendering has also improved, making it easier to produce coherent and error-free text on images. Goh mentioned that ensuring correct text rendering has been a significant challenge, as errors in small captions or text elements can render the entire image unusable. After months of iteration, the team has achieved consistently usable text quality, though challenges remain with very small text.
The system employs an autoregressive method, generating images sequentially from left to right and top to bottom, akin to how text is written, rather than the diffusion model technology used by most image generators like DALL-E, which creates entire images at once. Goh speculates that this technical distinction might explain why ChatGPT's image generation excels in text rendering and binding capabilities.
In pre-launch briefings, the team showcased several examples, such as scientific diagrams of Newton's prism experiment with correctly labeled components, multi-panel comics with consistent characters and speech bubbles, and accurate promotional posters. They also emphasized practical applications like creating transparent-background images for stickers, restaurant menus, and logo designs.
The head of multimodal products stated that the new model incorporates world knowledge into the image generation process. Hence, when users request an image of Newton's prism experiment, they receive the corresponding visual without needing to explain the concept.
Although the new system takes longer to generate images than before, OpenAI believes this trade-off is worthwhile. They acknowledged there's room for improvement in latency, but the image quality, functionality, and embedded world knowledge justify the additional wait time for users.