OpenAI Sora – Text-to-video AI with (almost) realistic results

OpenAI wants to continue to assert itself as the top dog in the market for generative AI models. In addition to the chatbot ChatGPT and the image creation DALL-E, a tool for generating video material has now been presented: Sora. OpenAI's AI, called Sora, creates high-resolution and detailed videos from simple text commands, images or other videos. People, buildings, animals, plants, certain scenes, styles, camera types, eras and the like can be implemented. The results so far from the development of OpenAI Sora are impressive, but still have enough flaws to be revealed as AI material. Still…

This woman doesn't exist. In an impressive video, OpenAI shows what the new video AI called Sora can do. In addition to realistic-looking people, the AI ​​model also generates detailed environments including light reflections, motion blur, etc.
This woman doesn't exist. In an impressive video, OpenAI shows what the new video AI called Sora can do. In addition to realistic-looking people, the AI ​​model also generates detailed environments including light reflections, motion blur, etc.

OpenAI Sora creates videos with up to 1 minute of consistent content

From two pirate ships fighting in a cup of coffee, to a young man reading a book while sitting on a cloud, to a parade celebrating Chinese New Year, everything and much more is possible. On this page OpenAI shows which realistic and fantastic scenes can already be realized with Sora AI for text-to-video tasks.

It's not just about individual focal content or protagonists that are rendered in great detail. The entire video, including backgrounds and supporting characters, is usually spatially correct, with complex lighting effects, physically sensible equipment and the like. At first glance, most of the material appears real.

OpenAI Sora can also create a gallery with exhibited artworks of different styles. The prompt for this is very short and simple.
OpenAI Sora can also create a gallery with exhibited artworks of different styles. The prompt for this is very short and simple.

Sora is still in its early stages and has limited access

The Sora AI is currently only available to the OpenAI “Red Team” and professional creatives from the fields of film and design. The so-called Red Team consists of people who test new OpenAI technologies with regard to their dangers and risks. Video AIs in particular are full of this, as they can theoretically be used to create extensive deepfakes of celebrities, politicians and even private individuals.

So while the Red Team is supposed to identify such potential threats, the professional users from the creative area are involved to get feedback to improve Sora. The aim is to see which features would be useful for cinema, YouTube and the like. Public access is certainly planned, but hopefully with security mechanisms against misuse of the tool.

This man doesn't exist. The wealth of detail of OpenAI's Sora AI is reflected in a wide variety of elements: skin, hair, lighting effects, fabrics and surfaces, etc.
This man doesn't exist. The wealth of detail of OpenAI's Sora AI is reflected in a wide variety of elements: skin, hair, lighting effects, fabrics and surfaces, etc.

OpenAI draws attention to the weaknesses of video AI

In addition to really impressive and sometimes very realistic-looking AI videos, OpenAI also shows a few outliers from previous test runs on the page linked above. For example, Sora was supposed to animate a person on a treadmill. That worked in theory, but the man went in the wrong direction.

Another example shows wolf pups frolicking on a dirt road. The problem: more and more of the little animals appear from the bunch, seemingly out of nowhere. In other examples, objects appear seemingly out of nowhere or behind other objects that they could not really have hidden. Hands are still a problem, including natural hand movements.

Is the treadmill running backwards or what's going on in that Sora video?
Is the treadmill running backwards or what's going on in that Sora video?

More or less useful security mechanisms announced

A video AI that can produce (almost) realistic scenes of up to 1 minute in length offers opportunities, but of course also dangers. OpenAI has therefore announced various security mechanisms to prevent misuse of the tool. For example, prompt filters should be implemented to prevent certain inputs from being converted into video form.

In particular, extreme violence, sexual content, hateful depictions, celebrity likenesses or the use of franchise material (characters from cartoons, films, series, video games, etc.) should be prevented. As with images from DALL-E, C2PA metadata should also be incorporated into the output video files. But these are admittedly easy to remove. It remains to be seen how safe the first public version of Sora will be.

A closeup of the woman from the OpenAI Sora video example shown at the top. At first and second glance, the AI ​​origin of the scene cannot be recognized.
A closeup of the woman from the OpenAI Sora video example shown at the top. At first and second glance, the AI ​​origin of the scene cannot be recognized.

The technology behind it: Sora is a “diffusion” model

As with corresponding image AIs, video AIs can work as diffusion models. This means that as a first step they create static noise and then remove the noise in numerous steps so that the described image or video is ultimately created. Unlike images, coherence must also be developed for videos, as content should not suddenly change completely or deform unrealistically.

In addition, objects and characters that are lost from the virtual camera's field of view must look the same when they re-enter the action. Techniques for this have also been implemented in the Sora model. Ultimately, Sora can also be viewed as a multimodal AI model because, in addition to text input, it can also use images and videos as source material.

Impressive at first glance. But the dog's shadow is wrong. Sora also ignores the fact that the shutter is so far away from the house that the dog can't walk in front of it. The AI ​​origin of the clip is therefore recognizable if you know what to look for.
Impressive at first glance. But the dog's shadow is wrong. Sora also ignores the fact that the shutter is so far away from the house that the dog can't walk in front of it. The AI ​​origin of the clip is therefore recognizable if you know what to look for.

Sora can extend and touch up videos as well as animate images

In addition to text commands for creating completely new video content, OpenAI also announces Sora AI as a tool for expanding and repairing existing videos. Furthermore, with the video AI it should be possible to select an image file and animate it while maintaining the details and values ​​shown. Of course, text is used again to describe what should happen in the animated version of the image.

Even when extending videos or adding new content or removing unwanted content from video files, the user can communicate via text input what should ultimately be seen. This allows a video clip to be expanded at the beginning and/or end to provide a better introduction or a more exciting ending. People could also be removed or added.

Impressive: While the buildings the train passes are marked by motion blur, the reflection in the window in front remains sharp. The person from whose perspective the video is created also becomes visible when the train passes a bridge/tunnel. The prompt is implemented shockingly well.
Impressive: While the buildings the train passes are marked by motion blur, the reflection in the window in front remains sharp. The person from whose perspective the video is created also becomes visible when the train passes a bridge/tunnel. The prompt is implemented shockingly well.

OpenAI and AGI – Sora is intended to be a step towards “everything AI”.

A large part of Sora's announcement consists of describing the creative possibilities presented by multimodal video AI. Nevertheless, the long article with the many video examples, descriptions of the underlying technology and other details ends with this sentence (loosely translated): “Sora serves as the foundation for models that can understand and simulate the real world - a capability that we believe will be an important milestone on the path to AGI."

The AGI is the “Artificial General Intelligence”, which in theory should be able to understand and solve any intellectual task. This still theoretical construct would be a highly autonomous system, the exact form of which has not yet been uniformly defined. Like all AI, AGI is associated with opportunities and risks. There are more details on the topic as well as links to relevant specialist literature at Wikipedia.

Did you like the article and did the instructions on the blog help you? Then I would be happy if you the blog via a Steady Membership would support.

Post a comment

Your e-mail address will not be published. Required fields are marked with * marked

In the Sir Apfelot Blog you will find advice, instructions and reviews on Apple products such as the iPhone, iPad, Apple Watch, AirPods, iMac, Mac Pro, Mac Mini and Mac Studio.