Member-only story
Open-Sora 2.0 Explained
The $200K Model That’s Changing Video AI
Watch the video!
Last week, I was at GTC, Nvidia’s annual event, and I had the chance to check out some incredible new technology. One initiative that really caught my attention is a fully open-source video generator called Open‑Sora. They managed to train an end-to-end video generator, so taking text and generating a short video from it, with just 200,000$. Okay, 200,000$ is a lot of money, but it’s quite low compared to what OpenAI’s Sora or other state-of-the-art video generation models cost, like Runway and the others I covered on my channel that require millions to train and get similar results.
Before diving into how they achieved that, let’s begin by understanding the problem itself. Text-to-video generation isn’t like generating a single image from text; it’s about creating a sequence of images that flow together seamlessly over time. You have to capture not only all the fine spatial details of a scene but also ensure that the motion is smooth and realistic over time. This added temporal dimension introduces an entirely new layer of complexity and cost, mainly due to the fact that these AI systems don’t understand time. They only get tokens, which are either our words or pixels. They don’t have an understanding of the laws of physics that humans develop through trial and error when they are babies. They just have access to our world through tokens, making this video time-consistency extremely difficult.
There are essentially two approaches to tackle this problem. The first is to train a model directly to convert text into video, which means the model has to learn both how to generate high‑quality images and how to stitch them together into coherent motion in one go, without any glitch or artifacts. Of course, this is ideal, and it’s what we want to end up with, but we have the same challenges I just mentioned. But there’s a second approach, which instead takes a detour, simplifying the problem with a two‑step process: first, you train a model to generate a high‑quality image from a text prompt, and then you use that model and the image generated as a conditioning signal to generate a video. Open‑Sora 2.0 adopts the second…