Skip navigation

Show Your WorkArtificial Intelligence

How we tested and evaluated AI-generated dance videos

Pink-tinted photograph of a dancing group posing in sync, their arms arched, standing in formation on a stage in a dark dance studio; the background is a black backdrop with LED lights hanging on metal railing rigs
Alisha Jucevic

Have you read this article yet? You may want to start here.

Artificial intelligence models can produce lifelike video footage with a simple text prompt. But these tools still struggle with generating realistic videos of complex natural movements, like human dance. 

When CalMatters and The Markup asked dancers and choreographers about whether AI could disrupt their industry, most concluded that human dancers could not be replaced.

For the most part, we found that they were right. We tested nine different cultural, modern and popular dance styles using four commercially available generative AI video models, generating a total of 36 videos. We found that the latest commercially available AI video generation models produced convincingly lifelike videos of people dancing — but none produced a figure performing the prompted dance.

About a third of the generated videos exhibited inconsistencies in a subject’s appearance from frame to frame, along with abnormalities in movement and limbs. The frequency and magnitude of issues observed were a significant improvement compared to initial testing in late 2024.

↩︎ link

Methodology

↩︎ link

Define task

CalMatters and The Markup tested four commercial video generation models produced by major tech companies to create video clips of traditional and popular dance.

We limited our tests to consumer-facing, closed-source generative video tools because they are the most readily available for everyday users and tend to perform better than open-source models. We tested Sora 2 by OpenAI, Veo 3.1 by Google, Kling 2.5 by Kuaishou, and Hailou 2.3 by MiniMax.

↩︎ link

Prepare prompts

We drafted nine video prompts testing a variety of dances in different settings, such as dance floors, stages, bedrooms, studios, cultural events, public squares and classrooms. We tested for popular, modern and traditional cultural dance styles, including the Macarena, the Mashed Potato, folklorico and popular TikTok dances. See the Appendix for more details.

We varied the level of specificity to test whether identifying the dance by name was enough to generate a video of the desired motion, or whether explicitly specifying the exact physical movements improved output. 

Before finalizing the list of prompts, we submitted them to ChatGPT for edits based on the Sora 2 Prompting Guide. See Limitations: Prompt Optimization for more details.

↩︎ link

Submit prompts for video generation

Each prompt was submitted once, using each model’s default settings for generating landscape-oriented videos. Three prompts submitted to Sora 2 were edited to remove words that triggered OpenAI’s filter, blocking prompts that may have violated “guardrails concerning similarity to third-party content.” For example, Sora 2 flagged prompts referencing specific years, popular music artists and banned words. One blocked prompt was for a video of a politician dancing the Macarena. For that prompt, replacing “politician in a suit” with “man in a suit” bypassed the guardrail. Veo 3.1 flagged similar prompts when we submitted via Gemini or Flow but did not when we submitted directly to the Veo 3.1 API.

↩︎ link

Evaluate generated videos

We evaluated the generated videos on six different criteria related to prompt alignment and video consistency:

  1. Did the main subject dance in any way?
  2. Did the main subject perform the specific dance we prompted for?
  3. Did the main subject maintain the same physical appearance throughout the video?
  4. Did the main subject produce realistic motions based on human physiology?
  5. Did the scene and setting match the prompt?
  6. Did the camera match the prompted camera angle and position?

Each of the above criteria was assessed as a pass or a fail by a single reviewer, with the assistance of a second reviewer when needed. The generated videos of cultural dances were reviewed for accuracy by dancers familiar with them.

↩︎ link

Results

Of the 36 videos generated, all but one showed a figure dancing. The one video that did not show a dancing figure — produced by Kling 2.5 — instead presented the bottom half of a figure performing side lunges.

No video produced the actual dance we prompted for. For the Cahuilla Band of Indians bird dance, tribal member Emily Clarke said, “None of these depictions are anywhere close to bird dancing, in my opinion.” The videos for the Horton dance did not show the specific dance movement we prompted for, but choreographer Emma Andre said she found the depiction by Veo 3.1 to be “staggeringly lifelike.” 

For the remaining pop culture dances, we compared the generated videos to videos we found on YouTube to evaluate whether the dance was accurate.

11 out of 36 videos exhibited issues with either motion or appearance consistency. This included sudden changes in clothing, hair or limb structure, such as heads rotating on separate axes from their bodies and limbs liquefying and reconstituting.

See the Appendix for full results and videos.

↩︎ link

Limitations

↩︎ link

Image-to-video generation

We did not use images to prompt the models. Image-to-video generation involves uploading a static image along with a text prompt, producing a dynamic video from both. Image-to-video generation is an advertised use case for models that produce dance videos from user-submitted images. 

↩︎ link

Multi-subject dance videos

We did not prompt for videos with multiple dancers, even though some of the dances are often performed in groups. We limited our video prompts to showcase a single dancer to avoid ambiguity around whether a failed evaluation was due to issues with generating complex human movement or a realistic multi-subject video.

↩︎ link

Prompt optimization

We did not optimize prompts on a per-model basis. Each company publishes its own prompt guide. (See the guidelines for Veo 3.1, Hailou 2.5, Kling 2.3, and Sora 2.) Instead, we used ChatGPT 5 to standardize prompts across models to align with the Sora 2 Prompting Guide. It’s possible that optimizing prompts for the models according to their specific guides could have yielded more accurate results.

We also tried to improve the quality of the videos by giving detailed step-by-step instructions of each dance. However, these instructions did not produce videos that were any more accurate than those produced with simpler prompts.

↩︎ link

Human-motion generation models

We did not test generative models focused on human-motion generation. These models are used in animation and video games to generate and capture natural human motion. Researchers train some state-of-the-art academic models in this space using large datasets, including footage of popular dances on TikTok. Although these models may perform better than the consumer-facing models we tested, they require technical expertise and substantial computational resources to run.  

↩︎ link

Sample size

Our evaluation is limited to the generated videos for nine prompts; it is not a comprehensive assessment of the models used. Some video generation benchmarks, like those from Tencent’s AI Lab and others, use several hundred prompts to test capabilities such as complex motion, multiple subjects and creative style.

↩︎ link

Acknowledgements

We thank Yuhang Yang (University of Science and Technology of China) and Xiaodong Cun (Great Bay University) for reviewing an early draft of this methodology.

↩︎ link

Appendix

View evaluations by prompt or evaluations by model.

↩︎ link

Evaluations by prompt

Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.

Prompt: “In a bright dance-studio, a woman grooves the ‘Apple’ dance from summer 2024 (the tune by Charli XCX plays). The camera holds still as she hits the signature moves—pastel outfit, energetic attitude, clean room. The scene feels fresh, fun, contemporary.” On Sora 2, we removed the reference to Charli XCX because it was rejected, citing possible violation of the platform’s guardrails for third-party likeness. (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “A Cahuilla Band of Indians woman in colorful ribbon skirt dances the bird dance in slow, majestic movement. The camera holds still. The mood is respectful, ceremonial and visually rich.” (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “In a bright school gym, a teacher in casual clothes does the chicken dance—flapping arms, twisting hips, fun and silly. The camera doesn’t move, just watches the whole body doing the moves.” (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “In a bright, white-walled dance studio, a dancer in tights performs the Horton fortification number 3 from the Lester Horton technique: strong lines, extended limbs, controlled movement. The camera stays still and watches the body’s form and transition.” (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “In a golden-sunlit public square somewhere in Jalisco, Mexico, a woman in a traditional Jalisco folklórico dress performs a Jalisco folklórico dance. The camera holds still, watching her full-body movement from a comfortable distance. The mood is festive and free-spirited, with natural sunlight and bright colours of the dress catching the light.” (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “On a bright dance floor, a smartly dressed politician in a suit dances the ‘Macarena’. The camera holds still, capturing his suit-clad moves, the retro fun of the moment, and the light shining on the floor.” On Sora 2, we replaced “politician in a suit” with “man in a suit”, because the original prompt was rejected citing possible violation of the platform’s guardrails for third-party likeness. (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “On a brightly lit stage, a man in a shiny satin shirt and flared trousers grooves the classic 1962 mashed-potato dance. The camera holds still, watching his feet swivel and his body rock in vintage style. The vibe is retro, fun and energetic.” On Sora 2, we removed reference to the year because it was rejected citing possible violation of the platform’s guardrails for third-party likeness. (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “In a cozy blue-walled bedroom, someone in fun pajamas does the Renegade dance (TikTok-famous from 2019). The camera doesn’t move—it just watches them hit the moves. The vibe is playful and relaxed.” (View reference video)


Videos generated using, clockwise from top-left, OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5 and MiniMax’s Hailou 2.3.
The Markup

Prompt: “On a dark stage under a spotlight, a person in a sharp business suit does the classic sprinkler move — one hand behind the neck, the other sweeping wide-arc as they spin lightly. The camera stays still and captures the full body. The mood is retro, playful and fluid.” (View reference video)


↩︎ link

Evaluations by model

We don't only investigate technology. We instigate change.

Your donations power our award-winning reporting and our tools. Together we can do more. Give now.

Donate Now