Data Science · NLP · Pattern Recognition

Every movie ever made is one of 6 stories.

I analyzed 29,737 film plots using sentiment analysis and machine learning — and found that Hollywood has been telling the same six stories on repeat since forever.

Siri Lahari Chava · April 2026 · 7 min read
scroll to read

It started with The Intern.

I was watching it for probably the third time — Anne Hathaway, Robert De Niro, the whole warm fuzzy thing — and somewhere around the midpoint I had this uncomfortable thought: I have seen this exact story before. Not the characters. Not the jokes. The shape of it. The way it moves. The feeling it leaves you with.

Outsider enters unfamiliar world. Struggles. Earns trust. Changes the place that changed them. Roll credits.

I'd seen it in Devil Wears Prada. In The Pursuit of Happyness. In Finding Nemo, if you squint. The names change, the settings change, but the emotional journey? Identical.

"What if every movie you've ever loved is actually the same movie — just wearing a different costume?"

Most people would let that thought go. I opened a dataset.

The Method

The CMU Movie Summary Corpus contains plot summaries for over 42,000 films. I filtered it down to 29,737 movies with enough text to be meaningful, then ran each plot through a sentiment analysis pipeline — breaking every summary into 10 equal segments and scoring the emotional tone of each one.

The result for each movie is an emotional arc: a curve showing how the feeling of the story rises and falls from opening scene to final frame. Hopeful start? The curve goes up. Dark middle? It dips. Redemptive ending? Back up again.

Then I ran K-means clustering on all 29,737 arcs and asked the algorithm a simple question: how many natural story shapes actually exist?

29,737
films analyzed
6
story shapes found
10
emotional segments per plot

The answer was 6. Not 3, not 12. Six distinct emotional shapes that account for nearly every film ever made. Vonnegut hypothesized something like this decades ago. I wanted to prove it with data.

The 6 Shapes
Shape 01 · 4,621 films
The Tragedy
Dark from start to finish. No relief, no redemption. The story just gets heavier.
The Godfather II · Spider-Man 3
Shape 02 · 4,971 films
The False Hope
Builds toward something good — then falls apart at the end. Hope, then loss.
Forrest Gump · The Notebook · Devil Wears Prada
Shape 03 · 6,212 films
The Feel Good
Consistently warm all the way through. No real darkness. Just good.
Inception · Rocky · Clueless
Shape 04 · 4,529 films
The Roller Coaster
Up, down, up, down. It never lets you settle. Classic adventure structure.
Schindler's List · Harry Potter
Shape 05 · 4,196 films
The Slow Burn
Starts okay. Slowly unravels. Hope fades quietly. You barely notice until it's gone.
Mean Girls · Gravity
Shape 06 · 5,208 films
The Triumph
Rough start, hard middle — but it ends on a high. Dark to light.
Titanic · Finding Nemo · The Pursuit of Happyness
The Finding That Surprised Me

I expected the clusters to be messier. I expected the algorithm to need 10 or 12 groups to make sense of 29,737 films. It didn't. Six was clean. The elbow in the curve was clear and decisive.

What surprised me more was where specific movies landed. Forrest Gump, widely remembered as an uplifting American story, is a False Hope. The data reads its plot as building toward something and then collapsing. Which, if you think about it, is exactly right. Jenny dies. The warmth of the ending is bittersweet at best. The algorithm caught something the marketing never admitted.

The Intern is a Triumph. Shape 6. Rough start, uncertain middle, warm ending. It shares its arc with Titanic and Finding Nemo, which sounds ridiculous until you map the emotional journey and realize all three follow the same curve: descent into difficulty, slow rebuild, resolution that earns its feeling.

That's the thing about working with data at this scale. You stop seeing movies as individual stories and start seeing them as instances of patterns. Every plot is a data point. Every emotional beat is a signal. And when you have 29,737 of them, the patterns become impossible to ignore.

What This Actually Is

This project is a Streamlit app built on top of the CMU Movie Summary Corpus. Paste any plot, a movie, a book, even a story you're writing, and it classifies the emotional arc in real time, shows you the sentiment curve, and tells you which famous films share your story's DNA.

The pipeline uses VADER for sentiment scoring, K-means clustering trained on all 29,737 arcs, and a custom weighting system that gives extra importance to the ending of a plot — because how a story ends matters more than how it begins.

The insight is older than film. Kurt Vonnegut said stories have shapes in 1995. It just took 29,737 data points to prove him right.

Written by Siri Lahari Chava
Data Scientist · Frisco, TX
Try the app →