It started with The Intern.
I was watching it for probably the third time — Anne Hathaway, Robert De Niro, the whole warm fuzzy thing — and somewhere around the midpoint I had this uncomfortable thought: I have seen this exact story before. Not the characters. Not the jokes. The shape of it. The way it moves. The feeling it leaves you with.
Outsider enters unfamiliar world. Struggles. Earns trust. Changes the place that changed them. Roll credits.
I'd seen it in Devil Wears Prada. In The Pursuit of Happyness. In Finding Nemo, if you squint. The names change, the settings change, but the emotional journey? Identical.
Most people would let that thought go. I opened a dataset.
The CMU Movie Summary Corpus contains plot summaries for over 42,000 films. I filtered it down to 29,737 movies with enough text to be meaningful, then ran each plot through a sentiment analysis pipeline — breaking every summary into 10 equal segments and scoring the emotional tone of each one.
The result for each movie is an emotional arc: a curve showing how the feeling of the story rises and falls from opening scene to final frame. Hopeful start? The curve goes up. Dark middle? It dips. Redemptive ending? Back up again.
Then I ran K-means clustering on all 29,737 arcs and asked the algorithm a simple question: how many natural story shapes actually exist?
The answer was 6. Not 3, not 12. Six distinct emotional shapes that account for nearly every film ever made. Vonnegut hypothesized something like this decades ago. I wanted to prove it with data.
I expected the clusters to be messier. I expected the algorithm to need 10 or 12 groups to make sense of 29,737 films. It didn't. Six was clean. The elbow in the curve was clear and decisive.
What surprised me more was where specific movies landed. Forrest Gump, widely remembered as an uplifting American story, is a False Hope. The data reads its plot as building toward something and then collapsing. Which, if you think about it, is exactly right. Jenny dies. The warmth of the ending is bittersweet at best. The algorithm caught something the marketing never admitted.
The Intern is a Triumph. Shape 6. Rough start, uncertain middle, warm ending. It shares its arc with Titanic and Finding Nemo, which sounds ridiculous until you map the emotional journey and realize all three follow the same curve: descent into difficulty, slow rebuild, resolution that earns its feeling.
That's the thing about working with data at this scale. You stop seeing movies as individual stories and start seeing them as instances of patterns. Every plot is a data point. Every emotional beat is a signal. And when you have 29,737 of them, the patterns become impossible to ignore.
This project is a Streamlit app built on top of the CMU Movie Summary Corpus. Paste any plot, a movie, a book, even a story you're writing, and it classifies the emotional arc in real time, shows you the sentiment curve, and tells you which famous films share your story's DNA.
The pipeline uses VADER for sentiment scoring, K-means clustering trained on all 29,737 arcs, and a custom weighting system that gives extra importance to the ending of a plot — because how a story ends matters more than how it begins.
The insight is older than film. Kurt Vonnegut said stories have shapes in 1995. It just took 29,737 data points to prove him right.