Creating a Demo SQLite Database for Teens with IMDb and Python

Creating a Demo SQLite Database for Teens with IMDb and Python

When you’re teaching SQL to teenagers, you don’t just want them to memorize SELECT * FROM customers. You want them to explore something fun, relatable, and realistic. That’s why I built a demo database based on IMDb movies and series for our SQL course.

The goal: give students a lightweight, easy-to-install database they can query at home, while working with data they actually care about.

Why SQLite + DBeaver?

For a classroom setting, SQLite is perfect. It’s lightweight, requires no server setup, and runs on pretty much any machine. Pair it with DBeaver as the front-end, and you’ve got a professional-looking environment that’s still beginner-friendly. Students can install it at home without headaches, and it’s more than enough for learning SQL queries.

Choosing the Dataset: Movies!

Let’s be honest: querying a “customers and orders” dataset isn’t exactly thrilling. Movies, on the other hand, are instantly engaging. Everyone has a favorite film or series, and that makes the data relatable.

The IMDb Non-Commercial Datasets was the obvious choice. It’s realistic, current, and huge. The export comes as TSV files, and with the Python package imdb-sqlite · PyPI, you can transform those files into a ready-to-query SQLite database.

Getting Started

For absolute beginners, here’s the quick setup:

  1. Install a Python distribution (I recommend Anaconda).
  2. Open the Anaconda Prompt.
  3. Navigate to your target folder: cd path\to\your\project
  4. Install the package: pip install imdb-sqlite
  5. Run the tool: imdb-sqlite

This will generate a roughly 20 GB big SQLite database from the IMDb TSV files.

The Challenges

Of course, nothing worth doing comes without hurdles:

  • The IMDb dataset is huge – around 20GB in sqlite with indexes.
  • It includes everything since the early 1900s, but we only need films and series from 2000 onwards.
  • And yes, IMDb also contains adult-only titles. Not exactly suitable for a youth course.

So, cleanup was required.

Cleaning Up with Python + Copilot

With Copilot’s help, I wrote a Python script to filter the dataset. Some key lessons learned along the way:

  • Delete in batches (e.g., 500,000 rows at a time). This keeps the SQLite journal file manageable (under ~500MB).
  • SQLite doesn’t support every SQL command you might expect. For example, DELETE ... LIMIT isn’t available by default. The workaround: use a temporary table and rowid.
  • Always run VACUUM at the end. Otherwise, the file size stays bloated even after deletions.

The script handles all of this automatically. For sure I could have coded that myself however it would have taken much longer than having it generated by Copilot.
Download the code here.

By running this script I could reduce the size to 9 GB. After zipping the SQLite file it became manageable 1.2 GB in size (7zip Maximum Compression). This is half the size of a standard zip file. When unzipping it might take a few minutes (3 minutes on my laptop with SSD…your mileage might vary).

Example Output

Here’s what the cleanup looks like in action:

(base) C:\Users\myUser\myFolder>python cleanup_imdb_titles_2000plus.py
[2025-11-09 11:28:27] Starting cleanup process...
[2025-11-09 11:28:27] Column is_adult not found, skipping deletion of adult titles...
[2025-11-09 11:28:27] Column is_adult already removed, skipping...
[2025-11-09 11:28:29] Step 2 (premiered < 2000): nothing to delete
[2025-11-09 11:29:45] Step 3a (crew): total rows to delete = 36,887,467
[2025-11-09 11:30:11] Step 3a (crew): deleted 500,000 rows (total 500,000, 1.36% done, 36,387,467 remaining)
[2025-11-09 11:30:29] Step 3a (crew): deleted 500,000 rows (total 1,000,000, 2.71% done, 35,887,467 remaining)
...

Lean back…it takes a while to delete the data (almost 3 hours on my machine). For sure there are more efficient databases than SQLite which might also delete faster but the simplicity of running it is amazing.

Wrapping Up

With this setup, students get:

  • A realistic dataset they can relate to.
  • A lightweight database they can install at home.
  • A chance to practice SQL queries on something fun.
  • And a taste of real-world data cleaning challenges.

It’s a great way to make SQL approachable, engaging, and practical.

Have fun exploring IMDb data yourself – and don’t forget to run VACUUM!

Disclaimer

This article was created based on my personal notes with support from Microsoft Copilot. While Copilot assisted in structuring and refining the content, all technical details have been carefully reviewed and developed by me.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.