Published Dec 12, 20254 min read · 705 words

Trying to make a Victorian LLM with Nanochat

What if you could talk to someone from a century ago and it didn’t feel like a cosplay chatbot doing a bad accent? That itch is what pushed me into this. I wanted an excuse to train Karpathy’s nanochat and actually touch the whole pipeline: mid-training, supervised fine-tuning, and the boring-but-important grind of collecting a lot of text without losing my mind.

The first wall I hit was data. There isn’t a big, clean, “Victorian conversation starter pack” dataset sitting around, so I had to make one. This whole thing was inspired by TimeCapsuleLLM (https://github.com/haykgrigo3/TimeCapsuleLLM), which uses Internet Archive texts with nanoGPT. That repo had a curated list of Internet Archive IDs, mostly popular London-era stuff from roughly 1800–1875. Great starting point, but I wanted wider geography and a broader time window, so I went hunting.

Internet Archive’s advanced search is the real hero here. You can treat it like a weird little query machine and, with a bit of binary-search style poking around, pull slices of what you want. I started small, grabbed around 1,000 text IDs just to see if the plumbing worked. Once it did, I rented a cheap VPS with a lot of cores and a fast pipe and tried to scale up. That’s when I learned the API has a personality too: with an access key, it was happy at about 256 concurrent downloads. Past that, requests started timing out and everything turned into a sad queue of retries.

After a few solid runs, I went big: I pulled IDs for 1 million texts, years 0001 to 1899, English, OCR text available, then sorted by download count so I’d hit the “likely to exist” stuff first. I let the downloader run overnight and woke up to around 700,000 TXT files. It slowed down hard over time, because once you get into the low-popularity tail, tons of items claim OCR text but don’t actually have it. That meant a lot of 400 responses and wasted time. At some point the hit rate felt like digging for coins in a parking lot, so I killed the run.

Now I had a mountain of raw text, which is where the unglamorous work starts. The first pages of a lot of these books are packed with boilerplate like “digitized by Google” or “provided by X”, plus random catalog junk. I stripped that stuff with a couple regex patterns and a simple Python cleanup pass, then stored the result as Parquet so nanochat’s loader wouldn’t choke. I chunked the dataset into shards of about 250M characters, 1024 rows per shard. Final count: 233 GB, around 593.7B characters. I uploaded the chunked dataset here: https://huggingface.co/datasets/meettilavat/InternetArchive_1899_Chunked

For mid-training, I wanted to nudge the model toward a period voice and vibe without going full theatre kid. That part was easier because Karpathy already provides a script to generate mid-training data, so it was mostly prompt tweaking and hooking it up through OpenRouter. The last piece was SFT, because I needed the model to behave like a chat model instead of a fancy autocomplete engine.

Nanochat uses smol-smoltalk (https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk), which is cool, but it’s packed with modern Q&A and modern knowledge. That defeats the whole point. The nice surprise was that Hugging Face published the generation scripts: https://github.com/huggingface/smollm/tree/main/text/data/smoltalk. So I adapted those and generated my own synthetic set using GPT-5-nano, batched out to about 400,000 prompts across a bunch of topics. That dataset is here: https://huggingface.co/datasets/meettilavat/smol-oldtalk

Once everything was assembled, I trained it on an 8xH100 cluster on vast.ai. From “go” to “done”, it took around three hours. The demo is live here: https://huggingface.co/spaces/meettilavat/nanochat_d20_1899_demo

Screenshot 2025-12-12 213250.png — Example of what it learned during mid-train.

And yeah, the result isn’t magic. It hallucinates. A lot. I didn’t use enough fact cards in SFT, and I only trained on around 4B characters of the giant pile, so it’s missing huge chunks of knowledge and it shows. Still, I don’t regret it for a second. I came out of this with a much sharper feel for what it takes to build an LLM end-to-end, and it made me appreciate how many moving parts frontier teams must juggle without anything catching fire. Honestly, getting even a messy “Victorian-ish” chat model to run feels like a small win.