I learned so much today

I hate learning stuff.
There’s a book about the history of this region written in the 1830s, long in the public domain. Google Books has it as a pdf file, but it’s a shitty pdf file that isn’t searchable. I’m looking for something specific in it.
So I went hunting.
Turns out, there’s a reputedly powerful command line tool called OCRmyPDF that looked like it would read and OCR it in one go. But it’s open source and natively Linux, so I had to install WSL (Windows Subsystem for Linux) and a distro (I chose Ubuntu because it’s for babies). Which is kind of awesome, because I like Linux, but not enough to make it my main computer.
Command line for the win.
But turns out I also needed a copy of Ghostscript (which is a program for handling pdf files). And Tesseract (which is the thing that actually does the OCR). They just would not play nice with each other. Until they did.
And yes, it successfully OCR’ed the whole six-hundred-some pages and the resulting text is…pretty much shit. Which I don’t understand because I thought the original scan was okay. I’ll have to think on’t.
Bonus thing I learned today: the regular old command line terminal is not exactly the same thing as the Windows Power Shell. You can get either, but the latter is what it defaults to.
January 29, 2026 — 7:15 pm
Comments: 6










