web analytics

I learned so much today

I hate learning stuff.

There’s a book about the history of this region written in the 1830s, long in the public domain. Google Books has it as a pdf file, but it’s a shitty pdf file that isn’t searchable. I’m looking for something specific in it.

So I went hunting.

Turns out, there’s a reputedly powerful command line tool called OCRmyPDF that looked like it would read and OCR it in one go. But it’s open source and natively Linux, so I had to install WSL (Windows Subsystem for Linux) and a distro (I chose Ubuntu because it’s for babies). Which is kind of awesome, because I like Linux, but not enough to make it my main computer.

Command line for the win.

But turns out I also needed a copy of Ghostscript (which is a program for handling pdf files). And Tesseract (which is the thing that actually does the OCR). They just would not play nice with each other. Until they did.

And yes, it successfully OCR’ed the whole six-hundred-some pages and the resulting text is…pretty much shit. Which I don’t understand because I thought the original scan was okay. I’ll have to think on’t.

Bonus thing I learned today: the regular old command line terminal is not exactly the same thing as the Windows Power Shell. You can get either, but the latter is what it defaults to.

Comments


Comment from durnedyankee
Time: January 29, 2026, 9:19 pm

Ugh….looks like UNIX!

And Unix is a virus with a user interface.


Comment from QuasiModo
Time: January 29, 2026, 11:48 pm

An alternative to Windows Subsystem for Linux is running Linux in a Virtualbox Virtual Machine, that’s the way I do it until I permanently switch to Linux…then it will be Windows 10 in the VM 🙂

I keep an EndeavourOS VM (based on Arch), a CachyOS VM (based also on Arch) and a plain vanilla Debian VM.

You can map to shares on your Windows host to transfer files back and forth, paste from the Windows clipboard, access serial ports, all kinds of stuff.


Comment from nbc
Time: January 30, 2026, 11:44 am

I concur with QuasiModo, I’m running Linux Mint (ubuntu) in vmware player now so that I can kick MS to the kerb.

If you have access to the current MS Word, there is OCR capability in there so you may not need Ghostscript et al.


Comment from MrKnowitall
Time: January 30, 2026, 1:39 pm

OR – you can just get a second SSD (ok, first sell a kidney) and install Linux on it and make your computer a true dual boot machine. As my father once told me, “keep your pens here, and your pencils here”. Although I think he was talking about an entirely different subject at the time…


Comment from S. Weasel
Time: January 30, 2026, 8:42 pm

There’s an awesome OCR built into Google Docs. Have I mentioned? All you have to do is open the .jpg with Google docs. But I need to learn how to handle a 600 page document.

Next step, I learn how to pre-process images before the OCR so it scans better. But first, snip a 600 page pdf into individual images (that’s going to be a disk eater for a while!).

My ultimate plan is to make good, searchable digital copies of a lot of local history documents that have fallen into the public domain.


Comment from Rich Rostrom
Time: January 30, 2026, 11:57 pm

Preview for Mac OS X is is open more or less everything app. It does pretty damn good OCR – I ran a whole book in French through it. (I had scans of every page of the book.)

Give me the name of the book, I’ll get it off GoogleBooks and see what I can do.

Write a comment

(as if I cared)

(yeah. I'm going to write)

(oooo! you have a website?)


Beware: more than one link in a comment is apt to earn you a trip to the spam filter, where you will remain -- cold, frightened and alone -- until I remember to clean the trap. But, hey, without Akismet, we'd be up to our asses in...well, ass porn, mostly.


<< carry me back to ol' virginny