Converting an image pdf file to a searchable text pdf file in a Linux environment

April 09, 2021

Okay, so that's a really long title for a blog post, but sometimes you must use many words to explain what it really is that you are doing, a lesson learned by spending a lot of time on the mostly worthless forums where people have very little ability to form a subject line that has anything to do with their issue.

At any rate, some background. I love downloading public-domain (mostly) books and documents, but often they are scanned as image files. As I am a writer and want to use quotes from the pdf, it is much easier if I convert the picture pdf to a text pdf so I can copy and paste, rather than re-typing the quoted material.

There are lots of ways to go about this conversion task, but often they require buying conversion software or paying to play in the cloud. I hate spending money on work stuff, so here's my simple, quick solution.

Install gscan2pdf. In Ubuntu, you can do that from the Ubuntu Software Centre or the Gnome Software Center. If you are into installing using a terminal, have at it, but we won't describe the "how to" of terminal installation here.
Open your File browser and find the file you want to scan and ocr. Right click on the file and Open With gscan2pdf. If you get a screen that no scanner was detected, just close the dialog screen. You aren't going to scan anything on the scanner.
gscan2pdf will proceed to load all the pages of the image pdf.
Click on Tools, OCR.
Select Page Range, then click OCR (I leave the other settings at default).
Click on the OCR tab to watch as the OCR is being performed. Don't be frightened, it looks like an awful jumble but the end product won't be jumbled.
After the last page is OCR'd, click File, Save. Select the format to which you want to save the OCR'd file. In this case, as I simply want a pdf file with searchable text that I can copy, highlight and annotate, I chose to save the file as PDF. (Bonus: you also can now convert the pdf to a text file that can be edited in a word processor.)

The file that I converted from an image to a text pdf was a 35-page Employee Handbook from 1940. From start to finish took less than five minutes.

gscan2pdf also works well with the scanner in my Epson MFP when I need to scan documents from scratch.

Comments

Post a Comment

Only comments that improve or disprove the contents of the posts on this blog will be approved. Opinions and speculations generally will not be approved. "Self-serving" links will not be approved. Product and advertising links will not be approved, but plain text recommendations might be approved. No form of vulgarity or cursing will be approved. No personal disparaging remarks will be approved. All comments become the property of this blog immediately upon the member's/reader's posting of the comment. All comments may be rejected or edited without recourse to or by the commenter. By posting, you agree to hold harmless this blog, its owner, editors, administrators and contributors, even if your post is approved as-is.

Search This Blog

Luke's LibreOffice Hacks