Converting an image pdf file to a searchable text pdf file in a Linux environment
Okay, so that's a really long title for a blog post, but sometimes you must use many words to explain what it really is that you are doing, a lesson learned by spending a lot of time on the mostly worthless forums where people have very little ability to form a subject line that has anything to do with their issue.
At any rate, some background. I love downloading public-domain (mostly) books and documents, but often they are scanned as image files. As I am a writer and want to use quotes from the pdf, it is much easier if I convert the picture pdf to a text pdf so I can copy and paste, rather than re-typing the quoted material.
There are lots of ways to go about this conversion task, but often they require buying conversion software or paying to play in the cloud. I hate spending money on work stuff, so here's my simple, quick solution.
- Install gscan2pdf. In Ubuntu, you can do that from the Ubuntu Software Centre or the Gnome Software Center. If you are into installing using a terminal, have at it, but we won't describe the "how to" of terminal installation here.
- Open your File browser and find the file you want to scan and ocr. Right click on the file and Open With gscan2pdf. If you get a screen that no scanner was detected, just close the dialog screen. You aren't going to scan anything on the scanner.
- gscan2pdf will proceed to load all the pages of the image pdf.
- Click on Tools, OCR.
- Select Page Range, then click OCR (I leave the other settings at default).
- Click on the OCR tab to watch as the OCR is being performed. Don't be frightened, it looks like an awful jumble but the end product won't be jumbled.
- After the last page is OCR'd, click File, Save. Select the format to which you want to save the OCR'd file. In this case, as I simply want a pdf file with searchable text that I can copy, highlight and annotate, I chose to save the file as PDF. (Bonus: you also can now convert the pdf to a text file that can be edited in a word processor.)
Comments
Post a Comment
Only comments that improve or disprove the contents of the posts on this blog will be approved. Opinions and speculations generally will not be approved. "Self-serving" links will not be approved. Product and advertising links will not be approved, but plain text recommendations might be approved. No form of vulgarity or cursing will be approved. No personal disparaging remarks will be approved. All comments become the property of this blog immediately upon the member's/reader's posting of the comment. All comments may be rejected or edited without recourse to or by the commenter. By posting, you agree to hold harmless this blog, its owner, editors, administrators and contributors, even if your post is approved as-is.