Converting an image pdf file to a searchable text pdf file in a Linux environment

 Okay, so that's a really long title for a blog post, but sometimes you must use many words to explain what it really is that you are doing, a lesson learned by spending a lot of time on the mostly worthless forums where people have very little ability to form a subject line that has anything to do with their issue. 

At any rate, some background. I love downloading public-domain (mostly) books and documents, but often they are scanned as image files. As I am a writer and want to use quotes from the pdf, it is much easier if I convert the picture pdf to a text pdf so I can copy and paste, rather than re-typing the  quoted material. 

There are lots of ways to go about this conversion task, but often they require buying conversion software or paying to play in the cloud. I hate spending money on work stuff, so here's my simple, quick solution. 

  1. Install gscan2pdf. In Ubuntu, you can do that from the Ubuntu Software Centre or the Gnome Software Center. If you are into installing using a terminal, have at it, but we won't describe the "how to" of terminal installation here.
  2. Open your File browser and find the file you want to scan and ocr. Right click on the file and Open With gscan2pdf. If you get a screen that no scanner was detected, just close the dialog screen. You aren't going to scan anything on the scanner.
  3. gscan2pdf will proceed to load all the pages of the image pdf. 
  4. Click on Tools, OCR.
  5. Select Page Range, then click OCR (I leave the other settings at default).
  6. Click on the OCR tab to watch as the OCR is being performed. Don't be frightened, it looks like an awful jumble but the end product won't be jumbled.
  7. After the last page is OCR'd, click File, Save. Select the format to which you want to save the OCR'd file. In this case, as I simply want a pdf file with searchable text that I can copy, highlight and annotate, I chose to save the file as PDF. (Bonus: you also can now convert the pdf to a text file that can be edited in a word processor.)
The file that I converted from an image to a text pdf was a 35-page Employee Handbook from 1940. From start to finish took less than five minutes. 

gscan2pdf also works well with the scanner in my Epson MFP when I need to scan documents from scratch. 

Comments

Popular posts from this blog

LibreOffice Freeze and Slowdown - "Memory" and "Undo" Settings

Creating an Epub file with a clickable TOC using Libreoffice and Google Docs

LibreOffice : Stop Breaking Your "Document Hyperlinks" (Document Links)