Posts

Showing posts with the label Document Conversion

Common Regex Find & Replace strings used by this Blogger.

Image
As I have stated in other posts, I do a lot of document conversions from pdf, txt and epub to .docx, .odt and Google Docs.Most of the documents are between 5 and 500 pages, occasionally stretching to 700 or 1,000+ pages. I work in a Linux environment, so my most common means of converting documents are Okular, Tesseract (command line), G Suite, and Calibre--in that order, with Okular being the most-often used. In some cases, the documents I am converting have already been scanned into plain text (txt) and I am simply downloading, then converting to final format (usually .odt or .docx) so that I can add images, footnotes, comments, etc., often reconverting the final product back to pdf and/or epub formats. All that is more fully explained in this post.   Below are some of the more common (and not-so-common) Regex and non-Regex hacks I use. Again, I work strictly in a Linux OS environment, but much of my work is converted to MS Work and Google Docs, so I use LibreOffice's built-in Fi...

Using Find and Replace in LibreOffice Writer to Clean up Converted Text Documents in any Platform

Image
If your work involves converting documents, and after the conversion you need to do a lot of Find and Replace to clean up the converted document, you might have found yourself needing to use Regular Expressions. Trying to use Regex in MS Word or Google Docs is a frustrating mess unless you are, perhaps, a developer. For the everyday user at work or home (WFH), trying to figure out Regex in the MS or G Suite environment is a mess, even if you use one of the Find and Replace G Suite add-ons.  You can use Regular Expressions (Regex) to find and replace empty paragraphs; multiple, contiguous empty spaces; empty spaces at the beginning of a paragraph; etc. As someone who works almost exclusively in a Linux environment, and whose work requires the conversion of hundreds of pdf, txt, and epub documents to .docx and .odt formats, my favorite (and only really useful) hack is to use LibreOffice and its far-superior-to-MS Word and G Suite "Find & Replace" function.  The first step i...