Posts

Showing posts with the label Regex

Initial clean-up of converted or downloaded pdf, epub and txt documents using LO Writer Find and Replace

Image
Scenario: Your downloaded .txt or converted .pdf file has a paragraph break after each line, but you want a paragraph break only at the actual or "real" end of the paragraph. Here's how you take care of that in three easy steps in LibreOffice Writer. Perennial Reminder: Depending on the length of the document, some Find and Replace tasks can take a long while. Take a break and let your F&R do its thing, sometimes as long as five or ten minutes. Be patient. Go to your browser and post something on Facebook or Twitter. Go to the kitchen and make a sandwich. Most Replacements are pretty quick.  First , you should replace the actual or "real" end of each paragraph with a placeholder; the reason will become obvious later in these instructions. So here, we are replacing each instance of the "real" paragraph end (highlighted) with ".9999". Later, we will replace all instances of 9999. Below is the find and replace instruction: ( note that  the...

Common Regex Find & Replace strings used by this Blogger.

Image
As I have stated in other posts, I do a lot of document conversions from pdf, txt and epub to .docx, .odt and Google Docs.Most of the documents are between 5 and 500 pages, occasionally stretching to 700 or 1,000+ pages. I work in a Linux environment, so my most common means of converting documents are Okular, Tesseract (command line), G Suite, and Calibre--in that order, with Okular being the most-often used. In some cases, the documents I am converting have already been scanned into plain text (txt) and I am simply downloading, then converting to final format (usually .odt or .docx) so that I can add images, footnotes, comments, etc., often reconverting the final product back to pdf and/or epub formats. All that is more fully explained in this post.   Below are some of the more common (and not-so-common) Regex and non-Regex hacks I use. Again, I work strictly in a Linux OS environment, but much of my work is converted to MS Work and Google Docs, so I use LibreOffice's built-in Fi...

Using Find and Replace in LibreOffice Writer to Clean up Converted Text Documents in any Platform

Image
If your work involves converting documents, and after the conversion you need to do a lot of Find and Replace to clean up the converted document, you might have found yourself needing to use Regular Expressions. Trying to use Regex in MS Word or Google Docs is a frustrating mess unless you are, perhaps, a developer. For the everyday user at work or home (WFH), trying to figure out Regex in the MS or G Suite environment is a mess, even if you use one of the Find and Replace G Suite add-ons.  You can use Regular Expressions (Regex) to find and replace empty paragraphs; multiple, contiguous empty spaces; empty spaces at the beginning of a paragraph; etc. As someone who works almost exclusively in a Linux environment, and whose work requires the conversion of hundreds of pdf, txt, and epub documents to .docx and .odt formats, my favorite (and only really useful) hack is to use LibreOffice and its far-superior-to-MS Word and G Suite "Find & Replace" function.  The first step i...