Common Regex Find & Replace strings used by this Blogger.
As I have stated in other posts, I do a lot of document conversions from pdf, txt and epub to .docx, .odt and Google Docs.Most of the documents are between 5 and 500 pages, occasionally stretching to 700 or 1,000+ pages. I work in a Linux environment, so my most common means of converting documents are Okular, Tesseract (command line), G Suite, and Calibre--in that order, with Okular being the most-often used. In some cases, the documents I am converting have already been scanned into plain text (txt) and I am simply downloading, then converting to final format (usually .odt or .docx) so that I can add images, footnotes, comments, etc., often reconverting the final product back to pdf and/or epub formats. All that is more fully explained in this post.
Below are some of the more common (and not-so-common) Regex and non-Regex hacks I use. Again, I work strictly in a Linux OS environment, but much of my work is converted to MS Work and Google Docs, so I use LibreOffice's built-in Find and Replace, along with an occasional foray into the Alternate Find & Replace extension, to clean up converted documents.
Some common Find &Replace hacks, using Regular Expressions: (All within LibreOffice Writer)
(non-Regex) Replacing multiple, contiguous spaces with a single space.Many of the documents I download and convert have a lot of multiple-spaces between words, sometimes as many as 30, or as few as 2, spaces between words. This is really simple, so you probably cannot screw it up, even if you try.
In the Find box, hit your space bar multiple times; for an extremely long document with varying multiple spaces between words, I usually start with five taps of the space bar in the Find box, then 1 space in the Replacement box -- Replace All. Then I go back to the Find box and delete one space, meaning I now replace four spaces with 1 space. I do the same for three spaces, then two spaces. This will still leave some instances of two spaces, but these get cleaned up later.
Comments
Post a Comment
Only comments that improve or disprove the contents of the posts on this blog will be approved. Opinions and speculations generally will not be approved. "Self-serving" links will not be approved. Product and advertising links will not be approved, but plain text recommendations might be approved. No form of vulgarity or cursing will be approved. No personal disparaging remarks will be approved. All comments become the property of this blog immediately upon the member's/reader's posting of the comment. All comments may be rejected or edited without recourse to or by the commenter. By posting, you agree to hold harmless this blog, its owner, editors, administrators and contributors, even if your post is approved as-is.