Scheduled Postgres Backups to S3
How to achieve recurring uploads of Postgres dumps to AWS S3.
2022-03-16
Welcome to my first post on this new blog!
In this article I’ll explain how I make my ever growing pile of PDFs searchable.
Recently I ran into some situations where I needed to find a specific document. It drove me nuts to manually open and close PDFs to find the one I was looking for.
So I made all my PDFs searchable and here is how that worked:
Throw all your PDF files into a folder, run ocrmypdf to make them searchable:
find ./*.pdf -type f -name '*.pdf' -not -name '*_ocr.pdf' -exec sh -c 'ocrmypdf "${0%.*}.pdf" "${0%.*}_ocr.pdf"' {} \;
then use pdfgrep to search for things:
pdfgrep -i --cache 'taxpayer identification number' *.pdf
and never waste time on tagging, renaming or categorizing PDFs again.
There are some really powerful tools out there to get us out of this dilemma. And it’s super simple:
The first step is my favorite. No manual categorizing, no file tags, no problems!
Tesseract is an open source tool for optical character recognition (OCR). It takes an image and finds text in it:
tesseract input.png output -l eng
Cool! Now we have a way to read text from images. However, this post is all about PDFs. So how can we feed a PDF to tesseract?
Unfortunately, tesseract only works with image formats like PNG, JPEG and some others. You could convert your PDFs to images like this:
convert -density 300 input.pdf -depth 8 output.png
and then use tesseract on the generated images. But there is a simpler way to do this: OCRmyPDF. This is another CLI that does the conversion for you automatically.
ocrmypdf -l eng input.pdf output.pdf
That command works for one file. To run OCR on all PDF files in a folder, use this one:
find ./*.pdf -type f -name '*.pdf' -not -name '*_ocr.pdf' -exec sh -c 'ocrmypdf "${0%.*}.pdf" "${0%.*}_ocr.pdf"' {} \;
This will keep the original files and creates copies of those with “_ocr” attached to their filenames. Whenever you add more PDFs to your folder and need to OCR them, just re-run this command and it will try to OCR files without the “_ocr” suffix.
You can either keep the old files or delete all files without the “_ocr” suffix in the current directory by using this command:
find *.pdf ! -name '*_ocr.pdf' -delete
Only searchable PDFs are left now. Whenever you throw in more PDFs you will instantly know which need to be treated with ocrmypdf: All those without the _ocr
suffix.
Now that all your PDFs are searchable, let’s try pdfgrep:
pdfgrep -i --cache 'invoice' *.pdf
This will give you a list of filenames as well as some characters before/after the matching phrase and the page number.
Note that depending on the amount of PDFs, the first run could take a while. By using the --cache
flag the next search will be much quicker!
If you are familiar with docker, you can also try this docker image. It let’s you specify an input and and output directory. You put all your PDFs into the input directory and the running docker container will perform OCR on them and outputs the searchable PDFs to the specified output directory.
Have a look at the Readme for more instructions on how to use and configure it.
More Posts