Archive full of folders, documents and books

Make all your PDFs searchable

Welcome to my first post on this new blog!
In this article I’ll explain how I make my ever growing pile of PDFs searchable.

Motivation

Recently I ran into some situations where I needed to find a specific document. It drove me nuts to manually open and close PDFs to find the one I was looking for.

So I made all my PDFs searchable and here is how that worked:

TL;DR

Throw all your PDF files into a folder, run ocrmypdf to make them searchable:

find ./*.pdf -type f -name '*.pdf' -not -name '*_ocr.pdf' -exec sh -c 'ocrmypdf "${0%.*}.pdf" "${0%.*}_ocr.pdf"' {} \;

then use pdfgrep to search for things:

pdfgrep -i --cache 'taxpayer identification number' *.pdf

and never waste time on tagging, renaming or categorizing PDFs again.

Using tesseract & pdfgrep

There are some really powerful tools out there to get us out of this dilemma. And it’s super simple:

Put all PDFs in one place
Let tesseract take care of OCR
Use pdfgrep for search

The first step is my favorite. No manual categorizing, no file tags, no problems!

Tesseract

Tesseract is an open source tool for optical character recognition (OCR). It takes an image and finds text in it:

tesseract input.png output -l eng

Cool! Now we have a way to read text from images. However, this post is all about PDFs. So how can we feed a PDF to tesseract?

Unfortunately, tesseract only works with image formats like PNG, JPEG and some others. You could convert your PDFs to images like this:

convert -density 300 input.pdf -depth 8 output.png

and then use tesseract on the generated images. But there is a simpler way to do this: OCRmyPDF. This is another CLI that does the conversion for you automatically.

ocrmypdf -l eng input.pdf output.pdf

That command works for one file. To run OCR on all PDF files in a folder, use this one:

find ./*.pdf -type f -name '*.pdf' -not -name '*_ocr.pdf' -exec sh -c 'ocrmypdf "${0%.*}.pdf" "${0%.*}_ocr.pdf"' {} \;

This will keep the original files and creates copies of those with “_ocr” attached to their filenames. Whenever you add more PDFs to your folder and need to OCR them, just re-run this command and it will try to OCR files without the “_ocr” suffix.

You can either keep the old files or delete all files without the “_ocr” suffix in the current directory by using this command:

find *.pdf ! -name '*_ocr.pdf' -delete

Only searchable PDFs are left now. Whenever you throw in more PDFs you will instantly know which need to be treated with ocrmypdf: All those without the _ocr suffix.

pdfgrep

Now that all your PDFs are searchable, let’s try pdfgrep:

pdfgrep -i --cache 'invoice' *.pdf

This will give you a list of filenames as well as some characters before/after the matching phrase and the page number.

Note that depending on the amount of PDFs, the first run could take a while. By using the --cache flag the next search will be much quicker!

Using Docker

If you are familiar with docker, you can also try this docker image. It let’s you specify an input and and output directory. You put all your PDFs into the input directory and the running docker container will perform OCR on them and outputs the searchable PDFs to the specified output directory.

Have a look at the Readme for more instructions on how to use and configure it.