Recursive PDF conversion with unoconv
One of our customers has about 4500 documents in Word (Docx and Doc), RTF, TXT, OTF and PDF format collected by their users. All organized in a forrest of folders. The demand was to have all these documents available in their online candidates portal for preview. Time for LibreOffice and unoconv and a bit of Python handwork.
I have looked around on the web and found a few utilities for Windows that could handle this to a certain or complete extend. Only 7-PDF Maker on Windows did the job almost as I wanted. The challenge is that it has to convert all the documents to PDF recursively and with a good quality. 7-PDF Maker, a free command line utility, does a great job, but it broke during the conversion more than once.
In this article I describe how I have done the mass conversion on Linux (my other favorite operating system) with LibreOffice, a utility called “unoconv” and a bit of Python programming. Don’t worry, I am not an experienced Python programmer, so I will stick to the plan of getting you to the PDF’s as quickly as possible.
Required to be installed
- LibreOffice (from here: Link to LibreOffice)
- Unoconv (installation instructions)
- Python (Python 2.7.12 installed) (installation instructions)
- py-unoconv-batch-recursive (from Github)
Somewhat difficult name, but you have to install it in a folder where you can easily find it, like /<somewhere-easy>/py-unoconv-batch-recursive. You can also give any name you like.
You can simply do:
git clone https://github.com/enovision/py-unoconv-batch-recursive.git
in a terminal window with your path changed to the path of your choice.
Once everything is installed go to the root of the folder that contains the documents which you want to convert (let’s assume: /media/somewhere/CD-Data).
python /tmp/py-unoconv-batch-recursive/recursive-pdf-converter --in="/media/somewhere/CD-Data"
If you don’t add the “–in” parameter, it will take the path where the Python script is located as the root path of your conversion.
The program will convert documents with the following formats by default:
To have an alternate list of extensions you can add the parameter:
--ext="doc docx yyy zzz"
without comma’s or other characters.
After starting the script, it will check all folders down the root of the given input path for files with selected extensions. All found files will be converted with a filename equal to the original file with an added “.pdf”. So “something.doc”, becomes “something.doc.pdf”. This is to avoid that “something.docx” overwrites the PDF when also “something.doc” in the same folder just passed the conversion process. The output files will be put in the same folder as the original document. There is an opportunity to add an ‘–out’ parameter, but that is not doing anything yet. Maybe in a later version of the script.
Unoconv is a command line program that is used to convert between different office document file formats. The nice thing about converting with this “unoconv-LibreOffice” method is that the generated PDF’s are not converted as bitmaps, but as layered PDF’s. I am using the results of the conversion in my ExtJS package ext-pdf-viewer, a PDF Viewer panel for ExtJS version 6, based on the Mozilla’s pdf.js library.
I admit that my (first) Python script is very basic, but it worked very well for me. The process of converting about 4500 documents of mostly 1 to 2 pages took on my own laptop about 30 minutes. The script starts with a delay of 20 seconds(!) to give the “unoconv” listener time enough to start. This listener avoids to start the LibreOffice instance every time “unoconv” is called. At the end of the script the listener is killed and the program stops with the message “Done”.