Batch recursive conversion to PDF using Linux, LibreOffice and Unoconv

LibreOffice Unoconv

Recursive PDF conversion with unoconv

One of our customers has about 4500 documents in Word (Docx and Doc), RTF, TXT, OTF and PDF format collected by their users. All organized in a forrest of folders. The demand was to have all these documents available in their online candidates portal for preview. Time for LibreOffice and unoconv and a bit of Python handwork.

I have looked around on the web and found a few utilities for Windows that could handle this to a certain or complete extend. Only 7-PDF Maker on Windows did the job almost as I wanted. The challenge is that it has to convert all the documents to PDF recursively and with a good quality. 7-PDF Maker, a free command line utility, does a great job, but it broke during the conversion more than once.

In this article I describe how I have done the mass conversion on Linux (my other favorite operating system) with LibreOffice, a utility called “unoconv” and a bit of Python programming. Don’t worry, I am not an experienced Python programmer, so I will stick to the plan of getting you to the PDF’s as quickly as possible.

Required to be installed

Py-unoconv-batch-recursive

Somewhat difficult name, but you have to install it in a folder where you can easily find it, like /<somewhere-easy>/py-unoconv-batch-recursive. You can also give any name you like.

You can simply do:

in a terminal window with your path changed to the path of your choice.

Once everything is installed go to the root of the folder that contains the documents which you want to convert (let’s assume: /media/somewhere/CD-Data).

If you don’t add the “–in” parameter, it will take the path where the Python script is located as the root path of your conversion.

The program will convert documents with the following formats by default:

  • docx
  • doc
  • rtf
  • otf
  • txt

To have an alternate list of extensions you can add the parameter:

without comma’s or other characters.

After starting the script, it will check all folders down the root of the given input path for files with selected extensions. All found files will be converted with a filename equal to the original file with an added “.pdf”. So “something.doc”, becomes “something.doc.pdf”. This is to avoid that “something.docx” overwrites the PDF when also “something.doc” in the same folder just passed the conversion process. The output files will be put in the same folder as the original document. There is an opportunity to add an ‘–out’ parameter, but that is not doing anything yet. Maybe in a later version of the script.

Unoconv

Unoconv is a command line program that is used to convert between different office document file formats. The nice thing about converting with this “unoconv-LibreOffice” method is that the generated PDF’s are not converted as bitmaps, but as layered PDF’s. I am using the results of the conversion in my ExtJS package ext-pdf-viewer, a PDF Viewer panel for ExtJS version 6, based on the Mozilla’s pdf.js library.

Conclusion

I admit that my (first) Python script is very basic, but it worked very well for me. The process of converting about 4500 documents of mostly 1 to 2 pages took on my own laptop about 30 minutes. The script starts with a delay of 20 seconds(!) to give the “unoconv” listener time enough to start. This listener avoids to start the LibreOffice instance every time “unoconv” is called. At the end of the script the listener is killed and the program stops with the message “Done”.

Johan van de Merwe
Dedicated to professional software development since 1985. Has worked since 1992 as IT manager in several international operating companies. Since 2007 CEO and Sencha Ext JS web application developer at Enovision GmbH.

Leave a Reply

Time limit is exhausted. Please reload CAPTCHA.