Batch recursive conversion to PDF using Linux, LibreOffice and Unoconv

Batch recursive conversion to PDF using Linux, LibreOffice and Unoconv

This article shows how you can batch convert large amounts of documents to PDF with LibreOffice, Unoconv and a bit of Python script (included).

Recursive PDF conversion with unoconv

One of our customers has about 4500 documents in Word (Docx and Doc), RTF, TXT, OTF and PDF format collected by their users. All organized in a forrest of folders. The demand was to have all these documents available in their online candidates portal for preview. Time for LibreOffice and unoconv and a bit of Python handwork.

I have looked around on the web and found a few utilities for Windows that could handle this to a certain or complete extend. Only 7-PDF Maker on Windows did the job almost as I wanted. The challenge is that it has to convert all the documents to PDF recursively and with a good quality. 7-PDF Maker, a free command line utility, does a great job, but it broke during the conversion more than once.

In this article I describe how I have done the mass conversion on Linux (my other favorite operating system) with LibreOffice, a utility called "unoconv" and a bit of Python programming. Don't worry, I am not an experienced Python programmer, so I will stick to the plan of getting you to the PDF's as quickly as possible.

Required to be installed

Py-unoconv-batch-recursive

Somewhat difficult name, but you have to install it in a folder where you can easily find it, like /<somewhere-easy>/py-unoconv-batch-recursive. You can also give any name you like.

To install it:

git clone https://github.com/enovision/py-unoconv-batch-recursive.git
<?php
class johan {

    private $var;

    function __construct() {
        $this->var = 'johan';
    }

}

in a terminal window with your path changed to the path of your choice.

Once everything is installed go to the root of the folder that contains the documents which you want to convert (let's assume: /media/somewhere/CD-Data).

python /tmp/py-unoconv-batch-recursive/recursive-pdf-converter --in="/media/somewhere/CD-Data"

If you don't add the --in parameter, it will take the path where the Python script is located as the root path of your conversion.

The program will convert documents with the following formats by default:

  • docx
  • doc
  • rtf
  • otf
  • txt

To have an alternate list of extensions you can add the parameter:

--ext="doc docx yyy zzz"

without comma's or other characters.

After starting the script, it will check all folders down the root of the given input path for files with selected extensions. All found files will be converted with a filename equal to the original file with an added .pdf. So something.doc, becomes something.doc.pdf. This is to avoid that something.docx overwrites the PDF when also something.doc in the same folder just passed the conversion process. The output files will be put in the same folder as the original document. There is an opportunity to add an --out parameter, but that is not doing anything yet. Maybe in a later version of the script.

Unoconv

Unoconv is a command line program that is used to convert between different office document file formats. The nice thing about converting with this unoconv-LibreOffice method is that the generated PDF's are not converted as bitmaps, but as layered PDF's. I am using the results of the conversion in my ExtJS package ext-pdf-viewer, a PDF Viewer panel for ExtJS version 6, based on the Mozilla's pdf.js library.

Conclusion

I admit that my (first) Python script is very basic, but it worked very well for me. The process of converting about 4500 documents of mostly 1 to 2 pages took on my own laptop about 30 minutes. The script starts with a delay of 20 seconds(!) to give the "unoconv" listener time enough to start. This listener avoids to start the LibreOffice instance every time unoconv is called. At the end of the script the listener is killed and the program stops with the message "Done".

No comments yet

your email address will not be published. required fields are marked *