Multi Word document conversion into PDF with Linux and LibreOffice and Unoconv

01 Apr

2019

Johan van de Merwe

Posted in Tools

The article demonstrates how to efficiently convert a large number of documents into PDF format using LibreOffice, Unoconv, and a simple Python script (provided within the guide).

Efficient PDF Conversion with unoconv

When faced with the task of converting thousands of documents into PDF format for online portal access, tools like unoconv can streamline the process. This article outlines a step-by-step approach to mass conversion using LibreOffice, unoconv, and basic Python scripting on the Linux platform.

Prerequisites

Before proceeding with the conversion process, ensure the following software is installed:

LibreOffice (Download from here)
Unoconv (Installation instructions)
Python 2.7.12 (Installation guidelines)
py-unoconv-batch-recursive (Available on Github)

Installing Py-unoconv-batch-recursive

To set up py-unoconv-batch-recursive, clone the repository to a convenient location on your system, like /<somewhere-easy>/py-unoconv-batch-recursive. Execute the following command in a terminal window:

git clone https://github.com/enovision/py-unoconv-batch-recursive.git

Navigate to the root folder containing the documents to convert (e.g., /media/somewhere/CD-Data) and run the Python script:

python /tmp/py-unoconv-batch-recursive/recursive-pdf-converter.py --in="/media/somewhere/CD-Data"

If the --in parameter is omitted, the script uses the path where the Python script is located as the root directory for conversion.

By default, the program processes documents in formats like docx, doc, rtf, otf, and txt. To specify alternate file extensions, use the --ext parameter:

--ext="doc docx yyy zzz"

The script traverses all subfolders from the root directory, converting files and appending '.pdf' to the original filenames. This prevents filename clashes during conversions. While an --out option exists, it currently serves no purpose.

Unoconv

Unoconv facilitates file format conversions via the command line. The method of conversion using unoconv-LibreOffice ensures the resultant PDFs are rendered as layered documents, preserving text and layout integrity.

These PDF outputs are ideal for integration with tools like ext-pdf-viewer, an Ext JS package leveraging Mozilla's pdf.js library.

Conclusion

Despite its simplistic nature, the Python script performed efficiently during the conversion of 4500 documents. On average, the process completed within 30 minutes on a standard laptop (as in 2019). The script incorporates a 20-second delay to allow the unoconv listener to initialize, ensuring optimal performance. Upon completion, the listener is terminated, and a "Done" message signals the conclusion of the program.

Johan van de Merwe

More from same category

	Multi Word document conversion into PDF with Linux and LibreOffice and Unoconv Tools 01 Apr 2019 The article demonstrates how to efficiently convert a large number of documents into PDF format using LibreOffice, Unoconv, and a simple Python script...
	How to install Bitnami Gitlab on a VMWare ESX Server and make SMTP mail work Software Tools 08 Apr 2015 A walkthrough on installing Bitnami Gitlab on a VMWare ESX server and make your smtp email work.
	Review Sencha Architect 3, a mixed bag of feelings Ext JS Tools 02 Dec 2013 Sencha Architect is presented as the ultimate tool for developing HTML5 applications. Time for an honest and independent review.
	Freeware font viewer tools for Windows and Linux Software Tools 14 Apr 2019 Babelmap is a Windows freeware tool to show font information and a detailed character map. But also for Linux you will find useful tools for that purp...
	Microsoft Windows 98 nostalgia in a VMWare Player Software Tools 30 Jul 2015 To be able to play 500 Nations from Microsoft I needed Windows 98. I decided to use VMWare Player for this.
	How to solve Vestacp/Hestiacp localhost connection error when adding new database Tools 15 Mar 2021 This article explains how to solve the prolem in the VestaCP or HestiaCP management environment when you have modified the root password of the databa...
	Fast way for unzipping large libraries and frameworks on your ftp server Tools 07 Oct 2013 Moving an unpacked large library or frameworks to a remote ftp server can take a long time. You can do this much faster with this small utility.
	Going wild with Codeanywhere online IDE containers and virtual servers Online Tools 11 Dec 2019 Codeanywhere is an online IDE and developer tool set that even comes with easy to implement containers with virtual servers including PHP, MySQL and C...