Save tpaskhalis/214c3976ac08cb809d846e01135d9f5f to your computer and use it in GitHub Desktop.
Batch conversion of pdf files to textThe procedure for automatic conversion of pdf into txt files in shell has been previously described in detail here.
In this gist I will focus on writing the bash script that uses find command-line program. It allows much sleaker implementation, with less code (essentially one-liner), while being robust to file and folder names that contain whitespaces or other non-standard characters (more on issues of wordsplitting in bash here).
Here's the original script:
#!/bin/bash FILES=~/pdfs/*.pdf for f in $FILES do echo "Processing $f file. " pdftotext -enc UTF-8 $f done
Problems start when the path contains characters other than alphanumeric or underscore, e.g. whitespace:
FILES=~/pdfs/party\ manifestos/*.pdf for f in $FILES do echo "Processing $f file. " pdftotext -enc UTF-8 $f done
Running this script will result in:
./convertpdf.sh Processing /home/tom/pdfs/party file. I/O Error: Couldn't open file '/home/tom/pdfs/party': No such file or directory. Processing manifestos/*.pdf file. I/O Error: Couldn't open file 'manifestos/*.pdf': No such file or directory.
For reasons described here this problem cannot be solved by putting $FILES in double quotes as "$FILES" .
The correct and more robust way to batch process multiple pdf files through pdftotext is to use find (more on correctly using find here) and its output. Here is how to do it in two lines of code:
FOLDER=~/pdfs/party\ manifestos/ find "$FOLDER" -name '*.pdf' -exec pdftotext -enc UTF-8 <> \;
Or, preserving the output of echo inside the loop:
FOLDER=~/pdfs/party\ manifestos/ find "$FOLDER" -name '*.pdf' | while read i; do echo "Processing $i" pdftotext -enc UTF-8 "$i" done