Extracting pages with specified string from PDF-file with Python

January 15th, 2009 by Tuomas Rasila

Yesterday I had a simple problem. I had one big PDF-file with 6000 pages in it and I wanted to prepare it for mail house. What they needed to get the job done was two PDF-files, one with single-paged documents and one with multi-paged ones.

Luckily all the single-paged documents had string “Page: 1/1″ (or same in Finnish) on the top of the page. So writing a small Python-script to do the job was easy. What I needed in addition to Python was pdftotext binary (here is a dmg for OS X) and pyPdf. So here is the code:


#!/usr/bin/python
'''This script will need pypdf module and pdftotext binary'''
import sys
import os
from pyPdf import PdfFileReader, PdfFileWriter

def findstr(lookup, filename):
    textfile = open(filename, 'rb')
    text = textfile.read()
    textfile.close()
    pos = -1
    while True:
        # move index up on next call
        pos = text.find(lookup, pos + 1)
        # not found or done
        if pos < 0:
            return False
        return True

try:
    searchstr = sys.argv[1]
    searchstr2 = sys.argv[2]
    pdffile = PdfFileReader(file(sys.argv[3], "rb"))
    numpages = pdffile.getNumPages()
    singlefile = sys.argv[4]
    multifile = sys.argv[5]
except IndexError:
    print "Usage: getmulti.py [searchstring1] [searchstring2] [sourcefile] [destinationfile-single] [destinationfile-single]"

print "****************"
print "Extracting multipaged and singlepaged files from " + sys.argv[3] + " (%s pages)" % numpages
print "Outputting multipaged to " + multifile + " and singlepaged to " + singlefile

singleoutput = PdfFileWriter()
multioutput = PdfFileWriter()
for i in xrange(numpages):
    os.system("pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3]))
    print "pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3])
    if findstr(searchstr, "/tmp/foo.txt") or findstr(searchstr2, "/tmp/foo.txt"):
        print "got it"
        singleoutput.addPage(pdffile.getPage(i))
    else:
        print "not got"
        multioutput.addPage(pdffile.getPage(i))

multioutputStream = file(multifile, "wb")
singleoutputStream = file(singlefile, "wb")

multioutput.write(multioutputStream)
singleoutput.write(singleoutputStream)
multioutputStream.close()
singleoutputStream.close()

So now I can just say:

./getmulti.py "Page: 1/ 1" "Sivu: 1/ 1" orig.pdf single.pdf multi.pdf

Share and Enjoy:
  • del.icio.us
  • Digg
  • Facebook
  • Reddit
  • Twitter

Tags: ,

One Response to “Extracting pages with specified string from PDF-file with Python”

  1. Seth says:

    Thanks for the pdftotext. Doesn’t handle math type well.

Leave a Reply