Yesterday I had a simple problem. I had one big PDF-file with 6000 pages in it and I wanted to prepare it for mail house. What they needed to get the job done was two PDF-files, one with single-paged documents and one with multi-paged ones.
Luckily all the single-paged documents had string “Page: 1/1″ (or same in Finnish) on the top of the page. So writing a small Python-script to do the job was easy. What I needed in addition to Python was pdftotext binary (here is a dmg for OS X) and pyPdf. So here is the code:
#!/usr/bin/python
'''This script will need pypdf module and pdftotext binary'''
import sys
import os
from pyPdf import PdfFileReader, PdfFileWriter
def findstr(lookup, filename):
textfile = open(filename, 'rb')
text = textfile.read()
textfile.close()
pos = -1
while True:
# move index up on next call
pos = text.find(lookup, pos + 1)
# not found or done
if pos < 0:
return False
return True
try:
searchstr = sys.argv[1]
searchstr2 = sys.argv[2]
pdffile = PdfFileReader(file(sys.argv[3], "rb"))
numpages = pdffile.getNumPages()
singlefile = sys.argv[4]
multifile = sys.argv[5]
except IndexError:
print "Usage: getmulti.py [searchstring1] [searchstring2] [sourcefile] [destinationfile-single] [destinationfile-single]"
print "****************"
print "Extracting multipaged and singlepaged files from " + sys.argv[3] + " (%s pages)" % numpages
print "Outputting multipaged to " + multifile + " and singlepaged to " + singlefile
singleoutput = PdfFileWriter()
multioutput = PdfFileWriter()
for i in xrange(numpages):
os.system("pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3]))
print "pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3])
if findstr(searchstr, "/tmp/foo.txt") or findstr(searchstr2, "/tmp/foo.txt"):
print "got it"
singleoutput.addPage(pdffile.getPage(i))
else:
print "not got"
multioutput.addPage(pdffile.getPage(i))
multioutputStream = file(multifile, "wb")
singleoutputStream = file(singlefile, "wb")
multioutput.write(multioutputStream)
singleoutput.write(singleoutputStream)
multioutputStream.close()
singleoutputStream.close()
So now I can just say:
./getmulti.py "Page: 1/ 1" "Sivu: 1/ 1" orig.pdf single.pdf multi.pdf
Thanks for the pdftotext. Doesn’t handle math type well.