<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rasila Garage &#187; scripting</title>
	<atom:link href="http://rasilagarage.com/tag/scripting/feed/" rel="self" type="application/rss+xml" />
	<link>http://rasilagarage.com</link>
	<description>Tuomas Rasila's blog about software and entrepreneurship</description>
	<lastBuildDate>Sun, 07 Mar 2010 09:11:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Extracting pages with specified string from PDF-file with Python</title>
		<link>http://rasilagarage.com/2009/01/extracting-pdf-with-python/</link>
		<comments>http://rasilagarage.com/2009/01/extracting-pdf-with-python/#comments</comments>
		<pubDate>Thu, 15 Jan 2009 20:43:44 +0000</pubDate>
		<dc:creator>Tuomas Rasila</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scripting]]></category>

		<guid isPermaLink="false">http://rasilagarage.com/?p=1</guid>
		<description><![CDATA[Yesterday I had a simple problem. I had one big PDF-file with 6000 pages in it and I wanted to prepare it for mail house. What they needed to get the job done was two PDF-files, one with single-paged documents and one with multi-paged ones. Luckily all the single-paged documents had string &#8220;Page: 1/1&#8243; (or [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I had a simple problem. I had one big PDF-file with 6000 pages in it and I wanted to prepare it for mail house. What they needed to get the job done was two PDF-files, one with single-paged documents and one with multi-paged ones.</p>
<p>Luckily all the single-paged documents had string &#8220;Page: 1/1&#8243; (or same in Finnish) on the top of the page. So writing a small Python-script to do the job was easy. What I needed in addition to Python was <a href="http://en.wikipedia.org/wiki/Pdftotext" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Pdftotext?referer=');">pdftotext</a> binary (<a href="http://www.bluem.net/downloads/pdftotext_en/" onclick="pageTracker._trackPageview('/outgoing/www.bluem.net/downloads/pdftotext_en/?referer=');">here is a dmg for OS X</a>) and <a href="http://pybrary.net/pyPdf/" onclick="pageTracker._trackPageview('/outgoing/pybrary.net/pyPdf/?referer=');">pyPdf</a>. So here is the code:<br />
<span id="more-1"></span></p>
<pre><code>
#!/usr/bin/python
'''This script will need pypdf module and pdftotext binary'''
import sys
import os
from pyPdf import PdfFileReader, PdfFileWriter

def findstr(lookup, filename):
    textfile = open(filename, 'rb')
    text = textfile.read()
    textfile.close()
    pos = -1
    while True:
        # move index up on next call
        pos = text.find(lookup, pos + 1)
        # not found or done
        if pos < 0:
            return False
        return True

try:
    searchstr = sys.argv[1]
    searchstr2 = sys.argv[2]
    pdffile = PdfFileReader(file(sys.argv[3], "rb"))
    numpages = pdffile.getNumPages()
    singlefile = sys.argv[4]
    multifile = sys.argv[5]
except IndexError:
    print "Usage: getmulti.py [searchstring1] [searchstring2] [sourcefile] [destinationfile-single] [destinationfile-single]"

print "****************"
print "Extracting multipaged and singlepaged files from " + sys.argv[3] + " (%s pages)" % numpages
print "Outputting multipaged to " + multifile + " and singlepaged to " + singlefile

singleoutput = PdfFileWriter()
multioutput = PdfFileWriter()
for i in xrange(numpages):
    os.system("pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3]))
    print "pdftotext -f %s -l %s %s /tmp/foo.txt" % (i+1, i+1, sys.argv[3])
    if findstr(searchstr, "/tmp/foo.txt") or findstr(searchstr2, "/tmp/foo.txt"):
        print "got it"
        singleoutput.addPage(pdffile.getPage(i))
    else:
        print "not got"
        multioutput.addPage(pdffile.getPage(i))

multioutputStream = file(multifile, "wb")
singleoutputStream = file(singlefile, "wb")

multioutput.write(multioutputStream)
singleoutput.write(singleoutputStream)
multioutputStream.close()
singleoutputStream.close()
</code>
</pre>
<p>So now I can just say:<br />
<code><br />
./getmulti.py "Page: 1/ 1" "Sivu: 1/ 1" orig.pdf single.pdf multi.pdf<br />
</code></p>
]]></content:encoded>
			<wfw:commentRss>http://rasilagarage.com/2009/01/extracting-pdf-with-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
