Monday, September 24, 2007

Download all pdfs in a web page

A while ago I found an out-of-print book with a web-page that listed a series of downloadable pdfs.

Since there were 17 chapters and several other sections, I didn't want to right-click my way through the list so I wrote a quick Python script to do it for me.

This must be a Python idiom. I've had to re-invent it so often, I thought I'd blog it so that I -and you - will never have to create it from scratch again.

Here it is.

import urllib, re, sys
# download each pdf linked to in a web-page.
# assumes that the urls are all relative,
# which is usually the case
pdflink = re.compile(r'<a href="(.*\.pdf)">')
baseURL = sys.argv[1]
page = urllib.urlopen(baseURL+"contents.html")

for line in page.readlines():
match = pdflink.search(line)
if match:
filename = match.group(1)
print "downloading %s" % filename
urllib.urlretrieve(baseURL+filename, filename)

No comments: