Monday, September 24, 2007

Download all pdfs in a web page

A while ago I found an out-of-print book with a web-page that listed a series of downloadable pdfs.

Since there were 17 chapters and several other sections, I didn't want to right-click my way through the list so I wrote a quick Python script to do it for me.

This must be a Python idiom. I've had to re-invent it so often, I thought I'd blog it so that I -and you - will never have to create it from scratch again.

Here it is.

import urllib, re, sys
# download each pdf linked to in a web-page.
# assumes that the urls are all relative,
# which is usually the case
pdflink = re.compile(r'<a href="(.*\.pdf)">')
baseURL = sys.argv[1]
page = urllib.urlopen(baseURL+"contents.html")

for line in page.readlines():
match = pdflink.search(line)
if match:
filename = match.group(1)
print "downloading %s" % filename
urllib.urlretrieve(baseURL+filename, filename)

Google soft-launches social bookmarking tool

Is Google shared stuff the latest hot thing? It's a new social bookmarking tool.

It's available now, although you won't yet find it in the list of Google services and tools, nor on the Google labs page.

As well as sharing with your friends, the new bookmarklet allows you to share your links via several third-party sites, including Facebook, digg, reddit and del.icio.us.

In itself, it wouldn't be worldshaking; but
  1. It's from Google
  2. There are rumors of a soon-to-be-released Google api for social networking.

Saturday, September 22, 2007

Count Blog subscriptions with this Python script

If you blog, you're probably keen to know the size of your readership. Many of your subscribers will use an RSS aggregation service, and these days there are lots of feed aggregators. How do you count the subscribers?

This short Python script does just that.

If you can access the httpd logs of the server that hosts your blog, you'll find that many aggregators send you a subscriber count when they check for new blog entries.

A log entry for a typical aggregator request looks like this:

65.214.44.29 - - [21/Sep/2007:04:12:34 +0100]
"GET /atom.xml HTTP/1.1" 200 0 "-"
"Bloglines/3.1 (http://www.bloglines.com; 10 subscribers)"


The Python script below filters out lines that contain the string "subscribers", extracts the Ip address and subscriber count, and calculates the total number of subscriptions.

The script is very simple, but there are a couple of wrinkles.

A singe log file can include multiple requests from a given aggregator. The script puts counts into a dictionary keyed by the aggregator's ip address, so the total count includes the latest value for each aggregator.

In theory, you could get a request that included the string 'subscriber' from some other source. (For example, someone might visit a URL for newsletter subscribers.)
The script has a fudge to treat such requests as if they came from a dummy aggregator with no subscribers.

Here's the script:

# count blog subscribers from httpd log file 
import re

ip =r"^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
anything = ".*"
count = r"\s(\d+)\s"
subscriber = "subscriber"
subscription_entry = re.compile(ip+anything+count+subscriber)

def ipc(line):
match = subscription_entry.search(line)
if match:
return match.group(1,2)
else:
# just in case the line contains 'subscriber'
# but does not match the regular expression
null_entry = ("0.0.0.0", 0)
return null_entry

def total_counts(ips_and_counts):
# there may be multiple log entries for a given aggregator
# so we use a dictionary and just keep the last value
# which is the latest one for each aggregator
dict = {}
for (ip, count) in ips_and_counts:
dict[ip] = int(count)
return sum(dict.values())

def countSubscribersInLogFile(filename):
file = open(filename)
counts = [ipc(line) for line in file if subscriber in line]
total = total_counts(counts)
file.close()
return total

if __name__ == "__main__":
import os, sys
filename = sys.argv[1]
details = (countSubscribersInLogFile(filename), filename)
print "%d subscriptions in %s" % details

Monday, September 10, 2007

Builders

Nat Price has posted an excellent overview of Test Data Builders. (Don't be put off by the image!)

Sadly, he doesn't (yet) say much about the problem of unit testing builders, though he does point out that Builder classes may well find their way into production code. Nat, that's a hint...

XPDay7 is shaping up; now on to SPA2008!

The programme committee for XPDay7 met last week to review session proposals. We'd received an outstanding collection of submissions, so there was no problem in filling the program with quality sessions. The hard part, as usual, was to decide which ones we would not include.

The meeting went very smoothly. Angela Martin and Ivan Moore are joint program chairs this year. They put a lot of hard work into preparation, and it really paid off.

If you've submitted a proposal for XPDay7, expect to hear from Angela or Ivan soon.

I'm now bending my mind to SPA2008. Ivan Moore and Eoin Woods are joint program chairs. As conference chair, I'm as keen as they are to see another outstanding programme.

If you submitted a proposal for XPDay7, consider submitting a session for SPA. We like a mix of new and proven sessions, so acceptance for XPay7 shouldn't stop you from submitting a similar proposal for SPA. Not should rejection - there were excellent submissions for which we just had no room, and they might well fit in the (longer) SPA programme.

The SPA deadline is Monday 17th September, but you don't need to submit finished material - just a proposal.

Sunday, September 09, 2007

Backing up vital data with Amazon S3

I finally got around to trying Amazon S3 for data backup.

Just in case you've missed out on S3, it's a commercial network-based data storage service offered by Amazon. They claim to use the same technology for S3 that lies behind the Amazon stores.

S3 comes with no service level guarantees and there have been some reports of occasional unavailability and/or slow transfer. I'm not too worried; the things that are critical to me are data security and cost.

I'm using the JetS3t java(tm) tool kit to manage the interaction with the S3 service. The cockpit application gives a GUI view of what's in your S3 buckets; the synchronize application gives a simple command line interface that allows you to keep data on S3 in step with data on your machine(s), and retrieve that data when necessary.

You need to be a little careful, as synchronization can delete data as well as add or update it. To mitigate this, there is a --noaction option which allows you to see what would happen to your data without actually changing anything.

Minor snags

I've hit a couple of minor snags so far; neither are caused by the S3 service itself, but you need to be aware of them.

The first arose when I tried to back up a large amount of data. I'm keen to keep a copy of my cvs repository off-site; it's well backed-up, but all the recent backups are in my home office. If we had a fire or a burglary I'd be in real trouble.

I realised soon after the upload started that it would take a while. The repository is about 2.7Gb and although I have fast broadband, my upstream speed is only 256 kb/s. The upload took 25 hours!

Smart synchronization

Synchronize is smart enough not to upload files if they are unchanged. I figured that the next upload would take a few minutes at most. The second snag arose when I tried to check that.

Part way through the synchronize application bombed with an out of memory error. I've now modified the script to give it 512M of heap space, and it runs just fine. And now a burglary would be inconvenient but not disastrous.

Friday, September 07, 2007

Measuring cultural differences

James Manktelow has just published an interesting article about measuring and understanding cultural differences. It's based on Geert Hofstede's work on cultural dimensions.

Hofstede has identified five key dimensions which can be used to classify and understand local cultures. His analysis is based on a data base collected by IBM in the 60's and 70's surveying the values of employees in more than 70 countries.

Hofstede's Cultural Dimensions

Hofstede analyses culture in these dimensions:
  • Power Distance
  • Individualism
  • Masculinity
  • Uncertainty Avoidance and
  • Long-Term Orientation
It looks as if these measures would form a useful framework for analysing and understanding the culture of the organisations we work for.

Introducing Agile - Gerry Weinberg's approach

David Petersen just sent me a link to a great interview: "5Qs on Agile with Gerald M. Weinberg".

The interview includes a neat way of getting people to want to adopt Agile techniques, rather than trying to force them.

Weinbeg is one of my favourite authors, but I wasn't subscribed to his blogs. I am now.

His writing blog has another interesting post about self-publishing.