Tuesday, October 16, 2007

Jython lives!

It's good to see that Jython is under active development again.

I use cPython a lot. I'd pretty much stopped using Jython because it does not currently support language features that I find essential - decimals, generators and list comprehensions, for example.

The project seemed to be dormant, but Jython 2.2 has just been released, and it looks like Jython 2.5 is on the way, with all the critical features that I need. Excellent news!

Monday, September 24, 2007

Download all pdfs in a web page

A while ago I found an out-of-print book with a web-page that listed a series of downloadable pdfs.

Since there were 17 chapters and several other sections, I didn't want to right-click my way through the list so I wrote a quick Python script to do it for me.

This must be a Python idiom. I've had to re-invent it so often, I thought I'd blog it so that I -and you - will never have to create it from scratch again.

Here it is.

import urllib, re, sys
# download each pdf linked to in a web-page.
# assumes that the urls are all relative,
# which is usually the case
pdflink = re.compile(r'<a href="(.*\.pdf)">')
baseURL = sys.argv[1]
page = urllib.urlopen(baseURL+"contents.html")

for line in page.readlines():
match = pdflink.search(line)
if match:
filename = match.group(1)
print "downloading %s" % filename
urllib.urlretrieve(baseURL+filename, filename)

Google soft-launches social bookmarking tool

Is Google shared stuff the latest hot thing? It's a new social bookmarking tool.

It's available now, although you won't yet find it in the list of Google services and tools, nor on the Google labs page.

As well as sharing with your friends, the new bookmarklet allows you to share your links via several third-party sites, including Facebook, digg, reddit and del.icio.us.

In itself, it wouldn't be worldshaking; but
  1. It's from Google
  2. There are rumors of a soon-to-be-released Google api for social networking.

Saturday, September 22, 2007

Count Blog subscriptions with this Python script

If you blog, you're probably keen to know the size of your readership. Many of your subscribers will use an RSS aggregation service, and these days there are lots of feed aggregators. How do you count the subscribers?

This short Python script does just that.

If you can access the httpd logs of the server that hosts your blog, you'll find that many aggregators send you a subscriber count when they check for new blog entries.

A log entry for a typical aggregator request looks like this: - - [21/Sep/2007:04:12:34 +0100]
"GET /atom.xml HTTP/1.1" 200 0 "-"
"Bloglines/3.1 (http://www.bloglines.com; 10 subscribers)"

The Python script below filters out lines that contain the string "subscribers", extracts the Ip address and subscriber count, and calculates the total number of subscriptions.

The script is very simple, but there are a couple of wrinkles.

A singe log file can include multiple requests from a given aggregator. The script puts counts into a dictionary keyed by the aggregator's ip address, so the total count includes the latest value for each aggregator.

In theory, you could get a request that included the string 'subscriber' from some other source. (For example, someone might visit a URL for newsletter subscribers.)
The script has a fudge to treat such requests as if they came from a dummy aggregator with no subscribers.

Here's the script:

# count blog subscribers from httpd log file 
import re

ip =r"^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
anything = ".*"
count = r"\s(\d+)\s"
subscriber = "subscriber"
subscription_entry = re.compile(ip+anything+count+subscriber)

def ipc(line):
match = subscription_entry.search(line)
if match:
return match.group(1,2)
# just in case the line contains 'subscriber'
# but does not match the regular expression
null_entry = ("", 0)
return null_entry

def total_counts(ips_and_counts):
# there may be multiple log entries for a given aggregator
# so we use a dictionary and just keep the last value
# which is the latest one for each aggregator
dict = {}
for (ip, count) in ips_and_counts:
dict[ip] = int(count)
return sum(dict.values())

def countSubscribersInLogFile(filename):
file = open(filename)
counts = [ipc(line) for line in file if subscriber in line]
total = total_counts(counts)
return total

if __name__ == "__main__":
import os, sys
filename = sys.argv[1]
details = (countSubscribersInLogFile(filename), filename)
print "%d subscriptions in %s" % details

Monday, September 10, 2007


Nat Price has posted an excellent overview of Test Data Builders. (Don't be put off by the image!)

Sadly, he doesn't (yet) say much about the problem of unit testing builders, though he does point out that Builder classes may well find their way into production code. Nat, that's a hint...

XPDay7 is shaping up; now on to SPA2008!

The programme committee for XPDay7 met last week to review session proposals. We'd received an outstanding collection of submissions, so there was no problem in filling the program with quality sessions. The hard part, as usual, was to decide which ones we would not include.

The meeting went very smoothly. Angela Martin and Ivan Moore are joint program chairs this year. They put a lot of hard work into preparation, and it really paid off.

If you've submitted a proposal for XPDay7, expect to hear from Angela or Ivan soon.

I'm now bending my mind to SPA2008. Ivan Moore and Eoin Woods are joint program chairs. As conference chair, I'm as keen as they are to see another outstanding programme.

If you submitted a proposal for XPDay7, consider submitting a session for SPA. We like a mix of new and proven sessions, so acceptance for XPay7 shouldn't stop you from submitting a similar proposal for SPA. Not should rejection - there were excellent submissions for which we just had no room, and they might well fit in the (longer) SPA programme.

The SPA deadline is Monday 17th September, but you don't need to submit finished material - just a proposal.

Sunday, September 09, 2007

Backing up vital data with Amazon S3

I finally got around to trying Amazon S3 for data backup.

Just in case you've missed out on S3, it's a commercial network-based data storage service offered by Amazon. They claim to use the same technology for S3 that lies behind the Amazon stores.

S3 comes with no service level guarantees and there have been some reports of occasional unavailability and/or slow transfer. I'm not too worried; the things that are critical to me are data security and cost.

I'm using the JetS3t java(tm) tool kit to manage the interaction with the S3 service. The cockpit application gives a GUI view of what's in your S3 buckets; the synchronize application gives a simple command line interface that allows you to keep data on S3 in step with data on your machine(s), and retrieve that data when necessary.

You need to be a little careful, as synchronization can delete data as well as add or update it. To mitigate this, there is a --noaction option which allows you to see what would happen to your data without actually changing anything.

Minor snags

I've hit a couple of minor snags so far; neither are caused by the S3 service itself, but you need to be aware of them.

The first arose when I tried to back up a large amount of data. I'm keen to keep a copy of my cvs repository off-site; it's well backed-up, but all the recent backups are in my home office. If we had a fire or a burglary I'd be in real trouble.

I realised soon after the upload started that it would take a while. The repository is about 2.7Gb and although I have fast broadband, my upstream speed is only 256 kb/s. The upload took 25 hours!

Smart synchronization

Synchronize is smart enough not to upload files if they are unchanged. I figured that the next upload would take a few minutes at most. The second snag arose when I tried to check that.

Part way through the synchronize application bombed with an out of memory error. I've now modified the script to give it 512M of heap space, and it runs just fine. And now a burglary would be inconvenient but not disastrous.

Friday, September 07, 2007

Measuring cultural differences

James Manktelow has just published an interesting article about measuring and understanding cultural differences. It's based on Geert Hofstede's work on cultural dimensions.

Hofstede has identified five key dimensions which can be used to classify and understand local cultures. His analysis is based on a data base collected by IBM in the 60's and 70's surveying the values of employees in more than 70 countries.

Hofstede's Cultural Dimensions

Hofstede analyses culture in these dimensions:
  • Power Distance
  • Individualism
  • Masculinity
  • Uncertainty Avoidance and
  • Long-Term Orientation
It looks as if these measures would form a useful framework for analysing and understanding the culture of the organisations we work for.

Introducing Agile - Gerry Weinberg's approach

David Petersen just sent me a link to a great interview: "5Qs on Agile with Gerald M. Weinberg".

The interview includes a neat way of getting people to want to adopt Agile techniques, rather than trying to force them.

Weinbeg is one of my favourite authors, but I wasn't subscribed to his blogs. I am now.

His writing blog has another interesting post about self-publishing.

Friday, August 31, 2007

Virtual Appliances: only run them if you trust them!

I forgot to add an important warning to yesterday's post about Virtual Applicances!

When you start up a virtual appliance it will probably try to get an IP address on your private network via DHCP. If the Virtual Appliance contains malicious code, it could then do all kinds of nasty things.

The bottom line: as with any software that you install behind your firewall, don't run a virtual appliance unless you trust the source you got it from.

Thursday, August 30, 2007

The Virtual Appliance: a simple way to install complex software

This morning I saw an email asking for suggestions for a cheap or low-cost web content management system. There was one other important requirement - the software would be installed, managed and used by "semi-technical people".

A couple of people suggested drupal. I've heard good things about drupal, so I took a quick look at the website. Here's a sample of what hits you on the installation page:

Drupal requires a web server, PHP4 (4.3.3 or greater) or PHP5 (http://www.php.net/) and either MySQL (http://www.mysql.com/) or PostgreSQL (http://www.postgresql.org/).

That's hardly ideal for semi-technical people.

Once drupal is installed, however, it seems easy to configure and use. What my friend's users need is a simple way to install the application in the first place.

The best approach I can think of is to install a Virtual Appliance.

A Virtual Appliance is a virtual machine image which you can run using VMware, Xen or Parallels. It contains a complete application, pre-configured and ready to run.

Using a Virtual Appliance with the free VMware player takes just 5 steps:
  1. Download VMware Player.
  2. Install it. (on Windows, that's a one-click operation).
  3. Download and unzip a suitable Virtual Appliance.
  4. Start the Player and
  5. Start the Appliance.
Now your chosen application is up and ready for use.

I tried out the drupal appliance from JumpBox. It took me about 5 minutes to install and start using it. (I had a bit of a head-start because I already had VMware player installed).

If you know someone who wants user-friendly software but who lacks the technical expertise to install it, look and see if you can find a virtual appliance to suit.

More and more software (open source and commercial) is being offered this way, and it's also a great way to distribute software that you've written or configured.

Monday, August 27, 2007

Automated end-to-end testing made easier with VMWare Server

Automated testing is one of the most important practices in Agile Development. I like to have a hierarchy of automated tests that cover the whole system as deployed and all of its component parts, right down to unit tests that tell me whether each class is fulfilling its contract.

Starting with end-to-end tests

There's a lot published about unit testing and acceptance tests, but much less about end-to-end testing. That's disappointing, given that end-to-end tests are arguably the first thing you should work on when developing a new application.

The need for a clean deployment environment

End-to-end testing gives you a chance to check out your deployment process and verify that the deployed application works in the target environment. Sometimes, though, a badly planned testing process can lull you into a false sense of security; most of us have encountered applications that work well in their test environment but fail when deployed to a clean machine.

It usually takes too much time and effort to create a fresh testing environment from scratch every time you want to do an end-to-end test. There is a simple fast alternative using VMWare Server.

VMware server to the rescue

VMware server is a free (as in beer) version of VMware's virtualisation software. I run it on an Ubuntu server, and use it to create a clean environment for applications that I'm developing. It has a snapshot capability that allows me to capture a clean "before deployment" image, and restore that in a few seconds prior to running a test. Best of all, there's an external scripting interface which allows you to automate the process.

You can download VMware Server for free (after registration). I also use their commercial VMware workstation product; the two play well together, as you'd expect, but you can do everything you need for end-to-end testing with the free product.

Thursday, July 19, 2007

JavaScript - are you serious?

One of the sessions at Monday's miniSPA conference was Serious JavaScript led by Peter Marks and David Harvey.

I missed the original session at SPA2007, and I was very glad to get a chance to see it second time round. I found it fascinating and alarming.

It was fascinating because the presenters brought out some of JavaScript's real strengths as a development language. One example: JavaScript supports functions as first class objects, which allows smart developers to do all kinds of clever and useful things.

The session was alarming because it showed how seriously JavaScript is marred by arbitrary and counter-intuitive semantics.

Time after time the session leaders asked the audience to predict the results of simple JavaScript expressions. Time after time we predicted incorrectly (and sometimes the presenters did too).

The language seems to have been designed around the principle of most surprise.

People are suggesting that JavaScript is going to be the next big language, and it looks like a fair few people at Google are in that camp, including Steve Yegge. If so, heaven help us all. Power and unpredictability make a very dangerous combination.

Wednesday, July 18, 2007

miniSPA and jMock2

I spent yesterday at miniSPA. It's a selection of popular presentations from last year's SPA conference; we run it to attract new participants and new session leaders.

This year's miniSPA was brilliantly organised by Ivan Moore and Andy Moorley. Ivan is programme chair for SPA2008. Andy runs the conference, organises miniSPA, and generally keeps the whole world turning as it should.

I was at miniSPA in two capacities; as next year's conference chair, and as a presenter from last year. Nat Pryce and I ran a tutorial on Test Driven Development with jMock2, which went really well. We've offered to run it again in November at XPDay7.

I'm about to analyse the feedback and first impressions are very positive. SPA is a remarkable conference. If you haven't been, take a look at the website and think about going next year. If you want to go next year, think about submitting a session proposal.

Wednesday, July 11, 2007

Unit testing builders

Should one unit-test test Builder classes? If so, how?

A well-designed application is a composed network of simple components. Many developers use one or more builder classes to construct applications.

Builders can get quite complex.

In many cases we want a single component to be shared by several client components. It's easy to get this wrong. If we do so, we may build a network on which all dependencies appear to be satisfied, but we have actually created multiple copies of components when a single instance should be shared.

Builders often use lazy initialization, in which case they have state. This introduces additional opportunities for error.

Since it's easy to get builders wrong, we need to test them. End-to-end tests will usually tell us if we've build out application without all the necessary parts, but they don't pinpoint the problem, and they may not tell us if we've mistakenly introduced multiple objects instead of sharing a single one.

This suggests that we need to write unit tests for builders, but that's often difficult. The classes that are built will hide their implementation (as they should) so unit tests cannot easily verify that the right internal components are there inside them.

I'm coming to the conclusion that builder tests need to break encapsulation. That's normally a major code smell, but the nature of builders seems to leave no alternative.