Saturday, September 22, 2007

Count Blog subscriptions with this Python script

If you blog, you're probably keen to know the size of your readership. Many of your subscribers will use an RSS aggregation service, and these days there are lots of feed aggregators. How do you count the subscribers?

This short Python script does just that.

If you can access the httpd logs of the server that hosts your blog, you'll find that many aggregators send you a subscriber count when they check for new blog entries.

A log entry for a typical aggregator request looks like this:

65.214.44.29 - - [21/Sep/2007:04:12:34 +0100]
"GET /atom.xml HTTP/1.1" 200 0 "-"
"Bloglines/3.1 (http://www.bloglines.com; 10 subscribers)"


The Python script below filters out lines that contain the string "subscribers", extracts the Ip address and subscriber count, and calculates the total number of subscriptions.

The script is very simple, but there are a couple of wrinkles.

A singe log file can include multiple requests from a given aggregator. The script puts counts into a dictionary keyed by the aggregator's ip address, so the total count includes the latest value for each aggregator.

In theory, you could get a request that included the string 'subscriber' from some other source. (For example, someone might visit a URL for newsletter subscribers.)
The script has a fudge to treat such requests as if they came from a dummy aggregator with no subscribers.

Here's the script:

# count blog subscribers from httpd log file 
import re

ip =r"^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
anything = ".*"
count = r"\s(\d+)\s"
subscriber = "subscriber"
subscription_entry = re.compile(ip+anything+count+subscriber)

def ipc(line):
match = subscription_entry.search(line)
if match:
return match.group(1,2)
else:
# just in case the line contains 'subscriber'
# but does not match the regular expression
null_entry = ("0.0.0.0", 0)
return null_entry

def total_counts(ips_and_counts):
# there may be multiple log entries for a given aggregator
# so we use a dictionary and just keep the last value
# which is the latest one for each aggregator
dict = {}
for (ip, count) in ips_and_counts:
dict[ip] = int(count)
return sum(dict.values())

def countSubscribersInLogFile(filename):
file = open(filename)
counts = [ipc(line) for line in file if subscriber in line]
total = total_counts(counts)
file.close()
return total

if __name__ == "__main__":
import os, sys
filename = sys.argv[1]
details = (countSubscribersInLogFile(filename), filename)
print "%d subscriptions in %s" % details

No comments: