Python, Erlang, Map/Reduce
I found both python and map/reduce conspicuously missing in the recent discussion at ongoing about Wide Finder log parsing in Erlang. I have been very short of time, and thus I have been limited to read the discussion. But today I got a bit of time. Wanting to get into first principles first, and given that I don’t find Ruby particularly easy to read, I decided to rewrite it in python in “map/reduce” style:
import re
from collections import defaultdict
matches = (re.search(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ",line)
for line in file("o10k.ap"))
mapp = (match.groups()[0] for match in matches if match)
count=defaultdict(int)
for page in mapp:
count[page] +=1
for key in sorted(count.keys(), key=count.get):
print "%40s = %s" % (key, count[key])
The program uses the data file that Tim Bray published, and tries to be as declarative as possible. It requires python 2.5 due to the use of collections.defaultdict. The first two comprehensions “map” the file, i.e., the first one maps the file into an iterator that yields either a match or None, and the second one filters the stream of matches, removing the nulls, and retrieves the group.
After this, we need to reduce. I have not (yet) found a declarative way to reduce the stream, but a short loop does its job. The use of defaultdict(int) is a nice technique to get a map that will default to 0 but it is easy to read.
This code is one line shorter than the ruby example that Tim Bray shows and, barring the ugly python syntax for regular expressions, I tend to find it more elegant than the original. Note that it will handle easity the quarter Gig file, as the file iterator reads one line at a time, and the rest of the processing takes place one line at a time. It will start barking when the number of entries (and thus the dictionary) becomes really big.
I’ll try to keep constructing, bottom up, the code for the problem, and discussing the trade offs and gotchas, as time permits. But I thought that leaving here this short snippet while I can find time is better than let it rot in my Tomboy notes. :)
Welcome, Tom.
The typical use case for Erlang (or for Hadoop, as you pointed) would be more like “do what Tim says in 500 machines and aggregate the results”. I skimmed on your comment, now I went to the real paper. When I said missing I meant mostly Tim, and it was before his fifth post appeared.
The funny thing about the code is I was going to use the classic map and reduce functions, but I remembered that reduce is doomed, and reading Guido’s post I found:
filter(P, S) is almost always written clearer as [x for x in S if P(x)], and this has the huge advantage that the most common usages involve predicates that are comparisons, e.g. x==42, and defining a lambda for that just requires much more effort for the reader (plus the lambda is slower than the list comprehension). Even more so for map(F, S) which becomes [F(x) for x in S].
As an added benefit, the code is much more similar to Erlang’s comprehensions. Actually using [] will make them in-memory lists, but using () will turn them into generators.
Posted by Santiago Gala atIf you’re looking for cheap optimizations, initialize a regexp object outside the loop and save a lot of parsing.
Posted by dbt at
defaultdict as seen in the wild is almost always just a “bag”, a set-like object that allows objects to be in it multiple times. Using that structure would make the code a bit shorter. I threw together a little bag implementation: [link]
You should be able to do:
b = Bag(map_lines(file))
for count, page in b.mostcommon():
print ‘%40s=%s’ % (page, count)
This will probably slow the code down, but if Bag was written in C like defaultdict it wouldn’t.
Posted by Ian Bicking atdefaultdict as seen in the wild is almost always just a “bag”, a set-like object that allows objects to be in it multiple times.
I remember the Bag from my old days as a Smalltalk programmer (Digitalk under MS-DOS, man I feel old) and I have always wondered why supposedly rich collection frameworks like the ones in java and python don’t have it. It is exactly what is needed for the reduce task here: counting.
Here I did what the original code was doing and what your code and Smalltalk uses as the implementation of Bag: use a dictionary to store the counts.
I even hesitated to use defaultdict to keep 2.5- compatibility, but now that major linux distros are moving to 2.5 I guess it is mainstream enough.
Posted by Santiago Gala atWF VII: Other Voices
This is the seventh progress report from the Wide Finder Project , in which I report on the work of others and turn this into kind of a contest. I’ll be back later with one more whack at Erlang, and then probably move on to other Wide Finder...Excerpt from ongoing at
I have an obvious bug (that I have not yet fixed, but it is identified), in the routine that checks the logs for referrers. Plus a power supply burned tonight. The power supply has been replaced, but in the process of cleaning the entries I erased accidentally, and google’s cache is too old, the one proposing a short version by using a try: except: pass idiom to avoid having to check for matches vs. None. Please, do you have the code? I read it fast and can’t remember the name/URL.
In a sense,
def Filter(Iterator):
while Something in Iterator:
try: yield UnsafeOp(Something)
except: pass
is a useful filter in the chain. Not unlike the regex one (yield all matches...)
Posted by Santiago Gala atThis is what I found to be fastest, beating Ruby by a full second on my test data; actually a list comprehension was slightly faster but at increased resource use. Oddly, ruby is penalized for using F.L.'s filter optimization - almost twice as slow.
counts = defaultdict(int) for match in (res.group(1) for res in (ongoing_re.search(line) for line in open(sys.argv[1]) if "GET /ongoing/When" in line) if res): counts[match] +=1Posted by Michael Watkins at
Map/Reduce: Map and mind twisting in programming
Abelson and Sussman, SICP: One extremely useful operation is to apply some transformation to each element in a list and generate the list of results. (...) We can abstract this general idea and capture it as a common pattern expressed as a higher-order... [more]Trackback from Santiago Gala at
Wide Finder
More notes on Wide Finder Background Fredrik Lundh’s notes and commentary, in Python What am I doing? mmap and re playing together using the buffer interface mxTextTools, with regexps via Martel mxTextTools is zippy fast! Making a faster standard...Excerpt from Andrew Dalke's writings at
Blogosphäre (aus JavaSPEKTRUM 06/07)
Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie...Excerpt from JavaSPEKTRUM Blogosphäre at
Blogosphäre (aus JavaSPEKTRUM 06/07)
Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie...Excerpt from JavaSPEKTRUM Blogosphäre at
Blogosphäre (aus JavaSPEKTRUM 06/07)
Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie...Excerpt from JavaSPEKTRUM Blogosphäre at
Blogosphäre (aus JavaSPEKTRUM 06/07)
Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie...Excerpt from JavaSPEKTRUM Blogosphäre at
Blogosphäre (aus JavaSPEKTRUM 06/07)
Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie...Excerpt from JavaSPEKTRUM Blogosphäre at
Here are the optimizations I would make:
(1) Initialize the regexp (as others have said)
r = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")
(2) Filter out data that can’t match
matches = (r.search(line) for line in file("o10k.ap") if line.startswith("GET /ongoing/When/"))
(3) Add psyco
import psyco
psyco.full()
mapfilter
Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter? mapfilter? mapfilter is a function that combines the traditional map & filter of functional programming by using the...Excerpt from Stream Hacker at

I wrote a comment on Tim’s blog about using MapReduce for the Wide Finder project. BTW I like the way you use the MapReduce idiom to write a regular python program.
Cheers,
Tom
Posted by Tom White at