http://memojo.com/~sgala/blog/25.atom ../images/favicon.ico Boxes and Glue Santiago Gala and his Symbols Santiago Gala sgala@apache.org http://memojo.com/~sgala/blog/ 2010-09-03T04:35:09+02:00 tag:memojo.com,2004:25 Python, Erlang, Map/Reduce
I found both python and map/reduce conspicuously missing in the recent discussion at ongoing about Wide Finder log parsing in Erlang. I have been very short of time, and thus I have been limited to read the discussion. But today

I found both python and map/reduce conspicuously missing in the recent discussion at ongoing about Wide Finder log parsing in Erlang. I have been very short of time, and thus I have been limited to read the discussion. But today I got a bit of time. Wanting to get into first principles first, and given that I don’t find Ruby particularly easy to read, I decided to rewrite it in python in “map/reduce” style:

import re
from collections import defaultdict

matches = (re.search(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ",line)
                   for line in file("o10k.ap"))
mapp    = (match.groups()[0] for match in matches if match)

count=defaultdict(int)
for page in mapp:
   count[page] +=1

for key in sorted(count.keys(), key=count.get):
   print "%40s = %s" % (key, count[key])

The program uses the data file that Tim Bray published, and tries to be as declarative as possible. It requires python 2.5 due to the use of collections.defaultdict. The first two comprehensions “map” the file, i.e., the first one maps the file into an iterator that yields either a match or None, and the second one filters the stream of matches, removing the nulls, and retrieves the group.

After this, we need to reduce. I have not (yet) found a declarative way to reduce the stream, but a short loop does its job. The use of defaultdict(int) is a nice technique to get a map that will default to 0 but it is easy to read.

This code is one line shorter than the ruby example that Tim Bray shows and, barring the ugly python syntax for regular expressions, I tend to find it more elegant than the original. Note that it will handle easity the quarter Gig file, as the file iterator reads one line at a time, and the rest of the processing takes place one line at a time. It will start barking when the number of entries (and thus the dictionary) becomes really big.

I’ll try to keep constructing, bottom up, the code for the problem, and discussing the trade offs and gotchas, as time permits. But I thought that leaving here this short snippet while I can find time is better than let it rot in my Tomboy notes. :)

2007-09-29T18:34:07+02:00
tag:memojo.com,2004:25-1191100172 http://problemsworthyofattack.blogspot.com/ 81.168.52.230 form Tom White Python, Erlang, Map/Reduce

I wrote a comment on Tim’s blog about using MapReduce for the Wide Finder project. BTW I like the way you use the MapReduce idiom to write a regular python program.

Cheers,

Tom

2007-09-30T00:09:32+02:00
tag:memojo.com,2004:25-1191104440 http://memojo.com/~nostromo/blog/ form https://api.screenname.aol.com/auth/openidServer Santiago Gala Python, Erlang, Map/Reduce

Welcome, Tom.

The typical use case for Erlang (or for Hadoop, as you pointed) would be more like “do what Tim says in 500 machines and aggregate the results”. I skimmed on your comment, now I went to the real paper. When I said missing I meant mostly Tim, and it was before his fifth post appeared.

The funny thing about the code is I was going to use the classic map and reduce functions, but I remembered that reduce is doomed, and reading Guido’s post I found:

filter(P, S) is almost always written clearer as [x for x in S if P(x)], and this has the huge advantage that the most common usages involve predicates that are comparisons, e.g. x==42, and defining a lambda for that just requires much more effort for the reader (plus the lambda is slower than the list comprehension). Even more so for map(F, S) which becomes [F(x) for x in S].

As an added benefit, the code is much more similar to Erlang’s comprehensions. Actually using [] will make them in-memory lists, but using () will turn them into generators.

2007-09-30T01:20:40+02:00
tag:memojo.com,2004:25-1191246603 http://meat.net/ chi-ra.chicagotrading.com form dbt Python, Erlang, Map/Reduce If you’re looking for cheap optimizations, initialize a regexp object outside the loop and save a lot of parsing. 2007-10-01T16:50:03+02:00 tag:memojo.com,2004:25-1191274312 http://blog.ianbicking.org cpe-66-108-80-238.nyc.res.rr.com form Ian Bicking Python, Erlang, Map/Reduce

defaultdict as seen in the wild is almost always just a “bag”, a set-like object that allows objects to be in it multiple times.  Using that structure would make the code a bit shorter.  I threw together a little bag implementation: [link]

You should be able to do:

b = Bag(map_lines(file))
for count, page in b.mostcommon():
  print ‘%40s=%s’ % (page, count)

This will probably slow the code down, but if Bag was written in C like defaultdict it wouldn’t.

2007-10-02T00:31:52+02:00
tag:memojo.com,2004:25-1191302325 http://memojo.com/~sgala/blog/ form https://api.screenname.aol.com/auth/openidServer Santiago Gala Python, Erlang, Map/Reduce

Ian Bicking:

defaultdict as seen in the wild is almost always just a “bag”, a set-like object that allows objects to be in it multiple times.

I remember the Bag from my old days as a Smalltalk programmer (Digitalk under MS-DOS, man I feel old) and I have always wondered why supposedly rich collection frameworks like the ones in java and python don’t have it. It is exactly what is needed for the reduce task here: counting.

Here I did what the original code was doing and what your code and Smalltalk uses as the implementation of Bag: use a dictionary to store the counts.

I even hesitated to use defaultdict to keep 2.5- compatibility, but now that major linux distros are moving to 2.5 I guess it is mainstream enough.

2007-10-02T08:18:45+02:00
tag:memojo.com,2004:25-1191357419 http://www.tbray.org/ongoing/When/200x/2007/10/01/WF-Roundup excerpt ongoing WF VII: Other Voices This is the seventh progress report from the Wide Finder Project , in which I report on the work of others and turn this into kind of a contest. I’ll be back later with one more whack at Erlang, and then probably move on to other Wide Finder... 2007-10-02T23:36:59+02:00 tag:memojo.com,2004:25-1191416537 http://memojo.com/~sgala/blog/ form https://api.screenname.aol.com/auth/openidServer Santiago Gala Python, Erlang, Map/Reduce

I have an obvious bug (that I have not yet fixed, but it is identified), in the routine that checks the logs for referrers. Plus a power supply burned tonight. The power supply has been replaced, but in the process of cleaning the entries I erased accidentally, and google’s cache is too old, the one proposing a short version by using a try: except: pass idiom to avoid having to check for matches vs. None. Please, do you have the code? I read it fast and can’t remember the name/URL.

In a sense,

def Filter(Iterator):
  while Something in Iterator:
    try: yield UnsafeOp(Something)
    except: pass

is a useful filter in the chain. Not unlike the regex one (yield all matches...)

2007-10-03T16:02:17+02:00
tag:memojo.com,2004:25-1191543483 http://mikewatkins.ca/tags/python/ S01060007e9f6985b.vc.shawcable.net form Michael Watkins Python, Erlang, Map/Reduce

This is what I found to be fastest, beating Ruby by a full second on my test data; actually a list comprehension was slightly faster but at increased resource use. Oddly, ruby is penalized for using F.L.'s filter optimization - almost twice as slow.

counts = defaultdict(int) 
for match in (res.group(1) for res in (ongoing_re.search(line) for line in open(sys.argv[1]) if "GET /ongoing/When" in line) if res):
  counts[match] +=1
2007-10-05T03:18:03+02:00
tag:memojo.com,2004:25-1191653416 memojo.com trackback Santiago Gala http://memojo.com/~sgala/blog/2007/10/06/Map-Reduce-Map-and-mind-twisting-in-programming Map/Reduce: Map and mind twisting in programming Abelson and Sussman, SICP: One extremely useful operation is to apply some transformation to each element in a list and generate the list of results. (...) We can abstract this general idea and capture it as a common pattern expressed as a higher-order... 2007-10-06T09:50:16+02:00 tag:memojo.com,2004:25-1192102020 http://www.dalkescientific.com/writings/diary/archive/2007/10/07/wide_finder.html excerpt Andrew Dalke's writings Wide Finder More notes on Wide Finder Background Fredrik Lundh’s notes and commentary, in Python What am I doing? mmap and re playing together using the buffer interface mxTextTools, with regexps via Martel mxTextTools is zippy fast! Making a faster standard... 2007-10-11T14:27:00+02:00 tag:memojo.com,2004:25-1195543269 http://www.sigs.de/blog/js/?p=32 excerpt JavaSPEKTRUM Blogosphäre Blogosphäre (aus JavaSPEKTRUM 06/07) Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie... 2007-11-20T09:21:09+01:00 tag:memojo.com,2004:25-1203575602 http://www.sigs.de/blog/js/?p=32 excerpt JavaSPEKTRUM Blogosphäre Blogosphäre (aus JavaSPEKTRUM 06/07) Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie... 2008-02-21T08:33:22+01:00 tag:memojo.com,2004:25-1203575611 http://www.sigs.de/blog/js/?p=32 excerpt JavaSPEKTRUM Blogosphäre Blogosphäre (aus JavaSPEKTRUM 06/07) Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie... 2008-02-21T08:33:31+01:00 tag:memojo.com,2004:25-1207141415 http://www.sigs.de/blog/js/?p=32 excerpt JavaSPEKTRUM Blogosphäre Blogosphäre (aus JavaSPEKTRUM 06/07) Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie... 2008-04-02T16:03:35+02:00 tag:memojo.com,2004:25-1207141461 http://www.sigs.de/blog/js/?p=32 excerpt JavaSPEKTRUM Blogosphäre Blogosphäre (aus JavaSPEKTRUM 06/07) Nachdem wir uns in den letzten Jahren schon daran gewöhnt hatten, dass die tatsächlichen oder empfundenen Vor- und Nachteile unterschiedlicher Programmiersprachen bestenfalls von akademischem Interesse zu sein schienen und man mit einer Sprache wie... 2008-04-02T16:04:21+02:00 tag:memojo.com,2004:25-1213111766 http://www.iwebthereforeiam.com/ cpe-66-108-204-239.nyc.res.rr.com form Hugh Brown Python, Erlang, Map/Reduce

Here are the optimizations I would make:

(1) Initialize the regexp (as others have said)
r = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")

(2) Filter out data that can’t match
matches = (r.search(line) for line in file("o10k.ap") if line.startswith("GET /ongoing/When/"))

(3) Add psyco
import psyco
psyco.full()

2008-06-10T18:29:26+02:00
tag:memojo.com,2004:25-1236692116 http://streamhacker.wordpress.com/2009/03/03/mapfilter/ excerpt Stream Hacker mapfilter Have you ever mapped a list, then filtered it? Or filtered first, then mapped? Why not do it all in one pass with mapfilter? mapfilter? mapfilter is a function that combines the traditional map & filter of functional programming by using the... 2009-03-10T15:35:16+01:00 tag:memojo.com,2004:25-1282006306 http://www.mecosplay.com 222.95.240.191 form cosplay Python, Erlang, Map/Reduce It is the filter or reduce the size of the map? 2010-08-17T03:51:46+02:00 tag:memojo.com,2004:25-1282006306-1282006388 http://www.mecosplay.com 222.95.240.191 form cosplay Python, Erlang, Map/Reduce If you have no conception how to finish your ancient essay paper, you would have to use the custom writing service to learn a correct way to create free essays. 2010-08-17T03:53:08+02:00 tag:memojo.com,2004:25-1282118235 http://www.resumedocket.com 119.155.8.229 form Resume Writing Python, Erlang, Map/Reduce Outstanding presentation. Nice demonstration of your writing. Thank you for the time spent writing these articles. A pleasure to read you regularly and learn many things. 2010-08-18T10:57:15+02:00 tag:memojo.com,2004:25-1282620899 h3300723@yahoo.cn 221.6.35.67 form h3300723 Python, Erlang, Map/Reduce
Cosplay Costumes
Cosplay
Gothic dresses
Lolita dresses
school girl costume
Sexy Costumes
sexy nurses costume
Sports costume
Christmas Costumes
Holiday Costumes
Disney Costumes
Movie TV Costume
jewellery
dresses
shoes blog
jewellery
ed hardy blog
wedding sale
2010-08-24T06:34:59+02:00
tag:memojo.com,2004:25-1282634476 216.83.63.50 form anonymous Python, Erlang, Map/Reduce
cheap nfl jerseys
cheap mlb jerseys
wholesale mlb Jerseys
basketball jerseys
phillies jersey
R4 ds
jade jewellery
2010-08-24T10:21:16+02:00
tag:memojo.com,2004:25-1282634612 216.83.63.50 form mlb jerseys Python, Erlang, Map/Reduce
Ptfe Valve
Colts jerseys
NFL youth jerseys
tiffany ring
steelers jersey
2010-08-24T10:23:32+02:00
tag:memojo.com,2004:25-1282714222 http://www.jerseywholesales.com 61.190.27.42 form nfl jersey Python, Erlang, Map/Reduce
nhl jerseys wholesale
Roll forming machine
blackhawks jersey
jade jewellery
nike air max shoes
tiffany
nfl jerseys wholesale
Yankees jersey
2010-08-25T08:30:22+02:00
tag:memojo.com,2004:25-1282765352 http://www.a2zessays.com/ 180.149.217.100 form an essay Python, Erlang, Map/Reduce Thanks for sharing. Its really great. 2010-08-25T22:42:32+02:00 tag:memojo.com,2004:25-1282765393 http://www.papersassistance.com/ 180.149.217.100 form academic writing Python, Erlang, Map/Reduce Thank you for there share. 2010-08-25T22:43:13+02:00 tag:memojo.com,2004:25-1282866184 http://www.completeessays.com/ 180.149.217.100 form essay writing services Python, Erlang, Map/Reduce
Hi,
Nice post! You have worked hard on jotting down the essential information. Keep sharing the good work in future too.
2010-08-27T02:43:04+02:00
tag:memojo.com,2004:25-1282866220 http://www.dissertationsolution.com/ 180.149.217.100 form dissertation writing Python, Erlang, Map/Reduce Thanks for the info.... Very good. 2010-08-27T02:43:40+02:00 tag:memojo.com,2004:25-1282999684 http://www.ebusinesssubmit.com 221.120.234.2 form SEO Services Python, Erlang, Map/Reduce Nice post! You have worked down the necessary information. Keep the exchange of good work in the future. 2010-08-28T15:48:04+02:00 tag:memojo.com,2004:25-1283027587 http://www.bestonlinepapers.com/ 180.149.217.100 form custom written essays Python, Erlang, Map/Reduce
great post
good idea
thanks.
2010-08-28T23:33:07+02:00
tag:memojo.com,2004:25-1283231444 61.190.27.42 form anonymous Python, Erlang, Map/Reduce

China tour
mlb jerseys

cheap nfl jerseys
cheap mlb jerseys
wholesale mlb Jerseys

2010-08-31T08:10:44+02:00
tag:memojo.com,2004:25-1283231542 http://www.chinastarshop.com 61.190.27.42 form Nfl football jerseys Python, Erlang, Map/Reduce
Ptfe Valve
Colts jerseys
NFL youth jerseys
tiffany ring
2010-08-31T08:12:22+02:00
tag:memojo.com,2004:25-1283321575 http://www.fonntai.com 216.83.63.50 form Roll former Python, Erlang, Map/Reduce
Hello, I found your blog in a new directory of blogs. I dont know how your blog came up, must have been a typo, Your blog looks good. Have a nice day.
chinese jade
latin shoes
cheap mlb Jerseys
nfl jersey
cheap mlb jerseys
gucci jewelry
Colts jerseys
jade jewellery
Nhl hockey jerseys
2010-09-01T09:12:55+02:00
tag:memojo.com,2004:25-1283459778 http://www.transferpod.com/ 222.127.195.11 form Ipod to Computer Python, Erlang, Map/Reduce Thank you very much for this.. Very nice! :) 2010-09-02T23:36:18+02:00 tag:memojo.com,2004:25-1283475844 315258065@qq.com hn.kd.ny.adsl form Christian Louboutin Python, Erlang, Map/Reduce
baidu
[url=http://www.baidu.com]baidu[/url]
2010-09-03T04:04:04+02:00
tag:memojo.com,2004:25-1283475904 315258065@qq.com hn.kd.ny.adsl form Christian Louboutin Python, Erlang, Map/Reduce
Christian Louboutin
Christian Louboutin
hermes birkin
Hermes handbags
moncler
moncler
herve leger
herve leger
YOU CAN CHOOSE WHAT YOU LIKE!
2010-09-03T04:05:04+02:00