Scraping google.com
Mon Mar 05, 2007 · 115 words

I don't know when this happened, but it seems you can't scrape (parse) google.com as easily as before. At least with Python, a simple

s=urllib.urlopen('http://www.google.com/search?q=define%3Aesoteric')
r=s.read()
print r

will give you this notice instead:

/snip/ Your client does not have permission to get URL /search?q=define%3Aesoteric from this server /snip/

And here's me not understanding why my regexes are not matching. :) Apparently, you should at least show some kind of User Agent. In Python you can do this easily by subclassing URLopener and setting it's version property to something, like:

class MyOpener(urllib.URLopener):
  version = "InternetExploiter/666"
urllib._urlopener = MyOpener()

And I'm really starting to love TextMate. Just type your little Python script and hit Cmd-R. Brilliant.


back · essays · credits ·