Archive of March 2008


Parsing HTML with python

Note: lifted from my old(er) blog: ctshryock.blogspot.com

Intro

Mark Pilgrim released an excellent book / free pdf online called dive into python which I highly recommend. In chapter 8 he discusses retrieving, parsing, and then reconstructing an html page. I’m going to butcher that example with the focus on the first two, then capturing specific data that I want.

From chapter 8, section 2:

SGMLParser. SGMLParser parses HTML into useful pieces, like
start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on
itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these
methods.

so lets do that.

in this example we’ll be parsing http://google.com. We’ll be looking for the main image’s (the giant google) title.

Here is the start of our python:

from sgmllib import SGMLParser
import urllib

class demoParser(SGMLParser): 
    def __init__(self): 
        SGMLParser.__init__(self) 
        self.imgTag = False 
        self.title = ""

The class ImageURLS subclasses SGMLParser, which has all the functions declared. When it encounters a tag it calls the matching function. When it encounters plain text, it calls handle_data(self,text). urllib is the heavy lifter, we use that to go and fetch the contents of an given address. SGMLParser defines all the necessary methods but doesn’t do anything with them, they’re just there for us to override. There is a function for each tag (<a> : start_a,<img> : start_img,<pre> : start_pre etc…) as well as handle_data function that processes text inside these tags, and unknown_starttag for anything not defined. Each of these have a matching end function as well (end_a, end_pre, etc…).

self.imgTag is a flag we set when we come across an <img> tag in the body, and self.title is the title we’re looking for.

here’s our logic (prepare to be amazed):

def start_img(self,attrs):
    self.imgTag = True
    alt = [v for k, v in attrs if k==’alt’]
    if alt:
        self.alt = ”.join(alt)

def end_img(self):
    self.imgTag = False

def runDemo(self):
    u = urllib.urlopen(‘http://google.com’)
    self.feed(u.read())
    u.close()
    self.close()

runDemo(self): is the function that kicks things off. It runs urllib.urlopen() which fetches a given webpage. self.feed() is a method defined in sgmlparser that reads the result of urllib.urlopen(). You close both for good measure (see Mark’s dive int python for an actual explanation. chapter 8 specifically). This will parse your html. demo calls the above methods when encountered and preforms the tasks we want. running example :

Python 2.4.3 (#1, Mar 30 2006, 11:02:15)
GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
import demo
d = demo.demoParser()
d.runDemo()
print d.alt
Google

you can take this example and expand the basic concept (trapping text / etc given certain criteria) for any common html element.

March 7th, 2008

mikeleegoogle.png

mike lee is awesome. possibly my new programmer idol

March 6th, 2008

iphone sdk

i would love to twitter about the iphone sdk but i can't because, like during most apple events, twitter has shat itself

March 6th, 2008

Next → ← Previous