Parsing HTML with python
Note: lifted from my old(er) blog: ctshryock.blogspot.com
Intro
Mark Pilgrim released an excellent book / free pdf online called dive into python which I highly recommend. In chapter 8 he discusses retrieving, parsing, and then reconstructing an html page. I’m going to butcher that example with the focus on the first two, then capturing specific data that I want.
From chapter 8, section 2:
SGMLParser. SGMLParser parses HTML into useful pieces, like
start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on
itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these
methods.
so lets do that.
in this example we’ll be parsing http://google.com. We’ll be looking for the main image’s (the giant google) title.
Here is the start of our python:
from sgmllib import SGMLParser
import urllib
class demoParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.imgTag = False
self.title = ""
The class ImageURLS subclasses SGMLParser, which has all the functions declared. When it encounters a tag it calls the matching function. When it encounters plain text, it calls handle_data(self,text). urllib is the heavy lifter, we use that to go and fetch the contents of an given address. SGMLParser defines all the necessary methods but doesn’t do anything with them, they’re just there for us to override. There is a function for each tag (<a> : start_a,<img> : start_img,<pre> : start_pre etc…) as well as handle_data function that processes text inside these tags, and unknown_starttag for anything not defined. Each of these have a matching end function as well (end_a, end_pre, etc…).
self.imgTag is a flag we set when we come across an <img> tag in the body, and self.title is the title we’re looking for.
here’s our logic (prepare to be amazed):
def start_img(self,attrs):
self.imgTag = True
alt = [v for k, v in attrs if k==’alt’]
if alt:
self.alt = ”.join(alt)
def end_img(self):
self.imgTag = False
def runDemo(self):
u = urllib.urlopen(‘http://google.com’)
self.feed(u.read())
u.close()
self.close()
runDemo(self): is the function that kicks things off. It runs urllib.urlopen() which fetches a given webpage. self.feed() is a method defined in sgmlparser that reads the result of urllib.urlopen(). You close both for good measure (see Mark’s dive int python for an actual explanation. chapter 8 specifically). This will parse your html. demo calls the above methods when encountered and preforms the tasks we want.
running example :
Python 2.4.3 (#1, Mar 30 2006, 11:02:15)
GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
import demo
d = demo.demoParser()
d.runDemo()
print d.alt
Google
you can take this example and expand the basic concept (trapping text / etc given certain criteria) for any common html element.