Recently in python Category

Parsing HTML with python

Intro

Mark Pilgrim released an excellent book / free pdf online called dive into python which I highly recommend. In chapter 8 he discusses retrieving, parsing, and then reconstructing an html page. I’m going to butcher that example with the focus on the first two, then capturing specific data that I want.

From chapter 8, section 2:

SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods.

so lets do that.
in this example we’ll be parsing http://google.com. We’ll be looking for the main image’s (the giant google) title.

Here is the start of our python:

from sgmllib import SGMLParser import urllib

class demoParser(SGMLParser):   def __init__(self):     SGMLParser.__init__(self)     self.imgTag = False     self.title = ""

The class ImageURLS subclasses SGMLParser, which has all the functions declared. When it encounters a tag it calls the matching funciton. When it encounters plain text, it calls handle_data(self,text). urllib is the heavy lifter, we use that to go and fetch the contents of an given address. SGMLParser defines all the necessary methods but doesn’t do anything with them, they’re just there for us to override. There is a function for each tag (<a> : start_a,<img> : start_img,<pre> : start_pre etc…) as well as handle_data function that processes text inside these tags, and unknown_starttag for anything not defined. Each of these have a matching end function as well (end_a, end_pre, etc…).

self.imgTag is a flag we set when we come across an <img> tag in the body, and self.title is the title we’re looking for.
here’s our logic (prepare to be amazed):

def start_img(self,attrs):
  self.imgTag = True
  alt = [v for k, v in attrs if k==’alt’]
   if alt:
    self.alt = ”.join(alt)

def end_img(self):
  self.imgTag = False

def runDemo(self):
  u = urllib.urlopen(‘http://google.com’)
  self.feed(u.read())
  u.close()
  self.close()

runDemo(self): is the function that kicks things off. It runs urllib.urlopen() which fetches a given webpage. self.feed() is a method definded in sgmlparser that reads the result of urllib.urlopen(). You close both for good measure (see Mark’s diveintopython for an actual explamation. chapter 8 specifically). This will parse your html. demo calls the above methods when encountered and preforms the tasks we want. running example :

Python 2.4.3 (#1, Mar 30 2006, 11:02:15)
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.

>>> import demo >>> d = demo.demoParser() >>> d.runDemo() >>> print d.alt >>> Google

you can take this example and expand the basic concept (trapping text / etc given certain criteria) for any common html element.

Stackoverflow

About this Archive

This page is an archive of recent entries in the python category.

php is the previous category.

Twitter API is the next category.

Find recent content on the main index or look in the archives to find all content.