Using Beautiful Soup for Screen Scraping

I’ve been curious to learn more about screen scraping for some time. And then I heard about a python script that is great for parsing html. Since I’ve also been learning python, I thought now was the perfect time to explore some scraping.

In the past I had some trouble with using php to parse the magic the gathering official site for new card info when working on my mtg card database. I didn’t spend much time trying to figure that out, but using python I didn’t have a problem.

After copying Beautiful Soup to my python path I started typing in some python at the command line.

from BeautifulSoup import BeautifulSoup as BSoup
import urllib
url  = 'http://ww2.wizards.com/gatherer/Index.aspx?setfilter=Shards%20of%20Alara&output=Spoiler'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for tr in soup.fetch('tr'):
    if tr.td:
        print tr.td.string

This would output all of the magic card names on the page (and some other stuff). Here is another example: getting image urls when knowing the value of the id attribute on the img tags.

url  = 'http://ww2.wizards.com/gatherer/CardDetails.aspx?&id=175000'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for img in soup.findAll(id='_imgCardImage'):
    print img['src']

With a little more time I could get all the cards and their images and fill up my database. I just have to find the time now.

Update 12/28/8

I just heard about Scrapy. Now I need to try it out with a project.