Using Beautiful Soup for Screen Scraping
I’ve been curious to learn more about screen scraping for some time. And then I heard about a python script that is great for parsing html. Since I’ve also been learning python, I thought now was the perfect time to explore some scraping.
In the past I had some trouble with using php to parse the magic the gathering official site for new card info when working on my mtg card database. I didn’t spend much time trying to figure that out, but using python I didn’t have a problem.
After copying Beautiful Soup to my python path I started typing in some python at the command line.
from BeautifulSoup import BeautifulSoup as BSoup import urllib url = 'http://ww2.wizards.com/gatherer/Index.aspx?setfilter=Shards%20of%20Alara&output=Spoiler' html = urllib.urlopen(url).read() soup = BSoup(html) for tr in soup.fetch('tr'): if tr.td: print tr.td.string |
This would output all of the magic card names on the page (and some other stuff). Here is another example: getting image urls when knowing the value of the id attribute on the img tags.
url = 'http://ww2.wizards.com/gatherer/CardDetails.aspx?&id=175000' html = urllib.urlopen(url).read() soup = BSoup(html) for img in soup.findAll(id='_imgCardImage'): print img['src'] |
With a little more time I could get all the cards and their images and fill up my database. I just have to find the time now.
Update 12/28/8
I just heard about Scrapy. Now I need to try it out with a project.