How I Made isthemarketdown.com

posted on January 21st, 2009 by Greg in Personal Projects

The idea came on inauguration day when someone was wondering how the stock market was doing. I had been to http://isobamapresident.com/ earlier in the day and the single serving site idea for the stock market came to mind.

It only took about 3 hours to create after I had the idea.

First 30 minutes: Find free data. Is there a feed or API?
Next 30: I couldn’t find anything useful so I decided to scrape a website using BeautifulSoup.
Next 60: I set up a Django project to put it all together. Created a model to hold the scraped data so I wouldn’t do it every page load. Created a simple view and layed out a template file and simple stylesheet.
Next 15: I found out how to add a method to manage.py so that I could call the scraping from the command line. python manage.py market_parse
Next 15: Debugging and adjusting. I hadn’t looked at the site yet, but there weren’t that many bugs.
Last 30 minutes: add crontab for every 20 minutes on weekdays from 9-5. It took a while to get going because I needed the full python path in the crontab.

So, is the market down?

Using Beautiful Soup for Screen Scraping

posted on November 12th, 2008 by Greg in Personal Projects

I’ve been curious to learn more about screen scraping for some time. And then I heard about a python script that is great for parsing html. Since I’ve also been learning python, I thought now was the perfect time to explore some scraping.

In the past I had some trouble with using php to parse the magic the gathering official site for new card info when working on my mtg card database. I didn’t spend much time trying to figure that out, but using python I didn’t have a problem.

After copying Beautiful Soup to my python path I started typing in some python at the command line.

from BeautifulSoup import BeautifulSoup as BSoup
import urllib
url  = 'http://ww2.wizards.com/gatherer/Index.aspx?setfilter=Shards%20of%20Alara&output=Spoiler'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for tr in soup.fetch('tr'):
    if tr.td:
        print tr.td.string

This would output all of the magic card names on the page (and some other stuff). Here is another example: getting image urls when knowing the value of the id attribute on the img tags.

url  = 'http://ww2.wizards.com/gatherer/CardDetails.aspx?&id=175000'
html = urllib.urlopen(url).read()
soup = BSoup(html)
for img in soup.findAll(id='_imgCardImage'):
    print img['src']

With a little more time I could get all the cards and their images and fill up my database. I just have to find the time now.

Update 12/28/8

I just heard about Scrapy. Now I need to try it out with a project.