How to Display Realtime Traffic Analytics

posted on September 2nd, 2009 by Greg Allard in Greg's Posts on Code Spatter
Presskit'n Hits

Presskit'n Hits

Users of Presskit’n have been asking for traffic statistics on their press releases so I decided I would get them the most recent data possible. At first I was parsing the access log once a minute and when I was testing that I decided it wasn’t updating fast enough. I’ve gotten used to everything being instant on the internet and I didn’t want to wait a minute to see how many more views there were. In this post I show how I got it to update on page load using Apache, python, Django, and memcached.

Apache Access Logs

Apache is installed with rotatelogs. This program can be used to rotate the logs after they get too large. However I wanted a few more features. Cronolog will update a symlink everytime it creates a new log file so that you can always have the most recent stats.

CustomLog "|/usr/bin/cronolog --symlink=/path/to/access /path/to/%Y/%m/%d/access.log" combined
ErrorLog "|/usr/bin/cronolog --symlink=/path/to/error /path/to/%Y/%m/%d/error.log"

CustomLog and ErrorLog directives in apache will let you pipe output to a command. So I put the full path to cronolog and then specified the parameters to cronolog. –symlink will point the named symlink to the most recent log created with cronolog. After the options, the path to the log location is specified and date formats can be used. I decided to break mine up by day.

Piping Apache Log info to a Python Script

Apache can have multiple log locations and log multiple times. So I wrote my own logging script in python that would insert into memcached. Here is the extra line I added to apache:

CustomLog "|/path/to/python /path/to/log_cache.py" combined

And this is log_cache.py:

#!/usr/bin/env python
 
import os
import sys
import re
from datetime import date
 
sys.path = ['/path/to/project',] + sys.path
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'
 
from django.core.cache import cache
 
r = re.compile(r'"GET (?P\S+) ')
 
def be_parent(child_pid):
    exit_status = os.waitpid(child_pid, 0)
    if exit_status: # if there's an error, restart the child.
        pid = os.fork()
        if not pid:
            be_child()
        else:
            be_parent(pid)
    return
 
def be_child():
    while True:
        line = sys.stdin.readline() # wait for apache log data
        if not line:
            return # without error code so everything stops
        log_data(line)
 
def log_data(data):
    page = r.search(data)
    if page:
        key = '%s%s' % (date.today(), page.group('url'))
        try:
            cache.incr(key)
        except ValueError:
            # add it to the cache for 24 hours
            cache.set(key, 1, 24*60*60)
    return
 
pid = os.fork()
if not pid:
    be_child()
else:
    be_parent(pid)

A blog post about using python to store access records in postgres helped me out a lot. The parent/child processing came from that and fixed a lot of problems I was having before.

The page views are being added to memcached (with cache.incr() which is new in django 1.1) for quick retrieval and the logs will still be created by cronolog so no data will be lost when the cache expires. Those logs are used in the next part.

Parsing the Logs

The hit counts will expire from the cache after 24 hours so I parse the logs once a day and put that information into my database. For this I wrote a django management command (I didn’t do a management command before because I wasn’t sure how it would handle the parent and child processes). This command is called by ./manage.py parse_log

from django.conf import settings
from django.contrib.contenttypes.models import ContentType
from django.core.cache import cache
from django.core.management.base import BaseCommand
from django.core.urlresolvers import resolve, Resolver404
import datetime
# found on page linked above
from apachelogs import ApacheLogFile
from app.models import Model_being_hit
from metrics.models import Hits
 
def save_log(alf, date):
    hits = {}
    # loop to sum hits
    for log_line in alf:
        request = log_line.request_line
        request_parts = request.split(' ')
        hits[request_parts[1]] = hits.get(request_parts[1], 0) + 1
    for page, views in hits.iteritems():
        try:
            view, args, kwargs = resolve(page)
            # I check kwargs for something only passed to one app
            if 'param' in kwargs:
                a = Model_being_hit.objects.get(id=kwargs['id'])
                try:
                    content_type = ContentType.objects.get_for_model(a)
                    hit = Hits.objects.get(
                        date=date,
                        content_type=content_type,
                        object_id=a.id,
                    )
                    hit.views = views
                except Hits.DoesNotExist:
                    hit = Hits(date=date, views=views, content_object=a)
                hit.save()
        except:
            # something not in urls file like static files
            pass
class Command(BaseCommand):
    def handle(self, *args, **options):
        day = datetime.date.today()
        day = day - datetime.timedelta(days=1)
        alf = ApacheLogFile('%s/%s/%s/%s/access.log' % (
            settings.ACCESS_LOG_LOCATION,
            day.year,
            day.strftime('%m'), #month
            day.strftime('%d'), #day
        ))
        save_log(alf, day)

I use django.core.urlresolvers.resolve so that I can use my urls file and I don’t have to repeat myself.

Hits is a django model I created with a few fields for storing date and views. It uses the content types framework so that it can be tied to any of my django models.

from django.contrib.contenttypes        import generic
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class Hits(models.Model):
    date        = models.DateField()
    views       = models.IntegerField()
    # to add to any model
    content_type   = models.ForeignKey(ContentType)
    object_id      = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('content_type', 'object_id')
 
    def __unicode__(self):
        return "%s hits on %s" % (self.views, self.date)

This was added to my cron with crontab -e

#every morning on the first minute
1 0 * * * /path/to/python /path/to/manage.py parse_log > /dev/null

Displaying the Hits

On my models I added a couple methods that would look up the info in the cache or database.

    @property
    def hits_today(self):
        from datetime import date
        from django.core.cache import cache
        key = '%s%s' % (date.today(), self.get_absolute_url())
        return cache.get(key)
 
    @property
    def hits(self):
        from metrics.models import Hits
        from django.contrib.contenttypes.models import ContentType
        content_type = ContentType.objects.get_for_model(self)
        hits = Hits.objects.filter(
            content_type=content_type,
            object_id=self.id,
        ).order_by('-date')
        return hits

The hits_today method requires that you define get_absolute_url which is useful in other places as well. @property is a decorator that makes it possible to access the data with object.hits and leave off the parenthesis.

The hits method uses the content type framework again to look up the hits in the database.

Just the Basics

There is a lot more that can be done with this. This barely touches the raw data available in the logs. A few ways I’ve already started improving this is to not include known bots as hits, check the referrer to see where traffic is coming from, and save the keywords used in search engines.

Related posts:

  1. How to Speed up Your Django Sites with NginX, Memcached, and django-compress A lot of these steps will speed up any kind...
  2. Python Projects in Users’ Home Directories with wsgi Letting users put static files and php files in a...
  3. How to Write Reusable Apps for Pinax and Django Pinax is a collection of reusable django apps that...

How I Made isthemarketdown.com

posted on January 21st, 2009 by Greg in Personal Projects

The idea came on inauguration day when someone was wondering how the stock market was doing. I had been to http://isobamapresident.com/ earlier in the day and the single serving site idea for the stock market came to mind.

It only took about 3 hours to create after I had the idea.

First 30 minutes: Find free data. Is there a feed or API?
Next 30: I couldn’t find anything useful so I decided to scrape a website using BeautifulSoup.
Next 60: I set up a Django project to put it all together. Created a model to hold the scraped data so I wouldn’t do it every page load. Created a simple view and layed out a template file and simple stylesheet.
Next 15: I found out how to add a method to manage.py so that I could call the scraping from the command line. python manage.py market_parse
Next 15: Debugging and adjusting. I hadn’t looked at the site yet, but there weren’t that many bugs.
Last 30 minutes: add crontab for every 20 minutes on weekdays from 9-5. It took a while to get going because I needed the full python path in the crontab.

So, is the market down?

Main Page Updater for Emergencies

posted on October 3rd, 2008 by Greg in CDWS Projects

At a large institution like UCF, it is good to have a plan for emergencies. I set up a simple form that will update the main page at http://ucf.edu in an emergency so that important information can be realeased as fast as possible.

The main page is an html file that is copied every few minutes from our database driven application. This speeds up the website and cuts down on processor utilization considerably. A simple update to our cron job was added that checks if the site is in emergency mode and pulls from our other emergency page. This emergency page is created with a simple form and simple template file.

I created this page updater to be reliable and simple so that there is little turn arround time from emergency situation to information available. The form edits files in the filesystem instead of using a database that would require more complexity. There is a place to update the important information. That info is then put into the pre-built template when the user hits preview. Once the user is satisfied with the way it looks, there is a button to enable/disable the page. It updates the status that the cron job looks for and the main page will change in under a minute.