Author Archive

How to Display Realtime Traffic Analytics

posted on September 2nd, 2009 by Greg Allard in Greg's Posts on Code Spatter
Presskit'n Hits

Presskit'n Hits

Users of
Presskit’n have been asking for traffic statistics on their press releases so I decided I would get them the most recent data possible. At first I was parsing the access log once a minute and when I was testing that I decided it wasn’t updating fast enough. I’ve gotten used to everything being instant on the internet and I didn’t want to wait a minute to see how many more views there were. In this post I show how I got it to update on page load using Apache, python, Django, and memcached.

Apache Access Logs

Apache is installed with rotatelogs. This program can be used to rotate the logs after they get too large. However I wanted a few more features. Cronolog will update a symlink everytime it creates a new log file so that you can always have the most recent stats.

CustomLog "|/usr/bin/cronolog --symlink=/path/to/access /path/to/%Y/%m/%d/access.log" combined
ErrorLog "|/usr/bin/cronolog --symlink=/path/to/error /path/to/%Y/%m/%d/error.log"

CustomLog and ErrorLog directives in apache will let you pipe output to a command. So I put the full path to cronolog and then specified the parameters to cronolog. –symlink will point the named symlink to the most recent log created with cronolog. After the options, the path to the log location is specified and date formats can be used. I decided to break mine up by day.

Piping Apache Log info to a Python Script

Apache can have multiple log locations and log multiple times. So I wrote my own logging script in python that would insert into memcached. Here is the extra line I added to apache:

CustomLog "|/path/to/python /path/to/log_cache.py" combined

And this is log_cache.py:

#!/usr/bin/env python
 
import os
import sys
import re
from datetime import date
 
sys.path = ['/path/to/project',] + sys.path
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'
 
from django.core.cache import cache
 
r = re.compile(r'"GET (?P\S+) ')
 
def be_parent(child_pid):
    exit_status = os.waitpid(child_pid, 0)
    if exit_status: # if there's an error, restart the child.
        pid = os.fork()
        if not pid:
            be_child()
        else:
            be_parent(pid)
    return
 
def be_child():
    while True:
        line = sys.stdin.readline() # wait for apache log data
        if not line:
            return # without error code so everything stops
        log_data(line)
 
def log_data(data):
    page = r.search(data)
    if page:
        key = '%s%s' % (date.today(), page.group('url'))
        try:
            cache.incr(key)
        except ValueError:
            # add it to the cache for 24 hours
            cache.set(key, 1, 24*60*60)
    return
 
pid = os.fork()
if not pid:
    be_child()
else:
    be_parent(pid)

A blog post about
using python to store access records in postgres helped me out a lot. The parent/child processing came from that and fixed a lot of problems I was having before.

The page views are being added to memcached (with
cache.incr() which is new in django 1.1) for quick retrieval and the logs will still be created by cronolog so no data will be lost when the cache expires. Those logs are used in the next part.

Parsing the Logs

The hit counts will expire from the cache after 24 hours so I
parse the logs once a day and put that information into my database. For this I wrote a
django management command (I didn’t do a management command before because I wasn’t sure how it would handle the parent and child processes). This command is called by ./manage.py parse_log

from django.conf import settings
from django.contrib.contenttypes.models import ContentType
from django.core.cache import cache
from django.core.management.base import BaseCommand
from django.core.urlresolvers import resolve, Resolver404
import datetime
# found on page linked above
from apachelogs import ApacheLogFile
from app.models import Model_being_hit
from metrics.models import Hits
 
def save_log(alf, date):
    hits = {}
    # loop to sum hits
    for log_line in alf:
        request = log_line.request_line
        request_parts = request.split(' ')
        hits[request_parts[1]] = hits.get(request_parts[1], 0) + 1
    for page, views in hits.iteritems():
        try:
            view, args, kwargs = resolve(page)
            # I check kwargs for something only passed to one app
            if 'param' in kwargs:
                a = Model_being_hit.objects.get(id=kwargs['id'])
                try:
                    content_type = ContentType.objects.get_for_model(a)
                    hit = Hits.objects.get(
                        date=date,
                        content_type=content_type,
                        object_id=a.id,
                    )
                    hit.views = views
                except Hits.DoesNotExist:
                    hit = Hits(date=date, views=views, content_object=a)
                hit.save()
        except:
            # something not in urls file like static files
            pass
class Command(BaseCommand):
    def handle(self, *args, **options):
        day = datetime.date.today()
        day = day - datetime.timedelta(days=1)
        alf = ApacheLogFile('%s/%s/%s/%s/access.log' % (
            settings.ACCESS_LOG_LOCATION,
            day.year,
            day.strftime('%m'), #month
            day.strftime('%d'), #day
        ))
        save_log(alf, day)

I use
django.core.urlresolvers.resolve so that I can use my urls file and I don’t have to repeat myself.

Hits is a django model I created with a few fields for storing date and views. It uses the
content types framework so that it can be tied to any of my django models.

from django.contrib.contenttypes        import generic
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class Hits(models.Model):
    date        = models.DateField()
    views       = models.IntegerField()
    # to add to any model
    content_type   = models.ForeignKey(ContentType)
    object_id      = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('content_type', 'object_id')
 
    def __unicode__(self):
        return "%s hits on %s" % (self.views, self.date)

This was added to my cron with crontab -e

#every morning on the first minute
1 0 * * * /path/to/python /path/to/manage.py parse_log > /dev/null

Displaying the Hits

On my models I added a couple methods that would look up the info in the cache or database.

    @property
    def hits_today(self):
        from datetime import date
        from django.core.cache import cache
        key = '%s%s' % (date.today(), self.get_absolute_url())
        return cache.get(key)
 
    @property
    def hits(self):
        from metrics.models import Hits
        from django.contrib.contenttypes.models import ContentType
        content_type = ContentType.objects.get_for_model(self)
        hits = Hits.objects.filter(
            content_type=content_type,
            object_id=self.id,
        ).order_by('-date')
        return hits

The hits_today method requires that you define
get_absolute_url which is useful in other places as well. @property is a decorator that makes it possible to access the data with object.hits and leave off the parenthesis.

The hits method uses the content type framework again to look up the hits in the database.

Just the Basics

There is a lot more that can be done with this. This barely touches the raw data available in the logs. A few ways I’ve already started improving this is to not include known bots as hits, check the referrer to see where traffic is coming from, and save the keywords used in search engines.

Related posts:

  1. How to Speed up Your Django Sites with NginX, Memcached, and django-compress A lot of these steps will speed up any kind…
  2. Python Projects in Users’ Home Directories with wsgi Letting users put static files and php files in a…
  3. How to Write Reusable Apps for Pinax and Django Pinax is a collection of reusable django apps that…

Conditions on Count or Sum in MySQL

posted on August 28th, 2009 by Greg Allard in Greg's Comments on the Internet

yeah that looks like it should work

Read more comments by Greg Allard

A Django Model Manager for Soft Deleting Records and How to Customize the Django Admin

posted on August 10th, 2009 by Greg Allard in Greg's Comments on the Internet

That would work. If you are using a signal (pre_delete or post_delete), you might need to send it from the new function since you wouldn’t want to call the real delete.

I’ve been doing object.deleted = 1 object.save() and not calling or overriding delete(). That way I still have the option to do the real delete in case I need it. You could probably make a real_delete() function to do that if needed though.

Read more comments by Greg Allard

Python Projects in Users’ Home Directories with wsgi

posted on July 22nd, 2009 by Greg Allard in Greg's Comments on the Internet

I like the AddHandler approach more than what I was trying in this post. It is better since AddHandler will work in an .htaccess file. Which means this doesn’t require a new public_python folder and doesn’t require /p/ to be added to the url.

Before arriving at the solution in my post I tried using an .htaccess file and the directives I tried weren’t supported in .htaccess. I didn’t read the part about AddHandler so I missed that.

Something with either WSGIDaemonProcess or WSGIProcessGroup from the code in the blog post is making those applications work in daemon mode. It seems like any wsgi file that is touched will result in the code being reloaded for that project.

Read more comments by Greg Allard

A Django Model Manager for Soft Deleting Records and How to Customize the Django Admin

posted on July 22nd, 2009 by Greg Allard in Greg's Comments on the Internet

I tested this out with a simple many to many example and I am not getting the soft deleted objects returned. With objects = SoftDeleteManager the many to many queries will be using get_query_set() which won’t return the soft deleted records. I might need to see an example of how you are getting the deleted results to be able to figure out what is going on. I tried it with some_object.related_things.all() and the returned set won’t have deleted related_things.

Read more comments by Greg Allard

Python Projects in Users’ Home Directories with wsgi

posted on July 9th, 2009 by Greg Allard in Greg's Comments on the Internet

I tried it without the LocationMatch directives and it works with just having AliasMatch in there for the static locations.

I didn’t expect that WSGIDaemonProcess wouldn’t expand the python_project_name. I was doing that so that each project would have a different process so touching one wsgi file wouldn’t effect another project. It seemed like it was working like that.

If you can figure out a better way of doing this that would be awesome.

Read more comments by Greg Allard

Python Projects in Users’ Home Directories with wsgi

posted on July 8th, 2009 by Greg Allard in Greg's Comments on the Internet

Thanks for taking at look at this. I’ll test it without SetHandler None when I get a chance. It is probably something I was keeping around from when I was using mod_python. I’m guessing I need to keep LocationMatch in there so that the request doesn’t go to the wsgi file though, right?

Read more comments by Greg Allard

Python Projects in Users’ Home Directories with wsgi

posted on July 8th, 2009 by Greg Allard in Greg's Posts on Code Spatter

Letting users put static files and php files in a public_html folder in their home directory has been a common convention for some time. I created a way for users to have a public_python folder that will allow for python projects.

In the apache configuration files I created some regular expression patterns that will look for a wsgi file based on the url requested. To serve this url: http://domain/~user/p/myproject, the server will look for this wsgi file: /home/user/public_python/myproject/deploy/myproject.wsgi

It is set up to run wsgi in daemon mode so that each user can touch their own wsgi file to restart their project instead of needing to reload the apache config and inconvenience everyone.

This is the code I added to the apache configuration (in a virtual host, other configs might be different):

RewriteEngine On
RewriteCond %{REQUEST_URI} ^/~(\w+)/p/(\w+)/(.*)
RewriteRule . - [E=python_project_name:%2]
 
WSGIScriptAliasMatch ^/~(\w+)/p/(\w+)  /home/$1/public_python/$2/deploy/$2.wsgi
WSGIDaemonProcess wsgi_processes.%{ENV:python_project_name}
processes=2 threads=15
WSGIProcessGroup wsgi_processes.%{ENV:python_project_name}
 
AliasMatch ^/~(\w+)/p/(\w+)/files(.*) /home/$1/public_python/$2/files$3
<LocationMatch ^/~(\w+)/p/(\w+)/files(.*)>
       SetHandler none
</LocationMatch>
 
AliasMatch ^/~(\w+)/p/(\w+)/media(.*) /home/$1/public_python/$2/media$3
<LocationMatch ^/~(\w+)/p/(\w+)/media(.*)>
       SetHandler none
</LocationMatch>

This will also serve two directories statically for images, css, and javascript. For one of them, I always make a symbolic link to the django admin media and tell my settings file to use that.

ln -s /path/to/django/contrib/admin/media media

To use this for a django project

This is a sample wsgi file to use for a django project. Username and project_name will need to be replaced. I’m also adding an apps folder to the path following
the style I mention in my reusable apps post.

import os
import sys
 
sys.path = ['/home/username/public_python/', '/home/username/public_python/project_name/apps'] + sys.path
from django.core.handlers.wsgi import WSGIHandler
 
os.environ['DJANGO_SETTINGS_MODULE'] = 'project_name.settings'
application = WSGIHandler()

I’ve been using this for a couple weeks and it’s working great for me. If you use it, I’d like to know how it works out for you. Let me know in the comments.

Related posts:

  1. How to Add Locations to Python Path for Reusable Django Apps In my previous post I talk about reusable apps, but…
  2. Getting Basecamp API Working with Python I found one library that was linked everywhere, but it…
  3. Setting up Apache2, mod_python, MySQL, and Django on Debian Lenny or Ubuntu Hardy Heron Both Debian and Ubuntu make it really simple to get…

A Django Model Manager for Soft Deleting Records and How to Customize the Django Admin

posted on July 2nd, 2009 by Greg Allard in Greg's Comments on the Internet

You can do this.
from somewhere import SoftDeleteManager

class NewManager(SoftDeleteManager):
”’new stuff”’

and in the model
objects = NewManager()

Read more comments by Greg Allard

A Django Model Manager for Soft Deleting Records and How to Customize the Django Admin

posted on July 1st, 2009 by Greg Allard in Greg's Posts on Code Spatter

Sometimes it’s good to hide things instead of deleting them. Users may accidentally delete something and this way there will be an extra backup. The way I’ve been doing this is I set a flag in the database, deleted = 1. I wrote this code to automatically hide records from django if they are flagged.

Django allows developers to create model managers that can change how the models work. The code below was written to return only the undeleted records by default. I added two new methods in case I need to get some of the deleted records.

from django.db import models
 
class SoftDeleteManager(models.Manager):
    ''' Use this manager to get objects that have a deleted field '''
    def get_query_set(self):
        return super(SoftDeleteManager, self).get_query_set().filter(deleted=False)
    def all_with_deleted(self):
        return super(SoftDeleteManager, self).get_query_set()
    def deleted_set(self):
        return super(SoftDeleteManager, self).get_query_set().filter(deleted=True)

This is usable by many models by adding this line to the model (it needs a deleted field) objects = SoftDeleteManager()

This will hide deleted records from django completely, even the django admin and even if you specify the id directly. The only way to find it is through the database itself or an app like phpMyAdmin. This might be good for some cases, but I went a step further to make it possible to undelete things in the django admin.

Django has a lot of customization options for the admin interface (
this article has some more info on customizing the django admin). I wanted the queryset to be different in the admin, so I created a ModelAdmin to customize what is displayed. First I set it up to show a few more columns than just __unicode__ on the list of items and added a filter to help easily separate the deleted from the active.

from django.contrib import admin
 
class SoftDeleteAdmin(admin.ModelAdmin):
    list_display = ('id', '__unicode__', 'deleted',)
    list_filter = ('deleted',)
# this requires __unicode__ to be defined in your model

This can also be used by many models by adding this at the bottom of the models.py file:

from django.contrib import admin
from wherever import SoftDeleteAdmin
admin.site.register(MyModel, SoftDeleteAdmin)

The next thing to do was override the queryset method in the default ModelAdmin. I copied the code from the django source and changed it from using get_query_set to make it use all_with_deleted() which was a method added to the ModelManager. The following code was added to SoftDeleteAdmin.

    def queryset(self, request):
        """ Returns a QuerySet of all model instances that can be edited by the
        admin site. This is used by changelist_view. """
        # Default: qs = self.model._default_manager.get_query_set()
        qs = self.model._default_manager.all_with_deleted()
        # TODO: this should be handled by some parameter to the ChangeList.
        ordering = self.ordering or () # otherwise we might try to *None, which is bad 😉
        if ordering:
            qs = qs.order_by(*ordering)
        return qs

The list of objects in the admin will start to look like this.

A screenshot of the django admin interface

A screenshot of the django admin interface

They are showing up there now, but won’t be editable yet because django is using get_query_set to find them. There are two methods I added to SoftDeleteManager so that django can find the deleted records.

    def get(self, *args, **kwargs):
        ''' if a specific record was requested, return it even if it's deleted '''
        return self.all_with_deleted().get(*args, **kwargs)
 
    def filter(self, *args, **kwargs):
        ''' if pk was specified as a kwarg, return even if it's deleted '''
        if 'pk' in kwargs:
            return self.all_with_deleted().filter(*args, **kwargs)
        return self.get_query_set().filter(*args, **kwargs)

With those updated methods, django will be able to find records if the primary key is specified, not only in the admin section, but everywhere in the project. Lists of objects will only return deleted records in the admin section still.

This code can be applied to a bunch of models and easily allow soft deletes of records to prevent loss of accidentally deleted objects.

Related posts:

  1. How to Display Realtime Traffic Analytics Users of Presskit’n have been asking for traffic statistics on…
  2. How to Write Django Template Tags Template tags can be useful for making your applications more…
  3. How to Write Reusable Apps for Pinax and Django Pinax is a collection of reusable django apps that…