How to Display Realtime Traffic Analytics

posted on September 2nd, 2009 by Greg Allard in Greg's Posts on Code Spatter
Presskit'n Hits

Presskit'n Hits

Users of
Presskit’n have been asking for traffic statistics on their press releases so I decided I would get them the most recent data possible. At first I was parsing the access log once a minute and when I was testing that I decided it wasn’t updating fast enough. I’ve gotten used to everything being instant on the internet and I didn’t want to wait a minute to see how many more views there were. In this post I show how I got it to update on page load using Apache, python, Django, and memcached.

Apache Access Logs

Apache is installed with rotatelogs. This program can be used to rotate the logs after they get too large. However I wanted a few more features. Cronolog will update a symlink everytime it creates a new log file so that you can always have the most recent stats.

CustomLog "|/usr/bin/cronolog --symlink=/path/to/access /path/to/%Y/%m/%d/access.log" combined
ErrorLog "|/usr/bin/cronolog --symlink=/path/to/error /path/to/%Y/%m/%d/error.log"

CustomLog and ErrorLog directives in apache will let you pipe output to a command. So I put the full path to cronolog and then specified the parameters to cronolog. –symlink will point the named symlink to the most recent log created with cronolog. After the options, the path to the log location is specified and date formats can be used. I decided to break mine up by day.

Piping Apache Log info to a Python Script

Apache can have multiple log locations and log multiple times. So I wrote my own logging script in python that would insert into memcached. Here is the extra line I added to apache:

CustomLog "|/path/to/python /path/to/log_cache.py" combined

And this is log_cache.py:

#!/usr/bin/env python
 
import os
import sys
import re
from datetime import date
 
sys.path = ['/path/to/project',] + sys.path
os.environ['DJANGO_SETTINGS_MODULE'] = 'myproject.settings'
 
from django.core.cache import cache
 
r = re.compile(r'"GET (?P\S+) ')
 
def be_parent(child_pid):
    exit_status = os.waitpid(child_pid, 0)
    if exit_status: # if there's an error, restart the child.
        pid = os.fork()
        if not pid:
            be_child()
        else:
            be_parent(pid)
    return
 
def be_child():
    while True:
        line = sys.stdin.readline() # wait for apache log data
        if not line:
            return # without error code so everything stops
        log_data(line)
 
def log_data(data):
    page = r.search(data)
    if page:
        key = '%s%s' % (date.today(), page.group('url'))
        try:
            cache.incr(key)
        except ValueError:
            # add it to the cache for 24 hours
            cache.set(key, 1, 24*60*60)
    return
 
pid = os.fork()
if not pid:
    be_child()
else:
    be_parent(pid)

A blog post about
using python to store access records in postgres helped me out a lot. The parent/child processing came from that and fixed a lot of problems I was having before.

The page views are being added to memcached (with
cache.incr() which is new in django 1.1) for quick retrieval and the logs will still be created by cronolog so no data will be lost when the cache expires. Those logs are used in the next part.

Parsing the Logs

The hit counts will expire from the cache after 24 hours so I
parse the logs once a day and put that information into my database. For this I wrote a
django management command (I didn’t do a management command before because I wasn’t sure how it would handle the parent and child processes). This command is called by ./manage.py parse_log

from django.conf import settings
from django.contrib.contenttypes.models import ContentType
from django.core.cache import cache
from django.core.management.base import BaseCommand
from django.core.urlresolvers import resolve, Resolver404
import datetime
# found on page linked above
from apachelogs import ApacheLogFile
from app.models import Model_being_hit
from metrics.models import Hits
 
def save_log(alf, date):
    hits = {}
    # loop to sum hits
    for log_line in alf:
        request = log_line.request_line
        request_parts = request.split(' ')
        hits[request_parts[1]] = hits.get(request_parts[1], 0) + 1
    for page, views in hits.iteritems():
        try:
            view, args, kwargs = resolve(page)
            # I check kwargs for something only passed to one app
            if 'param' in kwargs:
                a = Model_being_hit.objects.get(id=kwargs['id'])
                try:
                    content_type = ContentType.objects.get_for_model(a)
                    hit = Hits.objects.get(
                        date=date,
                        content_type=content_type,
                        object_id=a.id,
                    )
                    hit.views = views
                except Hits.DoesNotExist:
                    hit = Hits(date=date, views=views, content_object=a)
                hit.save()
        except:
            # something not in urls file like static files
            pass
class Command(BaseCommand):
    def handle(self, *args, **options):
        day = datetime.date.today()
        day = day - datetime.timedelta(days=1)
        alf = ApacheLogFile('%s/%s/%s/%s/access.log' % (
            settings.ACCESS_LOG_LOCATION,
            day.year,
            day.strftime('%m'), #month
            day.strftime('%d'), #day
        ))
        save_log(alf, day)

I use
django.core.urlresolvers.resolve so that I can use my urls file and I don’t have to repeat myself.

Hits is a django model I created with a few fields for storing date and views. It uses the
content types framework so that it can be tied to any of my django models.

from django.contrib.contenttypes        import generic
from django.contrib.contenttypes.models import ContentType
from django.db import models
 
class Hits(models.Model):
    date        = models.DateField()
    views       = models.IntegerField()
    # to add to any model
    content_type   = models.ForeignKey(ContentType)
    object_id      = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey('content_type', 'object_id')
 
    def __unicode__(self):
        return "%s hits on %s" % (self.views, self.date)

This was added to my cron with crontab -e

#every morning on the first minute
1 0 * * * /path/to/python /path/to/manage.py parse_log > /dev/null

Displaying the Hits

On my models I added a couple methods that would look up the info in the cache or database.

    @property
    def hits_today(self):
        from datetime import date
        from django.core.cache import cache
        key = '%s%s' % (date.today(), self.get_absolute_url())
        return cache.get(key)
 
    @property
    def hits(self):
        from metrics.models import Hits
        from django.contrib.contenttypes.models import ContentType
        content_type = ContentType.objects.get_for_model(self)
        hits = Hits.objects.filter(
            content_type=content_type,
            object_id=self.id,
        ).order_by('-date')
        return hits

The hits_today method requires that you define
get_absolute_url which is useful in other places as well. @property is a decorator that makes it possible to access the data with object.hits and leave off the parenthesis.

The hits method uses the content type framework again to look up the hits in the database.

Just the Basics

There is a lot more that can be done with this. This barely touches the raw data available in the logs. A few ways I’ve already started improving this is to not include known bots as hits, check the referrer to see where traffic is coming from, and save the keywords used in search engines.

Related posts:

  1. How to Speed up Your Django Sites with NginX, Memcached, and django-compress A lot of these steps will speed up any kind…
  2. Python Projects in Users’ Home Directories with wsgi Letting users put static files and php files in a…
  3. How to Write Reusable Apps for Pinax and Django Pinax is a collection of reusable django apps that…

Python Projects in Users’ Home Directories with wsgi

posted on July 8th, 2009 by Greg Allard in Greg's Posts on Code Spatter

Letting users put static files and php files in a public_html folder in their home directory has been a common convention for some time. I created a way for users to have a public_python folder that will allow for python projects.

In the apache configuration files I created some regular expression patterns that will look for a wsgi file based on the url requested. To serve this url: http://domain/~user/p/myproject, the server will look for this wsgi file: /home/user/public_python/myproject/deploy/myproject.wsgi

It is set up to run wsgi in daemon mode so that each user can touch their own wsgi file to restart their project instead of needing to reload the apache config and inconvenience everyone.

This is the code I added to the apache configuration (in a virtual host, other configs might be different):

RewriteEngine On
RewriteCond %{REQUEST_URI} ^/~(\w+)/p/(\w+)/(.*)
RewriteRule . - [E=python_project_name:%2]
 
WSGIScriptAliasMatch ^/~(\w+)/p/(\w+)  /home/$1/public_python/$2/deploy/$2.wsgi
WSGIDaemonProcess wsgi_processes.%{ENV:python_project_name}
processes=2 threads=15
WSGIProcessGroup wsgi_processes.%{ENV:python_project_name}
 
AliasMatch ^/~(\w+)/p/(\w+)/files(.*) /home/$1/public_python/$2/files$3
<LocationMatch ^/~(\w+)/p/(\w+)/files(.*)>
       SetHandler none
</LocationMatch>
 
AliasMatch ^/~(\w+)/p/(\w+)/media(.*) /home/$1/public_python/$2/media$3
<LocationMatch ^/~(\w+)/p/(\w+)/media(.*)>
       SetHandler none
</LocationMatch>

This will also serve two directories statically for images, css, and javascript. For one of them, I always make a symbolic link to the django admin media and tell my settings file to use that.

ln -s /path/to/django/contrib/admin/media media

To use this for a django project

This is a sample wsgi file to use for a django project. Username and project_name will need to be replaced. I’m also adding an apps folder to the path following
the style I mention in my reusable apps post.

import os
import sys
 
sys.path = ['/home/username/public_python/', '/home/username/public_python/project_name/apps'] + sys.path
from django.core.handlers.wsgi import WSGIHandler
 
os.environ['DJANGO_SETTINGS_MODULE'] = 'project_name.settings'
application = WSGIHandler()

I’ve been using this for a couple weeks and it’s working great for me. If you use it, I’d like to know how it works out for you. Let me know in the comments.

Related posts:

  1. How to Add Locations to Python Path for Reusable Django Apps In my previous post I talk about reusable apps, but…
  2. Getting Basecamp API Working with Python I found one library that was linked everywhere, but it…
  3. Setting up Apache2, mod_python, MySQL, and Django on Debian Lenny or Ubuntu Hardy Heron Both Debian and Ubuntu make it really simple to get…

How to Speed up Your Django Sites with NginX, Memcached, and django-compress

posted on April 23rd, 2009 by Greg Allard in Greg's Posts on Code Spatter

A lot of these steps will speed up any kind of application, not just django projects, but there are a few django specific things. Everything has been tested on
IvyLees which is running in a Debian/Ubuntu environment.

These three simple steps will speed up your server and allow it to handle more traffic.

Reducing the Number of HTTP Requests

Yahoo has developed a
firefox extension called
YSlow. It analyzes all of the traffic from a website and gives a score on a few categories where improvements can be made.

It recommends reducing all of your css files into one file and all of your js files into one file or as few as possible. There is a pluggable, open source django application available to help with that task. After setting up
django-compress, a website will have css and js files that are minified (excess white space and characters are removed to reduce file size). The application will also give the files version numbers so that they can be cached by the web browser and won’t need to be downloaded again until a change is made and a new version of the file is created.
How to setup the server to set a far future expiration is shown below in the lightweight server section.

Setting up Memcached

Django makes it really simple to set up caching backends and memcached is easy to install.

sudo aptitude install memcached, python-setuptools

We will need setuptools so that we can do the following command.

sudo easy_install python-memcached

Once that is done you can start the memcached server by doing the following:

sudo memcached -d -u www-data -p 11211 -m 64

-d will start it in daemon mode, -u is the user for it to run as, -p is the port, and -m is the maximum number of megabytes of memory to use.

Now open up the settings.py file for your project and add the following line:

CACHE_BACKEND = 'memcached://127.0.0.1:11211/'

Find the MIDDLEWARE_CLASSES section and add this to the beginning of the list:

    'django.middleware.cache.UpdateCacheMiddleware',

and this to the end of the list:

    'django.middleware.cache.FetchFromCacheMiddleware',

For more about caching with django see the
django docs on caching. You can reload the server now to try it out.

sudo /etc/init.d/apache2 reload

To make sure that memcached is set up correctly you can telnet into it and get some statistics.

telnet localhost 11211

Once you are in type stats and it will show some information (press ctrl ] and then ctrl d to exit). If there are too many zeroes, it either isn’t working or you haven’t visited your site since the caching was set up. See
the memcached site for more information.

Don’t Use Apache for Static Files

Apache has some overhead involved that makes it good for serving php, python, or ruby applications, but you do not need that for static files like your images, style sheets, and javascript. There are a few options for lightweight servers that you can put in front of apache to handle the static files.
Lighttpd (lighty) and
nginx (engine x) are two good options. Adding this layer in front of your application will act as an application firewall so there is a security bonus to the speed bonus.

There is this guide to
install a django setup with nginx and apache from scratch. If you followed
my guide to set up your server or already have apache set up for your application, then there are a few steps to get nginx handling your static files.

sudo aptitude install nginx

Edit the config file for your site (sudo nano /etc/apache2/sites-available/default) and change the port from 80 to 8080 and change the ip address (might be *) to 127.0.0.1. The lines will look like the following

NameVirtualHost 127.0.0.1:8080
<VirtualHost 127.0.0.1:8080>

Also edit the ports.conf file (sudo nano /etc/apache2/ports.conf) so that it will listen on 8080.

Listen 8080

Don’t restart the server yet, you want to configure nginx first. Edit the default nginx config file (sudo nano /etc/nginx/sites-available/default) and find where it says

        location / {
               root   /var/www/nginx-default;
               index  index.html index.htm;
        }

and replace it with

location / {
    proxy_pass http://192.168.0.180:8080;
    proxy_redirect off;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    client_max_body_size 10m;
    client_body_buffer_size 128k;
    proxy_connect_timeout 90;
    proxy_send_timeout 90;
    proxy_read_timeout 90;
    proxy_buffer_size 4k;
    proxy_buffers 4 32k;
    proxy_busy_buffers_size 64k;
    proxy_temp_file_write_size 64k; 
}
location /files/ {
    root /var/www/myproject/;
    expires max;
}

/files/ is where I’ve stored all of my static files and /var/www/myproject/ is where my project lives and it contains the files directory.

Set static files to expire far in the future

expires max; will tell your users’ browsers to cache the files from that directory for a long time. Only use that if you are use those files won’t change. You can use expires 24h; if you aren’t sure.

Configure gzip

Edit the nginx configuration to use gzip on all of your static files (sudo nano /etc/nginx/nginx.conf). Where it says gzip on; make sure it looks like the following:

    gzip  on;
    gzip_comp_level 2;
    gzip_proxied any;
    gzip_types      text/plain text/html text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;

The servers should be ready to be restarted.

sudo /etc/init.d/apache2 reload
sudo /etc/init.d/nginx reload

If you are having any problems I suggest reading through
this guide and seeing if you have something set up differently.

Speedy Django Sites

Those three steps should speed up your server and allow for more simultaneous visitors. There is a lot more that can be done, but getting these three easy things out of the way first is a good start.

Related posts:

  1. Static Files in Django on Production and Development Update 2009-03-25 I realize why this isn’t needed. If your…
  2. Python Projects in Users’ Home Directories with wsgi Letting users put static files and php files in a…
  3. Setting up Apache2, mod_python, MySQL, and Django on Debian Lenny or Ubuntu Hardy Heron Both Debian and Ubuntu make it really simple to get…

Setting up Apache2, mod_python, MySQL, and Django on Debian Lenny or Ubuntu Hardy Heron

posted on October 15th, 2008 by Greg Allard in Greg's Posts on Code Spatter

Both Debian and Ubuntu make it really simple to get a server up and running. I was trying a few different Machine Images on Amazon and I found myself repeating a lot of things so I wanted to put them here for reference for those who might find it useful.

With a fresh image, the first thing to do is update apt-get.

apt-get update && apt-get upgrade -y

Then grab all of the software to use.

apt-get install -y xfsprogs mysql-server  apache2  libapache2-mod-python  python-mysqldb  python-imaging  python-django  subversion php5  phpmyadmin

xfsprogs is for formatting an Elastic Block Store volume and may not be needed in all cases.

I like to check out the latest version of Django from their repository, it makes it easier to update it later. This also starts a project named myproject (this name is used later).

cd /usr/lib/python2.5/site-packages
svn co http://code.djangoproject.com/svn/django/trunk/django django
ln -s /usr/lib/python2.5/site-packages/django/bin/django-admin.py /usr/local/bin
cd /var/www
django-admin.py startproject myproject

Now to edit the apache config to tell it about our project.

cd /etc/apache2
nano httpd.conf

Add the following to set up python to run the django files and php to run the phpmyadmin files. There is also an example of serving static files. Change where it says myproject if you used a different name.

<Location "/">
    SetHandler python-program
    PythonHandler django.core.handlers.modpython
    SetEnv DJANGO_SETTINGS_MODULE myproject.settings
    PythonOption django.root /myproject
    PythonDebug On
    PythonPath "['/var/www'] + sys.path"
</Location>
 
 
Alias /adm_media/ /usr/lib/python2.5/site-packages/django/contrib/admin/media/
<Location "/adm_media/">
    SetHandler None
</Location>
 
Alias /files/ /var/www/myproject/files/
<Location "/files/">
    SetHandler None
</Location>
 
Alias /phpmyadmin/ /usr/share/phpmyadmin/
<Location "/phpmyadmin/">
    SetHandler None
</Location>

Restart apache for it to use the new configuration.

/etc/init.d/apache2 restart

The only thing left to do is set up the database. If Ubuntu had you set up a root password already, add -p to the end of the following command to use it.

mysql

There are some users in mysql without username, it is best to remove those.

drop user ''@'localhost';

Do that for each host that has a blank username. Use the following to see all users.

SELECT user, host FROM mysql.user;

Create a database and add a user.

CREATE DATABASE db_name;
GRANT ALL ON db_name.* to user_name WITH GRANT OPTION;
SET PASSWORD FOR user_name = password('psswdhere');

If root doesn’t have a password yet, use the above commant with root as the username.

Amazon has a page about
how to use EBS with MySQL, but
there are reported issues with using Debian Lenny and EBS.