EuroPython 2011: Simone Federici on Django Productivity Tips and Tricks

Links: Talk description, video and slides


Know the environment

Use Linux with the "Terminator" shell.

Use a version control system.

Use virtualenv, for managing different versions of Python and dependencies, e.g.

virtualenv path --python=python2.7 --no-sites-packages
system libs: ./configure --prefix=envpath ; export LD_LIBRARY_PATH=envpath/lib

Use yolk, handy tool to query pypi and status of pypi installed packages

yolk -l (to see installed packages)
yolk -U (to see if there are updates on pypi)
yolk -a

Use the bash autocomplete script.

Use djangodevtools (PyPi/site with description of the new commands), for lots of useful things such as adding test coverage (./manage.py cover test myapp).

Continuous integration

Use Buildbot and Django-Buildbot (note: there was a configuration example on the speaker slides).

In settings.py:

try:
    from setting_local import *

setting_local.py shouldn't be shared or checked in.

Thanks to alias, from djangodevtools, you can simply create commands to do whatever you want, e.g. clean up rabbit queues. The commands are stored in a manage.cfg file that is shared.

uWSGI is an application server with many options. --auto-snapshot sounds quite interesting. It supports clustering. It'll be in the official Django deployment documentation from the next release (ticket 16057).

Coding

get_profile() tends to be a problem. It's possible to monkey patch the User.get_profile() method (to make sure a new profile is created if it doesn't exist) but you have to be careful where it's loaded. It's also possible to use a Meta proxy together with a new middleware (set up after the authentication middleware)

Django model form factory (django.forms.models.modelform_factory) sounds interesting to create forms more quickly.

uWSGI

There was an on-the-fly short talk on uWSGI after the talk, by someone whose name I didn't pick up. It can talk to many protocols, it has lots of plugins so you can only use what you need. It's not the fastest but speed isn't the main factor that should make you decide to use it.

Leave a comment

EuroPython 2011: Simon Willison on Advanced Aspects of the Django Ecosystem

Links: Talk description and video and slides.

This talk will be about 3 tools, that can be considered secret weapons: they offer great payoffs, for low efforts.

(Note: the slides enhance most of these concepts with lots of code examples, have a look!)

Haystack

Haystack does full text search, and is available as modular search for Django. It's very easy to get a nice search interface if you already use the Django ORM, and the queryset can also be defined to limit search queries to what you want (e.g. only published entries).

You can have different templates/html bits depending on the type of objects returned by the search.

Scaling/Backend

  • Woosh (Python) - Good if you have no write traffic, and not a lot of traffic in geenral
  • Xapian (C)
  • Solr (Java) tends to be the default choice. It has an HTTP interface, and there are tons of things that are already baked into Solr, like filtering down by object type. Objects can be associated with a template, although it sounds like it's more about relevance than display: the speaker mentioned showing the title twice in the template to increase its weight in search results. It can scale with SearchQuerySet, and works faster than complicated crazy SQL.

Search indexes usually don't like being updated much. Haystack offers several solutions. Sites with low write traffic can update the index in real time at every change. Or changes can be batched every 6 hours. At a higher scale, you have to roll your own solution. For Lanyrd they have added a "needs_indexing" boolean to their models that defaults to True and is also set in the save() hook. Then using a management command or something else, it's possible to look at what needs to be indexed, process it and set the flag to False.

Solr has master/slaves capabilities and knows how to balance the reads between slaves, the writes should be sent to master. Haystack only knows how to talk to one url, but using nginx it's possible to balance and to set up different proxies depending on the URLs to make sure the writes go where they should -- remember, Solr speaks HTTP.

Celery

Celery is a fully featured, distributed task queue manager. Any task that would take more than 200ms should go on the queue! For instance...

  • Updating the search index
  • Resizing images
  • Hitting external APIs
  • Generating reports

Using the @task decorator, the method works normally if called directly, but also gains a delay() that adds the method to the queue to be picked up by workers.

For tasks launched by users (such as uploading a picture or figuring out what's at a url):

  • To deal with people using Javascript or not: if 'ajax' in request.POST, show an inline loading indicator, otherwise redirect.
  • Use memcached for automatic house keeping, in case the user closes the browser and doesn't come back, don't keep the task around forever. The oldest will get dropped out automatically after a few hours.

Use celerybeat for scheduling, celerymon for monitoring the worker cluster, celerycam to take snapshots -- this helps figuring out when/where things go wrong.

The activity stream pattern gives everyone an "inbox" when everyone needs to receive something, like a tweet: it gets written to everyone's stream. Redis can handle 100,000 writes/second and is a handy tool to deal with this; this is also the kind of tasks that's a great candidate for queueing.

Fabric

Fabric is great for automated and repeatable deployments, it also makes it easier to roll back. You could use chef and puppet, which are ridiculously powerful but quite complex to set up. Fabric fits the developer mental model better generally, it kind of wraps your current processes into Python.

For instance, you can create a clear_cache() that calls flush_all() on the cache. Then, to clear the cache on your server, call from your machine:

fab -H host1,host2 clear_cache

The file (fabfile.py) is version controlled therefore documenting your process -- so you don't forget how to do it 6 months from now!

env is a global variable used by Fabric, you can add your own variables to it that can be reused in other commands, for instance env.deploy_date to store the deployment date and time and make it easier to roll back and roll forward (he uses symlinks).

They use a servers.json configuration file that documents the instance id, public dns, private dns, names and roles (solr, memcached, etc). Fabric can use this to deploy, nginx can use it to load balance, Django can import it in the settings to know what to talk to.

Leave a comment

EuroPython 2011: Simon Willison on Challenges in developing a large Django site

Links: talk description and video and slides.


Simon Willison is the co-founder lanyrd.com, a social website for conferences.

Tips and tricks

Signing (from 1.4, currently in trunk)

Using cryptographic signing for various things can ensure that they haven't been tampered with, for instance a cookie or an unsubscribe link. If you encrypt your session cookies you don't have to hit the database anymore, you just need to check the proper signed cookie.

The speaker showed a couple of short code examples to demonstrate how simple it is to use, and how the interface is consistent with the other serialisation interfaces.

from django.core import signing
signing.dumps({"foo": "bar"})  # url safe
signing.loads(string)

cache_version

This is another way to do cache invalidation. You add a cache_version field to the model, that is incremented when calling the save() hook or a touch() method. In the template cache fragment, you use the primary key and the cache_version to invalidate.

You can also mass invalidate by updating the cache version of objects.all() using F() -- example from the slides:

topic.conferences.all().update(
    cache_version = F('cache_version') + 1
)

noSQL for denormalisation

Use noSQL to denormalise and keep the database and the cache/nosql in sync. It's more work but it's worth it.

For instance they use Redis sets to maintain lists such as username-follows, europython-attendees and then they simply need to do a set intersection to get the information they want. These are only lists of ids so they don't take that much space.

Hashed static asset filenames in CloudFront

They created a management command to push static assets, that compresses Javascript, changes the names/urls, etc. This way they can publish them in advance, and also keep static files around if there's a need to rollback. The different names are also good to prevent Internet Explorer caching.

Challenges

This part of the talk is about things they don't really have answers for.

HTTP Requests

e.g. talking to an API: what if it fails or take 30 seconds? Do you use urllib? What if people enter private urls from within your Intranet? :O

You have to handle connection timeouts, logging and profiling, url validation, and http caching. All of these are a common set of problems that should be baked into the framework.

Profiling and debugging production problems

Debugging in development rocks, with the django-debug-toolbar, the way error 500 are handled, pdb, etc.

Once you turn debug to False, you're blind. After a while, all the bugs, particularly performance bugs, only happen in production.

He showed us a code snippet for a UserBasedExceptionMiddleware, that if you access the page throwing a 500 error and is_superuser is True, you will see a traceback, not the default 500 error (so if one of your users reports a problem, you can go to the page straight off and see a traceback).

At the database level, there is a handy tool called mysql-proxy that is customisable using Lua. Using a wonderful, horribly documented library called log.lua, you can for instance turn on logging for a couple of minutes when needed.

He created an app called django_instrumented (unreleased, until it's cleaned up) that collects statistics and sticks them into memcached. He has a special bookmark to access them, they are stored for 5 minutes only  -- so they waste neither space or time.

This actually helped improve the performance: if you measure something and make it visible, people will improve it over time.

0 downtime deployments

Code-wise it's easy enough to do, but when there are database changes it's tougher. Ideally they try to make schema changes backwards compatible, then use ./manage.py migrate (using South) on another web server.

Having a read-only mode made a lot of problems easier! It's not 0 downtime but the content is still readable. It can be a setting or a Redis key.

Feature flags work in the same way but at a more fine-grained level, for instance turning off search while you update your solr cluster. There's quite a bit more work involved.

One lesson we keep on learning in Django

We went from one database to multi-databases, from one cache to multi-caches, from one haystack backend to multiple backends.

Debug is one single setting, that affects a lot of things.

The timezone setting also affects Apache log files.

The middleware concept is very powerful, but is executed on every single request: if there's a conditional it has to be done within the middleware.

Really, global settings should be flushed out of the project! They are evil settings that cannot be changed at runtime.

Leave a comment

EuroPython 2011: David Cramer on building scalable websites

Link to talk description and video (videos should be public next week I believe)


Performance (e.g. a request should return in less than 5 seconds) is not the same as scalability (e.g. a request should ALWAYS return in less than 5 seconds). Fortunately, it turns out that when you start working on scalability you usually end up improving performance as well -- note that this doesn't work the other way around.

Common bottlenecks

The database is almost always an issue.

Caching and invalidation help.

They use Postgres for 98% of their data, it works great on good hardware with one master only (Disqus, his company, uses Django to serve 3 billion page views a month)

Packaging matters

Packaging is key: it lets you repeat your deployment, makes it repeatable which is incredibly useful even when you're working by yourself. Unfortunately there are too many ways to do packaging in Python, and none that solves all the problem. He uses setuptools, because it usually works.

Plenty of benefits to packaging:

  • The handy 'develop' command installs all the dependencies.
  • Dependencies are frozen.
  • It's a great way to get a new team member quickly set up.

Then, they use fabric to deploy consistently.

Database(s)

This applies to any kind of datastore, which are the usual bottleneck. It can become difficult to scale once there is more than one server.

The rest of the talk uses a Twitter clone as an example.

For the public timeline, you select everything and order it by date. It's ok if there is only 1 database server, otherwise you need to use some sort of map/reduce variant to get it working. The index on date will be fairly heavy though. It's quite easy to cache (add tweet to a queue whenever it's added), and invalidate.

For personal timelines, you can use vertical partitioning, with the user and tweets on separate machines. Unfortunately this means a SQL JOIN is not possible. Materialised views are a possible answer but there aren't supported by many databases (for instance it's not supported by MySQL. MySQL will generate a view by rerunning the query everytime, which means you can't index it).

Using Postgres and Redis, you can have a sorted set, using the tweet id with the timestamp as its weight (will become ordering). Note that you can't have a never ending long tail of data, data will be truncated after 30 days or whatever (remove the data from Redis).

Now the new problem is to scale Redis! You can partition per user, say if you keep 1000 tweets per user you can know how much space a user will take, and how many you can have per server.

See: github.com/disqus/nydus to package cluster of connections to Redis, it can be used like (?) a Django database. They store 64 redis nodes on the same machine in virtual machines.

Vertical vs. Horizontal partitioning

You can have:

  • Master database with no indexes, only primary keys
  • A database of users
  • A database of tweets

So far the hardware scales at the same time as their app. If you need more machines, more RAM, it's cheap enough, and when you need it again in a few years it will be the same price.

Asynchronous tasks

Using Rabbit and Celery, you can use application triggers to manage your data, e.g. a signal on a model save() hook that adds the new item to a queue after it's been added to the database. This way, when the worker starts on the task it can add the new tweet to all the caches without blocking (e.g. if someone has 7 million followers, their tweet needs to be added to 7 million streams)

Building an API

Having an API is important to scale your code and your architecture. Making sure that all the places in your code (the Django code, the Redis code, the REST part, whatever) all use the same API, or are refactored to use the same API so that you can change them all in one place.

To wrap up

  • Use a framework (like Django, to do some of the legwork for you), then iterate. Start with querying the database then scale.
  • Scaling can lead to performance but not the other way around.
  • When you have a large infrastructure, architecture it in terms of services, it's easier to scale
  • Consolidate the entry points, it becomes easier to optimise

Lessons learnt

  • Have more upfront, for instance 64 VMs, so that you can scale up to 64 machines if needed.
  • Redistributing/rebalancing shards is a nightmare, plan far ahead.
  • PUSH to the cache, don't PULL: otherwise if the data is not there, 5000 users might request it at the same time and suddenly you have 5000 hits to the database. Cache everything, it's easier to invalidate (everything is cached 5 minutes in memcached in their system)
  • Write counters to denormalise views (updated via queues, stored in Redis I think)
  • Push everything to a queue from the start, it will make processing faster -- there is no excuse, Celery is so easy to set up
  • Don't write database triggers, handle the trigger logic in your queue
  • Database pagination is slow and crappy: LIMIT 0, 1000 may be ok -- LIMIT 1000, 2000 and suddenly the database has to count rows, it gets slower and consumes CPU and memory. There are easier ways to do pagination, he likes to do id chunks and select range of ids, it's very quick.
  • Build with future sharding in mind. Think massive, use Puppet.

One of the questions was: does that mean there are 7 million cache misses if someone deletes a tweet? Answer: Yes indeed.

Leave a comment

Testing django admin customisations

In preparation for an upgrade, I've been writing unit tests for a Django app with the help of a fantastic book -- Django 1.1 Testing and Debugging by Karen M. Tracey, there'll be a review coming up when I finish it.

I had some issues with the code to test admin customisations (Chapter 4), I want to document the changes I made for future reference.

Error 403

Despite using client.login() in the setUp() method, response returned with a status_code of 403 (Forbidden) when creating a new item.

class AdminCustomisationTest(TestCase):
    
    def setUp(self):
        username = 'test_user'
        pwd = 'secret'

        self.u = User.objects.create_user(username, '', pwd)
        self.u.is_staff = True
        self.u.is_superuser = True
        self.u.save()

        self.assertTrue(self.client.login(username=username, password=pwd),
            "Logging in user %s, pwd %s failed." % (username, pwd))

        Survey.objects.all().delete()

    def tearDown(self):
        self.client.logout()
        self.u.delete()

    def test_add_survey_ok(self):
        self.assertEquals(Survey.objects.count(), 0)

        post_data = { 'title': u'Test OK',
                      'open_date': datetime.now(),
                    }

        response = self.client.post(reverse('admin:survey_survey_add'), post_data)

        self.assertRedirects(response, reverse('admin:survey_survey_changelist'))
        self.assertEquals(Survey.objects.count(), 1)

At first I blamed some customisations I did to the authentication backends to allow OpenID, however printing the response.content revealed otherwise:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><body><h1>403 Forbidden</h1><p>Cross Site Request Forgery detected. Request aborted.</p></body></html>

The Django CSRF protection system prevented the tests from passing. Note that this app runs on Django 1.1, Django 1.2 overhauled the CSRF system and would likely work without problems.

I'm sure there are more elegant ways to solve this. Here's a method that works:

    def get_csrf_token(self, response):
        csrf = "name='csrfmiddlewaretoken' value='"
        start = response.content.find(csrf) + len(csrf)
        end = response.content.find("'", start)

        return response.content[start:end]

    def test_add_survey_ok(self):
        self.assertEquals(Survey.objects.count(), 0)

        response = self.client.get(reverse('admin:survey_survey_add'))
        csrf = self.get_csrf_token(response)
        post_data = { 'title': u'Test OK',
                      'open_date': datetime.now(),
                      'csrfmiddlewaretoken': csrf,
                    }

        response = self.client.post(reverse('admin:survey_survey_add'), post_data)

        self.assertRedirects(response, reverse('admin:survey_survey_changelist'))
        self.assertEquals(Survey.objects.count(), 1)

(Note/Update: If I had waited until the next chapter, I would have found out how to integrate Twill with Django apps and none of this would have been necessary -- haha!)

This field is required (datetime)

After this, although the response.status_code was finally 200, the article still wasn't created. Peering through response.content showed that the datetime field was considered to be missing. This is what the postdata should look like instead for a datetime field:

        post_data = { 'title': u'Test OK',
                      'open_date_0': '2011-03-17',
                      'open_date_1': '9:50:00',
                      'csrfmiddlewaretoken': csrf,
                    }

Test passes!

Leave a comment

Django admin actions and intermediate pages Tagging multiple articles at once

The other day I created an "Ireland" tag on this blog and thought, hey, wouldn't it be cool to go back and tag all the posts I wrote about events in Ireland, so I have an easy way to see what's going on locally?

Of course retagging each post manually would suck so I decided to create a new Django admin action, with an intermediate page that would present me with my current list of tags so I could select one, then update.

I also wanted:

  1. To keep the Django admin look and feel
  2. To have the nice admin message "X posts were successfully updated"

Now the Django documentation is very clear at explaining and showing how to do each of those, but not together.

Here's how I ended up doing it, see links at the end for other helpful resources. I'm quite pleased with this solution because it's done the Django way (through the use of forms, etc) rather than by working around Django.

Directly in the ModelAdmin section for blog posts, create an inner class for the new form (tag picker in my case), and a function to process the form like you would in a normal view. Don't forget to add the new method to the actions list.

    actions = ['add_tag']

    class AddTagForm(forms.Form):
        _selected_action = forms.CharField(widget=forms.MultipleHiddenInput)
        tag = forms.ModelChoiceField(Tag.objects)

    def add_tag(self, request, queryset):
        form = None

        if 'apply' in request.POST:
            form = self.AddTagForm(request.POST)

            if form.is_valid():
                tag = form.cleaned_data['tag']

                count = 0
                for article in queryset:
                    article.tags.add(tag)
                    count += 1

                plural = ''
                if count != 1:
                    plural = 's'

                self.message_user(request, "Successfully added tag %s to %d article%s." % (tag, count, plural))
                return HttpResponseRedirect(request.get_full_path())

        if not form:
            form = self.AddTagForm(initial={'_selected_action': request.POST.getlist(admin.ACTION_CHECKBOX_NAME)})

        return render_to_response('admin/add_tag.html', {'articles': queryset,
                                                         'tag_form': form,
                                                        })
    add_tag.short_description = "Add tag to articles"

And here's my intermediate page (quite simple!), which uses the admin look & feel and is located in templates/admin/add_tag.html.

{% extends "admin/base_site.html" %}

{% block content %}

<p>Select tag to apply:</p>

<form action="" method="post">

    {{ tag_form }}

    <p>The tag will be applied to:</p>

    <ul>{{ articles|unordered_list }}</ul>

    <input type="hidden" name="action" value="add_tag" />
    <input type="submit" name="apply" value="Apply tag" />
</form>

{% endblock %}

The hidden field is essential for Django to recognise the form submission as an admin action.

Update: Follow this ticket to learn how to apply an admin action to selected items on multiple pages -- or look at the comments below for tips!

Example of the code in action:

Selecting articles and the new admin action --> Select the tag to apply --> The message is displayed

Resources:

Leave a comment | 28 so far

Django: GROUP BY datetime

While trying to make the archive pages of this site more useful, I had a bit of trouble working with Django's annotation feature to give me the numbers I was looking for. I wanted to show on the index page how many posts were published in a given month. This ticket was very helpful to figure out how to do it, and as a bonus it means it will work more simply in future-Django :)

At the moment, the workaround is unfortunately database specific. The ticket contains an example using SQLite, here's how I did it using PostgreSQL. I think with MySQL you'd need to play with YEAR() and MONTH() to get the same.

bymonth_select = {"month": """DATE_TRUNC('month', creation_date)"""} # Postgres specific
months = Article.objects.exclude(draft=True).extra(select=bymonth_select).values('month').annotate(num_posts=Count('id')).order_by('-month')

I removed a couple of filters for readability, but this is basically a normal QuerySet with a couple of Django caveats:

  1. Write your filters before the annotate call, as the Count() processes only what comes before it
  2. .values('blah') for a GROUP BY effect, without it you'll get '1' for every record fetched back
  3. .order_by(), with or without an argument otherwise things will likely get messed up by the order_by of your Meta class if any -- inspired by comment #8 of the ticket linked above, thanks!
Leave a comment

Django + mod_wsgi and PHP Slowly improving my httpd.conf-fu

Initially I followed the Django documentation and the mod_wsgi documentation to run Django on my local Apache server, but after installing mod_php5 the 2 modules began conflicting with each other when I wanted both to run peacefully in parallel.

PHP was to keep the default port 80 and Django was to run on port 9000. Here's how I triumphed over the slew of Error 500:

1. <VirtualHost *:9000> for WSGI

2. Add the following statement before the <VirtualHost *:9000> line (that's the bit not mentioned in the documentation I was following, and that I want to document here for $future_self...)

Listen 127.0.0.1:9000

3. Update the MEDIA_URL in settings.py to reference port 9000 as well.

And they all lived on their respective port happily ever after.

Leave a comment

InternalError: current transaction is aborted, commands ignored until end of transaction block Take Two

Hm... Starting to get the feeling this is the catch-all exception for Django, considering the wide range of scenarios in which people slam themselves against this error. (Although it's more frequent when using psycopg2, I hear.)

InternalError: current transaction is aborted, commands ignored until end of transaction block

This time, I was trying to restore a postgresql backup and that's all Django would give me locally. Turns out the users don't match between the servers, so no permissions were granted to anyone on the newly created database (\c <database>, \dp). GRANT ALL on every table, and my restore worked handsomely. (Edit: This command helps. A lot.)

(For another take on this error, see this previous post.)

Leave a comment

InternalError: current transaction is aborted, commands ignored until end of transaction block

Just because my own Googling wasn't particularly fruitful when I encountered the following error:

InternalError: current transaction is aborted, commands ignored until end of transaction block

Turns out I had forgotten to update my database and run a syncdb after adding a new module. Nothing in the traceback hinted at that module (the last line of my code it was failing on was context_instance=RequestContext(request) in the render_to_response call), and the new field wasn't directly referenced either.

My hope is that if I'd had DEBUG set to True the error message would have been more helpful, but because it was on a production server I was reluctant to start with that (I reverted the code to the old working version instead and started investigating).

Leave a comment

Welcome

A few months back, while investigating the technologies I was less familiar with for a Python job that looked pretty cool, I finally got to properly discover Django.

Until then I thought that when I would finally get around to learn a new web framework it would be Perl based, but without any project I particularly cared about my attempts kinda kept fizzling out. I went a bit further with Django, likely because it's such a breeze to set up for development, and their website is awesome for those just starting out with the framework. They have all the buzzwords in all the right places, hint at all that's possible to do ("other batteries included", yes indeed!), but most importantly they offer a fantastic tutorial that showcases how powerful Django is, while not taking a huge amount of time to go through. Like I suspect most people, I was awed when playing with the admin site for the first time. Such a painstakingly repetitive part of any webapp... Now fun again!

Life got busy after this and I stopped poking around, but as soon as I got some breathing time an old idea popped back into my head, making my own website / portfolio for the stuff I care about; open-source, education, software development. And getting back to writing regularly so I can get better at it, maybe putting together tutorials about all the cool stuff I end up clashing against and figuring out.

After a month or so of fun with Django in the evening, here we are with version 0.1. It should be fun!

Leave a comment