EuroPython 2011: Simon Willison on Advanced Aspects of the Django Ecosystem

Links: Talk description and video and slides.

This talk will be about 3 tools, that can be considered secret weapons: they offer great payoffs, for low efforts.

(Note: the slides enhance most of these concepts with lots of code examples, have a look!)

Haystack

Haystack does full text search, and is available as modular search for Django. It's very easy to get a nice search interface if you already use the Django ORM, and the queryset can also be defined to limit search queries to what you want (e.g. only published entries).

You can have different templates/html bits depending on the type of objects returned by the search.

Scaling/Backend

  • Woosh (Python) - Good if you have no write traffic, and not a lot of traffic in geenral
  • Xapian (C)
  • Solr (Java) tends to be the default choice. It has an HTTP interface, and there are tons of things that are already baked into Solr, like filtering down by object type. Objects can be associated with a template, although it sounds like it's more about relevance than display: the speaker mentioned showing the title twice in the template to increase its weight in search results. It can scale with SearchQuerySet, and works faster than complicated crazy SQL.

Search indexes usually don't like being updated much. Haystack offers several solutions. Sites with low write traffic can update the index in real time at every change. Or changes can be batched every 6 hours. At a higher scale, you have to roll your own solution. For Lanyrd they have added a "needs_indexing" boolean to their models that defaults to True and is also set in the save() hook. Then using a management command or something else, it's possible to look at what needs to be indexed, process it and set the flag to False.

Solr has master/slaves capabilities and knows how to balance the reads between slaves, the writes should be sent to master. Haystack only knows how to talk to one url, but using nginx it's possible to balance and to set up different proxies depending on the URLs to make sure the writes go where they should -- remember, Solr speaks HTTP.

Celery

Celery is a fully featured, distributed task queue manager. Any task that would take more than 200ms should go on the queue! For instance...

  • Updating the search index
  • Resizing images
  • Hitting external APIs
  • Generating reports

Using the @task decorator, the method works normally if called directly, but also gains a delay() that adds the method to the queue to be picked up by workers.

For tasks launched by users (such as uploading a picture or figuring out what's at a url):

  • To deal with people using Javascript or not: if 'ajax' in request.POST, show an inline loading indicator, otherwise redirect.
  • Use memcached for automatic house keeping, in case the user closes the browser and doesn't come back, don't keep the task around forever. The oldest will get dropped out automatically after a few hours.

Use celerybeat for scheduling, celerymon for monitoring the worker cluster, celerycam to take snapshots -- this helps figuring out when/where things go wrong.

The activity stream pattern gives everyone an "inbox" when everyone needs to receive something, like a tweet: it gets written to everyone's stream. Redis can handle 100,000 writes/second and is a handy tool to deal with this; this is also the kind of tasks that's a great candidate for queueing.

Fabric

Fabric is great for automated and repeatable deployments, it also makes it easier to roll back. You could use chef and puppet, which are ridiculously powerful but quite complex to set up. Fabric fits the developer mental model better generally, it kind of wraps your current processes into Python.

For instance, you can create a clear_cache() that calls flush_all() on the cache. Then, to clear the cache on your server, call from your machine:

fab -H host1,host2 clear_cache

The file (fabfile.py) is version controlled therefore documenting your process -- so you don't forget how to do it 6 months from now!

env is a global variable used by Fabric, you can add your own variables to it that can be reused in other commands, for instance env.deploy_date to store the deployment date and time and make it easier to roll back and roll forward (he uses symlinks).

They use a servers.json configuration file that documents the instance id, public dns, private dns, names and roles (solr, memcached, etc). Fabric can use this to deploy, nginx can use it to load balance, Django can import it in the settings to know what to talk to.

links

social