EuroPython 2011: Mark Ramm on Relate or !Relate

Link: Talk description and video

This talk was about non-relational databases. I didn't take a lot of notes :o) The most important morale is probably: don't keep the mind altering substances and the tools in the same shed.

With 2 decades of relational databases, they are pretty robust by now. They cover different spectrum of ACID compliance ; for instance MySQL is faster, Postgres is more reliable (though becoming faster... if you tick off the reliability options!). Relational databases are supposed to be normalised, except they are not really: there is also a spectrum here as databases tend to get denormalised for performance reasons.

Amazon uses an "eventually consistent" system, which they can pull off by charging at shipping time only. Conflicts are rare, if 2 orders are placed and there is only 1 item available, someone might get a gift certificate instead.

The NoSQL taxonomy includes wildly different tools that don't have much in common except for the fact that they don't use SQL: key-value stores, document stores, ....

CAP: Consistency, Availability, Partition tolerance. You can have 1 or 2, not all 3. (Brewer's Theorem)

There was only 1 Postgres database for all of SourceForge for a long time, while they were in the top 100 sites. Don't obsess about scale you'll never achieve.

One of the question was about how difficult it is to convert from a relational database to NoSQL. The answer is, from something like Postgres to MongoDB, it wouldn't be that much work (he did suggest 4 people 6 weeks though, which doesn't sound that trivial to me). Changing to Cassandra on the other hand would be a huge effort.

Leave a comment

EuroPython 2011: Simon Willison on Challenges in developing a large Django site

Links: talk description and video and slides.


Simon Willison is the co-founder lanyrd.com, a social website for conferences.

Tips and tricks

Signing (from 1.4, currently in trunk)

Using cryptographic signing for various things can ensure that they haven't been tampered with, for instance a cookie or an unsubscribe link. If you encrypt your session cookies you don't have to hit the database anymore, you just need to check the proper signed cookie.

The speaker showed a couple of short code examples to demonstrate how simple it is to use, and how the interface is consistent with the other serialisation interfaces.

from django.core import signing
signing.dumps({"foo": "bar"})  # url safe
signing.loads(string)

cache_version

This is another way to do cache invalidation. You add a cache_version field to the model, that is incremented when calling the save() hook or a touch() method. In the template cache fragment, you use the primary key and the cache_version to invalidate.

You can also mass invalidate by updating the cache version of objects.all() using F() -- example from the slides:

topic.conferences.all().update(
    cache_version = F('cache_version') + 1
)

noSQL for denormalisation

Use noSQL to denormalise and keep the database and the cache/nosql in sync. It's more work but it's worth it.

For instance they use Redis sets to maintain lists such as username-follows, europython-attendees and then they simply need to do a set intersection to get the information they want. These are only lists of ids so they don't take that much space.

Hashed static asset filenames in CloudFront

They created a management command to push static assets, that compresses Javascript, changes the names/urls, etc. This way they can publish them in advance, and also keep static files around if there's a need to rollback. The different names are also good to prevent Internet Explorer caching.

Challenges

This part of the talk is about things they don't really have answers for.

HTTP Requests

e.g. talking to an API: what if it fails or take 30 seconds? Do you use urllib? What if people enter private urls from within your Intranet? :O

You have to handle connection timeouts, logging and profiling, url validation, and http caching. All of these are a common set of problems that should be baked into the framework.

Profiling and debugging production problems

Debugging in development rocks, with the django-debug-toolbar, the way error 500 are handled, pdb, etc.

Once you turn debug to False, you're blind. After a while, all the bugs, particularly performance bugs, only happen in production.

He showed us a code snippet for a UserBasedExceptionMiddleware, that if you access the page throwing a 500 error and is_superuser is True, you will see a traceback, not the default 500 error (so if one of your users reports a problem, you can go to the page straight off and see a traceback).

At the database level, there is a handy tool called mysql-proxy that is customisable using Lua. Using a wonderful, horribly documented library called log.lua, you can for instance turn on logging for a couple of minutes when needed.

He created an app called django_instrumented (unreleased, until it's cleaned up) that collects statistics and sticks them into memcached. He has a special bookmark to access them, they are stored for 5 minutes only  -- so they waste neither space or time.

This actually helped improve the performance: if you measure something and make it visible, people will improve it over time.

0 downtime deployments

Code-wise it's easy enough to do, but when there are database changes it's tougher. Ideally they try to make schema changes backwards compatible, then use ./manage.py migrate (using South) on another web server.

Having a read-only mode made a lot of problems easier! It's not 0 downtime but the content is still readable. It can be a setting or a Redis key.

Feature flags work in the same way but at a more fine-grained level, for instance turning off search while you update your solr cluster. There's quite a bit more work involved.

One lesson we keep on learning in Django

We went from one database to multi-databases, from one cache to multi-caches, from one haystack backend to multiple backends.

Debug is one single setting, that affects a lot of things.

The timezone setting also affects Apache log files.

The middleware concept is very powerful, but is executed on every single request: if there's a conditional it has to be done within the middleware.

Really, global settings should be flushed out of the project! They are evil settings that cannot be changed at runtime.

Leave a comment

EuroPython 2011: David Cramer on building scalable websites

Link to talk description and video (videos should be public next week I believe)


Performance (e.g. a request should return in less than 5 seconds) is not the same as scalability (e.g. a request should ALWAYS return in less than 5 seconds). Fortunately, it turns out that when you start working on scalability you usually end up improving performance as well -- note that this doesn't work the other way around.

Common bottlenecks

The database is almost always an issue.

Caching and invalidation help.

They use Postgres for 98% of their data, it works great on good hardware with one master only (Disqus, his company, uses Django to serve 3 billion page views a month)

Packaging matters

Packaging is key: it lets you repeat your deployment, makes it repeatable which is incredibly useful even when you're working by yourself. Unfortunately there are too many ways to do packaging in Python, and none that solves all the problem. He uses setuptools, because it usually works.

Plenty of benefits to packaging:

  • The handy 'develop' command installs all the dependencies.
  • Dependencies are frozen.
  • It's a great way to get a new team member quickly set up.

Then, they use fabric to deploy consistently.

Database(s)

This applies to any kind of datastore, which are the usual bottleneck. It can become difficult to scale once there is more than one server.

The rest of the talk uses a Twitter clone as an example.

For the public timeline, you select everything and order it by date. It's ok if there is only 1 database server, otherwise you need to use some sort of map/reduce variant to get it working. The index on date will be fairly heavy though. It's quite easy to cache (add tweet to a queue whenever it's added), and invalidate.

For personal timelines, you can use vertical partitioning, with the user and tweets on separate machines. Unfortunately this means a SQL JOIN is not possible. Materialised views are a possible answer but there aren't supported by many databases (for instance it's not supported by MySQL. MySQL will generate a view by rerunning the query everytime, which means you can't index it).

Using Postgres and Redis, you can have a sorted set, using the tweet id with the timestamp as its weight (will become ordering). Note that you can't have a never ending long tail of data, data will be truncated after 30 days or whatever (remove the data from Redis).

Now the new problem is to scale Redis! You can partition per user, say if you keep 1000 tweets per user you can know how much space a user will take, and how many you can have per server.

See: github.com/disqus/nydus to package cluster of connections to Redis, it can be used like (?) a Django database. They store 64 redis nodes on the same machine in virtual machines.

Vertical vs. Horizontal partitioning

You can have:

  • Master database with no indexes, only primary keys
  • A database of users
  • A database of tweets

So far the hardware scales at the same time as their app. If you need more machines, more RAM, it's cheap enough, and when you need it again in a few years it will be the same price.

Asynchronous tasks

Using Rabbit and Celery, you can use application triggers to manage your data, e.g. a signal on a model save() hook that adds the new item to a queue after it's been added to the database. This way, when the worker starts on the task it can add the new tweet to all the caches without blocking (e.g. if someone has 7 million followers, their tweet needs to be added to 7 million streams)

Building an API

Having an API is important to scale your code and your architecture. Making sure that all the places in your code (the Django code, the Redis code, the REST part, whatever) all use the same API, or are refactored to use the same API so that you can change them all in one place.

To wrap up

  • Use a framework (like Django, to do some of the legwork for you), then iterate. Start with querying the database then scale.
  • Scaling can lead to performance but not the other way around.
  • When you have a large infrastructure, architecture it in terms of services, it's easier to scale
  • Consolidate the entry points, it becomes easier to optimise

Lessons learnt

  • Have more upfront, for instance 64 VMs, so that you can scale up to 64 machines if needed.
  • Redistributing/rebalancing shards is a nightmare, plan far ahead.
  • PUSH to the cache, don't PULL: otherwise if the data is not there, 5000 users might request it at the same time and suddenly you have 5000 hits to the database. Cache everything, it's easier to invalidate (everything is cached 5 minutes in memcached in their system)
  • Write counters to denormalise views (updated via queues, stored in Redis I think)
  • Push everything to a queue from the start, it will make processing faster -- there is no excuse, Celery is so easy to set up
  • Don't write database triggers, handle the trigger logic in your queue
  • Database pagination is slow and crappy: LIMIT 0, 1000 may be ok -- LIMIT 1000, 2000 and suddenly the database has to count rows, it gets slower and consumes CPU and memory. There are easier ways to do pagination, he likes to do id chunks and select range of ids, it's very quick.
  • Build with future sharding in mind. Think massive, use Puppet.

One of the questions was: does that mean there are 7 million cache misses if someone deletes a tweet? Answer: Yes indeed.

Leave a comment