Archives for June 2011


Upgrading to Moodle 2.0

Over the week-end, I thought it was time to see how my Moodle block handled the upgrade to 2.0. I did a brand new install of Moodle 1.9-latest, restored my old SQL dev backup, anxiously proceeded to start the upgrade and... ta-dah! Short answer: it doesn't work. Here are my notes on a couple of issues I encountered during the upgrade, so I don't waste time when I try again after updating the plug-in.

Upgrading: "Error: database connection failed"

I use PostgreSQL and after uploading the 2.0 code the site would only display "Error: database connection failed". I'm not sure what's the best way to move past this, but here what worked for me. The configuration string for Postgres dabatases always looked strange in config.php, something like this:

$CFG->dbhost    = 'user=\'mydbuser\' password=\'mydbpass\' dbname=\'moodle\'';

I changed it to this:

$CFG->dbhost    = 'localhost';
$CFG->dbname    = 'moodle';
$CFG->dbuser    = 'mydbuser';
$CFG->dbpass    = 'mydbpass';

and happily, I was able to move on. Unfortunately I had forgotten to turn off debugging (activated from the SQL backup), and there may have been some CSS/theme trickery to set up before upgrading so... the upgrade screens were quite bare and sad looking.

Plugin "block/dvreport" is defective or outdated, can not continue, sorry.

The upgrade is quite a depressing process. Whenever it encounters a module that won't work with 2.0, it just shows this message about "defective" and "outdated" and won't move forward until the 'defective' module is removed. I just rm -rf'ed them since this is a dev system, but this was a fresh install, and besides DVReport these used to be standard modules. The migration path for people who actually used them must be painful.

No CSS, text-only homepage

After the upgrade, search for "Theme selector" and select a theme.

Next?

Well, there is some documentation on migrating code to work with Moodle 2.0 so I'll start there. It's a stub though, and looks fairly incomplete. Reminds me of the initial work on DVReport: Moodle 1.7 came out when I was starting out and contributors had to figure out all the novelties like the new Role system before they were documented. I can do this again, just need to make the time!

Leave a comment

Getting Hibernate (Suspend to disk) working Linux Debian Squeeze / Lenovo Thinkpad X201

Here's how I got Hibernate working on my Thinkpad X201, currently running Debian Squeeze.

First, make sure you have a swap partition at least as large as your RAM. Following that little incident last year, my swap had a different UUID and I never realised it was never mounted. Manually editing /etc/fstab following the useful blkid output did the trick.

Second, install uswsusp, helpfully discovered thanks to the ThinkWiki.

And that's it! Hibernate works, through the Shutdown menu and pm-hibernate, and so far Suspend doesn't seem to have been adversely affected. Woohoo!

Leave a comment

EuroPython 2011

Tomorrow I fly to Italy to make my way over to EuroPython 2011 in Florence. I am tremendously looking forward to it! With 5 days of talks and 2 days of sprints (though probably only one for me) this will be the longest conference I've ever attended, we'll see how I do :-)

The conference looks wonderfully organised, I've been really impressed with everything the all volunteers staff has already accomplished to make the event as enjoyable as possible for everyone (the scheduling with estimated attendance vs. room size, the data SIM, the cultural events...). Can't wait!

Some of the talks I look forward to are...

There are of course many-many-many more talks I highlighted on my schedule, I expect these to be my conference highlights. From past experience though I know I'll be amazed at plenty more!

Now of course, this may be compromised if I fail to hear my alarm clock at 3:45 tomorrow morning :O

Leave a comment

Origami workshop in Tog Friday, July 8th

Interested in learning origami? I'm organising a free 2-hour origami workshop in Tog, on July 8th (a Friday) from 7pm to 9pm, kindly taught by Jamie O'Leary. Complete beginners and intermediate levels are very welcome, though if more advanced students would like to drop by and give a hand, they're welcome too!

For more information, including how to register, see the announcement.


Know some cool craft, involved in an interesting open-source project and would like to share the love and/or teach about it? Contact me at julie AT this domain (or see my about page) and let's have a chat! We'd love to see more of these workshops in Tog, if you're around Dublin.

Leave a comment | 2 so far

EuroPython 2011: David Cramer on building scalable websites

Link to talk description and video (videos should be public next week I believe)


Performance (e.g. a request should return in less than 5 seconds) is not the same as scalability (e.g. a request should ALWAYS return in less than 5 seconds). Fortunately, it turns out that when you start working on scalability you usually end up improving performance as well -- note that this doesn't work the other way around.

Common bottlenecks

The database is almost always an issue.

Caching and invalidation help.

They use Postgres for 98% of their data, it works great on good hardware with one master only (Disqus, his company, uses Django to serve 3 billion page views a month)

Packaging matters

Packaging is key: it lets you repeat your deployment, makes it repeatable which is incredibly useful even when you're working by yourself. Unfortunately there are too many ways to do packaging in Python, and none that solves all the problem. He uses setuptools, because it usually works.

Plenty of benefits to packaging:

  • The handy 'develop' command installs all the dependencies.
  • Dependencies are frozen.
  • It's a great way to get a new team member quickly set up.

Then, they use fabric to deploy consistently.

Database(s)

This applies to any kind of datastore, which are the usual bottleneck. It can become difficult to scale once there is more than one server.

The rest of the talk uses a Twitter clone as an example.

For the public timeline, you select everything and order it by date. It's ok if there is only 1 database server, otherwise you need to use some sort of map/reduce variant to get it working. The index on date will be fairly heavy though. It's quite easy to cache (add tweet to a queue whenever it's added), and invalidate.

For personal timelines, you can use vertical partitioning, with the user and tweets on separate machines. Unfortunately this means a SQL JOIN is not possible. Materialised views are a possible answer but there aren't supported by many databases (for instance it's not supported by MySQL. MySQL will generate a view by rerunning the query everytime, which means you can't index it).

Using Postgres and Redis, you can have a sorted set, using the tweet id with the timestamp as its weight (will become ordering). Note that you can't have a never ending long tail of data, data will be truncated after 30 days or whatever (remove the data from Redis).

Now the new problem is to scale Redis! You can partition per user, say if you keep 1000 tweets per user you can know how much space a user will take, and how many you can have per server.

See: github.com/disqus/nydus to package cluster of connections to Redis, it can be used like (?) a Django database. They store 64 redis nodes on the same machine in virtual machines.

Vertical vs. Horizontal partitioning

You can have:

  • Master database with no indexes, only primary keys
  • A database of users
  • A database of tweets

So far the hardware scales at the same time as their app. If you need more machines, more RAM, it's cheap enough, and when you need it again in a few years it will be the same price.

Asynchronous tasks

Using Rabbit and Celery, you can use application triggers to manage your data, e.g. a signal on a model save() hook that adds the new item to a queue after it's been added to the database. This way, when the worker starts on the task it can add the new tweet to all the caches without blocking (e.g. if someone has 7 million followers, their tweet needs to be added to 7 million streams)

Building an API

Having an API is important to scale your code and your architecture. Making sure that all the places in your code (the Django code, the Redis code, the REST part, whatever) all use the same API, or are refactored to use the same API so that you can change them all in one place.

To wrap up

  • Use a framework (like Django, to do some of the legwork for you), then iterate. Start with querying the database then scale.
  • Scaling can lead to performance but not the other way around.
  • When you have a large infrastructure, architecture it in terms of services, it's easier to scale
  • Consolidate the entry points, it becomes easier to optimise

Lessons learnt

  • Have more upfront, for instance 64 VMs, so that you can scale up to 64 machines if needed.
  • Redistributing/rebalancing shards is a nightmare, plan far ahead.
  • PUSH to the cache, don't PULL: otherwise if the data is not there, 5000 users might request it at the same time and suddenly you have 5000 hits to the database. Cache everything, it's easier to invalidate (everything is cached 5 minutes in memcached in their system)
  • Write counters to denormalise views (updated via queues, stored in Redis I think)
  • Push everything to a queue from the start, it will make processing faster -- there is no excuse, Celery is so easy to set up
  • Don't write database triggers, handle the trigger logic in your queue
  • Database pagination is slow and crappy: LIMIT 0, 1000 may be ok -- LIMIT 1000, 2000 and suddenly the database has to count rows, it gets slower and consumes CPU and memory. There are easier ways to do pagination, he likes to do id chunks and select range of ids, it's very quick.
  • Build with future sharding in mind. Think massive, use Puppet.

One of the questions was: does that mean there are 7 million cache misses if someone deletes a tweet? Answer: Yes indeed.

Leave a comment

EuroPython 2011: Simon Willison on Challenges in developing a large Django site

Links: talk description and video and slides.


Simon Willison is the co-founder lanyrd.com, a social website for conferences.

Tips and tricks

Signing (from 1.4, currently in trunk)

Using cryptographic signing for various things can ensure that they haven't been tampered with, for instance a cookie or an unsubscribe link. If you encrypt your session cookies you don't have to hit the database anymore, you just need to check the proper signed cookie.

The speaker showed a couple of short code examples to demonstrate how simple it is to use, and how the interface is consistent with the other serialisation interfaces.

from django.core import signing
signing.dumps({"foo": "bar"})  # url safe
signing.loads(string)

cache_version

This is another way to do cache invalidation. You add a cache_version field to the model, that is incremented when calling the save() hook or a touch() method. In the template cache fragment, you use the primary key and the cache_version to invalidate.

You can also mass invalidate by updating the cache version of objects.all() using F() -- example from the slides:

topic.conferences.all().update(
    cache_version = F('cache_version') + 1
)

noSQL for denormalisation

Use noSQL to denormalise and keep the database and the cache/nosql in sync. It's more work but it's worth it.

For instance they use Redis sets to maintain lists such as username-follows, europython-attendees and then they simply need to do a set intersection to get the information they want. These are only lists of ids so they don't take that much space.

Hashed static asset filenames in CloudFront

They created a management command to push static assets, that compresses Javascript, changes the names/urls, etc. This way they can publish them in advance, and also keep static files around if there's a need to rollback. The different names are also good to prevent Internet Explorer caching.

Challenges

This part of the talk is about things they don't really have answers for.

HTTP Requests

e.g. talking to an API: what if it fails or take 30 seconds? Do you use urllib? What if people enter private urls from within your Intranet? :O

You have to handle connection timeouts, logging and profiling, url validation, and http caching. All of these are a common set of problems that should be baked into the framework.

Profiling and debugging production problems

Debugging in development rocks, with the django-debug-toolbar, the way error 500 are handled, pdb, etc.

Once you turn debug to False, you're blind. After a while, all the bugs, particularly performance bugs, only happen in production.

He showed us a code snippet for a UserBasedExceptionMiddleware, that if you access the page throwing a 500 error and is_superuser is True, you will see a traceback, not the default 500 error (so if one of your users reports a problem, you can go to the page straight off and see a traceback).

At the database level, there is a handy tool called mysql-proxy that is customisable using Lua. Using a wonderful, horribly documented library called log.lua, you can for instance turn on logging for a couple of minutes when needed.

He created an app called django_instrumented (unreleased, until it's cleaned up) that collects statistics and sticks them into memcached. He has a special bookmark to access them, they are stored for 5 minutes only  -- so they waste neither space or time.

This actually helped improve the performance: if you measure something and make it visible, people will improve it over time.

0 downtime deployments

Code-wise it's easy enough to do, but when there are database changes it's tougher. Ideally they try to make schema changes backwards compatible, then use ./manage.py migrate (using South) on another web server.

Having a read-only mode made a lot of problems easier! It's not 0 downtime but the content is still readable. It can be a setting or a Redis key.

Feature flags work in the same way but at a more fine-grained level, for instance turning off search while you update your solr cluster. There's quite a bit more work involved.

One lesson we keep on learning in Django

We went from one database to multi-databases, from one cache to multi-caches, from one haystack backend to multiple backends.

Debug is one single setting, that affects a lot of things.

The timezone setting also affects Apache log files.

The middleware concept is very powerful, but is executed on every single request: if there's a conditional it has to be done within the middleware.

Really, global settings should be flushed out of the project! They are evil settings that cannot be changed at runtime.

Leave a comment

EuroPython 2011: Wesley Chun on Python 3, Python 103 Incomplete notes on interesting talks

Python 3: The Next Generation

Link: Talk description and video

Python was created in 1991 -- it's 20 years old now! Lots of time for cruft.

Python 3 is backwards incompatible. Stuff will break. Hopefully the migration will not be grudging, the main thing for most programs will likely be unicode strings.

In 1997 Guido wrote "Python regrets" which later on and with other things became the basis for Python 3000.

A few of the changes (note from me: they were not all mentioned and I didn't take note of everything either!):

  • The print statement becomes the print() function.
  • Numbers, divisions: division will now be true division by default (1/2 == 0.5 as opposed to the current 1/2 == 0 -- better to teach, especially young ones)
  • Dictionaries will use itertools by default to save memory
  • Likewise for built-ins like map/filter, many changes to have better speed and/or memory efficiency

With regard to migrating:

  • Wait for your dependencies to port
  • Have a good test suite
  • Move your code to at least 2.6, which is when Python 3 functionality started to be backported
  • The -3 switch tells you about incompatibilities
  • The 2to3 tool offers diffs on what should be ported

Be careful with Python 3 books, if they cover 3.0 they are already obsolete. Lots of changes!


Python 103

Link: Talk description and video

This talks aims to fill in the gaps in the knowledge of not-quite-beginners-anymore. We'll have a look at: the memory model, mutability, and methods.

Special methods

Mutable objects are passed by reference, immutable objects (like a string or number) are passed by value.

class Stuff:
    version = 1.0

john = Stuff()

As a shortcut you can call john.version rather than Stuff.version. However if you assign john.version = "blah", you're hiding access to the class 'version' attribute, and only changing this attribute for the john instance -- basically creating a new local variable.

To initialise, if there's one parent you should do Parent.__init__(self). If you're lower down, try to use super() for MRO and stuff. Extra reading: Python's super() considered super! and Python's Super Considered Harmful.

You can make your own classes act like Python types, by overloading existing operators and existing built-in functions. There are loads of them! __init__, __new__ (for immutable objects), __len__, __eq__, __getslice__, __getitem__, __contains__ (for the 'in' keyword), ...

How to (not) use the stack - performance

The timeit module  will run a problem a million times and help find which implementation is faster.

For instance:

1. while i < len(x)
2. strlen = len(x); while i < strlen

The second one is faster, that's the penalty of a stackframe: the len() value is not cached because it could change! (With overloading, etc.) With the 2nd version we removed a function call.

Objects and references

All Python objects have:

  • An identity (similar to the memory address)
  • A type
  • A value (only that one can change, sort of)

References are also called aliases. There is a reference count to track the total number. Variables do not hold data, they point to objects. For instance if 2 variables are assigned 1, they would both be pointing to the same (immutable) Integer object representing 1.

3 types of standard types

His classification, nothing official.

  • Storage: Linear/Scalar vs. Container
  • Update: Mutable vs. Immutable
  • Access: Direct (number) vs. Sequence (string, list) vs. Mapping (dictionary)

"Interned" numbers are a list of integers in range(-5, 257): these are special numbers that are never deallocated.

"is" is a keyword to assess if 2 variables point to the same object.

Beware shallow copy vs. deep copy when you have a list of lists.

Memory allocation

Growing memory: When you call .append() and the list is full, 4 free units are malloc'ed, then double that, then a bit less: usually about 12.5% additional free slots are created at once in advance. What this means is that you should try not to have a lot of short lists.

Shrinking memory is a fairly inexpensive operation.

Inserting in the middle of a list creates a lot of shifting. Deque is better for push() and pop(), though less good for access.

Leave a comment | 2 so far

EuroPython 2011: Alex Martelli on API Design Anti-Patterns A few notes

Link: Talk description and video


The easy way to write an API is to use your current implementation, but then you expose implementation details which makes it harder to change or improve the implementation in the future.

In software development, you shouldn't think up a big design upfront, except for 2 things: security, and API.

Forget about your implementation: think up 2 or 3 different ways you could have implemented your project, and keep only the common parts, the "substance" for the API.

To motivate people to migrate to your new API: don't add new features to the old one.

Make choices.

Also, don't be in an environment where making mistakes is punished, rather than fixed :)

Leave a comment

Couple of thoughts on EuroPython 2011

Taking a break from writing up my notes into blog posts to share a few general thoughts about the conference.

General feeling: Woohooooo! This was a fantastic week, I learnt a ton and met an amazing amount of awesome people. If you're here because I gave you one of my cards, hi! o/ It was lovely to meet you. (If you asked about the card and forgot, these are MooCards from moo.com. Get yourself some! People fought over these at EuroPython, I'll have you know. They're that good!)

The conference was wonderfully well organised, including the evening events. I fondly recommend the bistecca alla Florentina from Zaza! Everyone was incredibly friendly, like at PyCon Ireland last year it was common to strike up interesting conversations with a random stranger besides you and impromptu dinner plans would be shared between groups.

I was humbled by how egoless most people I spoke with were. They seem to know that no one knows absolutely everything about Python (and there were funny anecdotes about this, such as famous names requesting new features that are already in the language!). I was incredibly surprised when one of the keynote speakers sat at my table during lunch on the first day -- I had assumed well-know people would have solid cliques and no time or desire to meet new faces. And of course they ended up being just as nice as everyone else.

Some things I grumbled over: the constant strikes in Italy, first at the airport when I landed then the trains when I (tried to) leave! I was disappointed to miss out on the training I was hoping to attend as well, the rooms were a bit small and filled up long before the training starting time. Lesson learnt for next year!

...And indeed I am much looking forward to going again next year. In the meantime I welcome all Pythonistas to PyCon Ireland in Dublin this October! :D

Leave a comment

Archives