Training at EuroPython 2014 Making your first contribution to OpenStack

OpenStack logo

Last week I ran a 3-hour training on how to get started contributing to OpenStack at EuroPython. The aim was to give a high-level overview of how the contribution process works in the project and guide people through making their first contribution, from account creation to submitting a patch.

Overview

The session starts with an extremely fast overview of OpenStack, geared toward giving the participants an idea of the different components and possible areas for contribution. We then go through creating the accounts, why they're all needed, and how to work with DevStack for the people who have installed it. From there we finally start talking about the contribution process itself, some general points on open-source and OpenStack culture then go through a number of ideas for small tasks suitable for a first contribution. After that it's down to the participants to work on something and prepare a patch. Some people chose to file and triage/confirm bugs. The last part is about making sure the patch matches the community standards, submitting it, and talking about what happens next both to the patch and to the participant as a new member of the community.

Preparing

During the weeks preceding the event, I ran two pilot workshops with small groups (less than 10 people) in my local hackerspace, in preparation for the big one in Berlin. That was absolutely invaluable in terms of making the material more understandable and refining the content for items I didn't think of covering initially (e.g. screen, openrc files) and topics that could use more in-depth explanations (e.g. how to find your first task), timings, and generally getting a feel for what's reasonably achievable within a 3-hour intro workshop.

Delivering

I think it went well, despite some issues at the beginning due to lack of Internet connectivity (always a problem during hands-on workshops!). About 70 people had signed up to attend (a.k.a. about 7 times too many), thankfully other members of the OpenStack community stepped up and offered their help as mentors - thanks again everyone! In the end, about half the participants showed up in the morning, and we lost another dozen to the Internet woes. The people who stayed were mostly enthusiastic and seemed happy with the experience. According to the session etherpad, at least 5 new contributors uploaded a first patch :) Three are merged so far.

Distributing the slides early proved popular and useful. For an interactive workshop with lots of links and references it's really helpful for people to go back on something they missed or want to check again.

Issues

The start of the workshop is a bit lecture-heavy and could be titled "Things I Desperately Wish I Knew When Starting Out," and although there's some quizzes/discussions/demoing I'd love to make it more interactive in the future.

The information requested in order to join the Foundation tends to surprise people, I think because people come at it from the perspective of "I want to submit a patch" rather than "I am preparing to join a Foundation." At the hackerspace sessions in particular (maybe because it was easier to have candid discussions in such a small group), people weren't impressed with being forced to state an affiliation. The lack of obvious answer for volunteers gave the impression that the project cares more about contributions from companies. "Tog" might make an appearance in the company stats in the future :-)

On the sign-up form, the "Statement of Interest" is intimidating and confusing for some people (I certainly remember being uncertain over mine and what was appropriate, back when I was new and joining the Foundation was optional!). I worked around this after the initial session by offering suggestions/tips for both these fields, and spoke a bit more about their purpose.

A few people suggested I simply tell people to sign up for all these accounts in advance so there's more time during the workshop to work on the contribution itself. It's an option, though a number of people still hit non-obvious issues with Gerrit that are difficult to debug (some we opened bugs for, others we added to the etherpad). During one of the pilot sessions at the hackerspace, 6 of the 7 participants had errors when running git review -s  - I'm still not sure why, as it Worked On My Machine (tm) just fine at the same time.


Overall, I'm glad I did this! It was interesting to extract all this information from my brain, wiki pages and docs and attempt to make it as consumable as possible. It's really exciting when people come back later to say they've made a contribution and that the session helped to make the process less scary and more comprehensible. Thanks to all the participants who took the time to offer feedback as well! I hope I can continue to improve on this.

Leave a comment

wvdial: stackmaster assertion failure & EuroPython

At EuroPython, I happily took advantage of the prepaid SIM card that you could order together with your ticket. However tethering with Wind (Italy) was not to work that simply. On Debian Wheezy I ended up with the following error:

[...]
--> Modem initialized.
wvdial: utils/wvtask.cc:409: static void WvTaskMan::_stackmaster(): Assertion `magic_number == -0x123678' failed.

A bit of astute googling reveals that libwvstreams needs to be downgraded.

$ aptitude versions libwvstreams4.6-base # or apt-cache policy libwvstreams4.6-base
Package libwvstreams4.6-base:                        
p A 4.6.1-1                                       stable                    500 
i A 4.6.1-4                                       testing,unstable          500 
$ sudo aptitude install libwvstreams4.6-base=4.6.1-1

Tadam! wvdial now works. The same configuration as last year seems to do the trick (though on Fedora 16, the Network Manager's Mobile Broadband tool was easy to set up with Wind without  bothering with wvdial.)

[Dialer wind]
Modem Type = USB Modem
Modem = /dev/ttyACM0
Username = "wind"
Password = "wind"
Init6= AT+CGDCONT=1,"IP","internet.wind"
Leave a comment

EuroPython 2012 Let's go

I'm very happy to be going to EuroPython again this year :) Once more the talks I'm most looking forward to are mainly related to scalability, perhaps with a dash of internationalisation/encoding on the side. The tutorial on Django testing with Selenium should be interesting too.

Say hi if you see me! :)

Python Logo

Leave a comment | 2 so far

EuroPython 2011: Nicholas Tollervey on the London Python Code Dojo

Link: Talk description and video

The Python Code Dojo is a community organised monthly meeting.

Dojo

A dojo is a place where you go to practice stuff, learning is a continuous process. It's based on the idea of deliberate practice.

Paris

Codingdojo.org was started in Paris, where it follows a very structured format.

Katas are forms that you practice to prepare yourself. You learn how to solve a problem using baby steps. In Paris they do this in silence, unless you really don't understand and have to ask a question. "Randori kata" is public pair programming, with a pilot and a co-pilot that solve a problem on stage.

London

The London Dojo works more like a seminar and attendees are encouraged to interrupt. Participation is expected. They do team dojo where the team must solve a problem within a timeframe. Problems are written on a blackboard, people vote for one and then everyone works at solving it in a team of 5 or 6 people over 1h30. Finally there is a show, tell, review and question event where each team presents their solution/approach.

Why participate in a dojo?

  • The educational benefit, of learning by doing
  • You can fail safely in a sympathetic environment, and experiment
  • People teach one another, all levels can attend
  • You build a community: in London, that's relaxing with pizza and beer

What's a good dojo?

From the attendee's perspective: it's fun, you get to solve problems, it's safe to make mistakes, show and tell is encouraged which is good to get feedback.

From the organiser's perspective: it self-organises, mostly.

To see if it's going well: see that there is a positive aim, something is done to reach this aim, with some sort of feedback at the end.

Personal observations

Beware of systems and gurus. Ignore systems if something else works for you, you can actually do damage otherwise. Learn to practice learning!

Q&A tidbits

When they (or another dojo?) started using meetup.com they doubled their numbers! Or EventBrite, the idea is to have a centralised system, with tickets to predict attendance.

Leave a comment

EuroPython 2011: Lightning Talks

The lightning talks were very fast paced (5 minutes) so I only jolted down some project names I want to check out and interesting tidbits, and missed speaker names just about all the time. Sorry!

To easily create diagrams, check out blockdiag. It includes different shapes that make me hopeful it might be a less painful way to do nice topologies.

Someone's project to remove the GIL in Pypy (future):

global_lock.acquire() / .release()
object.acquire()
with Transaction():

Learning a language in 60 hours: http://sotospeak.se/ (English homepage). It's a piece of software for your mobile phone, that encourages you to learn a language like children do. It's written in Python.

shlex for simple lexical analysis.

Python Edinburgh are a bunch of cool folks with their own conference :)

DjangoZoom, effortless deployment for Django (like Heroku?)

pip install null, if a need to use the Null object pattern arises.

Leave a comment

EuroPython 2011: Mark Ramm on A Python Takeover

Link: Talk description and video

2 years and a half ago, SourceForge was all PHP except for one little Python service. SourceForge was originally written in 1999, when Python wasn't so great at the web. After 10 years though, it all started to atrophy.

So SourceForge decided to do a little experiment and assigned 2 people to create FossFor.us. They chose a web framework, thinking that Python was good and impressed with the Django documentation: there was no real learning curve. Couchdb was a bit slow, though.

The experiment was a success (although the site is now gone, due to managerial reasons if I understood correctly) and so they were then given 6 weeks to redo the download page/files of SourceForge. It had to be dynamic, to offer you a download relevant to your platform: they use the user agent string to figure out the operating system, and borrowed code from setuptools to figure out which release is the latest.

Their admins loved Apache so they went with mod_wsgi, which worked out well.

The whole system worked fine on a laptop handling all the traffic from SourceForge: that was the only load testing! And unfortunately when the system went live it took about 8 seconds to load a page. There was no CPU or memory usage. It turned out they saturated their gigabit Ethernet card, by loading all the releases, which some projects have a lot of (e.g. JBoss). Memcached turned out to be slower than MongoDB, because of pickling, the 4 megabytes objects and CPU.

Finally they updated the list to not include all releases and the project was deemed a success! Thus everything shall become Python. They are developing their new platform openly, Allura. They use a FUSE filesystem in Python to control permissions.

From now on, they have an internal mandate that everything should be written using Python unless there is a good reason not to. And now SourceForge can compete again! As well as explore new directions.

Leave a comment

EuroPython 2011: Mark Ramm on Relate or !Relate

Link: Talk description and video

This talk was about non-relational databases. I didn't take a lot of notes :o) The most important morale is probably: don't keep the mind altering substances and the tools in the same shed.

With 2 decades of relational databases, they are pretty robust by now. They cover different spectrum of ACID compliance ; for instance MySQL is faster, Postgres is more reliable (though becoming faster... if you tick off the reliability options!). Relational databases are supposed to be normalised, except they are not really: there is also a spectrum here as databases tend to get denormalised for performance reasons.

Amazon uses an "eventually consistent" system, which they can pull off by charging at shipping time only. Conflicts are rare, if 2 orders are placed and there is only 1 item available, someone might get a gift certificate instead.

The NoSQL taxonomy includes wildly different tools that don't have much in common except for the fact that they don't use SQL: key-value stores, document stores, ....

CAP: Consistency, Availability, Partition tolerance. You can have 1 or 2, not all 3. (Brewer's Theorem)

There was only 1 Postgres database for all of SourceForge for a long time, while they were in the top 100 sites. Don't obsess about scale you'll never achieve.

One of the question was about how difficult it is to convert from a relational database to NoSQL. The answer is, from something like Postgres to MongoDB, it wouldn't be that much work (he did suggest 4 people 6 weeks though, which doesn't sound that trivial to me). Changing to Cassandra on the other hand would be a huge effort.

Leave a comment

EuroPython 2011: Raymond Hettinger on Python Tips, Tricks and Idioms

Link: Talk description and video

I couldn't find the slides online but please do link me if you find them, they were a treasure trove of awesome tips and very well laid out!

The talk touched on many things, the following tips are unrelated to each other and in no particular order.


Beware the corner cases of negative slicing -- if you use a variable for slicing there's likely a bug lurking in your code!

mytuple = 1, 2, 3 # Tuple declaration also works without parenthesis

Learn Python the way Guido intended it and indented it :P

for x in reversed(seq): # Better than negative slicing because it's clearer, there's no need to do a double take

for i, x in enumerate(seq): # That's Pythonic! Forget about i in range, seq[i]

for x, y in zip(seqx, seqy)

for/else: a for loop with an else, the else executes if there was no break in the loop, that is, when the loop runs to completion. It's only useful when there is a break (the searching use case).

None is always smaller than everything else, in comparisons.

Sets aren't guaranteed order after being sorted, because __lt__ has been overridden to indicate subsets and supersets.

If you intend to define __lt__ you must also implement the 6 ordering functions.

# Key function:
sorted(s, key=str.lower) # (awesome!)

If a class takes in an iterable, and emits iterables lots of awesome may occur, with the prebuilt tools and other surprising uses (kind of works like pipes and filters).

d = defaultdict(dict)
d[stuff][otherstuff] = "blah"
# as opposed to
d = dict()
d[stuff] = dict()
# etc...

d.update(dict(d))
Leave a comment | 3 so far

EuroPython 2011: Brian Fitzpatrick on the Myth of the Genius Programmer

Link: talk description and video

A lot of user questions for the Google Code project are along the lines of "how do I hide code until it's ready", "how can I wipe the history and start from scratch" and so on: they are about people's insecurities.

When you have elitism and anonymity, suddenly everyone is elite. There's a whole mythology that gets built around programming heroes (Torvalds, van Rossum, Walls).

There are no genius. The only thing this has created is a fear of showing mistakes. This insecurity inhibits progress, makes the process slower, brings lower quality and a lower bus factor.

Avoiding the trap

  • Lose the ego
  • Criticism is not evil: give and take it politely. You are not your code. People criticising your code are not out to get you.
  • Embrace failure (just don't fail the same thing over and over!)

The speaker shared an interesting story (probably an aphorism?) of an executive that makes a mistake that costs his company 10 million dollars. The day after, he starts packing up his things and when the CEO summons him to his office, he's ready to hand in his resignation, saying there is no need to fire him. The CEO replies "Fire you? Why would I do that? I just spent 10 millions dollars training you!"

  • Iterate quickly
  • Practice is key
  • Be a small fish, as in don't be the smartest person in your company. You'll learn much more and faster
  • Be influenced. "Alpha" programmers think they know everything and won't ever listen -- you may find that you actually gain more influence, by being willing to be influenced!
  • Be vulnerable, repeated vulnerability becomes a strength long term.

Tools

Tools won't solve sociological problems, but they may influence behaviours. Pay attention to the tools, they can influence culture and moral, for instance by encouraging "cave" behaviour where developers work on their own for a long time and dump a big chunk of code: it's bad for collaboration, reviewing, etc.

You don't need to hide a project until it's "ready." Simply don't advertise it. People may find you because they are looking for something like this.

Don't let people collaborate until it's too late: they may help with code reviews, or pointing out to you existing libraries you missed. If it's too late in the project, they have no possibility to drive, to be a strong part of the project and it's less likely they will contribute.

Certainly get a prototype ready, some running code and some design, but let it still be something that you're happy to step back on.

Conclusion

  • Don't try to be a genius
  • Collaborate early and often
  • Pay attention to default behaviours (the ones encouraged by tools especially)
  • Pay attention to timing

...and if you do all this, people will think you're a genius!

Some of the questions/responses

Make sure to write a manifesto with the direction you want for your project from early development, so there are no major clashes or misunderstandings later on when people get involved.

If you don't care about credit, wow will you go places. If someone "steals" your idea and they have more reach (e.g. more clout/connections), it's great! It means it's more likely the idea will be implemented, and you'll have more ideas anyway.

On influencing a new team you just joined with best/better practices: start by doing good work and building up a good reputation, then you get to pay it back on something you believe matters. You have to choose your battles, you can't step in front of every train (resistance to change is like a very fast train! Hard to stop.)

Leave a comment

EuroPython 2011: Simone Federici on Django Productivity Tips and Tricks

Links: Talk description, video and slides


Know the environment

Use Linux with the "Terminator" shell.

Use a version control system.

Use virtualenv, for managing different versions of Python and dependencies, e.g.

virtualenv path --python=python2.7 --no-sites-packages
system libs: ./configure --prefix=envpath ; export LD_LIBRARY_PATH=envpath/lib

Use yolk, handy tool to query pypi and status of pypi installed packages

yolk -l (to see installed packages)
yolk -U (to see if there are updates on pypi)
yolk -a

Use the bash autocomplete script.

Use djangodevtools (PyPi/site with description of the new commands), for lots of useful things such as adding test coverage (./manage.py cover test myapp).

Continuous integration

Use Buildbot and Django-Buildbot (note: there was a configuration example on the speaker slides).

In settings.py:

try:
    from setting_local import *

setting_local.py shouldn't be shared or checked in.

Thanks to alias, from djangodevtools, you can simply create commands to do whatever you want, e.g. clean up rabbit queues. The commands are stored in a manage.cfg file that is shared.

uWSGI is an application server with many options. --auto-snapshot sounds quite interesting. It supports clustering. It'll be in the official Django deployment documentation from the next release (ticket 16057).

Coding

get_profile() tends to be a problem. It's possible to monkey patch the User.get_profile() method (to make sure a new profile is created if it doesn't exist) but you have to be careful where it's loaded. It's also possible to use a Meta proxy together with a new middleware (set up after the authentication middleware)

Django model form factory (django.forms.models.modelform_factory) sounds interesting to create forms more quickly.

uWSGI

There was an on-the-fly short talk on uWSGI after the talk, by someone whose name I didn't pick up. It can talk to many protocols, it has lots of plugins so you can only use what you need. It's not the fastest but speed isn't the main factor that should make you decide to use it.

Leave a comment

EuroPython 2011: Simon Willison on Advanced Aspects of the Django Ecosystem

Links: Talk description and video and slides.

This talk will be about 3 tools, that can be considered secret weapons: they offer great payoffs, for low efforts.

(Note: the slides enhance most of these concepts with lots of code examples, have a look!)

Haystack

Haystack does full text search, and is available as modular search for Django. It's very easy to get a nice search interface if you already use the Django ORM, and the queryset can also be defined to limit search queries to what you want (e.g. only published entries).

You can have different templates/html bits depending on the type of objects returned by the search.

Scaling/Backend

  • Woosh (Python) - Good if you have no write traffic, and not a lot of traffic in geenral
  • Xapian (C)
  • Solr (Java) tends to be the default choice. It has an HTTP interface, and there are tons of things that are already baked into Solr, like filtering down by object type. Objects can be associated with a template, although it sounds like it's more about relevance than display: the speaker mentioned showing the title twice in the template to increase its weight in search results. It can scale with SearchQuerySet, and works faster than complicated crazy SQL.

Search indexes usually don't like being updated much. Haystack offers several solutions. Sites with low write traffic can update the index in real time at every change. Or changes can be batched every 6 hours. At a higher scale, you have to roll your own solution. For Lanyrd they have added a "needs_indexing" boolean to their models that defaults to True and is also set in the save() hook. Then using a management command or something else, it's possible to look at what needs to be indexed, process it and set the flag to False.

Solr has master/slaves capabilities and knows how to balance the reads between slaves, the writes should be sent to master. Haystack only knows how to talk to one url, but using nginx it's possible to balance and to set up different proxies depending on the URLs to make sure the writes go where they should -- remember, Solr speaks HTTP.

Celery

Celery is a fully featured, distributed task queue manager. Any task that would take more than 200ms should go on the queue! For instance...

  • Updating the search index
  • Resizing images
  • Hitting external APIs
  • Generating reports

Using the @task decorator, the method works normally if called directly, but also gains a delay() that adds the method to the queue to be picked up by workers.

For tasks launched by users (such as uploading a picture or figuring out what's at a url):

  • To deal with people using Javascript or not: if 'ajax' in request.POST, show an inline loading indicator, otherwise redirect.
  • Use memcached for automatic house keeping, in case the user closes the browser and doesn't come back, don't keep the task around forever. The oldest will get dropped out automatically after a few hours.

Use celerybeat for scheduling, celerymon for monitoring the worker cluster, celerycam to take snapshots -- this helps figuring out when/where things go wrong.

The activity stream pattern gives everyone an "inbox" when everyone needs to receive something, like a tweet: it gets written to everyone's stream. Redis can handle 100,000 writes/second and is a handy tool to deal with this; this is also the kind of tasks that's a great candidate for queueing.

Fabric

Fabric is great for automated and repeatable deployments, it also makes it easier to roll back. You could use chef and puppet, which are ridiculously powerful but quite complex to set up. Fabric fits the developer mental model better generally, it kind of wraps your current processes into Python.

For instance, you can create a clear_cache() that calls flush_all() on the cache. Then, to clear the cache on your server, call from your machine:

fab -H host1,host2 clear_cache

The file (fabfile.py) is version controlled therefore documenting your process -- so you don't forget how to do it 6 months from now!

env is a global variable used by Fabric, you can add your own variables to it that can be reused in other commands, for instance env.deploy_date to store the deployment date and time and make it easier to roll back and roll forward (he uses symlinks).

They use a servers.json configuration file that documents the instance id, public dns, private dns, names and roles (solr, memcached, etc). Fabric can use this to deploy, nginx can use it to load balance, Django can import it in the settings to know what to talk to.

Leave a comment

Couple of thoughts on EuroPython 2011

Taking a break from writing up my notes into blog posts to share a few general thoughts about the conference.

General feeling: Woohooooo! This was a fantastic week, I learnt a ton and met an amazing amount of awesome people. If you're here because I gave you one of my cards, hi! o/ It was lovely to meet you. (If you asked about the card and forgot, these are MooCards from moo.com. Get yourself some! People fought over these at EuroPython, I'll have you know. They're that good!)

The conference was wonderfully well organised, including the evening events. I fondly recommend the bistecca alla Florentina from Zaza! Everyone was incredibly friendly, like at PyCon Ireland last year it was common to strike up interesting conversations with a random stranger besides you and impromptu dinner plans would be shared between groups.

I was humbled by how egoless most people I spoke with were. They seem to know that no one knows absolutely everything about Python (and there were funny anecdotes about this, such as famous names requesting new features that are already in the language!). I was incredibly surprised when one of the keynote speakers sat at my table during lunch on the first day -- I had assumed well-know people would have solid cliques and no time or desire to meet new faces. And of course they ended up being just as nice as everyone else.

Some things I grumbled over: the constant strikes in Italy, first at the airport when I landed then the trains when I (tried to) leave! I was disappointed to miss out on the training I was hoping to attend as well, the rooms were a bit small and filled up long before the training starting time. Lesson learnt for next year!

...And indeed I am much looking forward to going again next year. In the meantime I welcome all Pythonistas to PyCon Ireland in Dublin this October! :D

Leave a comment

EuroPython 2011: Alex Martelli on API Design Anti-Patterns A few notes

Link: Talk description and video


The easy way to write an API is to use your current implementation, but then you expose implementation details which makes it harder to change or improve the implementation in the future.

In software development, you shouldn't think up a big design upfront, except for 2 things: security, and API.

Forget about your implementation: think up 2 or 3 different ways you could have implemented your project, and keep only the common parts, the "substance" for the API.

To motivate people to migrate to your new API: don't add new features to the old one.

Make choices.

Also, don't be in an environment where making mistakes is punished, rather than fixed :)

Leave a comment

EuroPython 2011: Wesley Chun on Python 3, Python 103 Incomplete notes on interesting talks

Python 3: The Next Generation

Link: Talk description and video

Python was created in 1991 -- it's 20 years old now! Lots of time for cruft.

Python 3 is backwards incompatible. Stuff will break. Hopefully the migration will not be grudging, the main thing for most programs will likely be unicode strings.

In 1997 Guido wrote "Python regrets" which later on and with other things became the basis for Python 3000.

A few of the changes (note from me: they were not all mentioned and I didn't take note of everything either!):

  • The print statement becomes the print() function.
  • Numbers, divisions: division will now be true division by default (1/2 == 0.5 as opposed to the current 1/2 == 0 -- better to teach, especially young ones)
  • Dictionaries will use itertools by default to save memory
  • Likewise for built-ins like map/filter, many changes to have better speed and/or memory efficiency

With regard to migrating:

  • Wait for your dependencies to port
  • Have a good test suite
  • Move your code to at least 2.6, which is when Python 3 functionality started to be backported
  • The -3 switch tells you about incompatibilities
  • The 2to3 tool offers diffs on what should be ported

Be careful with Python 3 books, if they cover 3.0 they are already obsolete. Lots of changes!


Python 103

Link: Talk description and video

This talks aims to fill in the gaps in the knowledge of not-quite-beginners-anymore. We'll have a look at: the memory model, mutability, and methods.

Special methods

Mutable objects are passed by reference, immutable objects (like a string or number) are passed by value.

class Stuff:
    version = 1.0

john = Stuff()

As a shortcut you can call john.version rather than Stuff.version. However if you assign john.version = "blah", you're hiding access to the class 'version' attribute, and only changing this attribute for the john instance -- basically creating a new local variable.

To initialise, if there's one parent you should do Parent.__init__(self). If you're lower down, try to use super() for MRO and stuff. Extra reading: Python's super() considered super! and Python's Super Considered Harmful.

You can make your own classes act like Python types, by overloading existing operators and existing built-in functions. There are loads of them! __init__, __new__ (for immutable objects), __len__, __eq__, __getslice__, __getitem__, __contains__ (for the 'in' keyword), ...

How to (not) use the stack - performance

The timeit module  will run a problem a million times and help find which implementation is faster.

For instance:

1. while i < len(x)
2. strlen = len(x); while i < strlen

The second one is faster, that's the penalty of a stackframe: the len() value is not cached because it could change! (With overloading, etc.) With the 2nd version we removed a function call.

Objects and references

All Python objects have:

  • An identity (similar to the memory address)
  • A type
  • A value (only that one can change, sort of)

References are also called aliases. There is a reference count to track the total number. Variables do not hold data, they point to objects. For instance if 2 variables are assigned 1, they would both be pointing to the same (immutable) Integer object representing 1.

3 types of standard types

His classification, nothing official.

  • Storage: Linear/Scalar vs. Container
  • Update: Mutable vs. Immutable
  • Access: Direct (number) vs. Sequence (string, list) vs. Mapping (dictionary)

"Interned" numbers are a list of integers in range(-5, 257): these are special numbers that are never deallocated.

"is" is a keyword to assess if 2 variables point to the same object.

Beware shallow copy vs. deep copy when you have a list of lists.

Memory allocation

Growing memory: When you call .append() and the list is full, 4 free units are malloc'ed, then double that, then a bit less: usually about 12.5% additional free slots are created at once in advance. What this means is that you should try not to have a lot of short lists.

Shrinking memory is a fairly inexpensive operation.

Inserting in the middle of a list creates a lot of shifting. Deque is better for push() and pop(), though less good for access.

Leave a comment | 2 so far

EuroPython 2011: Simon Willison on Challenges in developing a large Django site

Links: talk description and video and slides.


Simon Willison is the co-founder lanyrd.com, a social website for conferences.

Tips and tricks

Signing (from 1.4, currently in trunk)

Using cryptographic signing for various things can ensure that they haven't been tampered with, for instance a cookie or an unsubscribe link. If you encrypt your session cookies you don't have to hit the database anymore, you just need to check the proper signed cookie.

The speaker showed a couple of short code examples to demonstrate how simple it is to use, and how the interface is consistent with the other serialisation interfaces.

from django.core import signing
signing.dumps({"foo": "bar"})  # url safe
signing.loads(string)

cache_version

This is another way to do cache invalidation. You add a cache_version field to the model, that is incremented when calling the save() hook or a touch() method. In the template cache fragment, you use the primary key and the cache_version to invalidate.

You can also mass invalidate by updating the cache version of objects.all() using F() -- example from the slides:

topic.conferences.all().update(
    cache_version = F('cache_version') + 1
)

noSQL for denormalisation

Use noSQL to denormalise and keep the database and the cache/nosql in sync. It's more work but it's worth it.

For instance they use Redis sets to maintain lists such as username-follows, europython-attendees and then they simply need to do a set intersection to get the information they want. These are only lists of ids so they don't take that much space.

Hashed static asset filenames in CloudFront

They created a management command to push static assets, that compresses Javascript, changes the names/urls, etc. This way they can publish them in advance, and also keep static files around if there's a need to rollback. The different names are also good to prevent Internet Explorer caching.

Challenges

This part of the talk is about things they don't really have answers for.

HTTP Requests

e.g. talking to an API: what if it fails or take 30 seconds? Do you use urllib? What if people enter private urls from within your Intranet? :O

You have to handle connection timeouts, logging and profiling, url validation, and http caching. All of these are a common set of problems that should be baked into the framework.

Profiling and debugging production problems

Debugging in development rocks, with the django-debug-toolbar, the way error 500 are handled, pdb, etc.

Once you turn debug to False, you're blind. After a while, all the bugs, particularly performance bugs, only happen in production.

He showed us a code snippet for a UserBasedExceptionMiddleware, that if you access the page throwing a 500 error and is_superuser is True, you will see a traceback, not the default 500 error (so if one of your users reports a problem, you can go to the page straight off and see a traceback).

At the database level, there is a handy tool called mysql-proxy that is customisable using Lua. Using a wonderful, horribly documented library called log.lua, you can for instance turn on logging for a couple of minutes when needed.

He created an app called django_instrumented (unreleased, until it's cleaned up) that collects statistics and sticks them into memcached. He has a special bookmark to access them, they are stored for 5 minutes only  -- so they waste neither space or time.

This actually helped improve the performance: if you measure something and make it visible, people will improve it over time.

0 downtime deployments

Code-wise it's easy enough to do, but when there are database changes it's tougher. Ideally they try to make schema changes backwards compatible, then use ./manage.py migrate (using South) on another web server.

Having a read-only mode made a lot of problems easier! It's not 0 downtime but the content is still readable. It can be a setting or a Redis key.

Feature flags work in the same way but at a more fine-grained level, for instance turning off search while you update your solr cluster. There's quite a bit more work involved.

One lesson we keep on learning in Django

We went from one database to multi-databases, from one cache to multi-caches, from one haystack backend to multiple backends.

Debug is one single setting, that affects a lot of things.

The timezone setting also affects Apache log files.

The middleware concept is very powerful, but is executed on every single request: if there's a conditional it has to be done within the middleware.

Really, global settings should be flushed out of the project! They are evil settings that cannot be changed at runtime.

Leave a comment