PyPap: April 2009

Wednesday, April 22, 2009

PyCon 2009 Notes - March 27th

Friday, March 27th through Sunday, March 29th were the “core” conference days. These are the days with regular scheduled talks, keynote talks, lightning talks and open spaces. You can see an overview of the schedule for these three days at http://us.pycon.org/2009/conference/schedule/.

I’ll present my notes from PyCon in chronological order.

Last year there were lightning talk sessions after the scheduled talks on all three days. Perhaps the scheduling committee got plenty of “more lightning talks” feedback, because this year each day also started with lightning talks, so except for a brief introduction from the PyCon 2009 Chair David Googer on Friday morning, the conference kicked off with lightning talks. I found this quite fitting and in keeping with PyCon being a community conference.

Morning Lightning Talks

You’ll find the video at http://us.pycon.org/2009/conference/schedule/event/2/.

Brett Cannon (~3:45 in the video) presented on python.org’s switching to a DVCS (from Subversion). See http://www.python.org/dev/peps/pep-0374/, particularly http://www.python.org/dev/peps/pep-0374/#chosen-dvcs which makes the decision to go with Mercurial official.

Jonathan Ellis (about 12:30 in the video) spoke about the Cassandra distributed database), opened sourced by Facebook in the summer of 2008. He made it sound like it’s the only (open source?) “distributed database” with a Python API.

Jeff Rush - About Python Namespaces (and Code Objects)

See http://us.pycon.org/2009/conference/schedule/event/7/ for the video, slides and other files.

I didn’t know compiling and disassembling Python code is as simple as:

s = 'x = 5'
co = compile(s, '<stdin>', 'exec')
from dis import dis
dis(co)

His slides and/or the video are worth reviewing.

His “thunk” example—which he defines as “like a proxy but it gets out of the way when you need it”—is interesting. See page 38 of the PDF or about 11:30 of the video.

I also noted that he said “we spend a lot of time going over source code in Dallas [at the Dallas Python Interest Group]”. That would be a worthwhile thing to try at BayPIGgies—I’ll propose it [TODO].

Adam D Christian - Using Windmill

See http://us.pycon.org/2009/conference/schedule/event/9/ for the video and PowerPoint slides.

“Windmill is the best-integrated solution for Web test development and its flexibility is largely due to its development in Python.”

Open source
Looks pretty slick
http://www.getwindmill.com
Selenium does SSL, Windmill doesn’t…yet.
Selenium has strong Java integration (and Windmill does not)

Question: “Why did you create Windmill?”
Answer: “At the time it took us longer to debug a Selenium test than to write it over again.”

Mike Fletcher - Introduction to Python Profiling

See http://us.pycon.org/2009/conference/schedule/event/15/ for the video and slides in OOo & PDF formats.

Asked 12 programmers “If I had a million dollars to spend on Python…”. The top three answers were about improving performance.

Good introduction. You may want to check out the slides first and then turn to the video for more detail.

Visualization Tools:
- KCacheGrind - “some assembly required for use with Python”[and non-trivial to get it working on Mac OS X]
- RunSnakeRun: (http://www.vrplumber.com/programming/runsnakerun/) - “doesn’t provide all the bells-and-whistles of a program like KCacheGrind, it’s intended to allow for profiling your Python programs, and just your Python programs”

Kumar McMillan - Strategies For Testing Ajax Web Applications

See http://us.pycon.org/2009/conference/schedule/event/18/ for the video and a ZIP containing the slides in HTML (or go to http://farmdev.com/talks/test-ajax/).

5 strategies:

Test Data Handlers
Test JavaScript
Isolate UI for Testing
Automate UI Tests
Gridify Your Test Suite

Some resources on his wrap-up slide.

Aaron Maxwell - Building an Automated QA Infrastructure using Open-Source Python Tools

See http://us.pycon.org/2009/conference/schedule/event/22/ for the video and the slides (in OOo & PPT formats). Or see http://redsymbol.net/talks/auto-qa-python/:

“This demonstrates the value of an automated QA system. If you need to manually execute the code coverage tool, then in practice you just won’t do it as often as if it is run for you. If your QA system automatically runs code coverage each night (for example), you and your team are freed up from bothering to do it manually - or even remembering to do so. It’s just done silently, and a fresh coverage report is available when you are ready to see it.

“This talk referenced the The Buildbot QA/CI Framework. There are many such frameworks with different plusses and minuses. BuildBot’s weakness is its brief but steep learning curve, which makes it harder than anyone would like to set up for simple projects. Its plusses are its generality, range, and extensibility: it can be made to do almost anything you need your QA system to do, even for tremendously large projects with complex test metrics. Overall, I recommend BuildBot be used for building your QA framework, unless you have some particular reason to use one of the others that are out there.”

From slide 5: “Your QA System is ONLY as good as its reporting of results. If you don’t get this done well… none of the rest matters. Under appreciated…And critically, critically important.”

From slide 9: “BuildBot is probably the best general purpose Python-based, open-source framework available now.”

Slide 11 gives quick definitions of some BuildBot architectural terms.

Slides 12-19 walk through examples of a simple and a more complex BuildBot configuration.

Slides 20 & 21 show examples of extending BuildBot.

Owen Taylor - Reinteract: a better way to interact with Python

I didn’t attend this talk, but several people remarked on it later. I’ve since played with Reinteract and I recommend you check it out: http://www.reinteract.org/.

See http://us.pycon.org/2009/conference/schedule/event/23/ for the video and the slides (in PDF format). The slides are not at all useful by themselves. But I definitely recommend you watch the video. Reinteract could well be a tool you’ll want to use regularly.

“Traditionally Python has worked one of two ways: either a program with an edit-run cycle or a command prompt where the user types commands. Reinteract introduces a new way of working where the user creates a worksheet that interleaves Python code with the results of that code. Previously entered code can be changed and corrected. The ability to insert graphs and plots in the worksheet makes Reinteract very suitable for data analysis, but it also is a good for basic experimentation with the Python language. This talk introduces Reinteract and gives a high-level peek at the magic behind the scenes.”

Ned Batchelder - Coverage testing, the good and the bad.

See http://us.pycon.org/2009/conference/schedule/event/26/ for the video and the slides (in PDF format).

“Coverage testing tests your tests”

The slides are easy to read without the video if you prefer, so I won’t duplicate them here.

Writing more tests is the “only way to truly increase code coverage”. Excluding code to boost coverage is tempting, but you’ll never come back, so you’re only hurting yourself.

What is currently “100% broken”:

branch coverage
path coverage
loop path coverage
data-driven code - can’t measure data used
complex conditionals
hidden branches
broken tests

Dr. C. Titus Brown - Building tests for large, untested codebases

See http://us.pycon.org/2009/conference/schedule/event/30/ for the video and the slides (in PDF format).

Presented on his experiences creating tests for pygr, a Python graph database (for use in bioinformatics). (slide 11)

~8K of Python, ~2K of Pyrex (-> C, for speed)
almost all library and framework (complex)
lots of technical debt

Code coverage invaluable when aimed at (slide 16)

new tests efforts on legacy code
understanding code bases

Grokking code through coverage (slide 19)

start with minimum useful statement
examine code that’s actually executed
add additional statement
examine executed code
repeat

(At some point—I can’t find it in the slides—he showed a —coverage-diff command-line option, to figleaf?)

Coverage driven testing (slide 29)

each new test should “attack” an uncovered line of code
immediate gratification of new code coverage
finds simple bugs with ease
you now understand that code

Jesse Noller - Introduction to Multiprocessing in Python

I didn’t attend this (as it was at the same time as the above talk), but I heard it was good. See http://us.pycon.org/2009/conference/schedule/event/31/ for the video and the slides (in PDF format).

Michael Foord - Functional Testing of Desktop Applications

See http://us.pycon.org/2009/conference/schedule/event/34/ for the video. See http://www.voidspace.org.uk/python/articles/testing/index.shtml for “online slides”.

If you write applications without tests then you are a bad person, incapable of love. — Wilson Bilkovich (The Rails Way)

Why Test Functionally? (http://www.voidspace.org.uk/python/articles/testing/processes.shtml)

Unit tests test components - not the application as a whole
Check new features don’t break existing functionality
Massively helpful when refactoring
Individual tests act as specification for a feature
Test suites are a specification for the application
When the test passes you know the feature is done
They can drive development

Good advice in dealing with problems (http://www.voidspace.org.uk/python/articles/testing/problems.shtml)

Fragility due to layout changes

Timing problems (beware the lure of the voodoo sleep)
Some UI elements are very hard to test
System dialogs (that are hard to interact with programmatically)
How do you test printing?
Bugs in the GUI toolkit
Spurious, random and impossible failures

Raymond Hettinger - Easy AI with Python

I didn’t attend this (as it was at the same time as the above talk), but I heard it was good. See http://us.pycon.org/2009/conference/schedule/event/71/ for the video and slides (in PPT & PDF formats).

Evening Lightning Talks

You’ll find the video at http://us.pycon.org/2009/conference/schedule/event/39/.

RANT: “import *” is evil (right at 0:05 in the video)

some call Brazil “Belindia” because it’s like “islands of Belgium in a sea of India” (6:18)

Michael Foord - Metaclasses in Five Minutes (12:00)
http://www.voidspace.org.uk/python/articles/five-minutes.shtml

Thursday, April 16, 2009

Python 401: Some Advanced Topics

On Thursday, March 26th, I attended Steve Holden’s Python 401: Some Advanced Topics tutorial. This one wasn’t as mind-expanding as the previous tutorial (or the three I took at PyCon 2008), and none of the material was new to me. But I’ve found re-learning material will often fill in the gaps in my knowledge, and that certainly was the case here.

You’ll find the slides at http://holdenweb.com/files/Python401.pdf.

The material was divided into six “lessons”, and three appendices.

Lesson 1 (slides 4 though 15) was on string interpolation, which I thought I had mastered. (Especially after the Secrets of the Framework Creators tutorial at PyCon 2008 and after writing http://pypap.blogspot.com/2008/03/string-interpolation.html). But I did learn a few new things. For example, I didn’t realize that the ‘%s’ conversion uses the value’s str() method. So one can quite safely do:

print '%s' % foo

…regardless of the type of foo.

I also didn’t know that one can use an asterisk to make width and precision “data dependent”. (Steve notes this only works with tuple data—of course it won’t work with a dictionary because the values are not ordered.) So you can do the following:

>>> def foo_wide(width):
...   print '%*s' % (width, 'foo')
...
>>> foo_wide(4)
foo
>>> foo_wide(10)
     foo

…or…

>>> import math
>>> def pi_wide(width, precision):
...   print '%*.*f' % (width, precision, math.pi)
...
>>> pi_wide(8,2)
   3.14
>>> pi_wide(10,5)
  3.14159

Lesson 2 (slides 17 through 27) was on iteration. Steve explained the ”iteration protocol”:

iterables must have an __iter__() method which returns and iterator
iterators must be iterable, and must also have a next() method

Steve (and I later observed Michael Foord also) pronounced __iter__ as “dunder-iter”. It sounded a little strange at first, but it’s certainly easier than saying “under-under-iter” or “under-under-iter-under-under”.

Steve mentioned the itertools standard library, but didn’t allocate time in the tutorial to cover it. (For that I recommend Doug Hellmann’s PyMOTW blog post.)

He concludes lesson 2 with a slide (#27) explaining how to use the enumerate() built-in function (which I have found useful many times).

Lesson 3 (slides 29 though 35) was on generators and generator expressions. I like Steve’s explanation that generators are for creating sequences where computation is needed to create each element. And in conclusion, he writes that generators can “express producer-consumer algorithms more naturally” since the “generation of values is cleaning separated from their processing”. But aside from these insights, I didn’t learn anything new about generators. (That may be difficult after David Beazley’s excellent “Generator Tricks for Systems Programmers” tutorial at PyCon 2008.) And in spite of the lesson’s title, Steve didn’t cover generator expressions.

Lesson 4—covering Descriptors and Properties—was the most useful to me. I’d heard of descriptors and properties, but never really studied them or read code that used them. First, Steve explains in detail how attribute lookup works in new-style classes. This leads (after an aside which I’ll mention later) this his definition of properties: “a way of interposing code between client and server of a namespace”. One can define—using the property() built-in—a getter, setter and deleter, plus a doc string. And since the first argument to property() is the getter function, one can use property as a decorator (with no arguments) around a method. (See slide 41.) David Beazley (who was also taking this tutorial) spoke up and pointed out that in Python 2.6 a property object (returned from the property built-in) has setter and deleter methods that can be used as decorators. See the property built-in documentation for an example. On slide 45, Steve shows how to define properties without namespace pollution. Finally (slides 47 though 50) he goes into detail on the difference between old-style and new-style attribute lookup. I realized as Steve wrapped up this lesson that I still didn’t understand what a descriptor is, so I asked. Steve’s answer (I think) that the “descriptor protocol” is what enables properties to work. I gave myself a to-do to read the Python documentation on descriptors.

Back to the aside (on slide 39) I mentioned above. Steve notes that when you look up a callable on an instance, the interpreter creates a “bound method”, therefore (presumably because these are objects like everything else in Python) “a method call carries object creation overhead”. There’s a good illustration of this on the slide. This would be good to keep in mind if I ever find myself trying to squeeze as much performance as possible out of some Python code.

Lesson 5 (slides 52 though 73) is on metaclasses. I’d seen these before in the “Secrets of the Framework Creators” tutorial at !PyCon 2008. (And during the time I spent digging around inside the Django sources). If you’re still trying to wrap your head around metaclasses, this may be a quick way to get there. I won’t attempt to summarize, but the insight I gained from this lesson is that the type() built-in, when called with three arguments returns a new type object. In other words it’s a dynamic form of the class statement. (This is the mechanism for implementing metaclasses.) You may also want to read Michael Foord’s “Metaclasses in Five Minutes” notes or watch the video of his lightning talk at PyCon 2009 (which is supposed to start 11 minutes in). Though I would conclude that if you’re considering using metaclasses, you should seriously consider using class decorators first. (See my notes on the “Class Decorators: Radically Simple” PyCon 2009 talk.)

Lesson 6 (slides 75 & 76) was not really a lesson but a very quick wrap-up.

Finally there are three appendices. We did find the time to cover Appendix A (slides 78 through 84). It’s on decorators, but only on the simpler form of decorators that don’t take arguments. Slides 83 and 84 cover functools.wraps and functools.partial, and are interesting reading.

We did not cover the other two appendices. Appendix B (slides 86 through 89) is on context managers, which I myself covered back in July 2008 to present a “Newbie Nugget” to BayPIGgies on the with statement. Appendix C (slides 92 through 107) is on unit testing. If you’re new to unit testing or new to the Python unittest module, then this is worth a read.

Thursday, April 9, 2009

A Curious Course on Coroutines and Concurrency

On Wednesday, March 25th, I attended David Beazley's A Curious Course on Coroutines and Concurrency tutorial at PyCon 2009. This was an excellent tutorial that continued from where David's Generator Tricks for Systems Programmers tutorial from PyCon 2008 left off. (See my notes on that tutorial in my PyCon 2008 Notes blog post; I see I never did write a summary.)

David again has made his tutorial materials (including excellent slides and plenty of code samples) publicly available: http://www.dabeaz.com/coroutines/.

At the start of the tutorial I had written generators and had some recollection of the "Generator Tricks for Systems Programmers" tutorial. But I had only vague sense of what a coroutine is. After the tutorial I feel like my understanding of coroutines is much deeper. I'm ready to use them when called for. I might even understand them well enough to avoid looking for all kinds of inappropriate nails to hit with this new hammer.

After an entertaining overview—David's sense of humor provided some sugar to help the medicine of the sometimes challenging material go down—he introduces coroutines. As he writes (on slide 8), in Python 2.5 generators "picked up some new features to allow 'coroutines'" (see PEP-342 "Coroutines via Enhanced Generators"), "most notably: a new send() method". He adds "If Python books are any guide, this is the most poorly documented, obscure, and apparently useless feature of Python."

I digress, but as the author of the Python Essential Reference, David has raised the bar for himself. I happened to pick up a copy of a draft manuscript of a fragment of chapter 6—"Functions and Functional Programming"—from the upcoming 4th Edition at the Addison-Wesley booth at PyCon. Following a section in that chapter explaining what coroutines are is a section entitled "Using Generators and Coroutines". David explains that "generator functions are useful if you want to set up a processing pipeline, similar in nature to using a pipe in the UNIX shell." After an example of this he writes "Coroutines can be used to write programs based on data-flow processing. Programs organized in this way look like inverted pipelines." The example that follows explains that "the coroutine pipeline remains active indefinitely or until close() is explicitly called on it." So "a program can continue to feed data into a coroutine for as long as necessary", and his example shows two consecutive calls to send different data into the pipeline.

I've never owned a copy of Python Essential Reference, but after reading this draft manuscript and seeing first-hand David's ability to simplify sometimes complex material, I've pre-ordered a copy of the 4th Edition.

Anyway, I need to remember that this is meant to be a summary of the tutorial. If you want the details you can read the slides and look at the code samples.

The tutorial was divided into 9 parts. Part 1 (slides 15 through 33) is a very clear introduction to generators and coroutines. He summarizes (in slide 33) that generators produce data for iteration, whereas coroutines are consumers are data and are not related to iteration. He warns us not to mix the two concepts together.

In Part 2 ("Coroutines, Pipelines, and Dataflow", slides 34 through 52) David explains that coroutines can be used to set up pipes. Each pipeline needs an initial source (a producer) and and end-point (a sink). Because (unlike generators which pull data through the pipe with iteration) coroutines push data into the pipeline with send(), they allow you to send data to multiple destinations. That is, you can have branches in the pipeline. He shows broadcasting to multiple targets as an example. (See slides 44 through 46.) He concludes by showing how coroutines are "somewhat similar to OO design patterns involving simple handler objects". (I think he's talking about the Chain of Responsibility pattern.) He notes that just like a generator is an iterator "stripped down to the bare essentials", so is a coroutine very simple compared to the multiple classes required to implement this pattern. (This is an example of the claim made by Joe Gregorio in his The (lack of) design patterns in Python PyCon 2009 talk.) David also shows that coroutines are faster than objects (because of the lack of self lookups).

Part 3 ("Coroutines and Event Dispatching", slides 53 through 74) shows that "coroutines can be used to write various components that process event streams". His example shows parsing the XML data that is available with the real-time GPS tracking data of most Chicago Transit Authority buses. He has a coroutine that implements a simple state machine to convert the "events" from an XML parser into dictionaries of bus data, another coroutine to filter on dictionary fields, and a coroutine to print the dictionaries as a table. ''What's quite slick about this is that he hooks them together into a pipeline that works without modification'' with SAX, expat, and a custom C extension written on top of the expat C library. (Each is faster than its predecessor, and the latter is slightly faster than using ElementTree.)

Part 4 ("From Data Processing to Concurrent Programming", slides 75 through 91) shows how coroutines "naturally tie into problems involving threads and distributed systems", since you send data to coroutines just as you do to threads (via queues) or processes (via messages). He creates a coroutine call threaded and hooks it up (slide 84) to the example coroutines from Part 3 so the filters and printing coroutines run in a separate thread. (And he notes this makes it run about 50% slower!) He then shows how coroutines could also be used to bridge two processes over a pipe or socket. So he notes that coroutines allow us to separate the implementation of a task (the coroutines) from the execution environment (threads, subprocesses, network). But he cautions us that huge collections of coroutines, threads and processes may be difficult to maintain and without careful study may make your program run slower. He also warns that the send() method on a coroutine must be synchronized, and if you call send() on an already-executing coroutine your program will crash (so no loops or cycles in the pipeline or multiple threads sending data into the same coroutine).

From here, the tutorial gets much more "mondo". In Part 5 (slides 92 through 98) he explains that coroutines look like tasks, the building blocks of concurrent programming. In Part 6 (slides 99 through 109), he gives a crash course in operating systems (and shows the yield statement can be though of like a trap) in order to prepare us for Part 7 (slides 110 through 168), where he builds an operating system using coroutines. I'm not going to go into detail on this, because this summary is already too long and you're better off reading his slides and looking at the code sample than reading my description of this. I'll note though that I enjoyed the humor in slide 152, where having written the a Task class to wrap a coroutine, support for system calls and a Scheduler class with basic task management, he states "The next step is obvious; we must implement a web framework". But he settles for an echo server. In Part 8 (slides 169 through 188) he explains that coroutines can't call subroutine functions that yield, and explains the solution using "trampolining". Slide 187 is worth noting, where he shows that application code has "normal looking control flow", just like traditional socket code (and unlike any code using a module that uses event callbacks).

He wraps it all up in Part 9 (slides 189 through 198). Here's my summary of his summary:

generators (and coroutines) are "far more powerful than most people realize"
they have decent performance
but he's not convinced that it's worth using coroutines for general multitasking
it is "critically important" not to mix the three main uses of yield toegether:
iteration
receiving messages
a trap

If you find this at all interesting, I urge you to read David's slides and code. And join me in urging him to present another tutorial at PyCon 2010.

PyPap