Nate Silva

May 20 2008

Python PostgreSQL interfaces: pg8000 vs. psycopg2 micro-benchmark

I have great hopes for the pg8000 project, which is a PostgreSQL interface written entirely in Python.

The current standard in this category is something called psycopg2. It depends on libpq (the official PostgreSQL client library). While it works great, it can be difficult to get it running on some platforms (ahem, OS X), because you need to download all of PostgreSQL to get the client library. You can install just libpq if you can suss out the correct build parameters, but it’s still quite a hassle when deploying software on random client machines.

What pg8000 offers is no dependencies. I can do database stuff without having the client install anything other than Python itself.

But this comes at a cost; namely: it’s very slow.

I used one of our build scripts that records metadata from a directory tree into a PostgreSQL database. The directory tree contains more than 3,000 files. The script reads some data from each file, as well as its last modification time. For each file, it then executes two statements: one SELECT to validate the data, and one INSERT to plunk it into the database. A single commit() is executed at the very end of the whole process.

Using psycopg2 gave the following results:

Processing complete. 3230 files total.

real    0m28.654s
user    0m4.419s
sys 0m6.582s

With pg8000 I got the following results:

Processing complete. 3230 files total.

real    7m53.424s
user    0m13.516s
sys 0m8.267s

So, for this test, pg8000 took 16 times as long to complete.

I haven’t figured out why it’s so much slower. Perhaps pg8000 is opening transactions for each statement instead of using a single transaction. Or maybe an all-Python database interface is just destined to be slow.

For everyday use (i.e., you’re not executing thousands of consecutive statements at once) you probably won’t notice much difference using pg8000. As always, you need to profile to find out where the bottleneck is. If your application only needs to do a few statements per minute then the speed of your database interface probably isn’t a constraint.

Page 1 of 1