16 December 2013

Posted by: Duncan
Tags: Python | Psycopg2 | Database

An understanding and appreciation of Shakespeare's tragedies is, of course, a crucial aspect of modern software engineering; and today I could not help recalling Polonius's musings on the mental health of the Prince of Denmark after a difficult spell extracting binary data with Python from a Postgres database.

The object was simple - just get some sea-state data stored as arrays of 8-byte reals - and prep the results for use with a graphing package. The steps needed to achieve this may appear mad, as the following notes suggest, but in fact they are, like Hamlet's mind, perfectly logical and not "strange or odd".

Psycopg2

The popular adaptor implements PEP 249, the Python DB API Specification v1.0 in a cross-platform package for Python 2.6+ that wraps around libpq, the standard PostgreSQL library.

After last year's trauma with Qt's pgsql driver (I may have deleted that post for health and safety reasons), psycopg2 proved to be generally painless, apart from handling binary data. Installation was simple: yum install python-psycopg2 into my virtualenv. Once installed, a simple threaded interface class proved that it was indeed thread-safe. Nice.

The data

The data was in a single PostgreSQL table, of type bytea, which is a binary string. In practice, this means that bytea data is stored as a stream of bytes in which the original formatting is lost. So if the data was stored as 32-bit integers, it will not be possible to work this out from the data unless some additional meta information is stored (such as in a separate column or as a fixed-size 'header' at the beginning of the bytea).

The data itself was sea surface (wave) height data stored as 8-byte double precision arrays with supporting descriptive data in other columns that provided the location and timings (in fact I had written the C++ code that loaded the data via Qt's pqsql driver). In C++, the insertion was;

QByteArray waveHeights = 
   QByteArray::fromRawData((char*)wp->data,wp->numRecs*sizeof(double));

However, there is no direct support for bytea data within the psycopg2 adaptor. This is where the lunacy begins...

Fetching the data

psycopg2 will return a buffer when it fetches a bytea from a PostgreSQL database. It looks like this;

>>> rec = db.fetch(column='wave_heights', id='2010_03_30_14_30_05_BPA')
>>> data = rec['data']
>>> type(data)
<read-only buffer for 0x00AC6140, size 3200, offset 0 at 0x00C25920>
>>> data
<type 'buffer'>

Converting the data

The first step is to convert the buffer object to a string;

>>> s = str(psycopg2.BINARY(data, db.cursor))
>>> s
'CtejcD0KF0D2KFyPwvUWQPYoXI/C9RZA9ihcj8L1FkAK16Nw...'

The string is actually in Base64 notation, which has to be converted again;

>>> import base64
>>> data2 = base64.standard_b64decode(s)
>>> data2
'\n\xd7\xa3p=\n\x17@\xf6(\\\x8f\xc2\xf5\x16@\xf6(\\\x8f\xc2...'

A len(data2) now gives the number 2400 which, when divided by 8 (bytes), gives 300 - which I knew to be the number of wave heights in the record. But wait, there's more to do...

As it stands, a decoded Base64 string is not massively useful. What I wanted was a list of doubles, being effectively the format of the data before it was inserted into the database. Using the Python standard library struct package, this bit at least is easy (sort of);

>>> import struct
>>> final_list = list(struct.unpack('<{0}d'.format(numrecs), data2))
>>> final_list
[5.76, 5.74, 5.74, 5.74, 5.76, 5.76, 5.76, 5.77, 5.77,...]

Note that the returned value from struct.unpack() is a tuple, an iterable that can be cast a list.

So there it is, finally. A somewhat elaborate way to get bytea data back into its original format, but one that works.

Links