Sur un air de balisette



Filed under: programming, Python — Alexandre Fayolle @ 09:00

Interesting project Plumbum:

* Tomer Filiba’s blog post introducing Plumbum
* Plumbum documentation

Similar in goal to some of the modules living under the logilab.common package, with a nice and modern API (a lot of things in logilab.common were written before decorators were introduced… Heck, I started working on logilab.common at the time of python 1.5.2…)



setting default value in a python dictionary

Filed under: Python — Alexandre Fayolle @ 12:14

I encountered the following idiom in some OpenERP module recently:

some_dict['key'] = some_dict.get('key', {})

It struck me as an unusual way of setting a default value in a dictionary. So I investigated a bit to compare the performance of the construct compared to other ways of doing the same thing, in order to check if what I would have done (f2 below) was the most efficient way in addition to be the (in my opinion, at least) most readable. I used the timeit module and a few different dictionaries in case the presence of the key influenced the performance of the operation:

import timeit

def f1(d):
    d['value'] = d.get('value', {})

def f2(d):
    if 'value' not in d:
        d['value'] = {}

def f3(d):
    d.setdefault('value', {})

if __name__ == '__main__':
    for d in ('{}', "{'a': None, 'b': None}", "{'a': None, 'b': None, 'value': None}"):
        print d
        for i in (1, 2, 3):
            print "f%d" % i
            t = timeit.Timer('f%d(d)' % i,
                             'from __main__ import f1, f2, f3; d = %s' % d)
            print t.repeat()

With my local version (python 2.7.3), f1 is always the slowest, closely followed by f3, and the readable explicit version (f2) also runs 2x faster.


pdb.set_trace no longer working: problem solved

Filed under: Python — Alexandre Fayolle @ 14:01

I had a bad case of bug hunting today which took me > 5 hours to track down (with the help of Adrien in the end).

I was trying to start a CubicWeb instance on my computer, and was encountering some strange pyro error at startup. So I edited some source file to add a pdb.set_trace() statement and restarted the instance, waiting for Python’s debugger to kick in. But that did not happen. I was baffled. I first checked for standard problems:

  • no or pdb.pyc was lying around in my Python sys.path
  • the pdb.set_trace function had not been silently redefined
  • no other thread was bugging me
  • the standard input and output were what they were supposed to be
  • I was not able to reproduce the issue on other machines

After triple checking everything, grepping everywhere, I asked a question on StackOverflow before taking a lunch break (if you go there, you’ll see the answer). After lunch, no useful answer had come in, so I asked Adrien for help, because two pairs of eyes are better than one in some cases. We dutifully traced down the pdb module’s code to the underlying bdb and cmd modules and learned some interesting things on the way down there. Finally, we found out that the Python code frames which should have been identical where not. This discovery caused further bafflement. We looked at the frames, and saw that one of those frames’s class was a psyco generated wrapper.

It turned out that CubicWeb can use two implementation of the RQL module: one which uses gecode (a C++ library for constraint based programming) and one which uses logilab.constraint (a pure python library for constraint solving). The former is the default, but it would not load on my computer, because the gecode library had been replaced by a more recent version during an upgrade. The pure python implementation tries to use psyco to speed up things. Installing the correct version of libgecode solved the issue. End of story.

When I checked out StackOverflow, Ned Batchelder had provided an answer. I didn’t get the satisfaction of answering the question myself…

Once this was figured out, solving the initial pyro issue took 2 minutes…


Launching Python scripts via Condor

Filed under: Python — Alexandre Fayolle @ 14:03

As part of an ongoing customer project, I’ve been learning about the Condor queue management system (actually it is more than just a batch queue management system, tacking the High-throughput computing problem, but in my current project, we’re not using the full possibilities of Condor, and the choice was dictated by other considerations outside the scope of this note). The documentation is excellent, and the features of the product are really amazing (pity the project runs on Windows, and we cannot use 90% of these…).

To launch a job on a computer participating in the Condor farm, you just have to write a job file which looks like this:


and then run condor_submit my_job_file and use condor_q to monitor the status your job (queued, running…)

My program is generating Condor job files and submitting them, and I’ve spent hours yesterday trying to understand why they were all failing : the stderr file contained a message from Python complaining that it could not import site and exiting.

A point which was not clear in the documentation I read (but I probably overlooked it) is that the executable mentionned in the job file is supposed to be a local file on the submission host which is copied to the computer running the job. In the jobs generated by my code, I was using sys.executable for the Executable field, and a path to the python script I wanted to run in the Arguments field. This resulted in the Python interpreter being copied on the execution host and not being able to run because it was not able to find the standard files it needs at startup.

Once I figured this out, the fix was easy: I made my program write a batch script which launched the Python script and changed the job to run that script.

UPDATE : I’m told there is a Transfer_executable=False line I could have put in the script to achieve the same thing.


Why you shoud get rid of os.system, os.popen, etc. in your code

Filed under: Python — Alexandre Fayolle @ 13:57

I regularly come across code such as:

output = os.popen('diff -u %s %s' % (appl_file, ref_file), 'r')

Code like this might well work machine but it is buggy and will fail (preferably during the demo or once shipped).

Where is the bug?

It is in the use of %s, which can inject in your command any string you want and also strings you don’t want. The problem is that you probably did not check appl_file and ref_file for weird things (spaces, quotes, semi colons…). Putting quotes around the %s in the string will not solve the issue.

So what should you do? The answer is “use the subprocess module”: subprocess.Popen takes a list of arguments as first parameter, which are passed as-is to the new process creation system call of your platform, and not interpreted by the shell:

pipe = subprocess.Popen(['diff', '-u', appl_file, ref_file], stdout=subprocess.PIPE)
output = pipe.stdout

By now, you should have guessed that the shell=True parameter of subprocess.Popen should not be used unless you really really need it (and even them, I encourage you to question that need).


Using tempfile.mkstemp correctly

Filed under: Python — Alexandre Fayolle @ 14:01

The mkstemp function in the tempfile module returns a tuple of 2 values:

  • an OS-level handle to an open file (as would be returned by
  • the absolute pathname of that file.

I often see code using mkstemp only to get the filename to the temporary file, following a pattern such as:

from tempfile import mkstemp
import os

def need_temp_storage():
    _, temp_path = mkstemp()
    os.system('some_commande --output %s' % temp_path)
    file = open(temp_path, 'r')
    data =
    return data

This seems to be working fine, but there is a bug hiding in there. The bug will show up on Linux if you call this functions many time in a long running process, and on the first call on Windows. We have leaked a file descriptor.

The first element of the tuple returned by mkstemp is typically an integer used to refer to a file by the OS. In Python, not closing a file is usually no big deal because the garbage collector will ultimately close the file for you, but here we are not dealing with file objects, but with OS-level handles. The interpreter sees an integer and has no way of knowing that the integer is connected to a file. On Linux, calling the above function repeatedly will eventually exhaust the available file descriptors. The program will stop with:

IOError: [Errno 24] Too many open files: '/tmp/tmpJ6g4Ke'

On Windows, it is not possible to remove a file which is still opened by another process, and you will get:

Windows Error [Error 32]

Fixing the above function requires closing the file descriptor using os.close_():

from tempfile import mkstemp
import os

def need_temp_storage():
    fd, temp_path = mkstemp()
    os.system('some_commande --output %s' % temp_path)
    file = open(temp_path, 'r')
    data =
    return data

If you need your process to write directly in the temporary file, you don’t need to call os.write_(fd, data). The function os.fdopen_(fd) will return a Python file object using the same file descriptor. Closing that file object will close the OS-level file descriptor.


Converting excel files to CSV using and pyuno

Filed under: Python — Alexandre Fayolle @ 14:02

The Task

I recently received from a customer a fairly large amount of data, organized in dozens of xls documents, each having dozens of sheets. I need to process this, and in order to ease the manipulation of the documents, I’d rather use standard text files in CSV (Comma Separated Values) format. Of course I didn’t want to spend hours manually converting each sheet of each file to CSV, so I thought this would be a good time to get my hands in pyUno.

So I gazed over the documentation, found the Calc page on the wiki, read some sample code and got started.

The easy bit

The first few lines I wrote were (all imports are here, though some were actually added later).

import logging
import sys
import os.path as osp
import os
import time

import uno

def convert_spreadsheet(filename):

def run():
for filename in sys.argv[1:]:

def configure_log():
logger = logging.getLogger('')
handler = logging.StreamHandler(sys.stdout)
format = "%(asctime)s %(levelname)-7s [%(name)s] %(message)s"

if __name__ == '__main__':

That was the easy part. In order to write the convert_spreadsheet function, I needed to open the document. And to do that, I need to start

Starting OOo

I started by copy-pasting some code I found in another project, which expected to be already started with the -accept option. I changed that code a bit, so that the function would launch soffice with the correct options if it could not contact an existing instance:

def _uno_init(_try_start=True):
"""init python-uno bridge infrastructure"""
# Get the uno component context from the PyUNO runtime
local_context = uno.getComponentContext()
# Get the local Service Manager
local_service_manager = local_context.ServiceManager
# Create the UnoUrlResolver on the Python side.
local_resolver = local_service_manager.createInstanceWithContext(
"", local_context)
# Connect to the running and get its context.
# XXX make host/port configurable
context = local_resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
# Get the ServiceManager object
service_manager = context.ServiceManager
# Create the Desktop instance
desktop = service_manager.createInstance("")
return service_manager, desktop
except Exception, exc:
if exc.__class__.__name__.endswith('NoConnectException') and _try_start:'Trying to start UNO server')
status = os.system('soffice -invisible -accept="socket,host=localhost,port=2002;urp;"')
time.sleep(2)'status = %d', status)
return _uno_init(False)
logging.exception("UNO server not started, you should fix that now. "
"`soffice \"-accept=socket,host=localhost,port=2002;urp;\"` "
"or maybe `unoconv -l` might suffice")

Spreadsheet conversion

Now the easy (sort of, once you start understanding the OOo API): to load a document, use


. To get the sheets of a Calc document, use


(that one was easy…). To iterate over the sheets, I used a sample from the SpreadsheetCommon page on the wiki.

Exporting the CSV was a bit more tricky. The function to use is


. There are two gotchas, however. The first one, is that we need to specify a filter, and to parameterize it correctly. The second one is that the CSV export filter is only able to export the active sheet, so we need to change the active sheet as we iterate over the sheets.

Parametrizing the export filter

The parameters are passed in a tuple of


uno structures, as the second argument to the


method. I wrote a helper function which accepts any named arguments and convert them to such a tuple:

def make_property_array(**kwargs):
"""convert the keyword arguments to a tuple of PropertyValue uno
array = []
for name, value in kwargs.iteritems():
prop = uno.createUnoStruct("")
prop.Name = name
prop.Value = value
return tuple(array)

Now, what do we put in that array? The answer is in the FilterOptions page of the wiki : The


property is

"Text - txt - csv (StarCalc)"

. We also need to configure the filter by using the


property. This is a string of comma separated values

  • ASCII code of field separator
  • ASCII code of text delimiter
  • character set, use 0 for “system character set”, 76 seems to be UTF-8
  • number of first line (1-based)
  • Cell format codes for the different columns (optional)

I used the value


, meaning I wanted semicolons for separators, and double quotes for text delimiters.

Here’s the code:

def convert_spreadsheet(filename):
"""load a spreadsheet document, and convert all sheets to
individual CSV files"""'processing %s', filename)
url = "file://%s" % osp.abspath(filename)
export_mask = make_export_mask(url)
# initialize Uno, get a Desktop object
service_manager, desktop = _uno_init()
# load the Document
document = desktop.loadComponentFromURL(url, "_blank", 0, ())
controller = document.getCurrentController()
sheets = document.getSheets()'found %d sheets', sheets.getCount())

# iterate on all the spreadsheets in the document
enumeration = sheets.createEnumeration()
while enumeration.hasMoreElements():
sheet = enumeration.nextElement()
name = sheet.getName()'current sheet name is %s', name)
outfilename = export_mask % name.replace(' ', '_')
make_property_array(FilterName="Text - txt - csv (StarCalc)",
FilterOptions="59,34,76,1" ))

def make_export_mask(url):
"""convert the url of the input document to a mask for the written
CSV file, with a substitution for the sheet name

>>> make_export_mask('file:///home/foobar/somedoc.xls')

components = url.split('.')
components[-2] += '$%s'
components[-1] = 'csv'
return '.'.join(components)


Windows, fichiers ouverts et tests unitaires

Filed under: Python — Alexandre Fayolle @ 14:00

Un problème rencontré hier : un test unitaire plante sous Windows, après avoir créé un objet qui garde des fichiers ouverts. le tearDown du test est appelé, mais il plante car Windows refuse de supprimer des fichiers ouverts, et le framework de test garde une référence sur la fonction de test pour qu’on puisse examiner la pile d’appels. Sous Linux, pas de problème (on a le droit du supprimer du disque un fichier ouvert, et donc pas de soucis dans le teardown).

Quelques pistes pour contourner le problème:

  1. mettre le test dans un try...finally avec un del sur l’objet qui garde les fichiers ouverts dans le finally. Inconvénient : quand le test ne passe pas, pdb ne permet plus de voir grand chose
  2. au lieu de nettoyer dans le tearDown, nettoyer plus tard dans un atexit par exemple. Il faut voir comment ça se passe si plusieurs tests veulent écrire dans les mêmes fichiers (je pense qu’il faudrait un répertoire temporaire par test, si on veut pouvoir avoir plusieurs tests qui foirent et examiner leurs données, mais il faut tester pour être sûr)
  3. coller un try...except dans le tearDown autour de la suppression de chaque fichier, et mettre les fichiers qui posent problème dans une liste qui sera traitée à la sortie du programme (avec atexit par exemple).

Ça ressemble à du bricolage, mais on a un comportement de windows sur lequel on n’a pas de contrôle (même avec des privilèges Administrateur ou System, on ne peut pas contourner cette impossibilité de supprimer un fichier ouvert, à ma connaissance).

Une autre approche, nettement plus lourde, serait de virtualiser la création de fichiers pour travailler en mémoire (au minimum surcharger os.mkdir et le builtin open, voire dans le cas qui nous intéresse les modules qui travaillent avec des fichiers zip). Il y a peut-être des choses comme ça en circulation. Poser la question sur la liste TIP apportera peut-être des réponses (une rapide recherche dans les archives n’a rien donné).

Voir aussi ces enfilades de mars 2004 et novembre 2004 sur comp.lang.python.

Blog at