Sur un air de balisette

2009/09/10

Using tempfile.mkstemp correctly

Filed under: Python — Alexandre Fayolle @ 14:01

The mkstemp function in the tempfile module returns a tuple of 2 values:

  • an OS-level handle to an open file (as would be returned by os.open())
  • the absolute pathname of that file.

I often see code using mkstemp only to get the filename to the temporary file, following a pattern such as:

from tempfile import mkstemp
import os

def need_temp_storage():
    _, temp_path = mkstemp()
    os.system('some_commande --output %s' % temp_path)
    file = open(temp_path, 'r')
    data = file.read()
    file.close()
    os.remove(temp_path)
    return data

This seems to be working fine, but there is a bug hiding in there. The bug will show up on Linux if you call this functions many time in a long running process, and on the first call on Windows. We have leaked a file descriptor.

The first element of the tuple returned by mkstemp is typically an integer used to refer to a file by the OS. In Python, not closing a file is usually no big deal because the garbage collector will ultimately close the file for you, but here we are not dealing with file objects, but with OS-level handles. The interpreter sees an integer and has no way of knowing that the integer is connected to a file. On Linux, calling the above function repeatedly will eventually exhaust the available file descriptors. The program will stop with:

IOError: [Errno 24] Too many open files: '/tmp/tmpJ6g4Ke'

On Windows, it is not possible to remove a file which is still opened by another process, and you will get:

Windows Error [Error 32]

Fixing the above function requires closing the file descriptor using os.close_():

from tempfile import mkstemp
import os

def need_temp_storage():
    fd, temp_path = mkstemp()
    os.system('some_commande --output %s' % temp_path)
    file = open(temp_path, 'r')
    data = file.read()
    file.close()
    os.close(fd)
    os.remove(temp_path)
    return data

If you need your process to write directly in the temporary file, you don’t need to call os.write_(fd, data). The function os.fdopen_(fd) will return a Python file object using the same file descriptor. Closing that file object will close the OS-level file descriptor.

Advertisements

2008/09/19

Converting excel files to CSV using OpenOffice.org and pyuno

Filed under: Python — Alexandre Fayolle @ 14:02

The Task

I recently received from a customer a fairly large amount of data, organized in dozens of xls documents, each having dozens of sheets. I need to process this, and in order to ease the manipulation of the documents, I’d rather use standard text files in CSV (Comma Separated Values) format. Of course I didn’t want to spend hours manually converting each sheet of each file to CSV, so I thought this would be a good time to get my hands in pyUno.

So I gazed over the documentation, found the Calc page on the OpenOffice.org wiki, read some sample code and got started.

The easy bit

The first few lines I wrote were (all imports are here, though some were actually added later).


import logging
import sys
import os.path as osp
import os
import time

import uno

def convert_spreadsheet(filename):
pass

def run():
for filename in sys.argv[1:]:
convert_spreadsheet(filename)

def configure_log():
logger = logging.getLogger('')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
logger.addHandler(handler)
format = "%(asctime)s %(levelname)-7s [%(name)s] %(message)s"
handler.setFormatter(logging.Formatter(format))

if __name__ == '__main__':
configure_log()
run()

That was the easy part. In order to write the convert_spreadsheet function, I needed to open the document. And to do that, I need to start OpenOffice.org.

Starting OOo

I started by copy-pasting some code I found in another project, which expected OpenOffice.org to be already started with the -accept option. I changed that code a bit, so that the function would launch soffice with the correct options if it could not contact an existing instance:


def _uno_init(_try_start=True):
"""init python-uno bridge infrastructure"""
try:
# Get the uno component context from the PyUNO runtime
local_context = uno.getComponentContext()
# Get the local Service Manager
local_service_manager = local_context.ServiceManager
# Create the UnoUrlResolver on the Python side.
local_resolver = local_service_manager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", local_context)
# Connect to the running OpenOffice.org and get its context.
# XXX make host/port configurable
context = local_resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
# Get the ServiceManager object
service_manager = context.ServiceManager
# Create the Desktop instance
desktop = service_manager.createInstance("com.sun.star.frame.Desktop")
return service_manager, desktop
except Exception, exc:
if exc.__class__.__name__.endswith('NoConnectException') and _try_start:
logging.info('Trying to start UNO server')
status = os.system('soffice -invisible -accept="socket,host=localhost,port=2002;urp;"')
time.sleep(2)
logging.info('status = %d', status)
return _uno_init(False)
else:
logging.exception("UNO server not started, you should fix that now. "
"`soffice \"-accept=socket,host=localhost,port=2002;urp;\"` "
"or maybe `unoconv -l` might suffice")
raise

Spreadsheet conversion

Now the easy (sort of, once you start understanding the OOo API): to load a document, use

desktop.loadComponentFromURL()

. To get the sheets of a Calc document, use

document.getSheets()

(that one was easy…). To iterate over the sheets, I used a sample from the SpreadsheetCommon page on the OpenOffice.org wiki.

Exporting the CSV was a bit more tricky. The function to use is

document.storeToURL()

. There are two gotchas, however. The first one, is that we need to specify a filter, and to parameterize it correctly. The second one is that the CSV export filter is only able to export the active sheet, so we need to change the active sheet as we iterate over the sheets.

Parametrizing the export filter

The parameters are passed in a tuple of

PropertyValue

uno structures, as the second argument to the

storeToURL

method. I wrote a helper function which accepts any named arguments and convert them to such a tuple:


def make_property_array(**kwargs):
"""convert the keyword arguments to a tuple of PropertyValue uno
structures"""
array = []
for name, value in kwargs.iteritems():
prop = uno.createUnoStruct("com.sun.star.beans.PropertyValue")
prop.Name = name
prop.Value = value
array.append(prop)
return tuple(array)

Now, what do we put in that array? The answer is in the FilterOptions page of the wiki : The

FilterName

property is

"Text - txt - csv (StarCalc)"

. We also need to configure the filter by using the

FilterOptions

property. This is a string of comma separated values

  • ASCII code of field separator
  • ASCII code of text delimiter
  • character set, use 0 for “system character set”, 76 seems to be UTF-8
  • number of first line (1-based)
  • Cell format codes for the different columns (optional)

I used the value

"59,34,76,1"

, meaning I wanted semicolons for separators, and double quotes for text delimiters.

Here’s the code:


def convert_spreadsheet(filename):
"""load a spreadsheet document, and convert all sheets to
individual CSV files"""
logging.info('processing %s', filename)
url = "file://%s" % osp.abspath(filename)
export_mask = make_export_mask(url)
# initialize Uno, get a Desktop object
service_manager, desktop = _uno_init()
try:
# load the Document
document = desktop.loadComponentFromURL(url, "_blank", 0, ())
controller = document.getCurrentController()
sheets = document.getSheets()
logging.info('found %d sheets', sheets.getCount())

# iterate on all the spreadsheets in the document
enumeration = sheets.createEnumeration()
while enumeration.hasMoreElements():
sheet = enumeration.nextElement()
name = sheet.getName()
logging.info('current sheet name is %s', name)
controller.setActiveSheet(sheet)
outfilename = export_mask % name.replace(' ', '_')
document.storeToURL(outfilename,
make_property_array(FilterName="Text - txt - csv (StarCalc)",
FilterOptions="59,34,76,1" ))
finally:
document.close(True)

def make_export_mask(url):
"""convert the url of the input document to a mask for the written
CSV file, with a substitution for the sheet name

>>> make_export_mask('file:///home/foobar/somedoc.xls')
'file:///home/foobar/somedoc$%s.csv'
"""

components = url.split('.')
components[-2] += '$%s'
components[-1] = 'csv'
return '.'.join(components)

2008/07/22

Windows, fichiers ouverts et tests unitaires

Filed under: Python — Alexandre Fayolle @ 14:00
2008/07/22

Un problème rencontré hier : un test unitaire plante sous Windows, après avoir créé un objet qui garde des fichiers ouverts. le tearDown du test est appelé, mais il plante car Windows refuse de supprimer des fichiers ouverts, et le framework de test garde une référence sur la fonction de test pour qu’on puisse examiner la pile d’appels. Sous Linux, pas de problème (on a le droit du supprimer du disque un fichier ouvert, et donc pas de soucis dans le teardown).

Quelques pistes pour contourner le problème:

  1. mettre le test dans un try...finally avec un del sur l’objet qui garde les fichiers ouverts dans le finally. Inconvénient : quand le test ne passe pas, pdb ne permet plus de voir grand chose
  2. au lieu de nettoyer dans le tearDown, nettoyer plus tard dans un atexit par exemple. Il faut voir comment ça se passe si plusieurs tests veulent écrire dans les mêmes fichiers (je pense qu’il faudrait un répertoire temporaire par test, si on veut pouvoir avoir plusieurs tests qui foirent et examiner leurs données, mais il faut tester pour être sûr)
  3. coller un try...except dans le tearDown autour de la suppression de chaque fichier, et mettre les fichiers qui posent problème dans une liste qui sera traitée à la sortie du programme (avec atexit par exemple).

Ça ressemble à du bricolage, mais on a un comportement de windows sur lequel on n’a pas de contrôle (même avec des privilèges Administrateur ou System, on ne peut pas contourner cette impossibilité de supprimer un fichier ouvert, à ma connaissance).

Une autre approche, nettement plus lourde, serait de virtualiser la création de fichiers pour travailler en mémoire (au minimum surcharger os.mkdir et le builtin open, voire dans le cas qui nous intéresse les modules qui travaillent avec des fichiers zip). Il y a peut-être des choses comme ça en circulation. Poser la question sur la liste TIP apportera peut-être des réponses (une rapide recherche dans les archives n’a rien donné).

Voir aussi ces enfilades de mars 2004 et novembre 2004 sur comp.lang.python.

« Newer Posts

Create a free website or blog at WordPress.com.