Data

Dealing with xls files in Python

Hammy Goonan

May 18, 2015 • 2 min read

It's been a long day coding. I've nearly finished all the preliminary scrapers for each state parliament and I must say I'll be glad to move onto the next phase.

I thought it's probably worth outlining my biggest coding struggle today just in case anyone else ever has the same issues I did.

To date, all mailing lists that I've processed have been csv files, which are wonderful because they are incredibly simple and for the most part just don't need to be any more complex than they are. However, it would seem the Parliament of Western Australia disagrees and to hell with anyone that operates outside a Microsoft environment. I'll spare you the diatribe about the importance of governments being platform neutral...

So I had to process an xls file which proved to be quite a bit more difficult than I thought it might be although in the end the solution was quite simple:

import requests
import os
import xlrd

class WaData(BaseData):

    ...

    def getData(self, url):
        """ Returns dictionary of CSV Data """
        csvfile = requests.get(url, stream=True)
        # create temp file
        temp_file = 'temp.xls'
        with open(temp_file, 'wb') as f:
            for chunk in csvfile.iter_content():
                f.write(chunk)
        f.close()
        book = xlrd.open_workbook(temp_file)
        # get excel 'sheets'
        sheet_names = book.sheet_names()
        # get first sheet
        sheet = book.sheet_by_name(sheet_names[0])
        # get header row
        header = sheet.row_values(0)
        # create list of dictionary values skipping the first one
        values = []
        for rownum in range(1, sheet.nrows):
            values.append(
                dict(zip(header, sheet.row_values(rownum)))
            )
        os.remove(temp_file)
        return values

There are some odd choices in here which is it worth explaining.

Firstly, I create a temp file 'manually' rather than using the tempfile library. That's because xlrd doesn't seem to like the temp files created with tempfile. tempfile creates a 'file-like' object which seems to exist as a buffer and xlrd doesn't want to open them.

I also wrote the content to the temp file using the requests inter_content() method as it just seemed to work a bit nicer given it's not plain text that I'm writing to the file.

Then, while I'm sure it's an excel/Microsoft issue, rather than an xlrd one, finding your way through the book object is a bit of a pain. So I get the sheet names of the book, then get the first one returned (lines 20 - 22).

I then loop over the rows in the sheet using an index and create a dictionary by combining the header row with the values in each row (lines 26 - 30).

Finally I clean up the temp file. Easy really.