A few weeks ago we got an email from a colleague here, with a problem that needed solving; our ICT people were starting to look at the issue of data storage, and had flagged up things like the Spend Over £500 files – big csv files that we publish every month, going back several years – as something we needed to look at before they became unmanageable. At the same time we were hearing that the process for adding a new file each month was ridiculously manual, tedious and about as far from agile as you could get. It got even worse when we talked about moving the files somewhere and updating the web pages they’re catalogued on – every link would have to be re-made manually. That was a big fat No Thanks as far as the team was concerned, and they asked us if we had any clever ideas.
As it happened, we thought we might; tackling the problem from both ends at once, we arrived at the same place – why not store the files in GitHub? From the storage end, it has all sorts of advantages. It’s free, because the repository is completely open; there’s built in version control and auditing, so we could see who’s done what with the files if we needed to, and it’s already well known to developers, who we would hope could make use of the data. From the cataloging and publishing end, we had a few ideas about how to automate this so the list of files appears on a dedicated page on our website, using WordPress, some PHP and the Github API.
Just to see if it could be done, we put together a very quick prototype and published it in beta. The thing about prototypes is, they’re never perfect; we look on them as an invitation to critique, and that’s why as soon as we had something to show we asked smart people in the open data world to give us their opinion. Here’s what we’ve heard from them so far:
- A CSV file for each year would be helpful for activists and journalists;
- The README is a bit sparse, it would be useful to have the licence restated there;
- Any chance of expanding the README to include some definitions, metadata and a standard schema?
- The links on the web page are confusing and hard to read especially on mobile. Links should be eg July 2016 rather than spendover500_201607;
- and of course, the obligatory typo – no, we don’t know what ‘Githib’ is either!
As I may have mentioned once or twice, I love my internet friends, and these helpful technical communities especially. Big thanks to everyone who took the time to have a look and send us their thoughts.
So, we made some changes before moving to the live website, but not all of the suggestions could be followed. Here’s why:
- We need to keep the file name convention in Github to make the data easier to consume via code for now. That means we can’t make the file names on the webpage any neater or shorter without breaking the automation that brings them through from Github, though we’re open to suggestions if someone knows a neat way of doing that.
- We looked into providing the data in chunks by year, but we weren’t sure how many people would want that – we could do both monthly and yearly but then we’d be doubling the storage we’re using and that felt a bit like overkill. We’d be happy to hear from anyone who urgently needs the files by year, though we figured anyone doing that scale of analysis would be able to stitch twelve csv files together easily enough for themselves. That’s a big assumption, and again we are happy to talk to people about it.
- We haven’t expanded the README yet but it’s on our backlog – standards may be a way off, but we aim to clarify the license and the schema at the very least.
We know this isn’t perfect, and it may not be a long term solution, but it’s a neat way to take advantage of an existing open and free tool that works well for lot of people – plus it gives us something we can easily update and move elsewhere if we need to. As an added bonus, it plays nicely with other things we use like the extract process from our finance system – so we could automate the process even further.
The live version is now up at https://www.devon.gov.uk/factsandfigures/open-data/spending-over-500/
As always, your feedback is very welcome – leave a comment here, tell us on Twitter or drop us an email.