City of Chicago Launches OpenData ETL Utility Kit

The City of Chicago’s Department of Innovation and Technology released the OpenData ETL Utility Kit at the Code for America Summit this morning. The ETL Utility Toolkit will give cities the same tools Chicago uses to get data from their own internal systems and unto their data portals.

Chicago River South Branch from the Cityscape Bar

Photo by Chris Smith Photography

We’ll explain why this is such a big deal and what ETL is below the fold!

What’s an ETL?

ETL stands for Extract, Transform and Load. It’s short-hand for a system that takes data from one system and puts it someplace else. Like fancy plumbing for data.

So, say that the city tows your car. At some point, the city enters your vehicle’s information into a city system.  On the DoIT’s end, they use an ETL system to take the data from that system, transforms it into a format that they can then put here – on the city’s data portal. (Where the the community can then do cool stuff like this.)

Here’s Director of Analytics and Performance Management for the City of Chicago Tom Schenk, Open Data Program Manager for Chicago Jon Levy, and master data wrangler and 2013 CfA fellow Dave Guarino to explain this a little deeper.

To do this, the City of Chicago has deployed a number of ETL tools. They’ve now open sourced those tools as part of a toolkit that any city will be able to adopt for free.

Features

Here’s how the team describes the app on GitHub:

ETL Utilities for an Open Data Program

This toolkit provides several utilities and framework to help governments deploy automated ETLs using the open-source Pentaho data integration (Kettle) software.

Namely, this toolkit will assist with:

  • Load data from a database an load it to a Socrata data portal
  • Steps to integrate with an email server to provide e-mail alerts on the outcome of ETL scripts
  • Handles deployment issues when using multiple operating systems during development
  • Utilities to allow administrators to quickly analyze the log files of ETLs for quick diagnostics

The ETL framework is organized so each function can be modified in one file that is used by all ETLs. This provides for easier maintenance, upgrading, and modification over hundreds of ETLs.

Features

  • Open source at the core – this framework can be deployed using Kettle, an open-source ETL software. Pentaho also provides telephone support and training if desired.
  • Compatible with multiple data sources – this ETL framework can be used with a variety of data sources, including a range of databases (MySQL, PostgreSQL, Oracle, SQL Server, and variety of NoSQL), APIs, text files, etc.
  • Compatible workflow for multiple operating systems – ETLs can be developed and deployed across multiple operating systems. ETLs can be developed on a Windows environment and deployed on Linux
  • Helpful utilities – includes several scripts to help users quickly analyze log files

You can catch a livestream of the presentation on Thursday  at 11:00am Chicago time on the Code for America Summit website.