Potholes and Crime Stats – A Useful Package for Cleaning Data

Note: this is a guest post from Geoffrey Hing from FreeGeek Chicago. FreeGeek Chicago is working with us on a project to help analyze and visualize Cook County conviction data obtained through a FOIA request by the Chicago Justice Project. As part of the project, Geoffrey created a set of open source packages that we think will be useful to the overall civic innovation community.   —DXO

Potholes are an ever-present nuisance for Chicago residents. While not the biggest challenge facing Chicagoans, they make the navigation of the city in day-to-day life more frustrating. This year, I’ve lost a car rim and had numerous near-crashes on my bike due to these street craters. I often feel like the blocks where streets have new pavement or have had the largest potholes filled in are more exceptional than the rutted norm.Working with open data, I often run into problems that feel like potholes. They’re small, solvable, problems, but they make it harder to get to the bigger problem or the new insight. These are the kinds of problems where dozens of civic hackers have hacked around the same little problems, their solutions burried somewhere in their code repositories. Working on a project covering records of convictions in Cook County criminal courts, our project team ran into one of these data potholes.Who will watch the watchmen? Reed!

We wanted to analyze the number of convictions and variance in sentencing based on the type of offense. We wanted to roll up the statutes under which people were convicted into a common, easily understandable set of offenses and categories. We decided to use the offenses that are part of the Illinois Uniform Crime Reporting(IUCR) program for our analysis. However, our data didn’t have fields mapping each record to an IUCR offense. Instead it had fields describing the statute under either the Illinois Revised Statutes (ILRS) or the Illinois Compiled Statutes (ILCS). This page on the General Assembly website was the best reference I could find describing the differences between how laws are referenced in the two law compilations. The Illinois State Police published a crosswalk between ILCS statutes and IUCR offenses but it was in PDF format and not very useful for processing the thousands of records we needed. Furthermore, our data had references to both ILRS and ILCS statutes and we needed to convert the ILRS statute to an ILCS one in order to look up the IUCR code.

Values for statutes in our data set look like this:

38-12-4
38 9-1E
38-19-1-A
38-18-1-A
38   18-2
38   11-1
720-5/24-1.1(a)
720-5/21-1.3(a)

I ended up implementing an ilcs package for converting the ILRS references to ILCS references and an iucrpackage for looking up an IUCR offense based on an ILCS reference.

It’s just CSV!

There isn’t much to the Python code in these packacges. Essentially, they provide classes that allow statues or offenses to have a string representation, compared and used as keys in dictionaries.

The packages just wrap CSV versions of the crosswalks provided by the states. This isn’t the most performant solution, because it requires that the CSV be parsed when the packages are imported, but I wanted to make it easy for people to update the data, use the data in a spreadsheet or database without using the Python interface, or implement similar functionality in other programming languages.

You can view or download the raw CSV data for the ILCS package here and for the IUCR package here.

Using the packages

Let’s look at an example of looking up an IUCR offense from an ILRS reference:

>>> import ilcs, iucr
>>> import re
>>> ilrs_re = re.compile(r'(?P<chapter>\d+)-(?P<paragraph>[-0-9]+)')
>>> # This is an example of an ILRS reference from our data
... ilrs_ref = '38-12-4'
>>> 
>>> # Parse the reference into chapter and paragraph parts
... m = ilrs_re.match(ilrs_ref)
>>> chapter, paragraph = m.groups()
>>> 
>>> # Lookup the ILCS section from the ILRS reference
... # Note that the lookup functions return lists because some ILRS sections
... # map to multiple ILCS sections
... ilcs_section = ilcs.lookup_by_ilrs(chapter=chapter, paragraph=paragraph)[0]
>>> 
>>> # The section object can evaluate to a nicely formatted string
... print(ilcs_section)
720 ILCS 5/12-4
>>> 
>>> # And you can access its individual components
... print(ilcs_section.chapter, ilcs_section.act_prefix, ilcs_section.section)
720 5 12-4
>>> 
>>> # Now let's look up the IUCR offense.
... # Again, the lookup function returns a list because in some cases,
... # an ILCS statute maps to multiple offenses
... iucr_offense = iucr.lookup_by_ilcs(ilcs_section.chapter, ilcs_section.act_prefix, ilcs_section.section)[0]
>>> 
>>> # An Offense object has various useful attributes
... print("The 4-digit code for the offense is {}".format(iucr_offense.code))
The 4-digit code for the offense is 0410
>>> print("The description of the offense is {}".format(iucr_offense.offense))
The description of the offense is Aggravated Battery
>>> print("The category of the offense is {}".format(iucr_offense.offense_category))
The category of the offense is Battery

Improvements

These are a few areas where I could imagine improvements for the packages.

We’d love to hear about your use cases for these packages, get updates or corrections to the underlying CSV data, suggestions for improvements to the API, or pull requests implementing them.

The best way to provide this feedback is by opening an issue or a pull request through the GitHub repositories for the python-ilcs or python-iucr packages.

Documentation

There are docstrings for the public API of the packages, so you can do something like:

import ilcs
help(ilcs)
help(ilcs.lookup_by_ilrs)

and get some help about the classes and functions in the packages. However, as the API matures, it would be nice to have Sphinx-generated HTML documentation for the packacges.

Exceptions

Currently, the KeyError exceptions trickle up when looking up ILCS sections or IUCR offenses. It’s probably better to catch these and raise more domain-specific exceptions.

Fuzzy lookup or parsing

In our dataset, statutes were referenced in a variety of formats, often including subsection references. Because of this, we couldn’t just pass the raw values to the lookup functions in our packages. It might be good to add functions for parsing strings containing statute references into more standardized formats that can be used with the lookup functions, or using something like jellyfish for doing approximate string matching.