How We Crunched California’s Pay-to-Stay Data: A guide to our methodology

This story was published in partnership with The Marshall Project

or the Marshall Project’s joint investigation with the Los Angeles Times, we built a one-of-a-kind database to analyze more than 3,500 people who participated in 25 of Southern California’s pay-to-stay jail programs. Here’s how we did it.

Collection

We surveyed every municipality in California — more than 500 of them — to determine who has a pay-to-stay program. Beginning in October 2015, we sent public records requests to 26 existing programs, requesting names of participants and individual convictions, lengths of stay, dates incarcerated, amounts paid, and courts of jurisdiction. Five cities resisted the requests, requiring legal intervention. One city, Baldwin Park, said it did not maintain records prior to 2016. It took about 13 months to obtain all the data

Often the data was incomplete, forcing us to get creative. For example, the city of Anaheim delayed for months and then said it could not provide its own database of participants. To get past this hurdle, we asked for every application Anaheim approved for its pay-to-stay program, identifying 243 participants. We looked up each case individually across at least 16 courts to determine accurate conviction information. We then requested receipts from the program to calculate lengths of stay.

We determined we had the most complete data for the years 2011 through 2015, which is where we focused our reporting.

Cleaning and standardizing

Responses came back in a wide variety of formats: databases, paper applications, jail sign-in sheets, and financial receipts. We used Tabula to convert PDF documents into a spreadsheet format. We manually transcribed paper documents and image-based PDFs.

We needed to produce uniform data listing one record per participant stay. For instance, a jail sign-in sheet might include many entries for one participant as he or she showed up for weekend stays or left for work furlough. To combine that into one complete record, we used Python via a jupyter notebook to aggregate and group each data set.

The jails generally did not assign each person a unique identifier — such as a prisoner number — making it difficult to root out duplicates or repeat offenders with the same or similar names. We tried to make this determination with cities when it was possible.

But because we did not have unique identifiers, we could not be certain we weren’t counting some people twice. We chose to err on the safe side: Although there are roughly 3,700 total records in the database, we estimate the number of participants as the number of records with unique names, or about 3,500. Although some of the remaining 200 cases could have been cross-checked in court filings, it would have been time- and cost-prohibitive.

Some individual records were incomplete. Torrance and Hawthorne, for example, did not have records for some or all of 2011 and 2012. Huntington Park was missing conviction information for the majority of its participants. When there was missing information, we asked the cities to track it down. If they could not or would not do that, we attempted to find it ourselves in court filings and other public records. There are still some missing cells in individual records

The result was a uniform data set of individual names and cases, ready for categorization and further analysis

Clustering

Conviction data proved especially difficult because it varied widely in format across different jails. Sometimes a simple phrase described the offense, sometimes it was a section of the California Penal Code, sometimes it was combinations of the two or small variations of either. For example, all of the following would constitute a DUI conviction: Driving under the influence, DUI, DWI, 23152(a), 23152(A), 23152 a, PC 23152(a) and 23152(a) PC.

We used a two-step approach to wrestle these thousands of messy, inconsistent records. First, we used the data analysis tool OpenRefine to cluster convictions that were very similar into smaller subsets. Then we labeled everything left over by hand.

Categorization

Once we did that, we had a data set containing almost 200 separate types of convictions. To narrow it down further, we developed a categorization system. We defined the following broad categories: offenses involving violence or threats of violence, including assault, battery, robbery and domestic violence; sex crimes against adults or children; DUIs; driving violations; driving offenses causing injury; property crime; fraud; and disobeying court orders. Everything else — about 3.5% of the total — was placed into an “other” category. We categorized participants with multiple charges according to the most serious offense

Fact-checking

After our analysis was complete, we compiled reports for each city to confirm our findings and the accuracy of the data set. We sent each city a summary of what we had found, including the number of participants; the number and percent of what we called “serious” offenses — those involving violence, threats of violence or sex crimes; the range of days served by participants; the facility’s cost per day; and any additional details about the city that were mentioned in the story. We included the original documents provided to us by the city, as well as our cleaned and formatted final versions of that data.

Each city was given three weeks or more to respond, and we worked with them during that time to resolve any questions or errors. For example, most jails define “one day served” as a 24-hour period, but a few count an eight-hour shift as one day, something we needed to know to make sure lengths of stay were consistent and comparable across jails. This process allowed us to identify and resolve any such inconsistencies or errors in our analysis or in the original data.

Credits: Produced by Lily Mihalik