Like many SAS users, Bill Roehl uses SAS in some very clever ways that aren’t necessarily work-related. Roehl is a Research Analyst at Capella University and an outspoken community journalist. When he isn’t working for Capella University, he is hard at work blogging about community news for Dakota County, Minnesota residents, where he has thousands of readers each day. Recently, a colleague asked a question about the makeup of inmates incarcerated within the Dakota County jail system, and Roehl used SAS to learn the answer.
According to Roehl, inmate data is provided in a publicly available report, but that report is only summarized yearly. The answer he needed was from more current data.
“What’s a data geek like me to do? Start from scratch, of course,” said Roehl.
There are several ways to get fresh data for the report that Roehl would need:
- Request the data from the county using a FOIA (Freedom of Information Act) request. – This can be time consuming and costly. And the information you receive may not be in a format that you can use.
- Manually obtain the data – Roehl could personally visit the facilities and interview each inmate. This is also time consuming and may be inaccurate.
- Programmatically obtain the data from the county website – The website data may require reformatting, but it provides the best option.
On Dakota County’s site, each inmate’s web page contains basic information including the inmate’s name, booking number, age, gender and race. The record also lists the arrest date, agency and location, reason held and anticipated release date. This information – formatted in a more visual way - can help answer questions about public funding, provide the public and governmental institutions with insight into inmate population and promote cross-departmental collaboration by showing the dynamic nature of crime.
Programmatically obtaining the data is called scraping. According to Roehl, scraping is taking information - in its raw html form - from a publicly available website and converting it to data that you can use for reporting.
The scraping code is set up using SAS macros and PERL regular expressions that pull down and then parse and process the data. Two macros do the majority of the work: GETINMATEDATA and PROCESSDATA. Inmate booking numbers are sequential, so the first workhorse Roehl uses is GETINMATEDATA. The second is PROCESSDATA to gather only a portion of the data on the HTML page. The PROCESSDATA macro uses simple regular expressions to parse each inmate’s data into a data set. According to Roehl, there are many ways this could have been done, but he believes PERL regular expression commands are the easiest. He uses the PRXPARSE and PRXMATCH functions. He then uses the TRANSPOSE and APPEND procedures to recode each individual’s dataset to a larger SAS dataset that includes all of the inmates. This allows him to analyze the data across variables.
He keeps his data current by automatically crawling the Dakota County site about once every four hours and auto-updating all charts and graphs.
Not All Peachy
There are pitfalls to scraping for your data rather than gathering it directly from the agency:
- Understanding data intricacies – “If you don’t know how the data is built over time, your reporting could be messy,” says Roehl. “For instance, in Dakota County, some people are incarcerated for weekend stays only. They’ll show up multiple times in the data, but they are actually just a single entity.”
- Thwarting crawling of the data – “This could happen in any number of ways, including changing the inmate ID number, which is currently sequentially incremented by one, to a random number. Or they could just outright ban your IP address,” he said.
- Data display format changes and updates – According to Roehl, your code is set up to parse within specific parameters. If the HTML template changes, your code will not anticipate the change.
- Purposefully misleading data provided – An organization or individual may purposefully provide incorrect data in the public-facing website.
In his paper, Roehl assures readers that he has used this code in a slightly modified form for two other crime data sources: Scott County’s Inmate Registry and Minnesota’s Level 3 Sex Offender database. “Both of these sources required very little modification to work with this code even though they are utilizing completely different data display formats. With this in mind it only makes sense that someone with an understanding of regular expressions would be able to easily extract the necessary information from other websites for their own reporting purposes.”
SAS Business Analytics provides another way to bring together disparate criminal information using SAS. Check out this video about CJLEADS. You'll have to open the Opening Session, Enterprise Excellence Award - NC Office, State Controller.
"CJLEADS brings together disparate criminal justice data to help create a more rounded profile of offenders and provides a single source of information from a variety of criminal justice organizations —including court, warrant, probation, parole and local jail information — which agencies can access securely via the Web," says Kay Meyer, Project Director, NC Office of State Controller.
These code examples are provided as is, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.