Skip to content

Latest commit

 

History

History
64 lines (40 loc) · 2.9 KB

README.md

File metadata and controls

64 lines (40 loc) · 2.9 KB

👮📊

HPD Stats

A project to track the statistics of the arrests of the Honolulu Police Department
View Project

About The Project

This project provides a dashboard interface that tracks and updates from the HPD's published daily arrest log reports.

Why this exists:

The Attorney General's Office provides annual reports as to the state of crime in Hawaii. This project provides a mechanism to validate these reports, track the numbers daily, and keep an archive of the raw data.

Project Screenshots

Arrests by Sex

Chart comparing arrests by Sex

Arrests by Age

Chart comparing arrests by Age

Arrests by Ethnicity

Chart comparing arrests by Ethnicity

Percentages of comparing arrests by Ethnicity

Officer Breakdown

Officer Breakdown

Officer Breakdown Detailed View

How It Works

Using a combination of image cropping and OCR, we extract data about each arrest from each daily published arrest log.

Full Breakdown

Everyday (with cron!), the script is run (cd scrape && python3 main.py) to scrape and parse the newly published arrest log. It then does the following:

  1. Uploads the PDF file to AWS S3 for archiving
  2. Downloads the PDF file locally for parsing purposes

After we download the file, we prepare it for image cropping and OCR. To do this, we

  1. Split the PDF into individual pages (Example Page PDF)
  2. Convert all the PDF file's pages into images (Example Page Image)
  3. Vertically concat all the page images into one long image, cropping the top and the bottom out so we only contain arrest records (Example Vertically Concatted Image)
  4. Crop each individual arrest record using the location of pixels (Example Record Image)
  5. Crop each portion of the arrest record by the categories we want to parse:
  1. Use OCR(PyTesseract) to parse the text

We then upload the data to AWS DynamoDB. Using Flask and DynamoDB's boto3 module, data is served to the HPDStats website. An example of the artifacts generated from the script can be viewed here: Example Artifacts