There are a number of scrapers that get data for this project. The initial scraping, in some cases was done manually, and the data was then loaded into the database. There are nightly Celery tasks that update and maintain the data. These are described in UPDATES_CELERY.
The following scrapers are available:
-
US Congress (for bills and metadata)
-
Statements of Administration Policy
-
CRS reports
-
Relevant Committee Documents
-
CBO Scores
./run govinfo --bulkdata=BILLSTATUS
./run bills
When running initially, I got an error because the bulk directories had not been made. To unzip the files manually in all directories:
find . -name "*.zip" | xargs -P 5 -I fileName sh -c 'unzip -o -d "$(dirname "fileName")/$(basename -s .zip "fileName")" "fileName"'
For more detail, see USCONGRESS_SCRAPER.
Instructions for loading the database fixture for the Statements of Administration Policy are in the DATA BACKGROUND
document, here: DATA BACKGROUND: Statement of Administration Policy.
To load the latest data (from the Biden Administration), run:
./manage.py biden_statements
The scraper for CRS Reports, and its instructions, are described in CRS_REPORTS_SCRAPER.
To load Relevant Committee Documents data use the following instructions:
-
After installing the requirements under scrapers directory, run crec_scrape_urls.py file under scrapers directory.
-
Go to the crec_scrapy folder and run “scrapy crawl crec” command. It will take about an hour to scrape all the data in crec_scrapy/data/crec_data.json file.
-
Copy scraped data from crec_scrapy/data/crec_data.json to django base directory. First delete old data under django base directory or replace it.
-
Run django command “./manage.py load_crec” command to populate the data to the database.
The CBO task processes metadata from the bill status files that are downloaded by the unitedstates/congress
scraper (above). It finds the field 'cboCostEstimates' and collects these for all bills. Then, it stores any new scores in the database.
The text of bills can be scraped with the Python project here: https://github.com/unitedstates/congress
. First, clone this repository as a child of the FlatGovDir
directory above, and a sibling of congress
. In this way, running the scraper will fill out the text data within the congress
directory.
$ cd /path/to/FlatGovDir
$ git clone https://github.com/unitedstates/congress.git
(no credentials needed-- it is an open repo)
Install the congress
Python virtual environment, install the requirements (pip install requirements.txt
). The scrapers were built with Python 2.7 and have not been upgraded; updates may be needed for a production environment, but the @unitedstates/congress
scraper is sufficient to gather the baseline data to test the utilities in this repository.
Note
|
Scraping the initial data can be very time-consuming (most of a day, depending on your internet download speeds). To get started, it is worth finding a source for bulk downloads of the text, if possible. |
On MacOS (Catalina), installing the congress
requirements involved a few adjustments:
-
Install OpenSSL 1.02 with Homebrew. The latest OpenSSL (>1.1) causes problems with certain requirements; unfortunately, version 1.0.0 also failed. A script was set up by a Github user to install version 1.0.2.
brew uninstall openssl --ignore-dependencies; brew uninstall openssl --ignore-dependencies; brew uninstall libressl --ignore-dependencies; brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/8b9d6d688f483a0f33fcfc93d433de501b9c3513/Formula/openssl.rb;
-
Link the OpenSSL libraries
export LDFLAGS=-L/usr/local/opt/openssl/lib
export CPPFLAGS=-I/usr/local/opt/openssl/include
-
Install
pytz
,pep517
andcryptography
directly
pip install pytz
pip install pep517
pip install cryptography
-
Install requirements
From the congress
repository directory, pip install -r requirements.txt