For this assignment, I used the gov01data.R package to scrape government documents. Three issues have been encountered, but they have been solved. Here is an overview of the scraping process:
File Format Requirements: The initial challenges involved downloading the dataset listings. I assumed that only the CSV file was required; however, the govdata01 package relies on both files, which in this case are the CSV and JSON files. This was a necessary step because it ensures compatibility with the package’s functions and to enable proper data extraction.
Setting the File Directory: Before starting with the downloads, the CSV and JSON files were not being recognized because there was no designated directory for it. With the guidance of AI, specifially ChatGBT, I determined that a dedicated folder was required to properly organize and access the data files. As a result, I used my folder that I created for assignment 4 named data, and stored the CSV and JSON files, which allowed the download and processing steps to be completed successfully.
File Downloads: During the first download process, the last five entries failed because the corresponding JSON records contained missing or empty PDF link fields. As a result, items 6–10 were labeled “Failed to download” due to the absence of valid PDF URLs. To resolve this issue, I obtained updated CSV and JSON files and attempted the process again with 20 government records. On the second attempt, 19 of the 20 files downloaded successfully, with only the third file failing due to a persistent missing PDF reference.
Data Usability and Recommendations: Despite these challenges, the resulting dataset remains usable, as it includes a diverse range of government documents from recent times. For future development, implementing a stronger filtering mechanism—such as automatically excluding records with missing pdf link fields—would improve the overall efficiency of the document retrieval process. Below, I’ve pasted my codes where 19 out of 20 government files were downloaded.
## Scraping Government data## Website: GovInfo (https://www.govinfo.gov/app/search/)## Prerequisite: Download from website the list of files to be downloaded## Designed for background job# Start with a clean plate and lean loading to save memorygc(reset=T)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 604110 32.3 1374994 73.5 604110 32.3
Vcells 1097459 8.4 8388608 64.0 1097459 8.4
The following object is masked from 'package:purrr':
set_names
## Set path for reading the listing and home directory## For Windows, use "c:"\\directory\\subdirectory\\## For Mac, "/Users/YOURNAME/path/"library(rjson)library(jsonlite)
Attaching package: 'jsonlite'
The following objects are masked from 'package:rjson':
fromJSON, toJSON
The following object is masked from 'package:purrr':
flatten
library(data.table)
Attaching package: 'data.table'
The following object is masked from 'package:purrr':
transpose
library(readr)
Warning: package 'readr' was built under R version 4.5.3
## CSV methodgovfiles=read.csv(file="data/govinfo-search-results-2025-10-29T19_06_10.csv", skip=2)## JSON method### rjsongf_list <- rjson::fromJSON(file ="data/govinfo-search-results-2025-10-29T19_06_42.json")govfile2=dplyr::bind_rows(gf_list$resultSet)### jsonlitegf_list1 = jsonlite::read_json("data/govinfo-search-results-2025-10-29T19_06_42.json")### Extract the listgovfiles3 <- gf_list1$resultSet### One more stepgovfiles3 <- gf_list1$resultSet |> dplyr::bind_rows()# Preparing for bulk download of government documentsgovfiles$id = govfiles$packageIdpdf_govfiles_url = govfiles3$pdfLinkpdf_govfiles_id <- govfiles3$index# Directory to save the pdf's# Be sure to create a folder for storing the pdf'ssave_dir <-"datamethods"# Function to download pdfsdownload_govfiles_pdf <-function(url, id) {tryCatch({ destfile <-paste0(save_dir, "govfiles_", id, ".pdf")download.file(url, destfile = destfile, mode ="wb") # Binary filesSys.sleep(runif(1, 1, 3)) # Important: random sleep between 1 and 3 seconds to avoid suspicion of "hacking" the serverreturn(paste("Successfully downloaded:", url)) },error =function(e) {return(paste("Failed to download:", url)) })}# Download files, potentially in parallel for speed# Simple timer, can use package like tictoc## Try twentystart.time <-Sys.time()message("Starting downloads")