Assignment 5: Government Data API

For this assignment, I used the gov01data.R package to scrape government documents. Three issues have been encountered, but they have been solved. Here is an overview of the scraping process:

  1. File Format Requirements: The initial challenges involved downloading the dataset listings. I assumed that only the CSV file was required; however, the govdata01 package relies on both files, which in this case are the CSV and JSON files. This was a necessary step because it ensures compatibility with the package’s functions and to enable proper data extraction.

  2. Setting the File Directory: Before starting with the downloads, the CSV and JSON files were not being recognized because there was no designated directory for it. With the guidance of AI, specifially ChatGBT, I determined that a dedicated folder was required to properly organize and access the data files. As a result, I used my folder that I created for assignment 4 named data, and stored the CSV and JSON files, which allowed the download and processing steps to be completed successfully.

  3. File Downloads: During the first download process, the last five entries failed because the corresponding JSON records contained missing or empty PDF link fields. As a result, items 6–10 were labeled “Failed to download” due to the absence of valid PDF URLs. To resolve this issue, I obtained updated CSV and JSON files and attempted the process again with 20 government records. On the second attempt, 19 of the 20 files downloaded successfully, with only the third file failing due to a persistent missing PDF reference.

  4. Data Usability and Recommendations: Despite these challenges, the resulting dataset remains usable, as it includes a diverse range of government documents from recent times. For future development, implementing a stronger filtering mechanism—such as automatically excluding records with missing pdf link fields—would improve the overall efficiency of the document retrieval process. Below, I’ve pasted my codes where 19 out of 20 government files were downloaded.

## Scraping Government data
## Website: GovInfo (https://www.govinfo.gov/app/search/)
## Prerequisite: Download from website the list of files to be downloaded
## Designed for background job

# Start with a clean plate and lean loading to save memory

gc(reset=T)
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  604110 32.3    1374994 73.5   604110 32.3
Vcells 1097459  8.4    8388608 64.0  1097459  8.4
# install.packages(c("purrr", "magrittr")
library(purrr)
library(magrittr) # Alternatively, load tidyverse

Attaching package: 'magrittr'
The following object is masked from 'package:purrr':

    set_names
## Set path for reading the listing and home directory
## For Windows, use "c:"\\directory\\subdirectory\\
## For Mac, "/Users/YOURNAME/path/"

library(rjson)
library(jsonlite)

Attaching package: 'jsonlite'
The following objects are masked from 'package:rjson':

    fromJSON, toJSON
The following object is masked from 'package:purrr':

    flatten
library(data.table)

Attaching package: 'data.table'
The following object is masked from 'package:purrr':

    transpose
library(readr)
Warning: package 'readr' was built under R version 4.5.3
## CSV method
govfiles= read.csv(file="data/govinfo-search-results-2025-10-29T19_06_10.csv", skip=2)

## JSON method
### rjson
gf_list <- rjson::fromJSON(file ="data/govinfo-search-results-2025-10-29T19_06_42.json")
govfile2=dplyr::bind_rows(gf_list$resultSet)

### jsonlite
gf_list1 = jsonlite::read_json("data/govinfo-search-results-2025-10-29T19_06_42.json")

### Extract the list
govfiles3 <- gf_list1$resultSet

### One more step
govfiles3 <- gf_list1$resultSet |> dplyr::bind_rows()


# Preparing for bulk download of government documents
govfiles$id = govfiles$packageId
pdf_govfiles_url = govfiles3$pdfLink
pdf_govfiles_id <- govfiles3$index

# Directory to save the pdf's
# Be sure to create a folder for storing the pdf's
save_dir <- "datamethods"

# Function to download pdfs
download_govfiles_pdf <- function(url, id) {
  tryCatch({
    destfile <- paste0(save_dir, "govfiles_", id, ".pdf")
    download.file(url, destfile = destfile, mode = "wb") # Binary files
    Sys.sleep(runif(1, 1, 3))  # Important: random sleep between 1 and 3 seconds to avoid suspicion of "hacking" the server
    return(paste("Successfully downloaded:", url))
  },
  error = function(e) {
    return(paste("Failed to download:", url))
  })
}
# Download files, potentially in parallel for speed
# Simple timer, can use package like tictoc

## Try twenty
start.time <- Sys.time()
message("Starting downloads")
Starting downloads
results <- 1:20 %>% 
  purrr::map_chr(~ download_govfiles_pdf(pdf_govfiles_url[.], pdf_govfiles_id[.]))
Warning in download.file(url, destfile = destfile, mode = "wb"): URL '': status
was 'URL using bad/illegal format or missing URL'
message("Finished downloads")
Finished downloads
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 43.19393 secs
# Print results
print(results)
 [1] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-29/pdf/CCAL-119scal-2025-10-29-pt2.pdf" 
 [2] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-29/pdf/CCAL-119scal-2025-10-29-pt6.pdf" 
 [3] "Failed to download: "                                                                                                     
 [4] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119hcal-2025-10-28/pdf/CCAL-119hcal-2025-10-28-pt9.pdf" 
 [5] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119hcal-2025-10-28/pdf/CCAL-119hcal-2025-10-28-pt13.pdf"
 [6] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119hcal-2025-10-28/pdf/CCAL-119hcal-2025-10-28-pt22.pdf"
 [7] "Successfully downloaded: https://www.govinfo.gov/content/pkg/BILLS-119s3056is/pdf/BILLS-119s3056is.pdf"                   
 [8] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-27/pdf/CREC-2025-10-27-pt1-PgS7753-4.pdf"       
 [9] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-27/pdf/CCAL-119scal-2025-10-27-pt2.pdf" 
[10] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-27/pdf/CCAL-119scal-2025-10-27-pt6.pdf" 
[11] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119hcal-2025-10-24/pdf/CCAL-119hcal-2025-10-24-pt9.pdf" 
[12] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119hcal-2025-10-24/pdf/CCAL-119hcal-2025-10-24-pt13.pdf"
[13] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgD1072.pdf"         
[14] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgS7722-9.pdf"       
[15] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgS7729-5.pdf"       
[16] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgS7730.pdf"         
[17] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgS7732.pdf"         
[18] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CREC-2025-10-23/pdf/CREC-2025-10-23-pt1-PgS7733-3.pdf"       
[19] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-23/pdf/CCAL-119scal-2025-10-23-pt2.pdf" 
[20] "Successfully downloaded: https://www.govinfo.gov/content/pkg/CCAL-119scal-2025-10-23/pdf/CCAL-119scal-2025-10-23-pt6.pdf"