Using gwArsenicR to Model Health Effects of Groundwater Arsenic Exposure

Introduction

Welcome to the gwArsenicR package vignette. This document provides a comprehensive guide to using the package for estimating arsenic exposure from private well water and assessing its association with health outcomes.

The core methodology implemented in this package is based on the approaches used in foundational epidemiological studies by Bulka et al. (2022) and Lombard et al. (2021).

This vignette will cover:

The conceptual background of the statistical model.
The required data formats for the analysis.
A complete, step-by-step example from data preparation to results interpretation.

Conceptual Framework

The primary challenge in studying the health effects of arsenic in private wells is the uncertainty in exposure assessment. Private wells are not federally regulated, and direct measurements for large populations are often unavailable. gwArsenicR addresses this by implementing a multi-stage approach:

Probabilistic Exposure Modeling: The package combines two sources of arsenic prediction data:
- USGS Models: Machine learning models (Lombard et al., 2021) that provide a multinomial probability for arsenic concentration falling into one of three categories (<5, 5-10, >10 µg/L) for grid cells across the U.S.
- EPA Data: County-level estimates of arsenic concentrations from public supply wells. These are combined, weighted by the proportion of the population using private vs. public wells, to create a county-specific probability distribution of arsenic exposure.
Multiple Imputation: Because we don’t know the true arsenic exposure for each individual, we treat it as missing data. The package uses a multiple imputation framework to create n complete datasets (ndraws). In each dataset, a plausible arsenic exposure category is assigned to each county based on the probability distribution derived in the previous step. The framework also imputes missing values in any user-specified covariates.
Regression Analysis and Pooling: A regression model (typically a linear mixed-effects model) is fitted to each of the n imputed datasets. The results (coefficients and standard errors) from these models are then pooled using Rubin’s Rules to produce a single set of estimates that properly accounts for the uncertainty introduced by the imputation process.

Data Requirements

The perform_sensitivity_analysis() function requires three main data inputs.

1. USGS Arsenic Probability Data (`as_usgs_prob_csv`)

A CSV file containing the multinomial probability of arsenic levels from the USGS model for each raster grid cell.

Required Columns:
- RFC3_C1v2, RFC3_C2v2, RFC3_C3v2: Numeric probabilities for the three arsenic categories. These must sum to 1 for each row.
- Wells_2010: Numeric population of well users for each raster cell.
- GEOID10: A 5-digit character or numeric county identifier (FIPS code).

2. EPA Arsenic Data (`as_epa_prob_csv`)

A CSV file containing the EPA model data.

Required Columns:
- EPA_AS_meanlog: Numeric mean (on a log scale) of the arsenic concentration.
- PWELL_private_pct: Numeric percentage of the population on private wells (from 0 to 100).
- GEOID10: A 5-digit county identifier (FIPS code) to link with the USGS data.

3. Birth Data (`birth_data`)

An R data.frame containing individual-level health outcome and covariate data.

Required Columns:
- FIPS: A 5-digit county identifier to link with the arsenic data.
- A column for each health outcome (e.g., BWT for birth weight).
- Columns for all covariates included in the regression formula (e.g., MAGE_R, smoke, MRSTATE).

A Complete Walkthrough

Let’s walk through a full example.

Step 1: Setup

First, load the necessary libraries.

library(gwArsenicR)
library(lme4) # Loaded by gwArsenicR, but good practice for clarity
#> Loading required package: Matrix

Step 2: Generate Example Data

For this vignette to be self-contained, we’ll first define and then use a helper function to generate dummy data files. In a real analysis, you would load your own data.

Step 3: Define Analysis Parameters

Next, define all the parameters for the analysis, including the regression formula.

# Use a small number for the vignette; recommend >= 20 for real analysis
ndraws <- 5
targets <- c("BWT", "OEGEST") # Birth Weight and Gestational Age
vars_to_impute <- c("MAGE_R", "smoke")
output_dir <- "analysis_results"
as_level_col <- "AsLevel"

# Define the regression formula
# This is a complex formula provided as an example. It models the outcome as a
# function of the imputed Arsenic Level, with adjustments for various
# covariates. It includes a random effect for county (FIPS) nested within state
# (MRSTATE) to account for spatial clustering.
regression_formula <- paste0(
  "~ as.factor(", as_level_col, ") + ",
  "rms::rcs(MAGE_R, parms = 3) + ",
  "as.factor(MRACEHISP_F) + ",
  "MEDUC_2 + MEDUC_3 + MEDUC_4 + ",
  "as.factor(RUCC_F) + DMAR_1 + smoke + pm + ",
  "(1 | MRSTATE / FIPS)"
)

Step 4: Run the Analysis

With all parameters defined, call the main function perform_sensitivity_analysis().

# This chunk may take a moment to run
results <- perform_sensitivity_analysis(
  birth_data = dummy_files$births,
  as_usgs_prob_csv = dummy_files$usgs,
  as_epa_prob_csv = dummy_files$epa,
  ndraws = ndraws,
  regression_formula = regression_formula,
  impute_vars = vars_to_impute,
  output_dir = output_dir,
  targets = targets
)
#> All 5 imputed datasets validated successfully
#> Processing target variable: BWT
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> Processing target variable: OEGEST

Step 5: Interpret the Results

The results object is a list where each element is a mipo object from the mice package. To view a publication-ready table of the pooled results, use summary().

Let’s examine the results for Birth Weight (BWT).

bwt_summary <- results$BWT
knitr::kable(
  bwt_summary,
  caption = "Pooled Regression Results for Birth Weight (BWT)"
)

Pooled Regression Results for Birth Weight (BWT)
term	q.mi	se.mi	statistic	conf.low	conf.high	p.value
As5-10	3.398037	26.13952	0.1299962	-47.92836	54.72443	0.8966090
As10+	18.354884	38.26601	0.4796654	-64.09919	100.80895	0.6392332

The summary table includes: * term: The regression model term. * estimate: The pooled point estimate (e.g., the mean difference in birth weight). * std.error: The pooled standard error. * statistic: The t-statistic. * df: The degrees of freedom. * p.value: The p-value for the term.

From this table, you can assess the magnitude, direction, and statistical significance of the association between different levels of arsenic exposure and the health outcome, while accounting for uncertainty from both exposure and covariate imputation.

Step 6: Cleanup

Finally, remove the dummy data directory created for this example.

unlink("example_data", recursive = TRUE)

References

Bulka, C. M., et al. (2022). Arsenic in private well water and birth outcomes in the United States. Environment International, 163, 107176.

Lombard, M. A., et al. (2021). Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies. Environmental science & technology, 55(8), 5012-5023.

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.

van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.

Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble

2025-06-29

Introduction

Conceptual Framework

Data Requirements

1. USGS Arsenic Probability Data (`as_usgs_prob_csv`)

2. EPA Arsenic Data (`as_epa_prob_csv`)

3. Birth Data (`birth_data`)

A Complete Walkthrough

Step 1: Setup

Step 2: Generate Example Data

Step 3: Define Analysis Parameters

Step 4: Run the Analysis

Step 5: Interpret the Results

Step 6: Cleanup

References

Using gwArsenicR to Model Health Effects of Groundwater Arsenic Exposure

Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble

2025-06-29

Introduction

Conceptual Framework

Data Requirements

1. USGS Arsenic Probability Data (as_usgs_prob_csv)

2. EPA Arsenic Data (as_epa_prob_csv)

3. Birth Data (birth_data)

A Complete Walkthrough

Step 1: Setup

Step 2: Generate Example Data

Step 3: Define Analysis Parameters

Step 4: Run the Analysis

Step 5: Interpret the Results

Step 6: Cleanup

References

1. USGS Arsenic Probability Data (`as_usgs_prob_csv`)

2. EPA Arsenic Data (`as_epa_prob_csv`)

3. Birth Data (`birth_data`)