Using gwArsenicR to Model Health Effects of Groundwater Arsenic Exposure
Sayantan Majumdar, Scott M. Bartell, Melissa A. Lombard, Ryan G. Smith, and Matthew O. Gribble
2025-06-29
gwArsenicR-vignette.Rmd
Introduction
Welcome to the gwArsenicR
package vignette. This
document provides a comprehensive guide to using the package for
estimating arsenic exposure from private well water and assessing its
association with health outcomes.
The core methodology implemented in this package is based on the approaches used in foundational epidemiological studies by Bulka et al. (2022) and Lombard et al. (2021).
This vignette will cover:
- The conceptual background of the statistical model.
- The required data formats for the analysis.
- A complete, step-by-step example from data preparation to results interpretation.
Conceptual Framework
The primary challenge in studying the health effects of arsenic in
private wells is the uncertainty in exposure assessment. Private wells
are not federally regulated, and direct measurements for large
populations are often unavailable. gwArsenicR
addresses
this by implementing a multi-stage approach:
-
Probabilistic Exposure Modeling: The package combines two sources of arsenic prediction data:
-
USGS Models: Machine learning models (Lombard et
al., 2021) that provide a multinomial probability for arsenic
concentration falling into one of three categories (
<5
,5-10
,>10
µg/L) for grid cells across the U.S. - EPA Data: County-level estimates of arsenic concentrations from public supply wells. These are combined, weighted by the proportion of the population using private vs. public wells, to create a county-specific probability distribution of arsenic exposure.
-
USGS Models: Machine learning models (Lombard et
al., 2021) that provide a multinomial probability for arsenic
concentration falling into one of three categories (
Multiple Imputation: Because we don’t know the true arsenic exposure for each individual, we treat it as missing data. The package uses a multiple imputation framework to create
n
complete datasets (ndraws
). In each dataset, a plausible arsenic exposure category is assigned to each county based on the probability distribution derived in the previous step. The framework also imputes missing values in any user-specified covariates.Regression Analysis and Pooling: A regression model (typically a linear mixed-effects model) is fitted to each of the
n
imputed datasets. The results (coefficients and standard errors) from these models are then pooled using Rubin’s Rules to produce a single set of estimates that properly accounts for the uncertainty introduced by the imputation process.
Data Requirements
The perform_sensitivity_analysis()
function requires
three main data inputs.
1. USGS Arsenic Probability Data
(as_usgs_prob_csv
)
A CSV file containing the multinomial probability of arsenic levels from the USGS model for each raster grid cell.
-
Required Columns:
-
RFC3_C1v2
,RFC3_C2v2
,RFC3_C3v2
: Numeric probabilities for the three arsenic categories. These must sum to 1 for each row. -
Wells_2010
: Numeric population of well users for each raster cell. -
GEOID10
: A 5-digit character or numeric county identifier (FIPS code).
-
2. EPA Arsenic Data (as_epa_prob_csv
)
A CSV file containing the EPA model data.
-
Required Columns:
-
EPA_AS_meanlog
: Numeric mean (on a log scale) of the arsenic concentration. -
PWELL_private_pct
: Numeric percentage of the population on private wells (from 0 to 100). -
GEOID10
: A 5-digit county identifier (FIPS code) to link with the USGS data.
-
3. Birth Data (birth_data
)
An R data.frame
containing individual-level health
outcome and covariate data.
-
Required Columns:
-
FIPS
: A 5-digit county identifier to link with the arsenic data. - A column for each health outcome (e.g.,
BWT
for birth weight). - Columns for all covariates included in the regression formula (e.g.,
MAGE_R
,smoke
,MRSTATE
).
-
A Complete Walkthrough
Let’s walk through a full example.
Step 2: Generate Example Data
For this vignette to be self-contained, we’ll first define and then use a helper function to generate dummy data files. In a real analysis, you would load your own data.
Step 3: Define Analysis Parameters
Next, define all the parameters for the analysis, including the regression formula.
# Use a small number for the vignette; recommend >= 20 for real analysis
ndraws <- 5
targets <- c("BWT", "OEGEST") # Birth Weight and Gestational Age
vars_to_impute <- c("MAGE_R", "smoke")
output_dir <- "analysis_results"
as_level_col <- "AsLevel"
# Define the regression formula
# This is a complex formula provided as an example. It models the outcome as a
# function of the imputed Arsenic Level, with adjustments for various
# covariates. It includes a random effect for county (FIPS) nested within state
# (MRSTATE) to account for spatial clustering.
regression_formula <- paste0(
"~ as.factor(", as_level_col, ") + ",
"rms::rcs(MAGE_R, parms = 3) + ",
"as.factor(MRACEHISP_F) + ",
"MEDUC_2 + MEDUC_3 + MEDUC_4 + ",
"as.factor(RUCC_F) + DMAR_1 + smoke + pm + ",
"(1 | MRSTATE / FIPS)"
)
Step 4: Run the Analysis
With all parameters defined, call the main function
perform_sensitivity_analysis()
.
# This chunk may take a moment to run
results <- perform_sensitivity_analysis(
birth_data = dummy_files$births,
as_usgs_prob_csv = dummy_files$usgs,
as_epa_prob_csv = dummy_files$epa,
ndraws = ndraws,
regression_formula = regression_formula,
impute_vars = vars_to_impute,
output_dir = output_dir,
targets = targets
)
#> All 5 imputed datasets validated successfully
#> Processing target variable: BWT
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
#> Processing target variable: OEGEST
Step 5: Interpret the Results
The results
object is a list where each element is a
mipo
object from the mice
package. To view a
publication-ready table of the pooled results, use
summary()
.
Let’s examine the results for Birth Weight (BWT
).
bwt_summary <- results$BWT
knitr::kable(
bwt_summary,
caption = "Pooled Regression Results for Birth Weight (BWT)"
)
term | q.mi | se.mi | statistic | conf.low | conf.high | p.value |
---|---|---|---|---|---|---|
As5-10 | 3.398037 | 26.13952 | 0.1299962 | -47.92836 | 54.72443 | 0.8966090 |
As10+ | 18.354884 | 38.26601 | 0.4796654 | -64.09919 | 100.80895 | 0.6392332 |
The summary table includes: * term: The regression model term. * estimate: The pooled point estimate (e.g., the mean difference in birth weight). * std.error: The pooled standard error. * statistic: The t-statistic. * df: The degrees of freedom. * p.value: The p-value for the term.
From this table, you can assess the magnitude, direction, and statistical significance of the association between different levels of arsenic exposure and the health outcome, while accounting for uncertainty from both exposure and covariate imputation.
Step 6: Cleanup
Finally, remove the dummy data directory created for this example.
unlink("example_data", recursive = TRUE)
References
Bulka, C. M., et al. (2022). Arsenic in private well water and birth outcomes in the United States. Environment International, 163, 107176.
Lombard, M. A., et al. (2021). Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies. Environmental science & technology, 55(8), 5012-5023.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.