An R package for modeling and estimating arsenic exposure from groundwater, based on epidemiological studies and existing concentration models.
[
Principal Investigator: Dr. Matthew O. Gribble [matt.gribble@ucsf.edu]
Co-Investigators: Dr. Sayantan Majumdar [sayantan.majumdar@dri.edu], Dr. Ryan G. Smith [ryan.g.smith@colostate.edu]
Statement of Need
gwArsenicR is an R package that provides a streamlined workflow for epidemiologists and public health researchers to estimate arsenic exposure from private and public well water and to assess its association with various health outcomes. The package implements a sophisticated statistical approach that combines geospatial arsenic prediction models with a multiple imputation framework to account for exposure uncertainty. The core functionality integrates predicted arsenic concentration probabilities from U.S. Geological Survey (USGS) models for private wells with U.S. Environmental Protection Agency (EPA) data for public water systems. It then uses a weighting scheme based on the proportion of the population served by each water source to create a unified, county-level exposure probability distribution. By encapsulating these complex methodologies into a single, user-friendly function, gwArsenicR makes this type of analysis more accessible, reproducible, and standardized.
Existing methods from studies like Bulka et al. (2022) and Lombard et al. (2021) provide robust statistical frameworks but require complex implementation involving: - Integration of multiple probabilistic models (USGS and EPA) - Multiple imputation techniques for uncertain exposures - Mixed-effects regression with proper pooling of results
gwArsenicR solves this by packaging these sophisticated methods into a single, user-friendly function. This enables researchers to:
- 🔬 Focus on science, not statistical programming
- 📊 Ensure reproducibility across studies
- ⚡ Accelerate research on arsenic health effects
- 🎯 Apply best practices for uncertainty quantification
By democratizing access to these methods, gwArsenicR promotes more rigorous and comparable arsenic exposure research.
Installation
System Dependencies: Pandoc
To build the package vignettes, which provide detailed documentation and examples, you need to have Pandoc installed. RStudio bundles a version of Pandoc, so if you are an RStudio user, you may not need to install it separately.
If you do not have Pandoc, you can download and install it from the official website: pandoc.org/installing.html.
R Package Installation
Prerequisites: R (≥ 4.1.0) and Pandoc for building vignettes.
Install from GitHub:
# Install devtools if not already installed
if (!require("devtools")) install.packages("devtools")
# Install gwArsenicR
devtools::install_github("montimaj/gwArsenicR", build_vignettes = TRUE)
# Load the package
library(gwArsenicR)
For development and testing (includes coverage tools):
# Install additional development dependencies
install.packages(c("devtools", "testthat", "covr", "DT", "htmltools"))
# Install gwArsenicR
devtools::install_github("montimaj/gwArsenicR", build_vignettes = TRUE)
Verify installation:
# Check package version
packageVersion("gwArsenicR")
# View help
?perform_sensitivity_analysis
Package Documentation Website:
The complete package documentation, including detailed function references, vignettes, and examples, is available at: https://montimaj.github.io/gwArsenicR
This website is automatically generated and updated whenever changes are made to the main branch, ensuring you always have access to the latest documentation.
Modifying the R Package
If you want to customize this package, run the following using R after you have modified the codes.
# Set the working directory to your package directory
setwd(<path to gwArsenicR>)
# Load devtools
library(devtools)
# Generate documentation
document()
# Build the package
build()
# Install the package
install()
Quick Start
library(gwArsenicR)
# Run sensitivity analysis with your data
results <- perform_sensitivity_analysis(
ndraws = 10,
as_usgs_prob_csv = "path/to/usgs_data.csv",
as_epa_prob_csv = "path/to/epa_data.csv",
birth_data_txt = "path/to/birth_data.txt",
regression_formula = "~ AsLevel + MAGE_R + rural + (1|FIPS)",
output_dir = "results/",
targets = c("OEGEST", "BWT")
)
# View results
print(results)
For detailed examples with dummy data, see the vignette.
Package Structure
The gwArsenicR package follows the standard structure for R packages:
gwArsenicR/
├── R/
│ └── gwArsenic.R # Main exported function and all package code
├── man/
│ ├── perform_sensitivity_analysis.Rd
│ └── [other function documentation]
├── tests/
│ ├── testthat/
│ │ ├── helper-create_dummy_data.R # Test helper functions
│ │ └── test-gwArsenic.R # Comprehensive test suite
│ └── testthat.R # Test runner
├── scripts/
│ └── test_coverage_local.R # Local coverage testing script
├── vignettes/
│ └── gwArsenicR-vignette.Rmd # Package tutorial and examples
├── .github/
│ └── workflows/
│ └── r.yml # CI/CD pipeline with coverage
├── docs/ # Generated by pkgdown (auto-created)
├── coverage/ # Coverage reports (auto-created)
├── _pkgdown.yml # Website configuration
├── CONTRIBUTING.md # Contribution guidelines
├── DESCRIPTION # Package metadata and dependencies
├── NAMESPACE # Package exports and imports
├── LICENSE # Apache 2.0 license file
├── LICENSE.md # License details
├── NEWS.md # Change log and version history
├── README.md # README file
└── .gitignore # Git ignore rules
R/ Directory Structure
The package uses a monolithic architecture where all functionality is contained in a single, well-organized R file:
-
gwArsenic.R
: Contains all package functions including:-
perform_sensitivity_analysis()
- Main exported function - Data loading and processing functions
- Multiple imputation functions using MICE
- Mixed-effects regression analysis functions
- Utility and validation functions
-
This approach provides several advantages: - Simplified maintenance: All related code in one location - Easier debugging: Complete workflow visible in single file - Reduced complexity: No cross-file dependencies - Better performance: Reduced package loading overhead
Testing Infrastructure
The package includes comprehensive testing and coverage tools:
-
tests/testthat/test-gwArsenic.R
: Main test suite with >95% code coverage -
tests/testthat/helper-create_dummy_data.R
: Creates synthetic data for testing -
tests/testthat.R
: Standard R package test runner -
scripts/test_coverage_local.R
: Local coverage analysis script -
.github/workflows/r.yml
: Automated CI/CD with coverage reporting
Key Internal Functions
While users primarily interact with perform_sensitivity_analysis()
, the package includes several internal functions organized by functionality:
Data Loading (data-loading.R
): - load_and_process_arsenic_data()
: Orchestrates USGS and EPA data loading - load_usgs_data()
: Loads USGS probability data - convert_epa_to_multinomial()
: Converts EPA lognormal to multinomial probabilities - create_weighted_prob_matrix()
: Creates weighted combined probability matrix - load_and_process_birth_data()
: Loads and processes birth/health outcome data
Multiple Imputation (imputation.R
): - impute_arsenic_exposure()
: Creates multiple imputed datasets with arsenic exposure - impute_additional_variables()
: Imputes missing covariates using MICE - validate_imputed_datasets()
: Validates imputation results
Regression Analysis (regression.R
): - regression_analysis()
: Performs mixed-effects regression on imputed data - pool_estimates_by_term()
: Pools regression estimates using Rubin’s Rules - pool_single_estimate()
: Applies Rubin’s Rules to individual parameters
Utilities: - validate_all_inputs()
: Input validation and error checking - format_geographic_ids()
: Formats FIPS codes and geographic identifiers - Additional helper functions for data validation and formatting
Core Functions
The package’s primary functionality is exposed through a single main function:
perform_sensitivity_analysis()
: This is the main exported function of the package. It orchestrates the entire workflow from data loading through final analysis:
- Data Loading: Loads and processes USGS probability data, EPA lognormal parameters, and birth/health outcome data
- Probability Integration: Combines USGS and EPA models using weighted averages based on private well usage
- Multiple Imputation: Creates multiple datasets with probabilistically assigned arsenic exposure levels
- Covariate Imputation: Optionally imputes missing values in additional covariates using MICE
- Statistical Analysis: Fits mixed-effects regression models to assess exposure-outcome relationships
- Results Pooling: Applies Rubin’s Rules to pool results across imputed datasets
- Output Generation: Saves results and returns structured analysis output
The function is designed to handle the complex statistical methodology while providing a simple, user-friendly interface that requires minimal statistical programming expertise.
Key Features
- ✅ Automated workflow: Single function handles entire analysis pipeline
- ✅ Multiple imputation: Accounts for uncertainty in arsenic exposure assignment
- ✅ Flexible modeling: Supports custom regression formulas and multiple outcomes
- ✅ Robust statistics: Implements Rubin’s Rules for proper inference
- ✅ Data integration: Combines USGS and EPA models with population weighting
- ✅ Missing data handling: Optional MICE imputation for covariates
- ✅ Reproducible results: Seed control and comprehensive output saving
- ✅ Extensive testing: >95% code coverage with automated CI/CD
Documentation
Online Documentation
📖 Complete documentation website: https://montimaj.github.io/gwArsenicR
The documentation website includes: - Function Reference: Detailed documentation for all package functions - Vignettes: Step-by-step tutorials with working examples - Getting Started Guide: Quick introduction to package usage - Methodology: Statistical background and implementation details - FAQ: Common questions and troubleshooting
Local Documentation
# View package help
help(package = "gwArsenicR")
# View main function documentation
?perform_sensitivity_analysis
# Browse vignettes
browseVignettes("gwArsenicR")
Data Requirements
The package requires three input files:
-
USGS Probability Data (
as_usgs_prob_csv
): CSV with arsenic concentration probabilities and geographic identifiers -
EPA Parameters (
as_epa_prob_csv
): CSV with EPA lognormal distribution parameters
-
Health Outcome Data (
birth_data_txt
): Text file with health outcomes and county identifiers
Expected columns: - USGS data: GEOID10
, RFC3_C1v2
, RFC3_C2v2
, RFC3_C3v2
, Wells_2010
- EPA data: EPA_AS_meanlog
, PWELL_private_pct
- Health data: FIPS
, outcome variables (e.g., BWT
, OEGEST
), covariates
Running Tests
The package includes comprehensive tests to ensure reliability and correctness. All tests use synthetic dummy data and run quickly for efficient development and CI/CD workflows.
Run all tests:
Run specific test file:
Run tests during development (faster - no package rebuild):
Run tests with clean output (suppress MICE warnings):
Check package integrity:
Interactive testing in R/RStudio:
# Install coverage dependencies if needed
if (!require("DT")) install.packages("DT")
if (!require("htmltools")) install.packages("htmltools")
# Load package for development
devtools::load_all()
# Run all tests
devtools::test()
# Run specific test with detailed output
testthat::test_file("tests/testthat/test-gwArsenic.R", reporter = "progress")
# Check test coverage (requires DT and htmltools)
Rscript scripts/test_coverage_local.R
Testing Configuration
Tests are configured for speed and reliability: - Fast execution: Uses ndraws = 2
for quick multiple imputation - Synthetic data: All tests use generated dummy data, no external dependencies - Comprehensive coverage: Tests data loading, imputation, regression, and output generation - Warning suppression: Expected MICE convergence warnings are suppressed for clean output - Automatic cleanup: Temporary files are automatically removed after each test
Expected Test Output
A successful test run will show:
✅ | F W S OK | Context
⠹ | 3 | Test perform_sensitivity_analysis with comprehensive coverage
--- Pooled Analysis Results ---
$BWT
term q.mi se.mi statistic conf.low conf.high p.value
1 As5-10 6.675316 32.49704 0.2054131 -60.42082 73.77145 0.8389941
2 As10+ 19.148237 32.77495 0.5842339 -45.79132 84.08780 0.5602382
$OEGEST
term q.mi se.mi statistic conf.low conf.high p.value
1 As5-10 3.618142e-02 0.1502064 0.2408779292 -0.2780558 0.3504186 0.8122120
2 As10+ -5.288651e-05 0.1693747 -0.0003122457 -0.3491999 0.3490941 0.9997534
-----------------------------
[1] "Checking results for target: BWT"
[1] "Checking results for target: OEGEST"
✅ | 189 | Test perform_sensitivity_analysis with comprehensive coverage [5.6s]
══ Results═══════════════════════════════════════════════════════════════════
Duration: 5.6 s
✅ [ FAIL 0 | WARN 0 | SKIP 0 | PASS 189 ]
Continuous Integration
The package includes automated testing via GitHub Actions: - R CMD check: Validates package structure and dependencies - Multiple R versions: Tests on R 4.1+ across different operating systems - Dependency validation: Ensures all required packages are properly declared - Documentation checks: Validates roxygen2 documentation completeness
Troubleshooting Tests
If tests fail with data.table errors:
If MICE warnings are excessive:
If tests timeout:
# Check system resources and reduce ndraws in test files if needed
# Default test configuration uses ndraws = 2 for speed
# For very slow systems, you can temporarily modify test parameters:
# Edit tests/testthat/test-gwArsenic.R and change ndraws = 2 to ndraws = 1
# Or run tests with extended timeout
Rscript -e "options(testthat.default_timeout = 600); devtools::test()"
# Skip slow integration tests and run only unit tests
Rscript -e "devtools::test(filter = 'unit')"
Memory issues:
Testing Best Practices
When developing or modifying the package:
- Run tests frequently during development
- Add new tests for new functionality
- Use small datasets in tests for speed (ndraws = 2)
- Test edge cases such as missing data or unusual parameter values
- Validate statistical correctness of pooled results
Contributing
We welcome contributions! Please see our contribution guidelines for details.
Ways to contribute: - 🐛 Report bugs via GitHub issues - 💡 Suggest features or improvements - 📖 Improve documentation - 🧪 Add tests or examples - 🔧 Submit pull requests
Development setup:
Troubleshooting
RStudio Installation Issues
Some users may encounter missing dependency errors when installing or using gwArsenicR in RStudio, even after successful installation. This typically manifests as errors like:
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called 'R6'
Common missing packages: - R6 - digest
- purrr - vctrs - glue
Solution: Install these core dependencies manually before installing gwArsenicR:
# Install missing dependencies
install.packages(c("R6", "digest", "purrr", "vctrs", "glue"))
# Then install gwArsenicR
devtools::install_github("montimaj/gwArsenicR")
Why this happens: This issue typically occurs when: - RStudio’s package installation doesn’t properly resolve all transitive dependencies - System-level R and RStudio R installations differ - Package library paths are misconfigured - Previous incomplete installations left corrupted package states
Alternative solutions:
-
Restart R session and try again:
# In RStudio: Session -> Restart R .rs.restartR()
-
Update all packages first:
update.packages(ask = FALSE)
-
Install from a clean state:
# Remove any partial installations remove.packages("gwArsenicR") # Clear package cache .libPaths() # Check library paths # Reinstall dependencies and package install.packages(c("devtools", "R6", "digest", "purrr", "vctrs", "glue")) devtools::install_github("montimaj/gwArsenicR")
-
Check R version compatibility:
# Ensure you're running R >= 4.4.0 R.version.string
If issues persist, please report them with your R version and sessionInfo().
Citation
If you use gwArsenicR in your research, please cite:
Majumdar, S., Bartell, S. M., Lombard, M. A., Smith, R. G., & Gribble, M. O. (2025). gwArsenicR: An R package for modeling and estimating arsenic exposure from groundwater. Journal of Open Source Software (under review).
BibTeX:
@article{majumdar2025gwarsenic,
title={gwArsenicR: An R package for modeling and estimating arsenic exposure from groundwater},
author={Majumdar, Sayantan and Bartell, Scott M and Lombard, Melissa A and Smith, Ryan G and Gribble, Matthew O},
journal={Journal of Open Source Software},
year={2025},
note={Under review},
url={https://github.com/montimaj/gwArsenicR}
}
DOI: Coming soon upon publication
Acknowledgments
This work was supported by the National Heart, Lung, and Blood Institute (R21HL159574) and funding from the United States Geological Survey’s John Wesley Powell Center for Analysis and Synthesis.
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Version: 0.1.0 | Status: Under active development
References
Bulka, C. M., Bryan, M. S., Lombard, M. A., Bartell, S. M., Jones, D. K., Bradley, P. M., … & Argos, M. (2022). Arsenic in private well water and birth outcomes in the United States. Environment International, 163, 107176. https://doi.org/10.1016/j.envint.2022.107176
Lombard, M. A., Bryan, M. S., Jones, D. K., Bulka, C., Bradley, P. M., Backer, L. C., … & Ayotte, J. D. (2021). Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies. Environmental science & technology, 55(8), 5012-5023. https://doi.org/10.1021/acs.est.0c05239
Wickham, H., & Bryan, J. (2023). R Packages (2nd ed.). O’Reilly Media. https://r-pkgs.org/