I’ve been involved with a couple of consulting projects entailing comparisons of groundwater quality indicators (i.e, major cation and anion concentrations) between site conditions and background in the southern San Joaquin Valley of California. For these projects, and for many others, the question of what constitutes background, and how that definition changes in time and in response to anthropogenic input, geology, etc. is paramount.

Groundwater in the San Joaquin Valley has been impacted by both nitrate as well as salts stemming from fertilizer application, animal waste impoundment lagoons, septic systems, and irrigation water return flows, among other anthropogenic inputs. To inform any site investigation, information regarding spatial distributions and historic trends in groundwater quality is helpful in establishing background conditions prior to any impact from site-specific activities. The California State Water Resources Control Board’s Groundwater Ambient Monitoring and Assessment (GAMA) database warehouses groundwater quality data from water supply as well as environmental monitoring wells across the state. Historic monitoring data for major and minor ions, organic contaminants, and other parameters are all stored as tab-delimited text files, organized by county, with sample locations indicated by well names and longitude and latitude values. Screened intervals for wells are not provided, however.

The objective of the particular exercise described here was to expand the scale of concern beyond the southern San Joaquin Valley to the entire state, exploring spatial and temporal explore trends in groundwater quality – specifically nitrate – across broad regions. The approach is statistical, relying on principal component analysis (PCA) to identify regional trends in the data, and spatial correlation analysis.


The GAMA database, downloaded as individual files for each of the 59 counties in California, was processed and analyzed with a script written in R. I chose R for this particular exercise in part because spatial analysis packages are readily available as add-ins (to keep things simple, I used the ncf package). In summary, the script does the following:

  1. Import and parse text files containing well location data, sampling dates, and analytical results for each county, filtering the data to select only results for calcium, chloride, potassium, magnesium, nitrate, sulfate, and sodium. While alkalinity is often reported among the major analytes, it is missing from enough of the samples in the database as to significantly reduce the number of complete data sets, so it was excluded. However, as discussed in below, it is possible to crudely estimate alkalinity based on subsequent charge balance considerations.
  2. Separately, manganese is added as a potential means for indicating the oxidation-reduction conditions when and where a given sample was collected.
  3. Once imported, the analyte-filtered data is written to a single file, so Step #1 does not have to be repeated for subsequent analyses (and the data are read much more quickly as well).
  4. Histograms depicting the distributions of the individual analytes are generated for a quick visual assessment.
  5. PCA is conducted and the scores written to output files to be post-processed or analyzed (via GIS, etc.).
  6. Conduct spatial analyses of the time-averaged data, consisting of (1) plotting correlograms and cross-correlograms of select PCA scores, and (2) plotting local indicators of spatial association, or LISA, of the PCA scores spatial distributions.

A full listing of the code can be accessed via my GitHub repository.


Over 115,600 groundwater samples collected from approximately 25,000 individual wells across all of California were included in the analysis for the major cations (i.e., exclusive of manganese). The probability distributions of the log-transformed concentrations of each of the major cations, normalized by chloride to adjust for evapotranspiration effects, indicate some degree of compositional variability:


A graphical depiction of the correlation matrix of the log-transformed concentrations of each of the analytes with respect to one another indicates some degree of correlation between all with the exception of nitrate:


PCA helps to provide some insight into the variance in the major ion analyte concentration data. An expanded correlation matrix that includes the principal component scores in addition to the log major ion concentrations indicates that much of the apparent correlation between the non-nitrate analyte concentrations are attributable to a single factor (PC#1) that accounts for approximately 55 percent of the variance in the data. A second component, which is associated with another 22 percent of the variance, is largely associated with nitrate itself:


The first five principal components represent approximately 97 percent of the total variance in the data. The next step in the analysis is to review the spatial distributions of these principal component scores, which can be accomplished by plotting the data on a map representing the state. Rather than do this in GIS, I summarized the results using Power Map, a new geographic data visualization feature including in Microsoft Excel 365, as an exercise:.

Scores for Principal Components 1-3 (temporal averages in each well). 
Scores for Principal Components 4-5 (temporal averages in each well).

These visualizations illustrate an extent of spatial correlation among the principal components, particularly for PCs #1 and #2. Construction of correlograms for the principal components scores does indicate some spatial correlation among the time-averaged scores in individual wells, extending out to a distance of tens of kilometers.

Correlogram for Principal Component #1 scores across a subset of counties representing the San Joaquin Valley.
Correlogram for Principal Component #2 scores across a subset of counties representing the San Joaquin Valley.

Additional analyses (cross-correlograms, and the results of LISA evaluations) are not shown here, for brevity, but are computed by the R script.

Including manganese as an additional analyte for PCA yields a clue that oxidation-reduction conditions may play an important role in the distribution of nitrate in groundwater. This is evident, for example, in the revised correlation matrix structure when Mn is included:

This composite correlation matrix is similar to the one shown above, except that the addition of Mn indicates a relationship between NO3, Mn, and Principal Component #2 scores that generally doe snot involve the other analytes. It should be noted that inclusion of Mn reduced the total number of groundwater samples with the required reported results for all eight analytes to just under 63,600 individual samples.

Here, a potential influencing mechanism on nitrate concentrations as captured by Principal Component #2 scores is suggested: the role of localized reducing groundwater chemistry which would be associated with increasing manganese concentrations (via reduction and dissolution of Mn-oxide minerals) and at the same time diminished nitrate concentrations via denitrification. PCA itself provides a useful means for illustrating this explanation graphically: while a scatter plot relating log-transformed manganese and nitrate exhibits much scatter that obscures the negative correlation between the two, both analytes exhibit much more obvious and contrasting correlations with Principal Component #2 scores …

Mn-NO3 summary

It is worth noting that the indication that nitrate concentrations are appreciably influenced by denitrification along the eastern side of the San Joaquin Valley is consistent with other studies (e.g., Burow, KR, BC Jurgens, K. Belitz, and NM Dubrovsky, 2012; Assessment of regional change in nitrate concentrations in groundwater in the Central Valley, California, USA, 1950s–2000s, Environ Earth Sci; DOI 10.1007/s12665-012-2082-4).

In conclusion,

  • Principal component analysis is capable of elucidating broad regional trends in major ion data that are visually discernible and exhibit some spatial correlation.
  • For nitrate in particular, there is an association with redox. This result is consistent with other studies. PCA makes this more clear.
As an aside, the partial pressure of CO2 was calculated for a subset of 5,000 samples with relatively high Principal Component #2 scores by imposing charge balance and equilibrium with calcite to solve for pH and bicarbonate concentration (using PHREEQC). The most elevated concentrations appear to be associated with the west side of the valley, where elevated concentrations of other constituents occur vis-a-vis Principal Component #1. Thus, a clear association of CO2 with a source of nitrate on a regional scale is not apparent.