| May 25, 2022

Artificial Intelligence Maps Environmental Variables by Their Relationships with the Landscape

Earth scientists often wish to know the value of environmental properties at all points in space — that is, they want to view as continuous maps properties that one can realistically only measure at a finite number of locations due to sampling costs and other constraints. Attempts to solve this type of problem gave rise to the field of geostatistics, which originated with South African mining engineer Danie G. Krige’s modeling of gold grades in the 1940s and 1950s and was subsequently developed by French mathematician Georges Matheron in the 1960s. The eponymous family of “kriging” interpolation methods has since become ubiquitous in applications that require interpolation of point sampled observations across geographic space. Nevertheless, such methods are not without limitations. Here we briefly discuss these restrictions and explore the ways in which recent artificial intelligence (AI) developments are improving our ability to incorporate more sources of information into environmental modeling and mapping procedures [4-7].

Ordinary kriging equates to Gaussian process regression in two or three spatial dimensions, such that the interpolation’s mean is a smooth function (or surface) through space that does not account for any information besides the values of the observations themselves. Ordinary kriging relies on an assumption of stationarity — that the spatial autocorrelation properties of the observations will remain consistent throughout the study area. It also often assumes that these autocorrelation properties will maintain consistency in all directions. Regression kriging upholds the same assumptions about spatial autocorrelation but expands the setup to include a non spatial regression component in the model; predictions are also informed by the values of spatially continuous covariates at each location, which may include satellite imagery or other types of remotely sensed data (e.g., terrain elevation). In this way, the interpolated maps can capture some amount of structure from the covariates. However, the ability to learn complex relationships is limited by the regression formulation, which is typically linear (i.e., a weighted sum of covariate values plus an intercept). We could therefore regard traditional kriging methods as making overly restrictive assumptions when mapping environmental variables at large scales in the wild.

While the development of kriging for spatial interpolation mostly rose to prominence over the second half of the 20th century [1], the last decade has seen a momentous explosion of AI approaches for all kinds of problems — particularly in the form of machine learning via multi layered neural networks (NNs) (hence the term “deep learning”). The allure of deep learning lies in its ability to alleviate the need for manual feature engineering [8]. In the context of mapping environmental variables, feature engineering tends to involve the use of terrain analysis software to derive features such as slope aspects, angles, and curvatures from gridded elevation data (because pointwise elevation alone does not tell the landscape’s full story). If we are lucky, these engineered features may help improve our predictions. Nonetheless, they are unlikely to be optimal for the task at hand, and it is difficult to hypothesize the best features. This challenge generates a laborious cycle of trial-and-error experimentation to manually craft what may or may not turn out to be useful features for our task specific spatial models.

Figure 1. Overview of our deep neural network architecture, which is visualized with the help of NN-SVG software [9]. For each observation, INPUT A feeds an image of the surrounding terrain into a stack of convolutional layers (shown as horizontal blocks). Simultaneously, INPUT B feeds the observation’s location variables into a fully connected layer. These two branches of the network are then concatenated and fed through another two fully connected layers (shown as vertical blocks) that output the two parameters of a Gaussian distribution, which provide our likelihood function. Figure courtesy of [7].

Deep learning offers an alternative to laborious manual feature engineering because automatic feature learning is inherent to NN operation; in the process of learning a composite hierarchical representation of the relation between observed values and predictors, the nodes of each layer act as learned features for the next layer. In the case of convolutional NNs, these features become specialized to indicate the presence of task-relevant patterns in images (or in gridded covariates, for environmental mapping). It is surely not a coincidence that both modern artificial NNs and the mammalian neocortex possess a multi-layered structure that allows problem solving through composite functions, i.e., by modeling an output as a function of a function of a function (and so on) of the inputs. The successes of deep learning—and of mammals—indicate that it is easier to learn complex new functions as “deep” compositions of simpler functions than in a single “shallow” step.

This notion brings us to our recent work on Bayesian deep learning for spatial interpolation in the presence of auxiliary information [7], which uses deep neural networks (DNNs) to incorporate computer vision capabilities into the spatial interpolation process to automatically recognize task-relevant features within the landscape (see Figure 1 for an illustration of the DNN architecture). We can therefore make probabilistic predictions of environmental variables not just as interpolations in geographic space, but also as interpolations in a self-learned hybrid space: a high dimensional representation that combines global location information with local contextual information that is extracted via convolution. Our deep learning approach enables the generation of more detailed interpolated maps than “shallow” modeling approaches such as traditional kriging, as it automatically extracts task-specific auxiliary information from gridded auxiliary variables. This effort facilitates the automatic discovery and utilization of new environmental relationships that may be more complex than what we could have manually proposed.

Operating within the framework of Bayesian inference enables the quantification of uncertainties within the deep-learned model (i.e., epistemic uncertainty, which is reducible with more observations) as well as in the data itself (i.e., aleatoric uncertainty, which is irreducible). Unlike traditional kriging, our approach utilizes DNNs rather than Gaussian processes and does not require assumptions of stationarity or isotropy — though each layer of our neural network architecture would be equivalent to a Gaussian process in the limit of infinite width, and so the output is conceptually not worlds apart from what kriging provides. It is also possible that future research directions will combine Gaussian processes with deep learning and convolution to achieve the so-called “best of both worlds.”

Figure 2. Calcium concentrations that were interpolated from stream sediment geochemistry observations across the U.K. via our Bayesian deep learning approach, which uses computer vision to uncover calcium concentrations’ relations to the landscape itself (as represented by gridded terrain elevation data). The map depicts the mean of our deep neural network’s predictive distribution, which has captured complex relationships between terrain features and calcium concentrations. Figure courtesy of [7].

From a Bayesian perspective, our DNN architecture allows us to “cast the net wide” in terms of our prior beliefs about the space of functions that the model should be able to represent, while simultaneously encouraging the combination of global location information with local contextual information that is learned from gridded auxiliary variables (e.g., from satellite imagery or other remote sensing). When tackling big data problems for which we have many observations, our “wide prior” approach allows us to learn spatial maps of environmental variables with detailed and realistic appearances while also achieving strong metrics against held-out test data. For example, our recent demonstration of calcium concentration mapping shows that the posterior predictive distribution achieves near perfect probabilistic calibration and is composed of simulated maps whose autocorrelation properties are very similar to those from the held out test observations [7]. We can also simply use the mean of the predictive distribution as a deterministic prediction (see Figure 2), which achieves an \(\textrm{R}^2\) of \(0.74\) on held-out test observations. This result indicates that our deep model can explain 74 percent of the variance in stream-sediment calcium across the U.K.; perhaps more importantly, the quality of calibration indicates that our model is successfully being honest about what it does not know).

At its heart, our deep learning approach to map environmental variables is powered by an approximate implementation of Bayesian inference for DNNs [2]. This implementation employs the typical optimization algorithm for training NNs—stochastic gradient descent—to optimize an approximate posterior distribution over the NN’s weights [10], rather than simply optimizing a fixed set of NN weights as with maximum likelihood estimation (which does not estimate model uncertainty). Once our Bayesian NN is trained, probabilistic predictions are provided by the predictive distribution

\[p(Y_s | x_s,\boldsymbol{y}) = \int_w p( Y_s | x_s,w)p(w | \boldsymbol{y} )dw.\]

Here, \(p(w|\boldsymbol{y})\) is the posterior distribution—the learned distribution over NN weights \(w\) given the training data \(\boldsymbol{y}\)—and \(p(Y_s|x_s,w)\) is the likelihood distribution — the probability of observing some value \(Y\) at site \(s\) given corresponding site predictors \(x_s\) and a sampled configuration of NN weights \(w\) from the posterior distribution \(p(w|\boldsymbol{y})\). Our NN architecture provides this forward model (see Figure 1). To generate one possible map of the target variable, we first sample a configuration of weights \(w\) from the posterior distribution. For each site across the map (cell centers on a regular grid of specified resolution), we then sample a value of the target variable \(Y\) from the likelihood distribution that is given by the corresponding forward model. Repeating this process many times simulates different possible maps and corresponds to estimating the posterior predictive distribution by Monte Carlo integration. For a well-tuned model, our resultant predictive distribution \(p( Y_s|x_s,\boldsymbol{y})\) is a sharp and well calibrated distribution [3] of the target variable’s possible values at each site (given the predictors at the site and the overall training data). This predictive distribution is composed of all possible maps of the target variable, whose mean is displayed as the map in Figure 2.

The full paper provides examples of individual simulated maps and a more thorough explanation and analysis of the approach [7].

References
[1] Cressie, N. (1990). The origins of kriging. Math. Geol., 22(3), 239-252.
[2] Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd international conference on machine learning (pp. 1050-1059). New York, NY: Proceedings of Machine Learning Research.
[3] Gneiting, T., Balabdaoui, F., & Raftery, A.E. (2007). Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc. Series B Stat. Methodol., 69(2), 243-268.
[4] Haupt, S.E., Chapman, W., Adams, S.V., Kirkwood, C., Hosking, J.S., Robinson, N.H., ... Subramanian, A.C. (2021). Towards implementing artificial intelligence post-processing in weather and climate: Proposed actions from the Oxford 2019 workshop. Philos. Trans. Roy. Soc. A, 379(2194), 20200091.
[5] Haupt, S.E., Pasini, A., & Marzban, C. (Eds.). (2008). Artificial intelligence methods in the environmental sciences. New York, NY: Springer.
[6] Kirkwood, C., Cave, M., Beamish, D., Grebby, S., & Ferreira, A. (2016). A machine learning approach to geochemical mapping. J. Geochem. Explor., 167, 49-61.
[7] Kirkwood, C., Economou, T., Pugeault, N., & Odbert, H. (2022). Bayesian deep learning for spatial interpolation in the presence of auxiliary information. Math. Geosci., 54, 507-531.
[8] Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems 25 (NIPS 2012) (pp. 1097-1105). Lake Tahoe, NV.
[9] LeNail, A. (2019). NN-SVG: Publication-ready neural network architecture schematics. J. Open Source Softw., 4(33), 747.
[10] Wilson, A.G. (2020). The case for Bayesian deep learning. Preprint, arXiv:2001.10995.

Charlie Kirkwood is a Ph.D. student in the Statistical Science group at the University of Exeter, where he works in collaboration with the U.K. Meteorological Office. His research interests lie in the development and application of probabilistic artificial intelligence methods for high-fidelity mapping and forecasting.