Interpolation of spatial data – A stochastic or a deterministic problem?

Interpolation of spatial data is a very general mathematical problem with various applications. In geostatistics, it is assumed that the underlying structure of the data is a stochastic process which leads to an interpolation procedure known as kriging. This method is mathematically equivalent to kernel interpolation, a method used in numerical analysis for the same problem, but derived under completely different modelling assumptions. In this paper we present the two approaches and discuss their modelling assumptions, notions of optimality and different concepts to quantify the interpolation accuracy. Their relation is much closer than has been appreciated so far, and even results on convergence rates of kernel interpolants can be translated to the geostatistical framework. We sketch different answers obtained in the two fields concerning the issue of kernel misspecification, present some methods for kernel selection and discuss the scope of these methods with a data example from the computer experiments literature.


A survey in mathematics for industry
Interpolation of spatial data is a very general mathematical problem with various applications, such as surface reconstruction, the numerical solution of partial differential equations, learning theory, computer experiments and the prediction of environmental variables, to name a few. Specific instances from different fields of application can be found in [9] and [85]. The precise mathematical formulation of the problem is as follows: Reconstruct a function f : T → R, where T is a domain in R d , based on its values at a finite set of data points X := {x 1 , . . . , x n } ⊂ T (usually called 'sampling locations' in geostatistics and 'centres' in kernel interpolation). This situation is illustrated in Figure 1 with topographic data from Mount Eden, New Zealand, 1 with X being a randomly chosen set of 300 sampling locations. In order to derive optimal procedures for reconstruction and to provide a priori estimates of their precision, it is necessary to make assumptions about f. We focus on two different approaches that deal with the above problem in different ways: kernel interpolation and kriging. The former assumes that f belongs to some Hilbert space H of functions of certain smoothness. This allows one to use Taylor approximation techniques to derive bounds for the approximation error in terms of the density of the data points. Smoothness is a comparatively weak and flexible assumption, and the error bounds permit control of the precision whenever it is possible to control the sampling. By construction, the kernel interpolation approach yields minimal approximation errors with respect to the norm · H on H.

T f(x)
In some applications there is only limited or no control over the sampling and one has to get by with the (sometimes very sparse) data that are available. Typical examples are environmental modelling or mining where sampling involves high costs or is limited by lack of accessibility of the variable of interest. Moreover, in these applications the variable of interest is often a very rough function, and together with the sparsity of data this implies that error bounds obtained on the basis of Taylor approximation are only of limited use. A way out is possible if the stronger modelling assumption that comes with a statistical modelling approach is adequate: the assumption that f is a realization of a random field. Then again optimal approximation procedures can be derived, and a satisfactory stochastic description of the approximation error is available.
It is quite remarkable that both approaches finally come up with the same type of approximant despite different model assumptions and motivations of its construction. Several authors, including [10,42,51,57], have already pointed out this connection, and a comprehensive overview over both approaches is given by [4]. While the authors of [4] also establish the link between stochastic processes and reproducing kernel Hilbert spaces (RKHS), the equivalence of kernel interpolation and kriging is shown by the usual algebraic arguments. In this paper we introduce kernel interpolation and the underlying RKHS model in a different way than that usually taken in the spline literature. This will make it clear that the derivation of optimal interpolation procedures follows the same principles in the stochastic and deterministic frameworks, and reveal that the connection between these frameworks goes much further than algebraic equivalence of respective interpolants.
In Sections 2 and 3 we describe kernel interpolation and kriging respectively along with their modelling assumptions and concepts of optimality. Some problems closely related to spatial interpolation and generalizations are discussed in Section 4. The presentation in Section 2 and 3 will show that the function that characterizes the magnitude of the pointwise approximation error appears -with different interpretations -in both frameworks. This will be used in Section 5 to apply theorems on the convergence rates of kernel interpolants to the stochastic framework, where statements of comparable generality have not been available so far. In Section 6, the issue of kernel misspecification is addressed, and we give an overview of the answers given in both communities to the question about the consequences of using an 'incorrect' kernel for the construction of the interpolant. In Section 7 we turn to the issue of parameter estimation and describe some of the procedures used to select a kernel based on the available data. The scope of these methods is briefly discussed and illustrated with a data example.
The main focus of this paper is to point out the interconnections between the two approaches to spatial interpolation. Some topics which receive considerable attention in one of the two frameworks but are (from our current perspective) hardly relevant for the respective other, will be briefly addressed in the final discussion.

Positive definite kernels
In the kernel interpolation framework, f is assumed to belong to some Hilbert space H of real-valued functions on T with inner product (·, ·) H . It is further assumed that for all where H * denotes the dual of H with dual norm Then by the theory of reproducing kernel Hilbert spaces (see [67] and references therein), a unique symmetric function K : T × T → R ('reproducing kernel') exists with K(·, x) ∈ H and for all x, y ∈ T , g ∈ H. Note that either of the last two equations imply for any choice of points x 1 , . . . , x n ∈ T , n ∈ N and coefficients α = (α 1 , . . . , α n ) ∈ R n \ {0}. If the functionals δ x 1 , . . . , δ x n are linear independent when based on distinct points, we even have strict inequality in (2.3). In this case K is called a positive definite kernel and H = H K is called the native space for K. The important role of positive definiteness for kernel interpolation was pointed out by Micchelli [55].  [68,84] for further examples The native space is an abstract concept, but for some important function spaces an explicit link can be made to translation-invariant kernels where K(x, y) = Φ(y − x) for some function Φ : R d → R. Consider, for example, the Sobolev spaces W τ 2 (R d ), which are widely used in numerical analysis, in particular in the context of Partial Differential Equations (PDEs). They not only have a rich mathematical structure but also characterize the degree of smoothness of the functions belonging to them. For τ ∈ N, this can be seen directly from their definition where D α denotes an αth weak partial derivative [22,Section 5.2]. An equivalent definition exists in terms of Fourier transforms, and this definition has a straightforward generalization to non-integer orders, τ > 0. In the remainder of this paper, we always have τ > d 2 , which implies, by the Sobolev embedding theorem, that every equivalence class in W τ 2 (R d ) contains a continuous representer. We will interpret W τ 2 (R d ) as a set of continuous functions in this way. The following theorem [85,Cor. 10.13] shows that it constitutes the native space of certain translation-invariant kernels Φ which have a degree of smoothness that depends on τ.

4)
for some τ > d 2 and constants 0 < c 1 6 c 2 . Then the native space of Φ coincides with the Sobolev space W τ 2 (R d ) as a vector space, and the native space norm and the Sobolev norm are equivalent. Table 1 shows examples (see [31,46,50,84] for details) of positive definite functions Φ commonly used in geostatistics and the approximation theory. For Φ ∈ L 2 (R d ) their Fourier transforms are defined as stated in Theorem 2.1. We write O((1 + ξ 2 ) −τ ) to denote that (2.4) is satisfied where closed forms are not available.
While Sobolev spaces are intuitively more accessible and allow one to better understand what exactly is assumed for f, the framework of native spaces is useful to derive an optimal approximation of f based on the given data at X in the sense that the worst case approximation error is minimized pointwise. Specifically, we consider approximants of the form x∈ T , (2.5) which are, at each point x ∈ T , a linear combination of the given values of f. The coefficient functions u 1 , . . . , u n : T → R are defined pointwise, and for fixed x 0 ∈ T we consider the norm of the error functional λ err : According to (2.3), its square can be written as a quadratic form and it follows that optimal coefficients u * 1 (x 0 ), . . . , u * n (x 0 ) minimizing Q must satisfy If K is positive definite, then this system has a unique solution, and this in turn implies that the so-called Lagrange conditions, are satisfied. Hence, s f,X interpolates f at X, and it can be shown that it has minimal native space norm among all such interpolants [65]. This property is the usual starting point for the derivation of kernel interpolants in the spline literature.

Conditionally positive definite kernels
Some important classical interpolation schemes, such as thin-plate splines [18][19][20] or Hardy's multiquadrics [35], are not covered by the above theory, but can still be incorporated into the framework of kernel interpolation. To this end we must allow for kernels that are only conditionally positive definite with respect to some finite dimensional function space P (in applications we usually have P = π m (T ), the space of polynomials on T of order at most m). Let L P (T ) denote the space of all linear functionals of the form a i δ x i , a 1 , . . . , a n ∈ R, X := {x 1 , . . . , x n } ⊂ T , n ∈ N Table 2. Some conditionally positive definite functions Φ(h) together with the minimal space with respect to which they are conditionally positive definite that vanish on P, i.e. λ X (p) = 0 for all p ∈ P. This is a vector space over R under usual operations. A kernel K is called conditionally positive definite with respect to P if where the superscripts denote the application of λ X with respect to the first and second argument of K respectively. Note that such K is also conditionally positive definite with respect to any finite dimensional function space P ⊃ P, in particular we can always consider a conditionally positive definite kernel with respect to π m (T ) as conditionally positive definite with respect to π l (T ) if l > m. Assume from now that K : T × T → R is (symmetric and) conditionally positive definite with respect to P. In analogy with (2.3) we let and due to (2.9) this defines an inner product on L P (T ). This can be used to define the native space of K as the largest space on which all functionals from L P (T ) act continuously, i.e.
H K,P := {g : T → R : |λ(g)| 6 C g λ K for all λ ∈ L P (T )}, (2.10) where C g < ∞ is a constant depending only on g. A semi-norm on H K,P can be defined via This characterization goes back to the pioneering work of Madych and Nelson [48]. We chose it because of its striking analogy with the theory of intrinsic random fields [52], which will be further discussed in Section 3. In [67] a detailed derivation of the native space of a given conditionally positive definite kernel K is presented, showing how values g(x), x ∈ T can be assigned to the abstract function g which a priori can be evaluated only by functionals from L P (T ) which does not include the point evaluation functionals δ x . Note that the positive definite case discussed above corresponds to P = {0}, and the continuity of any λ ∈ L {0} (T ) is a consequence of assumption (2.1) and the Riesz representation theorem. Examples of conditionally positive definite functions are given in Table 2. For the thin-plate splines, from now denoted by Φ d,l , a counterpart of Theorem 2.1 will provide some intuitive understanding of the corresponding native space. Φ d,l will be considered as conditionally positive definite with respect to π l−1 (R d ). The corresponding function spaces are the Beppo-Levi spaces Beppo-Levi spaces are closely related to Sobolev spaces, and this relation can be used to show that any BL l (R d ) with l > d 2 can be embedded into C(R d ) [19].  Table 2, considered as a conditionally positive definite with respect to π l−1 (R d ). Then the associated native space H K,P is the Beppo Levi space BL l (R d ) of order l, and the semi-norms are the same.
When it comes to deriving an optimal approximation of f, minimization of the norm of the error functional λ err = δ x 0 − λ u(x 0 ) for fixed x 0 ∈ T again amounts to the minimization of the quadratic form Q in (2.6). In the general framework of conditionally positive definite kernels, however, λ err is not automatically in L P (T ), and the additional constraint for all p ∈ P (2.11) must be satisfied to ensure that λ err K is defined. Note that this constraint also implies that functions from P are always reproduced exactly by s f,X . Since P was assumed finite dimensional, we can choose a basis p 1 , . . . , p q , and the above condition becomes Minimizing (2.6) subject to (2.12) can be done using Lagrange multipliers η 1 (x 0 ), . . . , η q (x 0 ), and it follows that optimal coefficients u * 1 (x 0 ), . . . , u * n (x 0 ) must satisfy (2.12) and ) of s f,X is a good starting point to derive a pointwise optimal approximation of f, but it is quite inefficient from a computational point of view. Some algebraic manipulations of (2.5), (2.12) and (2.13) however yield the alternative representation (2.14) where the coefficients α 1 , . . . , α n and β 1 , . . . , β q are defined by the system which is again uniquely solvable if K is conditionally positive definite and X is Punisolvent. Note that the first set of equations simply forces s f,X to interpolate the data, while the second set is necessary to ensure a unique decomposition into two terms in (2.14). This system needs to be solved only once and then yields an expression for s f,X in closed form, valid on the whole of T . Its solution requires O(n 3 ) floating point operations which may still be too expensive for large spatial data sets. We refer to [85,Chapter 15] for an overview over some algorithms that compute an approximate solution to reduce the computational cost to a practically manageable level.

Kriging
The statistical counterpart to kernel interpolation is known as kriging, the geostatistical term for optimal linear prediction of spatial processes. Kriging is based on the modelling assumption that f is a realization of a random field Z, which is a collection {Z(x) : x ∈ T } of random variables over the same probability space (Ω, A, P ), indexed over T . The observations f(x 1 ), . . . , f(x n ) are then realizations of the random variables Z(x 1 ), . . . , Z(x n ).
To predict Z at some (unobserved) location x 0 ∈ T , one considers all linear predictors of the form which are themselves random variables. The prediction of f(x 0 ) given f(x 1 ), . . . , f(x n ) is then as for kernel interpolation To determine optimal weights u * 1 (x 0 ), . . . , u * n (x 0 ), additional structural assumptions on Z are needed, and depending on these assumptions one distinguishes simple, ordinary, universal and intrinsic kriging. There are still quite other forms (such as complex kriging [45], indicator kriging [41] or disjunctive kriging [53,54]) but we shall only discuss the aforementioned ones due to their close connection to kernel interpolation. Figure 2. (Colour online) Perspective plots of one realization of Gaussian random fields with different Matérn covariance functions Φ ν with ν = 1.0 (left), ν = 1.5 (middle) and ν = 2.0 (right). We use the parametrization of Φ ν proposed by Handcock and Wallis [34], where the argument is rescaled such that the value of ν influences the shape of Φ ν (h) only near h = 0.

Simple kriging
The additional assumption with simple kriging is that Z(x) is centred, i.e. E(Z(x)) = 0, and that the second moments exist for every x ∈ T . Then the covariance function can be defined and it follows from some basic properties of the (co)variance that K is always symmetric and positive semi-definite. In this framework K describes the probabilistic structure of Z, but certain 'deterministic properties' such as the smoothness of realizations are controlled by K as well (see Figure 2, created with [73]). Note however, that a complete characterization of the probabilistic structure of a random field requires additional assumptions on its distribution (e.g. assuming all finite dimensional distributions to be multivariate Gaussian). The covariance function K can be viewed as an inner product on the vector space . . , a n ∈ R, x 1 , . . . , x n ∈ T , n ∈ N of second-order random variables. The closure of V Z under this inner product yields a Hilbert space that is isomorphic to H K from Section 2 (see [4]). When it comes to spatial interpolation, however, the natural counterpart of the Hilbert space generated by Z is the dual H * K rather than H K , as will become clear in the following. The geostatistical notion of optimality is to consider as the 'best' linear predictor Z u * (x 0 ) the random variable with minimal expected squared deviation from Z(x 0 ), i.e.
The optimal weights are then obtained by minimizing . This is, however, the same quadratic form Q as in (2.6), and so the simple kriging prediction coincides with the optimal approximation (2.5) with weights u * 1 (x 0 ), . . . , u * n (x 0 ) determined by (2.7). Briefly, in kriging the covariance function takes the role of the interpolation kernel, and Q, originally introduced as the squared norm of the pointwise error functional at x 0 , becomes the expected squared prediction error.
We briefly mention another perspective which is in a way even more probabilistic: the Bayesian approach. In the simple kriging framework, it is essentially the additional assumption of Z being Gaussian that needs to be made. As mentioned above, once K is fixed, the distribution of a Gaussian random field is completely determined, and the Bayesian approach uses it as infinite-dimensional prior distribution for the unknown function f. The posterior distribution of f at x 0 given the observations f(x 1 ), . . . , f(x n ) is then a Gaussian distribution with mean s f,X (x 0 ) (with weights u * 1 (x 0 ), . . . , u * n (x 0 ) as above) and variance Q(u * (x 0 )). Unlike in simple kriging and kernel interpolation, this result is not based on a particular loss function. It is an immediate consequence of the complete specification of a prior distribution for f (see also [62,Section 24.]).

Ordinary and universal kriging
Especially the assumption that Z is centred seems inappropriate in most applications of geostatistics. The first generalization of simple kriging is therefore to allow for a non-zero mean function m(x) := E(Z(x)) while still keeping the assumption that Z has second moments. The mean function is usually unknown in practice, but this problem can be bypassed by requiring the potential predictors Z u of Z to be unbiased, i.e.
This has the additional advantage of preventing systematic over-or underestimation of Z(x 0 ). Note that any such predictor is automatically unbiased if Z is centred. Using this unbiasedness constraint to recalculate the target function (3.3) one obtains , which is again the quadratic form Q in (2.6), depending only on K but not on m. Its minimizer, however, is in general not the same as above, since the additional unbiasedness constraint restricts the choice of weights. To ensure that condition (3.4) can be satisfied at all, one cannot let the mean function completely general, but must assume a sufficiently simple, finite dimensional model. The simplest model is a constant (but unknown) mean function, and this assumption leads to what is called ordinary kriging. This is, however, just a special case of universal kriging where the mean function is modelled as with known and linear independent functions p 1 , . . . , p q , and unknown coefficients β 1 , . . . , β q . Such a mean function is also called a trend, and condition (3.4) becomes This condition must hold for any set of coefficients β 1 , . . . , β q , and so when predicting at x 0 we are back to condition (2.12) restricting the weights u 1 (x 0 ), . . . , u n (x 0 ). It follows that the universal kriging prediction coincides with the optimal approximation (2.5) from the conditionally positive definite kernel interpolation setup, with optimal weights u * 1 (x 0 ), . . . , u * n (x 0 ) determined by (2.12) and (2.13). Representation (2.14) was already noted by Matheron [51], and the corresponding equation system is known as dual kriging. The universal kriging interpolant can also be derived within a Bayesian framework. As a prior distribution for f, one assumes a Gaussian random field with covariance function K and mean function m as in (3.5). For a complete specification of the prior for f, a distribution assumption needs to be made for β 1 , . . . , β q as well. Then the posterior distribution of f can be worked out, but it will depend both on K and on the prior distribution of the trend coefficients. In the special case of a flat (uninformative) prior, however, Omre and Halvorsen [59] show that the posterior of f at x 0 is a Gaussian distribution with mean s f,X (x 0 ) and variance Q(u * (x 0 )), both calculated with the optimal weights from the universal kriging approach.
Universal kriging and kernel interpolation with conditional positive definite kernels are formally equivalent and are derived from the same loss function Q. Nevertheless, the analogy is not yet perfect because the universal kriging assumption that Z(x) has second moments for every x ∈ T automatically entails positive (semi)definiteness of K. We therefore consider a slightly different stochastic model leading to kriging interpolants of the same form, but using a more general dependence structure that permits the use of conditionally positive definite kernels.

Intrinsic kriging
The idea with intrinsic random fields (introduced in [52]) is that one no longer specifies the full second-order structure of Z, but only the dependence structure of certain increments. More specifically, let P again be a finite dimensional space of functions on T , and let L P (T ) be as in Section 2, i.e. the space of functionals of the form λ X = n i=1 a i δ x i with λ X (p) = 0 for all p ∈ P. For every such λ X ∈ L P (T ) is called an allowable linear combination of Z with respect to P. Assume that all allowable linear combinations of Z have second moments and are centred, i.e. E(Z λ,X ) = 0. The We note that Z and Z + p have the same generalized covariance function for any p ∈ P. Moreover, since the expectation of any squared random variable is non-negative, K must be conditionally positive semi-definite with respect to P.
The most important case in practice is the case where P = π m (R d ), the space of polynomials of order at most m, and where Z is intrinsically (weakly) stationary of order m, i.e. K(x, y) = Φ(y − x) for some function Φ : R d → R that is conditionally positive semi-definite with respect to π m (R d ). Assuming K to be translation invariant is often reasonable because general dependence structures are usually too complex for reliable inference. Note that λ X ∈ L π m (R d ) implies λ X+x ∈ L π m (R d ) for all x ∈ R d , owing to the binomial formula, and hence all random variables It follows that the random field {Z λ,X+x : x ∈ R d } of mth-order increments is weakly stationary (i.e. centred with second moments and translation-invariant covariance function) with Whenever in practice one faces the situation that Z itself does not appear to be weakly stationary, there is still a chance that this seems plausible for some higher order increments, and this then motivates the modelling with intrinsically stationary random fields. A detailed introduction to this topic is given in [9,Chapter 4], and in particular the differences from a modelling perspective to the model underlying the universal kriging approach are illustrated excellently.
To predict Z(x 0 ) we consider the error functional λ err = δ x 0 − λ u(x 0 ) as in Section 2, but in the stochastic framework we are interested in the expected squared prediction error. When K is the generalized covariance function of Z, we have by definition which is again the quadratic form Q in (2.6). The requirement λ err ∈ L P (T ) again entails condition (2.12), and so it follows that the intrinsic kriging prediction is identical to the optimal approximation (2.5) in the conditionally positive definite kernel interpolation setup with optimal weights u * 1 (x 0 ), . . . , u * n (x 0 ) determined by the equation systems (2.13) and (2.12).
The difference to universal kriging lies in the interpretation of representation (2.14). While it is legitimate in universal kriging to interpret the second term in (2.14) as an approximate mean function and the first term as an approximate deviation from it, such an interpretation would be wrong in intrinsic kriging where a mean function is not even defined.

Comparison with the deterministic framework
We sum up what has been pointed out so far concerning the two different modelling approaches and the corresponding notions of optimality. In both frameworks we seek to minimize, at every fixed location x 0 ∈ T , the quadratic form possibly subject to some additional restrictions. The sense in which the resulting approximation of f at x 0 is optimal then differs according to the interpretation of Q: (1) In the deterministic framework, Q(u(x 0 )) indicates how well δ x 0 can be approximated by the linear combination λ u(x 0 ) of the point evaluation functionals for points of X. It measures how big the approximation error can be in the worst case assuming only that f ∈ H K,P .
(2) In the stochastic framework, Q(u(x 0 )) indicates how big the error for approximating a random field Z at x 0 by λ u(x 0 ) (Z) will be on average. Its calculation is based on the assumption that Z has generalized covariance function K.
Both worst-case and average-case behaviour of numerical algorithms are studied and compared by Ritter [64]. For his average-case analysis, Ritter adopts the stochastic perspective and specifies a probability measure on the space of all functions by making the geostatistical assumption of a random field Z with covariance kernel K. The average-case optimality of K-splines that is stated in this monograph is then a consequence of the equivalence of kriging and kernel interpolation. A short list with terminology used in kernel interpolation and geostatistics is provided in Table 3. We note that in situations where both frameworks are applicable, different answers are obtained as to the question of which K should be used. Indeed, assume that f is a realization of a centred random field Z on R d with translation invariant covariance function Φ τ whose Fourier transform Φ τ satisfies (2.4). For this model the simple kriging framework applies, and states that the optimal interpolant is obtained with Φ τ . On the other hand, Scheuerer [71] shows that the realizations of Z-and hence f-are in the Sobolev space W μ 2 (R d ) if and only if μ < τ − d 2 . The space W μ 2 (R d ), however, calls for some reproducing kernel Φ μ satisfying (2.4) with μ instead of τ (see Theorem 2.1), and so Table 3. Some frequently used terms in the language of statistics (left column) and numerical analysis (right column) Covariance function Symmetric and positive definite kernel ∼ of a weakly stationary random field Translation-invariant kernel ∼ of a w. stat. and isotropic random field Radially symmetric, translation-invariant kernel Kriging Kernel interpolation Intrinsic kriging ∼ with a conditionally positive definite kernel Kriging variance P 2 K,X Power function P K,X a numerical analyst would rather use Φ μ with μ ≈ τ − d 2 for interpolation. In other words: if worst case optimality is aspired, then a rougher kernel is considered appropriate for the same function f than when the aim is average-case optimality.

Generalizations and related problems
The spatial interpolation problem discussed in this paper can be generalized by considering arbitrary functionals λ 0 (f), λ 1 (f), . . . , λ n (f) instead of f(x 0 ), f(x 1 ), . . . , f(x n ). Such a generalization covers, for example, the situation where f is to be reconstructed from both function values and derivatives ('Hermite-Birkhoff interpolation'), or the case where the interest is in approximating integrals of f over certain sub-domains of T . The latter is an important problem in mining, where measurements (which may also effectively be integrals over small areas) are taken at certain locations in some ore deposit, and the interest is in predicting the overall content.
In numerical analysis, approximation with general functionals goes back to [87]. In the positive definite setup, the considered functionals must be in the dual space H * K , the analogue requirement with simple kriging is that λ 0 (Z), λ 1 (Z), . . . , λ n (Z) are all welldefined random variables and have second moments. If these conditions are met, the generalization of the quadratic form (2.6) is well defined, and its minimization yields optimal weights u * 1 (λ 0 ), . . . , u * n (λ 0 ) for the approximation of λ 0 (f) by In the framework of conditionally positive definite kernels (intrinsic kriging), one needs to require that (λ 1 i λ 2 j )(K) and λ i (p k ) are well defined for all i, j ∈ {0, . . . , n} and k ∈ {1, . . . , q}, and condition (2.12) becomes The quadratic form Q from above is then again well defined, and its minimization is straightforward. Wendland [85, Section 16.2] discusses the case of Hermite-Birkhoff interpolation in more detail and shows that the resulting system of equations has a unique solution if λ 1 , . . . , λ n are linearly independent, and if λ i (p) = 0 for all p ∈ P and i ∈ {1, . . . , n} implies that p = 0. The close link between approximation and integration of f and its consequences on the error analysis of these problems is discussed in [64]. In the geostatistical framework, both kriging with gradient information and kriging of block averages are described in [9]. The former method has been applied to meteorological problems involving several variables related by physical laws [7,8].
A different type of generalization of the setup in Sections 2 and 3 is required when the set X is infinite. One could think, for example, of data from moving tracking devices which measure f continuously along one-dimensional trajectories. λ u(x 0 ) itself is then no longer a finite linear combination of point evaluation functionals, but lies in the closure of such functionals, i.e.
The definition of Q(u(x 0 )) := δ x 0 − λ u(x 0 ) 2 K remains unchanged and the functional λ u * (x 0 ) defining the optimal interpolant is obtained as the projection of δ x 0 on F P,X,x 0 [69]. In the same way, the kriging approximation Z u * (x 0 ) of Z(x 0 ) based on the values of Z at all points of X is obtained as the projection of Z(x 0 ) on the space where the closure is under the inner product induced by the generalized covariance function. While this generalization is straightforward in theory, it is not clear how a solution can be obtained in practice. Matheron [52] studies a special case where Z u (x 0 ) can be represented by a measure μ x 0 with support in X, i.e.
Such a representation is not always possible, and a counterexample in [52] shows that smooth covariance functions, such as the Gaussian model, may lead to unsolvable systems. If, however, the optimal solution can be represented in that way, then the system of equations (2.12) and (2.13) becomes a system of integral equations that determine μ x 0 . At least in special cases, for example when K is such that Z has the Markov property and the geometry of X is sufficiently simple, closed-form solutions for μ x 0 can be obtained from these equations.
A mathematical problem that is closely related to the spatial interpolation problem discussed in this paper arises when the function f has to be reconstructed based on data i.e. where f is observed with measurement errors 1 , . . . , n . In this situation the aim is no longer interpolation of the data, but rather approximation by a function that is close to the noisy observations (4.1) on the one hand but still reasonably smooth on the other hand. This is the standard setup in machine learning, and depending on the loss function that is used to assess the fidelity to the data, different approaches turn out to be optimal. Schaback and Wendland [70] discuss the role of kernel methods in machine learning. The approach that is most closely related to the kernel interpolants ('splines') in Section 2 is the concept of smoothing splines, which corresponds to a quadratic loss function. More specifically, instead of minimizing the native space norm among all functions interpolating the data, smoothing splines minimize where λ is a regularization parameter that controls the fit/smoothness trade-off mentioned above. The universal kriging counterpart of (4.2) is obtained when f is assumed to be a realization of a random field Z with covariance function K, and 1 , . . . , n are assumed to be realizations of independent, centred random variables with variance λ. The optimal solution in that case still has the form (2.5), but in the system of equation (2.13) defining the kriging weights, K(x i , x i ) is replaced by (K(x i , x i ) + λ) for all i ∈ {1, . . . , n}. For a detailed description of smoothing splines, their connection to RKHS on the one hand and geostatistical methods on the other, we refer to [83] and references therein.

Error estimates
In both frameworks the magnitude of the approximation error at x 0 ∈ T is characterized by Q. In the literature on kernel interpolation, the value of Q 1/2 for approximation with the optimal weights u * 1 (x 0 ), . . . , u * n (x 0 ) is denoted by and is called the power function. The definition of Q 1/2 immediately implies the error bound The function f is unknown but fixed, and so in order to control the approximation error one is interested in quantifying how P K,X depends on K and X. Conversely, a bound on the absolute approximation error at x 0 normalized by |f| H K,P that holds for any f ∈ H K,P implies a bound on P K,X (x 0 ).
In geostatistics too P K,X (x 0 ) is a well-known quantity: its square is the so-called kriging variance. Stein's book [77] is the main reference for an in-depth study of the asymptotic behaviour of kriging interpolants. In Section 3.6 he considers a centred, weakly stationary random field Z on R with covariance function Φ τ satisfying (2.4). The simple kriging interpolant Z u * (0) at x 0 = 0 is calculated based on the values of Z at ±δ, ±2δ, . . . (interpolation problem) or at −δ, −2δ, . . . (extrapolation problem). In both cases it turns out that the kriging variance can be bounded by This result is very useful to understand the impact of the smoothness of Z on the precision of kriging approximations, but the geometric setup is rather special. In the kernel interpolation literature similar statements exist for very general alignment of points where f is observed. As a consequence of the coincidence of P 2 K,X with the kriging variance, these results can be easily translated to the stochastic framework.
To characterize the density of locations in X without restricting to lattice data, one defines the fill distance Intuitively, h X,T is the radius of the largest ball centred at some x ∈ T that does not contain any of the data points. Results will be given for approximation on a bounded domain T ⊆ R d with Lipschitz boundary, which means that the boundary can be thought of as locally being the graph of a Lipschitz continuous function. Moreover, T must satisfy an interior cone condition, i.e. there exists an angle θ ∈ (0, π 2 ) and a radius r > 0 such that for every x ∈ T a unit vector ξ(x) exists such that the cone is contained in T . This simply means that there exists a cone of fixed size that can be placed everywhere inside T , thus excluding the possibility of extremely narrow bulges of the boundary. We state two results from the kernel interpolation literature (see [85,Section 11.6] and [58,Section 4]) and formulate their consequences for kriging: Theorem 5.1 Suppose that T ⊂ R d is a bounded domain, has a Lipschitz boundary, and satisfies an interior cone condition. Let X ⊂ T be a given discrete set and s f,X be the kernel interpolant based on a translation invariant and positive definite kernel Φ τ satisfying (2.4) with τ = k + s, where k > d 2 is a positive integer and 0 6 s < 1. Then the error between f ∈ W τ 2 (T ) and its interpolant s f,X can be bounded by for all sufficiently dense sets X.

Corollary 1
Let Z be a centred, weakly stationary random field on a bounded domain T ⊂ R d with covariance function Φ τ as in Theorem 5.1. Assume that T has a Lipschitz boundary and satisfies an interior cone condition. Then the kriging variance can be bounded by for all sufficiently dense sets X.
Theorem 5.2 Suppose that T ⊂ R d is a bounded domain that satisfies an interior cone condition. Consider the thin-plate splines Φ d,l from Table 2 as conditionally positive definite with respect to π l−1 (T ). Then the error between f ∈ W l 2 (T ) and its thin-plate spline interpolant s f,X can be bounded by for all sufficiently dense sets X.

Corollary 2
Let Z be an intrinsically stationary random field of order l on a bounded domain T ⊂ R d with generalized covariance function Φ d,l as in Table 2. Assume that T satisfies an interior cone condition. Then the kriging variance can be bounded by for all sufficiently dense sets X.
The preceding Corollaries give rates for the speed of decline of the kriging variance as the data become denser. Corollary 1 is in agreement with, but more general than the result from [77, Section 3.6] mentioned above. To our knowledge, there is no result on convergence rates of kriging predictions in the statistical literature that covers geometric setups of data points with the generality of Corollaries 1 and 2. We refer to [86] for generalizations of Theorems 5.1 and 5.2 to the situation (4.1) where f is observed with measurement error.

Interpolation with misspecified kernels
So far it has always been assumed that the correct K is known. In geostatistics this means that the covariance structure of the random field under study is known, in kernel interpolation it amounts to the assumption that one knows the native space in which f is contained. In practice, however, such knowledge is usually not available, and so the question arises whether the interpolation schemes discussed above are still near-optimal if an 'incorrect' kernelK is used instead of K.
In kernel interpolation, the main interest is to ensure that the optimal rates in Theorems 5.1 and 5.2 are maintained. If this is the only goal, then rescaling the argument of Φ does not have an effect because in (2.4) rescaling only changes the constants, and thin-plate spline interpolants are invariant to rescaling of the argument of Φ d,l anyway. Misspecifying the smoothness of Φ, however, does have an effect on the rate. If a kernel Φτ withτ < τ is used in the setup of Theorem 5.1, then the statement remains valid for the lower rate ofτ − d 2 . Forτ > τ on the contrary, it cannot be guaranteed that f ∈ Wτ 2 (T ), and so does not decline faster than h X,T , however, it can be shown that the error rate is of the same order as if a kernel with the correct degree of smoothness was used [58]. Hence, for quasi-uniform sets X, i.e. q X 6 h X,T 6 cq X for some fixed constant c > 0, using a very smooth kernel does not degrade the approximation accuracy of s f,X with respect to the error rate. The power function, however, is independent of the true smoothness of f, thus decreases with the faster rate ofτ − d 2 , and consequently yields a false description of the magnitude of approximation errors.
The last point is not considered a big deficiency in kernel interpolation, but in geostatistics the exact quantification of the approximation error plays an important role, and a different perspective has been adopted here. A major step towards a theoretically founded answer to the kernel misspecification issue was made in [75]: If K andK are compatible, then the approximation based onK will have the same asymptotic efficiency as the optimal approximation, and the relative deviation of the true expected squared approximation error from the one calculated under the false assumption thatK is correct is asymptotically negligible. A full explanation of the concept of compatibility is beyond the scope of this paper, for details consider [38,77,78]. To compare with the statement above, we shall however give a sufficient condition for compatibility in an important special case where x,y∈ T , σ, a, ν > 0, (6.1) i.e. K is translation-invariant, radially symmetric and of the Matérn type (see Table 1). In addition to the parameter ν controlling the smoothness of K, we consider a parameter a rescaling the argument, and a variance parameter σ that does not affect the interpolant s f,X but scales the power function. This choice of K satisfies (2.4) with τ = ν + d 2 for any value of σ and a. When K has the above form and d 6 3, compatibility of K andK is guaranteed [88] This still allows for certain deviations ofK from K, but limits the choice ofK much more than the conditionν > ν that ensures optimal rates of the approximation error. Note the dependence of the above condition on the space dimension. For d > 5, condition (6.2) is no longer sufficient, and K andK are compatible only in the trivial case whereK = K [1]. The case d = 4 is still open. To formulate the precise statement of [75], consider a random field Z on a bounded domain T with mean function of the form (3.5) and covariance function K. Let ZK(x 0 ) be the kriging prediction at x 0 ∈ T based on observations of Z at some set X n ⊂ T , derived under the (false) assumption thatK is the covariance function. Assume further that x 0^Xn , the sequence (X n ) n∈N of point sets is getting dense in T and as n tends to infinity, where E K denotes the expectation under K. Then it holds, for any compatible covariance functionK, that as n tends to infinity. The convergence is even uniform on T [77]. Recall that where the subscripts K andK denote for Q that the quadratic form is calculated using K andK respectively, and for u * the optimal weights were obtained by minimizing Q K and QK respectively. In the language of numerical analysis, (6.3) says that asymptotically the interpolant obtained with a compatible kernelK is still optimal, and that the power function calculated withK tends to the 'true' power function Q K (u * K (x 0 ). This statement is much stronger than that of an optimal convergence rate, but it is based on more restrictive assumptions like (6.2).
An extension of this result to some conditionally positive definite covariance functions is proved in [78]. Putter and Young [60] consider the setting whereK is not fixed but may depend on n, which accommodates the situation in practice whereK n can be estimated from the data at X n (see Section 7) with increasing precision as n tends to infinity. This convergence is formalized by introducing the concept of contiguity (which replaces compatibility, see [60] for definition), and it is shown that (6.3) still holds if the stochastic models corresponding to the sequence (K n ) n∈N on the one hand and the true covariance function K on the other hand are contiguous.

Kernel selection and parameter estimation
An immediate question to follow up the issue of kernel misspecification is how to identify the 'correct' K based on the information and data at hand. We do not intend to give a comprehensive list of all methods available, but focus on two methods that are applicable in both deterministic and stochastic frameworks.
The issue of kernel selection has received comparatively little attention in the framework of kernel interpolation. This is not surprising in the light of the preceding section where we noted that working with some smooth kernel would always guarantee optimal convergence rates whatever be the particular form (and scaling) of this kernel, provided that the sampling locations are quasi-uniform. Consequently, more emphasis was put on the study of good configurations of sampling locations [14,16,39] on the one hand, and edge correction strategies (see [26] for an overview) on the other hand to avoid undesired oscillations near the boundaries that often come with smooth and flat kernels. Nevertheless, several authors [6,27,28,63] have pointed out the big impact of the choice of, for example, the scaling parameter on the accuracy of interpolant. When ill-conditioning (see Section 8) is not an issue for a relevant range of parameter values, there is usually a value that minimizes interpolation errors.
In the earlier literature, the question of suitable scaling of kernel has been typically solved by ad hoc rules [25,28,35]. Rippa [63] was the first to propose an algorithm based on the idea of leave-one-out cross validation (LOOCV) which chooses the scale parameter such that some norm of the LOOCV error vector ε is minimized. In the kernel interpolation setup, the components of ε are formed by leaving out one sampling location x i at a time, calculating the interpolant based on the remaining ones only and taking ε i to be the difference between the true value f(x i ) and the approximation s f,X (x i ). This procedure yields good choices of the scaling parameter and can be implemented such that the calculation of ε for a given kernel can be done with the computational cost of order O(n 3 ), where n is the number of sampling locations. A more recent paper [24] discusses extensions of Rippa's algorithm that have been applied in the context of an iterated approximate moving least-squares approximation of function value data and RBF (radially symmetric kernels) pseudo-spectral methods for the solution of partial differential equations.
LOOCV does not make any explicit modelling assumptions and is therefore also applicable in the geostatistical framework. In the geostatistical literature, however, cross validation is mainly used as a diagnostic tool to compare the performances of geostatistical models. Traditionally, variogram-based estimation methods have been used (see e.g. [11, or [9,Chapter 2] for details) since an estimate of the variogram usually constitutes the first step in the exploratory analysis of geostatistical data.
Here we focus on maximum likelihood estimation [49], which is applicable in all of the kriging setups presented above, and makes optimal use of the information contained in the data [37, Chapter 2 and Theorem 8.1]. It is usually derived under the additional modelling assumptions that Z is a Gaussian random field, and that K belongs to some parametric class {K θ : θ ∈ Θ} of covariance models. In the simple case where Z has a zero mean, the log likelihood function, i.e. the logarithm of the probability density function of the random vector (Z(x 1 ), . . . , Z(x n )) evaluated with the data vector f := (f(x 1 ), . . . , f(x n )) is then given by with A θ as defined below and |A θ | denoting its determinant. The maximum likelihood estimator then chooses the parameter that maximizes l(θ), reasoning that under the corresponding stochastic model observing the data f becomes most likely. An extension that works for both the case of a non-trivial mean of the form (3.5) and the case of a generalized covariance function was proposed by Kitanidis [43]. The idea is to use the information of n − q allowable linear combinations of f only, rather than the complete data vector. In the universal kriging setup this causes the mean function to be filtered out from the data. This procedure is called restricted maximum likelihood (REML) estimation, and it can be shown [36] that the restricted log likelihood function can be written as with A θ and P by An elementary introduction to maximum likelihood methods in spatial statistics is given in [44]. A major drawback seems to be the strong assumption that Z is Gaussian under which the maximum likelihood estimator is derived. In [72], however, an alternative derivation of REML in the framework of kernel interpretation (where much weaker modelling assumptions are made) is given, and a numerical study with several nonstochastic test cases is presented in which REML often yields very good choices of K.
Within the Bayesian paradigm, parameter selection and interpolation are not formally distinct. The full specification of a probabilistic model permits, via Bayes' Theorem, to obtain posterior distributions for the unobserved values of f, the trend parameters β 1 , . . . , β q and the covariance parameter θ. One could even step up yet another level and let the Bayesian methodology choose between different parametric model structures [62,Section 5.2]. This unified treatment of model selection and interpolation has the advantage that the additional uncertainty due to the fact that the data-generating model is unknown is reflected in posterior distributions. These distributions can, however, in general not be stated in closed form. For certain choices of the priors, some of the integrals that result from the repeated applications of Bayes' Theorem within the hierarchical model specification can be calculated analytically [17,33], but the final posterior distributions usually require numerical approximations or Markov chain Monte Carlo (MCMC) methods.
In the situation (4.1) where only noisy observations of f are available, the main focus is on estimating the regularization parameter λ in (4.2). Wahba [82] discusses a generalized cross-validation (GCV) procedure which has the advantage over standard LOOCV that it achieves certain desirable invariance properties (see [83] for a detailed motivation and asymptotic results for GCV). While Stein [76] proves that REML is asymptotically (as the sampling locations get increasingly dense) superior to GCV when the geostatistical assumptions are true, asymptotic results from Wahba [82] suggest that REML can fail when f is a smooth deterministic function, whereas GCV chooses a good λ in all frameworks. The following example, however, shows that a rather different behaviour may be observed in our interpolation framework and finite settings.
We illustrate and compare LOOCV and REML with a test function (the 'borehole model') used by many authors (e.g. [40,56]) to compare different methods in computer experiments. Examples from this field of application are particularly interesting in the context of the present paper because they are typically deterministic in nature but considered as realizations of Gaussian random fields.   ranges of interest are summarized in Table 4. We rescale these variables to the range (1, 3) and use the same orthogonal sampling design as Joseph et al. [40] with [27] locations. We now assume f to be a realization of a stationary Gaussian random field with covariance function of the Gaussian type. Its mean function will be considered constant but unknown so that we are in the framework of ordinary kriging (see Section 3). Table 5 shows the estimates for the parameters θ 1 , . . . , θ 8 and σ 2 obtained via REML, LOOCV 1 and LOOCV 2 , where the subscript indicates that either the · 1 -norm or the · 2 -norm of the cross-validation errors is minimized. Unlike real-world applications of computer experiments, the borehole function is cheap to evaluate, and this allows us to calculate its values on the grid G := {1, 1.5, 2, 2.5, 3} 8 on the space of the scaled input variables and compare them with the values predicted via ordinary kriging with the covariance functions estimated by different methods. The  following error statistics are given in Table 5: In the borehole example, both root mean squared error (RMSE) and mean absolute error (MAE) are the lowest for the interpolant computed with the REML estimates, but the LOOCV estimates too give good results. To judge how well the kriging variance describes the prediction uncertainty, one can look at the mean absolute standardized errors (MAStE). If the kriging variance (which also depends on the estimated parameters) has correct magnitude, the absolute standardized errors should average to 1, and indeed all three parameter choices yield an MAStE quite close to that. Since we have assumed a Gaussian random field, we can go even further and calculate, for every x ∈ G, the probability integral transform (PIT) F x (f(x)), where F x is the cumulative distribution function of a Gaussian distribution with mean s f,X (x) and variance P 2 K,X (x). F x is a probabilistic forecast of f(x) that automatically comes with our stochastic modelling assumptions. If it is correct, then the PIT values have a uniform distribution in [0, 1], and this property can be checked by plotting them in the form of a histogram [2]. The PIT histograms in Figure 3 are quite far from uniformity, suggesting that the assumption of a Gaussian distribution is rather questionable. It is quite remarkable that REML, which is based on this assumption, does an excellent job in selecting good parameters, and we found that this is also true for many other test cases (see [72]).
A critical issue about REML estimation is the computational cost of O(n 3 ) floating point operations for each choice of θ, which is prohibitive for large spatial data sets. When all sampling locations are on a (near-)regular lattice, spectral methods to approximate the likelihood can be used and allow to reduce the computational cost to an order of O(n log(n)) [13,29,32]. These techniques cannot be applied to scattered data, but other approaches to approximating likelihoods [5,47,79,81], covariance tapering [30] or simplified Gaussian models of low rank [3,12,21] have been proposed and shown to be quite effective in reducing the computational effort to an order that allows the application of REML in most practical situations.

Discussion
A variety of practical problems amount to or can be linked to the mathematical problem of data interpolation. In this paper two approaches -kernel interpolation and krigingwere presented and their interconnections were pointed out. In either framework the interpolation procedure is optimal in a certain sense, but optimality is based on the assumption that the 'correct' kernel is used. Answers given by numerical analysts and statisticians to the question about the consequences on approximation accuracy of using an 'incorrect' kernel were discussed. Finally, some methods for choosing a suitable kernel based on the given data were presented.
The borehole example analysed in the preceding section poses the interesting challenge that it is not entirely clear if a stochastic modelling perspective is appropriate. While this does not matter anyway with respect to the interpolation method, it is comforting to see that with both cross validation and maximum likelihood good choices of an interpolation kernel are obtained. At first sight, this seems to contradict the asymptotic results by Wahba [82] mentioned above. It seems, however, that in this and many other examples the sample size is simply too small for asymptotic statements to hold. Moreover, in Wahba's setup the actual interpolation kernel is fixed, and only λ is estimated. Our belief is that REML is mostly competitive even in deterministic settings as long as it can choose from a sufficiently flexible class of kernels that permits, for example, adaptation to the regularity of f. Generally, when a high approximation accuracy is expected, the deterministic perspective seems more appropriate. When the data are sparse and/or f has low regularity, a random field model often yields a good description of f. The transition between the two perspectives and their respective methodologies, however, is rather smooth.
We have focused our discussion on topics that are relevant for both numerical analysts and statisticians. An important issue in kernel interpolation not mentioned so far is that of an ill-conditioned equation system (2.15). This problem frequently arises because in the deterministic framework very smooth and flat kernels are often preferred since they can achieve high convergence rates when f is very smooth (see Sections 5 and 6). Such kernels, however, inevitably lead to ill-conditioned systems which are a big challenge for numerical algorithms, and they call for special techniques such as preconditioning or changes of basis. If the standard basis K(·, x 1 ), . . . , K(·, x n ) is used as suggested by representation (2.14), ill-conditioning is tied to smoothness of kernel and small approximation errors in terms of power function [66]. But the interpolant s f,X in function space is not dramatically ill-conditioned [15] such that ill-conditioning is a problem of bad basis, not a problem of the reconstruction process. In geostatistics, the variables of interest in typical applications are usually very rough and call for kernels with low smoothness, and so ill-conditioning is usually not a big issue.
We shall finally mention a field of research where the methods discussed in this paper are applied in a slightly different context: the field of machine learning. The problem studied there can again be formulated as an interpolation problem, and both stochastic and deterministic modelling approaches can be used for its solution. An outline of connections to Gaussian processes and reproducing kernels is given in [62,74]. Van der Vaart and van Zanten [80] discuss the Bayesian approach to the machine learning problem and provide -in a slightly different setting and based on a different risk function -results on convergence rates and the role of regularity of the covariance kernel similar to the results that have been discussed in Sections 5 and 6. In machine learning too it is not always obvious if stochastic modelling assumptions are appropriate, and so understanding the implications of different assumptions and identifying the scope of the corresponding methods seem vital.