From Estimation to Prediction of Genomic Variances: Allowing for Linkage Disequilibrium and Unbiasedness

The additive genomic variance, the chief ingredient for the heritability, is often underestimated in phenotype-genotype regression models. Various remedies, including different models and estimators, have been proposed in order to improve on what has been coined the “missing heritability”. Recently, debates have been conducted whether estimators for the genomic variance include linkage disequilibrium (LD) and how to explicitly account for LD in estimation procedures. Up-to-now, the genomic variance in random effect models (REM) has been estimated as a parameter of the marginal, i.e. unconditional model. We propose that the genomic variance in REM should be predicted as a conditional random quantity based on the conditional distribution of β. This signifies a paradigm shift from the estimation to the prediction of the genomic variance. This approach is structurally in perfect accordance to the Bayesian regression model (BRM), where the posterior expectation of the genomic variance is estimated based on the posterior of β. We introduce a novel, mathematically rigorously founded predictor for the conditional genomic variance in (g)BLUP, which is structurally close to the Bayesian estimator. The conditioning of the novel predictor on the data is intrinsically tied to the inclusion of the contribution of LD and the predicted effects. In addition to that, the predictor shows much weaker dependence on distribution assumptions than estimators of other approaches, e.g. GCTA-GREML. Last but not least, the predictor, contrasted with the estimator in the unconditional model, enables an innovative approximation of the influence of LD on the genomic variance in the dataset. An exemplary simulation study based on the commonly used dataset of 1814 mice genotyped for 10346 polymorphic markers substantiates that the bias of the novel predictor is small in all standard situations, i.e. that the predictor for the conditional genomic variance remarkably reduces the “missing heritability”.


Introduction
T he additive (genotypic) variance is defined as the variance of the breeding value (BV) and is the chief cause of resemblance between relatives and therefore the most important determinant of the response of a population to selection (Falconer and Mackay 1996). In addition to that, the additive variance can be estimated from observations made on the population and is a principal component of the (narrow-sense) heritability, which is one of the main objectives in many genetic studies (Falconer and Mackay 1996). The heritability is eminent, amongst other things, for the prediction of the response to selection in the breeder's equation (Piepho and Moehring 2007;Hill 2010). Although non-additive genomic variation exists, most of the genetic variation is additive, such that it is sufficient to investigate the additive genetic variance (Hill et al. 2008). In more detail, epistasis is only important on the gene-level but not for genetic variances (Hill et al. 2008), and Zhu et al. (2015) show that for human complex traits dominance variation contributes little. Nevertheless, linkage disequilibrium (LD) is an important factor especially when departing from random mating and Hardy-Weinberg equilibrium, which is often the case in animal breeding (Hill et al. 2008;Dempfle 2018). The genomic variance, the genomic equivalent to the genetic variance, is defined as the variance of a trait that can be explained by a linear regression on a set of markers (de los Campos et al. 2015). Many authors have been chasing what is sometimes coined "missing heritability" (Maher 2008) which means that only a fraction of the "true" genomic variance can be captured by regression on influential loci. To begin with, researchers have used genome-wide association studies (GWAS) in order to find quantitative trait loci by using fixed effect regression combined with variable selection. After having added the estimated corresponding genomic variances of the single statistically significant loci, they asserted that they could only account for a fraction of the "true" genetic variance. For instance, Maher (2008) found that only 5% instead of the widely accepted heritability estimate of 80% of human height could be explained. Golan et al. (2014) state that the "true" genetic variance is generally underestimated when applying variable selection, e.g. GWAS, to genomic datasets which are typically characterized by their high-dimensionality, where the number of variables p is much larger than the number of observations n. It is well known that a lot of traits are influenced by many genes and that at least some loci with tiny effects are missed when using variable selection or even single-marker regression models. Consequently, Yang et al. (2010) decided to fit all common markers jointly using genomic best linear unbiased prediction (GBLUP), where they assume the effect vector to vary at random. Then, they estimated the genomic variance using restricted maximum likelihood (REML) in an approach that they termed genome-wide complex trait analysis (GCTA) GREML (Yang et al. 2011). They showed that quantifying the combined effect of all single-nucleotide polymorphisms (SNPs) explains a larger part of the heritability than only using certain variants quantified by GWAS methods. They illustrate their results on the dataset on human height by pointing out that they could explain a heritability of about 45%. They concluded that the main reason for the remaining missing heritability was incomplete linkage disequilibrium of causal variants with the genotyped SNPs, which refers to the general difference of genetic and genomic variances (de los Campos et al. 2015). Recently, there has been a general discussion whether estimators for the genomic variance account for linkage disequilibrium (LD) between markers, which is defined as the covariance between additive effects of marker pairs (Bulmer 1971). Some authors argue that estimators similar to GCTA-GREML lack the contribution of LD (de los Campos et al. 2015;Kumar et al. 2015Kumar et al. , 2016Lehermeier et al. 2017) whereas others (Yang et al. 2016) resolutely disagree. More specifically, Kumar et al. (2015Kumar et al. ( , 2016 state that in GCTA-GREML the contributions of the p markers to the phenotypic value are assumed to be independent normally distributed random variables with equal variances. Thus, they claim that the random contribution made by each marker is not correlated with the random contributions made by any other marker which leads to a negligence of the contribution of LD to the genomic variance. Moreover, Kumar et al. (2015Kumar et al. ( , 2016 criticize GCTA-GREML because of the assumption that the estimated genomic relationship matrix (GRM) is treated as a fixed quantity without sampling errors although the GRM is actually a realization of an underlying stochastic process. In a study on the model plant Arabidopsis (The 1001 Genomes Consortium 2016), Lehermeier et al. (2017) use Bayesian ridge regression (BRR) to relate the phenotype flowering time to the genomic data. They use an estimator (termed M2) based on the posterior distribution of the marker effects obtained by Markov Chain Monte Carlo (MCMC) methods and show that this estimator explains a larger proportion of the phenotypic variance than the estimator, termed M1, based on GBLUP (VanRaden 2008; Yang et al. 2010Yang et al. , 2011. Lehermeier et al. (2017) show that the reason for the better performance of the Bayesian estimator for the genomic variance is the explicit inclusion of disequilibrium covariances. In this article we investigate the additive genomic variance in linear regression models within the framework of quantitative genetics, i.e. the genetic variance stems from the variation of QTL genotypes whereas the effects of alleles on a trait are fixed parameters (Falconer and Mackay 1996;de los Campos et al. 2015). The difference of individuals in their genetics values is caused by the inter-individual differences in allele content at QTL (de los Campos et al. 2015). These assumptions are reflected in the fixed effect model (FEM) that we treat in Section Fixed Effect Model (FEM), where β is a deterministic parameter and the genomic variability comes in only through the randomness of the marker content. We show that the genomic variance in FEM explicitly includes the contribution of LD and we derive a nearly unbiased estimator for the genomic variance in this model, i.e. that the remaining bias consists only of possible correlations between the plug-in quantities. However, the FEM is not applicable to genomic datasets that are characterized by their high-dimensionality. As a remedy, Bayesian regression models (BRM) and random effect models (REM) are often used. In this models, tough, the effect vector β is defined as a random variable and therefore these models do not ly within the framework of classical quantitative genetics. We investigate the expression for the genomic variance in FEM, BRM and REM and notice that, in general, the genomic variance strongly depends on the assumptions for the effect vector. In BRM in Section Bayesian Regression Model (BRM), β is assigned a prior distribution and we seek its posterior distribution by means of the likelihood of the data. We show that in this model set-up it is necessary to consider the genomic variance as a random quantity and not as a fixed population parameter. This results in the estimation of the posterior expectation of the (random) genomic variance, which has already been hinted at in Lehermeier et al. (2017). In Section Random Effect Model (REM) we show that up-to-now, the genomic variance in REM has been estimated as a parameter of the marginal, i.e. unconditional model (e.g. GCTA-GREML). By strictly conditioning on the effect vector as in BRM, we constitute a paradigm shift from the estimation of the marginal genomic variance to the prediction of the random conditional genomic variance, which is structurally in perfect accordance to the posterior genomic variance in BRM. Inspired by the prediction of random effects (or in equivalent terminology: the estimation of the realized values of random effects) introduced by Henderson (1984) at the beginning of his chapter on prediction of random variables, we call our procedure the prediction of the genomic variance in REM. To this end, we introduce a mathematically founded nearly unbiased predictor for the genomic variance that is adapted to the specified model assumptions. The application of the conditional genomic variance explicitly allows for an adaptation of the genomic variance to the data which is caused by the radical change in the structure of the conditional variance-covariance of β compared to the structure of marginal variance-covariance of β (from a diagonal structure to an arbitrary structure). By doing so, we take on the above mentioned critique that GCTA-GREML neglects the contribution of LD due to the diagonal covariance structure of the marginal β (Kumar et al. 2015(Kumar et al. , 2016. We show that the conditional genomic variance explicitly accounts for LD which has special practical relevance in animal breeding (Dempfle 2018), and remarkably reduces the missing heritability. In addition to that, the difference of the novel predictor and the estimator of the marginal genomic variance in REM can be used as an indicator for the contribution of LD to the genomic variance. In general, the conditional genomic variance in REM is structurally similar to the genomic variance in FEM and therefore has an interpretation close to the classical genetic variance from quantitative genetics. For reasons of clarity, we provide all calculations and detailed derivations in the Appendix. We illustrate our results for ordinary least-squares (OLS) from the class of FEM, for BRR from the class of BRM and for (G)BLUP from the class REM on simulated datasets, where we borrow the covariance structure from the commonly used dataset on 1814 mice that comes with the R-package "BGLR" (Perez and de los Campos 2014).

Linear Models and the Genomic Variance
We consider the basic additive linear model where Y is the phenotype of a random individual, µ is a deterministic intercept and β is a p-vector of marker effects. The random allele content at the markers is coded by the random row-p-vector X with expectation E[X] = 0 (in Subsection Notes on the mean-centering of X in the Appendix we consider deviations from this assumption) and covariance matrix Σ X . The residual ε is assumed to be independent of Xβ and normally distributed with mean 0 and variance σ 2 ε . The additive genomic variance V is then defined as the variance of the genomic value Xβ which consists of the inter-individual differences in allele content at the markers as well as the effects of the markers themselves (de los Campos et al. 2015): Due to independence of Xβ and ε we can separate the phenotypic variance σ 2 Y in the genomic variance V and in the residual variance σ 2 ε : Typically, one considers n realizations of model (1), i.e. y i = Y|(X = x i1 , x i2 , ..., x ip ) for i = 1, ..., n, which results in the conditional (on X) model where X denotes the n × p design matrix containing n realizations of the stochastic p-vector X. We consider mean-centered data: ∑ n i=1 x ij = 0 for j = 1, ..., p (in Subsection Notes on the mean-centering of X in the Appendix we consider deviations from this assumption). An unbiased estimator for Σ X is given by the method-of-moments estimator: Fitting linear models to data is based on model (4) and many authors (Piepho and Moehring 2007;Yang et al. 2010Yang et al. , 2011Piepho et al. 2012;Janss et al. 2012;Lee and Chow 2014;Lehermeier et al. 2017) also build their studies on the investigation and estimation of the genomic variance on model (4).
We base our analysis for the (theoretical) expression of the genomic variance on (the theoretical) model (1) for several reasons.
First of all, model (1) describes the underlying data-generating process which induces the realized model (4). By defining the genomic variance V given by (2) in model (1), we tackle the criticism of using the realized GRM without accounting for estimation uncertainty issued by Kumar et al. (2015Kumar et al. ( , 2016. More importantly, treating X as a random variable represents the interpretations from quantitative genetics that the genetic variability is caused by variation in QTL content. Strikingly, the genomic variance V in the fixed effect model in Section Fixed Effect Model (FEM) constantly equals 0 when building on model (4).

Fixed Effect Model (FEM)
In quantitative genetics the uncertainty about genetic values is assumed to stem from the uncertainty in allele content at the markers, whereas the effects are population parameters and therefore possess no variance (Falconer and Mackay 1996;Gianola et al. 2009;de los Campos et al. 2015). In accordance to that, we consider β here as a p-vector of fixed effects, i.e. as a deterministic population parameter. Consequently, we calculate the genomic variance V defined in (2) as where the expression β Σ X β is the genomic equivalent of the genetic variance defined in quantitative genetics textbooks for multiple QTL (Lehermeier et al. 2017). We have split this term up into the additive locus-specific variance (also called genic variance) and the contribution of linkage disequilibrium between different loci. We notice that the genomic variance V f is a weighted sum of the variances of the single marker content and the covariance between the content of the markers, where the weights are given by the elements of the fixed population effect vector β.
Replacing β and Σ X in (6) by unbiased estimatorsβ andΣ X leads to the plug-in estimator for the genomic variance V f (6). The estimatorV bias f (7) contains second order products of the random variablesβ j , j = 1, ..., p, and is therefore a biased estimator for (6). We correct for the covariance of the estimatorβ by defininĝ as a less biased estimator for V f (6), whereΣβ denotes an unbiased estimator for the variance-covariance matrix Σβ := Cov β ofβ.
In the case that we have more observations (individuals) n than variables (markers) p we can fit the linear model (4) using ordinary least-squares (OLS), for instance. We make the note that if we would base the definition of the genomic variance V given by (2) on the genomic value in model (4), we would obtain V f = Var(Xβ) ≡ 0. As an outcome of the OLS application we obtain the estimated effect vectorβ and its estimated covariance matrixΣβ. Plugging these quantities intoV f (8) we obtain an improved estimator for the genomic variance V f (6) in OLS and notice that the empirical phenotypic sample varianceσ 2 y splits up as in (3) into the unbiased estimatorV f (8) for the genomic variance in FEM and the unbiased estimator for the residual varianceσ 2 ε :σ 2 y =V f +σ 2 ε .
This implies that with mean-centered data we can expect the improved estimator for the genomic variance and the estimator for the residual variance to sum up exactly to the phenotypic variance regardless of the data considered. This implies that when using the OLS method to fit a linear model, using the less biased estimatorV f (8) to estimate the genomic variance contribution of all markers is equivalent to simply subtracting the residual variance from the phenotypic variance. We refer to Section FEM in the Appendix for a detailed mathematical derivation of the results in this section.

Bayesian Regression Model (BRM)
Due to the paper of Meuwissen et al. (2001) the usage of Bayesian methods has strongly increased in quantitative genetics. The high-dimensionality of genomic data necessitates some way of regularization. The basic idea of adjustment in Bayesian regression models is to express uncertainty of the effect vector β by assigning it a prior distribution. Then, by adapting to the data by means of its likelihood, one attains the posterior distribution of the effect vector.
In the first place, we consider the linear model (1) again where β possesses the prior distribution p(β) with prior expectation µ β (often chosen as 0) and prior variance-covariance matrix Σ β . The specific form of the distribution of β is not relevant for the following analysis. The genomic variance V given by (2) equals This expression for the genomic variance is meaningless because we can arbitrarily strongly influence it by the choice of the prior expectation and prior variance-covariance matrix. Instead, we require the genomic variance in BRM to move away from the prior assumptions by adapting to the data. In order to enable this Bayesian learning, we consider the variance of the genomic value Xβ conditional on the effect vector β: which is a quadratic form in the effect vector β. By assigning β a prior distribution, the genomic variance W (11) is assigned a prior distribution with prior expectation In the conditional (on X) linear model (4) investigations in BRM are performed on the posterior distribution of β by adapting to the phenotypic data y. We use characteristics of the posterior distribution p(β|y) of β to infer the posterior distribution of the genomic variance W given by (11), or equivalently the posterior distribution of the quadratic form W of β. We define the posterior mean of the genomic variance W as and notice that it comprises the posterior expectation µ β|y := E[β|y] and the posterior variance-covariance matrix Σ β|y := Var(β|y) of β. The expression W b structurally resembles the prior expectation V b of W but includes the posterior mean as well as the posterior covariance of β instead of the prior moments. Structurally, the expressions W and W b resemble the genomic variance V f given by (6) in the FEM in Section Fixed Effect Model (FEM) and explicitly include the contribution of LD, where the role of the weights for the covariance terms of X formerly played by β i β j , i = j, in V f , see equation (6) is taken over by the off-diagonal elements of the matrix of the posterior second moments E[ββ |y] of β. Hence, W b can be split up in the genic variance and a part including the contribution of LD similar to V f in (6).
There are many different approaches to fit the conditional model (4) in BRM that mainly differ in the choice of the prior distribution for the effect vector β. Most of the time, the posterior distribution of β is approximated using MCMC methods. Then, it is possible to estimate characteristics of the (posterior) effect vector by the mean value or the empirical variance of the resulting Markov chain. In this context, we denote the sequence of the Markov chain of the estimated effects, after discarding the burn-in iterations and after thinning the chain, by the sequence of p-vectors β (m) m=1,...,M . These vectors are draws from the distribution p(β|y). Consequently, we express the quantities µ β|y and Σ β|y by their empirical counterparts, namely the estimated posterior mean (also often just termed the estimated effects)μ β|y of βμ and the estimated posterior covariancê We propose to plug (13) and (14) into the estimator for the mean of the posterior genomic variance W b (12) in BRM, whereΣμ β|y denotes an unbiased estimator for the covariance Σμ β|y of the estimated effectsμ β|y . A detailed mathematical derivation of the results in this section has been provided in Subsection BRM in the Appendix, where we also show that plugging (13) and (14) into (12) is equivalent to using which can explicitly be interpreted as an estimator for the posterior mean of the genomic variance W in (11), in which the empirical mean is taken over the realizations in every MCMC sample. The estimator W Post is called M2 in Lehermeier et al. (2017) and it has already been mentioned that this approach draws inferences on the posterior distribution of the genomic variance, whereas the estimator M1 mentioned in Lehermeier et al. (2017) equals tr(Σ XΣβ|y ). By strictly deriving W b , see (15), as well as its corresponding theoretical expression W b in (12) we have also derived a relation of the estimators used in Lehermeier et al. (2017): becauseΣμ β|y ≈ 0. Gianola et al. (2009) shows on toy examples how the genomic variance changes when treating the effect vector as random instead of treating it as a fixed quantity. In random effect models we assume that the effect vector β in model (1) is a normally distributed random variable with mean 0 and diagonal variancecovariance matrix with equal variances σ 2 β , which is equivalent to modeling the single p components of β as independent random variables β j ∼ N (0, σ 2 β ), j = 1, ..., p. We obtain the marginal genomic variance V in model (1) as

Random Effect Model (REM)
The effect vector in the linear model (4) in REM is treated as random. Consequently, it is not possible to estimate β but we have to predict β in the terminology introduced by Henderson (1984), which means that the realized values of the random effects are estimated. Common approaches to find a best linear unbiased predictor (BLUP) for β are based on the mixed model equations (Henderson 1984) or in general on maximum-likelihood approaches (Searle et al. 1992). The variance components σ 2 ε and σ 2 β in model (4) are usually estimated using restricted maximum likelihood (REML) (Patterson and Thompson 1971). Subsequently, the marginal genomic variance V r (18) can be estimated byV The derivation of the genomic variance in REM is often entirely based on the realized model (4) (Piepho and Moehring 2007;Yang et al. 2010Yang et al. , 2011Piepho et al. 2012) which results in the genomic variance-covariance matrix The (realized) genomic variance is estimated as the mean value of the n genomic variances on the diagonal of (20): Due to the properties of the trace,V real r is (approximately) equivalent to the estimated genomic varianceV r in (19) that is based on the marginal variance V r , see (18) in model (1). Thus, there is no difference in considering the stochastic model (1) or the realized model (4) for the expression of the marginal genomic variance in REM. Due to computational advantages, it is common (Piepho and Moehring 2007;VanRaden 2008;Piepho 2009;Yang et al. 2010Yang et al. , 2011Speed et al. 2012;Janss et al. 2012;Lee and Chow 2014;Fernando et al. 2017) to consider the so-called linear equivalent model (Henderson 1984) to model (4), i.e. y in (22) is equally distributed as y in model (4). In the linear model (22) the n-vector of genomic values is called breeding-value (BV) (VanRaden 2008; Hill 2010) and describes the expected performance of a progeny. The covariance matrix σ 2 β XX of g can be replaced by some sort of equivalent genomic relationship matrix (GRM) G (VanRaden 2008) which is the reason why a model fit in REM based on model (22) is called genomic BLUP (GBLUP). For high-dimensional data where p n, it is computationally more efficient to investigate the n-vector g and its n × n covariance matrix than the p-vector β and its corresponding p × p covariance matrix. Basically, where σ 2 g := pσ 2 β and G := 1 p XX . The estimated equivalent genomic varianceV equi r in the equivalent model (22) where additionally the mean trace of the GRM G is often standardized to equal 1. Consequently, the marginal genomic vari-anceV equi r in the equivalent model (22) equalsσ 2 g . This approach is termed GCTA-GREML (Yang et al. 2010(Yang et al. , 2011. The estimated genomic varianceσ 2 g is equivalent to the estimated genomic varianceV real r (21) and therefore also in accordance with the marginal genomic variance V r given by (18). No matter which of the equivalent approachesV r (19),V real r (21) orV equi r (25) to estimate the marginal genomic variance V r (18) is used, they are similar to the first part of expression V f (6), namely ∑ p j=1 β 2 j Var(X j ). But instead of weighting the variances of the allele content by different components of the (fixed) effect vector β, the weights in V r , see (18), equal the variance component σ 2 β for every locus. More strikingly, the covariances between the different loci take no part in V r in (18) but they do so in V f in (6). Nevertheless, it is not clear how strong the disequilibrium covariances are involved in the estimation ofσ 2 β orσ 2 g in the REML equations and implicitly influence the estimatesV r (19), V real r (21) andV equi r (25). The assumptions on the marginal distribution of β (especially on its covariance structure) are very influential and cause the marginal genomic variance V r in (18) to be unsatisfactory. This is similar the genomic variance V b in (10) in Section Bayesian Regression Model (BRM) that is arbitrarily strongly influenced by the prior moments of β. As a consequence, we consider the genomic variance V (2) conditional on the effect vector β, analogously to Section Bayesian Regression Model (BRM): which is a quadratic form in the normally distributed effect vector β. The genomic variance W is a random variable with E[W] = V r . Investigations on the random variable W have to be done similar to investigations on the random effect β in REM, namely by a strict conditioning on the phenotypic data y in accordance to the prediction (Henderson 1984) of the effect vector β, where the BLUP µ β|y := E[β|y] of β is given by the conditional expectation of β on y (Searle et al. 1992). Consequently, we define an unbiased predictor for the random genomic variance W in (11) as the expectation of the random variable W conditional on the data y: where we have used the conditional variance-covariance matrix Σ β|y := Var(β|y) of β additional to the BLUP µ β|y . The predictor W r for the genomic variance W is structurally in perfect accordance with the posterior genomic variance W b in (12) and consequently has the same interpretation as V f , see (6), similar to the genetic variance in quantitative genetics. Most importantly, opposed to the marginal genomic variance V r in (18), the predicted genomic variance W r includes the contribution of disequilibrium covariances similar to V f in (6). This is done by weighting the covariances of X with the off-diagonals of the matrix of the conditional second moment E[ββ |y] of β. Hence, W r can be split up in the genic variance and a part including the contribution of LD similar to V f in (6). We examine the covariance of β|y in model (4) more closely and obtain: The marginal covariance structure σ 2 β 1 p×p of β (components independent with equal variances) in V r in (18) changes drastically when considering the conditional (on y) covariance structure Σ β|y of β. In this conditional approach, the single components of β|y are not equally and independently distributed, but posses an arbitrary covariance structure by adapting to the data by means of the likelihood of the data similar to the posterior covariance Σ β|y in Section Bayesian Regression Model (BRM). By introducing the concept of the prediction of conditional genomic variance, we tackle one of the central points of critique of GCTA-GREML issued by Kumar et al. (2015Kumar et al. ( , 2016. We substitute the variance components implicitly included in W r in (26) by estimates and obtain a nearly unbiased predictor for the conditional genomic variance : or W equi r when the prediction procedure is based on the equivalent model (22). By considering the genomic variance in REM as random and deriving a predictor for this random variable, we also bridge the gap between the estimation of the posterior mean of the genomic variance in BRM and estimation of the marginal variance in REM that has been observed in Lehermeier et al. (2017). For a detailed mathematical derivation of the results in this chapter we refer to Subsection REM in the Appendix. For an extension of the REM to the mixed-effect model (MEM) we refer to Subsection MEM in the Appendix.

Statistical Analysis
In this section we compare the performance of the estima-torsV bias f in (7) andV f in (8) from Section Fixed Effect Model (FEM), the performance of the estimator W b in (15) from Section Bayesian Regression Model (BRM) and the performance of the estimatorV r in (19)as well as the predictor W r in (28) from Section Random Effect Model (REM) with respect to the genomic variance V f defined in (6). We have already mentioned in Section Fixed Effect Model (FEM) that the genomic variance V f is the genomic equivalent of the genetic variance as defined in quantitative genetics for multiple QTL and explicitly accounts for the contribution of LD. We executed all calculations in this section with the free software R (R Development Core Team 2008).

Preparation of Datasets
We considered the mice dataset that comes with the R-package BGLR (Perez and de los Campos 2014). The data originally stem from an experiment from Valdar et al. (2006a,b) in a mice population. The dataset contains p = 10346 polymorphic markers that were measured in n = 1814 mice. The trait under consideration was body mass index (BMI). In order to compare the estimators from the FEM with the ones from BRM and REM we created a second dataset (the reduced mice dataset) where we included only the firstp = 0.6n ≈ 1088 markers, such thatp < n holds true. We used the n × p (n ×p) matrix X coding the marker content from the mice (reduced mice) dataset to obtain a realistic LDstructure for the further analysis. In order to obtain modified datasets with different QTL-to-marker densities we assigned k out of the p (p) markers to be QTL. We attributed each designated QTL with a corresponding "true" (fixed) effect k-vector β k . Then, we calculated the "true" genomic variance V k as using formula (6) for V f from Section Fixed Effect Model (FEM) because it resembles the genetic variance as defined in quantitative genetics. We denote by X k the restriction of the marker content data to the (designated) QTL content. For each k, we calculated the covariance matrix Σ X k applying the method of moments estimator (5) to the QTL content data X k with all individuals (serves as the whole population). It has been claimed that the main source of missing heritability is imperfect LD between markers and QTL (Yang et al. 2010) which we exclude by explicitly assigning markers to be QTL. In addition to that, the genomic variance under consideration is purely additive and the variance-covariance matrix of the QTL content is given. Consequently, the performance of the estimators and of the predictor depends only on their ability to represent the genomic variance V k for all k.
In order to investigate the estimation (prediction) procedures for each k for different levels of heritabilities we set the true error variance σ 2 ε equal to 1 and multiplied the "true" effect vector β k by the constant c k , where This results in considering "true" genomic variances of V k ∈ {0.25, 1, 4} for each QTL-marker ratio k/p (k/p). We drew n realizations of ε from a normal distribution with mean 0 and standard deviation σ ε = 1, calculated the phenotypic values y k using the additive linear model (4), and hence obtained several modified genomic datasets with phenotypic and genotypic values for each V k and h 2 k ).

Model-fitting and Genomic Variance Calculation
Given the phenotypic and genotypic data described in Subsection Preparation of Datasets, we fitted the OLS model using the R-function "lm" and obtained the estimated effect vector β as well as the estimated error variancesσ 2 ε . We used these in order to calculate the biased estimatorV bias f in (7) as well as the nearly unbiased estimatorV f in (8). The OLS method is not well-defined in applications where the number of markers p is larger than the number of individuals n. Therefore we applied this method only to the reduced mice dataset. We fitted the BRR model with the function "BGLR" with the specification of the model equal to "BRR" in the R-package "BGLR" (Perez and de los Campos 2014). We decided to use 30000 iterations of the Markov chain and discarded the first 10000 as burn-in, after we had exemplarily checked the convergence of the resulting Markov chain and asserted convergence in every case. We kept only every fifth realization of the remaining chain in order to obtain approximate independence. This left us with M = 4000 state values that are assumed to be representative of the posterior distribution. As a result of the application we obtained estimatorsμ for the intercept,σ 2 ε for the residual variance, and a M × p (M ×p) matrix with realizations of the estimated effect vectorβ (m) , m = 1, ..., M, in every state m of the considered Markov chain. We plugged these into W post , see (16), in order to calculate an estimator for the posterior expectation of the genomic variance W b defined in (12). We fitted the (G)BLUP model in its equivalent form (22) as in Section Random Effect Model (REM) by using the R-package "sommer" (Covarrubias-Pazaran 2017) and in particular its function "mmer". We obtained the predicted effectsμ g|y and the estimated variance componentsσ 2 g andσ 2 ε . We used these quantities in order to calculate the estimatorV equi r in (25) for V r and the predictor W equi r in (29) for the conditional genomic variance W r . Despite the explicit implementation ofV equi r and W equi r we use the equivalent quantitiesV r , see (19), and W r , see (28), to describe the simulation studies in order to emphasize the derivation using the stochastic data-generating process X.

Performance Indexes
We compared each estimatorV for the genomic variance V k in (30) with respect to the absolute value of the relative bias and their relative root-mean-squared-error For the analysis in Subsection Variation of QTL-Allocations. we define the relative contribution rLD of LD to the genomic variance V k as and the indicator I r in REM for the contribution of LD to the genomic variance as

Variation of Observational Data
We randomly selected k QTL as described in Subsection Preparation of Datasets and fixed them for the further analysis, where we chose the number k from the sets K m := {10, 100, 500, 1000, 2000, 5000, 10000} for the mice dataset and K rm := {10, 50, 100, 200, 500, 1000} for the reduced mice dataset. For practical reasons of creating effect vectors with shapes of realizations of normal distributions or the heavier-tailed gamma distribution, we chose the "true" effect vector β k as a realization (i.e. fixed value) according to the distributions depicted in Table  1. Formally, we considered an unknown data-generating process X with n realized p-vectors contained in the design matrices X. We randomly selectedñ = 0.8n out of the n realizations (individuals) 500 times for each combination of k and h 2 which imitates drawing from the data-generating process X. In each iteration, we calculated the estimators and the predictor in the OLS, BRR and (g)BLUP models as described in Subsection Model-fitting and Genomic Variance Calculation. The estimation performance of the biased estimatorV bias f compared to the improved estimatorV f from FEM in the reduced mice dataset is depicted in Figure 1 for a heritability of 0.2 (V k = 0.25 for all k). The biased estimatorV bias f performs drastically worse than the improved estimatorV f . This behavior ofV bias f is very similar for all considered h 2 which emphasizes the importance of the bias-correction in the FEM. For reasons of clarity we abstain from depicting the estimatorV bias f in the further analysis. We compared the performance of the remaining estimators and the predictor for the genomic variances in the reduced mice dataset for h 2 = 0.2 in Figure 2, for h 2 = 0.5 in Figure 3 and for h 2 = 0.8 in Figure 4. The estimated variances are averaged over the 500 realizations and are depicted in subject to the number of QTL k which also determines the QTL-marker ratios k/p. The bias-corrected estimatorV f given by (8) performs best and is very close to the "true" value of the genomic variance for all levels of heritabilities h 2 and numbers of QTL k. This is expected because the "true" genomic variance is calculated according to the genomic variance in the FEM, the genomic equivalent of the genetic variance as defined in quantitative genetics. The estimator W b , see(15), from the BRM overestimates the "true" genomic variance for h 2 = 0.2 for about over 10%. The performance of the estimator improves with larger heritability and for h 2 = 0.8 the estimator is very close to the "true" value for all k. Possible reasons for the overestimation by W b for small h 2 are dependencies between the states of the Markov chain, such that the approximation (40) is not good enough and that the model-fit gets worse such that the plugged-in state values are not representative of the posterior distribution (although the MCMC algorithm had converged). The estimation performance ofV r given by (19) depends on the QTL-marker ratio such that with increasing number of QTL's k the underestimation ofV r drastically increases, whereas for a small QTL-marker ratio, the estimatorV r tends to overestimate the "true" genomic variance. The performance of the estimator strongly declines with increasing heritability, such that for h 2 = 0.2 the relative bias amounts to about 4%, for h 2 = 0.5 to 5 − 15% and for h 2 = 0.8 to 5% − 20%. The novel predictor W r defined in (28) from the REM overestimates the "true" genomic variance for h 2 = 0.2 but nevertheless performs better than the estimators from the REM and the BRM. The predictor W r performs relatively independent of the QTLmarker ratio and its performance advantage uponV r increases with increasing h 2 . Although the "true" genomic variance is calculated according to the FEM, the performance of W r can more than compete with the estimatorsV f from FEM and W b from the BRM. We put special emphasis on the performance improvement of the novel predictor W r versus the estimatorV r in the case of higher heritability (Figure 4). This resembles the study of the missing heritability (Maher 2008;Yang et al. 2010) and the novel predictor remarkably reduced the missing heritability in REM in our simulation study. The number of covariances that contribute to the genomic variance V k depends quadratically on k (k 2 − k) and we draw the conclusion that the increasing bias of V r in (19) with increasing k is due to the quadratic increase in the number of missed covariances. In contrast to that, the estimatorŝ V f in (8), W b in (15) and the predictor W r in (28) fluctuate around the "true" value of the genomic variance independent on the number of covariances. The performance of the estimators and the predictor from BRM and REM in the full mice dataset is very similar to the performance in the reduced mice dataset such that we can also assert the improved performance of W b and W r in the case of p n. In addition to that, we compared the estimators and the predictor with respect to relative root-mean-squared-error and assert similar behavior as when investigating the estimation performance. We conclude that treating the genomic variance as random is also advantageous with respect to the precision of the estimators and the predictor.

Variation of QTL-Allocations
In Subsection Variation of Observational Data we investigated the performance of the estimators and the predictor of the genomic variance for a fixed QTL-allocation and varying observations. Hence, it is possible that the conclusions made depend strongly on the specific QTL-allocation and the corresponding implied LD-structure, and cannot be generalized. Consequently, we considered the whole dataset of individuals and conducted the analysis in this section for different QTL-allocations for each level of heratibility and number of QTL k. In order to do so, we undertook 2000 iterations of randomly choosing the actual QTLallocations for every level of heritability h 2 and each number of QTL k ∈ K, where K m = {10, 100, 500, 1000, 2000, ..., 10000} for the mice dataset and K rm = {10, 50, 100, 200, ..., 1000} for the  The "true" genomic variance V k equals 0.25 for all k which resembles a heritability h 2 of 0.2. The estimatorV f from FEM performs best followed by the predictor W r for the conditional genomic variance in REM which slightly overestimates V k . The estimatorV r underestimates V k and the bias of the estimators increases with k. The estimator for the posterior genomic variance W b constantly overestimates V by around 10%. The "true" genomic variance V k equals 1 for all k which resembles a heritability h 2 of 0.5. The estimatorV f from the FEM and the predictor W r from the REM are very close to the "true" V k for all k. The estimator W b for the posterior mean of genomic variance also performs well but slightly overestimate V k with increasing k. The estimatorV r drastically underestimates V k and the bias of the estimator strongly increases with k.

Figure 4
Estimated variance (mean value over iterations of subsets of individuals) in the reduced mice dataset for different number of QTL k and fixed numbers of markersp = 1088. The "true" genomic variance V k equals 4 for all k which resembles a heritability h 2 of 0.8. The estimatorV f from the FEM, the predictor W r from the REM and the estimator W b are constantly very close to V k for all QTL-marker ratios k/p. The estimatorV r from the REM drastically underestimates V k and the bias of the estimator strongly increases with k. reduced mice dataset. We used β = (1, ..., 1) k as the "true" effect vector in V k , see (30), prior to scaling by c k , in order to weight all locus-specific variances as well as all disequilibrium covariances equally. We compared the performance of the estimatorV bias f in (7) to the improved estimatorV f in (8) in the reduced mice dataset in Figure 5. Similar to Subsection Variation of Observational Data we notice that the bias-corrected estimatorV f behaves much better than the estimatorV bias f . In addition to that,V f fluctuates around the true value of V = 0.25 for all k, which indicates that the performance is independent of the QTL-marker ratio. We observed similar behavior for h 2 = 0.5 and h 2 = 0.8. In Figures  6 (h 2 = 0.2), 7 (h 2 = 0.5) and 8 (h 2 = 0.8) we depict the average of the estimators and the predictor over all considered QTLallocations in the reduced mice dataset for different number of QTL k for fixed number of markersp = 1088. We notice that the behavior of all considered quantities over k is more bumpy compared to the analysis for a fixed QTL-allocation. This indicates that the QTL-allocation influences the estimators and the predictor. The general conduct of the estimatorV f , see (8), the estimator W b , see (15) and the predictor W r , see (28), is similar and independent of the level of heritability, as we notice that these quantities have spikes and slabs for the same k (same QTLallocations) for each h 2 . This indicates thatV f , W b , and W r are in accordance and confirms that they can be used to estimate the genomic variance as defined in the FEM (and quantitative genetics). The estimatorV f fluctuates around the "true" value of the genomic variance, wheres the estimator W b constantly overestimates the V k for all k for small h 2 (as in Subsection Variation of Observational Data). The predictor W r fluctuates around the "true" value of the genomic variance for h 2 = 0.2 and slightly overestimates for larger heritabilities, but performs at least as good asV f and W b . The estimatorV r from REM underestimates the "true" value of the genomic variance in all cases where the bias increases with increasing k regardless of h 2 . Compared the behavior in Subsection Variation of Observational Data where only one QTL-allocation was examined, the estimatorV r underestimates V also for small k. The difference to the novel predictor W r is striking. Especially for h 2 = 0.8 the estimatorV r accounts for less than half of the true genomic variance, which is in accordance with observations of the missing heritability (Maher 2008;Yang et al. 2010). The missed covariances increase quadratically in k which explains the increasing bias of the estimatorV r . This simulation studies indicates that the novel predictor W r in (28) as well as the estimator W b in (15) are possible solutions to the missing heritability, because of their explicit inclusion of LD. For each level of heritability h 2 and each number of QTL k we considered 2000 different QTL-allocations and each of them defines a specific LD-structure. Consequently, the "true" value of the genomic variance for each QTL-allocation can be distinguished by a different relative contribution of LD to V as defined in rLD defined in (33). We depict the empirical covariance of this relative contribution of LD with the value ofV r , W r , the indicator I r given by (34), the relative bias (31) ofV r and the relative bias of W r for each h 2 and k in Figure 9. The correlation ofV r with the relative contribution of LD is negative (about −0.75) which indicates that the larger the contribution of LD becomes, the smaller the estimator becomes. This is clearly contrasted by the novel predictor W r which is approximately uncorrelated with the contribution of LD. In addition to that, the relative bias ofV r is positively correlated (about 0.75) with the relative contribu-tion of LD which demonstrates that the larger the contribution of LD, the larger the bias of the estimator becomes. This is once again contrasted by the relative bias of W r that is approximately uncorrelated to the contribution of LD. Strikingly, the empirical correlation of the indicator I r , which can be calculated using onlŷ V r and W r , is positively correlated with the relative contribution of LD to the genomic variance. As a consequence, I r constitutes a novel approximation of the relative contribution of LD to the genomic variance. In addition to the analysis for the reduced mice dataset we compared the estimators and the predictor in the full mice dataset where p n. The performance of the estimator W b ,V r , and the predictor W r are very similar to the performance in the reduced mice dataset.

Figure 5
Estimated variance in the FEM (mean value over different QTL allocations) in the reduced mice dataset for different numbesr of QTL k and fixed number of markersp = 1088. The "true" genomic variance V k equals 0.25 for all k which resembles a heritability h 2 of 0.2. The estimatorV f performs remarkably better than the biased estimatorV bias f and is very close to V independently of the QTL-to-marker ratio.

Results
We defined the genomic variance V in (2) as the variance of the genomic value for a stochastic marker content X in Section Linear Models and the Genomic Variance. We noticed that the expression for the genomic variance V as a fixed population parameter strongly depends on the models assumptions on the effect vector β. As a consequence, we distinguished the analysis of the genomic variance between the FEM, the BRM and the REM.
The genomic variance V f in the FEM in Section Fixed Effect Model (FEM), where the effect vector β is a deterministic parameter, is the genomic equivalent of the genetic variance in quantitative genetics and explicitly includes the contribution of LD. We noticed that the simple plug-in estimatorV bias f , given by (7), is clearly biased and introduced the improved estimatorV f in (8) that is unbiased for V if the plugged-in estimators are uncorrelated. The FEM can be applied when the number of effects is small or after the application of variable selection or reductions methods. In the simulation studies on the reduced mice dataset in Section Statistical Analysis we showed that the bias-corrected estimator largely improved the estimation of genomic variance V in FEM (Figures 1 and 5). The genomic variance V b , given by (10) in the BRM in Section Bayesian Regression Model (BRM), proved to be meaningless Figure 6 Estimated variance (mean value over different QTL allocations) in the reduced mice dataset for different numbers of QTL k and fixed number of markersp = 1088. The "true" genomic variance V k equals 0.25 for all k which resembles a heritability h 2 of 0.2. The estimatorV f from the FEM performs similar to the predictor W r from the REM and they are both close to the true V of 0.25. The estimator W b from the BRM performs solidly but constantly sightly overestimates the "true" genomic variance. The estimatorV r from the REM underestimates V by around 40% and the bias of the estimator tends to increase with the number of QTL k.

Figure 7
Estimated variance (mean value over different QTL allocations) in the reduced mice dataset for different numbers of QTL k and fixed number of markersp = 1088. The "true" genomic variance V k equals 1 for all k which resembles a heritability h 2 of 0.5. The estimatorV f from the FEM performs similar to the predictor W r from the REM and the estimator W b from the BRM and they are all very close to V. The estima-torV r from the REM underestimates V increasingly with the number of QTL k and by at least 40% starting at a QTL-marker ratio of 10%.

Figure 8
Estimated variance (mean value over different QTL allocations) in the reduced mice dataset for different numbers of QTL k and fixed number of markersp = 1088. The "true" genomic variance V k equals 4 for all k which resembles a heritability h 2 of 0.8. The estimatorV f from the FEM, the predictor W r from the REM and the estimator W b from the BRM are very close to the true V of 4. The estimatorV r from the REM drastically underestimates V and the bias of the estimator tends to increase with the number of QTL's k. For a large QTL-marker ratio,V r can only recover about 40% of the true genomic variance.

Figure 9
Empirical correlations with the relative contribution of LD to the "true" genomic variance in the reduced mice dataset for different numbers of QTL k for fixed number of markersp ≈ 1088. Values are averaged over the different levels of h 2 ∈ {0.2, 0.5, 0.8}. The correlation of the estimatorV r with the relative contribution of LD is about −0.7 except for k = 10, whereas the correlation of the predictor W r fluctuate around 0. The correlation of the relative bias ofV r is about 0.7 except for k = 10 which indicates that the larger the contribution of LD to the genomic variance, the larger the bias of V r becomes. Contrary to that, the bias of the predictor W r is approximately uncorrelated to the relative contribution of LD. The quantity I r is positively correlated (0.7 − 0.8) to the relative contribution of LD which makes it an usable indicator for the relative contribution of LD to the genomic variance. because of its dependence on characteristics of the prior distribution of the effect vector β. In order to include characteristics of the posterior distribution of β, we considered the genomic variance as a random variable W in (11) (conditional on the effect vector β) with prior expectation V b . We inferred its posterior expectation W b in (12), which includes the contribution of LD and has an interpretation similar to the genomic variance in the FEM. By doing so, we laid the theoretical foundations for the estimation of the posterior genomic variance in BRM and proved that the estimator for the mean of the posterior genomic variance W Post , given by (16), (Lehermeier et al. 2017) is nearly unbiased for the expectation of the random variable W. In Section Statistical Analysis we illustrated that the estimator W b performs similarly to the estimatorV f from the FEM (Figures 4,7,8) although it tends to overestimate in cases of low heritability (Figures 2, 3, 6). This enables an estimation of the genomic variance as defined in the FEM for high-dimensional genomic datasets (p n). In Section Random Effect Model (REM) we showed that, upto-now, the genomic variance in REM has been treated as the parameter V r , given by (18), and that popular estimation methods (e.g. GCTA-GREML) are based on the marginal covariance matrix of the effect vector β which leads to a negligence of the contribution of LD. In perfect accordance with the BRM, we introduced the novel concept of the random genomic variance W in (11) in REM by conditioning on the effect vector β. Similar to the prediction of the random effect β, the random genomic variance W has to be predicted by a strict conditioning on the phenotypic data by means of its likelihood. We derived a nearly unbiased predictor W r in (28) for the random genomic variance in REM that is based on the covariance of the conditional distribution of β given the data y. By adapting to the data, this approach explicitly allows for the contribution of LD and remarkably reduces the missing heritability ofV r in REM. In Section Statistical Analysis we illustrated that W r performs drastically superior to the estimatorV r (19) (Figures 3, 4, 6, 7, 8). Furthermore, the novel predictor W r performs at least as good as the estimator used for the mean of the posterior genomic variance in BRM W b and similar to the estimatorV f . This enables an estimation of the genomic variance as defined in the FEM in the REM. In Section Statistical Analysis we compared the estimators and predictors for the genomic variance with respect to their ability to estimate the genomic equivalent of the genetic variance in quantitative genetics in the mice dataset as well as the reduced mice dataset. In Subsection Variation of Observational Data we designated a fixed subset of markers to be QTL and investigated the performance of the estimator and the predictor for varying observations, which simulated drawing from the data-generating process of X. Because the investigation had only been executed for one fixed QTL setting and corresponding fixed population effect vector β, we investigated the dependence of the performance of the estimators and the predictor of the genomic variance on the QTL-allocation in Subsection Variation of QTL-Allocations. We asserted that the performance of the estimators and of the predictor as described above is consistent when varying the number of observations as well as when varying the underlying QTL allocation. In the end, we introduced an innovative indicator I r in (34) of the contribution of LD on the genomic variance (Figure 9) by comparing the estimatorV r and the predictor W r . This comparison added to the conclusion that the improved performance of the novel predictor W r compared to the estimatorV r is caused by the inclusion of LD.

Discussion
The additive genetic variance and the narrow sense heritability are clearly and uniquely defined, but nevertheless estimation procedures give different results (Chen 2016). The estimation of the genomic variance (especially in REM) varies even when using the same marker data to calculate different genomic relationship matrices. (Legarra 2015;Fernando et al. 2017). We showed in Subsection Notes on the mean-centering of X in the Appendix that transformations of the input marker-matrix X change the estimate of the genomic variance when using estimators similar to GCTA-GREML likeV r (19),V real r (21), orV equi r (25). The estimate of the genomic variance depends on the specific form on the estimated GRM and whether one considers mean-centered data or not. This critique goes hand in hand with Kumar et al. (2015Kumar et al. ( , 2016 that state that the GRM in GCTA-GREML is an estimate of the underlying data-generating process but is treated as a fixed quantity, which makes the calculation of the genomic variance as in (20) (Yang et al. 2010(Yang et al. , 2011 invalid. As a solution, we built our analysis on the data-generating process of the marker data by considering X as a random vector in model (1), which is also vital to generate a genomic variance in the FEM where the effects are fixed population parameters. We showed that treating the marker data as random is not enough because of the approximate equivalence ofV r in (19) andV real r in (21) (as well as its equivalent forms). We introduced the random genomic variance W in (11) by explicitly conditioning on the effect vector. As a consequence, the (random) genomic variance depends on the variance-covariance structure of the data-generating process X and is consequently independent of linear transformations of the marker content. In addition to that, the novel predictor W r , given by (28), for this random quantity is based on the conditional (on the data y) distribution of the effects which implies a departure from the marginal variancecovariance structure of β (diagonal, with equal variances σ 2 β ) to the arbitrary variance-covariance structure of β conditional on y. This approach is in accordance with the estimation of the posterior mean of the genomic variance W b given by (15) in BRM (Lehermeier et al. 2017) and tackles yet another central point of critique on GCTA-GREML issued by Kumar et al. (2015Kumar et al. ( , 2016, namely that the single marker effects are treated as independent random variables with equal variances. In the theoretical expression of the genomic variance V r in (18) LD does not contribute. However, when using the REML algorithm to estimate the variance component σ 2 β , LD implicitly contributes to estimated variances similar to GCTA-GREML likê V r (19),V real r (21), orV equi r (25). Nevertheless, as we have noticed in Figure 9, the bias ofV r is still very much correlated with with the contribution of LD. Our approach of considering the genomic variance W in BRM and REM as a random variable conditional on the effect vector can be considered as an extension of the genomic variance from the FEM to high-dimensional datasets. This is intrinsically tied to an explicit contribution of LD to the genomic variance. To be more specific, we have noticed in Figure 9 that the bias of W r (28) is approximately uncorrelated with the contribution of LD. This led us to deducing the indicator I r , given (34), for the contribution of LD to the genomic variance. The estimation and prediction of the effects β in highdimensional datasets using the BRM and the REM is executed by adapting to the data by means of its likelihood, which possibly results in an over-adjustment. In accordance to that, estimating the posterior mean of the conditional variance in BRM or predict-ing the conditional genomic variance in REM bears the risk of over-adjustment to the data. In our simulation study in Section Statistical Analysis we have assumed a very simplistic model and excluded, e.g., the influence of imperfect linkage between the markers and the QTL. This removed one of the main sources of the missing heritability claimed in literature, e.g. Yang et al. (2010). The stability of the novel predictor W r as well as of the estimator of the posterior mean W b has still to be further tested in more complex scenarios and for different datasets with different LD-structures. Specifically, it would be of interest to apply the novel predictor W r to the dataset of human height (not available to us), which is characterized by a large heritability of 80%, and compare the prediction performance of the genomic variance to the estimation of the genomic variance performed in Yang et al. (2010).
where only tr Σ X Σβ is amenable to estimation. Consequently, we define the bias-corrected estimatorV f for the genomic variance V f (6) asV whereΣβ is an unbiased estimator for Σ β = Cov(β). We first investigate the bias-correction term tr Σ XΣβ and find that We examine the estimatorV f (8): The estimatorV f (8) is biased by the amount Cov σ X ij ,β iβj −σβ ij , that is caused only by dependencies between the unbiased plugin estimatorsΣ X ,β andΣβ. If they are pairwise uncorrelated, the estimator V f is unbiased. We call estimators that are biased only due to correlations between plugged-in estimators "nearly unbiased". If we fit the conditional (on X) linear model by OLS, we can express y = µ + Xβ + ε =μ + Xβ +ε, whereβ = (X X) −1 X y andε = y − Xβ. It holds true thatβ ∼ N β, σ 2 ε (X X) −1 .
Subsequently, an unbiased estimatorΣβ for the variance ofβ in OLS is given by:Σβ It is well-known in OLS theory that an unbiased estimator for the residual variance σ 2 ε in the case of (p + 1) variables (including the intercept) is given bŷ where H := X(X X) −1 X is the so-called hat-matrix. For mean-centered phenotypes (∑ n i=1 y i /n = 0) we express the nearly unbiased estimatorV f (8) in OLS aŝ We obtain the exact empirical variance decomposition σ 2 y =V f +σ 2 ε .
in the OLS model which resembles the theoretical variance decomposition (3) in model (1).

BRM
In Section Bayesian Regression Model (BRM) we assume β in the stochastic model to be distributed according to the prior distribution p(β) with finite prior expectation µ β := E[β] and finite prior variancecovariance matrix Σ β := Cov(β). We leave the specific form of the distribution p(β) unspecified in this general approach. We calculate the genomic variance We can arbitrarily strongly influence the genomic variance V b (10) by the choice of the prior first and second moment of β.
As a solution, we define W := Var(Xβ|β) = β Σ X β = tr Σ X ββ , as the variance of the genomic value Xβ conditional on β. This random quantity has (prior) expectation We define = tr Σ X E ββ |y = tr Σ X E β|y E[β |y + Cov(β|y) = µ β|y Σ X µ β|y + tr Σ X Σ β|y (12) as the corresponding posterior mean of the genomic variance W (11), where In the conditional model we adapt to the data and define the estimator W b for the expectation of the posterior genomic variance W b (12): where we correct for the bias of the purely plug-in estimator by subtracting tr(Σ XΣμ β|y ) equivalently to the nearly unbiased estimatorV f (8) in Subsection FEM. The first part of expression (15), W (1) b , is similar toV f (8) such that we calculate as in (37): Cov σ X ij , (μ β|y ) i (μ β|y ) j −σμ β|y ij .
We derive the expectation of the second part of expression (15),

W
(2) b , as (25) are nearly unbiased estimators for the marginal genomic variance V r (18) using the same reasoning. In order to explicitly include LD into the expression of the genomic variance we condition on the effect vector β and define W := Var(Xβ|β) = β Σ X β = tr(Σ X ββ ), which is a quadratic form in the normally distributed β. This random variable has expectation Var(X j ) = V r .