Information about Europe Europa
Journal Issue
Share
Article

Benchmark Priors Revisited

Author(s):
Martin Feldkircher, and Stefan Zeugner
Published Date:
September 2009
Share
  • ShareShare
Information about Europe Europa
Show Summary Details

I Introduction

As a method of dealing with model uncertainty, Bayesian Model Averaging (BMA) has received considerable attention – ever since Raftery (1995) and Hoeting et al. (1999) demonstrated that inference neglecting model uncertainty leads to overstated confidence in statistical estimates. BMA, in contrast, tackles model uncertainty directly by basing inference on a weighted average of all potential covariate combinations. In the Bayesian framework, these weights arise naturally as posterior model probabilities (PMP), akin to the classical likelihood concept. The PMP for model Ms conditional on data (y, X) is proportional to its marginal likelihood p(y|Ms, X) times a model prior p(Ms):

Relying on this framework, numerous authors (e.g. Raftery 1995, Fernández et al. 2001a, Liang et al. 2008) have demonstrated that BMA outperforms other strategies in terms of predictive ability. Virtually all of them have so far concentrated on linear models with model-specific inference based on the natural-conjugate Normal-Gamma framework with Zellner’s g prior (Zellner, 1986). This approach aims to represent the lack of prior knowledge by employing a conditional prior on model coefficients β|σ2~ N(0, 2(X′X)–1) that is partly determined by the scalar hyperparameter g.

The g prior structure has proven universally popular in BMA, since it leads to simple closed form expressions of posterior statistics and because it reduces prior elicitation to the choice of a single hyperparameter g. This applies in particular to the resulting marginal likelihood p(y|Ms, X):1

The elicitation of g is subject to intense debate (e.g. Liang et al. (2008), Hoeting et al. (1999), Fernández et al. (2001a), Eicher et al. (2009)) and constitutes the focus of the present paper. So far, literature has discussed the optimal choice of g practically by Monte Carlo simulations, and theoretically by tuning g according to two considerations: First, it focused on consistency, i.e. the choice of g such that BMA asymptotically uncovers ‘the true model’. Second, the specification of g was studied in terms of its virtues as a model size penalty term to favor parsimonious models. In this respect, Fernández et al. (2001a) as well as Foster and George (1994) demonstrate how g can be calibrated to asymptotically mimic popular information criteria such as RIC or BIC by adjusting g to their respective model size penalties.

These studies were motivated by asymptotic consistency (as mentioned above), which focuses on a unique ‘true’ model and requires only a g increasing with N. However from a Bayesian viewpoint, many models might be ‘true’, in the sense that they are generating the data examined.2 In this case, the quest for asymptotic consistency is analytically less clear-cut.

The virtues of the different g elicitation mechanisms have been subject to debate – however the use of a constant hyperparameter g as such has been less frequently criticized.

Considered from the perspective of an applied researcher, the practical advantages of Zellner’s g that render it ubiquitous in BMA come at a serious cost: g exerts non-negligible influence on posterior inference since it governs how posterior mass is spread over the models. Larger values of g incite posterior mass to concentrate on fewer models, whereas smaller values of g spread PMPs more evenly, irrespective of model sizes and their penalty terms. Posterior statistics, and in particular PMP and posterior inclusion probabilities (PIP) are thus notoriously sensitive to the value of the g prior. In other words, the researcher’s prior on g plays a considerable role in determining how much posterior mass is attributed to a few, or the single best performing model – regardless whether these have been generating the data. Henceforth, we will refer to this concentration of posterior mass on a few models as the supermodel effect. While crucial in terms of prior sensitivity, this effect went more or less unnoticed in the focus on consistency as in Fernández et al. (2001a): Focusing on uncovering a single ‘true’ model in Monte Carlo simulations, previous studies longed for settings ascribing the bulk of posterior mass to this data-generating model – a job facilitated by choosing large values for g.

In our view, the earnest approach in tackling the issue is the introduction of a nondegenerate prior distribution on g, and thus ‘let the data choose’. Only few papers have attempted this so far: among them are Cui and George (2008), Strachan and van Dijk (2004), Liang et al. (2008), with the latter being probably the most comprehensive contribution. Thus we propose using a hyperprior distribution on g in the vein of Liang et al. (2008). The advantages of a hyperprior are fivefold: First, it greatly reduces the g prior sensitivity of posterior mass. Second, it does so by adjusting the posterior distribution of g such that it reflects the data’s signal-to-noise ratio, by shrinking the estimated coefficients more toward zero in noisier data sets.3 Thus the hyper-g prior allows for data dependent shrinkage, in contrast to prior structures fixing g to some constant. Third, it still leaves ample space for the researcher to formulate her prior beliefs, but without incurring the risk of unwantedly affecting posterior statistics. Fourth it is computationally feasible and addresses the same theoretical considerations complying with asymptotic consistency as does the standard setting. Finally, the hyper-g prior is not exposed to the aforementioned supermodel effect a priori. It adjusts the distribution of posterior mass in dependence of the information provided by the data. Thus if noise dominates the data, PMPs under the hyper-g prior will be distributed more evenly, whereas in the case of minor noise, posterior mass will be concentrated even more as in fixed settings that impose large values for g.

In this sense, we complement the contribution of Liang et al. (2008) by providing further closed-form representations for important posterior quantities. Additionally, we present analytical expressions that allow for a sound numerical implementation in terms of accuracy. Furthermore we demonstrate the behavior of several prominent prior structures such as the benchmark prior put forward by Fernández et al. (2001a) under a typical situation where the data generating process cannot be traced back to one specific model. Our results show that under noisy data the hyper-g prior dilutes the posterior mass among models whereas the benchmark prior incorrectly favors one (wrong) model. By means of a simulation study we examine the predictive properties of various settings for g indicating superior predictive ability for the hyper-g prior under varying signal-to-noise ratios.

The remainder of this study is organized as follows: the next section briefly summarizes Bayesian model averaging under the natural conjugate framework employing Zellner’s g prior and reviews the prior settings that have resurfaced most often in the literature so far. Section III introduces the hyper-g prior and outlines derivations of further posterior quantities as well as an implementation strategy being of practical relevance. Section IV examines the supermodel effect inherent to traditional priors and highlights the predictive performance of flexible priors based on a simulation study. The following section demonstrates the sensitivity of posterior results to the choice of g by means of an empirical application to a prominent growth data set. Section VI concludes the paper.

II Bayesian Model Averaging under Zellner’s g Prior

Consider the canonical regression problem of sample size N with the dependent variable in the N × 1 vector y, Xs an N × ks design matrix of covariates, and ε an N-dimensional vector of residuals in the following, linear model Ms:

Here α denotes the (scalar) intercept, and the k × 1-vector βs the nonzero regression coefficients. The residuals are assumed to be identically independently normally distributed with variance σ2, i.e. ε ~ N (0, σ2I). Note that Xs can be assumed to be centered (Xs1=0) without loss of generality, as this will only affect the posterior distribution of the constant αs (Liang et al., 2008, p. 412). Bayesian Model Averaging deals with uncertainty about the model Ms by drawing on the model-specific inference presented above. In the generic linear BMA problem, model uncertainty focuses on the choice of covariates Xs, which may be drawn from a set of K potential regressors. Thus there exist 2K unique covariate combinations, as represented by the model candidate space ℳ = {M1, M2,…, M2k} (cf. Hoeting et al. 1999 for a more detailed account).

The Bayesian framework calls for specifying a prior distribution on the model’s parameters α, βs, and σ2. The bulk of the BMA literature (Raftery, 1995; Chipman et al., 2001), favors the natural-conjugate approach, which puts a conditionally normal prior on coefficients (βs|σ2,Ms)N(β¯s,σ2V¯s), where β¯s and V¯s represent hyperparameters.

With respect to the two other parameters, we depart from earlier tradition and follow Fernández et al. (2001a), who proposed improper priors on α and σ: Let p(α)∝ 1, which corresponds best to a complete lack of prior information on the constant. Moreover, put an equally uninformative prior on p(σ) ∝ σ–1, which (in contrast to the traditional Gamma-priors) offers the additional advantage of being invariant under scale transformations (Fernández et al., 2001a, p. 391). The popularity of this prior structure is due to its closed-form solutions for posterior distributions, which allows for efficient coding with respect to large-scale model selection. Most notably, employing Bayes’ Theorem via

yields the posterior distribution of β as ks-variate Student-t distribution whose variance structure is primarily shaped by the expression (V¯s1+XsXs)1.

The above framework requires explicit formulation of β¯s and V¯s, the prior hyperparameters on β|σ2 – which is difficult to elicit given the many combinations possible in model selection problems. Virtually all linear BMA applications have thus opted for a common uninformative prior centered at zero, with the variance structure given by Zellner’s g prior, i.e. V¯s=g(XsXs)1 (Zellner, 1986). It thus assumes the prior covariance to be proportional to the posterior covariance expression (XsXs)1 that arises from the sample, with the scalar g determining how much importance is attributed to the prior precision V¯s1. Apart from offering computational efficiency, Zellner’s g thus reduces the elicitation of the covariance structure to simply choosing the scalar g. The conditional prior on the coefficients

simplifies the posterior distribution of β such that it follows a k–variate student-t distribution with the following density function (for N > 2):

where s=yyN1(1g1+gRs2)g1+g(XsXs)1

Here, ў = y1ӯ is the centered response vector, β^s the OLS estimator of the coefficient vector, and Rs2 the OLS R-squared of model Ms. Apart from simplifying the posterior covariance of βs, g also affects its expected value E(βs|y,X,Ms,g)=g1+gβ^s which becomes a convex combination of its OLS estimator and the prior expected value (zero). Note that the value of the shrinkage factor g1+g thus determines the importance of the prior expected value E(βs) = 0 with respect to the sample estimates.

Furthermore, Zellner’s g yields a simple expression for the marginal likelihood of Ms:4

with B denoting the parameter space of the β-coefficient vector. The Bayesian framework further calls for defining prior model probabilities p(Mj) for all models contained in the model space j ∈ {1, 2, …, 2K}. While advocates of purism may call for subjective prior specification of p(Ms), the sheer number of model candidates renders this virtue infeasible. Consequently, most authors have relied on the uniform model prior p(Ms) = 2K, whereas several (Ley and Steel, 2009; Brown et al., 1998; Sala-i-Martin et al., 2004) have proposed to specify model priors in dependence of average model size ks, typically in such a way that prior elicitation is reduced to choosing the prior expected model size. The beta-binomial model prior put forward by Ley and Steel (2009) falls into this category,5 and we will rely on it in the latter part of our study.

Given the prior model probabilities we can calculate the posterior model probabilities p(Ms|y, X, g) conditional on (y, X, g) serving as the model weights in Bayesian model averaging:

The key component constituting the PMP is the marginal (or integrated) likelihood p(y|Ms, X). The Bayes factor (i.e. the ratio of the marginal likelihoods for two competing models) allows for comparing any two models Ms and Mj by assessing their relative weights:

Multiplied by the prior odds, it yields the posterior odds p(y|Ms, X, g)/p(y|Mj, X, g) = B(Ms: Mj)p(Ms)/p(Mj). Consequently, the posterior model probability given by the product of (3) and the model prior may be expressed as the (nested) Bayes factor with respect to the null model M0 (times a model prior):

Finally, model averaging comes into play as the marginal posterior distribution of any statistic θ may be obtained as a mixture over posterior model probabilities:

This property is particularly useful in computing the posterior moments of the coefficient vector β as a weighted average over all models.6 Likewise, posterior inclusion probabilities, popular for assessing the importance of single covariates, are obtained as the sum of probabilities for all models in which the covariate is included.

II.1 Popular Settings for Zellner’s g

In view of equation (5), BMA inference thus hinges on posterior model probabilities and, in turn, on model priors p(Ms) and marginal likelihoods p(y|Ms, X) (and thus the hyperparameter g). Since the employed beta-binomial model prior already offers a sound statistical framework that aims at minimizing the impact of prior arbitrariness on posterior inference we consequently focus on the characteristics of the prior choice on g.

With respect to the marginal likelihood p(y|Ms, Xs), the discussion has concentrated on the elicitation of g, with two predominant considerations:

  • Consistency: The choice of g such that posterior model probabilities asymptotically uncover ‘the true model’ MT, i.e. p(MT|Y) → 1 as N → ∞
  • The importance of g as a penalty term enforcing parameter parsimony (the factor (1+g)kjks2 in (4))

Both issues have already been treated in the classical paper by Fernández et al. (2001a): With respect to consistency, they prove that a choice of g = w(N) such that limN→ ∞w (N) = ∞ and limNω(N)ω(N)=0 ensures consistency as it was mentioned above.

Still, consistency leaves open the exact specification of g, and over the course of more than a decade, various ‘automatic’ or ‘default’ specifications have been put forward that typically specify g in dependence of sample size. Mostly, their theoretical foundations build on the penalty term (1+g)ks2 in (5) and aim at asymptotically mimicking popular information criteria such as the BIC (cf. for instance Fernández et al. 2001a, p. 424). Thus the specification of g was frequently debated in terms of its virtues as a model size penalty term. However from a Bayesian viewpoint, subjective and theoretical considerations on such a penalty should more properly be fused into the formulation of the model prior.7 It is straightforward to neutralize the factor (1+g)ks2 in equation (5) by an appropriate model prior p(Mj) and introduce more or less penalty as one pleases.8

In the remainder of this study, we thus try to appreciate g as what it was intended to be: a hyperparameter on the prior distribution of β. Consequently, focus on Dsj, the second factor in (4) (as the first factor might be adjusted by model priors): Cet.par., the larger g, the more Dsj tends away from unity9 and the more tightly posterior mass is concentrated on a few ‘super models’. The relative distribution of PMPs is therefore crucially affected by the choice of g. Thus the relative magnitude of the g hyperparameter for two competing prior structures will determine strength and direction of the supermodel effect.

With respect to the previously mentioned motivations, several studies have compared the performance of different specifications by means of Monte Carlo simulations, among them Fernández et al. (2001a), Eicher et al. (2009) and Liang et al. (2008). We briefly reiterate the most popular concepts in closely following Liang et al. (2008):

  • Risk Inflation Criterion Prior (g-RIC): implies setting g = K2. This calibrates the posterior model probability to asymptotically match the risk inflation criterion proposed by Foster and George (1994).
  • Unit Information Prior (g-UIP): In the linear case, it corresponds to g = N. It draws on the notion that the ‘amount of information’ contained in the prior equalize the amount of information in one observation (Kass and Wasserman, 1995). For this setting, Fernández et al. (2001a, p. 424) demonstrate that as N → ∞ the log of the Bayes factor in (4) approaches the ratio of the Bayesian information criterion (BIC) for the two models Ms and Mj.
  • Benchmark Prior (g-BRIC): Corresponds to g = max(N, K2). After an extensive study on various specifications for g involving different settings for N, K, and ks, Fernández et al. (2001a) determined this combination of the g-UIP and g-RIC priors to perform best with respect to predictive performance.
  • Empirical BayesLocal (EBL): gs = arg maxgp(y|Ms, X, g). Authors such as George and Foster (2000) or Hansen and Yu (2001) advocate an ‘Empirical Bayes’ approach by using information contained in the data (y, X) to elicit g. The latter provide a theoretical underpinning for doing so locally, i.e. separately for each model. In the formulation given in Liang et al. (2008), this corresponds to gs = max(0, Fs – 1) where Fs is the standard F-statistic for testing the OLS formulation of Ms, with Fs=Rs2(N1ks)(1Rs2)ks. Note that this formulation frequently raises objections as the data-dependency of g runs counter the intuition of a prior.

Several more ‘automatic’ specifications for g have been proposed (cf. Fernández et al. 2001a, George and Foster 2000 or Eicher et al. 2009), but the ones above have resurfaced most frequently and also serve as a benchmark reference for Liang et al. (2008).

III The Hyper-g Prior: A Beta Prior on the Shrinkage Factor

Motivated on theoretical grounds,10Liang et al. (2008) introduce two prior distributions on the hyperparameter g. One of them, the so-called ‘hyper-g’ prior, is particularly interesting in that it provides closed form solutions for almost all posterior statistics of interest.

While Liang et al. (2008) ingeniously outline the basic features of the hyper-g prior, we derive further posterior quantities required for a fully Bayesian analysis. Equations (10)(12) complement their article by providing the posterior distribution of β|y, X and its second moments, as well as the second moment of the shrinkage factor.

In addition, equations focus (13)-(15) on facilitating the numerical implementation of the hyper-g prior. Most notably, posterior expressions given in Liang et al. (2008) involve ratios of Gaussian hypergeometric functions, which they propose to compute via Laplace approximations for reasons of computational feasibility. This approach gives rise to numerical inaccuracies, in particular with respect to the mentioned ratios. This section demonstrates how accurate statistics may be achieved in timely fashion by some algebraic transformations. We also show how the original hyper-g prior approach may be reconciled with consistency in the sense of Fernández et al. (2001a) with details given in the appendix.

The hyper-g prior for g translates into a Beta prior on the shrinkage factor g1+g that is common to all model candidates (Liang et al. (2008, p. 415)):

i.e. g1+g is Beta distributed with E(g1+g)=2a.11 The elicitation of g is therefore supplanted by the choice of the hyperparameter a ∈ (2, ∞): a = 4 renders the prior distribution of g1+g uniform, while moving a close to 2 concentrates the prior mass on the shrinkage factor close to 1. Conversely, any a > 4 tends to concentrate prior mass near 0. Liang et al. (2008) therefore omit those cases and concentrate on a ∈ (2, 4] – a strategy we will follow in this study.

The authors derive the posterior distribution of g|y, X, Ms and some posterior statistics by relying on an integral representation for the Gaussian hypergeometric function 2F1(a, b, c, z) (as, for instance, in Abramowitz and Stegun 1972, p. 563):

This integral representation is employed to derive the posterior distribution of g:

Moreover, the marginal likelihood may be expressed as (Liang et al., 2008, equation (17)):

The posterior expected value of the shrinkage factor is given by (Liang et al. (2008, equation (19))):12

Under this setting, the posterior expected value of the response under the hyper-g prior for model Ms is given by

with β^s denoting the OLS estimator for model Ms. From equation (9) the importance of the shrinkage factor becomes evident with the hyper-g prior allowing for model specific, data adaptive shrinkage as opposed to fixing the value for the shrinkage factor a priori.

The posterior statistics outlined so far suffice for the analysis in Liang et al. (2008). However, fully Bayesian inference requires several more expressions, notably with respect to second moments. Therefore, we provide the second moments of g1+g, as well as those of β|y, X and its posterior distribution below.13 The posterior covariance of βs is given by:

This covariance corresponds to the posterior distribution of βs that may be represented as follows:

Note that this expression is of close, though not perfect resemblance to a hypergeometric function distribution of type II.14

Finally, the second posterior moment of the shrinkage factor is given by:15

The above posterior moments are all characterized by fractions of differing hypergeometric functions. As computing the value of hypergeometric functions is quite involved, this form renders numerical implementation extremely difficult in terms of computation time and accuracy. However, they may all be expressed in dependence of Fs*2F1(N12,1,ks+a2,Rs2) using Gauss’ relations for contiguous hypergeometric functions (Abramowitz and Stegun, 1972, p. 563). Let N¯N3 and θ¯sks+a2 represent collected terms. Tedious, but straightforward algebra then yields the following results for (8), (10), and (12) (as long as Rs2>0):16

Note that the equations above all contain the term θ¯s/Fs* which is just 2/(a – 2) times the integration constant of p (g|y, Xs, Ms) or a2BF(Ms:M0) where BF(Ms: M0) is the null-based Bayes Factor for model Ms. So for each model’s statistics, a hypergeometric function (or its Laplace approximation) has to be computed only once, which benefits numerical implementation in terms of computation speed.

Moreover, equations (13)-(15) reveal a certain resemblance to the respective posterior statistics under the ‘Empirical Bayes - Local’ (EBL) approach as outlined in section II. 1: the main difference is the term θ¯s/Fs*, which guarantees non-negativity for the above expressions. Considering that the models associated with very low θ¯s/Fs* (and thus high PMP) are disproportionally weighted into model averaging, this term thus virtually disappears from model-averaged statistics.17 Moreover, the marginal model likelihood in (7) does not differ too far from its equivalent under EBL.18Section IV illustrates this effect in showing that hyper-g results differ far less from EBL than from settings under constant g.

One virtue of the hyper-g prior lies in the fact that the posterior distribution of the shrinkage factor g1+g can be interpreted in terms of goodness-of-fit. Equation (13) presents its model-specific expected value as close to 11/F^s, where F^s represents an adjusted OLS F-statistic for the model Ms:F^s=Rs2(N¯θ¯s)(1Rs2)θs. Larger values of the shrinkage factor hence correspond to more variance explained by the model Ms.

The model-averaged expected value of the shrinkage factor E(g1+g|y,X) may be interpreted likewise. As long as there are some Bayes factors considerably larger than one, the following inequality holds: 19

where RF2 is the OLS R-squared of the ‘full model’ with K regressors, and E(k|y,X) is the expected posterior model size. The right-hand side thus constitutes a pseudo F-statistic that relates RF2 with the ‘number of parameters’ E(k|y, X) + a – 2. It thus forms an upper bound for the ‘goodness-of-fit’ that can be achieved by BMA. Additionally, (16) gives rise to the following inequality, which may serve to establish a relationship to the classical interpretation of R-squared:

Finally, the hyperparameter a can still be trimmed to represent prior beliefs on the shrinkage factor. It is straightforward, for instance, to specify the prior beliefs such that the expected shrinkage factor matches the expressions laid out in section II. 1. In general, most popular settings for g can thus be emulated by a = 2 + 2/w (N), with w (N) > 0 and limn→∞w(N) = , thus positioning the prior expected value at E(g1+g)=ω(N)1+ω(N). Note that in this case, ‘consistency’ in the sense of Fernández et al. (2001a, p. 6) is ensured for the ‘hyper-g’ prior20 (cf. section A.1 in the appendix).

In this light, we propose the following specifications for the prior beliefs on the shrinkage factor:

  • HG-UIP: a=2+2N corresponds to the ‘g-UIP’-shrinkage factor with E(g1+g)=N1+N. Then 95% of the prior mass on the shrinkage factor is contained in the interval [1 – 0.95N, 1]
  • HG-RIC: a=2+2K2 corresponds to ‘g-RIC’-shrinkage with E(g1+g)=K21+K2. In this case 95% of the prior mass is contained in the interval [10.95K2,1]

Similarly, other specifications akin to ‘classic’ g formulations could be implemented – as long as they depend on N as defined above, in order to retain asymptotic consistency. However, as posterior expressions are quite insensitive to the value of a, and most of these formulations will lead to a close to 2, the resulting posterior statistics will be virtually identical. We therefore limit our attention to the two specifications above.

IV A Simulation Study

In this section we carry out a simulation study empirically investigating the supermodel effect as well as assessing predictive performance of selected prior structures. We can broadly distinguish two classes of prior settings, the degenerate ones fixing g values (fixed prior settings) as opposed to model specific and data dependent g prior structures (flexible prior settings). In the following, we concentrate on the 8 prior structures given in Table 1.

Table 1:Definition of Prior Settings.
Fixed Prior Settings
g-RICRisk inflation criterion, g = K2.
g-UIPUnit information prior, g = N.
gE(g1+g|Y)g1+g is set to the posterior mean under the HG-4 prior (i.e. E(g1+g|Y)).
Flexible Prior Settings
EBLLocal empirical Bayes estimate of g.
HG-3Hyper-g prior with a = 3.
HG-4Hyper-g prior with a = 4.
HG-RICHyper-g prior with a = 2 + 2/K2.
HG-UIPHyper-g prior with a = 2 + 2/N.

The first two fixed settings correspond to what Fernández et al. (2001a) coined the “benchmark” prior and is widely used in applied work.21 In these the suggestion made by Fernández et al. (2001a) often results in the g-RIC prior and will hence serve as our reference prior. The implied (large) value for g under g-RIC is expected to have two consequences: first g-RIC will favor parsimonious models, and second posterior mass will be concentrated on a small set of models.22 The unit information prior and the gE(g1+g|Y) complete the set up for fixed prior structures on g. For the latter we impose g1+g a-priori to equate the (model weighted) posterior mean of (g1+g|Y) under the HG-4 setting (the hyperprior with a = 4). We have chosen this particular prior structure to exemplify the impact of adaptive shrinkage: a prior both priors are expected to be very similar in posterior mass distribution. However, posterior results are expected to seriously differ regarding the assignment of PMPs. In principle, the gE(g1+g|Y) setting will favor more strongly parsimonious models and those with comparably small posterior support under the HG-4 due to keeping g constant.

Secondly, we propose more flexible prior structures that embody model specific g values and data dependent shrinkage. In particular these settings are the local empirical Bayes estimates and the hyper-g prior corresponding to a fully Bayesian approach. One strength in placing a prior on g lies in the fact that we can incorporate our prior beliefs following the rules of Bayesian statistics23 via the hyperparameter a. For the simulation study, we devise four different values for a: HG-3 (a=3) corresponds to a prior expected shrinkage factor of 23, whereas HG-4 (a=4) corresponds closely to a uniform prior over the shrinkage factor. We contrast these two settings with two settings calibrated to match the g-RIC and g-UIP prior structure (HG-RIC, HG-UIP). That is, the prior expected value of the shrinkage factor E(g1+g) conforms to the shrinkage factors induced by g-RIC (K2) or g-UIP (N).

Data-wise, we employ two different settings, where the first set up “A” is as in Fernández et al. (2001a). Each Monte Carlo run draws 10 potential explanatory variables (x1, …, x10) with N =100 observations from a standard normal distribution for each covariate. Additional 5 variables are generated by multiplying the first five regressors by [0.3, 0.5, 0.7, 0.9, 1.1] inducing a correlation structure among the covariates. Note that this correlation structure impedes uncovering the data generating model.

The second set-up “B” is more demanding since the data generating process cannot be traced back to a single model. This is more in line with Bayesian thinking whose question is not whether the preferred model is perfectly true (to which the answer is no), but whether under the assumed model(s) the observed data is a plausible outcome.24 The data generating process is composed of 5 partially nested models with unequal model weights imposed. This creates a “hierarchy” of models with y4 and y5 relatively dominating the remaining models in terms of explained variation.

Setup “A”:

Setup “B”:

Posterior inference under the different prior structures will be examined with varying signal-to-noise ratios. In particular we conduct the simulation study for four increasing levels of noise: σ = 1/2, σ = 1, σ = 2.5, σ = 5. The relatively low number K =15 allows for enumerating the model space and basing posterior inference on the results of the full set of 2K models. This guarantees that differences of results for the competing priors are not influenced by additional variation due to stochastic search.

Applied research often focuses on the posterior inclusion probabilities (PIPs) of the variables entering the analysis and the posterior moments of the related coefficients. Table 2 and 3 show PIPs for setting “A”, averaged over 50 Monte Carlo draws (standard deviations in parentheses): Under situations characterized by small degrees of noise (σ = 1/2 and σ =1) results do not differ considerably between fixed and data dependent priors for g. Under the σ = 2.5 setting, PIPs corresponding to coefficients of the data generating model exhibit differences in magnitude but still lead to the same interpretation. Results change when looking at the σ = 5 setting. Posterior mass under the flexible priors is spread more evenly as with fixed priors employed. The g-RIC prior shows strong support for the first variable, with a large PIP for β1 of approximately 80%. The remaining variables receive negligible posterior support tempting the reserarcher to believe that the data generating process is solely driven by the first variable. In contrast, flexible priors still ‘identify’ all variables. As expected, mass is spread more evenly, and over larger models, resulting in a high share of covariates with PIP close to 0.5 - which reflect the serious degree of noise in the data.

Besides the PIP, the posterior model probability of the model that was used in generating the data can be of interest to examine consistency properties in the sense of Fernández et al. (2001a). Tables 6 and 7 show summary statistics for the posterior model probabilities based on the 50 Monte Carlo draws. In line with the asymptotic results, more information in the data leads the hyperprior to uncover the data generating process with highest precision, whereas increasing noise deteriorates the selection ability of BMA for all settings. The ratio of the posterior model probability for the data generating process to the one with highest PMP is given in Table 7. The results show that in situations described by higher degrees of noise in the data all specifications favor a model different from the one generating the data. However, the deterioration of PMPs coupled with a surge in the PMP ratio of true to best model for flexible priors when noise increases indicated that mass is more diluted. In other words, while flexible priors fail to uncover the data generating model (as do the fixed ones) the assigned PMP for the best model is small as compared to fixed priors. Hence the surge of uncertainty is reflected in posterior mass distribution. Figures 1 and 2 exemplify the differences in PMP ascription for the 8 priors. The first figure shows the cumulative posterior mass of the 100 best models under the four signal-to-noise settings, averaged over 50 Monte Carlo draws. From the picture and the figures from Table 6 it becomes evident that flexible priors uncover the data generating model with highest precision and concentrate most mass on this model(s) in situations characterized by a high degree of information provided by the data. This means that the posterior mean E(g1+g|y,X) is larger for the flexible priors than the constant values g1+g under the fixed priors. As noise increases, the flexible priors distribute mass more evenly among explanatory variables reflecting the surge of uncertainty. In contrast, approaches fixing g are not capable of adjusting posterior mass distribution to uncertainty inherent in the data. The merits of Bayesian model averaging regarding handling of model uncertainty and predictive abilities are thus limited in these settings. Figure 2 shows a QQ-plot for the prior settings with the g-RIC specification as the reference prior. For all data dependent priors we see that differences increase with noise as is expected.25

Under setting “B” we view the employed models rather as approximations and uncovering a “true” model is of minor importance. The results exemplify once again the supermodel effect behavior of fixed prior settings illustrated in Figures 3 and 4. Small degrees of noise trigger a concentration of posterior mass under the hyper-g prior and the empirical Bayes approach. A surge in noise is reflected in a wider spread of posterior mass among models under flexible priors, whereas fixed priors still concentrate on a small number of models.

Finally we draw attention to the shrinkage factor’s role in weighting prior versus posterior coefficients and its implications for prediction. Results from a prediction exercise are expected to vary considerably between fixed and flexible prior settings, since the latter incorporate data adaptive shrinkage. Akin to Liang et al. (2008) we calculate the root mean squared error (rmse) based on 30 out of sample observations, averaged over 50 Monte Carlo steps.

The rmse statistics shown in Table 8 are normalized with respect to forecasting results under the g-RIC prior. Thus values below 1 indicate better predictive performance (in terms of accuracy) of the respective prior structure as compared to the g-RIC prior. The top panel of Table 8 shows mixed results for setting “A”. As expected, the g-RIC prior excels in nearly all signal-to-noise settings, concentrating on a single (and luckily the correct data-generating) model. In the σ = 1/2 test bed, however, the flexible priors concentrate mass even more tightly than does the g-RIC and consequently yield comparably better predictions in terms of rmse. As noise increases the g-RIC outperforms the other priors by greater margins exploiting the comparative advantage that the data generating process is composed of a single model.

In contrast, predictive merits of flexible priors are more pronounced in the complex data generating process structure of setting “B”: Predictive abilities differ by a greater margin, with flexible priors nearly dominating throughout all signal-to-noise settings. Especially the HG-3 prior and the empirical Bayes approach demonstrate superior predictive abilities with the latter one outperforming the g-RIC prior for all signal-to-noise setups. This contrasts with simulation exercises in the literature, as their data generating processes can be traced back to single models, which plays in favor of (large) fixed priors because of the supermodel effect.

V Growth Determinants Revisited

We now apply the different prior settings in order to examine growth determinants in a cross country growth data set. There is a dense empirical growth literature that has employed model averaging techniques.26 We use the data set given in Fernández et al. (2001b) and described in Sala-i-Martin (1997). Following Fernández et al. (2001b) we use 41 potential growth determinants for 72 countries. The data comprises proxy variables for human capital, institutional quality indicators, investment variables and regional dummy variables. For the sake of comparison we employ uniform model priors instead of the beta-binomial model prior. Results on the posterior inclusion probabilities are shown in Table 9, where we have used 3,000,000 posterior draws after a burn-in phase of 2,000,000 draws.27 Variables with posterior inclusion probabilities greater than 0.5 are often identified as ‘robustly related’ to the dependent variable.28 However, note that such a threshold should be allowed to vary with the information content of the data, as well as with the implied model size penalty of model priors. As with our model prior structure, the model size penalty term in the marginal likelihoods has not been adjusted for: a larger posterior shrinkage factor thus corresponds to more weight being put on a small subset of parsimonious models. A PIP that only slightly exceeds the 0.5 threshold coupled with a rather small posterior mean of g1+g cannot be interpreted in the same way as in the case when E(g1+g|y,X) is large. In particular this applies to comparing results with the g-RIC prior, under which the (posterior) shrinkage factor is much larger than otherwise.

The flexible prior structures identify a range of additional growth determinants as compared to the g-RIC prior setting manifested in the differences of the posterior mean model sizes given at the bottom of the table. Figure 5 top right panel illustrates the behavior under the different prior structures. Due to the degree of noise in the date set, flexible priors distribute posterior mass more evenly than fixed priors, in particular the g-RIC setting. The following figures exemplify the variation of posterior inference for the 8 prior settings with variables on the X-axis ordered according to the posterior inclusion probabilities under the g-RIC setting. The figures reveal the remarkably small differences within the class of flexible priors, again illustrating the close inter-relatedness of this group. By virtue of their smaller g-values, the gE(g1+g|Y) and the g-UIP are the fixed priors coming closest to the results of the flexible priors, whereas the g-RIC is far off.

Moreover, the graphs show that inference under the hyper-g priors is insensitive to the prior choice on the hyperparameter a. Due to the supermodel effect and model size penalty, the g-RIC prior identifies a smaller subset of growth determinants. However, this holds not true for all variables: the number of years an economy has been open (YrsOpen) looses significance under the flexible priors with the g-RIC setting being the only prior structure identifying the variable as an important growth determinant. The distribution of posterior mass under the employed prior structures is shown in the top left panel of Figure 5. Due to the degree of noise in the data, the flexible priors distribute posterior mass more evenly than the fixed ones. Besides the (robust) identification of growth drivers we are interested in the posterior means and the standardized coefficients (i.e. posterior mean/posterior standard deviation) as a further significance indicator given in the bottom panel of Figure 5.29 Disparities in posterior means emanate from differences in the posterior inclusion probabilities and magnitudes of the shrinkage factor among the prior settings. As expected for some variables (a regional dummy (Hindu) and two proxy variables for human capital (HighEnroll and PublicEducpt)) posterior means vary considerably as compared to the g-RIC setting.

Variations of posterior inference are solely due to differences in the value of the g hyperparameter, thus once again emphasizing the importance of this parameter for Bayesian model averaging inference. For the growth data exercise, the (posterior) shrinkage factor varies from 0.999 (g-RIC) to 0.951 (HG-4). Fixing g - in our view - bears the danger of ignoring the information in the data and exerts non-negligible influence on posterior results.

VI Concluding Remarks

The ubiquity of Zellner’s g prior in linear BMA rests on two reasons: it provides closed-form solutions and reduces the complexity of prior elicitation to, in practice, one scalar g. Consequently, theoretical considerations have mostly focused on the choice of g, in particular its virtues as a penalty term for model size.

This study deviates in bringing forward two arguments that have been overlooked so far: First, model size considerations should be decoupled from the prime feature of g (scaling coefficient covariance) and more properly be fused into the formulation of model priors. The elicitation of g should thus not interfere with prior desiderata on model size.

Second, we demonstrated that fixing g to arbitrary values may have unintended consequences on posterior model probabilities: The higher g, the more tightly posterior mass will concentrate on the few best-performing ‘super models’ – regardless of model sizes, number of observations or signal-to-noise ratios. Ultimately, a large value for g will favor a single model, thereby acting in a model-selection fashion. As previous studies predominantly have assessed BMA performance on simulated data generated by a single model, they tended to favor g-specifications ascribing larger values to g, that effectively select the right model. We demonstrate in section IV that the g-RIC prior suggested by Fernández et al. (2001a) is particularly prone to this supermodel behavior.

In order to overcome problems, we propose to put a prior distribution on the g parameter: Such a hyperprior allows for data-dependent shrinkage, thus adjusting the weight of prior beliefs more properly according to data quality. In discriminating models only as far as data quality allows for, a prior on g thus offers a remedy for the supermodel effect. In this manner, we focus on the hyper-g prior introduced by Liang et al. (2008), whose formulation offers three main advantages: First, it admits closed form solutions for almost any quantity of interest, thereby facilitating implementation. Second, it allows for BMA consistency. Third, its hyperparameter allows for formulating prior beliefs on coefficient variance, but without incurring the risk of unintended consequences on posterior model mass. We complement the existing literature on the hyper-g prior by providing additional posterior expressions that allow for fully Bayesian inference, as well as for sound numerical implementation.

Section IV contrasts various formulations of fixed and hyper-g priors in simulations, concentrating on predictive performance under varying signal-to-noise ratios. As expected the fixed (especially the g-RIC) priors perform considerably well when the data generating process rests on a single model that is part of the candidate model space. However, in more complex settings, the virtues of flexible prior structures become pronounced: Flexible priors outperform fixed g settings (in particular g-RIC) in terms of forecasting accuracy and exhibit a more stable structure of posterior model and inclusion probabilities as noise varies.

The final section illustrates these considerations by applying the same priors to a prominent growth data set. The results demonstrate that fixing g runs the risk of grossly over- or understating the importance of some variables – the degree of openness, for instance, is not as important to growth as one may think under the g-RIC prior. In this data set, fixing g to values larger than implied by flexible priors leads to stronger discrimination among posterior inclusion probabilities, which may incite overconfidence in BMA results. Finally, the magnitudes of several coefficients differ markedly between fixed and hyper-g priors, but are negligible among the hyper-g prior structures.

Concluding, the hyper-g prior offers a sound, fully Bayesian approach that features the virtues of prior input and predictive gains without incurring the risk of misspecification.

References

    AbramowitzM. and StegunI.(1972). Handbook of Mathematical Functions with Formulas Graphs and Mathematical Tables. National Bureau of Standards Applied Mathematics Series 55. Tenth Printing.

    • Search Google Scholar
    • Export Citation

    BarbieriM. M. and BergerJ. O.(2003). Optimal Predictive Model Selection. Ann. Statist. 32:870897.

    BernardoJ. and SmithA.(1994). Bayesian Theory. John Wiley and SonsNew York.

    BrownP.VannucciM. and FearnT.(1998). Multivariate Bayesian Variable Selection and Prediction. Journal of the Royal Statistical Society B60:627641.

    • Search Google Scholar
    • Export Citation

    ChipmanH.GeorgeE. and McCullochR.(2001). The Practical Implementation of Bayesian Model Selection. Institute of Mathematical Statistics Lecture Notes-Monograph SeriesVol. 38.Beachwood, Ohio.

    • Search Google Scholar
    • Export Citation

    Crespo CuaresmaJ. and DoppelhoferG.(2007). Nonlinearities in Cross-Country Growth Regressions: A Bayesian Averaging of Thresholds (BAT) Approach. Journal of Macroeconomics29:541554.

    • Search Google Scholar
    • Export Citation

    CuiW. and GeorgeE.(2008). Empirical Bayes vs. fully Bayes variable selection. Journal of Statistical Planning and Inference138:4:888900.

    • Search Google Scholar
    • Export Citation

    EicherT.PapageorgiouC. and RafteryA.(2009). Determining growth determinants: default priors and predictive performance in Bayesian model averaging. Journal of Applied Econometricsforthcoming.

    • Search Google Scholar
    • Export Citation

    EklundJ. and KarlssonS.(2007). Forecast Combination and Model Averaging using Predictive Measures. Econometric Reviews26:329362.

    • Search Google Scholar
    • Export Citation

    FernándezC.LeyE. and SteelM. F.(2001a). Benchmark Priors for Bayesian Model Averaging. Journal of Econometrics100:381427.

    • Search Google Scholar
    • Export Citation

    FernándezC.LeyE. and SteelM. F.(2001b). Model Uncertainty in Cross-Country Growth Regressions. Journal of Applied Econometrics16:563576.

    • Search Google Scholar
    • Export Citation

    FosterD. P. and GeorgeE. I.(1994). The Risk Inflation Criterion for Multiple Regression. The Annals of Statistics22:19471975.

    GelmanA.CarlinJ. B.SternS. H. and RubinB. D.(1995). Bayesian Data Analysis. Chapman & Hall.

    GeorgeE. and FosterD.(2000). Calibration and empirical Bayes variable selection. Biometrika87(4):731747.

    GuptarA. K. and NagarD. K.(2000). Matrix Variate Distributions. Chapman & Hall/CRCMonographs and Surveys in Pure and Applied Mathematics104.

    • Search Google Scholar
    • Export Citation

    HansenM. and YuB.(2001). Model selection and the principle of minimum description length. Journal of the American Statistical Association96(454):746774.

    • Search Google Scholar
    • Export Citation

    HoetingJ. A.MadiganD.RafteryA. E. and VolinskyC. T.(1999). Bayesian Model Averaging: A Tutorial. Statistical Science14No. 4:382417.

    • Search Google Scholar
    • Export Citation

    KassR. and RafteryA.(1995). Bayes Factors. Journal of the American Statistical Association90:773795.

    KassR. and WassermanL.(1995). A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association pages 928934.

    • Search Google Scholar
    • Export Citation

    KoopG. and PotterS.(2003). Forecasting in Large Macroeconomic Panels Using Bayesian Model Averaging. FRB NY Staff Report 163.

    LaudP. and IbrahimJ.(1995). Predictive model selection. Journal of the Royal Statistical Society Series B57:247262.

    LeyE. and SteelM. F.(2009). On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regressions. Journal of Applied Econometrics24:4:651674.

    • Search Google Scholar
    • Export Citation

    LiangF.PauloR.MolinaG.ClydeM. A. and BergerJ. O.(2008). Mixtures of g Priors for Bayesian Variable Selection. Journal of the American Statistical Association103:410423.

    • Search Google Scholar
    • Export Citation

    MasanjalaW. and PapageorgiouC.(2008). Rough and Lonely Road to Prosperity: A Re-examination of the Sources of Growth in Africa Using Bayesian Model Averaging. Journal of Applied Econometrics23:671682.

    • Search Google Scholar
    • Export Citation

    R Development Core Team (2008). R: A Language and Environment for Statistical Computing. R Foundation for Statistical ComputingVienna, AustriaISBN 3-900051-07-0 edition. http://www.R-project.org..

    • Search Google Scholar
    • Export Citation

    RafteryA. E.(1995). Bayesian Model Selection in Social Research. Sociological Methodology25:111163.

    Sala-i-MartinX.(1997). I Just Ran 2 Million Regressions. American Economic Review87:178183.

    Sala-i-MartinX.DoppelhoferG. and MillerR. I.(2004). Determinants of Long-Term Growth: A Bayesian Averaging of Classical Estimates (BACE) Approach. American Economic Review94:813835.

    • Search Google Scholar
    • Export Citation

    StrachanR. and van DijkH.(2004). Exceptions to Bartletts Paradox. Keele Economic Research Papers.

    ZellnerA.(1986). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finettichapter On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions. North-Holland: Amsterdam.

    • Search Google Scholar
    • Export Citation

    ZellnerA.(2008). Comments on Mixtures of g-priors for Bayesian Variable Selection by F. Liang, R. Paulo, G. Molina, MA Clyde and JO Berger.

    • Search Google Scholar
    • Export Citation
A Technical Appendix

A.1 Consistency of the Hyper-g Prior

Fernández et al. (2001a) define asymptotic ‘consistency’ as follows: Consider that only Model Ms is true, while all other models MjMs are not true. Consistency then requires:

Liang et al. (2008, Appendix B) have proven the above for the hyper-g prior except for the case where the true model Ms is the null model M0. They stop short their proof because in this case the Bayes factor B(Mj: M0) is (Liang et al., 2008, p. 423):

Moreover they state that if the above integral vanishes as N → ∞, then consistency is ensured.

Applying the hyper-g setting transforms the right-hand side in (A.1) into the following (by a > 2):

If a = 2 + w (N) with w (N) > 0 and limNω(N)=0, then the integral vanishes and thus concludes the proof.

A.2 Relationship between Hyper-g Prior and EBL

Due to perceived numerical difficulties, Liang et al. (2008) propose the use of an Laplace approximation for the posterior model likelihood under the hyper-g distribution (Liang et al. (2008, equation (17))). Depending on the data, Laplace approximations can be prone to substantial numerical inaccuracies. However, they may be useful for the purpose of this section, namely a tentative approach to establishing a rough equivalence between particular forms of the hyper-g prior and Empirical Bayes with respect to posterior statistics. Consider the familiar form of the Laplace approximation

… where θ^ is the maximizer of the integrand’ logarithm h(θ). Consider in turn the null-based model Bayes Factor for the hyper-g prior formulation as in (7):

Letting

yields the maximizer:

where g^=0 if and only if k + aR2(N – 1). Liang et al. (2008, p. 421) note the similarity to the local Empirical Bayes (EBL) estimator of g, but abstain from further investigating the issue.

The second derivative of h(g) is given as

The Bayes factor is thus approximately equal to:

In case we have g^>0, then algebraic manipulation of the expression above yields:

Now consider the equivalent null-based model Bayes Factor for the EBL approach which is:

in case if kR2(N – 1)

Therefore, if k + aR2(N – 1):

So if a → 2, the hyper-g Bayes Factor is approximatively equivalent to an EBL Bayes factor times a k-based model prior (that does not depend on z). Moreover, this model prior is bounded in a relatively narrow range: Note that

The upper bound follows from the fact that ((k+a)/k)k/2=(1+a2/k2)k2<exp(a2). Similarly ((N1ka)/(N1k))N1k2<exp(a2). Setting k = 1 and letting N – 1 ≥ k + a + 1 performs the lower bound.30 The effect of the term in square roots actually counters the impact of the latter term, as 4N1N1(N1ka)(k+a)1 for k + a <N – 1. The ‘model prior’ thus results in an upweighting of models with few or many coefficients, while intermediate model sizes are downweighted.

The impact of the k-based ‘model prior’, however, is virtually negligible with respect to the size of BFEBL. Thus, at least as long as R2(N – 1) > k + a, BFh is quite close to BFEBL. And as long as the signal-to-noise ratio in the data is not too small, BMA posterior statistics will be disproportionally based on models with large PMPs (and thus (N – 1)R2k + a). Models with large differences between BFh and BFEBL will thus hardly affect posterior statistics.

A.3 The Shrinkage Factor and Goodness-of-Fit

In order to demonstrate equation (16) consider a reformulation of the posterior expected value of the shrinkage factor (13)

where=j=12κp(Ms|y,X)θ¯sFs*Rs2(N¯θ¯s)

The term is based on the expression θ¯s/Fs* in (13), whose only role is to keep E(g1+g|X,y) non-negative in case of a ‘bad’ model, whereas it rapidly vanishes for models with higher signal-to-noise ratios. As long as the null model is not the single ‘true’ model, vanishes as N → ∞ for fixed K – but even in small samples, tends rapidly towards zero as data quality increases. Moreover, in BMA sampling with any viable signal-to-noise ratio, any models with very low PMP will hardly affect posterior results, and hence the expression will vanish as soon as there exist some models with considerable null-based Bayes factors (which therefore must have their Fs*1).

In the following, suppose that K + a < N. Now proceed to demonstrating the inequality in (16) by considering that as long as θ¯sN¯, the following holds:31

where EM(x)=j=12κxp(Mj|y,X) denotes the expected value over model probabilities. Multiply with N¯ and subtract unity to obtain

Moreover, since any nested model’s R-squared Rs2. cannot exceed the R-squared of the full model RF2, we have that 1Rs2Rs21RF2RF2 and therefore:

Retransforming and integrating in (A.2) yields another representation of (16) (recall that N¯N3 and E(θ¯s)E(ks)+a2):

How close E(g1+g|y,X) comes to this upper bound is mainly determined by the posterior variance of model size (the less variance, the closer), and by parsimoniousness of the model priors. Note that the term on the left-hand side might break the inequality in rare instances. However this term tends to be very small: Numerical simulations of a null hypothesis with varying N, K, a and standard deviations have yielded no single instance in which RF2>K+a2N¯ and E(g1+g|y,X) larger than the right-hand side above. Therefore, if RF2>K+aN(>K+a2N3)32, then the term can be safely omitted from the inequality above.

A.4 The Posterior Predictive Distribution and the Hyper-g Prior

Consider using the data (X, y) to forecast the dependent variable (ŷ conditional on ‘prediction’ covariates X^. Let X be N × k matrix, y be N × 1, while ŷ is l × y and X^ l×k. The posterior predictive distribution of ŷ is then given as a a multivariate t-distribution of dimension l (Eklund and Karlsson, 2007, equation (A. 15))33

where Σ=(Il+sX^(XX)1X^)yyN1(1sR2)

Here, s denotes the shrinkage factor s=g1+g, and R2 the (centered) R-squared of y on X, where y¯ is an N-dimensional vector whose elements are the arithmetic mean of y. Integrating the density function of y^|X^, X, y, g with respect to the shrinkage factor yields the integrand of the following equation (after some rearrangement):

… where yyy¯. To our knowledge, there is no closed-form solution to to the integral above, neither to its Laplace approximation. We therefore recommend to resort to numerical integration.

A.5 The Beta-binomial Prior over the Model Space

BMA calls for eliciting a prior distribution of models in . Two typical prior specifications have been usually imposed in the literature: a) an uninformative flat prior over all models, which implies that the posterior odds ratio resembles solely the Bayes factor and comparison of models is governed by their relative marginal likelihoods, and b) a prior that discriminates among models according to the number of regressors they include, so that a larger prior probability mass falls over models of a given size (see Sala-i-Martin et al. (2004)). This second alternative assumes that each covariate enters a model with probability ϑ, which implies that the prior mass for model j which includes kj variables amounts to P¯(Mj)=ϑkj(1ϑ)Kkj. The uninformative prior in a) is nested in b) by imposing ϑ = 1/2, which results into equal model probabilities of 2K for all models for each matrix.

Ley and Steel (2009) show that fixing ϑ = 1/2 puts most mass on models with K/2 regressors, since they are dominant in number. Their recommendation is thus to treat ϑ as random and placing a (hyper)prior on it. The proposal of Ley and Steel (2009) is to impose that the model size follows a beta-binomial(a, b) distribution (Bernardo and Smith (1994)) with a = 1, so that

The prior can be elicited by anchoring the prior expected model size, m.34Ley and Steel (2009) quantify the influence that a poorly specified prior exerts on posterior results when ϑ is fixed, which leads to the relative merits of BMA being less pronounced and its predictive power deteriorating. In contrast, the results in Ley and Steel (2009) indicate that the choice of m has no influential impact on posterior inference and the prior over models is purely non-informative.

A.6 Charts and Tables

Figure 1:Cumulated posterior model probabilities for Setting A. Top panel corresponds to a signal-to-noise ratio of σ = 1/2 (left) and σ = 1 (right). Bottom panel to a ratio of σ = 2.5 (left) and σ = 5 (right).

Figure 2:QQ-plot of cumulated posterior mass for different choices of g against that of the g-RIC setting (Setting A based on 50 Monte Carlo draws).

Table 2:Posterior Inclusion Probabilities for Setting A with standard deviations in parenthesis. Left Panel corresponds to a signal-to-noise ratio of σ = 1/2, right panel to a ratio of σ = 1. Coefficients corresponding to variables of the data generating model are in bold. PIP values exceeding 0.5 in bold. Results are averaged over 50 Monte Carlo Steps.
signal-to-noise ratio of σ =1/2signal-to-noise ratio of σ =1
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIPg-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
β11.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
β20.085

(0.072)
0.114

(0.069)
0.068

(0.067)
0.055

(0.058)
0.055

(0.055)
0.059

(0.058)
0.050

(0.051)
0.050

(0.051)
0.094

(0.125)
0.132

(0.128)
0.109

(0.127)
0.115

(0.128)
0.116

(0.127)
0.124

(0.128)
0.107

(0.127)
0.107

(0.127)
β30.083

(0.067)
0.112

(0.061)
0.066

(0.064)
0.054

(0.055)
0.052

(0.046)
0.056

(0.049)
0.047

(0.043)
0.047

(0.043)
0.085

(0.073)
0.125

(0.084)
0.101

(0.078)
0.108

(0.081)
0.109

(0.080)
0.117

(0.083)
0.099

(0.077)
0.099

(0.077)
β40.077

(0.065)
0.106

(0.064)
0.061

(0.062)
0.050

(0.060)
0.050

(0.059)
0.054

(0.060)
0.046

(0.057)
0.046

(0.057)
0.078

(0.057)
0.117

(0.068)
0.093

(0.062)
0.100

(0.063)
0.101

(0.063)
0.110

(0.065)
0.092

(0.060)
0.092

(0.060)
β51.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
β60.086

(0.059)
0.116

(0.061)
0.068

(0.052)
0.055

(0.046)
0.055

(0.045)
0.060

(0.048)
0.050

(0.042)
0.050

(0.042)
0.067

(0.021)
0.105

(0.031)
0.082

(0.025)
0.089

(0.029)
0.090

(0.029)
0.098

(0.032)
0.081

(0.027)
0.081

(0.027)
β71.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
β80.113

(0.153)
0.136

(0.132)
0.098

(0.156)
0.085

(0.149)
0.082

(0.138)
0.086

(0.141)
0.077

(0.135)
0.077

(0.135)
0.087

(0.061)
0.129

(0.075)
0.104

(0.068)
0.111

(0.069)
0.111

(0.068)
0.121

(0.071)
0.101

(0.065)
0.102

(0.065)
β90.084

(0.066)
0.114

(0.070)
0.066

(0.058)
0.054

(0.055)
0.054

(0.054)
0.058

(0.057)
0.049

(0.051)
0.049

(0.051)
0.083

(0.049)
0.125

(0.066)
0.099

(0.057)
0.107

(0.065)
0.108

(0.064)
0.117

(0.068)
0.098

(0.060)
0.098

(0.060)
β100.086

(0.082)
0.115

(0.078)
0.068

(0.077)
0.056

(0.069)
0.056

(0.066)
0.060

(0.069)
0.051

(0.062)
0.051

(0.062)
0.100

(0.086)
0.145

(0.106)
0.118

(0.095)
0.126

(0.105)
0.127

(0.103)
0.136

(0.107)
0.116

(0.099)
0.116

(0.099)
β111.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
β120.085

(0.054)
0.115

(0.056)
0.066

(0.048)
0.054

(0.043)
0.054

(0.042)
0.059

(0.044)
0.049

(0.040)
0.050

(0.040)
0.109

(0.122)
0.151

(0.134)
0.126

(0.128)
0.133

(0.131)
0.133

(0.129)
0.142

(0.131)
0.123

(0.126)
0.123

(0.126)
β131.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
0.992

(0.049)
0.992

(0.045)
0.992

(0.047)
0.992

(0.047)
0.992

(0.048)
0.992

(0.047)
0.992

(0.049)
0.992

(0.049)
β140.079

(0.069)
0.109

(0.065)
0.063

(0.067)
0.052

(0.064)
0.052

(0.063)
0.055

(0.064)
0.047

(0.061)
0.047

(0.061)
0.097

(0.124)
0.136

(0.130)
0.112

(0.127)
0.120

(0.129)
0.120

(0.127)
0.129

(0.129)
0.111

(0.125)
0.111

(0.125)
β150.093

(0.096)
0.122

(0.094)
0.075

(0.090)
0.062

(0.085)
0.062

(0.084)
0.066

(0.087)
0.057

(0.080)
0.057

(0.080)
0.101

(0.096)
0.144

(0.109)
0.119

(0.103)
0.125

(0.101)
0.126

(0.100)
0.135

(0.102)
0.115

(0.096)
0.115

(0.096)
E(g1+g|Y)0.9960.9900.9980.9990.9980.9980.9990.9990.9960.9900.9940.9930.9920.9910.9930.993
Table 3:Posterior Inclusion Probabilities for Setting “A” with standard deviations in parenthesis. Left Panel corresponds to a signal-to-noise ratio of σ = 2.5, right panel to a ratio of σ = 5 noise. Coefficients corresponding to variables of the data generating model are in bold. PIP values exceeding 0.5 in bold. Results correspond to 50 Monte Carlo Steps.
signal-to-noise ratio of σ =2.5signal-to-noise ratio of σ = 5
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIPg-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
β11.000

(0.002)
1.000

(0.001)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
0.792

(0.257)
0.842

(0.208)
0.955

(0.066)
0.953

(0.068)
0.946

(0.077)
0.946

(0.075)
0.884

(0.207)
0.905

(0.171)
β20.086

(0.106)
0.131

(0.126)
0.317

(0.149)
0.304

(0.153)
0.304

(0.151)
0.330

(0.151)
0.276

(0.150)
0.276

(0.150)
0.038

(0.037)
0.067

(0.058)
0.512

(0.149)
0.455

(0.171)
0.445

(0.166)
0.481

(0.159)
0.370

(0.185)
0.380

(0.180)
β30.094

(0.064)
0.141

(0.077)
0.331

(0.109)
0.316

(0.115)
0.317

(0.113)
0.343

(0.114)
0.289

(0.111)
0.289

(0.111)
0.050

(0.052)
0.081

(0.074)
0.514

(0.160)
0.460

(0.180)
0.450

(0.176)
0.486

(0.167)
0.376

(0.197)
0.386

(0.192)
β40.108

(0.119)
0.158

(0.144)
0.339

(0.165)
0.327

(0.172)
0.327

(0.169)
0.351

(0.168)
0.300

(0.169)
0.300

(0.169)
0.044

(0.072)
0.074

(0.104)
0.503

(0.158)
0.449

(0.176)
0.439

(0.172)
0.476

(0.164)
0.366

(0.195)
0.376

(0.190)
β50.796

(0.244)
0.826

(0.215)
0.871

(0.155)
0.873

(0.155)
0.870

(0.156)
0.875

(0.150)
0.865

(0.164)
0.865

(0.163)
0.184

(0.267)
0.236

(0.286)
0.617

(0.215)
0.579

(0.246)
0.567

(0.244)
0.594

(0.228)
0.497

(0.279)
0.509

(0.273)
β60.062

(0.044)
0.103

(0.068)
0.282

(0.115)
0.272

(0.118)
0.273

(0.117)
0.298

(0.119)
0.245

(0.113)
0.245

(0.113)
0.048

(0.070)
0.080

(0.100)
0.515

(0.185)
0.463

(0.208)
0.453

(0.204)
0.488

(0.194)
0.380

(0.226)
0.390

(0.221)
β70.997

(0.014)
0.999

(0.005)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
0.506

(0.342)
0.585

(0.340)
0.845

(0.183)
0.829

(0.208)
0.817

(0.212)
0.829

(0.195)
0.750

(0.284)
0.766

(0.268)
β80.075

(0.065)
0.121

(0.095)
0.305

(0.136)
0.292

(0.151)
0.292

(0.149)
0.317

(0.149)
0.265

(0.148)
0.265

(0.148)
0.049

(0.075)
0.080

(0.098)
0.509

(0.157)
0.460

(0.172)
0.450

(0.169)
0.486

(0.160)
0.376

(0.194)
0.386

(0.188)
β90.131

(0.161)
0.187

(0.199)
0.369

(0.217)
0.356

(0.219)
0.356

(0.216)
0.380

(0.213)
0.329

(0.218)
0.330

(0.218)
0.045

(0.061)
0.077

(0.094)
0.508

(0.165)
0.455

(0.176)
0.445

(0.173)
0.482

(0.165)
0.372

(0.198)
0.382

(0.192)
β100.082

(0.089)
0.127

(0.113)
0.313

(0.144)
0.302

(0.151)
0.303

(0.148)
0.328

(0.149)
0.275

(0.146)
0.275

(0.146)
0.042

(0.040)
0.072

(0.059)
0.520

(0.147)
0.465

(0.163)
0.454

(0.159)
0.491

(0.152)
0.377

(0.181)
0.388

(0.175)
β110.828

(0.240)
0.869

(0.205)
0.925

(0.141)
0.922

(0.146)
0.919

(0.147)
0.924

(0.141)
0.913

(0.156)
0.913

(0.156)
0.305

(0.316)
0.369

(0.323)
0.717

(0.222)
0.689

(0.246)
0.677

(0.246)
0.699

(0.229)
0.609

(0.294)
0.622

(0.285)
β120.065

(0.036)
0.109

(0.053)
0.301

(0.105)
0.288

(0.109)
0.289

(0.107)
0.315

(0.111)
0.259

(0.102)
0.260

(0.102)
0.054

(0.067)
0.093

(0.109)
0.533

(0.171)
0.480

(0.196)
0.469

(0.192)
0.504

(0.182)
0.393

(0.217)
0.404

(0.211)
β130.437

(0.332)
0.504

(0.322)
0.651

(0.261)
0.640

(0.268)
0.637

(0.267)
0.653

(0.258)
0.617

(0.277)
0.617

(0.277)
0.083

(0.119)
0.129

(0.151)
0.558

(0.189)
0.511

(0.216)
0.500

(0.212)
0.532

(0.200)
0.426

(0.239)
0.437

(0.233)
β140.098

(0.129)
0.143

(0.146)
0.321

(0.157)
0.309

(0.163)
0.310

(0.161)
0.334

(0.159)
0.283

(0.162)
0.283

(0.162)
0.043

(0.058)
0.072

(0.083)
0.504

(0.160)
0.449

(0.176)
0.439

(0.172)
0.476

(0.164)
0.366

(0.191)
0.375

(0.187)
β150.165

(0.147)
0.212

(0.156)
0.386

(0.148)
0.372

(0.160)
0.373

(0.158)
0.396

(0.156)
0.347

(0.161)
0.347

(0.161)
0.074

(0.114)
0.109

(0.131)
0.520

(0.155)
0.470

(0.175)
0.461

(0.171)
0.496

(0.162)
0.387

(0.195)
0.398

(0.191)
E(k|Y)5.0275.6297.7117.5727.5687.8447.2607.2642.3592.9658.8308.1678.0118.4676.9287.103
E(g1+g|Y)0.9960.9900.9490.9550.9470.9390.9560.9560.9960.9900.7830.8170.7950.7600.8560.849

Figure 3:Cumulated posterior model probabilities for Setting B. Top panel corresponds to a signal-to-noise ratio of σ = 1/2 (left) and σ = 1 (right). Bottom panel to a ratio of σ = 2.5 (left) and σ = 5 (right)

Figure 4:QQ-plot of cumulated posterior mass for different choices of g against that of the g-RIC setting (Setting B based on 50 Monte Carlo draws).

Table 4:Posterior Inclusion Probabilities for Setting B with standard deviations in parenthesis. Left Panel corresponds to a signal-to-noise ratio of σ = 1/2, right panel to a ratio of σ = 1. PIP values exceeding 0.5 in bold. Results are averaged over 50 Monte Carlo Steps.
signal-to-noise ratio of σ =1/2signal-to-noise ratio of σ =1
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIPg-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
β11.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
β20.801

(0.223)
0.835

(0.182)
0.818

(0.205)
0.810

(0.218)
0.808

(0.217)
0.813

(0.213)
0.804

(0.223)
0.804

(0.222)
0.374

(0.266)
0.487

(0.262)
0.567

(0.242)
0.562

(0.247)
0.560

(0.245)
0.575

(0.240)
0.544

(0.250)
0.544

(0.250)
β30.762

(0.239)
0.802

(0.195)
0.781

(0.220)
0.773

(0.228)
0.772

(0.227)
0.777

(0.222)
0.767

(0.232)
0.767

(0.232)
0.393

(0.295)
0.480

(0.270)
0.550

(0.243)
0.548

(0.248)
0.548

(0.247)
0.560

(0.241)
0.533

(0.253)
0.534

(0.253)
β40.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.747

(0.282)
0.806

(0.242)
0.840

(0.207)
0.840

(0.206)
0.838

(0.206)
0.845

(0.199)
0.831

(0.215)
0.831

(0.215)
β50.885

(0.147)
0.897

(0.119)
0.891

(0.135)
0.889

(0.140)
0.888

(0.140)
0.889

(0.136)
0.886

(0.143)
0.886

(0.143)
0.437

(0.293)
0.516

(0.271)
0.577

(0.245)
0.573

(0.249)
0.572

(0.247)
0.584

(0.241)
0.559

(0.254)
0.560

(0.254)
β60.943

(0.116)
0.952

(0.093)
0.948

(0.106)
0.947

(0.107)
0.946

(0.108)
0.947

(0.105)
0.944

(0.111)
0.944

(0.111)
0.470

(0.332)
0.560

(0.303)
0.625

(0.272)
0.621

(0.277)
0.620

(0.276)
0.632

(0.269)
0.606

(0.283)
0.606

(0.283)
β71.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
0.980

(0.039)
0.988

(0.023)
0.991

(0.018)
0.990

(0.020)
0.990

(0.021)
0.990

(0.019)
0.989

(0.022)
0.989

(0.022)
β80.696

(0.234)
0.749

(0.190)
0.720

(0.216)
0.712

(0.221)
0.711

(0.220)
0.718

(0.215)
0.704

(0.225)
0.704

(0.225)
0.325

(0.237)
0.430

(0.242)
0.509

(0.227)
0.506

(0.233)
0.506

(0.231)
0.520

(0.227)
0.489

(0.236)
0.490

(0.236)
β90.679

(0.256)
0.734

(0.211)
0.704

(0.238)
0.694

(0.247)
0.693

(0.246)
0.700

(0.241)
0.686

(0.252)
0.686

(0.252)
0.255

(0.189)
0.364

(0.196)
0.452

(0.188)
0.449

(0.197)
0.449

(0.195)
0.465

(0.192)
0.431

(0.198)
0.431

(0.198)
β100.998

(0.006)
0.998

(0.006)
0.999

(0.006)
0.999

(0.006)
0.998

(0.006)
0.998

(0.006)
0.998

(0.006)
0.998

(0.006)
0.737

(0.271)
0.804

(0.225)
0.841

(0.190)
0.838

(0.192)
0.836

(0.193)
0.843

(0.186)
0.828

(0.200)
0.828

(0.200)
β111.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
0.997

(0.022)
0.998

(0.013)
0.999

(0.010)
0.999

(0.010)
0.998

(0.010)
0.999

(0.010)
0.998

(0.011)
0.998

(0.011)
β120.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004))
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.999

(0.004)
0.671

(0.300)
0.754

(0.248)
0.804

(0.207)
0.795

(0.217)
0.793

(0.218)
0.802

(0.210)
0.782

(0.227)
0.782

(0.227)
β130.989

(0.053)
0.990

(0.043)
0.989

(0.048)
0.989

(0.050)
0.989

(0.050)
0.989

(0.049)
0.989

(0.052)
0.989

(0.052)
0.728

(0.303)
0.773

(0.265)
0.804

(0.233)
0.801

(0.236)
0.800

(0.235)
0.806

(0.229)
0.793

(0.243)
0.794

(0.243)
β141.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
1.000

(0.000)
0.993

(0.030)
0.996

(0.021)
0.997

(0.016)
0.996

(0.018)
0.996

(0.018)
0.996

(0.017)
0.996

(0.020)
0.996

(0.020)
β150.486

(0.191)
0.575

(0.155)
0.524

(0.177)
0.508

(0.185)
0.509

(0.183)
0.519

(0.179)
0.499

(0.187)
0.499

(0.187)
0.285

(0.201)
0.380

(0.194)
0.460

(0.180)
0.457

(0.185)
0.457

(0.184)
0.472

(0.180)
0.441

(0.188)
0.441

(0.188)
E(k|Y)13.23813.53013.37313.31813.31313.34813.27513.2759.39110.33511.01510.97610.96411.08910.82210.824
E(g1+g|Y)0.9960.9900.9940.9950.9940.9930.9940.9940.9960.9900.9820.9830.9800.9780.9820.982
Table 5:Posterior Inclusion Probabilities for Setting B with standard deviations in parenthesis. Left Panel corresponds to a signal-to-noise ratio of σ = 2.5, right panel to a ratio of σ = 5 noise. PIP values exceeding 0.5 in bold. Results correspond to 50 Monte Carlo Steps.
signal-to-noise ratio of σ =2.5signal-to-noise ratio of σ = 5
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIPg-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
β10.999

(0.004)
1.000

(0.002)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
1.000

(0.001)
0.819

(0.274)
0.851

(0.244)
0.944

(0.113)
0.940

(0.109)
0.930

(0.115)
0.927

(0.107)
0.804

(0.239)
0.857

(0.192)
β20.060

(0.059)
0.105

(0.096)
0.355

(0.179)
0.377

(0.196)
0.373

(0.193)
0.409

(0.188)
0.327

(0.197)
0.328

(0.197)
0.030

(0.050)
0.048

(0.068)
0.412

(0.163)
0.392

(0.183)
0.376

(0.174)
0.423

(0.168)
0.261

(0.171)
0.282

(0.172)
β30.112

(0.150)
0.165

(0.170)
0.414

(0.200)
0.432

(0.211)
0.426

(0.209)
0.460

(0.203)
0.382

(0.213)
0.382

(0.213)
0.040

(0.069)
0.063

(0.094)
0.422

(0.184)
0.399

(0.206)
0.384

(0.197)
0.428

(0.187)
0.264

(0.177)
0.288

(0.184)
β40.055

(0.060)
0.105

(0.114)
0.376

(0.208)
0.394

(0.225)
0.389

(0.221)
0.426

(0.216)
0.342

(0.224)
0.342

(0.224)
0.033

(0.049)
0.053

(0.066)
0.436

(0.168)
0.412

(0.186)
0.395

(0.177)
0.442

(0.170)
0.276

(0.178)
0.298

(0.178)
β50.056

(0.066)
0.093

(0.083)
0.332

(0.159)
0.355

(0.185)
0.351

(0.183)
0.387

(0.179)
0.306

(0.186)
0.307

(0.186)
0.028

(0.025)
0.047

(0.039)
0.420

(0.168)
0.397

(0.187)
0.382

(0.178)
0.428

(0.173)
0.263

(0.169)
0.286

(0.171)
β60.083

(0.154)
0.128

(0.174)
0.374

(0.210)
0.396

(0.216)
0.391

(0.213)
0.428

(0.207)
0.344

(0.218)
0.345

(0.218)
0.024

(0.034)
0.040

(0.051)
0.395

(0.149)
0.372

(0.180)
0.357

(0.170)
0.404

(0.163)
0.239

(0.148)
0.262

(0.157)
β70.230

(0.262)
0.304

(0.298)
0.547

(0.292)
0.555

(0.299)
0.549

(0.296)
0.579

(0.283)
0.508

(0.311)
0.508

(0.311)
0.062

(0.144)
0.086

(0.165)
0.446

(0.193)
0.423

(0.216)
0.407

(0.209)
0.451

(0.199)
0.288

(0.209)
0.312

(0.210)
β80.057

(0.072)
0.095

(0.103)
0.335

(0.172)
0.359

(0.188)
0.355

(0.185)
0.392

(0.183)
0.308

(0.185)
0.309

(0.185)
0.024

(0.027)
0.040

(0.042)
0.396

(0.157)
0.375

(0.183)
0.360

(0.174)
0.407

(0.168)
0.245

(0.166)
0.267

(0.168)
β90.056

(0.063)
0.097

(0.098)
0.332

(0.179)
0.356

(0.193)
0.353

(0.190)
0.388

(0.186)
0.309

(0.193)
0.309

(0.193)
0.025

(0.029)
0.042

(0.046)
0.397

(0.158)
0.379

(0.175)
0.364

(0.166)
0.411

(0.162)
0.250

(0.160)
0.271

(0.161)
β100.107

(0.193)
0.154

(0.210)
0.402

(0.235)
0.424

(0.252)
0.419

(0.249)
0.453

(0.241)
0.376

(0.255)
0.376

(0.255)
0.046

(0.072)
0.071

(0.102)
0.435

(0.197)
0.417

(0.209)
0.401

(0.202)
0.446

(0.193)
0.286

(0.206)
0.308

(0.205)
β110.503

(0.352)
0.580

(0.344)
0.764

(0.269)
0.768

(0.260)
0.761

(0.261)
0.781

(0.245)
0.727

(0.283)
0.728

(0.282)
0.138

(0.243)
0.170

(0.257)
0.524

(0.236)
0.501

(0.245)
0.484

(0.241)
0.525

(0.226)
0.360

(0.249)
0.387

(0.250)
β120.115

(0.164)
0.173

(0.197)
0.433

(0.241)
0.451

(0.243)
0.445

(0.239)
0.480

(0.234)
0.399

(0.243)
0.399

(0.243)
0.040

(0.067)
0.065

(0.098)
0.433

(0.173)
0.410

(0.196)
0.394

(0.188)
0.440

(0.179)
0.274

(0.187)
0.297

(0.187)
β130.126

(0.175)
0.178

(0.201)
0.411

(0.227)
0.430

(0.243)
0.426

(0.240)
0.458

(0.232)
0.384

(0.246)
0.384

(0.246)
0.073

(0.178)
0.094

(0.188)
0.436

(0.187)
0.415

(0.211)
0.400

(0.205)
0.444

(0.193)
0.284

(0.212)
0.307

(0.212)
β140.234

(0.273)
0.318

(0.291)
0.600

(0.276)
0.600

(0.286)
0.592

(0.285)
0.625

(0.269)
0.546

(0.304)
0.547

(0.304)
0.051

(0.072)
0.080

(0.097)
0.475

(0.198)
0.448

(0.216)
0.431

(0.208)
0.475

(0.198)
0.308

(0.208)
0.332

(0.209)
β150.045

(0.039)
0.083

(0.066)
0.327

(0.151)
0.350

(0.169)
0.346

(0.167)
0.383

(0.163)
0.300

(0.170)
0.300

(0.170)
0.033

(0.073)
0.051

(0.092)
0.410

(0.159)
0.391

(0.181)
0.376

(0.173)
0.422

(0.168)
0.260

(0.173)
0.282

(0.173)
E(k|Y)2.8373.5777.0017.2447.1737.6496.5566.5641.4651.8036.9816.6706.4447.0734.6625.036
E(g1+g|Y)0.9960.9900.9330.9260.9130.8970.9310.9310.9960.9900.7820.7960.7670.7160.8660.850
Table 6:Summary statistics of posterior model probabilities for true model based on setting “A” and 50 Monte Carlo Steps. Top panel corresponds to σ = 1/2, second panel to σ = 1, third panel to σ = 2.5, fourth panel to σ = 5
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
Min.0.13490.12790.15730.18880.19340.18090.20960.2094
Mean0.46180.37250.53060.59510.59730.57580.62200.6217
Max.0.62970.51060.70370.76690.76440.74610.78450.7843
St.Dev.0.13420.10190.14820.15510.14900.14870.14840.1484
Min.0.05390.03170.04260.03080.03200.02890.03620.0362
Mean0.44330.32900.39220.39440.39320.36900.42190.4215
Max.0.61150.48490.55780.59540.59220.56580.62200.6216
St.Dev.0.13730.11380.12830.13580.13420.12930.13920.1392
Min.0.00210.00220.00000.00000.00000.00000.00000.0000
Mean0.12010.10480.05560.06650.06600.06060.07210.0720
Max.0.46090.33920.14930.19780.19680.17680.22160.2213
St.Dev.0.11330.08340.03870.04870.04780.04410.05240.0524
Min.0.00000.00000.00000.00000.00000.00000.00000.0000
Mean0.00120.00230.00100.00250.00240.00210.00270.0028
Max.0.02150.03240.01140.03230.03080.02540.03470.0347
St.Dev.0.00350.00570.00260.00620.00600.00510.00670.0067
Table 7:Summary statistics of posterior model probabilities for true model based on setting “A” and 50 Monte Carlo Steps. Top panel corresponds to σ = 1/2, second panel to σ = 1, third panel to σ = 2.5, fourth panel to σ = 5
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-UIPHG-RIC
Min.0.47520.61330.47040.51920.53860.51750.56950.5691
Mean0.98060.99190.98070.98720.99080.99020.99140.9914
Max.1.00001.00001.00001.00001.00001.00001.00001.0000
St.Dev.0.13420.10190.14820.15510.14900.14870.14840.1484
Min.0.13630.11150.12260.11310.11880.11580.12390.1238
Mean0.96500.95520.96040.96120.96260.96040.96570.9656
Max.1.00001.00001.00001.00001.00001.00001.00001.0000
St.Dev.0.13730.11380.12830.13580.13420.12930.13920.1392
Min.0.00710.00760.00000.00000.00000.00000.00000.0000
Mean0.46830.52020.55160.53250.53310.52740.53820.5383
Max.1.00001.00001.00001.00001.00001.00001.00001.0000
St.Dev.0.11330.08340.03870.04870.04780.04410.05240.0524
Min.0.00000.00000.00000.00000.00000.00000.00000.0000
Mean0.00670.01730.00700.02030.02000.01470.02840.0285
Max.0.16430.37440.08420.28350.27370.16850.47190.4688
St.Dev.0.00350.00570.00260.00620.00600.00510.00670.0067
Table 8:Relative Root Mean Squared Error based on 30 out of sample forecasts averaged over 50 Monte Carlo Steps. Values below 1 indicate superior predictive performance as compared to the g-RIC setting. Top panel corresponds to setting “A” and bottom panel to setting “B”.
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-RICHG-UIP
σ = 1/2-1.008770.997540.997980.997930.998170.997710.99771
σ = 1-1.003471.002001.001281.002191.003151.001261.00127
σ= 2.5-0.995011.000791.005561.003201.006991.000391.00042
σ = 5-0.990341.006971.005941.007201.019581.012561.00692
σ = 1/2-0.997540.999260.997940.999480.999100.999980.99998
σ = 1-0.981660.973960.973160.975010.973980.976480.97647
σ = 2.5--0.988750.972840.967600.975800.978470.976310.97627
σ = 5--0.995781.004270.999681.007471.022161.018531.00976
Table 9:Posterior Inclusion Probabilities for different prior settings. Values larger than 0.5 in bold.
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-UIPHG-RIC
GDP600.99891.00000.99990.99990.99980.99980.99980.9999
Confucian0.98800.99900.99530.99590.99550.99520.99640.9961
LifeExp0.93140.98900.97510.97870.97480.97460.97920.9764
EquipInv0.92710.92450.89490.90310.89580.88950.89960.8974
SubSahara0.72880.94800.90870.91290.90960.90420.91270.9151
Muslim0.65240.51960.56950.56730.56460.56440.55960.5592
RuleofLaw0.48550.75690.67610.68220.66970.66110.67970.6826
YrsOpen0.51720.24520.35300.34330.35460.36300.33840.3373
EcoOrg0.45360.63280.64000.63200.64020.63700.63710.6447
Protestants0.44880.56650.57950.57590.57720.58250.57820.5820
Mining0.45720.89750.86330.86980.86380.85870.86930.8656
NequipInv0.42970.71590.72070.72090.72160.71850.72180.7264
LatAmerica0.20840.62970.58420.59620.58700.58610.59460.5956
PrScEnroll0.21440.51140.49870.51180.49640.49280.50320.4960
Buddha0.19860.35260.42320.41900.42230.42250.41830.4135
BlMktPm0.17960.64710.61450.61970.61040.61030.61760.6176
Catholic0.12770.23880.31380.30660.31030.31570.30190.3040
CivlLib0.12810.45590.45010.45360.44840.44450.44920.4485
Hindu0.12130.86130.81980.83580.82590.81760.83450.8331
PrExports0.10390.18280.27840.26840.27620.28460.26610.2679
PolRights0.09590.33540.38200.38100.38210.38210.37870.3819
RFEXDist0.08360.20720.27400.27420.27690.27440.27000.2672
Age0.08170.30640.35900.35440.35830.36390.35330.3553
WarDummy0.07760.25930.33800.33360.34210.34470.33510.3347
LabForce0.07710.80170.74140.76970.75190.73920.76080.7587
Foreign0.06970.15010.22480.22060.22710.23460.21960.2222
English0.06890.36870.38220.38680.38630.38350.38540.3860
EthnoL0.05840.68260.62060.64610.62770.60780.62990.6321
Spanish0.05630.44850.43410.44960.43670.42930.43860.4374
stdBMP0.04950.12640.20490.20020.20300.20730.19710.1971
French0.05100.40920.39920.41900.40550.39560.40780.4004
Abslat0.04340.15290.22730.22470.22430.23770.22610.2236
WorkPop0.04270.13330.20640.19940.20660.21350.20140.2012
HighEnroll0.04480.69470.61650.64150.61690.60490.63080.6261
Popg0.03680.14850.21820.21350.21790.22180.20930.2087
Brit0.03860.33040.32980.33820.33160.32600.33450.3289
OutwarOr0.03830.33870.35070.36130.34890.34630.35170.3501
Jewish0.03550.12890.19660.19270.19950.20960.19490.1922
RevnCoup0.03000.12520.19460.19240.19250.19620.18770.1842
PublEdupct0.03150.29510.31520.31670.31480.31820.31510.3133
Area0.02980.13910.21520.20970.21640.21800.21010.2072
E(k|Y)10.44219.65720.38920.51820.41120.37720.39520.368
E(g1+g|Y)0.9990.9860.9550.9600.9550.9510.9580.958
Table 10:Fully standardized posterior means for different prior settings. Coefficients corresponding to covariates with PIP exceeding 0.5 in bold.
g-RICg-UIPgE(g1+g|Y)EBLHG-3HG-4HG-UIPHG-RIC
GDP60-0.7810-0.8019-0.7743-0.7785-0.7740-0.7704-0.7768-0.7769
Confucian0.27030.30550.28810.29210.28840.28580.29050.2901
LifeExp0.52670.55460.53480.54110.53580.53310.53860.5366
EquipInv0.30400.23610.22220.22590.22270.22020.22400.2230
SubSahara-0.2545-0.4094-0.3619-0.3686-0.3632-0.3586-0.3683-0.3693
Muslim0.14480.10170.10960.10890.10830.10840.10730.1067
RuleofLaw0.13220.16580.14060.14220.13910.13680.14180.1429
YrsOpen0.14390.03250.04820.04650.04820.05020.04620.0455
EcoOrg0.08160.08880.08660.08500.08610.08600.08620.0875
Protestants-0.0772-0.0804-0.0799-0.0785-0.0794-0.0800-0.0796-0.0806
Mining0.08000.14860.14190.14330.14210.14150.14320.1426
NequipInv0.07370.10660.10660.10650.10700.10640.10680.1079
LatAmerica-0.0429-0.1759-0.1464-0.1528-0.1478-0.1455-0.1515-0.1510
PrScEnroll0.05980.12620.10990.11420.10900.10660.11160.1101
Buddha0.02640.03350.04100.04040.04080.04100.04000.0398
BlMktPm-0.0219-0.0783-0.0702-0.0713-0.0697-0.0694-0.0711-0.0709
Catholic-0.0050-0.0112-0.0111-0.0100-0.0111-0.0110-0.0111-0.0121
CivlLib-0.0271-0.0944-0.0853-0.0877-0.0852-0.0832-0.0866-0.0859
Hindu-0.0184-0.3591-0.3003-0.3169-0.3041-0.2949-0.3123-0.3104
PrExports-0.0168-0.0162-0.0271-0.0252-0.0264-0.0273-0.0250-0.0252
PolRights-0.0157-0.0454-0.0490-0.0490-0.0494-0.0493-0.0487-0.0494
RFEXDist-0.0095-0.0148-0.0187-0.0190-0.0190-0.0186-0.0186-0.0183
Age-0.0078-0.0252-0.0282-0.0280-0.0283-0.0287-0.0279-0.0282
WarDummy-0.0081-0.0222-0.0288-0.0284-0.0293-0.0296-0.0286-0.0288
LabForce0.01020.29970.24870.26460.25250.24450.25990.2572
Foreign0.00680.00380.00710.00680.00740.00780.00690.0069
English-0.0057-0.0350-0.0337-0.0346-0.0342-0.0336-0.0344-0.0344
EthnoL0.00550.13310.10840.11550.10940.10450.11220.1116
Spanish0.00530.10660.08210.08880.08330.07920.08570.0852
stdBMP-0.0034-0.0021-0.0037-0.0035-0.0036-0.0039-0.0035-0.0035
French0.00370.05850.04480.04920.04580.04320.04710.0464
Abslat0.0006-0.0071-0.0066-0.0065-0.0065-0.0069-0.0067-0.0068
WorkPop-0.0030-0.0035-0.0062-0.0060-0.0062-0.0064-0.0062-0.0061
HighEnroll-0.0044-0.1871-0.1441-0.1551-0.1453-0.1393-0.1513-0.1492
Popg0.00270.00720.01100.01040.01060.01130.01050.0103
Brit-0.00180.04130.02630.03030.02710.02490.02880.0281
OutwarOr-0.0020-0.0286-0.0266-0.0280-0.0265-0.0258-0.0270-0.0267
Jewish-0.0012-0.0009-0.0019-0.0018-0.0020-0.0021-0.0020-0.0018
RevnCoup-0.0001-0.0003-0.0002-0.0002-0.0003-0.0003-0.0002-0.0003
PublEdupct0.00040.02660.02290.02380.02280.02260.02370.0235
Area-0.0005-0.0026-0.0030-0.0031-0.0031-0.0030-0.0031-0.0031
E(k|Y)10.44219.65720.38920.51820.41120.37720.39520.368
E(g1+g|Y)0.9990.9860.9550.9600.9550.9510.9580.958

Figure 5:Top left panel shows the cumulative posterior mass, left panel the posterior inclusion probabilities, bottom left panel the standardized coefficients and bottom right panel the posterior mean for the growth determinants exercise.

*Martin Feldkircher is affiliated with Oesterreichische Nationalbank. The opinions in this paper are those of the authors and do not necessarily coincide with those of Oesterreichische Nationalbank. The authors are grateful to Jesús Crespo Cuaresma, Gernot Doppelhofer and Eduardo Ley for their helpful comments.
1Here, Rs2 denotes the OLS R-squared of model Ms, and (N, ks) the dimensions of its design matrix.
2Therefore, practitioners employing BMA focus much more on model-wise marginal distributions, such as posterior inclusion probabilities or posterior beta distributions.
3i.e. by up- or downweighting the prior beliefs on coefficients β.
4Note that although the term (yy)N12 is constant over models, it is frequently included in the marginal likelihood expression, such as in Fernández et al. (2001a) – while others, such as Liang et al. (2008) omit it.
5Please consult the technical appendix for further details.
6Note that we have retained the improper priors for α and σ as common to all models.
7In fact, the discussion of model priors has largely focused on model size penalties, as for instance in Sala-i-Martin et al. (2004).
8Consider, for instance, the model prior p(Ms)=(1+g)ks/2(1+g+1)K that would completely neutralize the factor (1+g)ks2 – and may be combined with other model priors adding more or less model size penalty, respectively.
9Note, however, that Dij is bounded according to the values of Rj2 and Rs2. Nevertheless the exponent (N – 1)/2 exacerbates any variations in g to quite a large extent.
10Liang et al. (2008) motivate their paper with two ‘paradoxes’ that arise with constant g. First, they raise a BMA formulation of ‘Bartlett’s paradox’ stating that if g →∞ for fixed N and K, the Bayes Factor B(Ms: M0) of any model with respect to the null model eventually goes to zero. Second, they refer to an ‘information paradox’ stating that for fixed N and K, if the R-squared of model Ms converges to unity, its Bayes factor with respect to any other fit-wise inferior model does not go to infinity. Note, however, the comment by Zellner 2008. Moreover, both arguments bite only in the case when N and K are kept constant: Bartlett’s paradox in this case may be less relevant as typical specifications for g require it to rise in line with N. The ‘information paradox’ does not contradict the standard consistency argument that requires the respective Bayes Factor to converge to infinity only when N tends likewise to infinity.
11Note that this is equivalent to putting the following prior on g: p(g)=a22(1+g)a2.
12Note that E(βs|y,Xs,Ms)=E(g1+g|y,Xs,Ms)β^s.
13For completeness, y’s posterior predictive distribution is provided in the appendix.
14See Guptar and Nagar (2000) for the exact definition of the type II hypergeometric distribution.
15Note that with respect to equations (8) and (12) it is straightforward to derive the corresponding expressions for E(g|y, Xs,Ms) and E(g2|y, Xs,Ms). However E(g|y, Xs,Ms) will only be finite for ks + α > 4 and E(g2|y, Xs,Ms) only for ks + a > 6. We therefore concentrate on the posterior moments of the shrinkage factor.
16In case Rs2=0 (in particular for the null model), the respective quantities follow directly from (8), (10), and (12) since 2F1 (a, b, c, 0) = 1 for any (a, b, c).
17Note that 2F1(N12,1,ks+a2,Rs2) goes quite rapidly towards infinity as Rs2 increases. The term θ¯s/Fs* could thus noticeably affect model-averaged posterior moments only in case the data examined offers a very low signal-to-noise ratio.
18Please refer to section A. 2 in the appendix for a theoretical underpinning of this claim.
19Even though this inequality will hold in virtually all relevant cases, it may not hold in case of less dependent-covariate correlation than expected under a null hypothesis of no relation. As a rule of thumb, RF2>K+a2/N3 is sufficient for (16) to hold in any case. Please refer to section A. 3 in the appendix for further details.
20Consistency does not directly apply to the g-RIC prior outlined below. However, throughout the following sections, g-RIC is in practice identical with the g-BRIC prior (as always K2 > N). Since the latter qualifies for consistency, the notion may be extended to g-RIC, at least in our case.
22This facilitates quick convergence of stochastic search algorithms such as the MC3 to the target distribution.
23See Laud and Ibrahim (1995) for a model selection approach designing information criteria that allow for the input of prior knowledge.
24See for example Gelman et al. (1995).
25We have omitted results from the HG-3 setting, since results are very similar to that of HG-4.
27The computer program is coded in R (R Development Core Team, 2008) and is available from the authors upon request.
28Eicher et al. (2009) translate the scale of evidence put forward originally by Kass and Raftery (1995) into four categories: weak (50-75% PIP), substantial (75-95%), strong (95-99%) and decisive (99%+) evidence. Barbieri and Berger (2003), on the other hand, highlight the predictive merits of the median model, consisting of those regressors whose PIP exceeds 0.5.
29Table 10 lists results on the posterior means for the whole set of prior structures. Note that we have fully (response variable and covariates) standardized the coefficients implying that the slopes refer to changes in terms of standard deviations.
30Note that if k = 0, BFEBL = BFh = 1.
31This stems from the fact that for a random variable x ≥ 0, by the definition of covariance E(1x)E(x)=1Cov(x,1x), and x ≥ 0 we have Cov(x,1x)0. Therefore E(1x)1E(x).
32Note that this threshold is just slightly higher than the expected value of RF2 under the classic null hypothesis of no signifcant variance explanation by a regression model. As a rule of thumb, if the standard F-statistic for the full model is ‘signifcant’ by at least 20%, then the inequality above is guaranteed to hold.
33The slight differences with respect to Eklund and Karlsson (2007) are due to the fact that we employ an improper prior on beta variance sigma and the constant.
34Note that b is then implicitly defined through b = (Km)/m.

Other Resources Citing This Publication