Small Business Resources, Business Advice and Forms from AllBusiness.com

The shape of the risk premium: evidence from a semiparametric generalized autoregressive...

We examine the relationship between the risk premium on the Center for Research on Security Prices (CRSP) value-weighted index total return and its conditional variance. We propose a new semiparametric model in which the conditional variance process is parametric and the conditional mean is

an arbitrary function of the conditional variance. For monthly CRSP value-weighted excess returns, the relationship between the two moments that we uncover is nonlinear and nonmonotonic.

KEY WORDS: Asset pricing; Autoregressive conditional heteroscedasticity; Backfitting; Fourier series; Kernel; Risk premium.

1. INTRODUCTION

Modern asset pricing theories, such as those of Abel (1987, 1999), Cox, Ingersoll, and Ross (1985), Merton (1973), and Gennotte and Marsh (1993), imply restrictions on the time series properties of expected returns and conditional variances of market aggregates. These restrictions are generally quite complicated, depending on utility functions as well as on the driving process of the stochastic components of the model. However, in an influential paper, Merton (1973) obtained very simple restrictions albeit under somewhat drastic assumptions. He showed in the context of a continuous-time partial equilibrium model that

(1) [[mu].sub.t] = E[([r.sub.mt] - [r.sub.ft])|[F.sub.t-1]] = [gamma] var[([r.sub.mt] - [r.sub.ft])|[F.sub.t-1]] = [gamma][[sigma].sup.2.sub.t],

where [r.sub.mt] and [r.sub.ft] are the returns on the market portfolio and risk-free asset and [F.sub.t-1] is the market-wide information available at time t-1. The constant [gamma] is the Arrow-Pratt measure of relative risk aversion.

The simplicity of the foregoing restrictions and their apparent congruence with the original capital asset pricing model (CAPM) restrictions (see Sharpe 1964; Lintner 1965) has motivated a large number of empirical studies that test some variant of this restriction. A convenient statistical framework for examining the relationship between the quantities [[mu].sub.t] and [[sigma].sup.2.sub.t] in discrete financial time series is the autoregressive conditional heteroscedasticity (ARCH) class of models (see Bollerslev, Chou, and Kroner 1992; Bollerslev, Engle, and Nelson 1994 for references). Engle, Lilien, and Robins (1987) examined the relationship between government bonds of different maturities using the ARCH-M model in which the errors follow an ARCH(p) process and [[mu].sub.t] = [mu]([[sigma].sup.2.sub.t]) for some parametric function [mu](*). They examined [[mu].sub.t] = [[gamma].sub.0] + [[gamma].sub.1][[sigma].sub.t] and [[mu].sub.t] = [[gamma].sub.0] + [[gamma].sub.1] ln([[sigma].sup.2.sub.t])and found that the latter specification provided the better fit. French, Schwert, and Stambaugh (1987) and Nelson (1991) also examined this relationship using generalized autoregressive conditional heteroskedastic (GARCH) models.

Gennotte and Marsh (1993) argued that the linear relationship (1) should be regarded as a very special case. They constructed a general equilibrium model of asset returns and derived the equilibrium relationship

(2) [[mu].sub.t] = [gamma][[sigma].sup.2.sub.t] + g([[sigma].sup.2.sub.t]),

where the form of g(*) depends on preferences and on the parameters of the distribution of asset returns. If the representative agent has logarithmic utility, then g(*) [equivalent to] 0, and the simple restrictions of Merton pertain. In addition, Backus, Gregory, and Zin (1989) and Backus and Gregory (1993) provided simulation evidence that g(*), and hence [mu](*), could be of arbitrary functional form in general equilibrium. Whitelaw (2000) developed these empirical findings into an equilibrium asset-pricing model with regime changes in which the relation is linear within each regime but overall nonlinear due to the presence of the two distinct regimes. Veronesi (2001) also developed a model in which investors receive noisy signals in which the shape of the relation between the risk premium and the conditional variance is ambiguous and depends on investor uncertainty.

Pagan and Hong (1990) argued that the risk premium, [[mu].sub.t], and the conditional variance, [[sigma].sup.2.sub.t], are highly nonlinear functions of the past whose form is not captured by standard parametric GARCH-M models. They estimated [[mu].sub.t] and [[sigma].sup.2.sub.t] nonparametrically and found evidence of considerable nonlinearity. They then estimated [delta] from the regression

(3) [r.sub.mt] - [r.sub.ft] = [beta]'[x.sub.t] + [delta][[sigma].sup.2.sub.t] + [[eta].sub.t],

by the least squares and instrumental variables methods with [[sigma].sup.2.sub.t] substituted for the nonparametric estimate, finding a negative but insignificant [delta]. Perron (2003) analysed this approach using weak instrument asymptotics and found similar results.

The Pagan and Hong (1990) approach has a couple of drawbacks. First, the conditional moments are calculated using a restricted conditioning set--the information set used in defining [[mu].sub.t] and [[sigma].sup.2.sub.t] contained only a finite number of lags, that is, [F.sub.t-1] = {[y.sub.t-1], ..., [y.sub.t-p]} for some fixed p and data series [y.sub.t] = [r.sub.mt] - [r.sub.ft]. This greatly restricts the dynamics for the variance process. In particular, if the conditional variance is highly persistent, then the nonparametric estimator of the conditional variance provides a poor approximation, as confirmed by the simulation evidence reported by Perron (1998). Second, linearity of the relationship between [[mu].sub.t] and [[sigma].sup.2.sub.t] is imposed, and this seems to be somewhat restrictive in view of earlier findings.

In this article, we investigate the relationship between the risk premium and the conditional variance of excess returns on the Center for Research on Security Prices (CRSP) value-weighted index. We consider a semiparametric specification that differs from previous treatments. In particular, we choose a parametric form for the variance dynamics [in our case, exponential GARCH (EGARCH)], while allowing the mean to be an unknown function of [[sigma].sup.2.sub.t]. This model takes into account the high level of persistence and leverage effect found in stock index return volatility, while at the same time allowing for an arbitrary functional form to describe the relationship between risk and return at the market level. We develop two estimation methods for this model: a Fourier series method and a method based on kernels. The kernel method is based on iterative one-dimensional smoothing and is similar in this respect to the back-fitting method for estimating additive nonparametric regression (see Hastie and Tibshirani 1990). We also suggest a bootstrap algorithm for obtaining confidence intervals. Using these methods, we find evidence of a nonlinear and nonmonotonic relationship between the risk premium and the conditional variance.

Other work applying nonparametric methods to this problem has been done by Boudoukh, Richardson, and Whitelaw (1997) and Harvey (2001). Our work differs from theirs in the parametric specification that we choose for the conditional variance. This enables joint estimation of the two elements of interest, as described in Section 3.

In the next section we discuss the specification of our model, while in Sections 3 and 4 we describe how to obtain point and interval estimates respectively. In Section 5 we present our empirical results and the results of a small simulation experiment, and in Section 6 we conclude.

2. A SEMIPARAMETRIC-MEAN EXPONENTIAL GENERALIZED AUTOREGRESSIVE CONDITIONAL HETEROSCEDASTICITY MODEL

We suppose that the excess returns [Y.sub.t] are generated as

(4) [y.sub.t] = [mu]([[sigma].sup.2.sub.t]) + [[epsilon].sub.t][[sigma].sub.t], t = 1,2, ..., T,

where [[epsilon].sub.t] is a martingale difference sequence with unit (conditional) variance and [mu](*) is a smooth function of unknown functional form. The restriction that E[[y.sub.t]|[F.sub.t-1]], where [F.sub.t-1]] = [{[y.sub.t-j]}.sup.[infinity].sub.j=1], depends on the past through only [[sigma].sup.2.sub.t] is quite severe but is a consequence of asset pricing models (e.g., Backus and Gregory 1993; Gennotte and Marsh 1993). In any case, it is possible to generalize this formulation in a number of directions. It is straightforward to incorporate fixed explanatory variables, lagged [[sigma].sup.2.sub.t], or lagged [y.sub.t] either as linear regressors or inside the unknown function [mu](*). More complicated dynamics for [[epsilon].sub.t], such as an ARMA(p, q) model and a multivariate extension, can also be accommodated.

We propose using a parametric function for the conditional variance so as to allow for rich dynamics in the volatility. To be specific, we consider the EGARCH model introduced by Nelson (1991),

(5) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The presence of the lagged dependent variables [h.sub.t-j] allows very rich dynamics for the variance process itself, which cannot yet be achieved by nonparametric methods. The foregoing model also allows both the sign and the level of [[epsilon].sub.t-k] to affect [[sigma].sup.2.sub.t]--good news and bad news can have different effects on volatility, hence allowing the possibility of the so-called "leverage effect" in stock returns. The parameter d controls the relative importance of the symmetric versus asymmetric effects. Evidence of such a leverage effect in returns on stock indices is widespread in the literature (e.g., Nelson 1991 for daily data and Braun, Nelson, and Sunier 1991 for monthly data).

A number of authors (e.g., Nelson 1991), have found that standardized residuals from estimated GARCH models are leptokurtic relative to the normal (see also Engle and Gonzalez-Rivera 1991). We thus assume that [[epsilon].sub.t] has a distribution within the generalized error distribution (GED) family,

(6) f([epsilon]) = v exp(-1/2[|[epsilon]/[lambda]|.sup.v]) / [lambda][2.sup.(1+1/v)][GAMMA](1/v),

[lambda] = [[[2.sup.(-2/v)][GAMMA](1/v)/[GAMMA](3/v)].sup.1/2],

where [GAMMA] is the gamma function. The GED family of errors includes the normal (v = 2), uniform (v = [infinity]), and Laplace (v = 1) as special cases. The distribution is symmetric about 0 for all v and has finite second moments for v > 1. With this density, we obtain that E|[[epsilon].sub.t]| = ([lambda][2.sup.1/v][GAMMA](2/v))/[GAMMA](1/v) (Hamilton 1994, p. 669).

We assume that the parameter values satisfy the requirements for stationarity given by Nelson (1991). Carrasco and Chen (2002) established a general result about the dependence properties of a general class of volatility models, which suggests that the process [y.sub.t] is [beta] mixing under some conditions.

Newey and Steigerwald (1997) showed that quasi-likelihood estimators in GARCH models based on distributions other than the normal are generally inconsistent. Therefore, we also investigate our EGARCH(p, q) specification for the variance combined with a normal error distribution.

The main difference between our model and previous treatments is that we do not restrict the functional form of [mu](*) a priori. This has a number of implications for both estimation and testing. In particular, a simple consistent estimator of [mu](*) is difficult to obtain and would appear to depend on first obtaining consistent estimates of the parameters of the variance process. On the other hand, to estimate these parameters, we need to have a good estimate of [mu](*). In the next section we propose a solution to this problem.

3. ESTIMATION

3.1 Parametric Estimation

Estimation of the unknown parameters by maximum likelihood when [mu](*) is known apart from a finite number of parameters, say [tau], has been considered by Engle et al. (1987) and Nelson (1991). In this case let [theta] = ([phi], [tau]), where [phi] = (a, [b.sub.1], ..., [b.sub.p], [c.sub.1], ..., [c.sub.q], d, v)' and [tau] is the vector of unknown mean parameters. Then [[epsilon].sub.t]([theta]) and [h.sub.t]([theta]) can be built up recursively given initial conditions, and the conditional log-likelihood function is

(7) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The likelihood function can be maximized with respect to [phi] and [tau] using the (BHHH) algorithm,

(8) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where [[lambda].sup.[i]] is a variable step length chosen to maximize the log-likelihood function in the given direction, and the score functions [l.sub.t[theta]] are evaluated at [[theta].sup.[i]]. Although the likelihood function is not smooth in all parameters (because of the presence of the absolute value of [epsilon]), this derivative-based method seems to work well in practice. Some authors have modified the specification by using a smooth substitute for the absolute function for values around 0 to avoid this problem. This proved unnecessary in our case.

3.2 Semiparametric Estimation

We propose several methods of constructing estimates of [phi] and [mu](*) in the semiparametric model. We estimate [mu] using two main approaches. The first approach involves treating the T x 1 vector [mu] = ([[mu].sub.1], [[mu].sub.2], ..., [[mu].sub.T])' as unknown parameters and estimating them through a kernel-smoothing method inside the optimization routine. The second approach is to parameterize [mu](*) in a flexible way using series expansion methods. We use the Fourier flexible form of Gallant (1981), although other methods could be used. Estimation of [phi] is then achieved by concentrating the likelihood function. We describe the estimation and the construction of confidence intervals for each method in turn.

3.2.1 Kernel Estimation. The first method estimates [mu] by a smoothing procedure based on kernels (see Hardle 1990; Hardle and Linton 1994; Pagan and Ullah 1999 for a discussion of kernel nonparametric regression estimation). Suppose that we could obtain some estimate of [mu](*); then we could easily estimate the parameters of the variance and error distribution using maximum likelihood on the residuals. Unfortunately, it is very difficult to obtain a satisfactory direct estimate of [mu](*). In our time series model, the relevant information set is the entire infinite past; that is, [mu](*) = E[[y.sub.t]|[F.sub.t-1]] depends on the entire past of the series. Thus literally computing this expectation empirically is infeasible. One could argue, as did Pagan and Hong (1990), that consistent estimates of E[[y.sub.t]|[F.sub.t-1]] could be obtained using nonparametric regression with a truncated information set [F.sup.P(T).sub.t-1] = {[y.sub.t-1], ..., [y.sub.t-P]}, where P(T) [??] [infinity] at a very slow rate. This estimate could then be used to obtain consistent estimates of the parameters of [h.sub.t]. This is not a particularly appealing procedure from a practical standpoint, because of the high dimension of the conditioning set. Silverman (1986) dramatically illustrated the curse of dimensionality by showing the effective sample size needed to achieve a certain precision.

In semiparametric problems where one cannot obtain direct estimates of the nonparametric function, one can often instead use a semiparametric profile likelihood method as described by Powell (1994) in which the nonparametric function is estimated for each given parameter value and then the parameters are chosen to minimize some criterion function that would have been the likelihood if the functions were known rather than estimated. In general, such parametric estimators are root-n consistent and asymptotically normal, and the nonparametric estimators are at least consistent. Unfortunately, in our model we cannot define the corresponding profiled quantity [[mu].sub.[phi]]([[sigma].sup.2.sub.t]) so easily, because [[sigma].sup.2.sub.t] depends not only on the parameters, but also on lagged [epsilon]'s, which in turn depend on lagged [mu]'s. Therefore, we need to know the entire function [mu](*) (or at least its values at the T sample points) to construct [[mu].sub.[phi]]([[sigma].sup.2.sub.t]).

At first glance, this might appear to make the estimation procedure hopeless; but this is a false impression. The same sort of issues arise in the estimation of additive nonparametric models and an enormous literature has arisen that proposes estimation algorithms, and, more recently, distribution theory (see, e.g., Breiman and Freedman 1985; Hastie and Tibshirani 1990; Opsomer and Ruppert 1997; Mammen, Linton, and Nielsen 1998). We borrow from this literature and suggest an estimation procedure based on iterative updating of both the finite-dimensional parameters [phi] and the function [mu](*). We first pick starting values for [mu] and [phi]. We then define a modified version of the Berndt, Hall, Hall, and Hausman (BHHH) algorithm to update our estimates of [phi]. Finally, we update our estimates of [mu] using kernel estimates based on the previous iterations filtered log variances. The main advantage of the procedure is that it relies only on one-dimensional smoothing operations at each step, so that the curse of dimensionality does not operate. The main disadvantage is that the procedure is time-consuming and may or may not converge to local minima.

For convenience, we describe our algorithm for the case p = q = 1. We smooth on the log of variance [h.sub.t] instead of the variance itself. Because the logarithm is a monotonic transformation, the two approaches are equivalent, but because log variance has a more symmetric distribution with less effect from outliers, it helps in selecting a bandwidth. Our main algorithm is as follows:

Kernel estimation algorithm

1. Choose starting values for [[phi].sup.[0]] and [{[[mu].sup.[0].sub.s}].sup.T.sub.s=1] that imply [{[h.sup.[0].sub.s}].sup.T.sub.s=1].

2. Given [{[h.sup.[r-1].sub.t}].sup.T.sub.t=1], calculate

(9) [[mu].sup.[r].sub.t] = [[SIGMA].sub.s] K(h.sup.[r-1].sub.t] - [h.sup.[r-1].sub.s] / [delta]) [y.sub.s] / [[SIGMA].sub.s] K(h.sup.[r-1].sub.t] - [h.sup.[r-1].sub.s] / [delta])

for t = 1,2, ..., T, where [delta] > 0 is a small bandwidth parameter and K is a bounded kernel satisfying [integral] K(u) du = 1.

3. Given initial values [h.sup.[r].sub.0]([phi]) and [epsilon].sup.[r].sub.0]([phi]), define recursively for any parameter value [phi]

[h.sup.[r].sub.t] = a + b[h.sup.[r].sub.t-1] + [c.sub.1](|[[epsilon].sup.[r].sub.t-1] - E|[[epsilon].sup.[r].sub.t-1]| + d[[epsilon].sup.[r].sub.t-1])

and

[[epsilon].sup.[r].sub.t] = [y.sub.t] - [[mu].sup.[r].sub.t] / exp [(h.sup.[r].sub.t]).sup.1/2],

for t = 1, 2, ..., T. Then for any [phi], construct [l.sup.[r].sub.t]([phi]) = [l.sub.t]([phi]); [[mu].sup.[r]]), the period t contribution to the rth likelihood function, where [[mu].sup.[r]] = ([[mu].sup.[r].sub.1], ..., [[mu].sup.[r].sub.T])'

4. Calculate

(10) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

where [l.sup.[r].sub.t[phi]] is the vector of partial derivatives of [l.sup.[r].sub.t]([phi]) with respect to [phi] evaluated at [[phi].sup.[r]] and [[mu].sup.[r]].

5. Repeat until convergence. We define convergence in terms of the relative gradient and the change in the nonparametric estimate, that is,

(11) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII],

for some small prespecified [epsilon] (we set [epsilon] = [10.sup.-4]). Denote the resulting estimates by [phi] and [mu].

We are unable to prove convergence of the foregoing algorithm, although in practice it seems to work reasonably well and to give similar answers for a range of starting values. Note that convergence of the backfitting algorithm for separable nonparametric regression has been shown only in some special cases--specifically, when the estimator is linear in the dependent variable. However, backfitting has been defined and widely used to estimate more general models than additive nonparametric regression (see Hastie and Tibshirani 1990), and it is widely believed to do a good job in such cases. In addition, Audrino and Buhlmann (2001) have proposed an iterative algorithm for estimating a nonparametric volatility model. They provided a result on convergence in a special case where a contraction property can be established. Dominitz and Sherman (2001) have reported some related results in parametric cases. Unfortunately, no such contraction property can be guaranteed here.

In practice, the estimated parameters of [h.sub.t] appear to be quite robust to different parametric specifications of the mean equation. The filtered estimate of [h.sub.t] based on [[mu].sup.[0].sub.t] = [T.sup.-1] [[SIGMA].sup.T.sub.s=1][y.sub.s] should be close to the true [h.sub.t] and should provide good starting values. We also use the fitted values from an EGARCH-M model as starting values to check for robustness. As in the parametric case, additional iterations should improve the performance of the estimated parameters and function.

The stopping rule (11) was arrived at after some experimentation. It is desirable to ensure that the entire parameter vector ([phi], [mu]) is convergent.

3.2.2 Fourier Series Estimation. The second approach that we consider is to parameterize the mean equation using a flexible functional form. By letting the number of terms grow with sample size and with a suitable choice of basis functions, this method can approximate arbitrary functions. This is an example of sieve estimation, but for a given sample size, it reduces to a parametric method with a finite number of parameters, and the estimation algorithm is just the standard BHHH algorithm given earlier.

The basis that we use is a modification of the flexible Fourier form of Gallant (1981) by adding sine and cosine terms to a linear function. Because this method uses trigonometric terms, it is convenient for the data to lie in the [0, 2[pi]] interval. We recenter and rescale the estimates of [h.sub.t] and define a new variable,

(12) [h.sup.*.sub.t] = ([h.sub.t] - h) 2[pi] / (h - h),

where h and h are scalars such that h is less than min([h.sub.t]) and h is greater than max([h.sub.t]). Then the Fourier approximation is

(13) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

The number of parameters to estimate is p + q + 2M + 5.

4. INFERENCE

There is a general theory of inference for maximum likelihood and quasi-maximum likelihood estimators in time series (see Wooldridge 1994 for a state-of-the-art survey). Specifically, Bollerslev and Wooldridge (1992) showed that under high-level conditions, quasi-maximum likelihood estimators in a parametric GARCH model are consistent and asymptotically normal provided only that the mean and the variance equations are correctly specified. However, their theory is based on high-level conditions that are rather difficult to verify even in the simplest cases. Authors who have derived an asymptotic theory for these models from primitive conditions include Weiss (1986), for ARCH models, and Lumsdaine (1996) and Lee and Hansen (1994), for the GARCH(1,1) model. For other specifications in the GARCH class, the asymptotic theory used in practice is not known to be valid. Similarly, the distribution theory for the EGARCH model of Nelson even in the special case with no mean effects and normal errors has not yet been established rigorously. However, there is much simulation evidence to support the normal approximation in this general class of models, and the results of Bollerslev and Wooldridge (1992) are widely believed to hold more generally and are frequently used in practice. Gonzalez-Rivera and Drost (1999) have investigated the efficiency of various estimation criteria under different specifications.

Given the complicated structure of our semiparametric model, it is not surprising that we cannot provide rigorous asymptotic theory for our estimators. However, if [h.sub.t] were observed, then a kernel estimate of [mu](*) as in (9) would be consistent and asymptotically normal under appropriate conditions, because the process [h.sub.t] is weakly dependent. Therefore, the results of Robinson (1983) can be applied to establish consistency, provided that [delta](T) [right arrow] 0 at an appropriate rate. This argument can be extended to the case where [h.sub.t] is replaced by a consistent parametric estimate. Indeed, the asymptotic distribution of nonparametric estimates is usually independent of any preliminary parametric estimation (Powell 1994). We therefore expect [[mu].sub.t] to be consistent at the usual nonparametric rate. Regarding [phi], we expect it to be [square root of (T)] consistent and to have a limiting normal distribution, with a variance including some component arising from the estimation of [mu].

We now turn to the construction of standard errors for the parameter estimates and the risk premium. For the former, we report analytical and bootstrap standard errors. The analytical standard errors are obtained by taking the outer product of the gradient with respect to the estimated parameters when the GED distribution is used. When the conditional distribution is Gaussian, we use the Bollerslev-Wooldridge (1992) quasimaximum likelihood estimator standard errors. For the kernel estimator, the estimated parameters are just [phi], the parameters of the error distribution and the variance process, whereas for the series estimator we are estimating these parameters jointly with the pseudoparameters, [tau], of the mean function. For the series estimator, we therefore compute standard errors from the matrix [[[SIGMA].sup.T.sub.t=1] [l.sub.t[theta]] [l'.sub.t[theta]] ([theta])].sup.-1], whereas for the kernel estimators we compute them from the smaller matrix, [[[SIGMA].sup.T.sub.t=1] [l.sub.t[theta]] [l'.sub.t[theta]] ([phi], [mu])].sup.-1]. The kernel standard errors asymptotically understate the true uncertainty associated with the parameter estimates, because they neglect the loss of efficiency associated with nonparametric estimation of [mu](*).

The second method of obtaining standard errors is through the bootstrap. The numerous methods for time series models include some that make very weak assumptions regarding the dependence structure, like the block bootstrap and the sieve bootstrap. In practice, however, the performance of these methods depends a lot on the implementation and the model structure. We instead prefer a bootstrap procedure that uses some of our model structure. We give an algorithm for calculating such confidence intervals for p = q = 1 in the case of the kernel procedure. We use a modified version of the wild bootstrap (see Hardle 1990, p. 247), because we do not wish to rule out higher-order conditional heterogeneity, which is relevant for the sampling variability of our estimators.

Wild bootstrap algorithm

1. Given estimates [mu], [phi], [h.sub.t]([phi], [mu]), and [[epsilon].sub.t] = [[epsilon].sub.t]([phi], [mu]), calculate late the recentered residuals, [[epsilon].sup.c.sub.t] = ([[epsilon].sub.t] - [T.sup.-1] [[SIGMA].sup.T.sub.t=1][[epsilon].sub.t].

2. Let [z.sub.t] be a random variable with E([z.sup.j.sub.t]) = 0 for j = 1, 3 and E([z.sup.j.sub.t]) = 1 for j = 2, 4. Draw a random sample {[z.sub.1], ..., [z.sub.T]} from this distribution and let [[epsilon].sup.*.sub.t] = [epsilon].sup.c.sub.t] * [z.sub.t]. The variable [[epsilon].sup.*.sub.t] will satisfy E([[epsilon].sup.*.sub.t]) = 0, E([[epsilon].sup.*2.sub.t]) = [[epsilon].sup.c2.sub.t], E([[epsilon].sup.*3.sub.t]) = 0, and E([[epsilon].sup.*4.sub.t]) = [[epsilon].sup We choose [z.sub.t] be a discrete variable that takes values -1 and 1 with equal probability.

3. Given starting values [h.sub.0] and [[epsilon].sup.*.sub.0], define recursively

[h.sub.t] = a + b[h.sub.t-1] + [c.sub.1]([|[[epsilon].sup.*.sub.t-1]| - E|[[epsilon].sup.*.sub.t-1]|] + d[[epsilon].sup.*.sub.t-1])

and

[y.sup.*.sub.t] = [mu]([h.sub.t]; [{[h.sub.s]}.sup.T.sub.s=1]) + [[epsilon].sup.*.sub.t] [[sigma].sup.*.sub.t],

with the corresponding choice of [mu](*). In the case of the kernel estimator, some auxiliary bandwidth parameter [delta] that over-smooths the data should be chosen, where

[mu](x;[{[h.sub.s]}.sup.T.sub.s=1], [delta]) = [[SIGMA].sub.s[not equal to]t](K(x-[h.sub.s] / [delta])[y.sub.s] / [[SIGMA].sub.s[not equal to]t](K(x-[h.sub.s] / [delta]),

whereas for the Fourier series

[mu]([h.sub.t]) = [[gamma].sub.0] + [[gamma].sub.1][h.sup.*.sub.t] + [[psi].sub.1] sin([h.sup.*.sub.t]) + [[phi].sub.1] cos([h.sup.*.sub.t]),

with [h.sup.*.sub.t] = ([h.sub.t] - h) 2[pi] / (h - h).

4. Given [{[y.sup.*.sub.t]}.sup.T.sub.t=1], calculate parameter estimates [[phi].sup.*] using the foregoing quasi-Newton procedure.

5. Repeat steps 2-4 m times. The standard errors are estimated from the sample standard deviation of the bootstrap parameter estimates [[phi].sup.*].

This method of obtaining standard errors is time-consuming for large datasets because it relies on simulation. However, it should fully reflect the loss of precision associated with estimating [mu](*). We impose a condition of symmetry on the errors for simplicity. However, we do not impose the restriction E([[epsilon].sup.*2.sub.t]) = 1, because this would require E([z.sup.2.sub.t]) = 1/[[epsilon].sup.c2.sub.t], which is numerically unstable and generates paths with very large outliers. Our chosen distribution for [z.sup.j.sub.t] is the Rademacher distribution advocated by Davidson and Flachaire (2001) based on Edgeworth expansions.

The second problem--construction of confidence intervals for [mu]--can be approached in two ways. We can think of standard errors that are conditional on a value of [h.sub.t] (and therefore allows us to look at the issue of the shape of the risk premium) and those that are conditional on all observables and thus allow us to run real-time experiments, and would be of interest to a decision maker. The second type is more difficult to construct because [h.sub.t] depends on the infinite past, and hence these standard errors must be built up recursively.

On the other hand, computing standard errors conditional on the value of [h.sub.t] is rather simple. For the kernel method, the variance of [[mu].sub.t] was given by Hardle (1990) as

(14) 1 / n[delta] [[sigma].sup.2.sub.t] [integral] k[(u).sup.2] du / f([h.sub.t]),

where f([h.sub.t]) is the ergodic density of [h.sub.t] evaluated at [h.sub.t]. This quantity can be estimated by replacing [[sigma].sup.2.sub.t] and f([h.sub.t]) by estimates [[sigma].sup.2.sub.t] and f([h.sub.t]).

For the series approximation, define [tau] as the estimated mean parameters and let [H.sub.t] be the vector of slopes, that is, [differential][mu]/[[differential][tau]|.sub.[tau]]. For instance, for the Fourier series,

(15) [H.sub.t]=(1, [h.sup.*.sub.t], sin([h.sup.*.sub.t]), cos([h.sup.*.sub.t]), ..., sin(M[h.sup.*.sub.t]), cos(M[h.sup.*.sub.t])'.

Then

(16) var[[mu]([h.sub.t])|[h.sub.t]]=[H'.sub.t]var([tau])[H.sub.t],

where var([tau]) is the appropriate submatrix of the covariance matrix of [theta] obtained by the bootstrap method as described earlier.

Finally, choice of bandwidth is a nontrivial problem here. It is necessary to undersmooth our estimate of [mu](*) to obtain good estimates of [phi], as has been pointed out by, for example, Robinson (1988). We adopt a cross-validation approach in which we maximize the likelihood function for each point on a grid of [delta] and choose the value that maximizes the (leave-one-out) likelihood function. However, to obtain a reasonable choice of bandwidth, we needed to remove the outliers when doing this, we removed 25% at each end of the data.

5. NUMERICAL RESULTS

5.1 Empirical Results

5.1.1 Data. We examine the monthly excess returns on the most comprehensive CRSP value-weighted index (including dividends)--the monthly continuously compounded return on the index minus the monthly return on the 30-day Treasury Bills--over the period January 1926-December 2001. The data were obtained from the CRSP, which includes the New York Stock Exchange (NYSE), American Stock Exchange (AMEX), and NASDAQ and is perhaps the best readily available proxy for "the market." We also conducted an analysis on the Standard & Poor (S&P) 500 series and obtained similar results. The data are plotted in the top panel of Figure 1. Table 1 reports sample moments for the data over the whole sample and two subsamples, each containing approximately half of the data: I (1926-1961) and II (1962-2000).

[FIGURE 1 OMITTED]

There is strong evidence of leptokurtosis and negative skewness in the full sample and in both subsamples. The table reveals some differences in moments across subsamples. In particular, the first subperiod has much higher mean and variance, more pronounced negative skewness, and fatter tails than the rest of the sample. The standard deviation is approximately 10 times the size of the mean, which appears to support the widely held view that it is fundamentally difficult to estimate any mean effect in the presence of such large volatility. However, from a nonparametric standpoint, this evidence is not by itself convincing, because the global moments are one end of the smoothing spectrum where bandwidth is infinite, and the other end of the smoothing spectrum is where bandwidth is 0 and corresponds to the point mean being equal to the observation itself and the point standard deviation being the same quantity. To illustrate this point, we computed a running mean and running standard deviation with seven observations and equal weighting. The results, shown in the bottom two panels of Figure 1, reveal the time-varying nature of the mean and volatility. At this frequency, the mean and standard deviation are much closer in magnitude. Note also that this approach to estimating volatility provides estimates similar to those obtained from the dynamic models that we propose. In both cases, estimated volatility is high around well-known events, including the depression years, World War II, the oil shock, and the 1987 crash.

5.1.2 Estimation. We first discuss some model selection choices that had to be made. For the series estimator, values of the tuning parameters of up to 3 were considered with the models selected by the Akaike information criterion (AIC), which maximizes 2 In L([omega]) - 2k, where k is the number of parameters in the model, and the Bayesian information criterion (BIC), which maximizes 2 ln L([omega]) - k ln T. These two criteria gave somewhat conflicting results, but both like the model with p = 2, q = 1, and M = 1. For the EGARCH-M model, AIC chooses p = 1 and q = 3, whereas BIC chooses p = 1 and q = 1. The selected model is the second choice for both criteria. For the model with Fourier terms, AIC chooses p = 2, q = 1, and M = 2, whereas BIC chooses p = 1, q = 1, and M = 0. The selected model is only marginally worse than these preferred ones. The values p = 2 and q = 1 were also chosen by Nelson (1991).

We chose the same values of p and q when estimating the model using the kernel approach. (Results for other choices of p and q are available from the authors on request.) It is difficult to compare the fit of the model estimated with the kernel for various values of p and q, because the models are then nonnested. As explained earlier, the bandwidth was selected by cross-validation over a grid of potential bandwidths. The bandwidth has the form

(17) [delta]=k[sigma][h.sub.t][T.sup.-1/5],

where [sigma]([h.sub.t]) is the standard deviation of [h.sub.t], updated at each iteration to reflect the new estimates of [h.sub.t]. The bandwidth constant k is allowed to vary between .5 and 2.5 in increments of. 1, and the estimated value of k is the one that produces the highest value of the cross-validated likelihood. We set the values of h and h at -10 and -2, based on the results from kernel estimation, which does not impose such restrictions. We also check to ensure that no value of [h.sub.t] lies outside of these values in the course of optimization.

We now turn to the estimation results. The results from the estimation using the two methods considered here and their associated standard errors ([[se.sub.[phi]]([phi])) are presented in Table 2. Our parameter estimates appear quite robust to the estimation method chosen. They are also consistent with many other studies in the area (e.g., Nelson 1991; Glosten, Jagannathan, and Runkle 1993; Bollerslev et al. 1994). In particular, volatility persistence is quite high (the sum of the estimates of [b.sub.1] and [b.sub.2] is well over .9), and the estimate of the leverage effect d is negative. But this parameter is not precisely estimated with the kernel procedure and is not significantly different from 0. Finally, the estimated value of v is around 1.4, which is again consistent with previous findings. The innovation density we find has fatter tails than a normal since v < 2. Note that the bootstrap standard errors tend to be larger than the analytic standard errors, sometimes dramatically so. The QMLE standard errors are not appreciably different from those obtained from GED estimation.

The last row of Table 2 lists results of a likelihood ratio test for the significance of the coefficients on the nonlinear terms in the Fourier series. The results clearly show that linearity is strongly rejected at usual significance levels.

The risk premium estimated using the kernel method is graphed in Figure 2(a) as a function of [h.sub.t]. Confidence intervals at the 95% level constructed using the pointwise kernel confidence intervals are also provided. The figure clearly reveals a nonmonotonic relation between [h.sub.t] and E[[y.sub.t]|[F.sub.t-1]]. This is consistent with the findings of Backus and Gregory (1993), Whitelaw (2000), and Veronesi (2001) that in general equilibrium the risk premium may have virtually any shape. Although the estimated risk premium is not significantly different from a constant at this level for some part of its range, the evidence is stronger in the middle range [h.sub.t] [member of] [-7.5, -5.5], which is where most of the data lie (Figure 2(c) plots the marginal density). The negative values of the risk premium for some values of the state are not statistically significant. The evolution of the estimated risk premium, conditional standard deviation, and Sharpe ratio (in monthly terms) are illustrated in Figure 3. The episodes of high volatility revealed by this figure coincide closely with those obtained by a simple running average, as is done in Figure 1. Note that stocks were a great deal in the 1990s according to the Sharpe ratio, but that they have become much less so in recent years.

[FIGURES 2-3 OMITTED]

Figure 2(b) provides the shape of the risk premium estimated using the Fourier series. The graph also includes the analytical 95% confidence intervals conditional on [h.sub.t]. Again, the estimated shape is nonlinear.

The two smoothing methods have advantages and disadvantages. The kernel estimate appears quite imprecise in the endpoints where there is not much data, as evidenced by the large standard errors and the volatile point estimates. The Fourier series method, on the other hand, is very smooth and gives the appearance of being precisely estimated. However, there is a pronounced upward slope at the high end, which seems at odds with the kernel method finding. This end trend is quite symptomatic of these polynomial-based methods, and we view it with some skepticism. Note also the difference in the standard errors for the two methods. The confidence band for the Fourier series method is almost the same width throughout the shown range, whereas the confidence band for the kernel is very wide at the end points, reflecting the relative paucity of the data in this region. Thus the Fourier series confidence band gives the appearance of being very precisely estimated in a region where we have little data. This is because it is a global fitting method that draws its estimates from all of the data. We thus redraw the two estimates on the same plot in the bottom right corner of Figure 2(d). The methods agree quite closely; there is a hump shape that is first concave and then convex.

Finally, we provide some diagnostics on the standardized residuals [[epsilon].sub.t] = ([y.sub.t] - [[mu].sub.t])/[[sigma].sub.t]. We report the results for just the kernel; similar results have been obtained for the series approach. The plots of the autocorrelogram of both the residuals and their squares indicate that they are close to white noise; there are 4 significant autocorrelation coefficients at the 5 % level among the first 100 lags in the levels and 5 significant autocorrelations in the squares.

5.1.3 Subsample Estimation. To see how robust our estimates are, we reestimated the model over two subsamples (1926-1961 and 1962-2001) using the kernel method. The results are presented in Table 3 (with analytical standard errors in parentheses).

The results show quite a bit of instability in the point estimates. Figure 5 illustrates the estimated risk premium using the same scale as in the other figures. Because the second subsample is characterized by lower volatility than the first one, the estimated log-volatility is concentrated toward the left of the graph for that period. The estimated risk premium of the second period is much flatter than that of the first period because of the much larger bandwidth constant chosen, although the point estimate suggests a similar nonmonotonic shape as for the full sample and the first subsample.

[FIGURE 5 OMITTED]

5.2 Monte Carlo

To appreciate the performance of our kernel procedure in estimating the risk premium in financial data, we carried out four simulation experiments. We repeated each experiment 5,000 times on samples of size 500. To make the experiments as realistic as possible. We set the parameters of each experiment to values estimated from our dataset. The data-generating processes used for the experiments are presented in Table 4.

Simulation experiment 1 involves generating a risk premium from a linear model. We thus estimated an EGARCH-M model with GED errors from the data (results presented in the last column of Table 2) and used it to generate 5,000 samples. We then applied our nonparametric procedure to these simulated samples. The results are presented in Figure 6(a). The solid line represents the true risk premium, which is linear. The long-dashed line is the median estimated function at each point on our equispaced grid. The short-dashed lines represent the 25th and 75th percentiles. The method appears to do quite well; the median estimate deviates from the true function marginally for all values of the log-conditional variance. The interquartile range of the estimates is relatively narrow in the middle of the distribution, but increases dramatically for large or small volatility because there is less data.

[FIGURE 6 OMITTED]

Experiment 2 uses the model estimated by the Fourier series and GED errors presented in the previous section to generate the data. The results for this experiment are shown in Figure 6(b). The kernel procedure unveils the nonlinear mean function well except for small log-conditional variance.

Experiment 3 is a GARCH-M model with normal errors and linear mean. This experiment is designed to check the robustness of our results to misspecification in the conditional variance process and the innovation density (the parametric components of our model). The results are shown in Figure 6(c). The kernel procedure discovers the linear mean very well; however, the confidence bands are very wide, reflecting the additional uncertainty caused by misspecification.

Finally, experiment 4 consists of a GARCH model with normal errors and mean function estimated with Fourier series. Once again, the mean function is well estimated where most data lies, but the uncertainty is once again large due to misspecification.

Table 5 presents the median and interquartile ranges for the estimated parameters over the 5,000 replications. Some of these parameters are difficult to interpret, because the estimated model is misspecified in experiments 3 and 4. For experiments 1 and 2, in which the model is correctly specified, the procedure estimates most parameters well. It has a tendency to underestimate [c.sub.1], the effect of past innovation on the log-conditional variance. Also, it does not distinguish well between the effect of [h.sub.t-1] and [h.sub.t-2] individually in experiment 2, although the overall persistence is well estimated. For the two misspecified models in which data are generated from a GARCH(1,1) model, the parameter values appear reasonable, suggesting a single lag of [h.sub.t-1] and no leverage effect. Moreover, the normality of the innovations is well discovered.

Overall, these results suggest that our kernel procedure performs well in uncovering possible nonlinearities in the data, and yet if the model were truly linear, the procedure would not mislead us. It is thus a useful tool for examining the shape of the risk premium.

6. CONCLUSIONS

We have found a highly nonlinear relationship between the first two moments of index returns as suggested by Backus and Gregory (1993) and Gennotte and Marsh (1993). In particular, the risk premium appears to be nonmonotonic and indeed hump-shaped. This result appears to be quite robust to the estimation method and the tuning parameters selected. However, the estimated risk premiums are subject to quite a bit of variability and are not uniformly significantly different from 0 at the 95% level. This and some instability over time must temper our interpretations to some degree.

Table 1. Raw Data by Subperiod

                    Full sample    1926:1-1961:12    1962:1-2001:12

Mean (x100)             .4987            .6686             .3457
Variance (x100)         .3031            .4132             .2043
Skewness               -.5037           -.4224            -.7422
Excess kurtosis        6.8015           3.5760            2.8460

NOTE: Descriptive statistics for monthly excess returns on the CRSP
value-weighted index for the entire sample (1926:1-2001:12) and two
subsamples (1925:1-1961:12) and (1962:1-2001:12). The skewness and
excess kurtosis are obtained after standardizing excess returns as
([r.sub.t]-[r.sup.f.sub.t])-[mu] / [sigma], where [mu] and [sigma] are
the mean and standard deviation of excess returns over the relevant
sample. Skewness is computed as the sample third moment of the
standardized excess returns; excess kurtosis, as the sample fourth
moment of the standardized excess returns minus 3. Excess kurtosis
would be 0 for a normal distribution.

Table 2. Full Sample Estimates

[r.sub.t] - [r.sup.f.sub.t] = [[mu].sub.t] +
[[sigma].sub.t][[epsilon].sub.t]

[h.sub.t] = ln ([[sigma].sup.2.sub.t]) = a + [b.sub.1] [h.sub.t-1] +
[b.sub.2][h.sub.t-2] + [c.sub.1]
(|[[epsilon].sub.t-1]| - E|[[epsilon].sub.t-1]| - d[[epsilon].sub.t-1]

Fourier: [[mu].sub.t] = [[gamma].sub.0] + [[gamma].sub.1]
[h.sup.*.sub.t] + [[psi].sub.1] sin ([h.sup.*.sub.t]) + [[phi].sub.1]
cos([h.sup.*.sub.t])

[[epsilon].sub.t] ~ GED(v) or N(0, 1)

                                 Kernel-GED         Kernel-QMLE

a                                   -.311              -.222
                                (.112) (.209)      (.089) (.196)
[b.sub.1]                            .780               .978
                                (.333) (.417)      (.097) (.432)
[b.sub.2]                            .169              -.015
                                (.325) (.402)      (.086) (.419)
[c.sub.1]                            .293               .260
                                (.091) (.088)      (.053) (.099)
d                                   -.142              -.080
                                (.122) (.222)      (.122) (.211)
v                                   1.444
                                (.078) (.121)
[[gamma].sub.0]
[[gamma].sub.1]
[[psi].sub.1]
[[phi].sub.1]
Bandwidth                            .9                 .9
  constant
Likelihood                         1,502.3            1,489.5
Linearity test
  [H.sub.0]:[[psi].sub.i] =
  [[phi].sub.i] = 0,
  i > 1 (p value)

                                 Fourier-GED       Fourier-QMLE

a                                   -.407              -.399
                                (.085) (.177)      (.136) (.163)
[b.sub.1]                            .452               .379
                                (.150) (.146)      (.181) (.116)
[b.sub.2]                            .484               .555
                                (.147) (.154)      (.179) (.117)
[c.sub.1]                            .241               .254
                                (.029) (.117)      (.039) (.053)
d                                   -.763              -.721
                                (.131) (.355)      (.190) (.175)
v                                   1.425
                                (.088) (.192)
[[gamma].sub.0]                     -.370              -.363
                                (.022) (.157)      (.131) (.102)
[[gamma].sub.1]                      .117               .115
                                (.006) (.054)      (.039) (.034)
[[psi].sub.1]                        .137               .133
                                (.009) (.050)      (.047) (.036)
[[phi].sub.1]                        .009              -.007
                                (.010) (.021)      (.014) (.024)
Bandwidth
  constant
Likelihood                         1,510.8            1,493.5
Linearity test                      7.369              8.956
  [H.sub.0]:[[psi].sub.i] =        (.025)             (.011)
  [[phi].sub.i] = 0,
  i > 1 (p value)

                                   EGARCH-M

a                                   -.340
                                    (.109)
[b.sub.1]                            .353
                                    (.138)
[b.sub.2]                            .593
                                    (.136)
[c.sub.1]                            .267
                                    (.048)
d                                   -.536
                                    (.182)
v                                   1.419
                                    (.844)
[[gamma].sub.0]                     -.003
                                    (.003)
[[gamma].sub.1]                      .002
                                     (0)
[[psi].sub.1]
[[phi].sub.1]
Bandwidth
  constant
Likelihood                         1,507.2
Linearity test
  [H.sub.0]:[[psi].sub.i] =
  [[phi].sub.i] = 0,
  i > 1 (p value)

NOTE: Empirical estimates of the semiparametric EGARCH model for
monthly excess returns on the CRSP value-weighted index for the
entire sample (1926-2001). The numbers in parentheses are analytical
and wild bootstrap standard errors. For the GED, the analytical
standard errors are from the outer product of gradient (OPG); for the
QMLE, the analytical standard errors are those of Bollerslev and
Wooldridge (1992).

Table 3. Subperiod Estimates

                         1926-1961          1962-2001

a                          -.135              -.754
                       (.088) (.216)      (.473) (.718)
[b.sub.1]                  1.137               .469
                       (.468) (.465)      (.334) (.488)
[b.sub.2]                   .159               .411
                       (.460) (.454)      (.333) (.467)
[c.sub.1]                   .223               .341
                       (.118) (.116)      (.137) (.108)
d                           .098              -.244
                       (.165) (.779)      (.229) (.476)
v                          1.444              1.487
                       (.125) (.177)      (.107) (.194)
Bandwidth constant          .7                2.5
Likelihood               680.7              829.9

NOTE: See Table 2 note.

Table 4. Results From Simulation Experiments Data-Generating Processes

Experiment 1: Linear mean, EGARCH conditional variance, and GED errors

[[mu].sub.t] = -.003 - .002[h.sub.t]

[h.sub.t] = -.340 + .353[h.sub.t-1] + .593[h.sub.t-2] +
.267(|[[epsilon].sub.t-1]| - E|[[epsilon].sub.t-1]| - .536
[[epsilon].sub.t-1])

[[epsilon].sub.t] ~ GED(1.419)

Experiment 2: Fourier mean, EGARCH conditional variance, and GED errors

[[mu].sub.t] = -.370 +. 117 [h.sup.*.sub.t] + .137 sin
([h.sup.*.sub.t]) -.009 cos ([h.sup.*.sub.t])

[h.sub.t] = -.407 + .452[h.sub.t-1] + .484[h.sub.t-2] +
.241(|[[epsilon].sub.t-1]| - E|[[epsilon].sub.t-1]| - .763
[[epsilon].sub.t-1])

[[epsilon].sub.t] ~ GED(1.425)

Experiment 3: Linear mean, GARCH conditional variance, and normal
errors

[[mu].sub.t] = .013 + .001 [h.sub.t]

[[sigma].sup.2].sub.t] = 7.402 x [10.sup.-5] + .867
[[sigma].sup.2.sub.t-1] + .109[u.sup.2.sub.t-1]

[[epsilon].sub.t] ~ N(0, 1)

Experiment 4: Fourier mean, GARCH conditional variance, and normal
errors

[[mu].sub.t] = -.229 + .073[h.sup.*.sub.t] +
.082 sin ([h.sup.*.sub.t]) - .006 cos ([h.sup.*.sub.t])

[[sigma].sup.2.sub.t] = 7.163 x [10.sup.-5] +
.867[[sigma].sup.2.sub.t-1] + .117[u.sup.2.sub.t-1]

[[epsilon].sub.t] ~ N(0, 1)

Table 5. Results from Simulation Experiments: Median
Estimated Parameters

                   Experiment 1            Experiment 2

a                      -.340                   -.382
                  (-.489, -.267)          (-.598, -.242)
[b.sub.1]              -.375                    .715
                   (.352, .766)            (.415, 1.155)
[b.sub.2]              -.545                    .190
                   (.148, .593)            (-.219, .494)
[c.sub.1]               .216                    .136
                   (.129, .267)            (.081, .199)
d                      -.536                   -.838
                  (-.823, -.444)          (-1.335,-.531)
v                      1.419                   1.411
                  (1.346, 1.460)          (1.325, 1.506)

                   Experiment 3            Experiment 4

a                      -.273                   -.181
                  (-.417, -.146)          (-.326, -.086)
[b.sub.1]               .890                    .978
                   (.423, 1.351)           (.596, 1.391)
[b.sub.2]               .045                   -.027
                   (-.402, .498)           (-.429, .348)
[c.sub.1]               .199                    .196
                   (.122, .267)            (.120, .237)
d                      -.029                   -.010
                   (-.211, .123)           (-.160, .137)
v                      1.969                   1.981
                  (1.813, 2.136)          (1.833, 2.143)

NOTE: Entries are the median of the estimated parameters over the
5,000 replications. The entries in parentheses are the 25th and
75th percentiles over the 5,000 replications.

ACKNOWLEDGMENTS

We thank Adrian Pagan; participants at the 1999 EC^2 conference in Madrid and at seminars at Montreal, Queen's University, and University of California, Santa Barbara; an associate editor, and two anonymous referees for comments and discussion. Perron acknowledges financial assistance from the Fonds pour la Formation des chercheurs et l'aide a la recherche (FCAR) and the Mathematics of Information Technology and Complex Systems (MITACS) network. Linton thanks the ESRC and STICERD for financial support.

[Received September 2000. Revised December 2002.]

REFERENCES

Abel, A. B. (1987), "Stock Prices Under Time-Varying Dividend Risk: An Exact Solution in an Infinite-Horizon General Equilibrium Model," Journal of Monetary Economics, 22, 375-393.

--(1999), "Risk Premia and Term Premia in General Equilibrium," Journal of Monetary Economics, 43, 3-33.

Audrino, F., and Buhlmann, P. (2001), "Tree-Structured GARCH Models," Journal of The Royal Statistical Society, 63, 727-744.

Backus, D. K., and Gregory, A. W. (1993), "Theoretical Relations Between Risk Premiums and Conditional Variances," Journal of Business and Economic Statistics, 11, 177-185.

Backus, D. K., Gregory, A. W., and Zin, S. E. (1989), "Risk Premiums in the Term Structure: Evidence from Artificial Economies," Journal of Monetary Economics, 24, 371-399.

Bollerslev, T., Chou, R. Y., and Kroner, K. F. (1992), "ARCH Modelling in Finance," Journal of Econometrics, 52, 5-59.

Bollerslev, T., Englek, R. E, and Nelson, D. B. (1994), "ARCH Models," in Handbook of Econometrics, Vol. IV, eds. R. E Engle and D. L. McFadden, Amsterdam: Elsevier Science, pp. 2959-3038.

Bollerslev, T., and Wooldridge, J. M. (1992), "Quasi-Maximum Likelihood Estimation and Inference in Dynamic Models With Time-Varying Covariances," Econometric Reviews, 11, 143-172.

Boudoukh, J., Richardson, M., and Whitelaw, R. E (1997), "Nonlinearities in the Relation Between the Equity Premium and the Term Structure," Management Science, 43, 371-385.

Braun, P. A., Nelson, D. B., and Sunier, A. M. (1995), "Good News, Bad News, Volatility and Betas," Journal of Finance, 50, 1575-1604.

Breiman, L., and Friedman, J. H. (1985), "Estimating Optimal Transformations for Multiple Regression and Correlation," Journal of the American Statistical Association, 80, 580-598.

Carrasco, M., and Chen, X. (2002), "Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models," Econometric Theory, 18, 17-39.

Cox, J., Ingersoll, J., and Ross, S. (1985), "An Intertemporal General Equilibrium Model of Asset Prices," Econometrica, 53, 363-384.

Davidson, R., and Flachaire, E. (2001), "The Wild Bootstrap, Tamed at Last," unpublished manuscript, Queen's University at Kingston.

Dominitz, J., and Sherman, R. (2001), "Convergence Theory for Stochastic Iterative Procedures With an Application to Semiparametric Estimation," unpublished manuscript, California Institute of Technology.

Engle, R. E, and Gonzalez-Rivera, G. (1991), "Semiparametric ARCH Models," Journal of Business and Economic Statistics, 9, 345-359.

Engle, R. E, Lilien, D. M., and Robins, R. P. (1987), "Estimating Time Varying Risk Premia in the Term Structure: The ARCH-M Model," Econometrica, 55, 391-407.

French, K. R., Schwert, G. W., and Stambaugh, R. B. (1987), "Expected Stock Returns and Volatility," Journal of Financial Economics, 19, 3-29.

Gallant, A. R. (1981), "On the Bias in Flexible Functional Forms and an Essentially Unbiased Form: The Fourier Flexible Form," Journal of Econometrics, 15, 211-245.

Gennotte, G., and Marsh, T. (1993), "Valuations in Economic Uncertainty and Risk Premiums on Capital Assets," European Economic Review, 37, 1021-1041.

Glosten, L. R., Jagannathan, R., and Runkle, D. E. (1993), "On the Relation Between the Expected Value and the Volatility of the Nominal Excess Returns on Stocks," Journal of Finance, 48, 1779-1801.

Gonzalez-Rivera, G., and Drost, F. C. (1999), "Efficiency Comparisons of Maximum-Likelihood Based Estimators in GARCH Models," Journal of Econometrics, 93, 93-111.

Hamilton, J. D. (1994), Time Series Analysis, Princeton, NJ: Princeton University Press.

Hardle, W. (1990), Applied Nonparametric Regression, Cambridge, U.K.: Cambridge University Press.

Hardle, W., and Linton, O. (1994), "Applied Nonparametric Methods," in Handbook of Econometrics, Vol. IV, eds. R. E Engle and D. L. McFadden, Amsterdam: Elsevier, pp. 2295-2339.

Harvey, C. (2001), "The Specification of Conditional Expectations," Journal of Empirical Finance, 8, 573-638.

Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, New York: Chapman and Hall.

Lee, S., and Hansen, B. (1994), "Asymptotic Theory for the GARCH(1,1) Quasi-Maximum Likelihood Estimator," Econometric Theory, 10, 29-52.

Lintner, J. (1965), "The Valuation of Risky Assets and the Selection of Risky Investment in Stock Portfolios and Capital Budgets," Review of Economics and Statistics, 47, 13-37.

Lumsdaine, R. L. (1996), "Consistency and Asymptotic Normality of the Quasi-Maximum Likelihood Estimator in IGARCH(1,1) and Covariance Stationary GARCH(1,1) Models," Econometrica, 64, 575-596.

Mammen, E., Linton, O., and Nielsen, J. P. (1999), "The Existence and Asymptotic Properties of a Backfitting Projection Algorithm Under Weak Conditions," The Annals of Statistics, 27, 1443-1490.

Merton, R. C. (1973), "An Intertemporal Capital Asset Pricing Model," Econometrica, 41, 867-887.

Nelson, D. B. (1990), "Stationarity and Persistence in the GARCH(1,1) Model," Econometric Theory, 6, 318-334.

(1991), "Conditional Heteroscedasticity in Asset Returns: A New Approach," Econometrica, 59, 347-370.

Newey, W. K., and Steigerwald, D. G. (1997), "Asymptotic Bias for Quasi-Maximum Likelihood Estimators in Conditional Heteroskedasticity Models," Econometrica, 65, 587-599.

Opsomer, J. D., and Ruppert, D. (1997), "Fitting a Bivariate Additive Model by Local Polynomial Regression," The Annals of Statistics, 25, 186-211.

Pagan, A. R., and Hong, Y. S. (1990), "Non-Parametric Estimation and the Risk Premium," in Nonparametric and Semiparametric Methods in Econometrics and Statistics: Proceedings of the Fifth International Symposium in Economic Theory and Econometrics, eds. W. A. Barnett, J. Powell, and G. Tauchen, Cambridge, U.K.: Cambridge University Press, pp. 51-75.

Pagan, A. R., and Ullah, A. (1999), Nonparametric Econometrics, Cambridge, U.K.: Cambridge University Press.

--(1998), "A Monte Carlo Comparison of Non-Parametric Estimators of the Conditional Variance," unpublished manuscript, University of Montreal.

Perron, B. (in press), "Semi-Parametric Weak Instrument Regressions With an Application to the Risk-Return Trade-Off," Review of Economics Statistic, 85, 424-443.

Powell, J. (1994), "Estimation of Semiparametric Models," in Handbook of Econometrics, Vol. IV, eds. R. E Engle and D. L. McFadden, Amsterdam: Elsevier, pp. 2443-2521.

Robinson, P. M. (1983), "Nonparametric Estimators for Time Series." Journal of Time Series Analysis, 4, 185-207.

--(1988), "Root-N-Consistent Semiparametric Regression." Econometrica, 56, 931-954.

Sharpe, W. (1964), "Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk," Journal of Finance, 19, 567-575.

Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall.

Tadikamalla, P. R. (1980), "Random Sampling From the Exponential Power Distribution," Journal of the American Statistical Association, 75, 683-686.

Veronesi, P. (2001), "How Does Information Quality Affect Stock Returns?" Journal of Finance, 55, 807-837.

Weiss, A. (1986), "Asymptotic Theory for ARCH Models: Estimation and Testing," Econometric Theory, 2, 107-131.

Whitelaw, R. F. (2000), "Stock Market Risk and Return: An Equilibrium Approach," Review of Financial Studies, 13, 521-547.

Wooldridge, J. M. (1994): "Estimation and Inference for Dependent Processes," in Handbook of Econometrics, Vol. IV, eds. R. E Engle and D. L. McFadden, Amsterdam: Elsevier, pp. 2659-2738.

Oliver LINTON

Department of Economics, London School of Economics, Houghton Street, London WC2A 2AE, U.K. (lintono@lse.ac.uk)

Benoit PERRON

Departement de Sciences Economiques, CIREQ, CIRANO, Universite de Montreal, C.P. 6128, Succursale Centre-ville, Montreal, Quebec H3C 3J7, Canada (benoit.perron@umontreal.ca)

In addition, make sure to read these articles: