The single index model is a generalization of the linear regression model with E(y|x) = g(x'[beta]), where g is an unknown function. The model provides a flexible alternative to the linear regression model while providing more structure than a fully nonparametric approach. Although the fitting
KEY WORDS: Autocorrelation; Heteroscedasticity; Local polynomial estimation; Model selection.
**********
1. INTRODUCTION
In recent years the classic linear regression model,
[y.sub.i] = [x'.sub.i][beta] + [[epsilon].sub.i] (1.1)
with E([[epsilon].sub.i]|[x.sub.i]) = 0, has been extended and generalized in various ways. One simple, yet extremely useful, generalization is the single index model, where (1.1) is replaced by
[y.sub.i] = g([x'.sub.i][beta]) + [[epsilon].sub.i], (1.2)
where g is an unknown function, often taken to be smooth. The single index model is more flexible than the linear model, because it allows for nonlinear relationships between the index variable x'[beta] and the target variable y while avoiding many of the drawbacks of a fully nonparametric approach. Such drawbacks include decreasing precision with increasing dimension of x (the so-called curse of dimensionality), difficulty in representing relationships graphically, and the inability to make predictions at points x outside the support of the observed [x.sub.i] values. This model and its variants have been applied successfully to data in fields including economics and medicine; see, e.g., Carroll, Fan, Gijbels, and Wand (1997) and Horowitz (1998).
Model (1.2) can be fit by using a two-step procedure. First, [beta] is estimated, and the linear index variable z = X[^.[beta]] is formed, where X is the matrix of predictor values. Then g is estimated using nonparametric regression as a smooth function based on {[z.sub.i], [y.sub.i]}, where [z.sub.i] = [x'.sub.i][^.[beta]], i = 1,..., n. We focus on two possible estimates of [beta]. Brillinger (1983) showed that if predictors are roughly Gaussian, the ordinary least squares (OLS) estimate based on (1.1) is a [square root of n]-consistent estimate of [beta] in (1.2) up to a constant of proportionality. Note that an intercept term [[beta].sub.0] is not identifiable in this context, as it can be absorbed into the function g. (The OLS estimate is based on centered predictors and includes an intercept, but the linear index does not use the estimated intercept.) A second approach is sliced inverse regression (SIR), proposed by Duan and Li (1991). This estimate is easy to compute and is [square root of n]-consistent (up to a proportionality constant) and asymptotically normally distributed. We do not consider here more complicated semiparametric estimators that have been proposed for this problem, such as those of Powell, Stock, and Stoker (1989) and Ichimura (1993), although the tests derived here would be equally applicable to those estimators.
Given z, any nonparametric regression smoother can be used to estimate g, yielding fitted values [^.y.sub.i] = [^.g]([z.sub.i]). Simonoff (1996, chap. 5) discussed various approaches to this problem. We focus here on local polynomial estimation, where a pthorder local polynomial estimator [^.g](z) is the constant term [^.[gamma].sub.0] of the minimizer of
[n.summation over (i=1)][[y.sub.i] - [[gamma].sub.0] -...- [[gamma].sub.p]([z.sub.i] - z)[.sup.p]][.sup.2]K([z.sub.i] - z]/h), (1.3)
where K is the kernel function, generally taken to be a symmetric probability density function with finite second derivative. In all simulations and examples treated here a local quadratic (p = 2) estimate is used. We also need an estimate of the derivative of g, [dot.g], which is just the slope term [^.[gamma].sub.1] of the minimizer of (1.3). The smoothness of [^.g] is controlled by the smoothing parameter h. Hurvich, Simonoff, and Tsai (1998) proposed using a corrected Akaike information criterion (AIC) to choose the smoothing parameter, where h is chosen to minimize
AI[C.sub.C] = log[^.[sigma].sup.2] + [1 + tr(H)/n]/[1 - (tr(H) + 2)/n] = log[^.[sigma].sup.2] + 1 + [2(tr(H) + 1)]/[n - tr(H) - 2],
where [^.[sigma].sup.2] = [summation][^.e.sub.i.sup.2]/n (where [^.e.sub.i] = [y.sub.i] - [^.y.sub.i] is the ith element of the residual vector), H satisfies [^.y] = Hy and depends on z and depends on y only through [^.[beta]], and they showed that this choice is asymptotically optimal for minimizing the integrated squared error of [^.g]. Note that although in theory a different smoothing parameter should be used for estimating g and [dot.g], in the tests constructed here we take the simple strategy of using the same smoothing parameter (the one derived to estimate g) for both estimates. This issue is discussed more fully in Section 3.
Although the fitting of single index models does not require distributional assumptions on [epsilon], the properties of the estimates do depend on such assumptions. Hardle, Hall, and Ichimura (1993) derived an optimal smoothing parameter for model (1.2) and showed that the efficiency of [^.[beta]] depends on the distribution of [epsilon] (focusing on an assumption of constant variance of [[epsilon].sub.i]). Similarly, a modified estimator is appropriate if the errors exhibit autocorrelation. In addition, the existence of violations of model (1.2) affects practical application of the model in, for example, the construction of prediction intervals.
In this article we derive score tests to test for the presence of violations of model (1.2) with [epsilon] [approximately] N (0, [[sigma].sup.2]I). The principle here is that using the score test can be viewed as a unifying principle in identifying such violations, which can then be addressed using methods adapted to those violations. Score tests have the advantage of requiring fitting only under the null model, thus avoiding the fitting of models that incorporate those violations unless necessary. In addition, in one of the scenarios examined here, a likelihood-ratio-type alternative test is not useful, leaving the score test without apparent competition appropriate for this model.
In the next section a generalized version of (1.2) is formulated; it has as special cases heteroscedasticity, autocorrelation, and additional covariates. Score tests are then derived for each possible violation. Section 3 summarizes the results of Monte Carlo simulations investigating the usefulness of the tests, and several real datasets are analyzed in Section 4. Discussion and recommendations for further work conclude the article.
2. DERIVATION OF SCORE TESTS
In all that follows, the notation g(b), where b is a vector, is used to represent the vector with ith entry g([b.sub.i]). Consider the generalization of (1.2)
y = g([X.sub.1][[beta].sub.1] + [X.sub.2][[beta].sub.2]) + [[summation].sup.1/2][epsilon], [epsilon] [approximately] N (0, [[sigma].sup.2]I), (2.1)
where [X.sub.1] and [X.sub.2] are an n X [p.sub.1] and n X [p.sub.2] matrices of predictor values, respectively, [[beta].sub.1] and [[beta].sub.2] are [p.sub.1] X 1 and [p.sub.2] X 1 vectors of predictor coefficients, respectively, and
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
with |p| < 1, [w.sub.i] = w([u.sub.i], [delta]), [u'.sub.i] the ith row of an n X k matrix U, and [delta] is a q X 1 vector. Define [[delta].sub.0] so that w(u, [[delta].sub.0]) = 1. As shown below, setting parameters to specified values results in models consistent with heteroscedasticity, first-order autoregressive autocorrelation, and additional covariates.
The general form of the score test [see, for example, Cox and Hinkley (1974, p. 324)] follows a structure similar to that for nonlinear regression models (Seber and Wild 1989, pp. 228-231). Let L([theta]) be the log-likelihood function of the observed value of y, where the m X 1 vector [theta] is partitioned into [theta] = ([[theta]'.sub.1], [[theta]'.sub.2])', with [[theta].sub.1] being [m.sub.1] X 1 and [[theta].sub.2] being [m.sub.2] X 1. Let
[U.sub.j]([theta]) = [[partial derivative]L([theta])]/[[partial derivative][[theta].sub.j]], j = 1, 2,
and
[I.sub.ij]([theta]) = E[-[[[partial derivative].sup.2]L([theta])]/[[partial derivative][[theta].sub.i][partial derivative][[theta].sub.j]]], i = 1, 2, j = 1, 2.
Note that [I.sub.21]([theta]) = [I.sub.12]([theta])'. The score test statistic tests the hypotheses
[H.sub.0] : [[theta].sub.2] = [[theta].sub.20] (2.2a)
versus
[H.sub.a] : [[theta].sub.2] [not equal to] [[theta].sub.20] (2.2b)
and is then defined as
S = [[U.sub.2] - [I.sub.21][I.sub.11.sup.-1][U.sub.1]]'[[I.sub.22] - [I.sub.21][I.sub.11.sup.-1][I.sub.12]][.sup.-1][[U.sub.2] - [I.sub.21][I.sub.11.sup.-1][U.sub.1]],
where [U.sub.j] and [I.sub.ij] are evaluated at [[theta].sub.1] = [^.[theta].sub.1]([[theta].sub.20]) (a [square root of n]-consistent estimator of [[theta].sub.1] given [[theta].sub.2] = [[theta].sub.20]) and [[theta].sub.2] = [[theta].sub.20].
In certain circumstances, the asymptotic distribution of the score test can be shown to be [chi square] on [m.sub.2] degrees of freedom under the null hypothesis. For example, in the nonlinear regression situation where g is known, this is the case (Seber and Wild 1989, p. 198; Tsai 1986). Similar results are sometimes available in the current context and are noted below. In any event, we appeal to this distribution for S as a benchmark here, and we use Monte Carlo simulations in the next section to investigate this approximation.
2.1 Test for Heteroscedasticity in Errors
Consider model (2.1) with [rho] = 0, i.e., [summation] = diag([w.sub.i]). This model incorporates possible heteroscedasticity through [w.sub.i] = w([u.sub.i], [delta]), because var([[epsilon].sub.i]) = [w.sub.i][[sigma].sup.2]. The values [u.sub.i] represent variables that are predictive for var([[epsilon].sub.i]), which can, but need not, include variables from X. To construct the test, a specific functional form for w is required; a typical choice is an exponential function
w([u.sub.i], [delta]) = exp([u'.sub.i][delta]), (2.3)
implying that [delta] = [[delta].sub.0] = 0 corresponds to constant variance [see Simonoff and Tsai (1994) and the references therein]. The score test is based on hypotheses (2.2) with [[theta].sub.1] = ([beta]', [[sigma].sup.2])' and [[theta].sub.2] = [delta] and equals
[S.sub.H] = [1/2]c'[bar.D]([bar.D'][bar.D])[.sup.-1][bar.D']c, (2.4)
where c is the n X 1 vector with ith entry [c.sub.i] = [^.e.sub.i.sup.2]/[^.[sigma].sup.2], [bar.D] = (I - 11'/n)D, 1 is a vector of 1's, [^.e] = y - [^.g](X[^.[beta]]) is the vector of residuals, [^.[sigma].sup.2] = [^.e]'[^.e]/n, and D is the n X q matrix [partial derivative]w([u.sub.i], [delta])/[partial derivative][delta] evaluated at [delta] = [[delta].sub.0]. This is one-half the regression sum of squares of the regression of c on D and is thus easy to calculate. Koenker (1981) showed that the score test for nonconstant variance in the linear regression model can be made more robust through studentizing, where [^.[sigma].sup.2] is replaced with [[[summation].sub.i]([^.e.sub.i.sup.2] - [^.[sigma].sup.2])[.sup.2]/(2n)][.sup.1/2]. Simonoff and Tsai (1994) applied this same idea to the score test for heteroscedasticity based on the modified profile likelihood.
Eubank and Thomas (1993) investigated this score test for heteroscedasticity in the context of the nonparametric regression model,
[y.sub.i] = g([x.sub.i]) + [[epsilon].sub.i], [[epsilon].sub.i] [approximately] N (0, [[sigma].sub.i.sup.2]), (2.5)
estimating g using a cubic smoothing spline. Specifically, they proposed the test statistic
T = [[[summation].sub.i=1.sup.n][A.sub.i]([^.e.sub.i.sup.2]/[^.[sigma].sup.2]) - 1][.sup.2]]/[2[[summation].sub.i=1.sup.n][A.sub.i.sup.2]]
to test for nonconstant variance of the form [[sigma].sub.i.sup.2] = [[sigma].sup.2] + [n.sup.-1/2][d.sub.i] (putting appropriate boundedness conditions on the {[A.sub.i]} and {[d.sub.i]}). They then showed that for a wide range of choices of the smoothing parameter (including ones that lead to optimal convergence rates), T is asymptotically noncentral [[chi square].sub.1] with noncentrality parameter depending on {[A.sub.i]} and {[d.sub.i]}.
This result can be adapted to the single index model when there is one variance predictor. Because [beta] is [square root of n]-consistent, the linear index values [z.sub.i] can be treated as the predictor values [x.sub.i] in (2.5), implying an asymptotic [[chi square].sub.1] distribution under the null hypothesis, and power to detect [square root of n] departures from constant variance. The theorem requires the local quadratic smoothing parameter to be o([n.sup.-1/12]), which is satisfied by the smoothing parameter that minimizes the (mean) integrated squared error (which has a rate of O([n.sup.-1/9]) in the interior), although technically it does not allow for data-dependent choices of h.
2.2 Test for First-Order Autoregressive Errors
Consider model (2.1) with [delta] = [[delta].sub.0]. This model incorporates first-order autoregressive [AR(1)] autocorrelation into the errors if [rho] [not equal to] 0. The score test statistic is based on hypotheses (2.2) with [[theta].sub.1] = ([beta]', [[sigma].sup.2])' and [[theta].sub.2] = [rho] and equals
[S.sub.A] = [[n.summation over (i=2)] [^.e.sub.i][^.e.sub.i-1]/[^.[sigma].sup.2]][.sup.2]/(n - 1). (2.6)
Defining an estimate of [rho] to be [^.[rho]] = [[summation].sub.i=2.sup.n][^.e.sub.i][^.e.sub.i-1]/[[summation].sub.i=1.sup.n][^.e.sub.i.sup.2], the score test statistic is effectively [S.sub.A] = n[^.p.sup.2].
The statistic [S.sub.A] can be written in the form
[S.sub.A] = [y' My - tr(A)]/[2 tr([A.sup.2])],
where M = (I - H)A(I - H) and A is the n X n matrix with 1's in the main off-diagonal and 0's elsewhere. Eubank and Thomas (1993, p. 154) showed that statistics of this form are asymptotically [[chi square].sub.1] under the null hypothesis in the nonparametric regression context, and this result again carries over to the single index model, justifying the use of [[chi square].sub.1] critical values here.
2.3 Test for the Need for Additional Covariates
Consider model (2.1) with [rho] = 0 and [delta] = [[delta].sub.0]. Given [[beta].sub.1] [not equal to] 0, this model allows for the possibility of additional useful covariates if [[beta].sub.2] [not equal to] 0. The score test statistic is based on hypotheses (2.2) with [[theta].sub.1] = [[beta].sub.1] and [[theta].sub.2] = [[beta].sub.2] and equals
[S.sub.C] = [[^.e'.sub.1][~.X'.sub.2][[~.X'.sub.2](I - [~.H.sub.1])[~.X.sub.2]][.sup.-1][~.X'.sub.2][^.e.sub.1]]/[^.[sigma].sub.1.sup.2], (2.7)
where
[~.X.sub.2] = Diag[[??]([X.sub.1][^.[beta].sub.1])][X.sub.2.sup.c],
[~.H.sub.1] = [~.X.sub.1]([~.X'.sub.1][~.X.sub.1])[.sup.-1][~.X.sub.1],
[~.X.sub.1] = Diag[[??]([X.sub.1][^.[beta].sub.1])][X.sub.1.sup.c],
[^.[sigma].sub.1.sup.2] = [^.e'.sub.1][^.e.sub.1]/n,
[^.e.sub.1] = y - [^.g]([X.sub.1][^.[beta].sub.1]),
[??]([X.sub.1][^.[beta].sub.1]) and [^.g]([X.sub.1][^.[beta].sub.1]) are the estimates of [dot.g] and g evaluated at the observations, respectively, based on using only [X.sub.1] as predictors in the model, and [X.sub.1.sup.c] and [X.sub.2.sup.c] are mean-centered versions of [X.sub.1] and [X.sub.2], respectively. Centered versions of the predictors must be used for this test because, as noted above, the intercept [[beta].sub.0] is not identifiable in model (1.2). Similarly, because [^.[beta]] is determined only up to a constant, the values of [??](X[^.[beta]]) must be standardized to have unit mean (as they would have if g were linear). Evans and Savin (1982) proposed an adjustment to the score test for extra covariates based on degrees of freedom in the linear regression model that can also be applied here. Unfortunately, there is no asymptotic theory for [S.sub.C] in the single index situation, but we will continue to use [chi square]-based critical values as a benchmark.
3. MONTE CARLO SIMULATIONS
In this section we examine the properties of the tests using Monte Carlo simulations. We examine both the size and the power properties of the tests, and how they relate to sample size, the strength of the relationship between y and x, the nature of g, and the type of [^.[beta]] used. All tests have the same level [alpha] = .05 and are examined using sample sizes n = 30, 50. Tests for heteroscedasticity are based on the model y = g([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3]) + [epsilon] X exp(z[delta]/2) with [[beta].sub.1] = [[beta].sub.2] = [[beta].sub.3] = 1, [epsilon] [approximately] N(0, [[sigma].sup.2]), z standard normal, and [delta] taking the values (-2, -1, -.5, 0, .5, 1, 2). Tests for autocorrelation are based on the model y = g([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3]) + [epsilon], where [[beta].sub.1] = [[beta].sub.2] = [[beta].sub.3] = 1, [epsilon] follows an AR(1) process, and [rho] [member of] (-.9, -.5, -.1, 0, .1, .5, .9). Tests for an additional covariate are based on the true model y = g([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3]) + [epsilon], where [[beta].sub.1] = [[beta].sub.2] = 1 and [[beta].sub.3] takes the values (0, 1, 5, 9), [[beta].sub.3] = 0 corresponding to the null hypothesis. For each situation [x.sub.1], [x.sub.2], and [x.sub.3] are generated once as uniform on (0, 1), g is linear, negative exponential, or a sine function, and [sigma] equals the range of g([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3]) or the range of g([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3]) divided by 10, respectively [that is, the signal-to-noise ratio (SNR) equals 1 or 10, respectively]. Each setting is simulated with 1,000 replications, resulting in a maximum standard error of the estimates of power of .016 (estimates of size, which are around .05, have standard errors roughly half as large).
Either OLS or SIR was used to estimate [beta] for each simulation replication, followed by a local quadratic estimate of g using AI[C.sub.C] to choose the smoothing parameter in each case. The tests based on ordinary least squares (OLS) and SIR performed similarly, with no clear advantage for one method over the other. Given that the OLS estimates are much more familiar, and simpler to calculate, all the simulation results reported here are based on using OLS.
Figures 1-3 summarize results for the tests for heteroscedasticity, autocorrelation, and an extra covariate, respectively. Each figure consists of six plots giving simulated power and size values, separated by the form of g and the SNR (the solid lines refer to the score test). In each plot, the lines represent linear interpolations between the actual values obtained in the simulations, those showing higher power corresponding to sample size n = 50 and those with lower power corresponding to n = 30. The horizontal line in each plot represents the nominal size [alpha] = .05.
Figure 1 summarizes results for testing heteroscedasticity. In addition to the score test, two other easily calculated tests were examined. Levene's test (dashed lines) is the F-test for significance in a regression of |[^.e]| on the variance predictor(s), and Harvey's test is the regression sum of squares of a regression of log[([^.e])[.sup.2]] on the variance predictor(s) divided by 4.9348 and is compared to a [chi square] distribution. The latter test consistently exhibited much lower power than the score and Levene's tests, so its values are not given in the plots.
The tests generally hold their size well, with estimated size less than .06 in virtually all cases, despite the small samples used here. As would be expected, the power of the tests increases as the heteroscedasticity increases (|[delta]| increasing), with positive and negative [delta] generally equally easy to identify. Whereas the score test and Levene's test are close in power, the score test is consistently the more powerful test. The strength of the regression relationship and the nature of g seem to have little effect on the test, with the exception of a sine relationship and high SNR, where power is lower. Power values otherwise exceed .9 (n = 30) or .95 (n = 50) for |[delta]| = 2.
The properties of the score test for autocorrelation (Figure 2) are broadly similar to those for heteroscedasticity. The plots also give values for an application of the Durbin--Watson statistic (dashed lines) to the observed residuals,
d = [[[summation].sub.i=2.sup.n]([^.e.sub.i] - [^.e.sub.i-1])[.sup.2]]/[[[summation].sub.i=1.sup.n][^.e.sub.i.sup.2]].
[FIGURE 1 OMITTED]
[FIGURE 2 OMITTED]
Significance of the statistic is based on an asymptotic normal limit, or equivalently a [[chi square].sub.1] approximation for n[(d/2 - 1)[.sup.2]. The tests hold the .05 size well, with all estimated sizes less than .06. The underlying form of g does not have a large effect, except for the sine function with high SNR, where power is noticeably lower. This is presumed to relate to the difficulties of estimating functions with more structure (as in the heteroscedasticity test) combined with the difficulties of AI[C.sub.C] in differentiating between a regression function with some curvature and some autocorrelation (the actual situation) and one with even more curvature and little autocorrelation (which would erroneously lead to nonrejection of the null hypothesis). The score test and Durbin-Watson test have similar properties; the score test is more powerful for negative autocorrelation and the Durbin-Watson test is more powerful for positive autocorrelation. This pattern is easy to understand, because
n([d/2] - 1)[.sup.2] = n([^.[rho] + [[^.e.sub.1.sup.2] + [^.e.sub.n.sup.2]]/[[[summation].sub.i=1.sup.n] [^.e.sub.i.sup.2]])[.sup.2] [equivalent to] n[~.[rho].sup.2].
Clearly [~.[rho]] [greater than or equal to] [^.[rho]], implying a (typically) smaller [chi square] statistic (and lower power) when [^.[rho]] < 0 and larger [chi square] statistic (and higher power) when [^.[rho]] > 0.
Figure 3 presents results for the score test for an additional covariate, along with ones based on the ordinary t test (dashed lines). The degrees-of-freedom adjustment of Evans and Savin (1982) has little effect in the situations studied here, so its results are not presented. The tests are slightly, but consistently, anticonservative, with sizes typically between .06 and .07. The power properties of the score test are considerably more complex than those of the heteroscedasticity and autocorrelation score tests, for several reasons. Recall that the score test is derived assuming that the null hypothesis is true, which means that the model is misspecified under the alternative hypothesis. This is only a second-order effect in the presence of heteroscedasticity or autocorrelation (in the sense that the mean function is still correctly specified), which means that [^.g]([x.sub.i]/[beta]) is still a consistent estimator of E([y.sub.i]). This is not the case when testing for the need for an extra covariate, because then the estimated linear index is not a consistent estimate of the true linear index. Thus, [^.g] is not a consistent estimate of the true g, and [??] is not a consistent estimate of [dot.g].
This accounts for several of the patterns in Figure 3. The score test is best behaved for linear g, but the power decreases with increasing [[beta].sub.3] when SNR equals 10 (the same is true for the exponential function). The problem here is in estimation of [dot.g]. Recall that the same smoothing parameter is used to estimate both g and [dot.g], although theory shows that they should be different. This leads to degradation of the test. If the true values of [dot.g] are used in the simulations (this can be done because [dot.g] is known), the power functions for linear g monotonically increase to 1 with increasing [[beta].sub.3] as would be expected, which confirms that the problem is in estimation of [dot.g].
[FIGURE 3 OMITTED]
The poor performance of the test for exponential g and SNR = 1 arises because E([y.sub.i]) is close to zero and with the added noise exhibits little structure. In this case including the predictor [x.sub.3] adds very little to the fit. The structure of [x.sub.3] used here results in a similar pattern for the sine function when [[beta].sub.3] = 9.
On the other hand, the flexibility of the single index model means that the estimate [^.g] based on the misspecified (null) model will often fit better than the true g based on the null model, which can translate into improved ability to identify structure related to [x.sub.3]. The score test based on the true g (results not presented here) is sometimes more powerful than that based on the single index model (reflecting the benefit of not having to estimate g), and sometimes it is less powerful (reflecting better fit of the nonparametrically estimated g).
The t test performs reasonably well in many of the situations, because the partial relationships between y and [x.sub.3] given [x.sub.1] and [x.sub.2] for these functions have enough of a linear component to be identified. It is not at all difficult, however, to construct situations in which the score test is much more powerful than the t test. For example, if the entries in [x.sub.1] and [x.sub.2] have mean zero, and the [x.sub.3] values are symmetric around zero, the t test has no power to identify nonzero [[beta].sub.3] in the quadratic function ([[beta].sub.1][x.sub.1] + [[beta].sub.2][x.sub.2] + [[beta].sub.3][x.sub.3])[.sup.2], whereas the score test can have power greater than .5 when n = 30. The superiority of the score test over the t test when g is linear and the SNR is low seems paradoxical. This occurs because [^.[sigma].sup.2] is biased downward, reflecting the flexibility of the single index model noted earlier ([^.g] is closer to the observed y values than is g).
Although the score test [S.sub.C] is useful, it should be used with caution in practice. However, because its difficulties are much more related to power than to size, a statistically significant test can indicate that the additional covariates are important to the model fitting. It might seem that the solution to the difficulties in using the score test is to use a likelihood-ratio-type test,
LR = n[log([^.[sigma].sub.1.sup.2]) - log([^.[sigma].sub.1,2.sup.2])],
where [^.[sigma].sub.1,2.sup.2] is the variance estimate [summation] [^.e.sub.i.sup.2]/n based on the full dataset ([X.sub.1], [X.sub.2]). (Recall that [^.[sigma].sub.1.sup.2] is the corresponding estimate based on using only [X.sub.1] as predictors.) Unfortunately, this is not the case, because the statistic does not follow a standard distribution under the null hypothesis. The unique characteristics of the single index model make it impossible to appeal to standard inferential arguments. Unlike in linear regression models, it is not necessarily the case that the unrestricted model (including the extra covariate) fits better than the restricted model (omitting it). Because the form of g is different under the null and alternative hypotheses, it can happen that [^.[sigma].sub.1.sup.2] < [^.[sigma].sub.1,2.sup.2], implying LR < 0. For the cases examined here, this happened 15-25% of the time for high SNR situations and more than 30% of the time for low SNRs. Further, the flexibility of estimation of g in the single index model implies that even under the null hypothesis the model using the additional covariate improves estimation accuracy more than would usually be expected. Using a [chi square] approximation, the tests are seriously anticonservative, with null sizes for both n = 30 and n = 50 around .15 - .2 for low signal-to-noise and .08 - .1 for high SNR.
Small-scale simulations based on [t.sub.3]-distributed errors showed that although the score tests for autocorrelation and an additional covariate were relatively insensitive to the nonnormal errors, for heteroscedasticity it was not, becoming strongly anticonservative. Studentizing the score test brings its size back to the proper level, but at the price of reduced power, making that approach problematic. For example, the studentized score test is less powerful than the ordinary score test for normal errors and is less powerful than Levene's test for [t.sub.3] errors.
4. ANALYSIS OF REAL DATA
In this section three real datasets are analyzed using the single index model score tests, and it is shown that not taking model misspecifications into account can lead to very different inferences. In all cases the linear index was calculated using OLS on centered predictors, and g was estimated using a local quadratic estimator with smoothing parameter chosen using AI[C.sub.C]. For all three examples SIR-based linear indices led to substantially similar results.
4.1 Automobile Prices
The data come from the 1999 Car Buyers issue of Consumer Reports (1999) and represent a sample of 41 autos. The price (in thousands of dollars) of the auto is modeled as a function of engine power (in horsepower), weight (in thousands of pounds), passenger space (in cubic feet), and reliability rating (from 1 to 5, with 1 being worst and 5 being best), as these are all characteristics that could be viewed as important by consumers. Figure 4(a) gives a scatter plot of price versus the linear index, with the estimated g superimposed, indicating nonlinearity in the relationship. Figure 4(b) is a scatter plot of the residuals from this fit versus the linear index, which strongly suggests heteroscedasticity. The score test [S.sub.H] given in (2.4), assuming an exponential model V ([[epsilon].sub.i]|[z.sub.i]) = [[sigma].sub.i.sup.2] = [[sigma].sup.2] exp([delta][z.sub.i]), equals 6.13 on one degree of freedom (p = .013), strongly rejecting constant variance. This inference is insensitive to the choice of smoothing parameter. Figure 5 gives a significance trace for this test (Bowman and Azzalini 1997, p. 89), which gives the tail probability of the test based on different choices of the smoothing parameter (the AI[C.sub.C] choice is represented by the vertical dashed line). It is clear that the p-value of the test is virtually constant over a wide range of smoothing parameters.
[FIGURE 4 OMITTED]
[FIGURE 5 OMITTED]
Hardle, Hall, and Ichimura (1993) showed that a weighted version of [^.[beta]] has higher efficiency than an unweighted version when there is nonconstant variance, and they proposed a multistage estimator. We do not pursue this because their heteroscedasticity model is different from the one used here. Still, even if [^.[beta]] is not changed, heteroscedasticity needs to be taken into account in making predictions, for example. A rough pointwise 95% prediction interval at z = [z.sub.0] is
[^.g]([z.sub.0])[+ or -]2[square root of ([[^.V]([^.g]([z.sub.0])) + [^.V]([epsilon]|[z.sub.0])])]. (4.1)
This is only roughly a prediction interval, because it ignores the bias in [^.g]([z.sub.0]), but it is useful as a guide for the effect of nonconstant variance on predictions. Once heteroscedasticity is suspected, a weighted version of [^.g] is appropriate, with weights [^.[sigma].sub.i.sup.-2]. Asymptotic forms for V([^.g]([z.sub.0])) are well known; see, for example, Fan and Gijbels (1996, p. 62).
Estimating the weights requires estimating the variance coefficients [delta]. Because E([[epsilon].sub.i.sup.2]) = [[sigma].sub.i.sup.2] is an exponential function of [z.sub.i], a Poisson regression of [[epsilon].sup.2] on z using the canonical logarithmic link function
log[E([[epsilon].sub.i.sup.2])] = [[delta].sub.0] + [[delta].sub.1][z.sub.i]
provides consistent estimates for [delta], and hence [[sigma].sub.i.sup.2] (and thus V([epsilon]|[z.sub.0])). The errors [epsilon] are not available, of course, so a Poisson regression of [^.e.sup.2] on z provides a practical implementation of this idea.
Figure 6 gives a plot of the pointwise prediction intervals for the unweighted (dashed) and weighted (solid) models based on (4.1). It is apparent that the intervals without accounting for heteroscedasticity are not appropriate, being too wide for smaller z and too narrow for larger z. In contrast, the intervals accounting for heteroscedasticity follow the observed relationship closely, and they could be used in predictions for other automobiles not in the sample.
[FIGURE 6 OMITTED]
4.2 Carbon Monoxide Emissions Data
These data track carbon monoxide emissions in the United States from 1960 through 1989. The target variable is per capita emissions (in metric tonnes per person), with predictors per capita constant dollar gasoline consumption, millions of acres destroyed by forest fires, and a linear time trend, thereby taking into account two major causes of carbon monoxide (internal combustion engines and forest fires), while still allowing for a trend in emissions given these causes. The data come from the web sites of the Census Bureau, the Environmental Protection Agency, the Department of Agriculture, and Statistical Review of World Energy, published by British Petroleum. The coefficient for time in the OLS-based linear index is negative, indicating a declining trend in emissions given these causes. The linear model indicates a strong relationship between the predictors and emissions ([R.sup.2] = 91.3%) but also shows strong evidence of autocorrelation in the residuals (for example, the Durbin-Watson statistic equals .70, which is highly statistically significant).
Figure 7 gives a plot of emissions versus the linear index, with estimated g superimposed. The plot suggests some nonlinearity in the relationship. This nonlinearity of [^.g] completely changes inferences about the existence of autocorrelation, because the score test for autocorrelation given in (2.6) is [S.sub.A] = .34, which does not give any sign of autocorrelation (p = .56). Figure 8 shows why the nonlinearity of g is so important. That figure plots the residuals versus year from the linear model (dashed line) and single index model (solid line). It is apparent that the Durbin-Watson test is interpreting the pattern in Figure 7 as autocorrelation in the errors, rather than nonlinearity in g. This is also apparent in the significance trace for this test given in Figure 9. Although the test is not statistically significant for a wide range of smoothing parameters, as the smoothing parameter becomes larger the estimated curve becomes straighter, and eventually the test becomes statistically significant. It is impossible to tell which interpretation (autocorrelation in the errors or nonlinearity of g) is correct for these real data, given the existence of other factors related to air pollution that are not in the model, but nonlinearity seems reasonable. For example, the emissions observations at the far right of Figure 7 correspond to pre-1973 data, before emissions began to steadily fall. This drop is attributable in large part to the implementation of the Clear Air Act of 1970, which put sharp limitations on sources of air pollution.
[FIGURE 7 OMITTED]
[FIGURE 8 OMITTED]
4.3 Ozone Concentration Data
The data come from Cook and Weisberg (1994, p. 127) and refer to air quality readings taken on 111 days in the New York City area in 1973. The target variable is the ozone concentration in parts per billion, and the potential predictors are the temperature (in degrees Fahrenheit), wind speed (in miles per hour), and solar radiation (in Langleys). The t statistics for the three variables from an OLS fit are 6.52, -5.10, and 2.58, respectively. Thus, the evidence for the need for the solar radiation variable given the other two variables is barely statistically insignificant at a .01 level (p = .011). This is dependent, of course, on an appropriate linear model.
Figure 10 is a plot of ozone concentration versus the linear index based on only temperature and wind speed, with estimated g superimposed; noticeable curvature in g is again evident. Given the use of these two predictors, the score test (2.7) for the need for solar radiation level in the single index model is [S.sub.C] = 11.96, which is highly statistically significant (p = .0005). The significance trace for this test given in Figure 11 shows that this inference is very insensitive to the choice of smoothing parameter. Thus, when the curvature of g is considered, the usefulness of the solar radiation variable is considerably greater than would have been thought otherwise.
[FIGURE 9 OMITTED]
[FIGURE 10 OMITTED]
These data form a time series, of course, so issues of autocorrelation are also relevant here. According to the score test for autocorrelation, this is not a problem, as it gives no indication of any autocorrelation.
5. CONCLUSIONS
In this article we used the general structure of the score test to derive tests for three possible misspecifications of the single index model. Monte Carlo simulations show that the tests hold their size well and can be reasonably powerful, even for small samples.
Score tests also can be derived for other models of interest. For example, if autocorrelation (if present) is assumed to be AR(p), rather than AR(1), the score test generalizes to the intuitive form
S = n([^.[rho].sub.1.sup.2] +...+ [^.[rho].sub.p.sup.2]),
where [^.[rho].sub.j] = [[summation].sub.i=j+1.sup.n][^.e.sub.i][^.e.sub.i-j]/[[summation].sub.i=1.sup.n] [^.e.sub.i.sup.2], referenced to a [[chi square].sub.p] distribution. Obviously the Durbin-Watson test is potentially useless for models of this type, if [[rho].sub.1] is small but higher-order autocorrelations are large.
[FIGURE 11 OMITTED]
The reduced power of the test for autocorrelation when the underlying function was a sine function shows that smoothing parameter selection for this test could be improved. The test is derived assuming the null is true, so the AI[C.sub.C] choice is not inappropriate, but it can lead to autocorrelation being mistaken for signal. Hart (1994) and Hart and Yi (1998) proposed time series cross-validation for smoothing parameter selection under autocorrelation, and it is reasonable to think that using a smoothing parameter based on this method in the autocorrelation test could improve power.
Problems with smoothing parameter choice also affect the properties of the test for an additional covariate, in that they affect estimation of [dot.g]. Using a different smoothing parameter for estimating g and [dot.g] is appropriate in theory and would definitely help for g functions close to linear. There has not been a great deal of work done on smoothing parameter choice for estimation for regression derivatives, although the plug-in method of Fan and Gijbels (1995) is one possible approach. The deterioration of power for other g functions as the coefficient of the additional covariate gets larger is clearly a concern, but it seems to be an inherent part of the score test construction. In addition, the flexibility of the single index model means that the estimated g under the null can be far from the "true" g, potentially reducing the power of the test, but the failure of the likelihood ratio to provide a useful test makes it clear that further research is needed in this area.
S-Plus code to implement these tests is available from the authors.
ACKNOWLEDGMENTS
We thank Karen Kafadar, an associate editor, and the referees for helpful comments on an earlier version of this manuscript. Chih-Ling Tsai's research was supported in part by National Institute of Health grant DA-01-043.
[Received August 2000. Revised September 2001.]
REFERENCES
Bowman, A. W., and Azzalini, A. (1997), Applied Smoothing Techniques for Data Analysis, Oxford: Clarendon Press.
Brillinger, D. R. (1983), "A Generalized Linear Model with 'Gaussian' Regression Variables," in A Festschrift for Erich L. Lehmann in Honor of His Sixty-Fifth Birthday, eds. P. J. Bickel, K. A. Doksum, and J. L. Hodges, Belmont, CA: Wadsworth, pp. 97-114.
Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997), "Generalized Partially Linear Single-Index Models," Journal of the American Statistical Association, 92, 477-489.
Consumer Reports (1999), "The '99s," Consumer Reports, Vol. 64, No. 4, Yonkers, NY: Consumers Union.
Cook, R. D., and Weisberg, S. (1994), An Introduction to Regression Graphics, New York: Wiley.
Cox, D. R., and Hinkley, D. V. (1974), Theoretical Statistics, London: Chapman and Hall.
Duan, N., and Li, K. C. (1991), "Slicing Regression: A Link Free Regression Method," Annals of Statistics, 19, 505-530.
Eubank, R. L., and Thomas, W. (1993), "Detecting Heteroscedasticity in Non-parametric Regression," Journal of the Royal Statistical Society, Ser. B, 55, 145-155.
Evans, G. B. A., and Savin, N. E. (1982), "Conflict Among the Criteria Revisited; The W, LR and LM Tests," Econometrica, 50, 737-748.
Fan, J., and Gijbels, I. (1995), "Data-Driven Bandwidth Selection in Local Polynomial Fitting: Variable Bandwidth and Spatial Adaptation," Journal of the Royal Statistical Society, Ser. B, 57, 371-394.
_______ (1996), Local Polynomial Modelling and its Applications, London: Chapman and Hall.
Hardle, W., Hall, P., and Ichimura, H. (1993), "Optimal Smoothing in Single-Index Models," Annals of Statistics, 21, 157-178.
Hart, J. D. (1994), "Automated Kernel Smoothing of Dependent Data by Using Time Series Cross-Validation," Journal of the Royal Statistical Society, Ser. B, 56, 529-542.
Hart, J. D., and Yi, S. (1998), "One-Sided Cross-Validation," Journal of the American Statistical Association, 93, 620-631.
Horowitz, J. L. (1998), Semiparametric Methods in Econometrics, New York: Springer-Verlag.
Hurvich, C. M., Simonoff, J. S., and Tsai, C.-L. (1998), "Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion," Journal of the Royal Statistical Society, Ser. B, 60, 271-293.
Ichimura, H. (1993), "Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single Index Models," Journal of Econometrics, 58, 71-120.
Koenker, R. (1981), "A Note on Studentizing a Test for Heteroscedasticity," Journal of Econometrics, 17, 107-112.
Powell, J. L., Stock, J. H., and Stoker, T. M. (1989), "Semiparametric Estimation of Index Coefficients," Econometrica, 61, 123-138.
Seber, G. A. F., and Wild, C. J. (1989), Nonlinear Regression, New York: Wiley.
Simonoff, J. S. (1996), Smoothing Methods in Statistics, New York: Springer-Verlag.
Simonoff, J. S., and Tsai, C.-L. (1994), "Use of Modified Profile Likelihood for Improved Tests of Constancy of Variance in Regression," Applied Statistics, 43, 357-370.
Tsai, C.-L. (1986), "Score Test for the First-Order Autoregressive Model With Heteroscedasticity," Biometrika, 73, 455-460.
Jeffrey S. SIMONOFF
Leonard N. Stern School of Business
New York University
New York, NY 10012-0258
(jsimonof@stern.nyu.edu)
Chih-Ling TSAI
Graduate School of Management
University of California at Davis
Davis, CA 95616-8609
(cltsai@ucdavis.edu)