文章基本信息

标题：Combining yearly and quarterly data in regression analysis.
作者：Ahmad, Eatzaz
期刊名称：Pakistan Development Review
印刷版ISSN：0030-9729
出版年度：1988
期号：December
语种：English
出版社：Pakistan Institute of Development Economics
摘要：Data deficiency is often a problem in regression analysis. The problem, for example, may be due to non-availability of data on some variable, missing observations, lack of information due to multicollinearity and measurement errors, etc. Various approaches have been suggested to deal with the problem depending on its precise nature. One such problem we want to focus our attention on is the lack of time disaggregated data in time-series regression analysis. In particular, observations on some variables over a shorter time interval like a quarter may be limited in number while the corresponding observations over a longer time interval like a year are available for a long period of time. The number of quarterly observations may not be sufficient to estimate the desired relationship with acceptable degrees of freedom. On the other hand, estimation with yearly data may require the use of a long time series going way back into the past. The estimates thus obtained may not capture the relationship prevailing at present or in the recent past and, therefore, mislead the researcher. In addition, the use of yearly data may also result in lack of degrees of freedom.
关键词：Regression analysis

Combining yearly and quarterly data in regression analysis.

Ahmad, Eatzaz

I. INTRODUCTION

Data deficiency is often a problem in regression analysis. The problem, for example, may be due to non-availability of data on some variable, missing observations, lack of information due to multicollinearity and measurement errors, etc. Various approaches have been suggested to deal with the problem depending on its precise nature. One such problem we want to focus our attention on is the lack of time disaggregated data in time-series regression analysis. In particular, observations on some variables over a shorter time interval like a quarter may be limited in number while the corresponding observations over a longer time interval like a year are available for a long period of time. The number of quarterly observations may not be sufficient to estimate the desired relationship with acceptable degrees of freedom. On the other hand, estimation with yearly data may require the use of a long time series going way back into the past. The estimates thus obtained may not capture the relationship prevailing at present or in the recent past and, therefore, mislead the researcher. In addition, the use of yearly data may also result in lack of degrees of freedom.

One possible approach to deal with the problem is to convert the yearly data into quarterly data by using some distribution scheme and combine with these data the other quarterly observations available. Friedman (1962) suggests various non-correlation methods of distributing a time aggregated series into a time disaggregated seties. Chow and Lin (1971), Friedman (1962), Hsiao (1979)and Palm and Nijman (1982) also suggest various methods of using a time disaggregated related series to distribute a time aggregated series over shorter time intervals.

The problem with any method of distribution is that it introduces measurement error and related problems in regression analysis. Due to this reason we suggest another approach which is quite simple and does not involve distribution of yearly data over quarterly intervals. One can simply pool quarterly and yearly data adjusting for heteroscedasticity introduced by the pooling. It is shown that the estimators of regression coefficients with pooled data have smaller variances as compared to the estimators with yearly or quarterly data.

The paper is organized as follows. In Section 2 we present the model and derive an Ordinary Least Squares estimator and the corresponding variance-covariance matrix of the vector of regression coefficients with alternative data sets. Relative efficiency of the alternative estimators is compared in Section 3. Section 4 presents the conclusions of the paper with some thoughts on future research.

2. THE MODEL AND ITS ESTIMATION WITH ALTERNATIVE DATA SETS

Let the relationship to be estimated be described by the following linear regression equation:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where, t and i refer to year and quarter respectively. We assume that:

(1) All the variables are flow variables;

(2) There is no lagged variable in the equation;

(3) All the x variables are non-stochastic;

(4) [u.sub.ti] is randomly distributed with E([u.sub.ti]) = 0 for all t and i; and

(5) E([u.sub.ti] [u.sub.sj]) = [[sigma].sup.2] for all t = s and i = j = 0 for all t [not equal to] s or i [not equal to] j.

Owing to assumptions (1) and (2) we can conveniently define yearly observations as:

[Y.sub.t] = [4.summation over (i=1)] [y.sub.ti] for all t

[X.sub.jt] = [4.summation over (i=1)] [x.sub.jti] for all t and j

[U.sub.t] = [4.summation over (i=1)] [u.sub.ti] for all t.

Thus we can write Equation (1) for yearly observations as:

[Y.sub.t] = 4[b.sub.1] + [b.sub.2] [X.sub.2t] + ... + [b.sub.k] [X.sub.kt] + [U.sub.t] ... ... ... (2)

where [U.sub.t] satisfies the following obvious properties.

(4') [U.sub.t] is randomly distributed with E([U.sub.t]) = 0 for all t and

(5') E([U.sub.t] [U.sub.s]) = 4 [[sigma].sup.2] for all t = s = 0 for all t [not equal to] s.

With no loss of generality, we assume that the quarterly observations are available over a whole number of years, say m. In addition, n yearly observations are also assumed to be available. As is more likely, it is assumed that the yearly observations appear before the quarterly ones. We consider the following three options available to estimate the relationship.

Option I

Use yearly data over n + m years. With this option the regression model can be written as:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

The best linear estimator of B, its mean vector and variance-covariance matrix are as follows. (1)

[B.sub.I] = [(X' X + [X'.sub.*][X.sub.*]).sup.-1] (X' Y + [X'.sub.*][Y.sub.*])

E([B.sub.I]) = B

V([B.sub.I]) = 4 [[sigma].sup.2] [(X' X + [X'.sub.*][X.sub.*]).sup.-1] ... ... (3)

Option II

Use only quarterly data over 4m quarterly observations. In this case the regression model can be written as:

[y.sub.*] = [x.sub.*] B + [u.sub.*]

where,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

and B is the same as defined before.

The best linear estimator of B, its mean vector and variance-covariance matrix under this option are: (2)

[B.sub.II] = [([x'.sub.*][x.sub.*]).sup.-1] ([x'.sub.*[y.sub.*])

E([B.sub.II]) = B

V([B.sub.II]) = [[sigma].sup.2-] [([x'.sub.*][x.sub.*]).sup.-1] ... ... ... (4)

Combine 4m quarterly and n yearly observations. Since variance of [U.sub.t] is four times as high as the variance of Uti, we have the problem of heteroscedasticity. With pre-adjustment for heteroscedasticity, the model becomes:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where, Y, [y.sub.*], X, [x.sub.*], U, [u.sub.*] and B are the same as defined earlier. With this option the best linear estimator of B, its mean vector and variance covariance matrix are:

[B.sub.III] = [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] (1/4 X' Y + [x'.sub.*][y.sub.*])

E([B.sub.III]) = B

V([B.sub.III]) = [[sigma].sup.2] [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] ... ... ... (5)

3. RELATIVE EFFICIENCY OF THE ALTERNATIVE ESTIMATORS OF B

Since all the three estimators of B outlined above are unbiased, their relative efficiency depends only on relative variances. For the comparison of variances we will use the following theorem taken from Maddala (1977).

Theorem

If B is a positive definite matrix and A - B is positive semidefinite then [B.sup.-1] - [A.sup.-1] is positive semidefinite.

Proof

Since B and hence [B.sup.-1] are positive definite and A - B is positive semidefinite, the matrix [B.sup.-1] (A - B) is positive semidefinite. Therefore the Equation [absolute value of [B.sup.-1] (A - B) - vI] has all roots v [greater than or equal to] 0. But the roots of this equation are the same as the roots of [absolute value of (A - B)- vB] or [absolute value of A - (1 + v)B] or [absolute value of [B.sup.-1] - (1 + v)[A.sup.-1] or [absolute value of A([B.sup.-1] - [A.sup.-1]) - vI]. Thus all v [greater than or equal to] 0 implies that A([B.sup.-1] - [A.sup.-1]) is positive semidefinite. Since B is positive definite and A - B is positive semidefinite, sum of the two, that is A is positive definite and so is [A.sup.-1]. Multiplying this positive definite matrix ([A.sup.-1]) by the positive semidefinite matrix A([B.sup.-1] - [A.sup.-1) gives a positive semidefinite matrix [B.sup.-1] - [A.sup.-1]. This completes the proof.

Let us now apply this theorem to compare variances of various estimators of B. We will consider the variance of a linear combination of the elements of Br (r = I, II, III), namely c' [B.sub.r] where, c is a k x 1 column vector of known constants.

Var (c'[B.sub.III]) Versus Var (c'[B.sub.I])

Consider the following matrix.

(1/4 X'X + [x'.sub.*][x.sub.*]) - (1/4 X'X + 1/4 [X'.sub.*][X.sub.*]) = [x'.sub.*][x.sub.*] - 1/4 [X'.sub.*][X.sub.*]

The element in jth row and hth column of this matrix is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

where, [z.sub.jti] = [x.sub.jti] - [bar.x], [z.sub.hti] = [x.sub.hti] - [[bar.x].sub.ht], [[bar.x].sub.jt] = [4.summation over (i = 1)] [x.sub.jti]/4

and [[bar.x].sub.ht] = [4.summation over (i=1)] [x.sub.hti]/4

The above discussion implies that the whole matrix (1/4 X'X + [x'.sub.*][x.sub.*]) - (1/4 X'X + 1/4 [X'.sub.*][X.sub.*]) = [x'.sub.*][x.sub.*] - 1/4 [X'.sub.*][X.sub.*] can be written as [Z'.sub.*][Z.sub.*] where,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Clearly, [Z'.sub.*] [Z.sub.*] = (1/4 X' X + [x'.sub.*] [x.sub.*]) - (1/4 X' X + 1/4 [X'.sub.*] [X.sub.*]) is positive semidefinite. In addition, (1/4 X' X + 1/4 [X'.sub.*][X.sub.*]) is positive definite. Therefore, according to the theorem, the matrix 4 [(X' X + [X'.sub.*] [X.sub.*]).sup.-1] - [(1/4 X' X + [x'.sub.*] [x..sub.*]).sup.-1] is positive semidefinite.

4 [[sigma].sup.2] c'[(X' X + [X'.sub.*] [X.sub.*]).sup.-1] c [greater than or equal to] [[sigma].sup.2] c' [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] c.

Or, according to Equations (3)and (5)

var (c' [B.sub.I]) [greater than or equal to] var (c' [B.sub.III]).

Notice that, if [x.sub.jti] = [[bar.x].sub.jt], that is, if there is no variation across quarters within a year then [Z'.sub.*][Z.sub.*] is positive as well as negative semidefinite. In this case ar (c'[B.sub.I]) is equal to var (c'[B.sub.III]). This is precisely what one should expect. If within a year variation across quarters is zero then disaggregating yearly observations into quarterly observations does not provide any additional information.

Var (c' [B.sub.III]) Versus Var (c' [B.sub.I])

Now consider the matrix (1/4 X' X + [x'.sub.*][x.sub.*]) - [x'.sub.*][x.sub.*] = 1/4 X' X which is obviously positive semidefinite. Since, in addition, [x'.sub.*][x.sub.*] is positive definite, the above theorem implies that the matrix [([x'.sub.*][x.sub.*]).sup.-1] - [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] is positive semidefinite.

Therefore we can write:

[[sigma].sup.2] c' [([x'.sub.*][x.sub.*]).sup.-1] c [greater than or equal to] [[sigma].sup.2] c' [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] c

This implies, according to Equations (4)and (5), that

var (c' [B.sub.II]) [greater than or equal to] var (c' [B.sub.III]).

This result does not require any explanation. Addition of observations to a given data set improves efficiency of the estimators.

Illustration: One Explanatory Variable Case

Consider the case of one explanatory variable without intercept:

[y.sub.ti] = b [x.sub.ti] + [u.sub.ti]

In this case the variance of [b.sub.r] (r = I, II, III) can be calculated as follows:

var ([b.sub.I]) = [[sigma].sup.2]/1/4 [n+m.summation over (t=1)][([4.summation over (i=1)] [x.sub.ti]).sup.2]

var ([b.sub.II]) = [[sigma].sup.2]/[n+m.summation over (t=n+1)][4.summation over (i=1)][(x.sub.ti)].sup.2]

var ([b.sub.III]) = [[sigma].sup.2] / 1/4 [n.summation over (t=1)] [(4.summation over (i=1)][x.sub.ti]).sup.2] + [n+m.summation over (t=n+1)][4.summation over (i=1)][(x.sub.ti]).sup.2]

Calling the denominator of [b.sub.t] as [D.sub.r] respectively for r = I, II and III, we can show that:

[D.sub.III] = [D.sub.I] + [n+m.summation over (t=n+1)] [4.summation over (1=1)][([x.sub.ti] - [[bar.x].sub.t]).sup.2] and

[D.sub.III] = [D.sub.II] + [n.summation over (t=1)][([4.summation over (i=1)][x.sub.ti]).sup.2]

This implies that:

var ([b.sub.III]) < var ([b.sub.I]) unless [X.sub.ti] = [[bar.x].sub.t] for t = n+l, ..., n+m t = 1,...., 4

var([b.sub.III]) < var ([b.sub.II]) unless [4.summation over (i=1)][x.sub.ti] = 0 for t = 1,...., n.

5. CONCLUDING REMARKS

Aggregation of quarterly data into yearly data for any part of the sample period results in loss of efficiency. Using quarterly data alone when additional yearly data are available also results in loss of efficiency. The best use of the limited data is to pool yearly observations with the quarterly observations with appropriate adjustment for heteroscedasticity. The result can be generalized to include seasonal effects in the regression equation. It can be shown that pooling yearly data to a given set of quarterly data also improves efficiency of seasonal effects although the yearly data alone are useless to estimate seasonal effects. (3)

The research can be extended to develop tests for autocorrelation of first or fourth order (these are the most likely orders of autocorrelation at quarterly level). Our preliminary research suggests that it is quite complicated to determine the order of autocorrelation with pooled data. Once the order of autocorrelation is determined, one can use various procedures to improve the asymptotic efficiency of the estimates.

Comments on "Combining Yearly and Quarterly Data in Regression Analysis"

In this paper the author has proposed an alternative approach to deal with the issue of non-availability of comparable data in regression analysis. In particular, the paper deals with the problem where both quarterly as well as annual observations on some variables are available but the number of each type of observation i.e. quarterly limited as a result of which it is difficult to estimate the desired relationship, by either using quarterly or annual data with acceptable degrees of freedom. Various approaches have been suggested in the literature to deal with such problems. A common weakness of most of these approaches is that they introduce measurement errors. This paper suggests that instead of estimating the desired relationship by either using quarterly or annual data the researcher should simply pool the quarterly and the annual data with appropriate adjustment for heteroscedasicity. It is claimed that the coefficients obtained using pooled data will have a lower variance as compared to the estimators with yearly or quarterly data.

To show the relative efficiency of the estimates obtained using pooled data the author has made use of one of the standard theorems available in the econometric literature. There is thus little room to caste doubt on his claim. The paper, therefore, can be regarded as a theoretical contribution in the area.

A general problem faced by many researchers in at least developing countries is not that both quarterly as well as annual data are available for the same variables but that for some variables quarterly data are available while for others the annual data are available. In these circumstances the techniques suggested by the author is hardly of any use. I wonder if we can somehow modify the suggested technique to handle the above-mentioned problem which is more frequently faced by the researchers.

Nadeem A. Burney

Pakistan Institute of Development Economics, Islamabad

REFERENCES

Chow, G. C., and A. I. Lin (1971). "Best Linear Unbiased Interpolation, Distribution and Extrapolation of Time Series by Related Series". Review of Economics and Statistics. Vol. CIII, No. 4. pp. 372-375.

Friedman, M (1962). "The Interpolation of Time Series by Related Series". Journal of American Statistical Association. Vol. 57, pp. 729-757.

Hsiao, C. (1979). "Linear Regression Using Both Temporally Aggregated and Temporally Disaggregated Data". Journal of Econometrics. Vol. I0, No. 2, pp. 243252.

Maddala, G. S. (1979). Econometrics. Tokyo: McGraw-Hill, Inc.

Palm, F. C., and T. E. Nijman (1982). "Linear Regression Using Both Temporally Aggregated and Temporally I)isaggegated Data", Journal of Econometrics. Vol. 19, pp. 333-343.

(1) It is assumed that the matrix [X' [X'.sub.*]] has full column rank k which is less than n + m.

(2) The matrix [x.sub.*] is assumed to have full column rank k < 4m.

(3) This result has been proved in the paper originally presented in the general meeting. The paper had to be reduced due to the space constraint.

EATZAZ AHMAD, The author is Assistant Professor in the Department of Economics, Quaid-i-Azam University, Islamabad. He is grateful to Professor F. T. Denton of McMaster University for his comments on an earlier draft of this paper.