Combining yearly and quarterly data in regression analysis.
Ahmad, Eatzaz
I. INTRODUCTION
Data deficiency is often a problem in regression analysis. The
problem, for example, may be due to non-availability of data on some
variable, missing observations, lack of information due to
multicollinearity and measurement errors, etc. Various approaches have
been suggested to deal with the problem depending on its precise nature.
One such problem we want to focus our attention on is the lack of time
disaggregated data in time-series regression analysis. In particular,
observations on some variables over a shorter time interval like a
quarter may be limited in number while the corresponding observations
over a longer time interval like a year are available for a long period
of time. The number of quarterly observations may not be sufficient to
estimate the desired relationship with acceptable degrees of freedom. On
the other hand, estimation with yearly data may require the use of a
long time series going way back into the past. The estimates thus
obtained may not capture the relationship prevailing at present or in
the recent past and, therefore, mislead the researcher. In addition, the
use of yearly data may also result in lack of degrees of freedom.
One possible approach to deal with the problem is to convert the
yearly data into quarterly data by using some distribution scheme and
combine with these data the other quarterly observations available.
Friedman (1962) suggests various non-correlation methods of distributing
a time aggregated series into a time disaggregated seties. Chow and Lin
(1971), Friedman (1962), Hsiao (1979)and Palm and Nijman (1982) also
suggest various methods of using a time disaggregated related series to
distribute a time aggregated series over shorter time intervals.
The problem with any method of distribution is that it introduces
measurement error and related problems in regression analysis. Due to
this reason we suggest another approach which is quite simple and does
not involve distribution of yearly data over quarterly intervals. One
can simply pool quarterly and yearly data adjusting for
heteroscedasticity introduced by the pooling. It is shown that the
estimators of regression coefficients with pooled data have smaller
variances as compared to the estimators with yearly or quarterly data.
The paper is organized as follows. In Section 2 we present the
model and derive an Ordinary Least Squares estimator and the
corresponding variance-covariance matrix of the vector of regression
coefficients with alternative data sets. Relative efficiency of the
alternative estimators is compared in Section 3. Section 4 presents the
conclusions of the paper with some thoughts on future research.
2. THE MODEL AND ITS ESTIMATION WITH ALTERNATIVE DATA SETS
Let the relationship to be estimated be described by the following
linear regression equation:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)
where, t and i refer to year and quarter respectively. We assume
that:
(1) All the variables are flow variables;
(2) There is no lagged variable in the equation;
(3) All the x variables are non-stochastic;
(4) [u.sub.ti] is randomly distributed with E([u.sub.ti]) = 0 for
all t and i; and
(5) E([u.sub.ti] [u.sub.sj]) = [[sigma].sup.2] for all t = s and i
= j = 0 for all t [not equal to] s or i [not equal to] j.
Owing to assumptions (1) and (2) we can conveniently define yearly
observations as:
[Y.sub.t] = [4.summation over (i=1)] [y.sub.ti] for all t
[X.sub.jt] = [4.summation over (i=1)] [x.sub.jti] for all t and j
[U.sub.t] = [4.summation over (i=1)] [u.sub.ti] for all t.
Thus we can write Equation (1) for yearly observations as:
[Y.sub.t] = 4[b.sub.1] + [b.sub.2] [X.sub.2t] + ... + [b.sub.k]
[X.sub.kt] + [U.sub.t] ... ... ... (2)
where [U.sub.t] satisfies the following obvious properties.
(4') [U.sub.t] is randomly distributed with E([U.sub.t]) = 0
for all t and
(5') E([U.sub.t] [U.sub.s]) = 4 [[sigma].sup.2] for all t = s
= 0 for all t [not equal to] s.
With no loss of generality, we assume that the quarterly
observations are available over a whole number of years, say m. In
addition, n yearly observations are also assumed to be available. As is
more likely, it is assumed that the yearly observations appear before
the quarterly ones. We consider the following three options available to
estimate the relationship.
Option I
Use yearly data over n + m years. With this option the regression
model can be written as:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
The best linear estimator of B, its mean vector and
variance-covariance matrix are as follows. (1)
[B.sub.I] = [(X' X + [X'.sub.*][X.sub.*]).sup.-1]
(X' Y + [X'.sub.*][Y.sub.*])
E([B.sub.I]) = B
V([B.sub.I]) = 4 [[sigma].sup.2] [(X' X +
[X'.sub.*][X.sub.*]).sup.-1] ... ... (3)
Option II
Use only quarterly data over 4m quarterly observations. In this
case the regression model can be written as:
[y.sub.*] = [x.sub.*] B + [u.sub.*]
where,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
and B is the same as defined before.
The best linear estimator of B, its mean vector and
variance-covariance matrix under this option are: (2)
[B.sub.II] = [([x'.sub.*][x.sub.*]).sup.-1]
([x'.sub.*[y.sub.*])
E([B.sub.II]) = B
V([B.sub.II]) = [[sigma].sup.2-]
[([x'.sub.*][x.sub.*]).sup.-1] ... ... ... (4)
Combine 4m quarterly and n yearly observations. Since variance of
[U.sub.t] is four times as high as the variance of Uti, we have the
problem of heteroscedasticity. With pre-adjustment for
heteroscedasticity, the model becomes:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where, Y, [y.sub.*], X, [x.sub.*], U, [u.sub.*] and B are the same
as defined earlier. With this option the best linear estimator of B, its
mean vector and variance covariance matrix are:
[B.sub.III] = [(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1]
(1/4 X' Y + [x'.sub.*][y.sub.*])
E([B.sub.III]) = B
V([B.sub.III]) = [[sigma].sup.2] [(1/4 X' X +
[x'.sub.*][x.sub.*]).sup.-1] ... ... ... (5)
3. RELATIVE EFFICIENCY OF THE ALTERNATIVE ESTIMATORS OF B
Since all the three estimators of B outlined above are unbiased,
their relative efficiency depends only on relative variances. For the
comparison of variances we will use the following theorem taken from
Maddala (1977).
Theorem
If B is a positive definite matrix and A - B is positive
semidefinite then [B.sup.-1] - [A.sup.-1] is positive semidefinite.
Proof
Since B and hence [B.sup.-1] are positive definite and A - B is
positive semidefinite, the matrix [B.sup.-1] (A - B) is positive
semidefinite. Therefore the Equation [absolute value of [B.sup.-1] (A -
B) - vI] has all roots v [greater than or equal to] 0. But the roots of
this equation are the same as the roots of [absolute value of (A - B)-
vB] or [absolute value of A - (1 + v)B] or [absolute value of [B.sup.-1]
- (1 + v)[A.sup.-1] or [absolute value of A([B.sup.-1] - [A.sup.-1]) -
vI]. Thus all v [greater than or equal to] 0 implies that A([B.sup.-1] -
[A.sup.-1]) is positive semidefinite. Since B is positive definite and A
- B is positive semidefinite, sum of the two, that is A is positive
definite and so is [A.sup.-1]. Multiplying this positive definite matrix
([A.sup.-1]) by the positive semidefinite matrix A([B.sup.-1] -
[A.sup.-1) gives a positive semidefinite matrix [B.sup.-1] - [A.sup.-1].
This completes the proof.
Let us now apply this theorem to compare variances of various
estimators of B. We will consider the variance of a linear combination
of the elements of Br (r = I, II, III), namely c' [B.sub.r] where,
c is a k x 1 column vector of known constants.
Var (c'[B.sub.III]) Versus Var (c'[B.sub.I])
Consider the following matrix.
(1/4 X'X + [x'.sub.*][x.sub.*]) - (1/4 X'X + 1/4
[X'.sub.*][X.sub.*]) = [x'.sub.*][x.sub.*] - 1/4
[X'.sub.*][X.sub.*]
The element in jth row and hth column of this matrix is:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
where, [z.sub.jti] = [x.sub.jti] - [bar.x], [z.sub.hti] =
[x.sub.hti] - [[bar.x].sub.ht], [[bar.x].sub.jt] = [4.summation over (i
= 1)] [x.sub.jti]/4
and [[bar.x].sub.ht] = [4.summation over (i=1)] [x.sub.hti]/4
The above discussion implies that the whole matrix (1/4 X'X +
[x'.sub.*][x.sub.*]) - (1/4 X'X + 1/4
[X'.sub.*][X.sub.*]) = [x'.sub.*][x.sub.*] - 1/4
[X'.sub.*][X.sub.*] can be written as [Z'.sub.*][Z.sub.*]
where,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
Clearly, [Z'.sub.*] [Z.sub.*] = (1/4 X' X +
[x'.sub.*] [x.sub.*]) - (1/4 X' X + 1/4 [X'.sub.*]
[X.sub.*]) is positive semidefinite. In addition, (1/4 X' X + 1/4
[X'.sub.*][X.sub.*]) is positive definite. Therefore, according to the theorem, the matrix 4 [(X' X + [X'.sub.*]
[X.sub.*]).sup.-1] - [(1/4 X' X + [x'.sub.*]
[x..sub.*]).sup.-1] is positive semidefinite.
4 [[sigma].sup.2] c'[(X' X + [X'.sub.*]
[X.sub.*]).sup.-1] c [greater than or equal to] [[sigma].sup.2] c'
[(1/4 X' X + [x'.sub.*][x.sub.*]).sup.-1] c.
Or, according to Equations (3)and (5)
var (c' [B.sub.I]) [greater than or equal to] var (c'
[B.sub.III]).
Notice that, if [x.sub.jti] = [[bar.x].sub.jt], that is, if there
is no variation across quarters within a year then
[Z'.sub.*][Z.sub.*] is positive as well as negative semidefinite.
In this case ar (c'[B.sub.I]) is equal to var (c'[B.sub.III]).
This is precisely what one should expect. If within a year variation
across quarters is zero then disaggregating yearly observations into
quarterly observations does not provide any additional information.
Var (c' [B.sub.III]) Versus Var (c' [B.sub.I])
Now consider the matrix (1/4 X' X + [x'.sub.*][x.sub.*])
- [x'.sub.*][x.sub.*] = 1/4 X' X which is obviously positive
semidefinite. Since, in addition, [x'.sub.*][x.sub.*] is positive
definite, the above theorem implies that the matrix
[([x'.sub.*][x.sub.*]).sup.-1] - [(1/4 X' X +
[x'.sub.*][x.sub.*]).sup.-1] is positive semidefinite.
Therefore we can write:
[[sigma].sup.2] c' [([x'.sub.*][x.sub.*]).sup.-1] c
[greater than or equal to] [[sigma].sup.2] c' [(1/4 X' X +
[x'.sub.*][x.sub.*]).sup.-1] c
This implies, according to Equations (4)and (5), that
var (c' [B.sub.II]) [greater than or equal to] var (c'
[B.sub.III]).
This result does not require any explanation. Addition of
observations to a given data set improves efficiency of the estimators.
Illustration: One Explanatory Variable Case
Consider the case of one explanatory variable without intercept:
[y.sub.ti] = b [x.sub.ti] + [u.sub.ti]
In this case the variance of [b.sub.r] (r = I, II, III) can be
calculated as follows:
var ([b.sub.I]) = [[sigma].sup.2]/1/4 [n+m.summation over
(t=1)][([4.summation over (i=1)] [x.sub.ti]).sup.2]
var ([b.sub.II]) = [[sigma].sup.2]/[n+m.summation over
(t=n+1)][4.summation over (i=1)][(x.sub.ti)].sup.2]
var ([b.sub.III]) = [[sigma].sup.2] / 1/4 [n.summation over (t=1)]
[(4.summation over (i=1)][x.sub.ti]).sup.2] + [n+m.summation over
(t=n+1)][4.summation over (i=1)][(x.sub.ti]).sup.2]
Calling the denominator of [b.sub.t] as [D.sub.r] respectively for
r = I, II and III, we can show that:
[D.sub.III] = [D.sub.I] + [n+m.summation over (t=n+1)] [4.summation
over (1=1)][([x.sub.ti] - [[bar.x].sub.t]).sup.2] and
[D.sub.III] = [D.sub.II] + [n.summation over (t=1)][([4.summation
over (i=1)][x.sub.ti]).sup.2]
This implies that:
var ([b.sub.III]) < var ([b.sub.I]) unless [X.sub.ti] =
[[bar.x].sub.t] for t = n+l, ..., n+m t = 1,...., 4
var([b.sub.III]) < var ([b.sub.II]) unless [4.summation over
(i=1)][x.sub.ti] = 0 for t = 1,...., n.
5. CONCLUDING REMARKS
Aggregation of quarterly data into yearly data for any part of the
sample period results in loss of efficiency. Using quarterly data alone
when additional yearly data are available also results in loss of
efficiency. The best use of the limited data is to pool yearly
observations with the quarterly observations with appropriate adjustment
for heteroscedasticity. The result can be generalized to include
seasonal effects in the regression equation. It can be shown that
pooling yearly data to a given set of quarterly data also improves
efficiency of seasonal effects although the yearly data alone are
useless to estimate seasonal effects. (3)
The research can be extended to develop tests for autocorrelation of first or fourth order (these are the most likely orders of
autocorrelation at quarterly level). Our preliminary research suggests
that it is quite complicated to determine the order of autocorrelation
with pooled data. Once the order of autocorrelation is determined, one
can use various procedures to improve the asymptotic efficiency of the
estimates.
Comments on "Combining Yearly and Quarterly Data in Regression
Analysis"
In this paper the author has proposed an alternative approach to
deal with the issue of non-availability of comparable data in regression
analysis. In particular, the paper deals with the problem where both
quarterly as well as annual observations on some variables are available
but the number of each type of observation i.e. quarterly limited as a
result of which it is difficult to estimate the desired relationship, by
either using quarterly or annual data with acceptable degrees of
freedom. Various approaches have been suggested in the literature to
deal with such problems. A common weakness of most of these approaches
is that they introduce measurement errors. This paper suggests that
instead of estimating the desired relationship by either using quarterly
or annual data the researcher should simply pool the quarterly and the
annual data with appropriate adjustment for heteroscedasicity. It is
claimed that the coefficients obtained using pooled data will have a
lower variance as compared to the estimators with yearly or quarterly
data.
To show the relative efficiency of the estimates obtained using
pooled data the author has made use of one of the standard theorems available in the econometric literature. There is thus little room to
caste doubt on his claim. The paper, therefore, can be regarded as a
theoretical contribution in the area.
A general problem faced by many researchers in at least developing
countries is not that both quarterly as well as annual data are
available for the same variables but that for some variables quarterly
data are available while for others the annual data are available. In
these circumstances the techniques suggested by the author is hardly of
any use. I wonder if we can somehow modify the suggested technique to
handle the above-mentioned problem which is more frequently faced by the
researchers.
Nadeem A. Burney
Pakistan Institute of Development Economics, Islamabad
REFERENCES
Chow, G. C., and A. I. Lin (1971). "Best Linear Unbiased
Interpolation, Distribution and Extrapolation of Time Series by Related
Series". Review of Economics and Statistics. Vol. CIII, No. 4. pp.
372-375.
Friedman, M (1962). "The Interpolation of Time Series by
Related Series". Journal of American Statistical Association. Vol.
57, pp. 729-757.
Hsiao, C. (1979). "Linear Regression Using Both Temporally
Aggregated and Temporally Disaggregated Data". Journal of
Econometrics. Vol. I0, No. 2, pp. 243252.
Maddala, G. S. (1979). Econometrics. Tokyo: McGraw-Hill, Inc.
Palm, F. C., and T. E. Nijman (1982). "Linear Regression Using
Both Temporally Aggregated and Temporally I)isaggegated Data",
Journal of Econometrics. Vol. 19, pp. 333-343.
(1) It is assumed that the matrix [X' [X'.sub.*]] has
full column rank k which is less than n + m.
(2) The matrix [x.sub.*] is assumed to have full column rank k <
4m.
(3) This result has been proved in the paper originally presented
in the general meeting. The paper had to be reduced due to the space
constraint.
EATZAZ AHMAD, The author is Assistant Professor in the Department
of Economics, Quaid-i-Azam University, Islamabad. He is grateful to
Professor F. T. Denton of McMaster University for his comments on an
earlier draft of this paper.