Using spreadsheets to conduct Monte Carlo experiments for teaching introductory econometrics. (Targeting Teaching).
Craft, R. Kim
1. Introduction
In recent years a number of authors have begun to use Monte Carlo
simulation to help explain elementary concepts in statistics and
econometrics (e.g., Gujarati 1999, pp. 79-81 and 159-161; Albright,
Winston, and Zappe 2000, pp. 191-194; Studenmund 2001, pp. 101-102).
Furthermore, Kennedy (1998a, b) has argued that the Monte Carlo experiment is an indispensable pedagogical tool for the undergraduate
econometrics course. In particular, he believes that Monte Carlo methods can be used to illuminate the idea of a sampling distribution, a
fundamental concept that is often difficult for students to grasp.
Kennedy maintains that because the notion of a sampling distribution is
the "statistical lens" that makes other statistical concepts
clear, and because Monte Carlo methods provide a superior vehicle for
acquiring this lens, introductory students should be required to
describe a Monte Carlo experiment related to "every major topic in
the course" (Kennedy 1998b, p. 148). Kennedy further contends,
however, that intro ductory students should not be asked to conduct
their own Monte Carlo experiments because of the high opportunity cost
of learning to program standard econometric software.
I concur with Kennedy's assertion that teaching statistical
concepts in the context of a Monte Carlo study is an enormously valuable
pedagogical technique. I also agree that the marginal benefit derived
from requiring students to program Monte Carlo experiments with standard
econometric software, rather than to only demonstrate their ability to
conceptualize the experiment, probably exceeds the marginal cost for
most undergraduates. Nevertheless, it is clear that students can derive
substantial learning benefits from conducting their own Monte Carlo
experiments. It is well known that many students learn best through
experiencing and experimenting, and even students with strong abstract
reasoning abilities usually find experiential learning exercises
helpful. The process of developing and experimenting with a simulation
model can provide such experiential learning opportunities with
otherwise very abstract statistical concepts; but the opportunity cost
is a problem.
I have found that the benefit can substantially exceed the cost
when introductory students use spreadsheets to perform their own Monte
Carlo experiments. As I hope to demonstrate below, the opportunity cost
of learning associated with conducting a Monte Carlo experiment with a
spreadsheet can be relatively low. Nearly all of my econometrics
students (juniors and seniors in finance and economics, most of whom
have had a prior course in spreadsheet modeling) are familiar with
spreadsheets and find the environment very natural. Thus, they learn the
necessary spreadsheet commands quickly and are then able to focus on the
purpose of the exercise rather than on programming. In addition,
instructors can control the cost to some degree by supplying spreadsheet
templates and directions tailored to the background of the students and
the goals of the exercise (see Cahill and Kosicki 2001 for a useful
discussion of this point). The use of spreadsheets to conduct Monte
Carlo experiments in introductory courses can therefo re overcome the
high-opportunity-cost problem to a large extent.
The use of spreadsheets to teach Monte Carlo simulation offers
additional learning benefits because the environment is especially
intuitive and user friendly. For instance, the spreadsheet setting
invites exploration: Once a model is created, the effects of changing a
parameter value can be investigated by simply entering a new number in a
cell and pressing a key. Spreadsheet models can also be superior
teaching tools because of the way data are displayed. Realizations of
random events can be organized, annotated, and presented in a tabular
form that undergraduates can easily relate to. With a well-designed
spreadsheet model, students can almost see samples being drawn and
estimates being created. As opposed to the "black box" nature
of standard econometric software, these models can be very explicit,
with each step connected in a straightforward way.
Finally, there are important positive externalities associated with
the use of spreadsheets to teach Monte Carlo methods. Most of the
spreadsheet techniques needed for a Monte Carlo study are useful in a
variety of other modeling situations, ranging from economic theory
(e.g., Cahill and Kosicki 2000) to personal finance (e.g., Holden and
Womack 2000). Students therefore learn practical spreadsheet modeling
skills along with econometrics. Moreover, since spreadsheet software is
widely available, students are more likely to later use their knowledge
of spreadsheets and Monte Carlo methods in a work environment. Because
they recognize these benefits (with help from the instructor), most
students are quite willing to invest in learning the necessary
spreadsheet skills.
Along with the numerous authors who have promoted the use of
spreadsheets in teaching economic theory, Judge (1999) introduced a
simple method for conducting Monte Carlo experiments with spreadsheets.
The approach proposed in this paper differs from Judge's approach
in at least three important ways. First, the method outlined here allows
for random regressors. Although it is not useful for most research
applications, sampling from a population of regressors adds an element
of realism (for cross-sectional analysis) that facilitates student
comprehension--my students are able to get a better grasp of the meaning
and significance of the "fixed in repeated samples" assumption
when I contrast fixed regressors with random regressors in a Monte Carlo
context. Another distinguishing feature of the approach presented here
involves the method of sampling. I believe it is both more intuitive and
more efficient than that used with Judge's approach; in particular,
it is easier to draw larger samples and to repeat experime nts under
alternative scenarios. Finally, because it is easier to repeat
experiments under alternative scenarios, the approach proposed here
facilitates student experimentation and enables instructors to
accomplish more with any Monte Carlo exercise.
My objectives in the remainder of this paper are to (1) describe an
effective method of conducting Monte Carlo experiments in Excel and (2)
discuss some of the ways in which I have used spreadsheet-based Monte
Carlo experiments successfully in teaching introductory econometrics
courses.
2. Monte Carlo Simulation with Excel
In this section, I briefly outline how to conduct a Monte Carlo
experiment in Microsoft Excel with a specific example involving a
cross-sectional application. More detailed instructions designed for
students, including a number of spreadsheet modeling tips and
guidelines, can be found on the course Website at
http://www.suu.edu/faculty/craft/EC4260/MonteCarlo.htm. The Website also
contains corresponding Excel templates and other supporting material.
The presentation is based on a lecture and computer demonstration
that I give students in preparation for their first Monte Carlo
assignment. The initial demonstration and discussion usually take nearly
one week of class meetings, but I believe it is time well spent for
introductory students. I typically assign the first Monte Carlo exercise
in the fourth or fifth week of the semester, after students are familiar
with the basics of the classical multiple linear regression model and
just before hypothesis testing. It is assumed that most have a basic
understanding of spreadsheets. I teach the Monte Carlo experiment as a
six-step process, which differs slightly from that typically outlined by
other authors. Each step is described below with reference to the
spreadsheet model displayed in Figures 1 and 2.
The Data-Generating Process
To begin the illustration, assume that the true data-generating
process is a classical linear regression model with two explanatory variables:
Y = [[beta].sub.0] + [[beta].sub.1][x.sub.1] +
[[beta].sub.2][x.sub.2] + u.
where the error term u is independently distributed Normal(0,
[[sigma].sup.2]). Further, assume that the true parameter values are,
say, [[beta].sub.0] = 25, [[beta].sub.1] = 6, [[beta].sub.2] = -2, and
[[sigma].sup.2] = 2500. Any parameter values can, of course, be chosen.
To make the example more concrete, I sometimes let the dependent
variable represent something real, like the selling price of a house,
and then choose the parameter values accordingly. Depending on the
purpose of the exercise, it can also be useful to let the error term
follow some nonnormal distribution.
Explanatory Variables
While a fixed set of regressors can be specified, and it is
computationally simpler to do so, it is both more realistic and more
intuitive for students to randomly draw the explanatory variables from a
population. Moreover, undergraduates generally find it easier to grasp
the notion of sampling from a finite population rather than from an
infinite one. Students are therefore provided with a
"population" of explanatory variables such as that displayed
in columns C and D of Figure 1, in which the population size is 750. A
much larger population is feasible, but I like to keep the in-class
simulation small to speed up the demonstration. The independent
variables in Figure 1 were created using Excel's random number
generator, although the assumed population values can be obtained in
different ways, such as pretending that some real-world sample of data
is actually a census.' The explanatory variables can also be made
to have certain properties if that is important for the exercise. For
example, the variables can be constructed so that they exhibit a certain
degree of collinearity by generating one as a linear function of the
other: [x.sub.i2] = f([x.sub.i1]) + -[epsilon].sub.i], where f(.) is a
linear function and [[epsilon].sub.i] is a stochastic term with desired
properties.
Obtaining a Random Sample
The stage of the exercise in which a random sample is obtained
consists of two steps: (i) generating the dependent variable according
to the assumed data-generating process and (ii) obtaining a sample from
the population. While these steps can be reversed, this ordering seems
to be most natural for students. Excel will produce a random draw from a
normal distribution with mean [mu] and standard deviation with the
formula =NORMINV(RAND(),[mu][sigma]). This formula is entered into the
error term cells of the spreadsheet (cells E10 : E759), with the
arguments being referenced to fixed parameter value cells. The dependent
variable is then generated for each element of the population with
standard spreadsheet functions (cells B10 B759), again by referencing
cells containing the appropriate population parameters. Note that once
the formulas are entered for the first element, the remainder can be
entered by simply copying and pasting if relative and absolute cell
references are set properly.
The model will draw a new error term for each individual and
consequently generate a new set of dependent variables every time the
spreadsheet is recalculated. Excel can be set to recalculate formulas
automatically whenever a change is made in the spreadsheet or to
recalculate manually. When developing simulation models, it is generally
best to have the spreadsheet set on manual calculation to avoid
unnecessary recalculations.
Next, a random sample of n observations, { ([x.sub.i1], [x.sub.i2],
[y.sub.i]) i 1,2,..., n], is drawn from the artificially created
population using simple random sampling with replacement. Here I think
it is worthwhile to briefly mention some of the sample selection issues
that can arise in practice and to remind students that the Gauss-Markov
theorem requires random sampling. A random selection can be made from
the sampling frame with the Excel formula =INT(N*RAND())+ 1, which
yields a uniformly distributed random integer from 1 to N. This formula
is entered into cells G4 through G43 of the spreadsheet model (Figure 1)
to obtain a sample of 40.
After n random numbers have been drawn, each is matched with the
corresponding population element and the data are extracted. One way to
accomplish this in Excel is with the VLOOKUP formula. The syntax for
VLOOKUP is =VLOOKUP(lookup value, table, column number), where lookup
value is the number that you want Excel to find in the first column of a
table of data, table is a reference to the table of information from
which you want to take data, and column number is the column in table
from which to take the data. This formula is entered into those cells in
which the sample observations and variables are to be displayed. For
instance, cell 14 of the model contains VLOOKUP($G4,$A1O: $D$759,2),
which says to find the number in the first column of the table defined
by cell range A10 : D759 that matches the value in cell G4 and return
the corresponding number from the second column of table AlO : D759.
This process yields the first observation on the dependent variable. The
remaining observations and variables are o btained similarly. For
example, cell J4 returns [x.sub.11] with the formula = VLOOKUP($G4,$Al0
$D$759,3).
Estimating the Parameters
A regression model is now estimated from the sample data. The Excel
array formula = LTNEST(range of y, range of x, const, stats), where x
represents all of the explanatory variables arranged adjacent to each
other on the spreadsheet, returns ordinary least squares (OLS)
regression estimates and certain statistics. (2) The argument const is a
logical value: If const is "true" (equal to 1) or omitted, an
intercept term is included in the regression. The argument stats is also
a logical value: If stats is "true" (equal to 1), LINEST
returns certain regression statistics. Figure 2 shows the output
generated by the LINEST formula for our example, where stats is
"true." It is important to note that LINEST is an array
formula. Array formulas can return more than one value, and the way in
which they must be entered is different from that for regular formulas.
(3) For instance, to generate the regression output illustrated in
Figure 2, select the cell range P3:R7, enter the formula LINEST(I4 :
I43,J4 : K43,1,1). (onet ime), and then, with the cell range still
selected, press CTRL + SHIFT + ENTER. The formula will be automatically
entered into each cell in the selected range with braces placed around
it, indicating that it is an array formula, and regression output will
be returned for the specific data set. To change an array formula,
select the entire range of cells, make the change, and then press CTRL +
SHIFT + ENTER again.
Running the Simulation
At this point, a new population is created, a new sample is
selected, and new OLS estimates are obtained each time the spreadsheet
is recalculated. In a classroom demonstration, it can be very helpful to
run the simulation manually several times and let students observe the
process while discussing the meaning and importance of the sampling
distribution of [[beta].sub.j].
Although it was not designed for this specific purpose, the Excel
data table command provides a simple way to run the simulation a large
number of times and record the results (see the course Web page for a
supplement called "On Using the Excel Data Table" for more
information). (4) To illustrate, the following describes how to program
Excel to draw 500 different samples, obtain OLS results for each sample,
and store the corresponding values of [[beta].sub.1],
se([[beta].sub.1]), and [R.sup.2] on another worksheet (while the
simulation results can be placed anywhere, I like to put them on a
separate worksheet to help drive home the sometimes difficult point that
a sample of estimates is distinct from a sample of data). With most
personal computers, it would be practical to perform a much larger
simulation if desired, but 500 is adequate for the example. First, name
the original worksheet Model and open a new worksheet and name it
Results. (5) Next, on the Results sheet, create a table with rows
labeled 1 thro ugh 500 and three columns labeled [[beta].sub.1],
se([[beta].sub.1]), and [R.sup.2]. In the first row of the table, just
under each column heading, place a reference to the cell in the Model
worksheet that contains the value of [[beta].sub.1], se([[beta].sub.1]),
and [R.sup.2]. This can be accomplished by typing "=" and then
clicking on the appropriate cell in the Model worksheet. Figure 3
illustrates the new worksheet.
To implement the data table command, select the entire table,
including the column with the row labels (cell range A3 : D502 in Figure
3); click on the Data menu and select Table; in the Table window, leave
"Row input cell" blank and in "Column input cell"
enter any empty cell on the Results worksheet outside of the table;
click OK. Excel will fill in the body of the table with the formulas
(=TABLE(,column input cell)) and will initially show the values from the
first row. When F9 is pressed, Excel will put each row number from the
first column of the table sequentially into the column input cell,
recalculate the spreadsheet, and store the new values for
[[beta].sub.1], se([[beta].sub.1]), and [R.sup.2]. This produces results
for 500 OLS regression runs, as illustrated in Figure 3.
Analyzing the Results
The bottom rows of Figure 3 indicate some of the summary measures
that might be used to investigate the results. It is also fairly
simple--and typically very helpful for students--to create histograms of
the estimates or statistics under consideration.
3. Class Discussion
I normally spend one or two entire class periods simply discussing
results from the initial demonstration. (I also spend some class time
discussing the results and implications of each Monte Carlo exercise
assigned.) Depending on students' familiarity with spreadsheets, it
may be a good idea to postpone a detailed discussion until after they
have worked through the first Monte Carlo experiment so that they will
be in a better position to focus on the meaning of the experiment rather
than on the details of creating the spreadsheet. In either case, the
following are some class discussion points and experiments that I have
found to be useful following the introduction of Monte Carlo simulation:
* Compare and contrast the sample with the sampling distribution.
In my experience, students frequently confuse these concepts (because
they do not understand the sampling distribution concept). Given a set
of Monte Carlo results, one can more clearly demonstrate how and why
they are distinctly different things.
* Note that the experimental results support the Gauss-Markov
theorem, discuss why they should, and take this opportunity to review
the meaning and importance of the theorem.
* Discuss a histogram of [[beta].sub.j]s representing draws from a
sampling distribution. Discuss why it appears to be approximately
normally distributed. Change the random error term to follow some
nonnormal distribution (a uniform distribution is easy to specify) and
compare the resulting [[beta].sub.j] sampling distribution to the
original. Discuss the implications for hypothesis testing.
* Compare and contrast the standard deviation of the sample of
[[beta].sub.j].s with the OLS standard error, and discuss the
interpretation and properties of se([[beta].sub.j]).
* Consider the variation in the [R.sup.2] statistic and discuss how
even when the model is well specified, the OLS regression line might fit
well or not so well depending on the variance of the error term and the
characteristics of a particular sample (i.e., a low [R.sup.2] does not
necessarily imply a "bad" model).
* Run an experiment over again with a larger sample size and
compare histograms associated with each sampling distribution.
* Compare and contrast alternative estimators using summary
statistics and histograms generated from a Monte Carlo simulation.
4. Monte Carlo Exercises
My course Web page includes spreadsheet-based Monte Carlo
assignments that I have successfully used with undergraduate
econometrics students. In addition, Kennedy (1998c) suggests numerous
Monte Carlo exercises, most of which could easily be implemented using
the spreadsheet framework described in this paper. With the first
exercise, I typically provide a set of detailed instructions on how to
conduct Monte Carlo simulation in Excel and a spreadsheet template similar to Figures 1 and 2 (see the course Web page).
In the introductory course, I generally focus on exercises that
illustrate the effects of violations of the basic assumptions, the
sampling distribution concept, and other fundamental principles. The
objective of the first assignment, for instance, is to demonstrate the
effects of an omitted variable when it is strongly correlated with the
included variable(s) and when it is not. I provide students with a
population regression model with two independent variables, complete
with parameter values and an error distribution, and two different
populations of explanatory variables. The explanatory variables are
highly correlated in one population and much less so in the other (but I
do not state this). Students are instructed to conduct a Monte Carlo
experiment with each population estimating [[beta].sub.1] first with
[x.sub.2] included and then with [x.sub.2] omitted (i.e., to purposely estimate an underspecified model). The students are then asked a series
of questions regarding the properties of the OLS estimato r in each
case. They are also asked to explain why the underspecified model yields
an OLS estimator that is more biased in one experiment than in the
other.
As they work through the first Monte Carlo exercise, most of my
students quickly become comfortable with the necessary spreadsheet
techniques, although I am often asked to debug their models (recall that
nearly all of my students have had a prior course in spreadsheet
modeling). And while the students have no problem with the idea that
excluding a relevant variable causes OLS to be "wrong," except
in special circumstances, their answers to the first assignment often
indicate either a superficial or an incorrect understanding of basic
principles. For instance, after completing the Monte Carlo exercise just
described, students still
make comments such as, "the estimate is biased because the
number is too small," "the estimate is more accurate when the
standard error of the explanatory variable is smaller," or even,
"it is important that the error term not be correlated with the y
variable."
Thus, nearly all students need more time and experience to fully
grasp the sampling distribution concept and other fundamental principles
that I hope to illustrate through Monte Carlo simulation. It is
therefore both beneficial and inexpensive at this point to assign Monte
Carlo exercises to illustrate any econometric topic that deserves
emphasis. This helps to drive home specific concepts and also
deepens--or awakens--understanding of fundamental principles at the same
time.
The second Monte Carlo assignment might be designed to demonstrate
the effects of multicollinearity and how one can get an imprecise estimate of a regression parameter when [R.sup.2] is high and vice
versa. This could be accomplished by asking students to compare and
contrast Monte Carlo results from two scenarios: In the first case,
specify a relatively large value for [[sigma].sup.2] along with a
population with little collinearity among the independent variables; in
the second case, specify the opposite conditions. Along with
illustrating the implications of multicollinearity, this experiment
helps students to understand the practical meaning of [R.sup.2], since
they tend to become overly concerned with its magnitude, and also
provides another opportunity to illustrate the sampling distribution
concept.
While students with less spreadsheet experience may struggle with
the techniques, it is important to note that the learning economies
involved in developing Monte Carlo experiments in Excel are substantial.
Once students have done one experiment, additional exercises require
much less effort with respect to spreadsheet modeling; students are
therefore better able to focus on the main purpose of the exercise.
Thus, it seems that the marginal benefit is increasing while the
marginal cost is decreasing over the first three or four exercises.
5. Conclusions
In his introductory text, Wooldridge (2000, p. iii) states that
"Not only are undergraduates capable of learning about econometrics
in a way that reflects modern practice, but this approach enhances their
interest in and enjoyment of the subject." Making Monte Carlo
experiments an integral part of the introductory course can help support
this objective in several ways: first, developing an understanding of
Monte Carlo methods leads to a deeper understanding of fundamental
principles of econometrics and statistics; second, Monte Carlo
experiments can help illustrate key econometric concepts and results;
third, conducting Monte Carlo experiments gives students hands-on
experience that is both instructive and enjoyable; and finally,
experience with simulation gives beginning students an understanding of
an important method of econometric research.
While the educational benefits of having students conduct their own
Monte Carlo experiments are substantial, in the past the practicality of
doing so has been limited because of the high opportunity cost involved
in learning to program. More recently, however, it has become possible
to conduct Monte Carlo experiments using standard spreadsheet software.
Because most students have experience with spreadsheets, pick up new
spreadsheet modeling techniques rather quickly, and generally find the
spreadsheet environment to be intuitive and user friendly, the
opportunity cost involved in having students conduct Monte Carlo
experiments with spreadsheets is comparatively small. This paper has
described a method for conducting meaningful Monte Carlo experiments
with Excel and has discussed how it can be used as an effective teaching
tool for introductory econometrics students.
(1.) The Excel formula RAND() generates a uniformly distributed
pseudorandom number between 0.0 and 1.0 (technically. 0 [less than or
equal to] RAND() < 0.999).
(2.) The Regression Analysis tool in Excel's Analysis ToolPak
cannot be used here because it will not automatically recalculate the
regression parameters during the resampling process.
(3.) Look op "array formula" or "=linest" in
the Excel online help catalog for more information.
(4.) The process of resampling and analyzing the results can be
greatly facilitated by commercial spreadsheet add-ins designed for this
purpose, such as Crystal Ball or @Risk. My objective here is to show how
to conduct a Monte Carlo simulation experiment using only the inherent
capabilities of Excel.
(5.) To create and name the Results worksheet, (i) click the Insert
menu and select Worksheet, (ii) click the Format menu and select
Sheet-Rename, (iii) type the name of the new worksheet, and (iv) click
anywhere on the new worksheet or press [Enter].
References
Albright, S. Christian, Wayne L. Winston, and Christopher J. Zappe.
2000. Managerial statistics. Pacific Grove, CA: Duxbury.
Cahill, Miles, and George Kosicki. 2000. Exploring economic models
using Excel. Southern Economic Journal 66:770-92.
Cahill, Miles, and George Kosicki. 2001. A framework for developing
spreadsheet applications in economics. Social Science Computer Review
19:186-200.
Gujarati, Damodar. 1999. Essentials of econometrics. Boston, MA:
Irwin McGraw-Hill.
Holden, Craig W., and Kent L. Womack. 2000. Spreadsheet modeling in
finance and investments courses. FEN Educator (online) 5(5). Available
at: http://papers.ssm.com/sol3/papers.cfm?abstract id=241708.
Judge, Guy. 1999. Simple Monte Carlo studies on a spreadsheet.
Computers in Higher Education Economics Review (online) 13(2). Available
at: http://econltsn.ilrt.bris.ac.uk/cheer/ch13_2/ch13_2pl2.htm.
Kennedy, Peter E. 1998a. Teaching undergraduate econometrics: A
suggestion for fundamental change. American Economic Review 88:487-91.
Kennedy, Peter E. 1998b. Using Monte Carlo studies for teaching
econometrics. In Teaching undergraduate economics: Alternatives to chalk
and talk, edited by W. E. Becker and M. Watts. Aldershot, UK: Edward
Elgar pp. 141-59.
Kennedy, Peter E. 1998c. A guide to econometrics. 4th edition.
Cambridge, MA: MIT Press.
Studenmund, A. H. 2001. Using econometrics: A practical guide. 4th
edition. Boston, MA: Addison Wesley Lougman.
Wooldridge, Jeffrey M. 2000. Introductory econometrics. Cincinnati,
OH: South-Western College Publishing.
R. Kim Craft *
* School of Business, Southern Utah University, Cedar City, UT
84720, USA; E-mail
[email protected].
The author appreciates helpful comments and suggestions from Joe
Baker, David Tufte, and two anonymous referees. Remaining errors are the
author's alone.