**MAX D. MORRIS***

Iowa State University

I congratulate and thank Sacks et al. for an interesting and thoughtful case study of model validation in an important application area. The authors offer an insightful description of the general process of testing a computer model against reality, but, more importantly, describe how they accomplished this task in a very specific, complex setting. In the development of new methodology, the "devil" is always in showing that the proposed ideas and techniques can be relevant to the "details" of real, important problems. Careful case studies, such as this one, are important steps toward improving the practice of model validation.

Each of the points I raise in this discussion has been addressed in some form and to some degree by the authors. I hope that my restatement and elaboration gives readers a useful alternative view of a few of the issues that must be faced when designing and interpreting a validation study.

I will focus my remarks on only a few aspects of the problem and model considered by Sacks et al. (at least in part to avoid the certain embarrassment that would otherwise arise because I do not have their extensive knowledge of traffic modeling). During any given period of time, real vehicles travel through the area studied by the authors, each experiencing some stop delay time at intersection approaches; the total of all such times across vehicles is a well-defined quantity *φ*. We have a clear general understanding of the physical process that gives rise to *φ*; individual vehicles arrive at the intersections corresponding to the entry nodes displayed in figure 1, negotiate their way through the grid, and exit or disappear into garages; given enough detail on the individual movement of each vehicle, it is a simple matter to calculate its contribution to total delay time. This simplified concept of reality might be denoted by

φ ← **R**(*t*,*u*;*c*)

where (with apologies to Sacks et al. for using notation not entirely consistent with
their own) *t* denotes the exact and complete collection of arrival times at each
entry node, *u* represents an extensive set of variables that fully characterizes each
vehicles destination and the rules it uses in reacting to its environment, and *c*
represents the timing of the signal lights (that we will "control"). The notation "←" rather
than "=" in the above expression indicates that this is our idea of how reality worksnot
necessarily the same thing as reality itself. Envisioned in this way, **R** is conceptually
simple. In fact, a model could at least in principle be written that
does *exactly* what **R** does, given *t*, *u*, and *c* as inputs.

However, models that require detailed values of *t* and *u* that match reality are of limited practical value because *t* cannot in practice be known before the time period of interest (and then only if impractically extensive measurements could be recorded at each entry node during that time period), and realistically *u* can never be known. Instead, simulation models like CORSIM are written with the idea that these quantities can be regarded as random processes, fully specified by a comparatively very small number of parameters. Rather than demanding the unattainable *t* and *u* as inputs, we define a model as a sort of stochastic generalization of **R**:

Φ ~ **M**(λ,*π*;*c*)

where *λ* and *π* are vectors that characterize distributions of random variables *T* and *U*, intended to represent the uncertainty in *t* and *u*, and so serving as the definition of a random variable Φ. The practical distinction between *t* and *u*, as discussed by the authors, is that it is sometimes possible to collect limited data directly related to the first, while distributions used to represent the second are usually set by "defaults." A computer program expressing **M** then serves two purposes: it generates a single realization of the random vectors *T* and *U*, and *then* evaluates **R** *as if* these were the actual values of *t* and *u*. It is worth rewriting **M** to emphasize this:

(*T,U*)~**D**(λ, *π*)

Φ = **R**(*T,U*;*c*)

where **D** expresses the distribution of *T* and *U*, given the input
parameters. One execution of the model produces one realization of Φ, rather than the exact
response that would follow from the fully defined deterministic inputs. (Here, I will use the
capital letters *T*, *U,* and Φ to denote the distributions defined
when *λ* and *Π* are selected, the random variables defined by these distributions, and the realizations of these random variables produced when the model is evaluated; corresponding lowercase letters are used to denote data collected from the physical system.) The outputs from a number of repeated runs initiated with the same parameter inputs *λ* and *Π* yield simulation estimates of the characteristics of the distribution of Φ, for example, mean, standard deviation, and quantiles. We may wish to give this distribution a frequentist interpretation, hoping that values of *φ* observed on similar days will look like a random sample from this distribution.

Now, this formulation sounds fairly pedantic, but it may help in describing some very serious questions related to the model validation process.

Given validation data from a specific period, what distributions (*T,U*) should be used in generating the corresponding CORSIM outputs? If *t* or some portion of it can be collected along with *φ* during the test period, one possibility would be to calibrate **D** so that the distribution *T* is consistent with *t*, that is, compute an estimate of *λ* using these data. The authors note in the section on Data Collection that this is often regarded as forbidden. However, we make one fundamental operational simplification in writing **M**: we drop the demand to actually know the very extensive vector *t*, settling instead for the much easier-to-specify and hopefully "statistically similar" *T*. Unless the realizations of *T* generated in the process of evaluating **M** are in some sense credible relative to the value of *t* realized during the validation exercise, the apparent validation error between *φ* and Φ can be the result of

- a lack of structural integrity in
**R**, - problems with the distributional assumptions expressed in
**D**, - unrealistic specification of distribution parameter inputs
*λ*and*Π*, or - any combination of these.

It seems to me that if the testing of model structure is of primary interest, *T* should be defined so as to correspond to *t* as closely as possible.

However, if the goal of validation is assessment of the *prediction capability* that would be obtained from using the model in realistic situations, then the use of *t* in specifying *T* should certainly be forbidden. In this case, the process of specifying *T* must be viewed as a "hard-wired" piece of the model, and validation must be carried out using *T* as it would be constructed in practice. Here, validation is actually a joint test against all three kinds of problems listed above, including those associated with the technique used to select inputs to characterize *T* for the prediction time period. The Bayesian sampling approach mentioned in the section on Analysis of Uncertainty would be one way to perform this joint assessment. However, improving any of the conceptual forms of **R**, the distributional form expressed in **D**, or the process used to select input parameters can potentially improve performance relative to this kind of validation. Because each kind of improvement requires a different kind of developmental effort, the "factoring" of predictive uncertainty corresponding to these sources is an important part of the validation process.

Practice that forbids fitting input parameters to data collected along with validation data may sometimes stem from fears of "overfitting" *T*, that is, customizing *T* so that *t* is "too typical" an outcome. This is certainly a legitimate concern, but may be secondary to the more fundamental issue of what is to be validatedthe physically motivated **R**, **R** plus the operationally necessary **D**, or the full model plus the process of setting inputs.

The authors stress the important fact that selection of the output to be used in validation requires a balance between the *relevance* to the important questions and the *feasibility* of collecting the measurements. (Ive used *φ* and Φ here to denote the measured and computed quantities being compared, respectively, even if they are actually functions of what the programmer might ordinarily call the models output.) Hence, the authors select *stopped delay time* as the basis for comparing the model with reality, even though the much more difficult-to-measure *average link travel time* might be more relevant to questions concerning the timing of signals. This quandary exists anytime a model is produced to simulate physical circumstances that are difficult to examine directlya common situation because such difficulty is often a major motivation for writing the model in the first place. Sometimes the discrepancy between what can realistically be measured and what is of most interest is even greater, for example, for models written to evaluate the reliability of nuclear weapons.

In the last section, I suggest that the Sacks et al. model validation is really a joint validation of **R**, **D**, and the method by which distributional parameters are set as inputs. Continuing this process, and agreeing with Sacks et al. that any validation must be done in the context of the purpose of the model, I think we may also need to consider the relationship between the output selected for measurement and comparisons and the output variables most critical in evaluating the success of setting *c*. Sacks et al. note that average link travel time and stopped delay are highly correlated (see the section on the Validation Process) and so at least informally consider this point.

In some settings, validation based on simultaneous comparisons of several outputs to various kinds of measurements may be possible. There may be few (or, in the case of the modeling of a nuclear weapon, no) measurements available that would be judged to be most relevant for the purposes of model use, a considerable quantity of data available corresponding to outputs of less relevance, and an intermediate quantity of data lying somewhere between these on some scale of relevance. Methodology, which formally accounts for relationships between multiple sources of validation data, and the fact that some are more relevant to the purposes of modeling than others, will be useful in such contexts.

Given specification of the input parameters by whatever means, repeated executions of **M** lead to a simulated "reference distribution" Φ. The validation exercise may be considered successful if the observed *φ* is a credible realization from this distribution. So, for example, the authors compare the "field" values with corresponding computed average and standard deviation values in tables 1 through 4. This amounts to a test of the hypothesis that *t* and *u* are drawn from the joint distribution characterized by **D** and the selected input parameters, and that **R** faithfully represents reality given *t* and *u*. However, even given effective specification of *T* and *U*, the authors remind us that "no simulator can be expected to capture real behavior exactly" (see the second section of the article); various details are always omitted, some intentionally and others through incomplete knowledge. Thus, what I have called the **R** section of CORSIM may not (and probably should not) contain explicit representations of the effects of emergency vehicles, thunderstorms, short-term construction work, and the use of cell phones by the drivers of some vehicles. A more detailed concept of reality that includes such phenomena, and so is perhaps closer in some sense to what happens in the streets, might be denoted

φ ← **R**_{*}(*t*,*u*,*v*;*c*)

where *v* represents the additional specific deterministic details of these unmodeled
subphenomena, and **R*** is the more elaborate understanding of reality that takes these into account. Suppose for simplicity that *v* is parameterized so that **R*** is the same as the simpler **R** when *v* = 0:

**R**_{*}(*t*,*u*,0;*c*) = **R**(*t*,*u*;*c*) ∀ *t*,*u*,*c*.

Hence, even if *T* and *U* effectively represent the physical variability of *t* and *u*, and our model expresses **R** perfectly, Φ may be inconsistent with *φ* because of the particular value of *v* at the time of validation. A strict "frequentist" might wonder whether the average of real-world *φ* values from a large number of days with identical *t* and *u*, but with *v* varying over some implied distribution *V*, might look like a reasonable realization of Φ. Related to this, we would consider

**R**(*t*,*u*,0;*c*)^{?}=*E _{v}*

where equality would suggest that the model might be trusted to predict such averages. But this is likely not to be what the developer of the model had in mind, and in any case, the test would require data that are operationally or even theoretically impossible to collect. Still, if such omitted effects are actually presentand they nearly always arethey imply potential variability in *φ*, which is not represented by the random variables in our model. This could mean that when *T* and *U* faithfully represent variation in *t* and *u*, Φ suggests less variation than should be attached to *φ*. Alternatively, it could lead to a situation in which the specified distributions *T* and *U* must have unrealistically large variances if the observed *φ*s are to "fill out" their matching calculated reference distributions.

Since the quantity and variety of data needed to fully answer these questions cannot typically be obtained experimentally, the pragmatic conclusion may be this: If it is important to predict both the mean *and* variability of *φ* for specified conditions, validation should be aimed at judging not only whether the observed *φ*s are close enough to their predictive means, but also, for example, whether their squared deviations from that mean agree with predictive variance, with the understanding that this does not automatically follow from getting the input distributions right (physically). The authors do the next best thing to checking the day-to-day variation of *φ* by looking at how some output quantities and measured quantities vary over time within a single validation period (see figure 4). This may be as close to comparing day-to-day distributions as can be achieved within the constraints imposed by sampling in this particular problem.

In thinking about experiments for validation of any kind of computer model (whether stochastic or not), it may be useful to remember a basic tenet of physical laboratory experimentation. A model cannot be expected to contain all the details of reality, but our hope is that it faithfully represents the major influences and effects associated with important and interesting characteristics of the system (in this case, the timing pattern of the signals). So, while it may be too much to ask that a model precisely predict the activity of a certain condition, we may hope that it usefully predicts the effect of changing the important characteristics in the absence of any other changes. Classical experimental design and analysis recognizes similar concepts in its use of experimental blocks and focuses on systematic differences among treatments within a block, rather than attempting to predict the result of a specific treatment in an unspecified block.

The authors take this approach when discussing the values of Δ in table 8. A simplified view of the experiment described in this paper is a two-treatment design (signal timing settings) within a single block corresponding to a single definition of *T* (since the authors assumed that "
the conditions in the field for the September data collection would be the same as in May"). Viewed in this way, we realize that the field information addresses the effect of changing *c* at only one level of *T*. Would Δ be different at another *T* specification, for example, traffic conditions at another time of day? Traditional experiments are often designed under the assumption of additive block effects, in hopes that the answer to this question is "no." Additional experiments covering other (*T; c*) combinations, for example, more blocks and treatments, may be too expensive for practical considerations in studies of this kind. But without them, we are left assuming that the effects of *T *and *c* are additive, or understanding that our validation pertains only to the *T *we have specified.

The authors have selected morning and evening rush hoursundoubtedly the most important conditions when setting signal timing; perhaps it is sufficient for their purposes to certify that the model can predict the effect of changes in the signal timing pattern for these conditions. If it is anticipated that *T* and *c* have important "interactions" in reality, and if studies can be extended to cover other traffic conditions, a validation exercise of broader scope might be considered.

Finally, returning to the authors statement that "no simulator can be expected to capture real behavior exactly," the most natural question to ask will generally not be whether **M** can be thought of as a universal replacement for measurements that are difficult or impossible to make in reality; this simply will not be the case. With careful development and tuning, we may hope for a model that does a respectable job *within some range of conditions*. But just as good "weather" models do not produce good "climate" forecasts and vice versa, models that do a generally good job of modeling traffic in some circumstances may be entirely unrealistic in others. And so a more useful (but more difficult) eventual endpoint of model validation may be the solution to an inverse problem: Under what set of circumstances is **M** a reliable representative of reality, or where in the space of input values can **M** be trusted?

As with the selection of outputs for validation, our ability to usefully answer this question depends both on the range of circumstances of interest and the range of circumstances over which we can expect to collect physical data. It is of little use to consider model validity outside of the first range, and meaningful comparisons will be very difficult, perhaps indirect, and sometimes impossible outside of the second. As with other experiments, however, the goal should be not just a simple answer to one question, but a collection of answers that indicate the sort of situations for which the model (or model-and-sampling process) might be presently "certified," and the identification of other settings within which further study or development is needed. Hence the authors conclusion that: "CORSIM, though imperfect, is effective in evaluating signal plans in urban networks, at least under some restrictions."

All the issues I have noted here can be framed in other ways, and each can be described from entirely different viewpoints. I have variously referred to quantities as random, fixed-and-unmeasurable, or altogether absent as it fits my purposes, while a mechanistic approach might ignore all randomness except that used to define the model, and a full Bayesian approach might always see all quantities as random. Regardless of the perspective, the issues of how and which data are used in setting input values, how validation data are collected and compared with outputs, and how the agreement between outputs and validation data is assessed raise difficult questions. Sacks et al. have done an excellent job of carefully considering and addressing these questions in the context of a specific and important problem. This and other thoughtful exercises of this sort will be the building blocks from which new and better methods for model validation may be developed.

Max D. Morris, Departments of Statistics and Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, IA 50011-1210. Email: mmorris@iastate.edu.