**Gary Davis*
University of Minnesota**

**Shimin Yang
The St. Paul Companies**

Predicted or estimated totals of traffic volume over one or more years are required in both highway pavement and safety engineering. While current recommended practices contain guidance on how to generate such estimates, they are less clear on how to quantify the uncertainty attached to the estimates. This paper describes an initial solution to this problem. Empirical Bayes methods are used to compute quantiles of the predictive probability distribution of the traffic total at a highway site, given a sample of daily traffic volumes from that site. Probable ranges and their associated probability values are readily found, and a point prediction of the total traffic can be obtained as the median, or 50th percentile, of the predictive distribution. The method of derivation can also be used to find the predictive density and the moments of the predictive distribution if needed. No data other than those routinely collected by statewide traffic monitoring programs are needed. A test comparing computed 90% credible intervals for annual traffic volume with the corresponding actual volume at 48 automatic traffic recorder sites showed that the actual coverage percentage was not significantly different from the nominal 90% value.

In engineering design, it is sometimes necessary to work with variables whose values are not known with certainty. In such cases, a rational compromise between over- and under-design requires first the determination of a probable range for a variable's outcome and then a design to accommodate values in this range. Implementation hinges on an acceptably accurate assessment of this probable range because if the assessed range is too narrow, the likelihood of a failure will be unacceptably high, but if the assessed range is too broad, resources will be expended in anticipation of improbable events. In addition, the users of scientific and engineering measurements often expect an assessment of the uncertainty attached to those measurements. The need for an assessment of uncertainty is revealed in the reporting of error bounds associated with opinion survey estimates, in the U.S. Supreme Court's recommendation that judges consider "known or potential rate of error" when determining the admissibility of expert scientific testimony (Foster and Huber 1999), and in the recommendation by the American Association of State Highway and Transportation Officials (AASHTO) that the "precision and bias" attached to traffic volume measurements be assessed and reported (AASHTO 1992).

The ability to justify an uncertainty assessment becomes especially important when either particular values of an estimate or the uncertainty range itself could be used by partisans to justify or oppose a controversial policy (Sarewitz and Pielke 2000). In a discussion of scientific predictions and social policy, Stewart (2000) found it useful to distinguish between two sources of uncertainty, which he called aleatory and epistemic uncertainties. Aleatory uncertainty arises when the outcomes of interest are governed by physically random processes that are at least in principle capable of generating stable long-run relative frequencies. Epistemic uncertainty, on the other hand, arises when our knowledge of underlying states of nature is incomplete. This duality in the notion of uncertainty appears to date back to the origin of modern ideas concerning probability (Hacking 1975) and is arguably the source of the current debate between Bayesian and frequentist views on the foundations of statistics (Howson and Urbach 1993).

Estimates or forecasts of the total traffic volume on a section of road for one or more years are used in both pavement design and traffic safety analysis and are also used to generate state- and nationwide estimates of total distance traveled. These are most often computed by multiplying an estimate of the road's mean daily traffic (MDT) volume by the number of days in the desired time horizon. For example, in pavement design (AASHTO 1993) estimates of MDT for each vehicle class are multiplied by 365 to obtain estimated yearly traffic totals, which are in turn used to predict the traffic loading over a pavement's design life. In traffic safety, the traffic exposure at a road site is computed by multiplying an estimate of MDT by the total number of days over which traffic accidents have been counted. For an intersection, these traffic totals are then summed over the intersection's approaches to give the total entering vehicles. For a highway section, the traffic total is multiplied by the section's length, producing an estimate of total vehicle kilometers of travel. Clearly, aleatory uncertainty is attached to an estimate of total traffic because, even if we knew a site's MDT exactly, the estimated traffic total and the actual total would likely differ, due to the unpredictable decisions of individual travelers. Epistemic uncertainty is present when the true MDT is not known exactly but has been estimated from a sample of daily traffic counts. AASHTO's (1993) recommended pavement design method explicitly allows for aleatory uncertainty as one of the components making up the overall variation term in the pavement design equation; however, epistemic uncertainty is not addressed. In traffic safety, current practices address the uncertainty in estimated accident rates due to the random nature of accident counts but do not appear to consider the contributions of either aleatory or epistemic uncertainty when estimating exposure (see Parker 1991).

Draper (1995) has illustrated how an accounting of multiple sources of uncertainty can be accomplished using Bayesian statistical methods, and in this paper we will consider the problem of assessing the uncertainty attached to estimates or forecasts of the total traffic volume. The second section will illustrate, using a simple example, how a Bayesian approach can be used to combine the contributions of aleatory and epistemic uncertainties into one assessment. The reasoning illustrated in that section will then be applied in the next section to develop an expression for the predictive distribution of a traffic total, which can be used to compute both point and interval estimates. The fourth section will then describe an initial empirical evaluation of this estimation method, and the final section will present conclusions. The development described in the third section draws heavily on past research into statistical models for time series of daily traffic counts (Davis and Guan 1996) and on weak convergence results for sums of lognormal random variables (Marlow 1967).

Consider the problem of estimating the total traffic volume over a period of *N* days,
using a short count collected with a portable traffic counter. Let

*z _{t}* = traffic volume on day

the
total traffic volume over days *t*=1,...,*N*,

*z _{l}* = traffic count on the

the sample average,

*n*-dimensional
vector containing the sample counts.

The ultimate objective is to estimate the total traffic volume *z ^{N}*
from the traffic count sample

In particular, if
is noninformative in the sense of being uniformly distributed on the real line,
it can be shown that the posterior uncertainty concerning *μ*
is characterized by a normal distribution with mean equal to the sample average
and variance equal to
(Box and Tiao 1973). The joint effect of aleatory and epistemic uncertainty
can then be determined by treating *μ* as a nuisance parameter and
integrating it out of the joint density for *z ^{N}* and

(Draper 1995). Here denotes the predictive probability density of the total traffic given the count sample, while denotes the predictive probability density of the total traffic when the MDT is known. For this example, closed form evaluation of (2) is possible (Box and Tiao 1973), leading to the conclusion that the predictive probability density is normal with mean equal to theand variance given by

In this case, the contributions to the total variance attributable to aleatory and epistemic uncertainty can be separated, with as variance due to aleatory uncertainty and as variance due to uncertainty concerning the MDT. Interestingly, while the variance due to aleatory uncertainty increases linearly with the number of days in the traffic total, the variance due to epistemic uncertainty increases quadratically. To see the relative contributions of these sources, suppose that we seek to predict one year's total traffic volume using a 10-day sample count and that the daily traffic volume has an MDT of 1,000 vehicles per day and a coefficient of variation equal to 0.1. The day-to-day variance would be , so the standard deviation due to aleatory uncertainty would be which equals 1,910 vehicles. The standard deviation due to epistemic uncertainty would be equal to 11,540 vehicles. This example illustrates how epistemic uncertainty can be the dominant source of error and that neglecting its contribution can lead to a serious overstatement of a prediction's precision.

Because the main objective of this paper is to show how a more complete accounting of uncertainty can be added to current traffic monitoring practices, we describe these practices next. The chief purpose of a traffic monitoring program is to generate estimates of MDT on each of a jurisdiction's road segments. Ideally, this is done with year-round counting on each segment, but the cost of installing and maintaining such a comprehensive traffic monitoring system is prohibitive. Therefore, MDT estimates on the majority of road segments are obtained from samples gathered using portable traffic counters. Since traffic volumes vary systematically throughout the course of the year as well as across the days of the week, averages computed from short count samples are generally biased estimates of full year averages. However, if the magnitude of the bias is known, adjustments can be made. To determine these adjustments, most states employ a small number of permanent automatic traffic recorders (ATRs) placed on a representative sample of road segments. The daily traffic counts from the ATRs are used to cluster the ATRs into factor groups such that daily traffic volumes at sites in a factor group show similar seasonal and day-of-week variation patterns. The ATR counts are also used to estimate the seasonal and day-of-week factors characterizing each group. Each non-ATR road section is then assigned to one of these factor groups, and the variation factors characterizing the assigned group are used to adjust the short-count sample, providing a better estimate of the section's MDT. It is currently recommended that a suitably adjusted short count of 48 hours produces an estimate of MDT with acceptable precision (AASHTO 1992; USDOT FHWA 1995).

At least two sources of potential error can cause an estimated MDT to differ from a section's true (but unknown) MDT: sampling error, arising anytime the estimate is based on less than a complete census of the section's traffic volumes, and adjustment error, arising if the factors used to adjust the short-count sample differ from those which actually describe the sampled section's variation pattern. In a recent review of MDT estimation, Davis (1997b) pointed out that much of the earlier research used to justify the use of short counts for estimating MDT tended to underestimate the potential effect of adjustment error, and an analysis of the potential contributions from the two error sources indicated that adjustment error can plausibly be two to three times larger than sampling error. This analysis was consistent with recent empirical work by Sharma et al. (1996), which investigated the effect of adjustment errors in estimating MDT, as well as with work which highlighted the error caused by applying adjustment factors developed for traffic dominated by passenger cars to estimate the MDT of heavy trucks (Hallenbeck and Kim 1994; Cambridge Systematics 1994). The review also pointed out that both sampling and adjustment error can be explicitly accounted for within a hierarchical statistical model of the process generating the daily traffic counts and that this model can be used to develop an empirical Bayes (EB) estimator of MDT, which does not require that each roadway section be assigned a priori to a factor group (Davis and Guan 1996; Davis 1997a). Rather, a structure similar to that shown in equations (1) and (2) is used, in which the sample data are used to assess the posterior probabilities the sample site belongs to each factor group. The MDT is then estimated as a weighted average, with the factor group probabilities providing the weights. The next section describes how this hierarchical modeling approach can be extended to develop a method for computing the predictive distribution of a site's total traffic volume, rather than its MDT, given a traffic count sample at the site.

Using Bayes Theorem to assess the information provided by a sample and then integrating out nuisance parameters, the two steps exemplified in equations (1) and (2) provide the basic framework for deriving the predictive distribution of traffic totals from more realistic assumptions. In the above example, we derived a predictive probability density but here we will focus on the corresponding predictive distribution function The distribution function is more useful from a practical standpoint since it leads immediately to a method for finding the quantiles of the predictive distribution by solving equations of the form Since the expression for the cumulative distribution function turns out to have the form of a weighted average, an argument similar to that employed below could also be used to find the predictive probability density or the moments of the predictive distribution.

**Aleatory Uncertainty**

We will develop an explicit expression
for in
several steps. As in the example, the total traffic
count *z ^{N}* is determined as the sum of the daily
counts

where

, the natural logarithm of a daily count,

*u* = expected log traffic count on a typical day,

, if
the count *z _{t}* was made during month

*m _{k,i}* = correction term for month

= 1,
if the count *z _{t}* was made on day-of-week

*w _{k,j}* = correction term for day-of-week

*e _{t}* = random error.

If we let denote
a column vector containing the monthly and day-of-week adjustment terms for factor
group *k*, and equation
(4) can be written in a slightly simpler form:

In the above model, the mean value of the logarithm of the daily count varies according to month and day-of-week, and the magnitude of these variations depends on the factor group to which the site of interest belongs. Analysis of the regression residuals obtained after estimating the adjustment terms indicated that the error terms *e _{t}* were not independent but showed day-to-day dependencies, which could be described by a multiplicative autoregressive (AR) model of the form

Here the *a _{t}* are independent, identically distributed, normal, random
variables with zero mean and common variance,
and and
are site-specific autoregressive coefficients.

The above model is parameterized by *u*, a mean-value
parameter, the
monthly and day-of-week adjustment terms, the variance of the *e _{t}* terms,
which we will denote by and
the autoregressive coefficients In
the next step, we will assume we know the values of these parameters but nothing else about the site. Properties of lognormal random variables (Shimizu and Crow 1988) can be used to show that the expected value of the total traffic
volume is

and the variance of the total traffic volume is

Here denotes
the correlation between *e _{t}* and

converges to that of a standard normal random variable, implying that
for large *N*, log* _{e}*(

where denotes the standard normal distribution function.

**Epistemic Uncertainty**

The sample ** z** contains two types of information
concerning

When the sample counts are correlated with counts comprising the total *z ^{N}*, the expression (11) will only be approximate, with the approximation deteriorating with increasing overlap between the sample counts and the counts entering into the total. In principle, smoothing algorithms could be used to include dependency on

The final steps involve characterizing the distribution and then finding a computationally feasible way to evaluate the (multidimensional) integral in (11). It turns out, however, that this problem is very similar to the problem of computing Bayes estimates of mean daily traffic described in Davis (1997a) and Davis and Guan (1996), and a similar solution can be employed here. The essence of this approach is to assess the prior uncertainty concerning the model parameters, and then use Bayes Theorem to account for information provided by the data sample.

As in Davis (1997a), we will assume that the highway agency has divided its road
segments into a set of *m* factor groups and that estimates of the adjustments
factors for each group, , *k*=1,...,*m*,
are available. We will further assume that the agency maintains a total
of *M* ATRs and that for each ATR estimates of the covariance
parameters are
also available. Straightforward procedures for computing these estimates from ATR data, using commonly available software packages, are described in Davis (1997a). Prior to collecting any data for a site, we will assume that our uncertainty concerning that site's parameters is captured by the prior probability distribution

where *I _{b}*(

*I _{b}*(

Basically, this prior assumes that before collecting data we are completely uncertain
of the value of *u* in the sense that our prior probability is uniformly distributed
on the real line. For the adjustment
term , we are
certain it takes on one of the
values characterizing
our factor groups, but we are equally uncertain which of these is correct.
Similarly, for we
are certain one of the sets of values estimated from our ATR sites is correct, but
prior to collecting data we are equally uncertain which one. Completing the specification
of this prior by generating estimates of
the from
ATR data results in an empirical Bayes (EB)
method, in the sense of Padgett and Robinson (1978). That is, empirical distributions from samples are used to form the priors.

Because the logarithms of the traffic counts are normal random variables, the
likelihood function of the sample is easy to specify. Letting **y** denote the
vector containing the logarithms of the sample counts and *V* denote the
correlation matrix of **y **(which can be computed once the value of the AR
parameters is
known), then if we knew the site-specific values for the
parameters , the
likelihood of the sample could be computed using the appropriate multivariate normal density.

Here *X* is a matrix, of dimension *N*19, each row having elements equal to 0 or 1, according to the month and day-of-week of the corresponding sample count, while **1*** _{n}* is an

**Predictive Distribution of Total Traffic**

Applying Bayes Theorem to the prior and likelihood to obtain the posterior distribution for the parameters, substituting this into (11), and performing the indicated integrations produces, after some tedious algebra,

where *y ^{N}*=log

where are
as defined in (7) and (8) but evaluated
using The
distribution given in (14) is a finite mixture of normal distributions where the weights given to the mixture components are the posterior probabilities that the sampled site has adjustment factors and covariance parameters characteristic of each the *m *factor groups and each of the *M* ATR sites. Although the expressions in (14) and (15) appear rather forbidding, the implied computations are readily carried out on a personal computer.

As noted above, the distribution (14) approximates the predictive distribution
of a total traffic count, the approximation being appropriate when predicting the
total of a large number of days (for example, a year or more) from a small sample
(for example, two weeks or less). In an earlier study, Davis (1997a) used traffic
counts from the year 1992 from 50 ATRs in outstate Minnesota to estimate monthly
and day-of-week adjustment terms for the Minnesota Department of Transportation's
(Mn/DOT) 3 outstate factor groups, as well as covariance parameters for each of the
50 ATRs. These estimates were then used to construct the discrete prior distributions
for , giving
*m*=3 and *M*=50. In addition, daily counts from the year 1991 were available for 48 ATRs, and for each of these ATRs a sample consisting of a one-week count from the month of March and a one-week count from the month of July was drawn. The 1992 data were used for estimation, and 1991 data were used for validation because more ATRs had good data in 1992. A MATLAB (Mathworks 1992) program for evaluating (14) was written, and then for each of the 48 ATRs, the 5th and 95th percentile points of the predictive distribution of the logarithm of the 1991 total traffic volume were computed by embedding this routine inside MATLAB's root-finding algorithm. Finally, the logarithm of the total 1991 actual traffic volume was also computed for each ATR. The results of these computations are displayed in
tables 1 through 3.(table 1, table 2, table 3)

Note that the 5th and 95th percentile points describe the bounds of a 90% credible interval, and, clearly, if a large number of actual traffic counts fell outside the bounds of our intervals, we would have evidence for inaccurate prediction. On the other hand we would still expect a few actual counts to fall outside our bounds. If the intervals caught all actual volumes, we would be inclined to believe that the computed credible intervals were too large. If the approximation is acceptably accurate, we would expect the actual count to fall outside the bounds 10% of the time, and a test of the adequacy of the estimated credible bounds can be made by treating the number of missed totals as the outcome of a binomial random variable with 48 trials and a hypothesized miss probability of *p*=0.1. Inspection of the tables shows that for 8 of the ATRs (2, 8, 12, 204, 208, 217, 218, and 226) the actual count fell outside the estimated bounds, for a total of 8 binomial "successes." Since the probability of obtaining 8 or more successes by chance is 0.102, this result is not inconsistent with the hypothesis that equation (14) provides a reasonable approximation of the predictive distribution.

Predicted or estimated traffic totals are required in both highway pavement and safety engineering and are used to produce statewide and nationwide estimates of total distance traveled. Although recommended practices exist for estimating traffic totals as part of a traffic monitoring program, it is less clear how we should characterize the uncertainty associated with these estimates. This paper describes an initial solution to this problem, in which empirical Bayes methods are used to compute the quantiles of a traffic total's predictive distribution, given a sample of daily traffic volumes. Probable ranges and their associated probability values are readily found, and, if desired, a point prediction of the total traffic can be obtained as the median, or 50th percentile, of the predictive distribution. The method of derivation can also be used to find the predictive density and the moments of the predictive distribution. No data are required beyond that routinely collected by statewide traffic monitoring programs, and the estimates of the factor group adjustment parameters can be computed using standard linear regression methods. All other computations have been successfully implemented as MATLAB macros.

In conclusion, almost all engineering decisions must be made in the face of uncertainty, and the art of successful engineering requires cost-effective hedging against this uncertainty. It was argued earlier that standard methods for predicting total traffic ignore potentially important sources of error, and, hence, understate the resulting uncertainty characterizing estimates and predictions. Many of the statistical procedures used in highway engineering date to the middle part of the 20th century and are based on simplified statistical models adapted to the computational constraints of those times. Statistical science has advanced considerably since then, and these advances can support and encourage the use of more realistic models in highway engineering. This paper proposes a modest step in this direction by providing a computationally practical method which accounts for uncertainty in traffic volume predictions. Of course, the importance of hedging against uncertainty depends on the consequences of error, and, fortunately, so far the consequences attached to using mistakenly precise traffic forecasts have not been too severe. Whether or not this state of affairs continues is of course another uncertain prediction about the future.

The authors would like to thank Mark Flinner of Mn/DOT for providing the traffic-count data used in this study. This research was sponsored by Mn/DOT. However, all facts, conclusions and opinions expressed here are solely the responsibility of the authors and do not necessarily reflect the views of Mn/DOT.

AASHTO. 1992. *Guidelines for Traffic Data Programs.* Washington, DC.

____. 1993. *Guide for Design of Pavement Structures. *Washington, DC.

Box, G. and G. Tiao. 1973. *Bayesian Inference in Statistical Analysis.* New York: Wiley and Sons.

Brockwell, P. and R. Davis. 1991. *Time Series: Theory and Methods.* New York: Springer-Verlag.

Cambridge Systematics. 1994. *Use of Data from Continuous Monitoring Sites.* Report to Federal Highway Administration, U.S. Department of Transportation, Washington, DC.

Davis, G. 1997a. *Estimation Theory Approach to Monitoring and Updating Average Daily Traffic.* Report 97-05, Minnesota Department of Transportation, St. Paul, MN.

Davis, G. 1997b. Accuracy of Estimates of Mean Daily Traffic: A Review. *Transportation Research Record* 1593:12-6.

Davis, G. and Y. Guan. 1996. Bayesian Assignment of Coverage Count Locations to Factor Groups and Estimation of Mean Daily Traffic. *Transportation Research Record* 1542:30-7.

Draper, D. 1995. Assessment and Propagation of Model Uncertainty. *Journal of the Royal Statistical Society *B 57:45-97.

Foster, K. and P. Huber. 1999. *Judging Science: Scientific Knowledge and the Federal Courts.* Cambridge: MIT Press.

Gallant, R. and H. White. 1988. *A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models.* New York: Basil-Blackwell.

Hacking, I. 1975. *The Emergence of Probability.* Cambridge: Cambridge University Press.

Hallenbeck, M. and S-G. Kim. 1994. *Truck Loads and Flows, Task A Final Technical Report.* Washington State Transportation Research Center (TRAC), Seattle, WA.

Howson, C. and P. Urbach. 1993. *Scientific Reasoning: The Bayesian Approach.* Chicago: Open Court.

Johnson, N., S. Kotz, and N. Balakrishnan. 1994. *Continuous Univariate Distributions, Volume 1.* Second edition. New York: Wiley and Sons.

Marlow, N. 1967. A Normal Limit Theorem for Power Sums of Independent Random Variables. *Bell Systems Technical Journal* 46:2081-9.

Mathworks. 1992. *MATLAB Reference Guide,* Version 4. Mathworks, Inc.

Padgett, W. and J. Robinson. 1978. Empirical Bayes Estimators of Reliability for Lognormal Failure Models. *IEEE Transactions on Reliability* R-27:223-336.

Parker, M. 1991. *Highway Safety Engineering Studies: Procedural Guide.* Report No. FHWA-HI-88-039. Federal Highway Administration, U.S. Department of Transportation, Washington, DC.

Sarewitz, D. and R. Pielke. 2000. *Prediction in Science and Policy. Prediction: Science, Decision Making, and the Future of Nature.* Washington, DC: Island Press.

Sharma, S., B. Gulati, and S. Rizak. 1996. Statewide Traffic Volume Studies and Precision of AADT Estimates. *ASCE Journal of Transportation Engineering *122:430-9.

Shimizu, K. and E. Crow. 1988. History, Genesis, and Properties. *Lognormal Distributions: Theory and Applications.* New York: Marcel Dekker.

Stewart, T. 2000. Uncertainty, Judgment, and Error in Prediction. *Prediction: Science, Decision Making, and the Future of Nature.* Washington, DC: Island Press.

U.S. Department of Transportation (USDOT), Federal Highway Administration (FHWA). 1995. *Traffic Monitoring Guide, Third Edition.* Washington, DC.

**Weak Convergence of Sums For a Class of Correlated Lognormal Random Variables**

As above, let *z _{t}*,

and the error terms {*e _{t}*} follow a stationary

then

To verify conditions (a) and (b), we will impose the restriction that the monthly and day-of-week factors for any given day are bounded from above and also bounded away from zero. That is, there exist constants such that

for all *t* and *k*. It is then possible to show that

where

whenever the are
the autocorrelations for a stationary AR(*p*) process. Similarly,

so that

and condition (a) is satisfied.

If the daily counts *z _{t}* were independent, we could use either the Lyaponuv or Lindbergh central limit theorems to verify condition (b), as was done by Marlow (1967). A more general central limit theorem, allowing for dependence of the sort generated by the AR(

then condition (b) will be satisfied, and we are done.

1) Let , and
since *z _{t}* is lognormal, its fourth central moment is known, so that

2) This condition is satisfied trivially since

implies

3) This condition is also satisfied trivially since the fact that the noise
process {*e _{t}*} is a stationary AR(

4) This follows from the fact, demonstrated above, that

Gary A. Davis, Department of Civil Engineering, University of Minnesota, 122 CivE, 500 Pillsbury Drive SE, Minneapolis, MN 55455. Email: drtrips@tc.umn.edu.