Data systems produced within a DOT agency are created to fulfill user needs.
Users can be those within DOT and outside. The data are compiled to satisfy
an external user need, measure success toward a strategic goal (internal user),
or used as a tool necessary to perform work toward a goal (internal user). Data
system planning consists of four stages: collecting user needs, development
of objectives for the system, translation of those objectives into data requirements,
and planning of the top-level methods that will be used to acquire the data.
2.1 Data System Objectives
- A "data system" is any collection of information that is used
as a source by any Government entity to disseminate information to the public,
along with the planning, collection, processing, and evaluation. A data
system can cover any combination of information treated as a single system
for the sake of documentation and other guideline issues.
- The "system owner" as used in these guidelines is the organizational
entity whose strategic plan and budget will guide the creation or continued
maintenance of the data system.
- "Users" of a data system are people or organizations who use
information products that incorporate data from the system, either in raw
form or in statistics. "Major Users" of the data system are system
users identified as such in strategic plans and legislation supporting the
creation and maintenance of the data system. "User needs" should
be in the form of questions that specific users want to be answered.
- "Objectives" of the data system describe what federal programs
and external users will accomplish with the information.
- System objectives in clear, specific terms, identifying data users and
key questions to be answered by the data system, will help guide the system
development to produce the results required.
- Just as user needs change over time, the objectives of the data system
will need to change over time to meet new requirements.
- Users will benefit from knowing the objectives that guided the system
- Data system objectives should be written in terms of the questions that
need to be answered by the data; not in terms of the data itself.
- Every data system objective should be traceable to user needs.
For example, NHTSA, as an internal user of the Fatality Analysis Reporting
System (FARS) has a primary goal to improve traffic safety and a need for
information related to that goal. So, one objective for the Fatality Analysis
Reporting System (FARS) could be to provide an overall measure of highway
safety to evaluate the effectiveness of highway safety improvement efforts.
- The system owner should develop and update the data system objectives in
partnership with critical users and stakeholders. The owner should have a
process to regularly update the system as user needs change.
For example, for the Highway Performance Monitoring System (HPMS), one of
the objectives may be: to provide state and national level measures of the
overall condition of the nations public roads for Congress, condition and
performance information for the traveling public, and information necessary
to make equitable apportionments of highway funds to the states. The specific
needs of major users have to be monitored and continuously updated.
- Objectives should include timeliness of the data related to user needs.
- The current data system objectives should be documented and made available
to the public, unless restricted.
- The updating process should be documented and include how user information
- Huang, K., Y.W. Lee, and R.Y. Wang. 1999. Quality Information and Knowledge.
Saddle River, NJ: Prentice Hall.
2.2 Data Requirements
- An "empirical indicator" is a characteristic of people, businesses,
objects, or events (e.g., people or businesses in a city or state, cars or
trains in the United States, actions at airports, incidents on highways).
Examples: The level of success in stopping illicit drug smuggling into the
U.S. over maritime routes. The level of use of public transit in a metropolitan
- Before deciding on what data should be in a data system or how to acquire
them, the data system objectives need to be linked to more specific "empirical
indicator," from which data requirements will be derived.
Example: For FARS, the objective "To provide an overall measure of highway
safety" leads to an empirical indicator of "Injury or death of people
on the highways of the U.S."
- Empirical indicators related to objectives can be outcomes that change
as objectives are achieved, outputs from agency accomplishments related
to an objective, efficiency concepts, inputs, and quality of work.
- From the empirical indicators, data requirements are created for possible
measurement of each empirical indicator.
- Maintaining the link from data system objectives to empirical indicators
to data requirements will help to ensure "relevance" of the data
- In the data requirements, the use of standard names, variables, numerical
units, codes, and definitions allow data comparisons across databases.
- Besides data that are directly related to strategic plans, additional data
may be required for possible cause and effect analysis.
For example, data collected for traffic crashes may include weather data for
- Each data system objective should have one or more "indicators"
that need to be measured. Characteristics or attributes of the target group
that are the focus of the objective should be covered by one or more empirical
For HPMS, the objective "to provide a measure of highway road use"
can lead to the empirical indicator of "the annual vehicle miles of travel
on the interstate system & other principle arteries."
- The empirical indicators should be those characteristics which, when changing
in a favorable way, indicate progress toward achievement of an objective.
Note: Exceptions to this description are measures of magnitude, such as a
total population or total vehicle miles traveled. These are "denominator
measures" used to allow comparisons over time.
- Once the empirical indicators are chosen, develop data requirements needed
to quantify them.
Example: For HPMS, the empirical indicator, "the annual vehicle miles
of travel on the interstate system & other principle arteries" can
lead to a data requirement for state-level measures of annual vehicle-miles
traveled accurate to within 10 percent at 80 percent confidence.
- There is usually more than one way to quantify an empirical indicator. All
reasonable measures should be considered without regard to source or availability
of data. The final data choices will be made in the "methods" phase
based on ease of acquisition, constraining factors (e.g., cost, time, legal
factors), and accuracy of available data.
Example: A concept of commercial airline travel "delay" can be measured
as a percent of flights on-time in accordance with schedule, or a measure
of average time a passenger must be in the airport including check in, security,
and flight delay (feasibility of measure is not considered at this stage).
- In the data requirements, each type of data should described in detail.
Key variables should include requirements for accuracy, timeliness, and completeness.
The accuracy should be based on how the measure will be used for decision-making.
Example: For FARS, the concept, "The safety of people and pedestrians
on the highways of the U.S." can lead to data requirements for counts
of fatalities, injuries, and motor vehicle crashes on U.S. highways and streets.
The fatalities for a fiscal year should be as accurate as possible (100% data
collection), available within three months after the end of the fiscal year,
and as complete as possible. The injury counts in traffic crashes for the
fiscal year totals should have a standard error of no more than 6 percent,
be available within three months after the end of the fiscal year, and have
an accident coverage rate of at least 90 percent.
- When selecting possible data, consider standardization with other databases.
First, consider measures used for similar concepts in other DOT databases.
Second, consider measures for similar concepts in databases outside DOT (e.g.,
The Census). Coding standards should be used where coding is used and made
part of the data requirements. Such standardization leads to "coherence"
Examples: the North American Industry Classification System (NAICS) codes,
the Federal Information Processing Standards (FIPS) for geographic codes (country,
state, county, etc.), the Standard Occupation Codes (SOC), International Organization
for Standardization (ISO) codes (money, countries, containers)
- The current data system empirical indicators and data requirements should
be documented and clearly posted with the data.
2.3 Methods to Acquire Data
Given data requirements for a wide range of possible measures, the next phase
is to consider the realities associated with gathering the data to construct
estimates and perform analysis. After looking at the ease of data acquisition,
complexity of possible acquisition approaches, budget restrictions, and time
considerations, the list of possible measures is likely to be reduced to a more
reasonable level. First, consider possible sources of data and then the process
of acquiring it.
The more critical data needs invariably require greater accuracy. This
in turn usually leads to a more complex data collection process. As the
process gets more complex, there is no substitute for expertise. If the
expertise for a complex design is not available in-house, consider acquiring
the expertise by either contacting an agency specializing in statistical
data collection like the Bureau of Transportation Statistics or by getting
2.4 Sources of Data
- A common arrangement in transportation is a reporting collection in which
the target group automatically sends data. Most of these are dictated by law
or regulation. That limits the collection planning to working out the physical
For example: 46 USC Chapter 61 specifies a marine casualty reporting collection,
while 46 CFR 4.05 specifies details.
- If existing data can be found that addresses data requirements, it is by
far the most efficient (i.e., cheapest) approach to data acquisition. Sources
of existing data can be current data systems or administrative records.
- "Administrative records" are data that are created by government
agencies to perform facilitative functions, but do not directly document the
performance of mission functions (National Archives definition). In addition
to providing a source for the data itself, administrative records may also
provide information helpful in the design of the data collection process (e.g.,
sampling lists, stratification information).
For example, state drivers license records, social security records, IRS
records, boat registration records, mariner license records.
- Another method, less costly than developing a new data collection system,
is to use existing data collections tailored to your needs. The owner of such
a system may be willing to add additional data collection or otherwise alter
the collection process to gather data that will meet data requirements.
For example, the Bureau of Transportation Omnibus survey is a monthly transportation
survey that will add questions related to transportation for special collections
of data from several thousand households. This method could be used if this
process is accurate enough for the data system needs.
- The "target group" is the group of all people, businesses, objects,
or events about which information is required.
For example, the following could be target groups: all active natural gas
pipelines in the U.S. on a specific day, traffic crashes in FY2000 involving
large trucks, empty seat-miles on the MARTA rail network in Atlanta on a given
day, hazardous material incidents involving radioactive material in FY2001,
mariners in distress on a given day, and U.S. automobile drivers.
- One possible approach is to go directly to the "target group,"
either all of them (100%) or a sample of them. This would work with people
- Another method frequently necessary with transportation data is the use
of third party sources. Third party sources are people, businesses, or even
government entities that have knowledge about the target group or collect
information for other purposes, such as investigators, observers, or service
providers (e.g., doctors).
Examples: traffic observers, police observers, investigators, bus drivers
counting passengers, state data collectors.
- Research whether government and private data collections already have data
that meet the data requirements. Consider surveys, reporting collections,
and administrative records.
- If existing data meet some but not all of the data requirements, determine
whether the existing data systems can be altered to meet the data needs.
For example, another agency may be willing to add to or alter their process
in exchange for financial support.
- A primary consideration in whether to gather data from the target group
or an indirect source is access to the group; all of them. A 100% data gathering
would obviously need access to the entire target group. A sample approach
will not include the entire target group, but all members should have a non-zero
(and known) probability of selection, or the sampling will not necessarily
be representative of the target group.
- Consider getting information directly from the target group (if they are
people or businesses), having the target group observed (events as they occur),
or getting information about the target group from another source (third party
source discussed above).
- In some situations, the information desired is not directly available. In
this case, consider collecting related information that can be used to derive
or estimate the information required.
For example: Collecting the number of people on and off a bus at each stop
combined with a separate estimate of trip length between stops to estimate
- When using third-party data for a data system, ensure that the data from
the third party meets data requirements. If the third party source is mandated
or a sole source for the data, gather information on each data requirement,
- The choices made for sources and their connection to the data requirements
should be documented and clearly posted with the data, or with disseminated
output from the data.
- Electronic Records Work Group Report to the National Archives and Records
Administration dated September 14, 1998.
2.5 Data Collection Design
- The design of data collection is one of the most critical phases in developing
a data system. The accuracy of the data and of estimates derived from the
data are heavily dependent upon the design of data collection.
For example, the accuracy is dependent upon proper sample design, making use
of sampling complexity to minimize variance. The data collection process itself
will also determine the accuracy and completeness of the raw data.
- Data collection from 100% of the target group is usually the most accurate
approach, but is not always feasible due to cost, time, and other resource
restrictions. It also is often far more accurate than the data requirements
demand and can be a waste of resources.
- A "probability sample" is an efficient way to automatically
select a data source representative of the target group with the accuracy
determined by the size of the sample.
- When sampling people, businesses, and/or things, sampling lists (also
known as frames) of the target group are required to select the sample.
Availability of such lists is often a restriction to the method used in
- For most statistical situations, it is usually important to be able to
estimate the variance along with estimating the mean or total.
- Sample designs should be based on established sampling theory, making
use of multi-staging, stratification, and clustering to enhance efficiency
- Sample sizes should be determined based on the data requirements for key
data, taking into account the sample design and missing data.
- The data collection designer should use a probability sample, unless a 100%
collection is required by law, necessitated by accuracy requirements, or turns
out to be inexpensive (e.g., data readily available).
For example, a system that collects data to estimate the total vehicle miles
traveled (VMT) for a state of the U.S. cannot possibly collect 100 percent
of all trips on every road, so a sampling approach is necessary. However,
when it comes to collecting passenger miles for a large transit system, it
may be possible with fare cards and computer networks to collect 100% of passenger
- The sample design should give all members of the target group a non-zero
(and known) probability of being represented in the sample.
DANGER => Samples of convenience, such as collecting transportation counts
at an opportune location, will produce data, but it will almost always be
biased. Whereas, randomly selecting counting sites from all possible locations
will be statistically sound (with allowances due to correlations between locations).
- The design of any samples should be based on established sampling theory.
Determine sample size using appropriate formulas to ensure data requirements
for accuracy are met with adjustments for sample design and missing data.
Use an appropriate random method to select sample according to the design.
- If some form of sampling is used, design the data collection to collect
sufficient information to estimate the variance of each estimate to be produced.
- The collection design and its connection to the data requirements should
be documented and clearly posted with the data, or with disseminated output
from the data. The documentation should include references for the sampling
- If the data collection process performed by DOT uses sampling, a statistician
or other sampling expert should develop or review the design.
- If the data system uses third party data collected using sampling, sample
design information should be collected and provided with collection design
documentation, when available
- Cochran, William G., Sampling Techniques (3rd Ed.),
New York: Wiley, 1977.