Data Analysis

BTS employs a wide variety of statistical techniques in its work. However, regardless of the techniques used, there are some steps that should be included in any data analysis. This chapter provides general guidance on those steps, and then leaves the choice of analytical tools up to the data analyst performing the work.

This chapter contains standards for planning a data analysis (Section 5.1), calculating estimates and performing inferences (Section 5.2), and documenting the data analysis (Section 5.3). For quick-response projects, compliance with these standards is recommended, but not required.

**Standard 5.1**: Plan before
starting a specific data analysis to ensure that the resulting product
addresses the needs of BTS customers and that the resources are available to
complete the data analysis.

**Key Terms: **key variable,
target audience

The data analysis should be relevant, objective, comprehensive, and add value to existing information. To meet these goals, data analysts need to:

- Conduct the data analysis in an objective and policy-neutral manner that focuses on the statistical and economic facts.
- Maintain awareness of subject matter issues so that the data analysis can address topics of interest and importance.
- Consult with subject area specialists about relevant issues, the strengths and weaknesses of data sources, and important references to key topic elements.
- If the data analysis is not comprehensive, indicate what further types of data analysis should be considered and whether BTS plans to do that work.

Prepare a data analysis plan in the proper format (BTS 2004) prior to the start of the data analysis.

- Include the purpose of the data analysis, the research question, target audience, data sources (including a description and any limitations), key variables to be used, and the data analysis methods. Also provide target completion dates and an estimate for the amount of resources needed to complete the product.
- Subject matter experts should review the plan to ensure that the proposed data analysis will answer relevant questions. Data analysis experts should review the plan to ensure that appropriate data and methods will be used.
- The data analysis plan must be approved by the designated manager.

Bureau of Transportation Statistics (BTS). 2004. *BTS Information Product Scoping Paper*. Washington, DC.

**Approval Date:** June 28, 2005

**Standard 5.2**: Estimates and statistical inferences made
regarding the data must be based on acceptable statistical practice.

**Key Terms: **accuracy, bias,
bridge estimates, estimates, inference, reliability, robustness, time
series, trend, variance

Analyses must use theory and methods justifiable by reference to statistical literature (provided below in Related Information) or by mathematical derivation.

- Use appropriate analysis methods for complex sample, time series, and geospatial data, or variance estimates may be seriously biased.
- If extensive seasonality, irregularities, known special causes, or variation in trends are present in the data, take those into account in the trend analysis.
- Use robust methods if in doubt about the quality of the data (i.e., the quality of the data cleaning) or about the suitability of the data for analysis by standard parametric methods.

Statistical statements should be accompanied by some assessment of the limitations and uncertainty of the results.

- Estimated errors due to statistical sampling or modeling indicate the reliability of the estimate. However, these estimated errors do not account for bias, which may have a greater effect on accuracy, and does not decrease as the number of cases increases.
- Analysts must consider data quality issues related to measurement error and missing data. The purpose, design, methods, and quality of processing can all place limitations on the analysis and interpretation of the data. If possible, quantify and eliminate biasing effects. Otherwise, discuss the nature and estimated magnitude of these limitations in the report.

Support statistical statements with proper testing and inference procedures.

- Sampling error estimates should accompany any estimates from samples.
- For complex sample designs, the BTS office originating the data should provide guidance on estimation and variance calculation. The guidelines should cover proper use of weights and recommend a maximum coefficient of variation and a minimum cell size for usability.
- When doing multiple comparisons with the same data between subgroups, include a note with the test results indicating whether or not the significance criterion (Type I error) was adjusted and, if adjusted, the method used.
- Not every statistically significant difference is important. Given a comparison with a statistically significant difference, subject matter expertise is needed to determine whether the difference is important. In the context of the measure and its fluctuation over time, it may be regarded as insignificant.

If the scope of data collection changes or part of an historical series is revised, data for both the old and the new series should be published for a suitable overlap period.

State all statistical assumptions (such as assumptions about data distributions or structured dependence) made during the data analysis.

- Perform diagnostics to detect violations of assumptions, and provide the results of the diagnostics in the report. Plots of data and statistical output, such as residuals, are often useful in detecting violations of assumptions.
- For each assumption, include a discussion of the likelihood that the assumption will be violated by small or large amounts and the robustness of the data analysis method to each such violation.

Agresti, A. 1990. *Categorical Data Analysis.* New York, NY: Wiley.

Anderson, T.W. 2003. *An Introduction to Multivariate Statistical Analysis, 3 ^{rd} ed*. New York: Wiley.

Box, G.P., Jenkins, G.M., and Reinsel, G.C. 1994. *Time Series Analysis: Forecasting and Control, 3 ^{rd} ed*. New York: Prentice Hall.

Casella, G. and Berger, R.L. 2001. *Statistical Inference, 2nd ed*. Belmont, CA: Duxbury Press.

Chatfield, C. 2003. *The Analysis of Time Series: An Introduction, 6th ed*. New York: Chapman and Hall.

Cleveland, W.S. 1993. *Visualizing Data*. Summit, NJ: Hobart Press.

Cochran, W.G. 1977. *Sampling Techniques, 3rd ed.* New York: Wiley.

Cook, R.D. and Weisberg, S.
1999. *Applied Regression Including Computing and Graphics*. New York:
Wiley.

Cressie, N. 1991. *Statistics for Spatial Data*. New York: Wiley.

Daniel, C. and Wood, F.S. 1980. *Fitting Equations to Data*. New York: Wiley.

DeGroot, M.H. 1989. *Probability and Statistics*. Reading, MA: Addison-Wesley.

Diggle, P.J., Liang, K.-Y., and Zeger, S.L. 2000. *Analysis of Longitudinal Data*. Oxford: Oxford University Press.

Draper, N.R. and Smith, H. 1998. *Applied Regression Analysis, 3 ^{rd} ed.* New York: Wiley.

Efron, B. and Tibshirani, R.J. 1994. *An Introduction to the Bootstrap*. New York: Chapman and Hall.

Fleiss, J.L. 1981. *Statistical Methods for Rates and Proportions, 2 ^{nd} ed*. New York: Wiley.

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. 2005. *Robust Statistics: The Approach Based on Influence Functions, rev. ed*. New York: Wiley.

Harvey, A.C. 1993. *Time Series Models, 2 ^{nd} ed*. Cambridge, MA: MIT Press.

Hicks, C.R., and Turner, K.V. 1999. *Fundamental Concepts in the Design of Experiments*. Oxford, UK: Oxford University Press.

Hogg, R.V., Craig, A., and McKean, J.W. 2004. *Introduction to Mathematical Statistics, 6th ed*. New York: Prentice Hall.

Hosmer, D.W., and Lemeshow, S. 1989. *Applied Logistic Regression*. New York: Wiley.

Huber, P.J. 1981. *Robust Statistics*. New York: Wiley.

Kelsey, J.L., Whittemore, A.S., Evans, A.S., and Thompson, W.D. 1996. *Methods in Observational Epidemiology*. New York: Oxford University Press.

Kleinbaum, D.G., Kupper, L.L., and Muller, K.E. 1988. *Applied Regression Analysis and Other Multivariable Methods*. Boston: PWS-Kent.

Lehmann, E.L. and Romano, J.P. 2005. *Testing Statistical Hypotheses, 3 ^{rd} ed.* New York: Springer Verlag.

Lehmann, E.L. and Casella, G. 1998. *Theory of Point Estimation, 2 ^{nd} ed.* New York: Springer Verlag.

Little, R.J.A. and Rubin, D. 1987. *Statistical Analysis with Missing Data*. New York: Wiley.

McCulloch, C.E. and Searle, S.R. 2001. *Generalized, Linear, and Mixed Models*. New York: Wiley.

Mood, A.M., Graybill,
F.A., and Boes, D.C. 1974. *Introduction
to the Theory of Statistics*. New York: McGraw-Hill.

Office of Management and Budget (OMB). 2005. *Standards for Statistical Surveys (Proposed)*,
Sections 4.1 (Developing Estimates and Projections) and 5.2 (Inference and
Comparisons). Washington,
DC.
July 14.

Pankratz, A. 1983. *Forecasting with Univariate Box-Jenkins Models*. New York: Wiley.

Rao, C.R. 1973. *Linear Statistical Inference and Its Applications, 2nd ed*. New York: Wiley.

Rohatgi, V.K. 1976. *An Introduction to Probability Theory and Mathematical Statistics*. New York: Wiley.

__________. 1984. *Statistical Inference*. New York: Wiley.

Rousseeuw, P.J., and Leroy, A.M. 1987. *Robust Regression and Outlier Detection*. New York: Wiley.

Srndal, C.-E., Swensson, B., and Wretman, J. 1991. *Model Assisted Survey Sampling*. New York: Springer Verlag.

Scheff, H. 1959. *Analysis of Variance*. New York: Wiley.

Searle, S.R., Casella, G., and McCulloch, C.E. 1992. *Variance Components*. New York: Wiley.

Seber, G.A.F., and Lee, A.J. 2003. *Linear Regression Analysis, 2 ^{nd} ed*. New York: Wiley.

Selvin, S. 1996. *Statistical Analysis of Epidemiologic Data*. Oxford, UK: Oxford University Press.

Skinner, C., Holt, D., and Smith, T. 1989. *Analysis of Complex Surveys.* New York: Wiley.

Snedecor, G.W. and Cochran, W.G. 1989. *Statistical Methods, 8th ed. *Ames, IA: Iowa State University Press.

Tukey, J. 1977. *Exploratory Data Analysis.* Reading, MA: Addison-Wesley.

U.S. Department of Transportation. 2002. *The Department of Transportation Information Dissemination Quality Guidelines*, Appendix A, Sections 4.3 (Production of Estimates and Projections) and 4.4 (Data Analysis and Interpretation). Available at http://dms.dot.gov/ombfinal092502.pdf as of January 19, 2005.

Wolter, K.M. 1985. *Introduction to Variance Estimation*. New York: Springer Verlag.

Zacks, S. 1971. *Theory of Statistical Inference*. New York: Wiley.

**Approval Date:** June 28, 2005

**Standard 5.3**: Document the methods and models used in data
analysis products to help ensure objectivity, utility, transparency, and
reproducibility of the estimates and projections.

**Key Terms: **reproducibility,
transparency

The data analysis report must contain details of the methods used during the data analysis, including a description of software used, a discussion of the data analysis assumptions, and key information relevant to obtaining the data analysis results.

- Document all methods, assumptions, diagnostics, and robustness checks. Provide references to support the methods used in the data analysis, or a derivation of the theory supporting the method used in the report.
- Include a statement of the limitations of the data analysis, including coverage and response limitations and statistical variation.
- Archive the data and models used in the data analysis so the estimates can be reproduced.
- Archive supporting technical documentation, such as standard error and significance test calculations, that help ensure transparency and reproducibility.
- For recurring reports, consider producing a methodological report.

Bureau of Transportation Statistics (BTS). 2005. *BTS Statistical Standards Manual*,
Section 6.8 (Public Documentation), Washington,
DC. Available at http://www.bts.gov/programs/statistical_policy_and_research/bts_statistical_standards_manual/index.html,
as of June
10, 2005.

Office of Management and Budget (OMB). 2002. *Guidelines for Ensuring and Maximizing the
Quality, Objectivity, Utility, and Integrity of Information Disseminated by
Federal Agencies*. Federal Register,
Vol. 67, No. 36, pp. 8452-8460. Washington,
DC.
February 22.

__________. 2005. *Standards
for Statistical Surveys (Proposed)*, Section 4.1 (Developing Estimates and
Projections). Washington,
DC. July
14.