Once the data have been collected or acquired from an external source, some processing is usually necessary to make the data ready for conversion into information products.
This chapter contains standards for securing the data during processing (Section 4.1), checking data for potential errors (Section 4.2), dealing with missing data issues (Section 4.3), and adding information to the data (Section 4.4). This chapter also contains standards for monitoring and evaluating data operations, including nonresponse analysis, (Section 4.5) and for documenting (Section 4.6) the data processing operations.
Standard 4.1: Safeguards must be taken throughout data processing to protect the data from disclosure, theft, or loss.
Key Terms: confidentiality, information security, storage
Implement the confidentiality procedures given in the BTS Confidentiality Procedures Manual sections on Physical Security Procedures and Security of Information Systems to protect the data from unauthorized disclosure or release during data production, use, storage, transmittal, and disposition (e.g., completed data collection forms, electronic files and hard copy printouts).
Guideline 4.1.2: Security of Information Systems
Follow the information system security procedures in the BTS Confidentiality Procedures Manual, and periodically monitor and update them. Ensure that:
Develop and implement routine data backups. Secure backup data from unauthorized access or release.
Bureau of Transportation Statistics. 2004. Confidentiality Procedures Manual. Washington, DC.
Federal Committee on Statistical Methodology. 1994. Report on Statistical Disclosure Limitation Methodology, Statistical Policy Working Paper 22. Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/spwp22.html as of November 15, 2004.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.4 (Data Protection [during data collection]) and Section 6.5 (Data Protection [during information dissemination]). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.2: As part of standard data processing, mitigate errors by checking and editing both data BTS collects and data it acquires from external sources.
Key Terms: edit, imputation, outliers, skip pattern
At a minimum, the editing process must include checking for the items below, and appropriate editing if errors are detected.
In a data editing system:
Several actions are possible when a data value fails an edit check. Recommended procedures are:
Federal Committee on Statistical Methodology. 1990. Data Editing in Federal Statistical Agencies, Statistical Policy Working Paper 18. Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp18.html as of November 15, 2004.
__________. 1996. Data Editing Workshop and Exposition, Statistical Policy Working Paper 25, Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp25a.html as of November 15, 2004.
__________. 2001. Measuring and Reporting Sources of Error in Surveys, Statistical Policy Working Paper 31, Section 7.2.3 (Editing Errors), Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/01papers/spwp31_final.pdf as of November 15, 2004.
Hawkins, D.M. 1980. Identification of Outliers. New York: Chapman and Hall.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.1 (Data Editing). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.3: Unit and item nonresponse must be appropriately measured, adjusted for, and reported. Response rates must be computed using standard formulas to measure the proportion of the eligible respondents represented by the responding units.
Key Terms: bias, eligible unit, imputation, item, item nonresponse, multivariate analysis, nonresponse bias, overall unit nonresponse, probability of selection, response rates, sample substitution, unit, unit nonresponse, weight
Calculate unit and item response rates based either on the probability of selection (for household or personal data collections) or on the units measure of size for industry or establishment data collections.
Calculate unit response rates (RRU) as the ratio of the number of completed data collection cases (CC) to the number of in-scope sample cases (AAPOR 2000). A number of different categories of cases comprise the total number of in-scope cases:
CC= number of completed cases;
R= number of cases that refused to provide any data;
O= number of eligible units not responding for reasons other than refusal;
NC= number of noncontacted units known to be eligible;
U= number of units of unknown eligibility; and
e= estimated proportion of units of unknown eligibility that are eligible.
The unit response rate (OMB 2005) represents a composite of these components:
Calculate item response rates (RRI) as the ratio of the number of respondents for whom an in-scope response was obtained (CCx for item x) to the number of respondents who were requested to provide information for that item. The number requested to provide information for an item is the number of unit level respondents (CC) minus the number of respondents with a valid skip for item x (Vx). When an abbreviated questionnaire is used to convert refusals, the eliminated questions are treated as item nonresponse.
Decisions regarding whether or not to adjust data, adjust weights, and impute for missing data should be based on how the data will be used and the assessment of the bias due to missing data that is likely to be encountered.
For data collections involving sampling, adjust weights for unit nonresponse, unless unit imputation is warranted. Adjust weights for missing units within classes of sub-populations to reduce bias.
American Association for Public Opinion Research. 2000. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. Lenexa, Kansas: AAPOR.
Kalton, G. 1983. Compensating for Missing Survey Data. Institute for Social Research, University of Michigan.
__________ and Flores-Cervantes, I. 2003. Weighting Methods, Journal of Official Statistics Vol.19, No.2.
__________ and Kasprzyk, D. 1982. Imputing for missing survey responses. Proceedings of the Section on Survey Research Methods American Statistical Association, 1982, 22-31.
__________ and Kasprzyk, D. 1986. The treatment of missing survey data. Survey Methodology, Vol. 12, No. 1, 1-16.
Little, R.J.A. and Rubin, D. 1987. Statistical Analysis with Missing Data. New York: Wiley.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.2 (Missing Data). Washington, DC. July 14.
Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London, UK: Chapman and Hall.
Approval Date: April 20, 2005
Standard 4.4: To allow appropriate analysis, use codes to identify missing, edited, and imputed items. Codes added to convert collected text information into a form that facilitates analysis must use standardized codes, when available, to enhance comparability with other data sources.
Key Terms: coding, editing, external source, imputation, skip pattern
Use codes on the file that clearly distinguish between cases where an item is missing and cases where an item does not apply, such as when skipped over by a skip pattern.
Code the data set to indicate edit actions and imputed values.
Although it is preferable to pre-code responses, it may be necessary to code open-ended text fields for further use.
American Association for Public Opinion Research. 1998. "Standard Definitions – Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys," http://www.aapor.org/ethics/stddef.html.
Bureau of Transportation Statistics (BTS). 2003. BTS Guide to Style and Publishing Procedures. Washington, DC.
__________. 2005. BTS Statistical Standards Manual, Chapter 2 (Data Collection Planning and Design). Washington, DC.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.3 (Coding). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.5: Monitor and evaluate each data processing activity, both to assess the impact on data quality and to inform data users.
Key Terms: frame, imputation, item nonresponse, incident data, longitudinal, missing at random, multivariate modeling, nonresponse bias, overall unit nonresponse, population, response rates, unit nonresponse, weight
Establish quality control procedures to monitor and report on the operation of data processing procedures.
Conduct an analysis of nonresponse for any data collection with an overall unit response rate (Guideline 4.3.2) less than 80 percent. The objective is to measure the impact of the nonresponse and to determine whether the data are missing at random.
If the item response rate (Guideline 4.3.3) is less than 70 percent, conduct an item nonresponse analysis to determine if the data are missing at random at the item level, in a similar fashion to Guideline 4.5.2.
Conduct unit and item nonresponse bias analyses prior to the release of any information products
In those cases where the analysis indicates that the data are not missing at random, the decision to publish individual items should be based on the amount of potential bias due to missing data.
Bureau of Transportation Statistics. 2005. BTS Statistical Standards Manual, Chapter 6 (Dissemination of Information). Washington, DC.
Groves, R. 1989. Survey Errors and Survey Costs. New York, NY: Wiley, Chapters 10 and 11.
Interagency Household Survey Nonresponse Group. Information available at http://www.fcsm.gov/committees/ihsng/ihsng.htm as of April 18, 2005.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.2 (Nonresponse Analysis and Response Rate Calculation). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.6: The data processing procedures must be documented for both BTS and public use. For external source data, the documentation must include procedures used by the external source as well as procedures that were implemented on the data at BTS. Documentation must allow reproduction of the steps leading to the results.
Key Terms: coding, derived data, edit, external source, imputation, item response, response rates, unit response, weight
Documentation must describe:
For key edits as identified by the data collection staff, maintain measures for the number of:
Documentation of procedures for handling missing data must include:
Document both the source for any coding scheme used and the coding process (whether automated or manual), and make it available to data users. Any reliability or accuracy studies of the coding process should also be documented and made available.
Documentation should include all formulas, detailed descriptions on how the item was created, and the sources of any external information used to derive additional data items for the file.
Systems for the processing of data should have documentation of all operations (both automated and manual) necessary to operate, maintain, and update the systems.
Update documentation whenever a major change to the processing system is made, but at least annually when the frequency is less than annual.
American Association for Public Opinion Research. 1998. "Standard Definitions – Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys." Available at http://www.aapor.org/ethics/stddef.html as of April 18, 2005.
Office of Management and Budget. 2002. Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies. Federal Register, Vol. 67, No. 36, pp. 8450-8460. Washington, DC. February 22.