You are here
Chapter 4Processing of Data
Processing of Data
Once the data have been collected or acquired from an external source, some processing is usually necessary to make the data ready for conversion into information products.
This chapter contains standards for securing the data during processing (Section 4.1), checking data for potential errors (Section 4.2), dealing with missing data issues (Section 4.3), and adding information to the data (Section 4.4). This chapter also contains standards for monitoring and evaluating data operations, including nonresponse analysis, (Section 4.5) and for documenting (Section 4.6) the data processing operations.
Standard 4.1: Safeguards must be taken throughout data processing to protect the data from disclosure, theft, or loss.
Key Terms: confidentiality, information security, storage
Guideline 4.1.1: Confidentiality Procedures
Implement the confidentiality procedures given in the BTS Confidentiality Procedures Manual sections on Physical Security Procedures and Security of Information Systems to protect the data from unauthorized disclosure or release during data production, use, storage, transmittal, and disposition (e.g., completed data collection forms, electronic files and hard copy printouts).
Guideline 4.1.2: Security of Information Systems
Follow the information system security procedures in the BTS Confidentiality Procedures Manual, and periodically monitor and update them. Ensure that:
- Data files, networks, servers, and desktop PCs are secure from malicious software, unauthorized access, or theft.
- Access to confidential data is controlled so that only authorized staff can read and/or write to the data. The project manager responsible for the data should periodically review staff access rights to guard against unauthorized release or alteration.
Guideline 4.1.3: Data Storage
Develop and implement routine data backups. Secure backup data from unauthorized access or release.
Bureau of Transportation Statistics. 2004. Confidentiality Procedures Manual. Washington, DC.
Federal Committee on Statistical Methodology. 1994. Report on Statistical Disclosure Limitation Methodology, Statistical Policy Working Paper 22. Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/spwp22.html as of November 15, 2004.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.4 (Data Protection [during data collection]) and Section 6.5 (Data Protection [during information dissemination]). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.2: As part of standard data processing, mitigate errors by checking and editing both data BTS collects and data it acquires from external sources.
Key Terms: edit, imputation, outliers, skip pattern
Guideline 4.2.1: Types of Edits
At a minimum, the editing process must include checking for the items below, and appropriate editing if errors are detected.
- Omission or duplication of records/units,
- Data that fall outside a pre-specified range, or for categorical data, data that are not equal to specified categories,
- Data that contradict other data within an individual record/unit,
- Data inconsistent with past data or with data from outside sources,
- Missing data that can be directly filled from other portions of the same record or through follow-up with the data provider,
- Incorrect flow through prescribed skip patterns, and
- Selections in excess of the allowable number, such as multiple selections for a mark one data item.
Guideline 4.2.2: Editing Process
In a data editing system:
- Develop editing rules in advance of any data processing. Rules may be modified during data processing (Section 4.5.1).
- Minimize manual intervention, since it will result in inconsistent applications of the edit rules and may introduce human error.
- Set the acceptable data ranges for outlier checks at broad enough levels so that legitimate special effects, trend shifts, or industry changes are not erroneously removed.
Guideline 4.2.3: Edit Resolution
Several actions are possible when a data value fails an edit check. Recommended procedures are:
- Verify with the original source or respondent and correct as appropriate, or
- Change the data value to the most likely value based upon other information collected, or impute a substitute value (Guideline 4.3.4).
- For administrative or regulatory data, any changed value needs the data providers acceptance.
- Notify the source if a change is made to data provided by an external source.
- Replacing the failed value with a missing value indicator (Guideline 4.4.2), and
- Accepting the data value as reported. Provide reasons for overriding edits.
Federal Committee on Statistical Methodology. 1990. Data Editing in Federal Statistical Agencies, Statistical Policy Working Paper 18. Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp18.html as of November 15, 2004.
__________. 1996. Data Editing Workshop and Exposition, Statistical Policy Working Paper 25, Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/working-papers/wp25a.html as of November 15, 2004.
__________. 2001. Measuring and Reporting Sources of Error in Surveys, Statistical Policy Working Paper 31, Section 7.2.3 (Editing Errors), Washington, DC: Office of Management and Budget. Available at http://www.fcsm.gov/01papers/spwp31_final.pdf as of November 15, 2004.
Hawkins, D.M. 1980. Identification of Outliers. New York: Chapman and Hall.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.1 (Data Editing). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.3: Unit and item nonresponse must be appropriately measured, adjusted for, and reported. Response rates must be computed using standard formulas to measure the proportion of the eligible respondents represented by the responding units.
Key Terms: bias, eligible unit, imputation, item, item nonresponse, multivariate analysis, nonresponse bias, overall unit nonresponse, probability of selection, response rates, sample substitution, unit, unit nonresponse, weight
Guideline 4.3.1: Basis for Rates
Calculate unit and item response rates based either on the probability of selection (for household or personal data collections) or on the units measure of size for industry or establishment data collections.
- Base proportions of the total industry on a measure of size available for all eligible units (e.g., annual operating revenue, total employment).
- For sample surveys, use the inverse of the probability of selection (base weights) in response rate calculation. For 100 percent (universe) data collections, the base weight for each unit is one.
- For sample designs using unequal probabilities, such as stratified designs with optimal allocation, report weighted missing data rates along with unweighted missing data rates.
- If sample substitutions were made, calculate response rates without the substituted cases.
Guideline 4.3.2: Unit Response Rates
Calculate unit response rates (RRU) as the ratio of the number of completed data collection cases (CC) to the number of in-scope sample cases (AAPOR 2000). A number of different categories of cases comprise the total number of in-scope cases:
CC= number of completed cases;
R= number of cases that refused to provide any data;
O= number of eligible units not responding for reasons other than refusal;
NC= number of noncontacted units known to be eligible;
U= number of units of unknown eligibility; and
e= estimated proportion of units of unknown eligibility that are eligible.
The unit response rate (OMB 2005) represents a composite of these components:
- The numerator includes all cases that have submitted sufficient information to be considered complete responses for the data collection period.
- Complete cases may contain some missing data items. Data collection staff and principal data users should jointly determine the criteria for considering a case to be complete.
- The denominator includes all original survey units that were identified as being eligible, including units with pending responses with no data received, new eligible units added to the data collection effort, and an estimate of the number of eligible units among the units of unknown eligibility. The denominator does not include units deemed out-of-business, out-of-scope, or duplicates.
- An unweighted version of the unit response rate can be used for tracking and analyzing data collection operations.
- A simple way to calculate e(U) is to compute the weighted ratio of eligible to ineligible in completed cases or eligibility-known cases and assume the same ratio will apply to the U cases.
- If a data collection has special circumstances that justify a formula other than the one above, such as longitudinal or partial response considerations, a more appropriate formula can be used if accompanied by a full explanation of the calculation method.
- When a data collection has multiple stages, calculate the overall unit response rates (RROC) as the product of two or more unit level response rates.
Guideline 4.3.3: Item Response Rates
Calculate item response rates (RRI) as the ratio of the number of respondents for whom an in-scope response was obtained (CCx for item x) to the number of respondents who were requested to provide information for that item. The number requested to provide information for an item is the number of unit level respondents (CC) minus the number of respondents with a valid skip for item x (Vx). When an abbreviated questionnaire is used to convert refusals, the eliminated questions are treated as item nonresponse.
- Calculate the total item response rates (RRTx) for specific items as the product of the overall unit response rate (RRO) and the item response rate for item x (RRIx).
Guideline 4.3.4: Imputation
Decisions regarding whether or not to adjust data, adjust weights, and impute for missing data should be based on how the data will be used and the assessment of the bias due to missing data that is likely to be encountered.
- To avoid biased estimates, include imputed data in any reported totals.
- When used, imputation procedures should be internally consistent, be based on theoretical and empirical considerations, be appropriate for the analysis, and make use of the most relevant data available.
- Since most data sets are subject to analysis by users to detect relationships between variables, implement imputation methods that preserve multivariate relationships.
- To ensure data integrity, re-edit data after imputation.
Guideline 4.3.5: Weight Adjustments
For data collections involving sampling, adjust weights for unit nonresponse, unless unit imputation is warranted. Adjust weights for missing units within classes of sub-populations to reduce bias.
American Association for Public Opinion Research. 2000. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. Lenexa, Kansas: AAPOR.
Kalton, G. 1983. Compensating for Missing Survey Data. Institute for Social Research, University of Michigan.
__________ and Flores-Cervantes, I. 2003. Weighting Methods, Journal of Official Statistics Vol.19, No.2.
__________ and Kasprzyk, D. 1982. Imputing for missing survey responses. Proceedings of the Section on Survey Research Methods American Statistical Association, 1982, 22-31.
__________ and Kasprzyk, D. 1986. The treatment of missing survey data. Survey Methodology, Vol. 12, No. 1, 1-16.
Little, R.J.A. and Rubin, D. 1987. Statistical Analysis with Missing Data. New York: Wiley.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.2 (Missing Data). Washington, DC. July 14.
Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London, UK: Chapman and Hall.
Approval Date: April 20, 2005
Standard 4.4: To allow appropriate analysis, use codes to identify missing, edited, and imputed items. Codes added to convert collected text information into a form that facilitates analysis must use standardized codes, when available, to enhance comparability with other data sources.
Key Terms: coding, editing, external source, imputation, skip pattern
Guideline 4.4.1: Codes for Missing and Inapplicable Data
Use codes on the file that clearly distinguish between cases where an item is missing and cases where an item does not apply, such as when skipped over by a skip pattern.
- Distinguish between data missing initially from the source, unreadable data, and data deleted in the editing process.
- If the data collection instrument contains skip patterns, distinguish between items skipped and items not ascertained (such as refusals).
- Do not use blanks and zeros to identify missing data, as they tend to be confused with actual data. Similarly, do not use numeric codes like a series of nines or eights for missing numeric items if these could be legitimate reported values.
- If a data file acquired from an external source was not previously coded, the level of coding effort should depend on how BTS plans to use the file and on whether BTS plans to further disseminate the file.
- For data in tabular form, the BTS Guide to Style and Publishing Procedures contains a number of symbols and abbreviations to place in cells with various types of missing or inapplicable data.
Guideline 4.4.2: Indicating Edit Actions and Imputations
Code the data set to indicate edit actions and imputed values.
- Indicate whether cases passed or failed each edit. If a case fails an edit, indicate the edit disposition (Guideline 4.2.3).
- If more than one method could be used to impute a missing data item, indicate the imputation method used.
Guideline 4.4.3: Coding Text Information
Although it is preferable to pre-code responses, it may be necessary to code open-ended text fields for further use.
- To code text data for easier analysis, use standardized codes if they exist (Guideline 2.3.3). Develop other types of codes by using existing DOT or other federal agency practice, or by using standard codes from industry or international organizations, when they exist.
- When manually coding text, create a quality assurance process that verifies at least a sample of the coding to determine if a specific level of coding accuracy and reliability is being maintained.
American Association for Public Opinion Research. 1998. "Standard Definitions – Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys," http://www.aapor.org/ethics/stddef.html.
Bureau of Transportation Statistics (BTS). 2003. BTS Guide to Style and Publishing Procedures. Washington, DC.
__________. 2005. BTS Statistical Standards Manual, Chapter 2 (Data Collection Planning and Design). Washington, DC.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.3 (Coding). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.5: Monitor and evaluate each data processing activity, both to assess the impact on data quality and to inform data users.
Key Terms: frame, imputation, item nonresponse, incident data, longitudinal, missing at random, multivariate modeling, nonresponse bias, overall unit nonresponse, population, response rates, unit nonresponse, weight
Guideline 4.5.1: Quality Control
Establish quality control procedures to monitor and report on the operation of data processing procedures.
- Incorporate quality control into the processing procedures to automatically produce outputs useable by data system managers. Outputs produced during data processing should be used to adjust procedures for higher quality results and greater efficiency.
- Monitor failure rates for each edit and by case. Analyze the pattern of edit failures graphically to pinpoint problems more easily and prioritize items for follow-up.
- When applicable, automate the process of referring data problems to data providers for quicker resolution.
- Maintain information on the amount of missing data, actions taken, and problems encountered during imputation for inclusion in the data processing (Guidelines 4.6.2 and 4.6.3) and user documentation (Guideline 6.8.1).
Guideline 4.5.2: Unit Response Analysis Requirement
Conduct an analysis of nonresponse for any data collection with an overall unit response rate (Guideline 4.3.2) less than 80 percent. The objective is to measure the impact of the nonresponse and to determine whether the data are missing at random.
- Compare respondents and nonrespondents across subgroups using external or frame data, if available, or through a nonresponse follow-back survey.
- Compare respondents characteristics to known characteristics of the population from an external source. This comparison can indicate possible bias, especially if the characteristics in question are related to the data collection efforts key variables.
- Consider multivariate modeling of response using respondent and nonrespondent external data to determine if nonresponse bias exists.
- For a multi-stage data collection effort, focus the response analysis on the stages with the higher missing data rates.
- Evaluate the impact of weighting adjustments on nonresponse bias.
Guideline 4.5.3: Item Response Analysis Requirement
If the item response rate (Guideline 4.3.3) is less than 70 percent, conduct an item nonresponse analysis to determine if the data are missing at random at the item level, in a similar fashion to Guideline 4.5.2.
- Analyze missing data rates at the item level and compare the characteristics of the reporters and the non-reporters.
- For some data collections, such as incident data collections, missing data rates may not be known. In such cases, provide estimates or qualitative information on what is known.
Guideline 4.5.4: Timing of Nonresponse Bias Analyses
Conduct unit and item nonresponse bias analyses prior to the release of any information products
- Analyze the missing data effect at least annually if the data collection occurs more than once a year or is continuous.
- Analyze the missing data effect each time data are collected if the collection occurs annually or less often.
- For data collections from longitudinal panels, analyze the effect of missing data after each collection due to attrition of respondents over time.
Guideline 4.5.5: Publishable Items
In those cases where the analysis indicates that the data are not missing at random, the decision to publish individual items should be based on the amount of potential bias due to missing data.
- If the missing data bias analysis shows that the data are not missing at random and the total item missing data rate (Section 4.3.3) is less than 70 percent, the estimate should be regarded as unreliable.
- Suppress or flag estimates that are unreliable due to missing data.
Bureau of Transportation Statistics. 2005. BTS Statistical Standards Manual, Chapter 6 (Dissemination of Information). Washington, DC.
Groves, R. 1989. Survey Errors and Survey Costs. New York, NY: Wiley, Chapters 10 and 11.
Interagency Household Survey Nonresponse Group. Information available at http://www.fcsm.gov/committees/ihsng/ihsng.htm as of April 18, 2005.
Office of Management and Budget. 2005. Standards for Statistical Surveys (Proposed), Section 3.2 (Nonresponse Analysis and Response Rate Calculation). Washington, DC. July 14.
Approval Date: April 20, 2005
Standard 4.6: The data processing procedures must be documented for both BTS and public use. For external source data, the documentation must include procedures used by the external source as well as procedures that were implemented on the data at BTS. Documentation must allow reproduction of the steps leading to the results.
Key Terms: coding, derived data, edit, external source, imputation, item response, response rates, unit response, weight
Guideline 4.6.1: Edit Procedures
Documentation must describe:
- The edit rules and their purpose,
- Procedures for handling records that fail edits,
- A description of the codes used to indicate edit disposition (Guideline 4.2.3), and
- The procedures for, and the results of, any edit performance evaluations.
Guideline 4.6.2: Measures of Edit Performance
For key edits as identified by the data collection staff, maintain measures for the number of:
- Edit messages, by edit disposition (Guideline 4.2.3),
- Edit messages resulting in revisions of the original data, and
- Edit messages overridden, by reason for overriding the edit.
Guideline 4.6.3: Procedures for Handling Missing Data
Documentation of procedures for handling missing data must include:
- The unit response rate or rates,
- Item response rates for key variables as identified by the data collection staff,
- Item response rates for any items with response rates less than 70 percent,
- Formulas used to calculate unit and item response rates,
- Results of response bias analyses,
- Full documentation of the methods of imputation or weight adjustments,
- A description of the coding schemes used to identify missing and imputed values, and
- An assessment of the nature, extent, and effects of imputation or weight adjustments.
Guideline 4.6.4: Procedures for Coding Text Information
Document both the source for any coding scheme used and the coding process (whether automated or manual), and make it available to data users. Any reliability or accuracy studies of the coding process should also be documented and made available.
Guideline 4.6.5: Derived Data Items
Documentation should include all formulas, detailed descriptions on how the item was created, and the sources of any external information used to derive additional data items for the file.
Guideline 4.6.6: Information Systems Documentation
Systems for the processing of data should have documentation of all operations (both automated and manual) necessary to operate, maintain, and update the systems.
- The documentation should provide an overview of integrated manual and automated operations, workflow, interfaces, and personnel requirements.
- Documentation should be sufficiently detailed and complete that personnel unfamiliar with the systems can become knowledgeable and operate them, if necessary.
- Information systems documentation may be incorporated into existing documentation or written as a separate document.
Guideline 4.6.7: Documentation Updates
Update documentation whenever a major change to the processing system is made, but at least annually when the frequency is less than annual.
American Association for Public Opinion Research. 1998. "Standard Definitions – Final Dispositions of Case Codes and Outcome Codes for RDD Telephone Surveys and In-Person Household Surveys." Available at http://www.aapor.org/ethics/stddef.html as of April 18, 2005.
Office of Management and Budget. 2002. Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies. Federal Register, Vol. 67, No. 36, pp. 8450-8460. Washington, DC. February 22.