Chapter 4
Processing Data
Once the collected data is in electronic form, some "processing" is usually necessary to mitigate obvious errors, and some analysis is usually necessary to convert data into useful information for decision documents, publications, and postings for the Internet.
4.1 Data Editing and Coding
Principles
- Data editing is the application of checks that identify missing, invalid, duplicate, inconsistent entries, or otherwise point to data records that are potentially in error.
- Typical data editing includes range checks, validity checks, consistency checks (comparing answers to related questions), and checks for duplicate records.
- For numerical data, "outliers" are not necessarily bad data. They should be examined for possible correction, rather than systematically deleted.
Note: By "examine" we mean you can check the original forms, compare data items with each other for consistency, and/or follow-up with the original source, all to see if the data are accurate or error has been introduced.
- Editing is a final inspection-correction method. It is almost always necessary, but data quality is better achieved much earlier in the process through clarity of definitions, forms design, data collection procedures, etc.
- Coding is the process of adding codes to the data set as additional information or converting existing information into a more useful form. Some codes indicate information about the collection. Other codes are conversions of data, such as text data, into a form more useful for data analysis.
For example, a code is usually added to indicate the "outcome" of each case. If there were multiple follow-up phases, the code may indicate in which phase the result was collected. Codes may also be added to indicate editing and missing data actions taken. Text entries are often coded to facilitate analysis. So, a text entry asking for a free form entry of a person's occupation may be coded with a standard code to facilitate analysis.
- Many coding schemes have been standardized.
Examples: the North American Industry Classification System (NAICS) codes, the Federal Information Processing Standards (FIPS) for geographic codes (country, state, county, etc.), the Standard Occupation Codes (SOC).
Guidelines
- An editing process should be applied to every data collection and to third-party data to reduce obvious error in the data. A minimum editing process should include range checks, validity checks, checks for duplicate entries, and consistency checks.
Examples of edits: If a data element has five categories numbered from 1 to 5, an answer of 8 should be edited to delete the 8 and flag it as a missing data value. Range checks should be applied to numerical values (e.g., income should not be negative). Rules should be created to deal with inconsistency (e.g., if dates are given for a train accident and the accident date is before the departure date, the rule would say how to deal with it). Data records should be examined for obvious duplicates.
- Most editing decisions should be made in advance and automated. Reliance on manual intervention in editing should be minimized, since it may introduce human error.
- Do not use outlier edits to the extent that special effects and trends would be hidden. Outliers can be very informative for analysis. Over-editing can lead to severe biases resulting from fitting data to implicit models imposed by the edits.
Rapid industry changes could be missed if an agency follows an overly restrictive editing regimen that rejects large changes.
- Some method should be used to allow after-the-fact identification of edits. One method is to add a separate field containing an edit code (i.e., a "flag"). Another is to keep "version" files, though this provides less information to the users.
- To avoid quality problems from analyst coding and spelling problems, text information to be used for data analysis should be coded using a standard coding scheme (e.g., NAICS, SOC, and FIPS discussed above). Retain the text information for troubleshooting.
- The editing and coding process should clearly identify missing values on the data file. The method of identifying missing values should be clearly described in the file documentation. Special consideration should be given to files that will be directly manipulated by analysts or users. Blanks or zeros used to indicate missing data have historically caused confusion. Also, using a coding to identify the reason for the missing data will facilitate missing data analysis.
- The editing and coding process and editing statistics should be documented and clearly posted with the data, or with disseminated output from the data.
References
- Little, R. and P. Smith (1987) "Editing and Imputation for Quantitative Survey Data," Journal if the American Statistical Association, Vol. 82, No. 397, pp. 58-68.
4.2 Handling Missing Data
Principles
- Untreated, missing data can introduce serious error into estimates. Frequently, there is a correlation between the characteristics of those missing and variables to be estimated, resulting in biased estimates. For this reason, it is often best to employ adjustments and imputation to mitigate this damage.
- Without weight adjustments or imputation, calculation of totals are underestimated. Essentially, zeroes are implicitly imputed for the missing items.
- One method used to deal with unit-level missing data is weighting adjustments. All cases, including the missing cases, are put into classes using variables known for both types. Within the classes, the weights for the missing cases are evenly distributed among the non-missing cases.
- "Imputation" is a process that substitutes values for missing or inconsistent reported data. Such substitutions may be strongly implied by known information or derived as statistical estimates.
- If imputation is employed and flagged, users can either use the imputed values or deal with the missing data themselves.
- The impact of missing data for a given estimate is a combination of how much is missing (often known via the missing data rates) and how much the missing differ from the sources that provided data in relation to the estimate (usually unknown).
For example, given a survey of airline pilots that asks about near-misses they are involved in and whether they reported them, it is known how many of the sampled pilots did not respond. You will not know if the ones who did respond had a lower number of near-misses than the ones who did not.
- For samples with unequal probabilities, weighted missing data rates give a better indication of impact of missing data across the population than do unweighted rates.
Guidelines
- Unit nonresponse should normally be adjusted by a weighting adjustment as described above, or if no adjustment is made, inform data users about the missing values.
- Imputing for missing item-level data (see definition above) should be considered to mitigate bias. A missing data expert should make or review decisions about imputation. If imputation is used, a separate field containing a code (i.e., a flag) should be added to the imputed data file indicating which variables have been imputed and by what method.
- All methods of imputation or weight adjustments should be fully documented.
- The missing data effect should be analyzed. For periodic data collections, it should be analyzed after each collection. For continuous collections, it should analyzed at least annually. As a minimum, the analysis should include missing data rates at the unit and item levels and analysis of the characteristics of the reporters and the non-reporters to see how they differ. For some reporting collections, such as with incidents, missing data rates may not be known. For such cases, estimates or just text information on what is known should be provided.
- For sample designs using unequal probabilities (e.g., stratified designs with optimal allocation), weighted missing data rates should be reported along with unweighted missing data rates.
References
- Chapter 4, Statistical Policy Working Paper 31, Measuring and Reporting Sources of Error in Surveys, Statistical Policy Office, Office of Information and Regulatory Affairs, Office of Management and Budget, July 2001.
- The American Association for Public Opinion Research. 2000. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. Ann Arbor, Michigan: AAPOR.
4.3 Production of Estimates and Projections
Principles
- "Derived" data items are additional case-level data that are either directly calculated from other data collected (e.g., # of days from two dates), added from a separate data source (e.g., the weather on a given date), or some combination of the two (e.g., give the departing and arriving airports, calculating distance from an external source). Deriving data is a way to enhance the data set without increasing respondent burden or significantly raising costs.
- An "estimate" is an approximation of some characteristic of the target group, like the average age, constructed from the data.
- A "projection" is a prediction of an outcome from the target group, usually in the future.
Examples: The average daily traffic volume at a given point of the Garden State Parkway in New Jersey two years from now. Total airline operations ten years from now.
- Estimates from samples should be calculated taking the sample design into account. The most common way this is done is weighted averages using weights based on the design.
- Estimates of standard error of an estimate will give an indication of the precision of the estimate. However, it will not include a measure of bias that may be introduced by problems in collection or design.
Guidelines
- Use derived data to enhance the data set without additional burden on data suppliers.
For example, the data collection can note the departure and arrival airports, and the distance of the flight can be added derived from a separate table.
- Weights should be used in all estimates from samples. Weights give the number of cases in the target group that each case represents, and are calculated as the inverse of the sampling probability. If using weights, adjust weights for nonresponse as discussed in section 4.2.
For example, the National Household Travel Survey is designed to be a sample representing the households of the United States, so the total of the weights for all sample households should equal the number of households in the United States. Due to sampling variability, it won't. Since we have a very good count of households in the United States from the 2000 Census, we can do a ratio adjustment of all weights to make them total to that count.
- Construct estimation methods using published techniques or your own documented derivations appropriate for the characteristic being estimated. Forecasting experts should be consulted when determining projections.
Example: You have partial year data and you want to estimate whole year data. A simple method is to use past partial year to whole year ratios (if stable year to year) to construct an extrapolation projection (Armstrong 2001).
- Standard error estimates should accompany any estimates from samples.
Standard errors should be calculated taking the sample design in account. For
more complex sample designs, use replicated methods (e.g., jackknife, successive
differences) incorporating the sample weights. Consult with a variance estimation
expert.
- Ensure that any statistical software used in constructing estimates and their
standard errors use methods that take into account the design of the data
collection.
- The methods used for estimations and projections should be documented and clearly posted with the resulting data.
References
- Armstrong, J.S. (2001). "Extrapolation of Time Series and Cross-Sectional Data," in Principles of Forecasting: A Handbook for Researchers and Practitioners, edited by J. S. Armstrong, Boston: Kluwer.
- Cochran, William G.(1977), Sampling Techniques (3rd Ed.). New York: Wiley.
- Wolter, K.M. (1985). Introduction to Variance Estimation. New York: Springer-Verlag.
4.4 Data Analysis and Interpretation
Principles
- Careful planning of complex analyses needs to involve concerned parties. Data analysis starts with questions that need to be answered. Analyses should be designed to focus on answering the key questions rather than showing all data results from a collection.
- Analysis methods are designed around probability theory allowing the analyst to separate indications of information from uncertainty.
- For analysis of data collected using complex sample designs, such as surveys, the design must be taken into account when determining data analysis methods (e.g., use weights, replication for variances).
- Estimates from 100% data collections do not have sampling error, though they are usually measuring a random phenomenon (e.g., highway fatalities), and therefore have a non-zero variance.
- Data collected at sequential points in time often require analysis with time series methods to account for inter-correlation of the sequential points. Similarly, data collected from contiguous geographical areas require spatial data analysis.
Note: Methods like linear regression assume independence of the data points, which may make them invalid in time and geographical cases. The biggest impact is in variance estimation and testing.
- Interpretation should take into account the stability of the process being analyzed. If the analysis interprets something about a process, but the process has been altered significantly since the data collection, the analysis results may have limited usefulness in decision making.
- The "robustness" of analytical methods is their sensitivity to assumption violation. Robustness is a critical factor in planning and interpreting an analysis.
Guidelines
- The planning of data analysis should begin with identifying the questions that need to be answered. For all but simplistic analyses, a project plan should be developed. Subject matter experts should review the plan to ensure that the analysis is relevant to the questions that need answering. Data analysis experts should review the plan (even if written by one) to ensure proper methods are used. Even "exploratory analyses" should be planned.
- All statistical methods used should be justifiable by statistical derivation or reference to statistical literature. The analysis process should be accompanied by a diagnostic evaluation of the analysis assumptions. The analysis should also include an examination of the probability that statistical assumptions will be violated to various degrees, and the effect such violations would have on the conclusions. All methods, derivations or references, assumption diagnostics, and the robustness checks should be documented in the plan and the final report.
Choices of data analysis methods include descriptive statistics for each variable, a wide range of graphical methods, comparison tests, multiple linear regression, logistic regression, analysis of variance, nonparametric methods, nonlinear models, Bayesian methods, control charts, data mining, cluster analysis, and factor analysis (this list is not meant to be exhaustive and should not be taken as such).
- Any analysis of data collected using a complex sample design should incorporate the sample design into the methods via weights and changes to variance estimation (e.g., replication).
- Data analysis for the relationship between two or more variables should include other related variables to assist in the interpretation. For example, an analysis may find a relationship between race and travel habits. That analysis should probably include income, education, and other variables that vary with race. Missing important variables can lead to bias. A subject matter expert should choose the related variables.
- Results of the analysis should be documented and either included with any report that uses the results or posted with it. It should be written to focus on the questions that are answered, identify the methods used (along with the accompanying assumptions) with derivation or reference, and include limitations of the analysis. The analysis report should always contain a statement of the limitations including coverage and response limitations (e.g., not all private transit operators are included in the National Transit Database; any analysis should take this into account). The wording of the results of the analysis should reflect the fact that statistically significant results are only an indication that the null hypothesis may not hold true. It is not absolute proof. Similarly, when a test does not show significance, it does not mean that the null hypothesis is true, it only means that there was insufficient evidence to reject it.
- Results from analysis of 100 percent data typically should not include tests or confidence intervals that are based on a sampling concept. Any test or confidence interval should use a measure of the variability of the underlying random phenomenon.
For example, the standard error of the time series can be used to measure the variance of the underlying random phenomenon with 100 percent data over time. It can also be used to measure sampling error and underlying variance when the sample is not 100 percent.
- The interpretation of the analysis results should comment on the stability of the process analyzed.
For example, if an analysis were performed on two years of airport security data prior to the creation of the Transportation Security Agency and the new screening workforce, the interpretation of the results relative to the new processes would be questionable.
References
- Skinner, C., D. Holt, and T. Smith. 1989. Analysis of Complex Surveys. New York, NY: Wiley.
- Tukey, J. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
- Agresti, A. 1990. Categorical Data Analysis. New York, NY: Wiley.