Given the collection design, the next phase in the data acquisition process is the collection process itself. This collection process can be a one-time execution of a survey, a monthly (or other periodic) data collection, a continuous reporting of incident data, or a compilation of data already collected by one or more third parties. The physical details of carrying out the collection are critical to making the collection design a reality.
3.1 Data Collection Operations
- The collection "instruments" are forms, questionnaires, automated collection screens, and file layouts used to collect the data. They consist of sets of questions or annotated blanks on paper or computer that request information from data suppliers. They should be designed to maximize communication to the data supplier.
- Data collection includes all the processes involved in carrying out the data collection design to acquire data. Data collection operations can have a high impact on the ultimate data quality.
- The data collection method should be appropriate to the data complexity, collection size, data requirements, and amount of time available.
Examples: A reporting collection will rely partially on the required reporting process, but will also follow-up for missing data. Similarly, a large survey requiring a high response rate will often start off with a mail out, followed by telephone contact, and finally by a personal visit.
- Specific data collection environmental choices can significantly affect error introduced at the collection stage.
For example, if the data collector is collecting as a collateral duty or is working in a uncomfortable environment, it may adversely affect the quality of the data collected. Also, if the data are particularly difficult to collect, it will affect the data quality.
- Conversion of data on paper to electronic form (e.g., key entry, scanning) introduces a certain amount of error which must be controlled.
- Third party sources of data may introduce some degree of error in their collection processes.
- Collection instruments are clearly defined for data suppliers, with entries in a logical sequence, reasonable visual cues, and limited skip patterns. Instructions should help minimize missing data and response error.
- A status tracking procedure should be used to ensure that data are not lost in mailings, file transfers, or collection handling. A tracking system for incoming third-party data should ensure that all required data are received.
- Data entry of paper forms should have a verification process ensuring that data entry errors remain below set limits based on data accuracy requirements.
For example, the verification samples of key entry forms can be based on an average outgoing quality limit for batches of forms. A somewhat more expensive approach would be 100 percent verification.
- Make the data collection as easy as possible for the collector.
- If interviewers or observers are used, a formal training process should be established to ensure proper procedures are followed.
- Data calculations and conversions at the collection level should be minimized.
For example, if a bus driver is counting passengers, they should not be doing calculations such as summations. The driver should record the raw counts and calculations should be performed where they are less likely to result in mistakes.
- The collection operation procedures should be documented and clearly posted with the data, or with disseminated output from the data. If third party data collection is used, procedures used by the third party should be provided as well.
- Federal Committee on Statistical Methodology. 1983. Approaches to Developing Questionnaires. Washington, DC: U.S. Office of Management and Budget (Statistical Policy Working Paper 10).
- Groves, R. 1989. Survey Errors and Survey Costs. New York, NY: Wiley, Chs. 10 & 11.
3.2 Missing Data Avoidance
- Some missing data occur in almost any data collection effort. Unit-level missing data occur when a report that should have been received is completely missing or is received and cannot be used (e.g., garbled data, missing key variables). Item-level missing data occur when data are missing for one or more items in an otherwise complete report.
For example, for an incident report for a hazardous material spill, unit-level missing data occur if the report was never sent in. It would also occur if it was sent in, but all entries were obliterated. Item-level missing data would occur if the report was complete, except it did not indicate the quantity spilled.
- The extent of unit-level missing data can sometimes be difficult to determine. If a report should be sent in whenever a certain kind of incident occurs, then non-reporters can only be identified if crosschecked with other data sources. On the other hand, if companies are required to send in periodic reports, the previous period may provide a list of the expected reporters for the current period.
Both can also be true for item-level missing data. For example, in a travel survey asking for trips made, forgotten trips would not necessarily be known.
- Some form of missing data follow-up will dramatically reduce the incidents of both unit-level and item-level missing data.
For example, a process to recontact the data source can be used, especially when critical data are left out. A series of recontacts may be used for unit nonresponse. Incident reporting collections can use some form of cross-check with other data sources to detect when incidents occur, but are not reported.
- When data are supplied by a third-party data collector, some initial data check and follow-up for missing data will dramatically reduce the incidents of missing data.
- All data collection programs should have some follow-up of missing reports and data items, even if the data are provided by third-party sources.
For example, for surveys and periodic reports, it is easy to tell what
is missing at any stage and institute some form of contact (e.g.,
mail out, telephone contact, or personal visit) to fill in the missing
data. For incident reports, it is a little more difficult, as a missing
report may not be obvious.
- For incident reporting collections where missing reports may not be easily tracked, some form of checking process should exist to reduce missing reports.
- For missing data items the data collection owner should distinguish between: critical items like items legally required or otherwise important items (e.g., items used to measure DOT or agency performance).
- The missing data avoidance procedures should be documented and clearly posted with the data, or with disseminated output from the data.
- Data collection program design documentation should address how the collection process was designed to produce high rates of response.
- If data is collected by a third party, the data collection program documentation should indicate how the third party deals with missing data, if that documentation is available.
- Groves, R.M. and M.P. Couper. 1998. Nonresponse in Household Interview Surveys. New York, NY: Wiley.