Household Survey Results

General Methodology

August 2000 to March 2001

The Bureau of Transportation Statistics (BTS)the federal statistical agency for the United States Department of Transportation (USDOT) charged with improving the knowledge base for public decision makingcoordinates the Omnibus Survey program. The survey is a ONEDOT effort to collect information about the transportation system, how it is used, and how it is viewed by the users. Through Omnibus Household Surveys, BTS gathers data each month on a random basis from 1,000 households to determine the general public's perception of, expectations from, and satisfaction with the nation's transportation system and to prioritize improvements to the transportation system.

Each of the monthly surveys contains a set of core questions based on critical information needs within DOT. In addition, supplemental questions are included each month that correspond to one of DOT's five strategic goals: safety, mobility, economic growth, human and natural environment, and security. Finally, specific questions posed by the various DOT modes are included on each survey.

Data collected from completed interviews, for each month, is provided in following file formats:

- Comma-delimited ASCII (CSV file extension)
- Microsoft Excel 97 (XLS file extension)
- SAS Transport (ZIP file extension)

The tables of results are presented in two different formats:

- Hypertext Markup Language (HTML file extension)
- Adobe Acrobat (PDF file extension)

This section describes the overall survey methodology, including the identification of the target population, the selection of the sample, the calculation of the survey weights, and variance estimation procedures.

The target population for Omnibus Household Survey comprises the non-institutionalized population, aged 18* years or older who live in the United States at the time of the interview. This is the population about which inferences are to be made.

*For the months of August, September, and October 2000, the target population included the non-institutionalized population, aged 16 years or older who lived in the United States at the time of the interview.

From August 2000 to March 2001, the GENESYS sampling system, developed and maintained by the Marketing Systems Group (Fort Washington, PA), was used to draw the samples for the monthly surveys. This system employs list-assisted random digit dialing. List-assisted refers to the use of commercial lists of directory-listed telephone numbers to increase the likelihood of dialing household residences. This method gives unlisted telephone numbers the same chance to be selected as directory-listed numbers.

Banks of 100 consecutive telephone numbers (e.g., 301-475-8100 to 301-475-8199) were constructed and compared to a database containing the count of directory-listed residential telephone numbers in each bank. The banks that contain zero directory-listed telephone numbers were deleted from the sampling frame. This greatly increases the chance of dialing residential households. Obviously, the deleted banks contain some residential telephone numbers. However, recent research has shown that less than 2 percent of the residential telephone numbers nationally are located in 100-banks with zero directory-listed numbers.

Prior to sample selection, GENESYS imposed an implicit stratification on the telephone prefixes using the U.S. Census divisions and metropolitan status. Within each U.S. Census division, counties and their associated prefix areas located in metropolitan statistical areas (MSAs) were ordered by the size of the MSA. Counties and their associated prefix areas within a U.S. Census division that are located outside of MSAs were first sorted by state. Within each state, the counties and their associated prefix areas were ordered by geographic location. This implicit stratification ensured that the sample of telephone numbers was geographically representative.

After the prefixes were stratified by U.S. Census division and metropolitan status, a single-stage equal-probability sample of telephone numbers was drawn. The total number of ten-digit telephone numbers in the universe was 100 times the total number of working banks in the universe. The selection interval was calculated by dividing the total number of ten-digit telephone numbers by the designated sample size. To identify the first sample telephone number, a random number between 0 and 1 was generated and multiplied by the selection interval. The integer part of this product divided by 100 identified the sequential working bank where the first sample number was located. The fractional portion of this product, truncated to two digits, provided the suffix. To identify the second sample number, a new random number was generated and was multiplied by the selection interval. This product was added to the selection interval, and the result was divided by 100. The suffix of the sample number was identified in the same way as the suffix of the first sample number. This process continued until all sample telephone numbers were determined.

Each month GENESYS-ID Plus was used to detect non-working numbers before the sample was released. This system actually dials the telephone number. If the telephone number starts to ring, GENESYS-ID Plus hangs up immediately. If the system detects non-working intercept signals, the telephone number being dialed is excluded from the sample. Non-residential telephone numbers also were excluded from the sample by comparing them to a database of Yellow Pages listings.

This section discusses the development of the survey weights. The final analysis weight reflects all adjustments for non-response, multiple telephone lines, persons per household, and post-stratification and is the weight that should be used for the analysis of the data. The sampling weight, which represents the inverse of the probability of selection, is the starting point for the calculation of the final analysis weight.

The final analysis weights for each month were developed using the following steps:

- calculation of the sampling weight
- adjustment for non-response
- adjustment for multiple telephone lines
- adjustment for selecting a random, adult household member
- post-stratification adjustment to the target population

The product of all of the above quantities represented the final analysis weight. Extreme values of the final analysis weight were then reduced using standard weight-trimming procedures.

The first step in weighting each month's sample is to calculate the sampling weight for each sampled telephone number. The sampling weight *W _{S}* for each telephone number was calculated as the inverse of its probability of selection or

where *N* is the total number of telephone numbers in the population and *n* is the total number of telephone numbers in the sample.

The non-response adjustment was based on U.S. Census division and metropolitan status (inside or outside an MSA) classification of the telephone numbers. The adjustment method for non-response was changed after October 2000.

From August 2000 through October 2000, the non-response adjustment factor for all telephone numbers in each U.S. Census division *c* by metropolitan status *s* combination was calculated as follows:

where * R _{CS}* is the total number of responding households in U.S. Census region

For data collected from November 2000 through March 2001, the non-response adjustment factor for all telephone numbers in each U.S. Census division *c* by metropolitan status *s*combination, was calculated using the Council of American Survey Research Organization (CASRO) definition:

where the denominator is the CASRO response rate for U.S. Census division *c* and metropolitan status *s*. The non-response adjustment factor for a specific cell (defined by metropolitan status and U.S. Census division) is a function of the response rate, which is given by the ratio of the estimated number of telephone households to the number of completed surveys. The estimated number of telephone households is the sum of the responding households, non-responding households, and the estimate of telephone households among unresolved numbers. The non-response adjusted weight *W _{NR}* is the product of the sampling weight

This adjustment will take into account the multiple chances of selection of households with multiple telephone lines used primarily for voice communication. The adjustment for multiple telephone lines is the inverse of the smallest of either 3 or the number of telephone lines:

For respondents that did not provide this information, it was assumed that the household contained only one telephone line. The non-response adjusted weight *W _{NR}* is then multiplied by the adjustment factor for multiple telephone lines

The probability of selecting an individual respondent depends upon the number of eligible respondents in the household. Therefore, it is important to account for the total number of eligible household members when constructing the sampling weights. The adjustment used for selecting a random, adult household member is:

*ADJ _{RA} = Number of Eligible Household
Members*

For respondents that did not provide this information, a value for * ADJ _{RA}* was imputed according to the distribution of the number of people in a household (from responding households) within the age, gender, and education
cross-classification cell matching that of the respondent for which the
value is being imputed. The weight that is adjusted for non-response and
for multiple probabilities of selection due to multiple telephone lines

The final adjustment to the survey weights is a post-stratification adjustment that would allow the weights to sum to the target population, i.e., U.S. non-institutionalized persons 18 years (16 years or older for surveys conducted prior to November 2000) of age or older by age, gender, and education. The method of adjustment that was used is called Iterative Proportional Fitting (IPF) or Raking^{a} . The outcome of that procedure is a multiplier*M* that scales * W _{NRMTRA}* within each age/gender/education cell so that weighted marginal sums for age, gender, and education agree with the corresponding Census Bureau distributions for these characteristics. Respondents who did not supply the demographic information necessary to categorize their age, gender, and/or education were excluded from the Raking procedure and were assigned a value of 1 for

^{a}SAS Institute, Inc. (1990), SAS/IML *Software Usage and Reference, Version 6* , First Edition, pp. 355-358, Cary, North Carolina: SAS Institute, Inc.

Extreme values of *W _{final}* were trimmed to avoid over inflation of the sampling variance. In short, the trimming procedure limits the relative contribution of the variance associated with the

Each household having a final analysis weight that exceeded the determined threshold value was assigned a trimmed weight equal to the threshold. Next, the age/gender/education cell used in the post-stratification was identified for each household with a trimmed weight. To maintain the overall weighted sum within the cell, the trimmed portions of the original weights were re-assigned to the cases whose weights were unchanged in the trimming process. For cases having trimmed weights but missing age, gender, and/or education information, the trimmed portions of the original weights were assigned to all remaining cases whose weights were unchanged in the trimming process.

The entire procedure was then repeated on the new set of weights: a new threshold value was re-calculated and the new extreme values were re-adjusted. The process was repeated until no new extreme values were found.

Introduction. The data collected in the Omnibus Household Survey are obtained through a complex sample design involving stratifications, and the final weights are subject to several adjustments. Any variance estimation methodology must involve some simplifying assumptions about the design and weighting. Some simplified conceptual design structures that allow users of these data to compute reasonably accurate standard errors are provided in this section.

At BTS, the software package SUDAAN (Research Triangle Institute, Research Triangle Park, NC) has been used to produce standard errors. An example of SUDAAN computer code is provided, but without guarantees of any kind. The computer code and methods used are subject to change without notification to the user. The entire risk as to the results and performance is assumed by the user. BTS recommends that any analysis of Omnibus Household Survey data be done under the supervision of a statistician who understands the implications of complex sample design surveys.

Sample Design. The Omnibus Household Survey uses random digit dialing (RDD). Sample telephone numbers were obtained from the GENESYS sampling systems. The standard GENESYS RDD sample methodology produces a strict single-stage equal probability sample of residential telephone numbers. In other words, a GENESYS RDD sample ensures an equal and known probability of selection for every residential telephone number in the sample frame.

Randomly generated telephone numbers were produced within the Master Exchange Database (MED) which consists of more than 48,000 residential area code/exchange combinations.

- The MED is structured using twenty independent strata: ten divisions of the United States split by metro and non-metro county definitions. The ten divisions are approximately equivalent to the U.S. Census definition of nine divisions. The tenth division in the GENESYS sampling design is made up of Alaska and Hawaii (which are in U.S. Census division nine).
- Within each of the ten division/metro strata, counties are ordered from those serving the largest MSA/Primary Metropolitan Statistical Area (PMSA) to those serving the smallest.
- Within each rank-ordered MSA/PMSA, exchanges are ordered by those serving the county(s) containing the central city(s), followed by those serving each of the remaining non-central city county(s).
- Within each county, exchanges and their associated working banks are ordered numerically, lowest to highest.
- For the ten division/non-metro strata, counties are ordered in a geographic serpentine pattern within each state.
- Within each county, exchanges are again ordered numerically.

The rationale for sorting the MED in such a fashion is to ensure strict geographic representation and to increase the homogeneity within the implicit strata created by the GENESYS sampling procedures.

Given this sample design, a one-stage sample should be specified and final sampling weights (adjusted by post stratification) used. The user should note that one simplifying procedure is used by BTS for variance estimation in SUDAAN. Whereas the GENESYS sample uses ten divisions as a sort criterion, BTS has used the U.S. Census definition of nine divisions. The rationale for this is that few respondents are interviewed in Alaska and Hawaii. Thus, these states are collapsed back into nine divisions.

Design Information for Variance Estimation. Three variables, DIVISION, METRO, and FINALWGT, are needed for variance estimation in SUDAAN. The variable DIVISION is not included in the data files of August 2000 through January 2001. For these months, the DIVISION variable has to be constructed from the variable FIPSCODE using the U.S. Census classification of states within divisions. To construct the variable DIVISION:

- Use only the first 2 digits in the variable FIPSCODE (a 5-digit number where, from left to right, the first two digits are the state identifier and the last three digits represents a county).
- Use the information in Table 1 to recode the 2 digits from FIPSCODE into the variable DIVISION.

**Table 1. State Codes Within Each of the Nine Divisions**

State Code from Variable FIPSCODE | DIVISION Code |
---|---|

09, 23, 25, 33, 44, and 50 | 1 |

34, 36, and 42 | 2 |

18, 17, 26, 39, and 55 | 3 |

19, 20, 27, 29, 31, 38, and 46 | 4 |

10, 11, 12, 13, 24, 37, 45, 51, and 54 | 5 |

01, 21, 28, and 47 | 6 |

05, 22, 40, and 48 | 7 |

04, 08, 16, 35, 30, 49, 32, and 56 | 8 |

02, 06, 15, 41, and 53 | 9 |

Variance Estimation Method. This method uses the DIVISION and METRO variables to create 18 strata, a single-stage selection with replacement procedure, and the final weight. This method provides somewhat conservative standard errors estimates. Assuming a simplified sample design structure, the following SUDAAN statements may be used (Note that the data file must first be sorted by DIVISION and METRO variables before using it in SUDAAN).

PROC ... DESIGN = STRWR;

NEST DIVISION METRO ;

WEIGHT FINALWGT ;

A typically used rule-of-thumb for degrees of freedom associated with a standard error is the quantity (number of unweighted records - number of strata) in the dataset. The rule-of-thumb degrees of freedom for the method above would fluctuate from month to month depending on the number of records in each monthly dataset. Most monthly dataset would yield degrees of freedom of around 1000. For practical purposes, any number of degrees of freedom exceeding 120 can be treated as infinite, i.e., one uses a normal *Z*-statistic instead of a *t*-statistic for testing.

Note that a one-tailed critical *t* at 120 degrees of freedom is 1.98 while at infinite degrees of freedom (a 0.025 *z*-value) is 1.96. If a variable of interest covers most of the sample strata, this limiting value would probably be adequate for analysis. Users should consult mathematical statisticians for discussion of degrees of freedom.

Subsetted Data Analysis. Frequently, analytical studies are restricted to select sub-domains, e.g., persons aged 65 and older. To save on storage, some users delete all records outside the domain of interest. This procedure of keeping only select records is called subsetting the data. With a subsetted data set, variance estimates sometimes cannot be computed. When data are collected using a complex survey design, and the data are then subsetted, it is likely that sample design structures could be compromised where complete design information is not available, for example, in all strata. Subsetting data may delete important design information needed for variance estimation.

If records are deleted in the Omnibus Household Survey where only one respondent is left in a particular stratum, variance estimates cannot be computed. When using subsetted data in SUDAAN, the MISSUNIT option can be added to the NEST statement to correct for possible missing design information. For example:

NEST DIVISION METRO / MISSUNIT ;

SUDAAN's MISSUNIT option performs a fix-up that produces variance estimates identical to that achieved when using a full data set.

The procedures for response rate calculation for the monthly surveys are based on the guidelines established by CASRO in defining a response rate. The final response rate for the survey was obtained using the following formula:

The distribution of household telephone numbers by disposition categories is shown in the methods section specific to each month. The number of household cases in each category was used in the above formula to calculate an overall response rate for each month.

The Omnibus Household Survey, by design, contains questions that are not asked of certain respondents based on their response(s) to other questions. In addition, there will always be some respondents who do not know the answer to or choose not to answer some items in the survey. Each of these responses can have a different meaning to the data user. While each of these response categories is important in characterizing the results of the survey, they are often removed from certain analyses, particularly those involving percentages. Therefore, the categories were given standard codes for easy identification. Table 2 below presents the response categories and how they are represented in each data file.

Data have not been imputed to account for missing values in specific questions, except during the weighting process. Those values were imputed only for the purpose of weighting the data and were not included in the final data files.

**Table 2. Summary of Codes for Missing Value Response Categories by Type of Data File**

Response Category | Data Set Value | ||
---|---|---|---|

SAS Transport^{1} |
Microsoft Excel | ASCII | |

Appropriate Skip | .S | -7 | -7 |

Refused | .R | -8 | -8 |

Dont Know | .D | -9 | -9 |

^{1}All codes represent special cases of SAS missing values and are treated as such in SAS procedures.

All survey data were collected using computer-assisted telephone interviewing (CATI) program. Also, CATI was used to schedule calls and track cases. It was programmed to release telephone numbers for calling based on standard and project-specific scheduling algorithms. Calls were scheduled based on optimal calling patterns and dispersed over different times of the day. Calls also were prioritized based upon their case status. For example, a telephone number for a household where a respondent had already agreed to participate was given a higher priority in the scheduler than a number where no contact had been made.

Follow-up efforts were limited to 15 attempts to determine whether a telephone number was residential, an additional ten attempts to identify an eligible respondent, and a final ten attempts to secure a completed interview or refusal. Therefore, the maximum number of call attempts to any household was 35. Once contact was made with a household, follow-up attempts followed a loose callback schedule established at the initial contact. That is, good times and days to callback were requested at the initial contact, but follow-up calls also were attempted before these appointment times, unless otherwise told not to do so by the household. This allowed for making the maximum number of attempts within the study period.

Once contact was made with individuals at a dialed telephone number, interviewers screened for eligibility by verifying that the number belonged to a residence (not a business or institution). An adult household member was then asked to identify the individual 18 years or older (16 years or older for surveys conducted prior to November 2000) in the household who would have the next birthday. The method preserved the randomness of the selection without requiring the time and effort to acquire a household roster and helps to avoid a potential break-off. If the respondent was available, the interviewer immediately attempted to complete the interview. If the selected respondent was not available, the interviewer asked for a good time to call back. In order to preserve respondent anonymity in the latter case, the interviewer asked for and recorded only the potential respondent's first name or initial.

No incentives were offered to respondents for completing the interview, and the survey was conducted only in English. If the selected household member refused the interview, the interviewer recorded the reason for refusal. The average length of the completed interview was approximately 15 minutes. Additionally, about 3-5 minutes were needed to recruit/screen potential respondents.

Once contact was made with the eligible respondent, the interviewer briefly explained the purpose of the survey and asked for the respondent's cooperation. The respondent was assured that the survey responses were being provided anonymously; that the respondent would not be asked for his/her full name, address, or other identifying information. Verbal consent to participate in the survey was asked of all respondents.

The interviews were completed in one telephone call. If a respondent started, but refused to complete an interview in one phone call, the session was broken off and the interview was coded as a refusal. No attempts were made to weight these data.

Interviewer performance was evaluated on the basis of production reports and regular on-line monitoring. Interviewer conduct during interviews was evaluated primarily by supervisory monitoring of actual calls, supplemented by review of interviewer notes maintained in the CATI system (all calls and notes recorded about those calls are maintained by the CATI system).

The CATI code was written to strictly enforce questionnaire logic. An interview could not be certified as "clean" until all appropriate questions had either been answered or assigned an acceptable non-response value, and until the data record for each interview was consistent with the instrument program logic.

A program was written to reformat the cleaned responses from the instrument into files that could be used for analytical purposes. Additional edits were performed in SAS. The additional edits included checks on the number of missing values, assignment of additional non-response values, and some constructed variables. Weights were also applied to the data files.