You are here

2. Sample Design

2. Sample Design

2.1 Target Population

The October 2009 OHS consists of a national survey effort and a survey effort of nine targeted MSAs. The target population for the national survey is the U.S. non-institutionalized adult population (18 years of age or older). The target population for the targeted MSA survey is the non-institutionalized adult population in nine targeted MSAs.

2.2 Sampling Frame and Selection

Both the national survey and the targeted MSA survey used the same questionnaire, but their samples were generated separately. To ensure that the October 2009 OHS is comparable to past OHS (November 2008 and earlier) the same methodology used for previous surveys was used for the current survey.

The samples for both the national survey and the targeted MSA survey were purchased from Survey Sampling International (SSI), a firm that provides samples for numerous government agencies and the private sector. The national sample included all 50 states and the District of Columbia. Using list-assisted random-digit-dialing (RDD) methodology, a national probability sample of telephone numbers was generated for the survey. All telephone numbers in the sampling frame - SSI's total active blocks - were divided into 18 strata by Census division (Table 1) and metropolitan status (i.e., inside MSA versus outside MSA) at the county level. The number of sampled telephone numbers for each stratum was proportionate to the size of the sampling population within the stratum. The national sampling rate was computed by dividing the number of RDD sample elements required by the total possible telephone numbers in the sampling frame.

Table 1: Census Bureau Regions and Divisions

Region Division States
Northeast New England CT, ME, MA, NH, RI, VT
Middle Atlantic NJ, NY, PA
Midwest East North Carolina IN, IL, MI, OH, WS
West North Carolina IA, KS, MN, MO, NE, ND, SD
South South Atlantic DE, DC, FL, GA, MD, NC, SC, VA, WV
East South Central AL, KY, MS, TN
West South Central AR, LA, OK, TX
West Mountain AZ, CO, ID, NM, MT, UT, NV, WY
Pacific AK, CA, HI, OR, WA

SSI developed the sample by first imposing an implicit stratification on the telephone exchange areas by Census division and metropolitan status at the county level. Within each Census division, counties and their associated telephone exchange areas located in MSAs were sorted by the size of the MSAs. The size of an MSA was measured by its population. After the MSAs were sorted according to the population, an indicator for metropolitan status (MSC) of the counties was created and added as a variable to the sample files. For the purpose of OHS, the MSC is defined as follows:

  • 1 = Large MSA - 1 million population or more.
  • 2 = Medium MSA - with 500,000-999,999 population.
  • 3 = Small MSA - with less than 500,000 population.
  • 5 = Outside MSA.

Counties and their associated telephone exchange areas within a Census division located outside of MSAs were first sorted by state. Within each state, the counties and their associated telephone exchange areas were sorted by geographic location. The sampling interval for all strata was the inverse of the national sampling rate computed above so that the number of intervals was equivalent to the number of sample elements required. Within each sampling interval, a single random number was generated between one and the interval size; the corresponding phone number within the interval was identified and written to an output file. This implicit stratification ensured that the sample of telephone numbers was geographically representative.

In addition to the national sample, a sample of targeted MSAs was also drawn with the same probability-proportionate-to-size sampling method from the following nine MSAs with a population of one million or more and rail transit (Table 2). The sampling rate was computed by dividing the number of RDD sample elements required for the survey of targeted MSAs by the total possible telephone numbers in the corresponding sampling frame. Each of the targeted MSAs was a stratum. Prior to sampling, counties and their associated telephone exchange areas within each MSA were first sorted by state. Within each state, the counties and their associated telephone exchange areas were sorted by geographic location.

Table 2: Targeted Metropolitan Statistical Areas

MSA Code
(2008 CBSA code)
MSA Title
12060 Atlanta-Sandy Springs-Marietta, GA
14460 Boston-Cambridge-Quincy , MA-NH
16980 Chicago-Naperville-Joliet , IL-IN-WI
31100 Los Angeles-Long Beach- Santa Ana , CA
33100 Miami-Fort Lauderdale-Pompano Beach , FL
35620 New York-Northern New Jersey-Long Island, NY-NJ-PA
37980 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD
41860 San Francisco-Oakland-Fremont , CA
47900 Washington-Arlington-Alexandria, DC-VA-MD-WV

A total of 18,050 telephone numbers was purchased for the October 2009 OHS. Of those numbers, 6,964 were determined to be working numbers for the national survey and 7,326 for the targeted MSA survey. For survey administration, the working numbers for both surveys were divided into five replicates, respectively. They were released by replicate over the period of data collection: the first replicates of 2,000 national and 1,000 MSA cases were released on the first day of interviewing; the second replicates of 1,953 national and 1,000 MSA cases were released on October 3; the third replicates of 1,047 national and 500 MSA cases were released on October 11; and the fourth replicates of 350 national and 330 MSA cases were released on October 19. The remaining replicates were not used. The following section describes the standard procedures for generating a RDD landline sample, which was used to generate samples for the October 2009 OHS.

2.2.1 RDD Landline Sample

To generate the RDD landline sample, SSI employed a list-assisted RDD system. List-assisted refers to the use of commercial lists of directory-listed telephone numbers, such as Telcordia, to increase the likelihood of dialing household residences. This method gives unlisted telephone numbers the same chance to be selected as directory-listed numbers.

The system utilizes a database of "working blocks." A block (also known as a 100-bank or a bank) is a set of 100 contiguous numbers identified by the first two digits of the last four digits of a telephone number. A block is defined as working if it contains one or more listed telephone households. The database consists of all residential telephone exchanges, working block information, and various geographic service parameters such as state, county, primary zip code, etc. On a national basis, this definition covers an estimated 97.7 percent of all residential telephone numbers (noting that slightly over 20 percent of U.S. households had only wireless telephones in the second half of 2008), while the listed database covers 99.96 percent of directory listed landline phones. This database is updated on a quarterly basis.

The sampling frame consists of the set of all telephone exchanges that meet the geographic criteria. This geographic definition is made using one or more of the geographic codes included in the database. Following specification of the geographic area, the system selects all exchanges and associated working banks that meet those criteria.

Based on the sampling frame defined, the system computes an interval such that the number of intervals is equivalent to the desired number of sample elements. The interval is computed by dividing the total possible telephone numbers in the sampling frame (i.e., # of working banks × 100) by the number of RDD sample elements required. Within each interval, a single random number is generated between one and the interval size; the corresponding phone number within the interval is identified and written to an output file. The result is that every potential telephone number within the defined sampling frame has a known and equal probability of selection.

2.2.2 Purging for Ineligible Numbers

The SSI purging process is designed to purge about 75 percent of the non-productive numbers (non-working, businesses, and fax/modems). Since this process is completed after the sample is generated, the statistical integrity of the sample is maintained.

First, the file of generated numbers is passed against a database that is comprised of the business database and the listed household database. Business numbers are eliminated from the file of generated numbers while listed household numbers are set aside so that they can be recombined after the active Dialer Phase.

Second, disconnected numbers are purged in a post-production process that identifies non-working or unassigned numbers, as well as modem and fax numbers in RDD telephone samples. It employs a proprietary technology that recognizes almost half of these numbers, thereby improving the effective working phones rate of random digit telephone samples by an average of 10-15 percent.

2.2.3 Address Matching

The Multi-Source Phone Data Product from CAS, Inc. was used for residential reverse matches (name and address). With this product, CAS collects millions of individuals' telephone numbers and associated address information from many different sources including telephone directories, subscription databases, government agencies, associations, court records, and internet databases that are updated on a weekly basis. This compiled listing of over 215 million individuals was then used to match telephone numbers with the most current address or vice versa depending on the client's needs.

2.3 Sample Administration

The national sample and the sample of targeted MSAs were administered separately during data collection for tracking purposes so that respective goals for both samples could be attained, respectively. The goal was to reach a minimum of 1,000 completed interviews for the national sample and a minimum of 500 for the sample of targeted MSAs and to achieve a 50 percent response rate for both samples. All the procedures for the national sample were followed for the sample of targeted MSAs. Since the questionnaires were the same for both the national sample and the sample of targeted MSAs, the interviews for both groups were conducted identically, but the files for the two samples were kept separately. After the data collection, the cases from the targeted MSAs in the national sample remained a part of the national sample. They were also combined with the original sample of targeted MSAs to achieve a larger sample size for the survey of targeted MSAs. The specific means for attaining the highest response rate possible, such as callbacks and refusal conversion, were the same for both samples. This is discussed in detail in the section on Data Collection (Section 5).

2.4 Precision of Estimates

The precision of estimated frequencies can be assessed by evaluating the width of the 95 percent confidence interval around the estimates. For this application, the confidence interval can be approximated for design purposes as follows:

Lowercase p subscript lowercase s plus or minus uppercase z subscript a divided by 2 times the square root of the variance of lowercase p subscript lowercase s where lowercase p subscript lowercase s is the estimated (sample) proportion and uppercase z is the 5 percent critical value of the normal distribution

Where:

ps is the estimated (sample) proportion;

Za/2 is the critical value of the normal distribution at α = 0. 05 significance level; and

Var(ps) is the variance of ps.

The calculation of the end points of the confidence interval can be rewritten as follows:

Lowercase p subscript lowercase s plus or minus uppercase z subscript a divided by 2 times the square root of outer left parenthesis lowercase p subscript lowercase s times inner left parenthesis 1 minus lowercase p subscript lowercase s inner right parenthesis divided by lowercase n outer right parenthesis where lowercase p subscript lowercase s is the estimated (sample) proportion, uppercase z is the 5 percent critical value of the normal distribution and lowercase n is the sample size

or

Lowercase p subscript lowercase s minus uppercase z subscript a divided by 2 times the square root of outer left parenthesis lowercase p subscript lowercase s times inner left parenthesis 1 minus lowercase p subscript lowercase s inner right parenthesis divided by lowercase n outer right parenthesis less than or equal to uppercase p less than or equal to lowercase p subscript lowercase s minus uppercase z subscript a divided by 2 times the square root of outer left parenthesis lowercase p subscript lowercase s times inner left parenthesis 1 minus lowercase p subscript lowercase s inner right parenthesis divided by lowercase n where lowercase p subscript lowercase s is the estimated sample proportion, uppercase z is the 5 percent critical value of the normal distribution, lowercase n is the sample size and uppercase p is the true population value of the proportion

Where:

P is the true population value of the proportion; and

n is the sample size.

Therefore, with a sample size of 1,082, ps = 50 percent and α = 0.05, the confidence interval range would be 47 ≤ P ≤ 53, approximately1.

1 This method of confidence interval calculation is conservative.