Appendix C. Sample Design, Data Collection, and Estimation

Appendix C. Sample Design, Data Collection, and Estimation

The primary goal for the 2007 Commodity Flow Survey (CFS) is to estimate shipping volumes (value, tons, and ton-miles) by commodity and mode of transportation at varying levels of geographic detail. A secondary objective is to estimate the volume of shipments moving from one geographic area to another (i.e., flows of commodities between states, regions, etc.) by mode and commodity. A detailed description of the sample design for the 2007 CFS is provided below.

Sample Design

The sample for the 2007 Commodity Flow Survey (CFS) was selected using a stratified three-stage design in which the first-stage sampling units were establishments, the second-stage sampling units were groups of four 1-week periods (reporting weeks) within the survey year, and the third-stage sampling units were shipments.

First Stage - establishment selection

Sampling frame

To create the first-stage sampling frame, a subset of establishment records (as of August 2006) was extracted from the Census Bureau's Business Register. The Business Register is a database of all known establishments located in the United States or its territories, and an establishment is a single physical location where business transactions take place or services are performed. Establishments located in the United States, having nonzero payroll in 2005, and classified in mining (except oil and gas extraction), manufacturing, wholesale, electronic shopping and mail order, fuel dealers, and publishing industries, as defined by the 2002 North American Industry Classification System (NAICS), were included on the sampling frame. Auxiliary establishments (e.g., warehouses and central administrative offices) with shipping activity were also included on the sampling frame. Auxiliary establishments are establishments that are primarily involved in rendering support services for other establishments within the same company, instead of for the public, government, or other business firms. Establishments classified in forestry, fishing, utilities, construction, transportation, and all other retail and services industries were not included on the sampling frame. Farms and government-owned entities (except government-owned liquor stores) were also excluded from the sampling frame. The resulting frame comprised approximately 754,000 establishments as listed in the table below.

Trade Area Establishments
Mining 6,789
Manufacturing 327,826
Wholesale 356,477
Retail 25,190
Services 22,539
Auxiliaries 14,878
Total 753,699

For each establishment, sales, payroll, number of employees, a six-digit NAICS code, name and address, and a primary identifier were extracted, and a measure of size was computed. The measure of size was designed to approximate an establishment's annual total value of shipments for the year 2004.

All of the establishments included on the sampling frame had State, county, and place geographic codes, which were used to assign each establishment to one of the 73 metropolitan areas (MAs) defined as a combination of the metropolitan statistical areas (MSAs), combined statistical areas (CSAs) and States. Establishments not located in an MA were assigned to the balance of the State.

Stratification

The sampling frame was stratified by geography and industry. A particular geographic-by-industry combination defined a primary stratum. Geographic strata were defined by a combination of the 50 States, the District of Columbia, and 73 metropolitan areas (MAs) based on their population and importance as transportation gateways. All other MAs were collapsed with the non-MAs within the State into Rest of State (ROS) strata. When an MA crossed State boundaries, size of each part of the MA was considered relative to the MAs total measure of size when determining whether or not to create strata in each State in which the MA was defined. Six MAs had strata in two or more States.

The industry strata were determined as follows. Within each of the geographic strata,  48 industry groups were defined based on the 2002 NAICS:

  • 3 mining (4-digit NAICS);
  • 21 manufacturing (3-digit NAICS);
  • 18 wholesale (4-digit NAICS);
  • 2 retail (NAICS 4541 and 45431);
  • 1 services (NAICS 5111 and 51223 combined), and
  • 3 auxiliary (combinations of NAICS 4931 and 551114).

If a three or four digit NAICS industry contributed at least 4 percent of the total value (based on sampling measure of size) or tonnage (based on 2002 CFS data) for the geographic stratum or the Nation, it was designated as a do not collapse industry stratum within the geographic stratum. Industries not meeting this level of activity within a geographic stratum were grouped with other similar industries. The remaining industry strata were collapsed to form at most 10 collapsed industry strata within each geographic stratum.

The method used to collapse the remaining strata, used 2002 CFS data as input to a Classification and Regression Tree (CART) procedure that related industries with commodities. The terminal nodes from the CART procedure were then grouped using a hierarchical clustering algorithm. Using the results from the hierarchical clustering algorithm, some of the clusters were manually regrouped to arrive at the final industry clusters.

To produce better estimates of the shipment of hazardous materials for 2007, a total of 160 strata targeting HAZMAT shippers were created. Using 2002 CFS data, the six-digit NAICS industries that accounted for a large proportion of the estimated total value and/or total tonnage for six groups of hazardous materials was identified.  These included ammonium nitrate, ethanol, explosives, hydrogen, toxic by inhalation, and all other miscellaneous hazardous materials.

The treatment of auxiliary establishments was modified for 2007 to take advantage of the data collected through the advance survey. For auxiliaries that responded to the advance survey and were considered to be shippers, 123 strata were created, one in each geographic stratum, combining both NAICS 4931 and 551114. Two national strata for auxiliary establishments were also created for those that did not respond to the advance survey—one stratum for nonresponding warehouses (those classified in NAICS 4931) and one stratum for nonresponding management offices (NAICS 551114).

The table below summarizes the primary stratification of the CFS sampling frame. Of the 2,745 primary strata, 232 were designated as take-all strata because of the small number of establishments in the stratum and/or their importance.

Primary Strata Number
Do Not Collapse 1,306
Collapsed 1,154
Auxiliaries (Advance Survey responders) 123
Auxiliaries (Advance Survey non-responders) 2
HAZMAT 160
Total 2,745

Sample size and allocation

Sample sizes were computed to meet coefficient of variation (CV) constraints on estimated value of shipments totals for each primary stratum. A CV of 1.5 percent on the estimated total value of shipments was used for each primary stratum because it produced total sample sizes of approximately 100,000 establishments.

The primary constraints were budget related, which are translated into an approximate fixed sample size for the survey. The goal of the design was to allocate this fixed total sample size in a statistically efficient manner. The CV constraints were primarily used as a tool to allocate more of the sample to more important strata. It was assumed that the cost of data collection would not vary by stratum.  Maximum sampling weight and minimum sample size constraints were also imposed. For the CFS designs, the maximum first stage sample weight was set to 100 and the minimum sample size to 2 establishments per stratum.

The procedure for determining sampling parameters was an iterative computerized process. The sample design programs used in the process are part of a group of generalized programs that have been modified to accommodate the needs of the survey, but use common methods such as the Dalenius & Hodges cumulative sqrt(f) procedure, Neyman allocation, and similar rules for determining acceptable designs.

For each (nontake all) primary sampling stratum, the survey designer specified as input to a Generalized Univariate Stratification (GUS) program:

  • desired number of bins (for a frequency distribution used in the Dalenius & Hodges' cumulative sqrt(f) procedure),
  • desired number of size strata,
  • desired number of certainty companies,
  • desired coefficient of variation for total value of shipments,
  • maximum sampling weight, and
  • minimum sample size.

Once designs were determined for each of the primary strata, the information from these designs was used as input to a program that attempted to more efficiently allocate the sample to meet the desired CV on each primary stratum and also determine the sample sizes needed to meet a national level constraint. Designs with a national level constraint tend to allocate more samples to the larger States so there is a tradeoff between better national estimates and the quality of the more detailed geographic estimates. For the 2007 CFS, a design with a primary strata CV of 1.7 percent and a national CV of 0.036 percent was chosen. The final first stage sample size was 102,369 establishments.

Second Stage - reporting week selection

The frame for the second stage of sampling consisted of 52-weeks from January 6, 2007 to January 4, 2008. Each establishment selected into the 2007 CFS sample was systematically assigned to report for four reporting weeks-one in each quarter of the reference year. Each of the four-weeks was in the same relative position of the quarter. For example, an establishment might have been requested to report data for the 5th, 18th, 31st, and 44th weeks of the reference year. In this instance, each reporting week corresponds to the 5th week of each quarter. Prior to assignment of weeks to establishments, the selected sample was sorted by primary stratum (State x metropolitan area x industry) and measure-of-size.

Third Stage - shipment selection

For each of the four reporting weeks in which an establishment was asked to report, we requested the respondent to construct a sampling frame consisting of all shipments made by the establishment in the reporting week. Each respondent was asked to count or estimate the total number of shipments comprising the sampling frame and to record this number on the questionnaire. For each assigned reporting week, if an establishment made more than 40 shipments during that week, we asked the respondent to select a systematic sample of the establishment's shipments and to provide us with information only about the selected shipments. If an establishment made 40 or fewer shipments during that week, we asked the respondent to provide information about all of the establishment's shipments made during that week; i.e., no sampling was required.

Data Collection

Each establishment selected into the CFS sample was mailed a questionnaire for each of its four reporting weeks, that is, an establishment was sent a questionnaire once every quarter of 2007. For a given establishment, the respondent was asked to provide the following information about each of the establishment's reported shipments:

  • shipment ID number,
  • shipment date (month, day),
  • shipment value,
  • shipment weight in pounds,
  • commodity code from Standard Classification of Transported Goods (SCTG) list,
  • commodity description,
  • United Nations or North America (UN/NA) number for hazardous material shipments,
  • U.S. destination (city, State, zip code)or gateway for export shipment
  • modes of transport,
  • an indication of whether the shipment was an export,
  • city and country of destination for exports, and
  • export mode.

For a shipment that included more than one commodity, the respondent was instructed to report the commodity that made up the greatest percentage of the shipment's weight.

Imputation of Shipment Value or Weight

Only two items were ever imputed in the 2007 CFS—shipment value or weight. To correct for nonresponse to either the value or weight for a given shipment reported in the CFS, the missing value for the item (or value that failed edit) was replaced by a predicted value obtained from an appropriate model. Such a shipment was considered a "recipient" if it had a valid commodity code and the other item reported was greater than zero and had passed edit. The recipient's item that was missing or failed edit was imputed as follows. First, a "donor" shipment was randomly selected from shipments that were reported in the CFS with:

  • the same commodity code as the recipient,
  • both value and weight items reported greater than zero and had passed edit, and
  • similar origin and value for the item reported by the recipient.

Then, the donor's value and weight data were used to calculate a ratio, which was then applied to the recipient's reported item, to impute the item that was missing or failed edit. If no donor was found, the median ratio for all shipments reported in the survey with the same commodity code as the recipient—and with both value and weight items reported greater than zero—was applied to the recipient's reported item. For either the value or weight item, about three percent of the shipment records used for the calculation of estimates had imputed data for the item.

Estimation

Estimated totals (e.g., value of shipments, tons, ton-miles) were produced as the sum of weighted shipment data (reported or imputed). Percent change and percent-of-total estimates were derived using the appropriate estimated totals. Estimates of average miles per shipment were computed by dividing an estimate of the total miles traveled by the estimated number of shipments.

Each shipment had associated with it a single tabulation weight, which was used in computing all estimates to which the shipment contributes. The tabulation weight was a product of seven different component weights. A description of each component weight follows.

CFS respondents provided data for a sample of shipments made by their respective establishments in the survey year. For each establishment, an estimate of that establishment's total value of shipments was produced for the entire survey year. To do this, four different weights were used—the shipment weight, the shipment nonresponse weight, the quarter weight, and the quarter nonresponse weight. Three additional weights were then applied to produce estimates representative of the entire universe—the establishment-level adjustment weight, the establishment (or sample) weight, and the industry-level adjustment weight.

Like establishments, shipments were identified as either certainty or noncertainty (see the Nonsampling Error section below). For noncertainty shipments, the shipment weight was defined as the ratio of the reported total number of shipments made by an establishment in a reporting week to the number of sampled shipments for the same week. This weight used data from the sampled shipments to represent all the establishment's shipments made in the reporting week. However, a respondent may have failed to provide sufficient information about a particular sampled shipment. For example, a respondent may not have been able to provide value, weight, or a destination for one of the sampled shipments. If this data item could not be imputed, then this shipment did not contribute to tabulations and was deemed unusable. (A usable shipment is one that has valid entries for value, weight, and origin and destination ZIP Codes.) To account for these unusable shipments, a shipment nonresponse weight was applied. For noncertainty shipments from a particular establishment's reporting week, the weight was equal to the ratio of the number of sampled shipments for the reporting week to the number of usable shipments for the same week. The shipment weight for certainty shipments from a particular establishment's reporting week was equal to one.

The quarter weight inflated an establishment's estimate for a particular reporting week to an estimate for the corresponding quarter. For noncertainty shipments, the quarter weight was equal to 13. The quarter weight for most certainty shipments is also equal to 13. However, if a respondent was able to provide information about all large (or certainty) shipments made in the quarter containing the reporting week, then the quarter weight for each of these shipments was one. For each establishment, the quarterly estimates were added to produce an estimate of the establishment's value of shipments for the entire survey year. Whenever an establishment did not provide the Census Bureau with a response for each of its four reporting weeks, a quarter nonresponse weight was computed. The quarter nonresponse weight for a particular establishment was defined as the ratio of the number of quarters for which the establishment was in business in the survey year to the total number of quarters (reporting weeks), for which usable shipment data was received from the establishment.

Using these four component weights, an estimate of each establishment's value of shipments was computed for the entire survey year. This estimate was then multiplied by a factor that adjusts the estimate using value of shipments and sales data obtained from other surveys and censuses conducted by the Census Bureau. This weight, the establishment-level adjustment weight, attempted to correct for any sampling or nonsampling errors that occurred during the sampling of shipments by the respondent.

The adjusted value of shipments estimate for an establishment was then weighted by the establishment (or sample) weight. This weight was equal to the reciprocal of the establishment's probability of being selected into the first stage sample.

A final adjustment weight, the industry-level adjustment weight, used information from other surveys and censuses conducted by the Census Bureau to account for establishment nonresponse or nonuseable response, and for changes in the universe of establishments from 2006 when the first-stage sampling frame was constructed and 2007 the year in which the data were collected. Separate industry-level adjustment weights were determined for nonauxiliary and auxiliary establishments. For the final CFS estimates, these industry-level adjustments were made by State at the three-digit (Manufacturing) or four-digit (all other industries) NAICS levels. There were approximately 2,150 separate industry adjustment weights computed.

A noise factor was then applied to provide additional disclosure protection (see Appendix B, "Reliability of the Estimates").