Multiple Imputation of Missing Passenger Boarding Data in the National Census of Ferry Operators
Multiple Imputation of Missing Passenger Boarding Data in the National Census of Ferry Operators
by Lee H. Giesbrecht
The Bureau of Transportation Statistics (BTS), a component of the Research and Innovative Technology Administration (RITA) of the U. S. Department of Transportation (DOT), conducted the National Census of Ferry Operators in 2006. This data collection updated information collected by the Federal Highway Administration in 2000. The resulting database contains ferry operation data for calendar year 2005 along with other sources of ferry data such as the U.S. Coast Guard and the Army Corps of Engineers. Ferry operators were asked about their season of operation, vessel fleet, modes of access to their terminals, and information about the route segments that they serve between terminals such as the route segment length, average trip time, and the number of passengers served.
Ferry operations included are those providing itinerant, fixed route, common carrier passenger and/or vehicle ferry service. Ferry operations that are exclusively nonitinerant (e.g., excursion services—whale watches, casino boats, day cruises, dinner cruises, etc.), passenger-only water-taxi services not operating on a fixed route, LoLo (Lift-on/Lift-off) freight/auto carrier services, or long-distance passenger-only cruise ship services are not included within the scope of this census. The geographic scope includes ferries operating within the United States and its possessions, encompassing the 50 states, Puerto Rico, the U.S. Virgin Islands, and the Commonwealth of the Northern Mariana Islands. In addition to ferry operators providing domestic service within the United States and its possessions, operators providing services to or from at least one U.S. terminal are also included.
BTS identified 230 ferry operators that were in business in 2005 that fall within the scope outlined above. Of those, approximately 92 percent responded to the census questionnaire. Data are missing because not all ferry operators responded to the census. However, some data variables for nonresponding ferry operators were completed based on information from other sources (e.g., vessel characteristics). In particular, passenger and vehicle boarding data are blank in the database for ferry operators that did not respond to the census, did not have access to these numbers, refused to report them, or required BTS to keep them confidential.
Need for Imputation
About 15 percent of the ferry route segments (part of the ferry route between two terminals) have missing values for passenger boarding data in the 2006 National Census of Ferry Operators. The sum of passengers for all nonmissing values (including those for which the operator required confidentiality) is about 89 million. This incomplete count is arguably less useful than an estimate of all passengers, which would include the 15 percent of route segments with missing values. Estimates of passenger-miles traveled and other passenger-related statistics will also not be as useful unless they are based either on complete or accurate estimates of passenger boarding data.
Multiple imputation techniques allow values to be imputed for missing data along with a measure of variability for estimates computed from the imputed values. As a federal statistical agency, the Bureau of Transportation Statistics strives to fully inform users about the quality of its data. Providing users with a measure of the variability added to the data due to imputation helps to satisfy this goal.
The Imputation Model
The basic idea of multiple imputations is to impute plausible values for the missing data (in this case, missing passenger data) from a distribution of values multiple times. This way, one can estimate distributions from the multiple replicates of the data. The method chosen for the missing ferry passenger boarding data fits a linear regression model that uses auxiliary information about the source of the missing data (ferry operator and route segment variables) along with prior data (from the 2000 data collection) to construct probability distributions of plausible values from which to impute the number of passengers. This method has the advantage of using a model to compute values of the missing data based on information known about the source (which results in better imputed values), while also providing a measure of uncertainty around estimates that make use of the missing values. The technique was first posited by Rubin (1987),1 who is credited as its developer. Multiple imputation is widely regarded, today, as the method of choice due to its appropriate treatment of imputation variance.
As previously mentioned, the passenger data are at the ferry route segment level. This requires that covariates be standardized at this level of analysis. The covariates in the imputation model are variables with nonmissing values that will be used as predictors of the missing passenger values. These variables are not all descriptive of the ferry route segment. Some are descriptive of the ferry operator, the ferry vessels, or the ferry terminals. As large a number as possible of covariates (as long as they are logically related to the number of passengers) is desired to improve the predictability of the imputation model. The process of fitting the model began with a version that included 10 variables and no geographic information. This model resulted in a very wide range of imputed values and, therefore, a large imputation variance. It was felt that geographic information would greatly improve the model’s predictive power. Subsequent changes to the model included a metro/nonmetro variable for both terminals, terminal access variables, and an indicator of whether either terminal served a national park. Census division for either terminal was added. Each addition reduced the variance of the estimate of total passengers due to imputation. Finally, the Census division variables were replaced with variables for each state. This resulted in a model that would not converge. It may be that the state variables resulted in too many unique records. The final model included the following variables listed in box A.
SAS Proc MI (version 8.02)2 was used to run 10 multiple imputations for missing 2005 passenger boarding values. The Markov Chain Monte Carlo (MCMC)3 option was used with an informative prior distribution based on the 2000 data that contained all the same variables as the main model from 2000. One exception is the variable that indicates the operator requested that his passenger data be kept confidential. This was not asked in 2000, so the values for this variable in the prior distribution were all zero.
Determining a Maximum Imputation Value
A very conservative approach was used to determine a range of plausible values from which to impute. No minimum value was specified, and the maximum value was based on the assumption that vessel capacity would be fully utilized at each route segment.
SAS Proc MI allows the analyst to control the range of valid values for the imputed variable. A reasonable upper limit for the imputation of missing passengers was determined based on information about vessels and route segments. The upper limit for the missing data was set as the total annual passengers computed using the following logic.
The following route segment and vessel information were considered:
- Segment length – Missing segment lengths were imputed with values computed using geographic information system software. The software computed segment length using precise coordinates for the two ferry terminals and presumed waterway paths between the terminals from the U.S. Army Corps of Engineers Navigable Waterway Network GIS database.
- Average travel time per segment – Missing values for average time were imputed with the average time of 14.1 minutes per mile for all route segments with missing passenger data but with average time reported.
- Number of months segment operated per year – Route segments with missing values for the number of operating months were assumed to be operated year-round and 12 months was imputed.
- Number of vessels available per segment – The number of vessels available was computed by dividing the total number of vessels per operator by the number of route segments per operator.
- Average vessel capacity per operator – This was computed by summing the vessel capacity fields (if operator-reported capacity was missing, data from the U.S. Coast Guard were used) and dividing by the number of vessels for each operator.
- Average capacity per route segment – The number of vessels per segment was then multiplied by the average vessel capacity per operator (because multiple vessels may be used on the same route segment) to get the capacity available for each route segment.
The number of runs per year was based on an assumption of an 8-hour work day (no data on work day length were available from the ferry survey). It was assumed that each round trip took the average time multiplied by two with zero time to load and unload passengers. This was multiplied by the number of days operating per year (based on the number of operating months per year). This estimated value of the total number of trips per year was then multiplied by the available passenger capacity per trip, thereby resulting in the maximum possible number of passengers for that ferry route segment. The highest passenger count possible based on these criteria for the highest capacity missing route segment was 7,358,400. The lowest upper limit that could be used for a missing route segment that still allowed the imputation model to converge was about 400,000. The imputation model was run for each missing route segment using the upper limit computed as described above and a lower limit of zero for each missing route segment.
Table 1 shows the estimated total annual passengers and passenger miles for all states, along with their associated 95 percent confidence intervals (CI) and coefficients of variation (CV)4. The CIs and CVs were computed based on the standard deviations across 10 imputation replicates.
It is likely that even the lower bound of the 95 percent confidence interval of 104 million passengers is much closer to the actual number of passengers for 2005 than the total computed without imputation of about 89 million because it is the most conservative estimate that accounts for the missing data. Some other estimates based on the imputed passenger data include state totals for California and Washington of 9,350,649 and 14,695,039, respectively. These estimates also have imputation error associated with them of plus or minus 592,402 and 380,069, respectively. The total number of passengers for Alaska, 711,809, has no imputation error because there were no route segments with missing data. Note that several other states have no imputation error as well. Estimates for some states, such as Massachusetts, may still be useful despite having a 4.9 percent CV for passengers, but many other states have imputation errors too large for accurate reporting. It should also be noted that state totals cannot be revealed for four states, New York, Connecticut, South Carolina, and Wisconsin (see first row of table), due to confidentiality restrictions. None of these states had any missing passenger data, but reporting a state total for any of these states would reveal the confidential data for some ferry operators.
The estimated number of passengers for 2005 may now be reported by BTS, along with its estimated variance due to imputation. Other estimates based on the passenger boarding data may also be computed and reported, such as passenger miles traveled. Care will be taken to ensure confidentiality for operators who requested their data be kept confidential. The confidential data were used in the production of the imputation replicates.
It may be possible to impute data for missing passenger reports from the 2000 ferry database and compare results. The methodology for imputing the 2000 data, however, must be different because there is no source for an informative prior census, which will likely result in a larger variance due to imputation. It is not clear whether or not a statistically significant difference could be detected using this methodology. It may also be possible to further reduce the range of plausible values for each operator with missing values on an individual basis. These ideas may be explored in future research.
In the next round (2008) of the census, additional data collection should be considered to better inform the imputation process for this variable. Information such as the schedule/number of trips for each route, the usual vessel for each route, and whether or not the route includes vehicles or is a passenger-only route would be helpful in this effort.
2 Documentation for the SAS MI Procedure can be found as of the date of this publication at http://support.sas.com.
3 More about the MCMC option in SAS Proc MI is contained in the SAS documentation. For further reading about Markov Chain Monte Carlo techniques, see http://www.stat.columbia.edu/~liam/teaching/neurostat-spr07/papers/mcmc/mcmc-gibbs-intro.pdf as of May 2008.
About this Report
This report was prepared by Lee H. Giesbrecht, survey statistician and project manager for the 2006 National Census of Ferry Operators.
This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not correspond to calculations obtained from using only the data in the NCFO. The 2006 NCFO data were collected from 230 ferry operators by the Bureau of Transportation Statistics, a component of the Research and Innovative Technology Administration in the U.S. Department of Transportation. The data were supplemented by other sources of ferry data, such as the U.S. Coast Guard and the Army Corps of Engineers. The database contains information on ferry systems, including operators, routes, vessels, and passenger and vehicle boarding. The ferry database is available online – www.bts.gov.