Variance estimation procedures have been developed to account for complex sample designs. Using these procedures, factors such as the selection of the sample, the use of differential sampling rates to subsample a subpopulation and nonresponse adjustments can be appropriately reflected in estimates of sampling error. The two main methods for estimating variances from a complex survey are known as Taylor series variance estimation (linear approximation) and replication (including jackknife and balanced repeated replication (BRR) methods). Wolter (1985) is a useful reference on the theory and applications of these methods. Shao (1996) is a more recent review paper that compares these methods.
Standard statistical software packages that assume a simple random sampling design do not properly compute variance estimates from weighted data collected under a design other than simple random sampling. By properly using the variable, RAKEDW00, as the final full sample weighting factor in standard statistical programs, an analysis of the survey data will result in accurate point estimates; however, this will not result in accurate variance estimates.
To overcome this limitation, this document gives guidance for analyzing the survey data using the software package SUDAAN (Software for the Statistical Analysis of Correlated Data) based on the Taylor series and replication methods (Research Triangle Institute, 1997). SUDAAN is a statistical package developed by Research Triangle Institute (RTI) to analyze data from complex sample surveys. SUDAAN computes the standard errors of the estimates taking the survey design into account. While later versions of SUDAAN (version 8 or later) can use replication methods, it is most often used for computing variances based on the firstorder Taylor series approximation also known as linearization. Though this section only provides details on the use of SUDAAN, the software packages of STATA and WesVar also can be used for linear approximation and replication methods respectively.
Although SUDAAN's estimates of variance based on linearization take into account the sample design of the survey; they do not properly reflect the variance reduction due to raking and poststratification. The weights in this survey were raked to control totals in the final step of the weighting process. Replication methods are more appropriate to compute estimates of variance under this condition. However, the magnitude of the reduction will depend on the type of estimate (i.e. total, proportion, etc.) and the correlation between the variable being analyzed and the dimensions used in raking.
This section describes how to use SUDAAN using both Taylor series and replication methods for the analysis of the survey data and the computation of appropriate standard errors and shows which options are appropriate to use. The data file contains 5,019 records, one for every completed extended interview.
Required Variables
The variables that provide information about the sample design in SUDAAN are:
Variable TSVUNIT (Taylors series variance unit). The variable TSVUNIT indicates the primary sampling unit (PSU) to be used for computing the estimates of variance using the Taylor series method. In the survey, the PSU corresponds to the household.
Variable RAKEDW00 (final full sample weight). The variable RAKEDW00 contains the final weight for the full sample. This weight is positive for all the records.
SUDAAN Keywords
The statements and keywords needed to run SUDAAN to compute variance estimates based on the Taylor Series approximation are:
DESIGN=WR (required). The sample was drawn without replacement; however, the WR (with replacement) design option is used because the finite population correction factor (fpc) is negligible. (Note: STRWR is not used because this requires that each record be a PSU, which is not the case because two persons could be sampled from the same household.)
NEST TSVUNIT /PSULEV = 1 (required). The keyword NEST lists the variables whose values identify the sampling stages. The Option /PSULEV = 1 instructs SUDAAN that TSVUNIT is the PSU level variable in position 1 in the NEST statement.
WEIGHT RAKEDW00 (required). The keyword WEIGHT lists the final weight to be used in the analysis. In this case, the variable for the weight is the final full sample weight RAKEDW00.
The variable TSVSTR in combination with the variable TSVUNIT can also be used to compute the standard errors with the appropriate changes in the NEST statement. The variable TSVSTR indicates the sampling stratum. In the survey, TSVSTR is set to 1 for all the records. An example of the use of this variable is also included in the following section.
SUDAAN is not the only statistical software that can be used to generate approximate standard errors using linear approximation. The statistical software STATA can be used as well. The variables TSVUNIT and TSVSTR can be used as the nesting variables and RAKEDW00 as the full sample weight in STATA to correctly generate both point estimates and standard errors.
The additional statements and keywords needed to run SUDAAN to compute estimates of variance based on replication methods are:
DESIGN= JACKKNIFE (required). The survey data file includes replicate weights that can be used in SUDAAN. The replication method used to create the weights is a form of the jackknife method. If estimates of variance based on replication methods are computed, the option JACKKNIFE should be used in the design statement.
JACKWGTS RAKEDW01  RAKEDW80 / ADJJACK=1 (required). The keyword JACKWGTS followed by the list of the variable names for the 80 replicate weights created for the survey (RAKEDW01RAKEDW80). When computing variances, replicate based estimates need to be adjusted by a constant value c that depends on the replication method used. In the replicates for this survey, the value of c is 1 and SUDAAN adjusts the weights appropriately with the option ADJJACK=1.
WesVar can be used to generate point estimates and appropriate standard errors using replication methods as well. This dataset contains 80 replicates (RAKEDW01RAKEDW80) for the full sample weight RAKEDW00. These replicates should be included in the file when creating the WesVar dataset. The jackknife method of JK2 should be selected as the jackknife method to be used. The ID variable on this file is PERSID.
Estimates Using SUDAAN based on the Taylor Series approximation
Listing 1 shows an example of running SUDAANs PROC CROSSTAB to compute totals, percentages and standard errors for the variable GENDER[19] based on the Taylor Series approximation. The procedure CROSSTAB produces weighted frequencies and percentage distributions for categorical variables. The following statements were used to produce the output in Listing 1.
proc crosstab data = btsall design=WR ;
weight RAKEDW ;
NEST TSVUNIT /PSULEV=1 ;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
The following statements also produce the same output as Listing 1. The difference is the use of the variable TSVSTR in the NEST statement.
proc crosstab data = btsall design=WR ;
weight RAKEDW ;
NEST TSVSTR TSVUNIT;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
Date: 12122002 Research Triangle
Institute
Page : 1
Time: 11:31:59 The CROSSTAB Procedure
Table : 1
Variance Estimation Method: Taylor Series (WR)
by: WHAT IS YOUR/SUBJECT'S GENDER.
WHAT IS YOUR/SUBJECT'S GENDER  

Total  1  2  
Sample Size  5011.000  2322.000  2689.000 
Weighted Size  273335024.970  133394837.990  139940186.980 
SE Weighted  3826319.579  3328823.884  3319195.188 
Row Percent  100.000  48.803  51.197 
100.000  48.803  51.197  
Tot Percent  100.000  48.803  51.197 
SE Row Percent  0.000  0.995  0.995 
SE Col Percent  0.000  0.995  0.995 
SE Tot Percent  0.000  0.995  0.995 
*The standard errors of both the estimated totals and percentages in Listing 1 are much larger than standard errors that take raking into account. This is because the effect of raking cannot be accounted for in PROC CROSSTAB when using Taylor series linearization.
Listing 2 shows an example of running SUDAANs PROC DESCRIPT to compute means, and standard errors for the variable AGE[20] based on the Taylor Series approximation. The procedure DESCRIPT produces weighted totals and means and their standard errors for continuous variables. The following statements were used to produce the output in Listing 2.
PROC DESCRIPT DATA = btsall design = WR ;
WEIGHT RAKEDW00 ;
NEST TSVUNIT /PSULEV=1 ;
VAR AGE ;
setenv colwidth = 17 decwidth= 3 ;
print / style = nchs ;
run ;
S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 2001
Release 8.0.0
Date: 12122002 Research Triangle Institute
Page : 1
Time: 11:32:24 The
DESCRIPT Procedure
Table : 1
Variance Estimation Method: Taylor Series (WR)
by: Variable, One.
Variable One  Sample Size  Weighted Size  Total  Mean  SE Mean 

AGE AT SCREENER1  4952.000  269936641.060  9544546622.010  35.358  0.423 
Estimates Using SUDAAN based on replication
Listing 3 shows an example of running SUDAANs PROC CROSSTAB to compute totals, percentages and standard errors for the variable GENDER[21] based on replication. The standard errors are smaller that those in Listing 1 because replication methods can reflect the reduction in variance caused by raking. The survey weights were raked to five dimensions in the last step of weighting. For GENDER, the standard errors are much smaller (in particular for totals) because GENDER was used to create one of the raking dimensions. The following statements were used to produce the output in Listing 3.
proc crosstab data = btsall design=JACKKNIFE;
weight RAKEDW00 ;
JACKWGTS RAKEDW01RAKEDW80 /ADJJACK=1;
subgroup gender ;
levels 2;
setenv colwidth = 17 decwidth= 3 ;
run ;
S U D A A N
Software for the Statistical Analysis of Correlated Data Copyright
Research Triangle Institute
July 200
Release 8.0.0
Number of observations read: 5019 Weighted count :273643273
Denominator degrees of freedom : 80
Date: 01082003 Research Triangle Institute
Time: 13:00:12 The CROSSTAB
Procedure
Variance Estimation Method: Replicate Weight Jackknife
by: WHAT IS YOUR/SUBJECT'S GENDER.
WHAT IS YOUR/SUBJECT'S GENDER  

Total  1  2  
Sample Size  5011.000  2322.000  2689.000 
Weighted Size  273335024.970  133394837.990  139940186.980 
SE Weighted  129773.082  83463.088  95960.303 
Row Percent  100.000  48.803  51.197 

100.000  48.803  51.197 
Tot Percent  100.000  48.803  51.197 
SE Row Percent  0.000  0.023  0.023 
SE Col Percent  0.000  0.023  0.023 
SE Tot Percent  0.000  0.023  0.023 
Listing 4 shows an example of running SUDAANs PROC DESCRIPT to compute means, and standard errors for the variable AGE[22] based on replication. The following statements were used to produce the output in Listing 4.
PROC DESCRIPT DATA = btsall design = JACKKNIFE ;
WEIGHT RAKEDW00 ;
JACKWGTS RAKEDW01RAKEDW80 /ADJJACK=1;
VAR AGE ;
setenv colwidth = 17 decwidth= 3 ;
print / style = nchs ;
run ;
Date: 01082003 Research Triangle Institute
Page : 1
Time: 13:26:21 The DESCRIPT
Procedure
Table : 1
Variance Estimation Method: Replicate Weight Jackknife
by: Variable, One.
Variable One  Sample Size  Weighted Size  Total  Mean  SE Mean 

AGE AT SCREENER 1  4952.000  269936641.060  9544546622.010  35.358  0.081 
Shao, J. (1996). Resampling Methods in Sample Surveys, (with Discussion). Statistics, 27, 203254.
Wolter, K. (1985). Introduction
to Variance Estimation.
Research Triangle Institute. (1997). SUDAAN users manual, (Release 7.5). Research Triangle Park: Author.