Lawrence H. Cox, Associate Director
National Center for Health Statistics
LCOX@CDC.GOV
Bureau of Transportation Statistics Confidentiality Seminar
Washington, DC
September 17, 2003
PRESENTATION HANDOUT–DO NOT QUOTE OR CITE
Tabular data
Tables
What is disclosure?
Count data: disclosure = small counts (1, 2, ...)
Magnitude data: disclosure = dominated cell value
Example:
Shoe company # 1: | $10M |
Shoe company # 2: | $6M |
Other companies (total): | $1M |
Cell value: | $17M |
# 2 can subtract its contribution from cell value and infer contribution of #1 to within 10% of its true value = DISCLOSURE
Cells containing disclosure are called sensitive cells
How is disclosure in tabular data limited by statistical agencies?
Example: If estimation of any contribution to within 20% is safe (policy decision), then a safe value above would be $18M
What is cell suppression?
Some properties of cell suppression:
How does cell suppression addresses data quality?
Cell suppression employs a linear objective function to control oversuppression
Namely, the mathematical program is instructed to minimize:
These are overall (global) measures of data distortion
Further, individual cell costs or capacities can be set to control individual (local) distortion
These are all sensible criteria and worth doing
However, they do not preserve statistical properties (moments)
Moreover, suppression destroys data and thwarts analysis
Original CTA Method (Dandekar and Cox 2002)
167 | 317 | 1284 | 587 | 4490 | 3981 | 2442 | 1150 | 70 (21) | 14488 |
57(1) | 1487 | 172 | 667 | 1006 | 327 | 1683 | 1138 | 46 (7) | 6583 |
616 | 202 | 1899 | 1098 | 2172 | 3825 | 4372 | 300(40) | 787 | 15271 |
0 | 36(10) | 0 | 16(4) | 0 | 0 | 65 | 0 | 140(40) | 257 |
840 | 2042 | 3355 | 2368 | 7668 | 8133 | 8562 | 2588 | 1043 | 36599 |
Example 1: 4x9 Table of Magnitude Data & Protection Limits for the 7 Disclosure Cells (red)
D | 317 | 1284 | D | 4490 | 3981 | 2442 | 1150 | D | 14488 |
D | 1487 | 172 | 667 | 1006 | 327 | 1679 | D | D | 6583 |
616 | D | 1899 | 1098 | 2172 | 3825 | 4371 | D | 787 | 15271 |
0 | D | 0 | D | 0 | 0 | 70 | 0 | D | 257 |
840 | 2042 | 3355 | 2368 | 7668 | 8133 | 8562 | 2588 | 1043 | 36599 |
Example 1a: After Optimal Suppression: 11 Cells (30%) & 2759 Units (7.5%) Suppressed
167 | 317 | 1276 | 587 | 4490 | 3981 | 2442 | 1150 | 91 | 14501 |
56 | 1487 | 172 | 667 | 1006 | 327 | 1683 | 1138 | 39 | 6571 |
617 | 196 | 1899 | 1095 | 2172 | 3825 | 4372 | 260 | 797 | 15232 |
0 | 26 | 0 | 12 | 0 | 0 | 65 | 0 | 180 | 288 |
840 | 2026 | 3347 | 2361 | 7668 | 8133 | 8562 | 2548 | 1107 | 36592 |
Example 1b: After Controlled Tabular Adjustment
167 | 317 | 1284 | 587 | 4490 | 3981 | 2442 | 1150 | 70 (21) | 14488 |
57(1) | 1487 | 172 | 667 | 1006 | 327 | 1683 | 1138 | 46 (7) | 6583 |
616 | 202 | 1899 | 1098 | 2172 | 3825 | 4372 | 300(40) | 787 | 15271 |
0 | 36(10) | 0 | 16(4) | 0 | 0 | 65 | 0 | 140(40) | 257 |
840 | 2042 | 3355 | 2368 | 7668 | 8133 | 8562 | 2588 | 1043 | 36599 |
Example 1: 4x9 Table of Magnitude Data & Protection Limits for the 7 Disclosure Cells (red)
167 | 317 | 1276 | 587 | 4490 | 3981 | 2442 | 1150 | 91 | 14501 |
56 | 1487 | 172 | 667 | 1006 | 327 | 1679 | 1138 | 39 | 6571 |
617 | 196 | 1899 | 1095 | 2172 | 3825 | 4371 | 260 | 797 | 15232 |
0 | 26 | 0 | 12 | 0 | 0 | 70 | 0 | 180 | 288 |
840 | 2026 | 3347 | 2361 | 7668 | 8133 | 8562 | 2548 | 1107 | 36592 |
Example 1b: Table After Controlled Tabular Adjustment
167 | 317 | 1276 | 587 | 4490 | 3981 | 2442 | 1150 | 91 | 14501 |
56 | 1487 | 172 | 667 | 1006 | 327 | 1683 | 1138 | 35 | 6571 |
617 | 202 | 1899 | 1098 | 2172 | 3825 | 4372 | 260 | 787 | 15232 |
0 | 20 | 0 | 9 | 0 | 0 | 65 | 0 | 194 | 288 |
840 | 2026 | 3347 | 2361 | 7668 | 8133 | 8562 | 2548 | 1107 | 36592 |
Example 1c: Table After Optimal Controlled Tabular Adjustment (Regression)
Original data: nx1 vector a
Adjusted data: nx1 vector a + y ^{+} - y^{ -}
T denotes the coefficient matrix for the tabulation equations
Denote y = y ^{+} - y ^{-}
Cells i = 1, ..., s are the sensitive cells
Upper (lower) protection for sensitive cell i denoted P_{i}(-P_{i})
MILP for case of minimizing sum of absolute adjustments
_{}
Subject to:
_{} T (y) = 0
y_{i}^{-} = p_{i}(l-I_{i})
y_{i}^{+} = p_{i}I_{i} i
= 1, ... , s (sensitive cells)
0 ≤ y_{i}^{-} , y_{i}^{+} ≤e_{i} , i = s+1, ..., n
(nonsensitive cells)
I_{i} binary, i = 1, ..., s
Capacities e_{i}_{} on adjustments to nonsensitive cells typically
small, e.g., based on measurement error
Based on mathematical programming, just like cell suppression CTA can minimize:
In addition, adjustments to nonsensitive cells can be restricted to lie within measurement error
Still, this may not ensure good statistical outcomes, namely,
analyses on original vs adjusted data yield comparable results
Verification of “comparable results” is mostly empirical
Many, many analyses are possible: Which analysis to choose?
Instead, we focus on preserving key statistics and linear models
between original and adjusted data
Can do this using direct (Tabu) search
I will describe how to do so well in most cases using LP
For simplicity, assume that the down/up decisions for sensitive cells have already been made (by heuristic)
When the LP holds a total fixed, it preserves the mean of the cell values contributing to the total e.g., fixing the grand total preserves the overall mean
In general, to preserve a mean, introduce (new) constraint: Σ (adjustments to cells contributing to the mean) = 0
A criticism of CTA is that it introduces too much distortion into the values of the sensitive cells
In general the intruder does not necessarily know which cells are sensitive nor cares to analyze only sensitive data, so focusing on distortions to sensitive values may be a bit of a red herring
Still, it is useful to demonstrate how to preserve the mean of the sensitive cell values, as the method applies to preserving the mean of any subset of cells
Preserving the mean of the sensitive cell values is equivalent to constraining net adjustment to zero:
If, as in the original Dandekar-Cox implementation, we allow only two choices for y_{i}_{} , this is unlikely to be feasible
However, satisfying this constraint is not a problem if we simply expand the set of possible y-values viz., if we permit slightly larger down/up adjustments
The MILP is:
min c(y)_{}
Subject to:
T(y) = 0_{}
_{}
p_{i}(l - I_{i}) ≤ y_{i}^{-} ≤
q_{i}(l - I_{i})
p_{i}I_{i }≤ y_{i}^{+}
≤ q_{i}l_{i} i
= 1, ... , s
0 ≤ y_{i}^{-} , y_{i}^{+ } ≤e_{i} i = s+1, ..., n
_{}q_{i} are appropriate upper bounds on changes to sensitive cells
c(y)is a linear cost function, typically involving sum of absolute adjustments
If the down/up directions are pre-selected, this is an LP
Seek:Var(a + y) _ Var(a), _{} assuming _{}
Var(a + y) = Var(a) + 2Cov(a,y) + Var(y)
Define L(y) = Cov(a,y)/Var(a)
L(y) is a linear function of the adjustments y
Var(a + y)/Var(a) = 2L(y) + (1 + Var(y)/Var (a))
|Var(a + y)/Var(a) - 1 |=| 2L(y) + (Var(y)/Var(a))|
_{}Var(y) is nonlinear, but can be linearly approximated
Alternatively: typically Var(y)/Var(a) is small
Thus, variance is approximately preserved by minimizing | L(y) |_{}
The absolute value is minimized as follows:
* incorporate two new linear constraints in the system:
w ≥ L(y)
w ≥ - L(y)
_{}* minimize w
Seek:Corr(a,a + y) _ 1
Corr (a, a + y) = Cov(a, a + y) ÷ √ Var(a) Var(a + y)
After some algebra,
Corr (a, a + y) = (l + L(y)) ÷ √ Var(a + y) / Var(a)
Again:min | L(y) |_{} yields a good approximation because it drives both numerator and denominator to one
Seek: under ordinary least squares regression
Y = β_{1 }X + β_{0}
of adjusted data Y = a + y on original data X = a,
we want: β_{1} _ land β_{0} _0
As _{} , then β_{0}_ 0 if β_{1} _ l
This corresponds to L(y) _ 0(if feasible)
Note again: this is achieved via min | L(y) |_{}
Variance is preserved by minimizing L(y)
Correlation is preserved by minimizing L(y)
Regression slope preserved by L(y) _ 0 _{}(if feasible)
All subject to _{}
If Var(y)/Var(a) is small (typical case), imposing objective
function min | L(y) | _{}assures good results simultaneously
Shortcut is to incorporate the constraint L(y) = 0 (if feasible)
Choosing L(y) _ 0 _{}is motivated statistically because it implies (near) zero correlation between values a and adjustments y viz., as solutions y and -y are interchangeable, this correlation should be zero
4x9 Table
Original | Table | ||||||||
---|---|---|---|---|---|---|---|---|---|
167500 | 317501 | 1283751 | 587501 | 4490751 | 3981001 | 2442001 | 1150000 | 70000 | 14490006 |
56250 | 1487000 | 172500 | 667503 | 1006253 | 327500 | 1683000 | 1138250 | 46000 | 6584256 |
616752 | 202750 | 1899502 | 1098751 | 2172251 | 3825251 | 4372753 | 300000 | 787500 | 15275510 |
0 | 35000 | 0 | 16250 | 0 | 0 | 65000 | 0 | 140000 | 256250 |
840502 | 2042251 | 3355753 | 2370005 | 7669255 | 8133752 | 8562754 | 2588250 | 1043500 | 36606022 |
Protection Levels (s(+/-) | (+/-) | |||||||
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 21000 |
625 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7800 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 40000 | 0 |
0 | 10500 | 0 | 4875 | 0 | 0 | 0 | 0 | 42000 |
Table 1: 4x9 Table of Magnitude Data and Protection Limits for Its Seven Sensitive Cells (in red)
min Σ | y_{i} | | |||||||||
---|---|---|---|---|---|---|---|---|---|
166875 | 307001 | 1283751 | 587501 | 4490751 | 3981001 | 2442001 | 1150000 | 91000 | 14499881 |
56875 | 1487000 | 172500 | 667503 | 1006253 | 327500 | 1683000 | 1141875 | 38200 | 6580706 |
616752 | 202750 | 1899502 | 1103626 | 2172251 | 3825251 | 4372753 | 260000 | 816300 | 15269185 |
0 | 45500 | 0 | 11375 | 0 | 0 | 65000 | 36375 | 98000 | 256250 |
840502 | 2042251 | 3355753 | 2370005 | 7669255 | 8133752 | 8562754 | 2588250 | 1043500 | 36606022 |
min |L-Bnd|(Variance) | |||||||||
---|---|---|---|---|---|---|---|---|---|
167500 | 317501 | 1283751 | 587501 | 4490751 | 3981001 | 2442001 | 1150000 | 91003 | 14511009 |
55625 | 1487000 | 172500 | 667503 | 1006253 | 327500 | 1683000 | 1146675 | 38200 | 6584256 |
616752 | 202750 | 1899502 | 1098751 | 2172251 | 3825251 | 4372753 | 260000 | 787498 | 15235508 |
0 | 18791 | 0 | 8125 | 0 | 0 | 65000 | 0 | 191756 | 283672 |
839877 | 2026042 | 3355753 | 2361880 | 7669255 | 8133752 | 8562754 | 2556675 | 1108457 | 36614445 |
max L (Corr.) | |||||||||
---|---|---|---|---|---|---|---|---|---|
167500 | 317501 | 1283751 | 587501 | 4490751 | 3981001 | 2442001 | 1129000 | 91000 | 14490006 |
55313 | 1499637 | 172500 | 667503 | 1006253 | 327500 | 1683000 | 1138250 | 34300 | 6584256 |
616752 | 202750 | 1899502 | 1098751 | 2172251 | 3825251 | 4372753 | 359884 | 787500 | 15335394 |
937 | 19250 | 0 | 8938 | 0 | 0 | 65000 | 0 | 94815 | 188940 |
840502 | 2039138 | 3355753 | 2362693 | 7669255 | 8133752 | 8562754 | 2627134 | 1007615 | 36598596 |
min |L| (Regress.) | |||||||||
---|---|---|---|---|---|---|---|---|---|
167500 | 317501 | 1276439 | 587501 | 4490751 | 3981001 | 2442001 | 1150000 | 91000 | 14503694 |
55625 | 1487000 | 172500 | 667503 | 1006253 | 327500 | 1683000 | 1138250 | 34420 | 6572051 |
616752 | 202750 | 1899502 | 1106063 | 2172251 | 3825251 | 4372753 | 260000 | 787500 | 15242822 |
0 | 19250 | 0 | 8938 | 0 | 0 | 65000 | 0 | 194267 | 287455 |
839877 | 2026501 | 3348441 | 2370005 | 7669255 | 8133752 | 8562754 | 2548250 | 1107187 | 36606022 |
Table 2: Original Table After Various Controlled Tabular Adjustments Using Linear Programming to Preserve Statistical Properties of Sensitive Cells Only
Summary: 4x9 Table Linear Programming
Sensitive Cells | Corr. | Regress. Slope | New Var. / Original Var. |
---|---|---|---|
min | y_{i} | | 0.98 | 0.82 | 0.70 |
min |L-Bound| (Var.) | 0.95 | 0.93 | 0.94 |
max L (Cor.) | 0.97 | 1.20 | 1.52 |
min |L| (Reg.)* | 0.95 | 0.93 | 0.95 |
All Cells | Corr. | Regress. Slope | New Var. / Original Var. |
---|---|---|---|
All 4 Functions | 1.00 | 1.00 | 1.00 |
Table 3: Summary of Results of Numeric Simulations on 4x9 Table Using Linear Programming
* = compromise solution
Summary: 13x13x13 Table Linear Programming
Sensitive Cells | Corr. | Regress. Slope | New Var. / Original Var. |
---|---|---|---|
min | y_{i} | | 0.995 | 0.96 | 0.94 |
min |L-Bound| (Var.) | 0.995 | 1.00 | 1.00 |
max L (Cor.) | 0.995 | 1.00 | 1.21 |
min |L| (Reg.)* | 0.995 | 1.00 | 1.01 |
All Cells | Corr. | Regress. Slope | New Var. / Original Var. |
---|---|---|---|
All 4 Functions | 1.00 | 1.00 | 1.00 |
Table 4: Summary of Results of Numeric Simulations on 13x13x13 Table Using Linear Programming
* = compromise solution