GEEの注意点の論文（Pepe & Anderson, 1994）を読んだ話

Pepe & Anderson (1994)

先日雑談をしていて、GEEで時間依存性共変量を使うときに注意が必要という話題になり、この論文を思い出して改めて読みなおしてみようかなと思った次第です。

A Key Condition

細かい話は論文を確認すれば良いとして、この論文で書かれているKey Conditionを見てみます（導出はセクション2に丁寧に書いてあります）。

When using an estimating equation of the form (2) to estimate $\beta$ from a marginal model $E(Y_{it} \mid X_{it}) = g(X_{it}\beta)$, either a diagonal working covariance matrix should be employed or one must validate that the marginal expectation $E(Y_{it} \mid X_{it}) = E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$.

推定方程式$E(Y_{it} \mid X_{it}) = g(X_{it}\beta)$を使って$\beta$を推定するなら、対角の作業相関行列を使うか、$E(Y_{it} \mid X_{it}) = E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$が成立することを確認せよということでした。
$X_{it}$が時間について固定されている場合には成立つことが自明ですが、$\mid X_{ij}$が時間変動する場合は微妙なケースが多そうです。

事例

論文ではシミュレーションを実施しており、そのうちの一つが、
\[
Y_{it}=\alpha Y_{i(t-1)}+\beta X_{it} + \epsilon_{it},~
t=1,\ldots,n_i,
\] というモデルでした。
ただし、$Y_{i0}=0$で、$(X_{it}, \epsilon_{it})$が平均$0$で互いに独立かつ$Y_{i(t-1)}$とも独立としていました。
これらの条件を設定しておくと、周辺平均に対しては
\[
E(Y_{it} \mid X_{it})=X_{it}\beta,
\] が成り立ちます。
一方で、条件付き周辺平均に対しては
\[
E(Y_{it} \mid X_{ij}, j=1,...,n_{i})=\beta \sum_{j \leq t} \alpha^{t-j}X_{ij}
\] が成り立ちます。
この例は、$E(Y_{it} \mid X_{it}) \ne E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$となります。

検算

検算しやすいように、$X$は二値にして平均$0$になるように設定しました。

library(dplyr)
library(tidyr)

beta <- 1
alpha <- 1.5
df <- NULL
for(i in 1:1000) {
  x <- rep(NA, 3)
  for(t in 1:3) {
    x[t] <- sample(c(-1, 1), 1)
    if (t == 1) {
      ypre <- 0
    }
    y <- alpha*ypre + beta*x[t] + rnorm(1, 0, 0.1)
    pcm <- 0
    for(k in 1:t) {
      pcm <- pcm + alpha^(t - k)*x[k]
    }
    pcm <- beta*pcm
    ypre <- y
    df <- rbind(df, data.frame(i, t, y, x = x[t], pcm))
  } 
}

df %>%
  group_by(t, x) %>%
  summarize(mm = mean(y), mm_altc = mean(beta*x))

df2 <- df %>%
  pivot_wider(id_cols = i, names_from = t, values_from = x,
              names_prefix = "x") %>%
  mutate(patrn = paste0(x1, x2, x3))

df <- left_join(df, df2, by = "i")

df %>%
  group_by(patrn, t) %>%
  summarize(pcmm = mean(y), pcmm_altc = mean(pcm)) %>%
  print(n = 24)

周辺平均は以下の通り（mmはデータから計算、mm_altcは真値です）。

> df %>%
+   group_by(t, x) %>%
+   summarize(mm = mean(y), mm_altc = mean(beta*x))
# A tibble: 6 × 4
# Groups:   t [3]
      t     x    mm mm_altc
  <int> <dbl> <dbl>   <dbl>
1     1    -1 -1.00      -1
2     1     1  1.00       1
3     2    -1 -1.05      -1
4     2     1  1.06       1
5     3    -1 -1.23      -1
6     3     1  1.01       1

条件付き周辺平均は以下の通りです（pcmmはデータから計算、pcmm_altcは真値です）。

> df %>%
+   group_by(patrn, t) %>%
+   summarize(pcmm = mean(y), pcmm_altc = mean(pcm)) %>%
+   print(n = 24)
# A tibble: 24 × 4
# Groups:   patrn [8]
   patrn      t   pcmm pcmm_altc
   <chr>  <int>  <dbl>     <dbl>
 1 -1-1-1     1 -1.01      -1   
 2 -1-1-1     2 -2.52      -2.5 
 3 -1-1-1     3 -4.77      -4.75
 4 -1-11      1 -0.995     -1   
 5 -1-11      2 -2.50      -2.5 
 6 -1-11      3 -2.75      -2.75
 7 -11-1      1 -1.00      -1   
 8 -11-1      2 -0.515     -0.5 
 9 -11-1      3 -1.79      -1.75
10 -111       1 -0.993     -1   
11 -111       2 -0.489     -0.5 
12 -111       3  0.295      0.25
13 1-1-1      1  0.995      1   
14 1-1-1      2  0.483      0.5 
15 1-1-1      3 -0.274     -0.25
16 1-11       1  1.00       1   
17 1-11       2  0.486      0.5 
18 1-11       3  1.72       1.75
19 11-1       1  1.00       1   
20 11-1       2  2.51       2.5 
21 11-1       3  2.76       2.75
22 111        1  1.02       1   
23 111        2  2.54       2.5 
24 111        3  4.81       4.75

明らかに、$E(Y_{it} \mid X_{it}) \ne E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$になっていることが確認できます。

上記のデータにGEEを適用してみます。

library(geepack)
summary(geeglm(y ~ x, id = i, data = df, corstr = "independence"))
summary(geeglm(y ~ x, id = i, data = df, corstr = "exchangeable"))
summary(geeglm(y ~ x, id = i, data = df, corstr = "unstructured"))

作業相関構造が"independence"以外では、$\hat{\beta}$が$1$に近い値になりませんでした。
あと、geepackではエラーが出ないですが、"unstructured"は推定に失敗していそうです。

> summary(geeglm(y ~ x, id = i, data = df, corstr = "independence"))

Call:
geeglm(formula = y ~ x, data = df, id = i, corstr = "independence")

 Coefficients:
            Estimate Std.err   Wald Pr(>|W|)    
(Intercept)  -0.0345  0.0428   0.65     0.42    
x             1.0584  0.0338 978.72   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = independence 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     3.31  0.0736
Number of clusters:   1000  Maximum cluster size: 3 

> summary(geeglm(y ~ x, id = i, data = df, corstr = "exchangeable"))

Call:
geeglm(formula = y ~ x, data = df, id = i, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err   Wald Pr(>|W|)    
(Intercept)  -0.0563  0.0535   1.11     0.29    
x             0.3477  0.0176 392.38   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     3.82   0.105
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.624 0.00495
Number of clusters:   1000  Maximum cluster size: 3 

> summary(geeglm(y ~ x, id = i, data = df, corstr = "unstructured"))

Call:
geeglm(formula = y ~ x, data = df, id = i, corstr = "unstructured")

 Coefficients:
            Estimate Std.err Wald Pr(>|W|)   
(Intercept)  -0.0691  0.0651 1.13   0.2878   
x            -0.0860  0.0285 9.10   0.0026 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     4.62    0.18
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.366 0.00869
alpha.1:3    0.556 0.00329
alpha.2:3    1.149 0.00537
Number of clusters:   1000  Maximum cluster size: 3

おまけ

（制限付き）最尤法ベースでもあまり変わらないですね。

library(nlme)
summary(glm(y ~ x, data = df))
summary(gls(y ~ x, data = df, correlation = corCompSymm(form = ~ 1 | i)))
summary(gls(y ~ x, data = df, correlation = corSymm(form = ~ 1 | i)))

> summary(glm(y ~ x, data = df))

Call:
glm(formula = y ~ x, data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-4.437  -1.191   0.033   1.116   4.719  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.0345     0.0333   -1.04      0.3    
x             1.0584     0.0333   31.83   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 3.31)

    Null deviance: 13292.8  on 2999  degrees of freedom
Residual deviance:  9935.1  on 2998  degrees of freedom
AIC: 12112

Number of Fisher Scoring iterations: 2

> summary(gls(y ~ x, data = df, correlation = corCompSymm(form = ~ 1 | i)))
Generalized least squares fit by REML
  Model: y ~ x 
  Data: df 
    AIC   BIC logLik
  11401 11425  -5696

Correlation Structure: Compound symmetry
 Formula: ~1 | i 
 Parameter estimate(s):
  Rho 
0.625 

Coefficients:
             Value Std.Error t-value p-value
(Intercept) -0.056    0.0535   -1.05   0.293
x            0.347    0.0260   13.37   0.000

 Correlation: 
  (Intr)
x 0.015 

Standardized residuals:
    Min      Q1     Med      Q3     Max 
-2.6229 -0.4340  0.0394  0.4639  2.6845 

Residual standard error: 1.95 
Degrees of freedom: 3000 total; 2998 residual

> summary(gls(y ~ x, data = df, correlation = corSymm(form = ~ 1 | i)))
Generalized least squares fit by REML
  Model: y ~ x 
  Data: df 
    AIC   BIC logLik
  10602 10638  -5295

Correlation Structure: General
 Formula: ~1 | i 
 Parameter estimate(s):
 Correlation: 
  1      2     
2 -0.677       
3  0.062  0.596

Coefficients:
            Value Std.Error t-value p-value
(Intercept) 0.024    0.0192     1.3   0.206
x           0.577    0.0155    37.2   0.000

 Correlation: 
  (Intr)
x 0.04  

Standardized residuals:
      Min        Q1       Med        Q3       Max 
-2.52e+00 -5.56e-01 -3.64e-06  4.87e-01  2.50e+00 

Residual standard error: 1.98 
Degrees of freedom: 3000 total; 2998 residual

感想

完全に理解できたとは言い難いので、$E(Y_{it} \mid X_{it}) = E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$の事例と$E(Y_{it} \mid X_{it}) \ne E(Y_{it} \mid X_{ij}, j=1,...,n_{i})$の事例をいっぱい作ってもうちょっと勉強したいなと思いました。

文献

Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics - Simulation and Computation 1994; 23(4): 939-951. doi: 10.1080/03610919408813210.