Stanでリッジ回帰をやってみたメモ [統計]
Stan Modeling Languageマニュアルと,beroberoさんのBayesian Lassoで特徴選択 with Stanを参考にリッジ回帰をやってみたメモです。
データ生成
set.seed(1234) n.obs <- 100 n.vars <- 20 x <- matrix(runif(n.obs * n.vars, 0, 1), nrow = n.obs, ncol = n.vars) y <- 2 * x[, 1] + 3 * x[, 2] + rnorm(n.obs, 0, 0.3)
まずは回帰モデルあてはめ。
library(rstan) reg_model <- " data { int<lower=0> N; int<lower=0> K; matrix[N, K] x; vector[N] y; } parameters { vector[K] beta; // coefficients real<lower=0> sigma; // error scale } model { y ~ normal(x * beta, sigma); }" fit <- stan(model_code = reg_model, data = list(N = n.obs, K = n.vars + 1, y = y, x = cbind(rep(1, n.obs), x)), pars = c("beta", "sigma"), chains = 4, iter = 2000)
結果
> fit Inference for Stan model: reg_model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta[1] 0.05 0.01 0.23 -0.41 -0.11 0.05 0.20 0.50 1981 1 beta[2] 2.09 0.00 0.11 1.88 2.02 2.09 2.16 2.29 3650 1 beta[3] 2.97 0.00 0.10 2.77 2.91 2.97 3.04 3.17 3293 1 beta[4] 0.01 0.00 0.11 -0.21 -0.07 0.01 0.08 0.23 4000 1 beta[5] -0.07 0.00 0.10 -0.26 -0.14 -0.07 -0.01 0.11 4000 1 beta[6] 0.07 0.00 0.11 -0.14 0.00 0.07 0.15 0.30 2635 1 beta[7] -0.07 0.00 0.10 -0.26 -0.13 -0.07 0.00 0.13 3562 1 beta[8] 0.01 0.00 0.10 -0.19 -0.06 0.01 0.08 0.21 3229 1 beta[9] 0.00 0.00 0.10 -0.20 -0.07 0.00 0.07 0.21 2955 1 beta[10] -0.18 0.00 0.11 -0.40 -0.26 -0.18 -0.11 0.04 2387 1 beta[11] -0.14 0.00 0.10 -0.33 -0.21 -0.15 -0.08 0.04 3585 1 beta[12] 0.02 0.00 0.10 -0.18 -0.05 0.02 0.09 0.22 3613 1 beta[13] 0.02 0.00 0.11 -0.19 -0.05 0.02 0.09 0.24 2893 1 beta[14] -0.36 0.00 0.11 -0.57 -0.43 -0.36 -0.29 -0.14 2788 1 beta[15] 0.09 0.00 0.11 -0.12 0.02 0.09 0.16 0.30 3005 1 beta[16] -0.08 0.00 0.11 -0.28 -0.16 -0.09 -0.01 0.13 3072 1 beta[17] 0.15 0.00 0.12 -0.08 0.07 0.15 0.23 0.39 2780 1 beta[18] -0.06 0.00 0.11 -0.28 -0.14 -0.07 0.01 0.16 3019 1 beta[19] 0.19 0.00 0.11 -0.02 0.12 0.19 0.26 0.40 4000 1 beta[20] 0.10 0.00 0.11 -0.10 0.03 0.09 0.17 0.30 2736 1 beta[21] 0.15 0.00 0.11 -0.07 0.08 0.15 0.22 0.37 2874 1 sigma 0.27 0.00 0.02 0.23 0.25 0.27 0.28 0.31 2698 1 lp__ 81.66 0.10 3.76 73.34 79.29 82.04 84.32 87.94 1475 1 Samples were drawn using NUTS(diag_e) at Sun Oct 12 09:47:19 2014. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
つづいて最小二乗法。increment_log_prob()の引数に-squared_errorを与えることで誤差二乗和を最小化する。
ols_model <- " data { int<lower=0> N; int<lower=0> K; matrix[N, K] x; vector[N] y; } parameters { vector[K] beta; // coefficients } transformed parameters { real<lower=0> squared_error; squared_error <- dot_self(y - x * beta); } model { increment_log_prob(-squared_error); }" fit.ols <- stan(model_code = ols_model, data = list(N = n.obs, K = n.vars + 1, y = y, x = cbind(rep(1, n.obs), x)), pars = c("beta"), chains = 4, iter = 2000)
結果
> fit.ols Inference for Stan model: ols_model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta[1] 0.06 0.01 0.62 -1.13 -0.36 0.05 0.49 1.26 2340 1 beta[2] 2.09 0.00 0.28 1.57 1.90 2.09 2.28 2.61 3827 1 beta[3] 2.98 0.00 0.26 2.47 2.80 2.98 3.16 3.48 4000 1 beta[4] 0.01 0.01 0.31 -0.59 -0.20 0.01 0.22 0.63 3860 1 beta[5] -0.08 0.00 0.25 -0.56 -0.24 -0.08 0.08 0.42 3512 1 beta[6] 0.07 0.00 0.29 -0.51 -0.13 0.06 0.27 0.65 4000 1 beta[7] -0.07 0.00 0.25 -0.56 -0.24 -0.07 0.10 0.44 4000 1 beta[8] 0.00 0.00 0.28 -0.53 -0.19 0.00 0.18 0.56 4000 1 beta[9] 0.01 0.00 0.27 -0.50 -0.17 0.01 0.19 0.54 3260 1 beta[10] -0.19 0.01 0.29 -0.76 -0.38 -0.18 0.00 0.37 3088 1 beta[11] -0.14 0.00 0.25 -0.65 -0.31 -0.14 0.03 0.36 4000 1 beta[12] 0.02 0.00 0.27 -0.49 -0.16 0.02 0.20 0.56 4000 1 beta[13] 0.01 0.00 0.29 -0.54 -0.18 0.01 0.21 0.59 3384 1 beta[14] -0.36 0.01 0.29 -0.93 -0.56 -0.36 -0.16 0.20 2884 1 beta[15] 0.09 0.00 0.27 -0.45 -0.09 0.10 0.27 0.64 4000 1 beta[16] -0.09 0.00 0.27 -0.63 -0.27 -0.08 0.09 0.45 4000 1 beta[17] 0.14 0.01 0.31 -0.47 -0.07 0.14 0.35 0.76 3325 1 beta[18] -0.07 0.00 0.29 -0.62 -0.26 -0.07 0.12 0.48 3453 1 beta[19] 0.18 0.00 0.28 -0.36 -0.01 0.18 0.38 0.72 4000 1 beta[20] 0.10 0.00 0.28 -0.42 -0.09 0.10 0.28 0.66 3649 1 beta[21] 0.14 0.00 0.30 -0.44 -0.05 0.14 0.34 0.74 3762 1 lp__ -15.91 0.07 3.21 -22.90 -17.84 -15.65 -13.58 -10.63 2033 1 Samples were drawn using NUTS(diag_e) at Sun Oct 12 09:47:45 2014. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
ではリッジ回帰。increment_log_prob()をもう1行追加して、λ×係数二乗和を罰則としてあたえる。
ridge_model <- " data { int<lower=0> N; int<lower=0> K; matrix[N, K] x; vector[N] y; real<lower=0> lambda; } parameters { vector[K] beta; // coefficients } transformed parameters { real<lower=0> squared_error; squared_error <- dot_self(y - x * beta); } model { increment_log_prob(-squared_error); increment_log_prob(-lambda * dot_self(beta)); }" model.ridge <- stan_model(model_code = ridge_model) fit.ridge.1 <- sampling(model.ridge, data = list(N = n.obs, K = n.vars + 1, y = y, x = cbind(rep(1, n.obs), x), lambda = 1), pars = c("beta"), chains = 4, iter = 2000) fit.ridge.2 <- sampling(model.ridge, data = list(N = n.obs, K = n.vars + 1, y = y, x = cbind(rep(1, n.obs), x), lambda = 10), pars = c("beta"), chains = 4, iter = 2000) fit.ridge.3 <- sampling(model.ridge, data = list(N = n.obs, K = n.vars + 1, y = y, x = cbind(rep(1, n.obs), x), lambda = 50), pars = c("beta"), chains = 4, iter = 2000)
結果。λを50にしたりすると、さすがに大きすぎた模様。
> fit.ridge.1 Inference for Stan model: ridge_model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta[1] 0.16 0.01 0.44 -0.72 -0.13 0.17 0.47 1.01 2656 1 beta[2] 1.83 0.00 0.26 1.33 1.65 1.83 2.01 2.33 4000 1 beta[3] 2.63 0.00 0.24 2.16 2.46 2.63 2.78 3.09 4000 1 beta[4] 0.07 0.00 0.27 -0.47 -0.11 0.07 0.25 0.61 4000 1 beta[5] -0.04 0.00 0.23 -0.49 -0.19 -0.04 0.11 0.42 4000 1 beta[6] 0.11 0.00 0.26 -0.40 -0.07 0.10 0.28 0.61 4000 1 beta[7] -0.05 0.00 0.23 -0.50 -0.21 -0.04 0.12 0.41 4000 1 beta[8] 0.02 0.00 0.25 -0.48 -0.15 0.01 0.18 0.52 4000 1 beta[9] 0.04 0.00 0.25 -0.45 -0.14 0.04 0.21 0.54 4000 1 beta[10] -0.11 0.00 0.26 -0.65 -0.29 -0.11 0.06 0.39 4000 1 beta[11] -0.15 0.00 0.23 -0.60 -0.31 -0.16 0.01 0.30 4000 1 beta[12] 0.03 0.00 0.25 -0.46 -0.14 0.03 0.19 0.50 4000 1 beta[13] 0.10 0.00 0.26 -0.40 -0.07 0.11 0.28 0.61 4000 1 beta[14] -0.32 0.00 0.25 -0.81 -0.49 -0.32 -0.16 0.16 4000 1 beta[15] 0.11 0.00 0.25 -0.36 -0.06 0.11 0.27 0.60 4000 1 beta[16] -0.04 0.00 0.25 -0.53 -0.20 -0.04 0.14 0.42 4000 1 beta[17] 0.14 0.00 0.28 -0.40 -0.05 0.13 0.32 0.68 3327 1 beta[18] -0.03 0.00 0.26 -0.53 -0.21 -0.03 0.15 0.49 4000 1 beta[19] 0.15 0.00 0.26 -0.35 -0.03 0.15 0.33 0.62 4000 1 beta[20] 0.04 0.00 0.25 -0.46 -0.13 0.04 0.21 0.54 4000 1 beta[21] 0.14 0.00 0.26 -0.36 -0.03 0.13 0.31 0.64 3370 1 lp__ -27.77 0.08 3.15 -34.91 -29.66 -27.48 -25.49 -22.63 1761 1 Samples were drawn using NUTS(diag_e) at Sun Oct 12 10:02:21 2014. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). > fit.ridge.2 Inference for Stan model: ridge_model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta[1] 0.32 0.00 0.19 -0.05 0.19 0.32 0.45 0.71 4000 1 beta[2] 0.94 0.00 0.17 0.61 0.83 0.94 1.05 1.27 4000 1 beta[3] 1.38 0.00 0.16 1.06 1.27 1.39 1.50 1.70 4000 1 beta[4] 0.21 0.00 0.17 -0.13 0.09 0.21 0.33 0.55 4000 1 beta[5] 0.09 0.00 0.16 -0.22 -0.02 0.09 0.20 0.41 4000 1 beta[6] 0.19 0.00 0.17 -0.15 0.07 0.19 0.30 0.51 4000 1 beta[7] 0.08 0.00 0.16 -0.23 -0.03 0.08 0.20 0.40 4000 1 beta[8] 0.10 0.00 0.17 -0.23 -0.02 0.10 0.21 0.42 4000 1 beta[9] 0.15 0.00 0.16 -0.18 0.03 0.14 0.26 0.47 4000 1 beta[10] 0.07 0.00 0.17 -0.26 -0.04 0.07 0.19 0.40 4000 1 beta[11] -0.06 0.00 0.16 -0.37 -0.17 -0.06 0.05 0.26 4000 1 beta[12] 0.09 0.00 0.16 -0.23 -0.03 0.09 0.20 0.40 4000 1 beta[13] 0.28 0.00 0.17 -0.05 0.17 0.28 0.40 0.61 4000 1 beta[14] -0.13 0.00 0.16 -0.46 -0.24 -0.13 -0.02 0.19 4000 1 beta[15] 0.20 0.00 0.16 -0.11 0.09 0.20 0.31 0.53 4000 1 beta[16] 0.13 0.00 0.16 -0.19 0.02 0.13 0.24 0.45 4000 1 beta[17] 0.17 0.00 0.18 -0.20 0.05 0.16 0.29 0.51 4000 1 beta[18] 0.12 0.00 0.17 -0.20 0.01 0.12 0.24 0.44 4000 1 beta[19] 0.13 0.00 0.17 -0.20 0.01 0.13 0.24 0.46 4000 1 beta[20] -0.01 0.00 0.16 -0.33 -0.13 -0.02 0.10 0.30 4000 1 beta[21] 0.12 0.00 0.17 -0.23 0.00 0.12 0.23 0.45 4000 1 lp__ -78.08 0.08 3.25 -85.18 -80.05 -77.68 -75.75 -72.74 1568 1 Samples were drawn using NUTS(diag_e) at Sun Oct 12 10:02:27 2014. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). > fit.ridge.3 Inference for Stan model: ridge_model. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat beta[1] 0.36 0.00 0.09 0.18 0.30 0.36 0.42 0.54 4000 1 beta[2] 0.40 0.00 0.09 0.21 0.34 0.40 0.46 0.58 4000 1 beta[3] 0.57 0.00 0.09 0.40 0.51 0.57 0.63 0.75 4000 1 beta[4] 0.21 0.00 0.10 0.02 0.14 0.21 0.28 0.40 4000 1 beta[5] 0.17 0.00 0.09 0.00 0.11 0.17 0.23 0.34 4000 1 beta[6] 0.19 0.00 0.09 0.02 0.13 0.19 0.25 0.37 4000 1 beta[7] 0.17 0.00 0.09 0.00 0.11 0.17 0.23 0.35 4000 1 beta[8] 0.15 0.00 0.09 -0.02 0.09 0.16 0.22 0.33 4000 1 beta[9] 0.17 0.00 0.09 0.00 0.11 0.17 0.23 0.35 4000 1 beta[10] 0.16 0.00 0.09 -0.02 0.09 0.16 0.22 0.33 4000 1 beta[11] 0.08 0.00 0.09 -0.09 0.02 0.08 0.14 0.25 4000 1 beta[12] 0.15 0.00 0.09 -0.03 0.08 0.15 0.21 0.32 4000 1 beta[13] 0.23 0.00 0.09 0.06 0.17 0.23 0.29 0.41 4000 1 beta[14] 0.06 0.00 0.09 -0.12 0.00 0.06 0.12 0.24 4000 1 beta[15] 0.20 0.00 0.09 0.02 0.14 0.20 0.26 0.38 4000 1 beta[16] 0.18 0.00 0.09 0.01 0.12 0.18 0.24 0.36 4000 1 beta[17] 0.17 0.00 0.09 -0.02 0.11 0.17 0.23 0.35 4000 1 beta[18] 0.16 0.00 0.09 -0.01 0.10 0.16 0.23 0.34 4000 1 beta[19] 0.16 0.00 0.09 -0.02 0.10 0.16 0.22 0.33 4000 1 beta[20] 0.08 0.00 0.09 -0.10 0.02 0.08 0.14 0.26 4000 1 beta[21] 0.16 0.00 0.09 -0.02 0.10 0.16 0.22 0.34 4000 1 lp__ -144.49 0.08 3.17 -151.52 -146.40 -144.20 -142.24 -139.17 1493 1 Samples were drawn using NUTS(diag_e) at Sun Oct 12 10:03:57 2014. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
コメント 0