岭(Ridge 线性回归大结局、 Lasso回归原理、公式推导),你想要的这里都有( 二 )


\[X : n\times (m+1) \\\hat{w} : (m+1)\times 1\\Y : n\times 1 \\f(X) : n\times 1\]那容易得出,对于整个数据集的误差为 \(\mathcal{L}(w, b)\) :
\[\mathcal{L}(w, b) = \mathcal{L}({\hat{w}}) = \sum_{i=1}^{n}( \hat{y} - y_i)^2 = \sum_{i=1}^{n}(x_i\hat{w}-y_i)^2 = ||X\hat{w}-Y||_2^2= (X\hat{w}-Y)^T(X\hat{w}-Y) \tag{16}\]\[x_i = (x_{i1}, x_{i2}, ..., x_{im}, 1)\\\hat{w} = (a_1, a_2, ..., a_m, b)^T\]现在来仔细分析一下公式\((16)\),首先对于一个\(1\times n\)或者\(n \times 1\)向量来说,它的二范数为:
\[||x||_2 = \sqrt{\sum_{i=1}^{n}x_i^2}\]二范数平方为:
\[||x||_2^2 = \sum_{i=1}^{n}x_i^2\]所以就有了 \(\sum_{i=1}^{n}(X\hat{w}-Y)^2 = ||Y-X\hat{w}||_2^2\),那么对于公式\((16)\)来说 \(X\hat{w}-Y\) 是一个 \(n\times 1\)的向量:
\[X\hat{w}-Y = \left[\begin{matrix}\hat{y}_1 - y_1 \\ \hat{y}_2 - y_2 \\.\\.\\.\\\hat{y}_{n-1} - y_{n-1} \\\hat{y}_{n}- y_{n} \end{matrix}\right]\]所以根据矩阵乘法就有:
\[(X\hat{w} - Y)^T(X\hat{w} - Y)=\left[\begin{matrix} \hat{y}_1 - y_1,&..., &\hat{y}_{n} - y_{n} \end{matrix}\right]\cdot \left[\begin{matrix}\hat{y}_1 - y_1 \\\hat{y}_2- y_2 \\.\\.\\.\\\hat{y}_{n-1} - y_{n-1} \\\hat{y}_{n} - y_{n} \end{matrix}\right] = \sum_{i=1}^{n}(\hat{y}- y_i )^2 \tag{17}\]根据上面的分析最终就得到了模型的误差:
\[\mathcal{L}(w, b) = \mathcal{L(\hat{w})} = ||X\hat{w} - Y||_2^2= (X\hat{w} - Y)^T(X\hat{w} - Y) \tag{18}\]现在就需要最小化模型的误差,即优化问题,易知\(\mathcal{L(w, b)}\)是一个关于 \(\hat{w}\) 的凸函数,则当它关于\(\hat{w}\)导数为0时求出的\(\hat{w}\)是\(\hat{w}\)的最优解 。这里不对其是凸函数进行解释,如果有时间以后专门写一篇文章来解读 。现在就需要对\(\hat{w}\)进行求导 。
\[\mathcal{L(\hat{w})} = ||X\hat{w} - Y||_2^2= (X\hat{w} - Y)^T(X\hat{w} - Y)\]\[= ((X\hat{w})^T - Y^T)(X\hat{w} - Y) = (\hat{w}^TX^T - Y^T)(X\hat{w} - Y)\]\[=\hat{w}^TX^TX\hat{w} - \hat{w}^TX^TY - Y^TX\hat{w} + Y^TY\]我们现在要对上述公式进行求导,我们先来推导一下矩阵求导法则,请大家扶稳坐好:
公求导式法则一:
\(\forall\) 向量 \(A:1 \times n\) , \(X: n \times 1, Y=A\cdot X\),则 \(\frac{\partial Y}{\partial X} = A^T\),其中\(Y\)是一个实数值 。
证明:
不妨设:
\[A = (a_1, a_2, a_3, ..., a_n)\]\[X = (x_1, x_2, x_3, ..., x_n)^T\]\[\therefore Y = (a_1, a_2, a_3, ..., a_n)\cdot \left[\begin{matrix} x_1\\x_2\\x_3\\.\\.\\.\\x_n\end{matrix}\right] = \sum_{i = 1}^{n}a_ix_i\]当我们在对\(x_i\),求导的时候其余\(x_j, j \neq i\),均可以看做常数,则:
\[\frac{\partial Y}{\partial x_i} = 0 + ...+0 + a_i + 0 +... + 0\]\[\therefore \frac{\partial Y}{\partial X} = \left[\begin{matrix}\frac{\partial Y}{\partial x_1}\\ \frac{\partial Y}{\partial x_2}\\.\\.\\.\\ \frac{\partial Y}{\partial x_{n-1}}\\\frac{\partial Y}{\partial x_n}\\\end{matrix}\right] = \left[\begin{matrix}a_1\\ a_2\\.\\.\\.\\a_{n-1}\\ a_n\end{matrix}\right] = (a_1, a_2, a_3, ..., a_n)^T = A^T\]由上述分析可知:
\[\frac{\partial Y}{\partial X} = A^T\]公求导式法则二:
当\(Y = X^TA\),其中 \(X:n\times 1, A:n\times 1\),则\(\frac{\partial Y}{\partial X} = A\)
公求导式法则三:
当\(Y = X^TAX\),其中 \(X:1\times n, A : n\times n\),则\(\frac{\partial Y}{\partial X} = (A^T + A)X\)
上面公式同理可以证明,在这里不进行赘述了 。
\[\mathcal{L(\hat{w})} =\hat{w}^TX^TX\hat{w} - \hat{w}^TX^TY - Y^TX\hat{w} + Y^TY \tag{19}\]有公式\((19)\)和上面求导法则可知:
\[\frac{\partial \mathcal{L(\hat{w})}}{\partial \hat{w}} = ((X^TX)^T + X^TX)\hat{w} - X^TY - (Y^TX)^T\\= 2X^TX\hat{w} - 2X^TY= 2X^T(X\hat{w} - Y) = 0\]\[X^TX\hat{w} = X^TY \tag{20}\]\[\therefore \hat{w}^* = (X^TX)^{-1}X^TY \tag{21}\]即 \(\hat{w}^* = (X^TX)^{-1}X^TY\) 为我们要求的参数 。
Ridge(岭)回归写在前面:对于一个矩阵 \(A_{n\times n}\) 来说如果想它的逆矩阵那么 \(A\)的行列式必然不为0,且矩阵 \(A\) 是一个满秩矩阵,即\(r(A)=n\) 。

经验总结扩展阅读