线性回归 | 紫梦沁香

算法思想

用于解决回归问题，做法就是找一条最能拟合测试数据的直线，求该拟合直线的斜率和截距就是最值问题，可以通过梯度下降法解决。预测数据也比较简单，将数据代入直线中求出结果即可。

实例

在举例之前，先介绍几种求误差的方法。

MSE (Mean Square Error) 均方误差

$\frac{1}{m}\sum\limits_{n=1}^{m} (y - y_n)^2$
RMSE (Root Mean Square Error) 均方根误差

$\sqrt{\frac{1}{m}\sum\limits_{n=1}^m (y - y_n)^2}$
MAE (Mean Absolute Error) 平均绝对误差

$\frac{1}{m}\sum\limits_{n=1}^{m} |y - y_n|$

上述三种中，均方误差和均方根误差只是差了个根号，本质是一样的，都是求预测值与真实值的平方差。

因为是平方差，会对极度偏颇的数据特别敏感，因此会着重去拟合这部分数据。

而绝对值对所有的数据都是一样的，没有侧重点。

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(-10, 10, 0.2).reshape([100,1])
y = 3.5 * x + np.random.normal(0, 20, 100).reshape([100,1])

plt.scatter(x,y)
plt.plot([-10, 10],[-55, 55], color = "red")

如上图所示，蓝色的离散点就是我们的测试数据，红色是要求的拟合曲线，方程为$y = ax + b$。

我们需要让这条直线的预测值和真实值的误差尽可能小，因此成本函数为

$cost = \frac{1}{m}\sum\limits_{n=1}^{m} (ax_n + b - y_n)^2$ （MSE）

此处$x_n$和$y_n$都是已知的数据，a和b才是我们要求的未知数。

该成本函数的偏导函数为

$\frac{dcost}{da} = \frac{2}{m}\sum\limits_{n=1}^{m} x_n(ax_n + b - y_n)$

$\frac{dcost}{db} = \frac{2}{m}\sum\limits_{n=1}^{m} (ax_n + b - y_n)$

def J(a, b, x, y):
    return 1/len(x) * np.sum((a * x + b - y) ** 2)

def DJ_a(a, b, x, y):
    return 2 / len(x) * np.sum(x * (a * x + b - y))

def DJ_b(a, b, x, y):
    return 2 / len(x) * np.sum(a * x + b - y)

def gradient_descent(a, b, eta, epsilon = 1e-6):
    i_iter = 0
    while True:
        last_a = a
        last_b = b
        a = a - eta * DJ_a(a, b, x, y)
        b = b - eta * DJ_b(a, b, x, y)
        if np.abs(J(last_a, last_b, x, y) - J(a, b, x, y)) < epsilon:
            break
    return a, b

a = 0
b = 0
eta = 0.02
epsilon = 1e-6

a, b = gradient_descent(a, b, eta, epsilon)

plt.plot([-10, 10],[-10 * a + b, 10 * a + b], color = "red")
plt.scatter(x,y)
plt.show()

效果也是挺不错的，不过这是一元的情况，可以写两个DJ_a和DJ_b，但如果是百元呢？

下面将会实现一种通用的梯度下降法。

先更改一下我们的成本函数，其中θ是系数矩阵，对于一元函数，它的shape就是(2, )，x需要再添一列（数值为1），这样x * a + 1 * b就正好是我们的直线方程预测值。

$cost = \frac{1}{m}\sum (x.dot(θ) - y)^2$ （MSE）

导函数

$\frac{dcost}{dθ} = \frac{2}{m} x.t.dot((x.dot(θ) - y))$

推导过程

def J(theta, x, y):
    return 1/len(x) * np.sum((x.dot(theta) - y) ** 2)

def DJ(theta, x, y):
    return 2 / len(x) * (x.T.dot(x.dot(theta) - y))

def gradient_descent(theta, x, y, eta, n_iters = 2e5, epsilon = 1e-6):
    i_iter = 0
    
    while i_iter < n_iters:
        last_theta = theta
        theta = theta - eta * DJ(theta, x, y)
        if np.abs(J(last_theta, x, y) - J(theta, x, y)) < epsilon:
            break
        i_iter += 1
        
    return theta

然后我们来拟合刚刚随机的数据

x = np.hstack([x, np.ones([100, 1])])

theta = np.zeros([2])
eta = 0.02
y = y.reshape([100])
theta = gradient_descent(theta, x, y, eta)

plt.plot([-10, 10],[-10 * theta[0] + theta[1], 10 * theta[0] + theta[1]], color = "red")
plt.scatter(x[:, 0],y.reshape([100, 1]))
plt.show()

theta 		# array([3.4686148, -1.06100372])

显然，这条直线还是能较好拟合这个测试数据的。

theta[0] = 3.4686148 ≈ 3.5，也和我们预设的基本一致。

代码实现

class LinearRegression:
    
    def __init__(self):
        self.X = None
        self.y = None
        
    def fit(self, X, y):
        self.X = X
        self.y = y
        self.len = len(X)
        self.X_b = np.hstack([self.X, np.ones([self.len, 1])])
        return self
    
    def J(self, theta):
        return np.sum((self.y - self.X_b.dot(theta)) ** 2) / self.len
    
    def DJ(self, theta):
        return self.X_b.T.dot(self.X_b.dot(theta) - y) * 2 / self.len
    
    def transform(self, eta = 0.01, n_iters = 1e5, epsilon = 1e-8):
        assert self.X_b is not None and self.y is not None, \
        "must fit before transform"
        
        theta = np.zeros(self.X.shape[1] + 1)
        i_iter = 0
        while i_iter < n_iters:
            last_theta = theta
            theta = theta - eta * self.DJ(theta)
            if np.abs(self.J(theta) - self.J(last_theta)) < epsilon:
                break
            i_iter += 1
        self.theta = theta
        self.intercept = self.theta[-1]
        self.coefficient = self.theta[:-1]
        return theta
    
    def predict(self, X):
        return X.dot(self.coefficient) + self.intercept
    
    def score(self, X, y):
        y_pred = X.dot(self.coefficient) + self.intercept
        return np.sum((y_pred - y) ** 2) / len(X)

需要注意的是，第28行中theta = theta - eta * self.DJ(theta)，不能写成theta -= eta * self.DJ(theta)，这是因为现在last_theta和theta都指向同一块内存单元，-=是原地操作，在更改了theta的同时，last_theta也被更改了，而=则是重新开辟一块新的内存空间，然后再让theta指向它。

测试

import sklearn
from sklearn import datasets

boston=datasets.load_boston()
X=boston.data
y=boston.target

lr = LinearRegression()
lr.fit(X, y)
lr.transform(eta = 1e-6, n_iters = 1e6)

"""lr.coefficient
array([-0.09218218,  0.06825598, -0.00529991,  0.46137796,  0.17983213,
        4.30941577,  0.02543425, -0.70440603,  0.17038895, -0.00977886,
       -0.09173025,  0.01804173, -0.55937148])
       
   lr.intercept 	
   0.4361364266724813
"""

因为数据是13维的，没有办法可视化，我们只能使用MSE指标来衡量，其中lr.score(X, y) = 26.11494。

为了能看得清楚点，我们只使用其中一维来进行预测。

1
2
3

X = X[:, 5]
plt.scatter(X, y)
plt.show()

这里上面有一排数值都为50，这是因为在做调查的时候，最高一个档就是50（大于50W都只能填50W），因此这部分数据我们当做无效数据，直接丢弃掉。

X = X[y < 50]
y = y[y < 50]

plt.scatter(X, y)
plt.show()

做一下筛选，只选y<50的数据就行了。

然后开始整体，进行预测。

lr = LinearRegression()
X = X.reshape([-1, 1])
lr.fit(X, y)
theta = lr.transform(eta = 1e-6, n_iters = 1e6)

plt.plot(X, lr.predict(X), color="red")
plt.scatter(X, y)
plt.show()

lr.score(X, y) = 42.28978618974813