# CART树的生成

## CART树的生成算法

输入： 训练数据集X，样本标签y\
输出：回归树f(x)

### 步骤

1. 若D中所有实例属于同一类$C\_k$，则T为单结点树，并将类$C\_k$作为该结点的类标记，返回T &#x20;
2. 对每个特征feature的每个取值value，将y分为$R\_1$和$R\_2$两个集合，因为现在还不是真正的split，只是要计算split后的基尼指数，只需要用到split之后的y   &#x20;

   $$
   \begin{aligned}
   y\_1(feature, value) = {y\_i | x\_i^{(feature)} \le value}  \\
   y\_2(feature, value) = {y\_i | y\_i^{(feature)} \gt value}
   \end{aligned}
   $$
3. 计算$y\_1$和$y\_2$的基尼指数之和 &#x20;

$$
Gini(p) = \sum^K p\_k(1-p\_k) = 1 - \sum^Kp\_k^2
$$

1. 选择基尼指数计算结果最小的(feature, value)作为当前的最优划分 &#x20;
2. 基于最优划分生成2个子结点，将数据分配到两个子结点中 &#x20;
3. 对子结点递归调用CART算法 &#x20;

## 代码

```python
def gini(y):
    ySet = set(y)
    ret, n = 1, y.shape[0]
    for yi in ySet:
        ret -= (y[y==yi].shape[0]/n)**2
    return ret

def CART(X, y):
    # 若D中所有实例属于同一类$$C_k$$
    if len(set(y))==1:
        # 将类$$C_k$$作为该结点的类标记
        return y[0]
    bestGini = np.inf
    # 对每个特征feature的每个取值value
    for feature in range(X.shape[1]):
        for value in set(X[:,feature]):
            # 将X分为$$R_1$$和$$R_2$$两个集合
            y1 = y[X[:,feature]<= value]
            y2 = y[X[:,feature]> value]
            # 计算$$R_1$$和$$R_2$$的基尼指数之和
            sumGini = gini(y1) + gini(y2)
            # 选择基尼指数计算结果最小的(feature, value)作为当前的最优划分
            if sumGini < bestGini:
                bestFeature, bestValue, bestGini = feature, value, sumGini
    # 基于最优划分生成2个子结点，将数据分配到两个子结点中
    node = {'feature':bestFeature,
            'value':bestValue,
            'left':CART(X[X[:,bestFeature]<= bestValue], y[X[:,bestFeature]<= bestValue]),
           'right':CART(X[X[:,bestFeature]> bestValue], y[X[:,bestFeature]> bestValue])}
    return node
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://windmising.gitbook.io/lihang-tongjixuexifangfa/cart/6.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
