# 信息增益的算法

## 信息增益的算法

输入：训练数据集D和特征A\
输出：特征A对训练数据集D的信息增益g(D,A)

定义：\
K：样本标签有K种分类\
$$C\_k$$：样本标签为k的样本数 m：样本总数 $$D\_i$$：样本中第A个特征为$$a\_i$$的样本数\
$$D\_{ik}$$：样本中第A个特征为$$a\_i$$且其标签分类为k的样本数

**计算数据集D的经验熵H(D)**

$$
\begin{aligned}
H(D) = -\sum\_{k=1}^K P\_k\log\_2P\_k  \\
P\_k = \frac{C\_k}{m}
\end{aligned}
$$

**计算特征A对数据集D的经验条件熵H(D|A)**

$$
\begin{aligned}
H(D|A) = \sum\_{i=1}^n p\_iH(D\_i) \\
\= -\sum\_{i=1}^n p\_i\sum\_{k=1}^K p\_{ik}\log\_2p\_{ik} \\
p\_i = \frac {D\_i}{m} \\
p\_{ik} = \frac {D\_{ik}}{D\_i}
\end{aligned}
$$

即通过特征A分出的每个子集的熵与子集比例乘积的和。

**计算信息增益**

$$
g(D, A) = H(D) - H(D|A)
$$

## 代码

```python
# 特征和标签的可取值范围：
def H(y):
    sum = 0
    # 计算y可取到的值
    k = set(y)
    for ck in k:
        Pk = y[y==ck].shape[0] / y.shape[0]
        if Pk != 0:
            sum -= Pk * np.log2(Pk)
    return sum

def svm(X, y, feature):
    # 计算X的每个特征可取到的值
    a = set(X[:,feature])
    # 计算数据集的经验熵
    HD = H(y)
    # 计算特征A对数据集D的经验条件熵H(D|A)
    HDA = 0
    for value in a:
        yDi = y[X[:,feature]==value]
        HDA += yDi.shape[0]/y.shape[0] * H(yDi)
    return HD - HDA
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://windmising.gitbook.io/lihang-tongjixuexifangfa/decisiontree/2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
