信息论的一些基本概念

信息熵

H(X)=E[I(X)]=E[ln(P(X))]H(X)=E[I(X)]=E[-ln(P(X))]

其中PPXX的概率质量函数,EE为期望函数,而I(x)I(x)XX的信息量(又称自信息).

H(X)=iP(xi)I(xi)=iP(xi)logbP(xi)H(X)=\sum_iP(x_i)I(x_i)=-\sum_iP(x_i)\log_bP(x_i) bS2bitenat10Hart\begin{matrix} b & S\cr 2 & bit\cr e & nat\cr 10 & Hart \end{matrix}

条件熵(Conditional Entropy)

特征xx 固定为xix_i时:H(cxi)H(c|x_i)

特征x x 整体分布已知时:H(xX)H(x|X)

信息增益(Information Gain)

IG(X)=H(c)H(cX)IG(X) = H(c)-H(c|X)

基尼系数(基尼不纯度Gini impurity)

Gini(D)=1inpi2Gini(D)=1-\sum_i^np_i^2 Gini(DA)=inDiDGini(D|A)=\sum_i^n\frac {D_i}{D}

信息增益比率(Information Gain Ratio)与分裂信息(Split information)

GR(DA)=IG(DA)SI(DA)GR(D|A)=\frac {IG(D|A)}{SI(D|A)} SI(DA)=inNiNlog2NiNSI(D|A)=-\sum_i^n\frac {N_i}{N}\log_2\frac{N_i}{N}

边界熵(boundary entropy)

BE(w1w2wk)=wCp(ww1w2wk)logp(ww1w2wk)BE(w_1w_2\cdots w_k) = -\sum_{w \in C}p(w\vert w_1w_2\cdots w_k)\log p(w\vert w_1w_2\cdots w_k)

ww是邻接于w1w2wkw_1w_2 \cdots w_k 的字符.

边界多样性(Accessor veriety,AV)

AV(w1w2wk)=logRLav(w1w2wk)AV(w_1w_2\cdots w_k)=\log RL_{av}(w_1w_2\cdots w_k)

RLavRL_{av} 表示邻接于字符串w1w2wkw_1w_2 \cdots w_k的不同字符个数.