`
backsnow
  • 浏览: 127424 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

随机森林

 
阅读更多

随机森林的基本过程是:(m*n,m为样本数,n为特征维)1,训练:随机选择若干特征r<<n(似乎一般去sqrt(n)),构造决策树;2,预测:通过所有决策树分类,然后以投票方式,得票数最多的分类即为分类值。

决策树构造过程如下,其中最大化Information gain来获得最有效的特征:

How to grow a Decision Tree

source : [3]

LearnUnprunedTree(X,Y)

Input: X a matrix of R rows and M columns where Xij = the value of the j'th attribute in the i'th input datapoint. Each column consists of either all real values or all categorical values.
Input: Y a vector of R elements, where Yi = the output class of the i'th datapoint. TheYi values are categorical.
Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
    select m variables at random out of the M variables
    For j = 1 .. m
        If j'th attribute is categorical
            IGj = IG(Y|Xj) (see Information Gain)            
        Else (j'th attribute is real-valued)
            IGj = IG*(Y|Xj) (see Information Gain)
    Let j* = argmaxj IGj (this is the splitting attribute we'll use)
    If j* is categorical then
        For each value v of the j'th attribute
            Let Xv = subset of rows of X in which Xij = v. Let Yv = corresponding subset of Y
            Let Childv = LearnUnprunedTree(Xv,Yv)
        Return a decision tree node, splitting on j'th attribute. The number of children equals the number of values of the j'th attribute, and the v'th child is Childv
    Else j* is real-valued and let t be the best split threshold
        Let XLO = subset of rows of X in which Xij <= t. Let YLO = corresponding subset of Y
        Let ChildLO = LearnUnprunedTree(XLO,YLO)
        Let XHI = subset of rows of X in which Xij > t. Let YHI = corresponding subset of Y
        Let ChildHI = LearnUnprunedTree(XHI,YHI)
        Return a decision tree node, splitting on j'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

注意,分类和实值求最大information gain是不同的,这里只说明实值的情形,IG*(Y|Xj)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。

 

entropy=-sum p_ilog(p_i),(entropy即为H函数) high entropy意味着变量为boring distribution,即变量取各个值的概率差距不大; low entropy意味着变量为varied(peaks and valley)distribution,即变量取某一个或两个值的概率特别高。我们的目的就是要找到Low entropy的特征,因为 information gain= H(Y)-H(Y|X),在H(Y)固定时,找到的H(Y|X)越低,则该特征去某一个值或两个值的概率越高,能够分类清楚的样本数越多,这样就越该被先选中作为分支节点。

 

Information gain

source : [3]

  1. nominal attributes

suppose X can have one of m values V1,V2,...,Vm
P(X=V1)=p1, P(X=V2)=p2,...,P(X=Vm)=pm
 
H(X)= -sumj=1m pj log2 pj (The entropy of X)
H(Y|X=v) = the entropy of Y among only those records in which X has value v
H(Y|X) = sumj pj H(Y|X=vj)
IG(Y|X) = H(Y) - H(Y|X)

  1. real-valued attributes

suppose X is real valued
define IG(Y|X:t) as H(Y) - H(Y|X:t)
define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t)
define IG*(Y|X) = maxt IG(Y|X:t)

How to grow a Random Forest

source : [1]

Each tree is grown as follows:

  1. if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
  2. if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics