随机森林的基本过程是:(m*n,m为样本数,n为特征维)1,训练:随机选择若干特征r<<n(似乎一般去sqrt(n)),构造决策树;2,预测:通过所有决策树分类,然后以投票方式,得票数最多的分类即为分类值。
决策树构造过程如下,其中最大化Information gain来获得最有效的特征:
How to grow a Decision Tree
source : [3]
LearnUnprunedTree(X,Y)
Input: X a matrix of R rows and M columns where Xij = the value of the j'th attribute in the i'th input datapoint. Each column consists of either all real values or all categorical values.
Input: Y a vector of R elements, where Yi = the output class of the i'th datapoint. TheYi values are categorical.
Output: An Unpruned decision tree
If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
select m variables at random out of the M variables
For j = 1 .. m
If j'th attribute is categorical
IGj = IG(Y|Xj) (see Information Gain)
Else (j'th attribute is real-valued)
IGj = IG*(Y|Xj) (see Information Gain)
Let j* = argmaxj IGj (this is the splitting attribute we'll use)
If j* is categorical then
For each value v of the j'th attribute
Let Xv = subset of rows of X in which Xij = v. Let Yv = corresponding subset of Y
Let Childv = LearnUnprunedTree(Xv,Yv)
Return a decision tree node, splitting on j'th attribute. The number of children equals the number of values of the j'th attribute, and the v'th child is Childv
Else j* is real-valued and let t be the best split threshold
Let XLO = subset of rows of X in which Xij <= t. Let YLO = corresponding subset of Y
Let ChildLO = LearnUnprunedTree(XLO,YLO)
Let XHI = subset of rows of X in which Xij > t. Let YHI = corresponding subset of Y
Let ChildHI = LearnUnprunedTree(XHI,YHI)
Return a decision tree node, splitting on j'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.
Note: There are alternatives to Information Gain for splitting nodes
注意,分类和实值求最大information gain是不同的,这里只说明实值的情形,IG*(Y|Xj)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。
entropy=-sum p_ilog(p_i),(entropy即为H函数) high entropy意味着变量为boring distribution,即变量取各个值的概率差距不大; low entropy意味着变量为varied(peaks and valley)distribution,即变量取某一个或两个值的概率特别高。我们的目的就是要找到Low entropy的特征,因为 information gain= H(Y)-H(Y|X),在H(Y)固定时,找到的H(Y|X)越低,则该特征去某一个值或两个值的概率越高,能够分类清楚的样本数越多,这样就越该被先选中作为分支节点。
Information gain
source : [3]
-
nominal attributes
suppose X can have one of m values V1,V2,...,Vm
P(X=V1)=p1, P(X=V2)=p2,...,P(X=Vm)=pm
H(X)= -sumj=1m pj log2 pj (The entropy of X)
H(Y|X=v) = the entropy of Y among only those records in which X has value v
H(Y|X) = sumj pj H(Y|X=vj)
IG(Y|X) = H(Y) - H(Y|X)
-
real-valued attributes
suppose X is real valued
define IG(Y|X:t) as H(Y) - H(Y|X:t)
define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t)
define IG*(Y|X) = maxt IG(Y|X:t)
How to grow a Random Forest
source : [1]
Each tree is grown as follows:
- if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
- if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
- each tree is grown to its large extent possible. There is no pruning.
Random Forest parameters
source : [2]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.
分享到:
相关推荐
代码 随机森林应用于分类问题代码代码 随机森林应用于分类问题代码代码 随机森林应用于分类问题代码代码 随机森林应用于分类问题代码代码 随机森林应用于分类问题代码代码 随机森林应用于分类问题代码代码 随机森林...
随机森林matlab工具箱,可以实现分类和回归
随机森林 random forest 模型是由Breiman 和Cutler 在2001 年提出的一种基于分类树的算法 它通过 对大量分类树的汇总提高了模型的预测精度 是取代神经网络等传统机器学习方法的新的模型 随机森林的运 算速度很快 在...
提出一种基于随机森林和转导推理的特征提取方法, 步骤如下: 1)利用带标签的训练样本建立随机森林模型; 2) 将无标签的测试数据导入随机森林模型中,生成全体数据(训练样本和测试数据)的相似性矩阵; 3)对该相似性矩阵...
数据挖掘之随机森林算法实验报告 (2).docx数据挖掘之随机森林算法实验报告 (2).docx数据挖掘之随机森林算法实验报告 (2).docx数据挖掘之随机森林算法实验报告 (2).docx数据挖掘之随机森林算法实验报告 (2).docx数据...
主要利用R语言进行随机森林回归,还有其他两种回归, library(lattice) library(grid) library(DMwR) library(rpart) library(ipred) library(randomForest) #回归树,装袋算法,随机森林三大回归 #前二种算法可以...
随机森林算法程序,用于对数据的仿真预测,是一个有用的算法
随机森林回归 建模 数据分析 matlab RF
此文档以ppt的形式介绍随机森林的概念及相关原理,可用于理解随机森林算法等等。
【项目实战】基于Python实现随机森林分类模型(RandomForestClassifier)项目 资料说明:包括数据集+源代码+Pdf文档说明。 资料内容包括: 1)项目背景; 2)获取数据; 3)数据预处理: (1)导入程序库并读取数据 ...
基于python使用随机森林算法进行生存分析和实验
通过多种机器学习股票价格预测,包括随机森林(Random Forest),决策树(SVM),线性回归(LinearRegression),长短期记忆(LSTM)。 利用toshare获取600519.sh 2000-2020年数据,除了随机森林外基本都是以前19年数据做训练集...
随机森林的C++实现,附实现PPT和实验报告声明
决策树和随机森林算法教程、源代码事例,已经关于决策树和随机森林算法的衍生混合算法代码
随机森林算法随机森林是一种比较新的机器学习模型。
之前做的一个项目,随机森林算法,直接运行,上传源程序供大家使用
C++实现的两类问题随机森林生成算法,对学习随机森林很有帮助
随机森林算法java数据挖掘算法源码
随机森林算法matlab代码 Matlab 的随机森林 这个工具箱是为了我自己的教育而编写的,让我有机会稍微探索一下这些模型。 它不适用于任何严肃的应用程序,并且它不会做许多您想做的事情 想要一个成熟的实现来做,比如...
随机森林降维算法,直接修改输入输出路径,就可以运行的VS项目。