当前位置：首页 > 编程日记 > 正文

机器学习数据拆分_解释了关键的机器学习概念-数据集拆分和随机森林

编程日记 2024-08-01 19:10:00

机器学习数据拆分

数据集分割 (Dataset Splitting)

Splitting up into Training, Cross Validation, and Test sets are common best practices. This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data.

分为培训，交叉验证和测试集是常见的最佳实践。这使您可以调整算法的各种参数，而无需做出专门符合训练数据的判断。

动机 (Motivation)

Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms. Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data. For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters.

数据集拆分是消除ML算法中训练数据偏差的必要条件。修改ML算法的参数以最适合训练数据通常会导致过拟合算法，该算法在实际测试数据上的表现不佳。因此，我们将数据集分为多个离散子集，在这些子集上训练不同的参数。

训练集 (The Training Set)

The Training set is used to compute the actual model your algorithm will use when exposed to new data. This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set).

训练集用于计算算法在暴露给新数据时将使用的实际模型。该数据集通常占整个可用数据的60％-80％(取决于您是否使用交叉验证集)。

交叉验证集 (The Cross Validation Set)

Cross Validation sets are for model selection (typically ~20% of your data). Use this dataset to try different parameters for the algorithm as trained on the Training set. For example, you can evaluate different model parameters (polynomial degree or lambda, the regularization parameter) on the Cross Validation set to see which may be most accurate.

交叉验证集用于模型选择(通常约占数据的20％)。使用此数据集尝试对训练集上训练的算法使用不同的参数。例如，您可以在“交叉验证”集上评估不同的模型参数(多项式或lambda，正则化参数)，以查看哪个模型参数最准确。

测试集 (The Test Set)

The Test set is the final dataset you touch (typically ~20% of your data). It is the source of truth. Your accuracy in predicting the test set is the accuracy of your ML algorithm.

测试集是您接触的最终数据集(通常是数据的约20％)。这是真理的源头。预测测试集的准确性就是ML算法的准确性。

随机森林 (Random Forest)

A Random Forest is a group of decision trees that make better decisions as a whole than individually.

随机森林是一组决策树，它们总体上比单独决策更好。

问题 (Problem)

Decision trees by themselves are prone to overfitting. This means that the tree becomes so used to the training data that it has difficulty making decisions for data it has never seen before.

决策树本身很容易过度拟合 。这意味着树变得非常习惯于训练数据，以至于难以为从未见过的数据做出决策。

随机森林的解决方案 (Solution with Random Forests)

Random Forests belong in the category of ensemble learning algorithms. This class of algorithms use many estimators to yield better results. This makes Random Forests usually more accurate than plain decision trees. In Random Forests, a bunch of decision trees are created. Each tree is trained on a random subset of the data and a random subset of the features of that data. This way the possibility of the estimators getting used to the data (overfitting) is greatly reduced, because each of them work on the different data and features than the others. This method of creating a bunch of estimators and training them on random subsets of data is a technique in ensemble learning called bagging or Bootstrap AGGregatING. To get the prediction, the each of the decision trees vote on the correct prediction (classification) or they get the mean of their results (regression).

随机森林属于集成学习算法的类别。这类算法使用许多估计器来产生更好的结果。这使得随机森林通常比普通决策树更准确 。在随机森林中，创建了一堆决策树。每棵树都在数据的随机子集和数据特征的随机子集上训练 。这样，估计器习惯于数据(过度拟合)的可能性就大大降低了，因为它们每个都处理不同的数据和特征 。创造了一堆估计和训练他们对数据的随机子集的这种方法是在集成学习称为装袋或引导聚集的技术。为了获得预测，每个决策树都对正确的预测(分类)进行投票，或者获取结果的均值(回归)。

在Python中提升的示例 (Example of Boosting in Python)

In this competition, we are given a list of collision events and their properties. We will then predict whether a τ → 3μ decay happened in this collision. This τ → 3μ is currently assumed by scientists not to happen, and the goal of this competition was to discover τ → 3μ happening more frequently than scientists currently can understand. The challenge here was to design a machine learning problem for something no one has ever observed before. Scientists at CERN developed the following designs to achieve the goal. https://www.kaggle.com/c/flavours-of-physics/data

在这次比赛中，我们给出了碰撞事件及其属性的列表。然后，我们将预测在此碰撞中是否发生了τ→3μ衰减。科学家目前认为τ→3μ不会发生，并且该竞赛的目的是发现τ→3μ的发生比科学家目前所能理解的更为频繁。这里的挑战是设计一种机器学习问题，以解决以前从未见过的问题。欧洲核子研究中心的科学家开发了以下设计来实现这一目标。 https://www.kaggle.com/c/flavours-of-physics/data

#Data Cleaning
import pandas as pd
data_test = pd.read_csv("test.csv")
data_train = pd.read_csv("training.csv")
data_train = data_train.drop('min_ANNmuon',1)
data_train = data_train.drop('production',1)
data_train = data_train.drop('mass',1)#Cleaned data
Y = data_train['signal']
X = data_train.drop('signal',1)#adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
seed = 9001 #this ones over 9000!!!
boosted_tree = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), algorithm="SAMME", n_estimators=50, random_state = seed)
model = boosted_tree.fit(X, Y)predictions = model.predict(data_test)
print(predictions)
#Note we can't really validate this data since we don't have an array of "right answers"#stochastic gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
gradient_boosted_tree = GradientBoostingClassifier(n_estimators=50, random_state=seed)
model2 = gradient_boosted_tree.fit(X,Y)predictions2 = model2.predict(data_test)
print(predictions2)

翻译自: https://www.freecodecamp.org/news/key-machine-learning-concepts-explained-dataset-splitting-and-random-forest/