当前位置: 首页 > 编程日记 > 正文

Cross-validation

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

1: Introduction To Validation

So far, we've been evaluating accuracy of trained models on the data the model was trained on. While this is an essential first step, this doesn't tell us much about how well the model does on data it's never seen before. In machine learning, we want to use training data, which is historical and contains the labelled outcomes for each observation, to build a classifier that will return predicted labels for new, unlabelled data. If we only evaluate a classifier's effectiveness on the data it was trained on, we can run into overfitting, where the classifier only performs well on the training but doesn't generalize to future data.

To test a classifier's generalizability, or its ability to provide accurate predictions on data it wasn't trained on, we use cross-validation techniques. Cross-validation involves splitting historical data into:

  • a training set -- which we use to train the classifer,
  • a test set -- which we use to evaluate the classifier's effectiveness using various measures.

Cross-validation is an important step that should be utilized after training any kind of machine learning model. In this mission, we'll focus on using cross-validation for evaluating a binary classification model. We'll continue to work with the dataset on graduate school admissions, which contains data on 644 applications with the following columns:

  • gre - applicant's store on the Graduate Record Exam, a generalized test for prospective graduate students.
    • Score ranges from 200 to 800.
  • gpa - college grade point average.
    • Continuous between 0.0 and 4.0.
  • admit - binary value
    • Binary value, 0 or 1, where 1 means the applicant was admitted to the program and 0 means the applicant was rejected.

In the following code cell, we import the libraries we need, read in the admissions Dataframe, rename the admit column toactual_label, and drop the admit column.

Instructions

This step is a demo. Play around with code or advance to the next step.

import pandas as pd
from sklearn.linear_model import LogisticRegression

admissions = pd.read_csv("admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)

print(admissions.head())

2: Holdout Validation

There are a few different types of cross-validation techniques we can use to evaluate a classifier's effectiveness. The simplest technique is called holdout validation, which involves:

  • randomly splitting our dataset into a training data and a test set,
  • fitting the model using the training set,
  • making predictions on the test set.

We'll randomly select 80% of the observations in the admissions Dataframe as the training set and the remaining 20% as the test set. This ratio isn't set in stone, and you'll see many people using a 75%-25% split instead.

We'll explore more advanced cross-validation techniques in later missions and will focus on holdout validation, the simplest kind of validation, in this mission. To split the data randomly into a training and a test set, we'll:

  • use the numpy.random.permutation function to return a list containing index values in random order,
  • return a new Dataframe in that list's order,
  • select the first 80% of the rows as the training set,
  • select the last 20% of the rows as the test set.

Instructions

  • Use the NumPyrand.permutation function to randomize the index for theadmissions Dataframe.

  • Use the loc[] method on theadmissions Dataframe to return a new Dataframe in the randomized order. Assign this Dataframe toshuffled_admissions.

  • Select rows 0 to 514 (including row 514) fromshuffled_admissions and assign to train.

  • Select the remaining rows and assign to test.

  • Finally, display the first 5 rows inshuffled_admissions.

import numpy as np
np.random.seed(8)
admissions = pd.read_csv("admissions.csv")
admissions["actual_label"] = admissions["admit"]
admissions = admissions.drop("admit", axis=1)
shuffled_index = np.random.permutation(admissions.index)
shuffled_admissions = admissions.loc[shuffled_index]

train = shuffled_admissions.iloc[0:515]
test = shuffled_admissions.iloc[515:len(shuffled_admissions)]

print(shuffled_admissions.head())

3: Accuracy

Now that we've split up the dataset into a training and a test set, we can:

  • train a logistic regression model on just the training set,
  • use the model to predict labels for the test set,
  • evaluate the accuracy of the predicted labels for the test set.

Recall that accuracy helps us answer the question:

  • What fraction of the predictions were correct (actual label matched predicted label)?

Prediction accuracy boils down to the number of labels that were correctly predicted divided by the total number of observations:

Accuracy=# of Correctly Predicted# of ObservationsAccuracy=# of Correctly Predicted# of Observations

Instructions

  • Train a logistic regression model using the gpa column from thetrain Dataframe.
  • Use the LogisticRegression method predict to return the predicted labels for the gpacolumn from the testDataframe. Assign the resultinglist of labels to thepredicted_label column in thetest Dataframe.
  • Calculate the accuracy of the predictions by dividing the number of rows whereactual_label matchespredicted_label by the total number of rows in the test set.
  • Assign the accuracy value toaccuracy and display it using theprint function.

shuffled_index = np.random.permutation(admissions.index)
shuffled_admissions = admissions.loc[shuffled_index]
train = shuffled_admissions.iloc[0:515]
test = shuffled_admissions.iloc[515:len(shuffled_admissions)]
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(train[["gpa"]],train["actual_label"])
labels=model.predict(test[["gpa"]])
test["predicted_label"]=labels
matches=test["predicted_label"]==test["actual_label"]
correct_predictions=test[matches]
accuracy=len(correct_predictions)/len(test)
print(accuracy)

4: Sensitivity And Specificity

Looks like the prediction accuracy is about 63.6%, which isn't too far off from the accuracy value we computed in the previous mission of64.6%. If the model performed significantly worse on new data, this means that it's overfitting. If the prediction accuracy was much lower, say 40% instead of 69%, we would reconsider using logistic regression.

When we evaluated the model on the training data in the previous mission, we achieved a sensitivity value of 12.7% and a specificity value of 96.3%. Let's calculate these measures for the test set and compare. Here's a quick refresher of sensitivity and specificity:

  • Sensitivity helps us answer the question:
    • How effective is this model at identifying positive outcomes?
    • Of all of the students that should have been admitted (True Positives + False Negatives), how many did the model correctly admit (True Positives)?
  • Specificity helps us answer the question:
    • How effective is this model at identifying negative outcomes?
    • Of all of the applicants who should have been rejected (False Positives + True Negatives), what proportion were correctly rejected (just True Negatives).

Now it's your turn! Calculate the specificity and sensitivity values for the predictions on the test set. To encourage you to avoid relying on the formulas for these measures, we've hidden the exact formula in the Hint and prefer that you work backwards from the goals of these measures instead.

Instructions

  • Calculate the sensitivity value for the predictions on the test set and assign to sensitivity.
  • Calculate the specificity value for the predictions on the test set and assign to specificity.
  • Display both values using theprint function.

model = LogisticRegression()
model.fit(train[["gpa"]], train["actual_label"])
labels = model.predict(test[["gpa"]])
test["predicted_label"] = labels
matches = test["predicted_label"] == test["actual_label"]
correct_predictions = test[matches]
accuracy = len(correct_predictions) / len(test)
true_positives=len(test[(test["actual_label"]==1)&(test["predicted_label"]==1)])
False_negatives=len(test[(test["actual_label"]==1)&(test["predicted_label"]==0)])
sensitivity=true_positives/(true_positives+False_negatives)
true_negative=len(test[(test["actual_label"]==0)&(test["predicted_label"]==0)])
false_positives=len(test[(test["actual_label"]==0)&(test["predicted_label"]==1)])
specificity=true_negative/(false_positives+true_negative)
print(specificity)
print(sensitivity)

5: False Positive Rate

It turns out that our test set achieved a sensitivity value of 8.3, compared to a sensitivity value of 12.7% from the previous mission, and a specificity value of 96.3%, which matches the specificity value of 96.3% from the previous mission. We have a little more evidence now that our logistic regression model is able to generalize to new data.

So far, we've been using the LogisticRegression method predict to generate predictions for labels. For each observation, scikit-learn uses the logit function, with the optimal parameter value for the data the model was trained on, to return a probabillity value. If the probability value is larger than 50%, the predicted label is 1 and if it's less than 50%, the predictd label is 0. For most problems, however, 50% is not the optimal discrimination threshold. We need a way to vary the threshold and compute the measures at each threshold. Then, depending on the measure we want to optimize, we can find the appropriate threshold to use for predictions.

The 2 common measures that are computed for each discrimination threshold are the False Positive Rate (or fall-out) and the True Positive Rate (or sensitivity). While we've explored the latter measure, we haven't discussed fall-out:

  • Fall-out or False Positive Rate - The proportion of applicants who should have been rejected (actual_label equals 0) but were instead admitted (predicted_label equals 1):

FPR=False PositivesFalse Positives+True NegativesFPR=False PositivesFalse Positives+True Negatives

These 2 rates describe how well the model accepts the right students and how poorly it rejects the wrong one:

  • True Positive Rate: The proportion of students that were admitted that should have been admitted.
  • False Positive Rate: The proportion of students that were accepted that should have been rejected.

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
probabilities=model.predict_proba(test[["gpa"]])
fpr, tpr, thresholds = metrics.roc_curve(test["actual_label"], probabilities[:,1])
plt.plot(fpr,tpr)

6: ROC Curve

We can vary the discrimination threshold and calculate the TPR and FPR for each value. This is called an ROC curve, which stands for reciever operator curve, and it allows us to understand a classification model's performance as the discrimination threshold is varied. To calculate the TPR and FPR values at each discrimination threshold, we can use the scikit-learn roc_curve function. This function will calculate the false positive rate and true positive rate for varying discrimination thresholds until both reach 0%.

This function takes 2 required parameters:

  • y_truelist of the true labels for the observations,
  • y_scorelist of the model's probability scores for those observations.

As the example code in the documentation suggests, the roc_curve function returns 3 values which you can assign all at once:

fpr, tpr, thresholds = metrics.roc_curve(labels, probabilities)

You'll notice that the returned thresholds won't usually range from 0.0 to 1.0 and will instead constrains the result set to the minimum range where FPR and TPR range from 0.0 to 1.0. Once we have the FPR and TPR for each relevant threshold, we can plot the ROC curve using the Matplotlib plot function.

Instructions

  • Import the relevant scikit-learn package you need to calculate the ROC curve.
  • Use the model to return predicted probabilities for the test set.
  • Use the roc_curve function to return the FPR and TPR values for different thresholds.
  • Create and display a line plot with:
    • the FPR values on the x-axis and
    • the TPR values on the y-axis.

# Note the different import style!
from sklearn.metrics import roc_auc_score

probabilities=model.predict_proba(test[["gpa"]])
auc_score=roc_auc_score(test["actual_label"],probabilities[:,1])
print(auc_score)

8: Next Steps

With an AUC score of about 57.8%, our model does a little bit better than 50%, which would correspond to randomly guessing, but not as high as the university may like. This could imply that using just one feature in our model, GPA, to predict admissions isn't enough. All of the measures and scores we've learned about are different ways of thinking about accuracy and the important takeaway is that no single measure will tell us if we want to use a specific model or not. Understanding how individual scores are calculated and what they focus on help you converge onto a clearer picture. It's always important to understand what measures are the most important for the problem at hand.

In the next mission, we'll switch gears and learn how we can use machine learning on problems that don't involve predicting a label. This type of machine learning is called unsupervised machine learning and we'll focus on a technique called clustering.

转载于:https://my.oschina.net/Bettyty/blog/751409

相关文章:

【ACM】杭电OJ 1877 又一版A+B(进制转换)

注意&#xff1a;A和B都是0的情况 A和B为int也可以AC #include<cstdio> #include <iostream> using namespace std;const int maxn 10000;int a[maxn];int main() {long long A,B;int m,k;while(scanf("%d",&m)!EOF){if(m0) return 0;scanf("…

[POI2009]KAM-Pebbles BZOJ1115 [ 待填坑 ] 博弈

有N堆石子&#xff0c;除了第一堆外&#xff0c;每堆石子个数都不少于前一堆的石子个数。两人轮流操作每次操作可以从一堆石子中移走任意多石子&#xff0c;但是要保证操作后仍然满足初始时的条件谁没有石子可移时输掉游戏。问先手是否必胜。 感谢MT大牛翻译. Sample OutputNIE…

ROS中使用摄像头的问题

ROS中使用摄像头的问题 0.prepare 4 . 安装uvc_cam $ sudo apt-get install ros-indigo-uvc-camera $ source /opt/ros/indigo/setup.bash 采用apt-get的方式&#xff0c;直接装在了ROS的安装路径中&#xff0c;并设置工作路径。 安装成功后在/opt/ros/hydro/的路径中就…

EmEditor Professional(文本编辑) 下载地址

http://www.greenxf.com/soft/2126.html 16.1.5 http://www.cr173.com/soft/3031.html 16.3.0 http://www.pc6.com/softview/SoftView_43146.html 17.8.1 绿色注册版 EmEditor 71 个实用插件汉化版 http://www.onlinedown.net/soft/35609.htm

【ACM】杭电OJ 4548 美素数(二次打表)

二次打表&#xff0c;第一次是标记哪些是素数&#xff0c;哪些不是。 第二次是前n个数中 “本身是素数 && 各个位上的和是素数 ” 的个数 TLE&#xff1a; #include <iostream> #include <cstdio> using namespace std;int fun1(int x) {int sum0…

animation与transition区别

transition&#xff1a; 过渡属性 过渡所需要时间 过渡动画函数 过渡延迟时间&#xff1b;默认值分别为&#xff1a;all 0 ease 0 1、局限性&#xff1a; 1&#xff09;只能设置一个属性 2&#xff09;需要伪类/事件触发才执行 3&#xff09;只能设置动画初始值和结束值 2、过…

如何将cocos2d-x程序分别移植到ios,android,windowsphone三个手机平台上

作者&#xff1a;方格子链接&#xff1a;https://www.zhihu.com/question/21505500/answer/22152464来源&#xff1a;知乎著作权归作者所有。商业转载请联系作者获得授权&#xff0c;非商业转载请注明出处。面向android的移植 0. 这移植过程简直…… 1. 完成以上工具的下载安装…

【数据结构】顺序循环队列及其实现(C语言)

给定一个大小为MAXSIZE的数组储存一个队列&#xff0c;经过若干次的插入和删除以后&#xff0c;当队尾指针 rear MAXSIZE 时&#xff0c;呈现队列满的状态&#xff0c;而事实上数组的前部可能还有空闲的位置。为了有效地利用空间&#xff0c;引入循环队列&#xff08;环状&…

C++中Reference与指针(Pointer)的使用对比

了解引用reference与指针pointer到底有什么不同可以帮助你决定什么时候该用reference&#xff0c;什么时候该用pointer。在C 中&#xff0c;reference在很多方面与指针(pointer)具有同样的能力。虽然多数C程序员对于何时使用reference何时使用pointer 都会有一些直觉&#xff0…

云南实现手机自主补(换)领居民身份证

图为云南首位通过手机自主补领居民身份证的申领人付宏强。 缪超 摄 中新网昆明1月22日电 (缪超)春节临近&#xff0c;云南实现手机自主补(换)领居民身份证&#xff0c;首张通过手机补办的居民身份证于22日在武定县公安局狮山派出所成功申领。 据悉&#xff0c;为方便民众因遗失…

NDK JNI 安装与配置(一)(UBUNTU16.04 )

1、下载Android NDK自解压包&#xff0c;官方地址&#xff1a;https://developer.android.com/ndk/downloads/index.html#download下载&#xff1a;$ wget -c http://dl.google.com/android/ndk/android-ndk-r10e-linux-x86_64.bin2、解压&#xff0c;将Android NDK压缩包解压到…

【数据结构】顺序表的应用(4)(C语言)

【数据结构】顺序表的应用&#xff08;1&#xff09;&#xff08;C语言&#xff09; 【数据结构】顺序表的应用&#xff08;2&#xff09;&#xff08;C语言&#xff09; 【数据结构】顺序表的应用&#xff08;3&#xff09;&#xff08;C语言&#xff09; 设计一个算法&…

Java泛型:泛型类、泛型接口和泛型方法

2019独角兽企业重金招聘Python工程师标准>>> 根据《Java编程思想 &#xff08;第4版&#xff09;》中的描述&#xff0c;泛型出现的动机在于&#xff1a;有许多原因促成了泛型的出现&#xff0c;而最引人注意的一个原因&#xff0c;就是为了创建容器类。 泛型类 容器…

POJ 2456 Aggressive cows(二分答案)

Aggressive cowsTime Limit: 1000MS Memory Limit: 65536KTotal Submissions: 22674 Accepted: 10636Description Farmer John has built a new long barn, with N (2 < N < 100,000) stalls. The stalls are located along a straight line at positions x1,...,xN (0…

JMeter打开脚本报错处理方法

今天电脑重装了系统&#xff0c;安装好jmeter后打开以前写的脚本&#xff0c;总是报错如下图&#xff0c;研究了半天也没搞明白。 后来一个群里的人员提醒才想起来&#xff0c;是脚本的问题&#xff0c;为啥捏&#xff1f; 因为之前写的脚本用了一些监听&#xff0c;而这些监听…

Android开发中libs包下面的mips、armeabi、armeabi-v7a和x86

简介 在Android日常的开发过程中有的项目需要引入第三方的库&#xff0c;有时候大家可能会在libs文件夹下看到 mips、armeabi、armeabi-v7a和x86这四个文件夹。那么这三个文件夹下面的包是干什么用的&#xff1f; 这三个包下面存放的用C编译的本地库文件&#xff08;各类『.…

【数据结构】判断一个单链表中各结点的值是否有序

count记录的是单链表的总长 count1记录的是升序的结点的个数 count2记录的是降序的结点的个数 如果count1或者count2等于count&#xff0c;那么就说明该序列是升序或者降序的。 稍加改进可以在准确判断是升序还是降序还是无序 &#xff08;个人认为链表中只有一个结点或者…

MSSQL-最佳实践-行级别安全解决方案

title: MSSQL-最佳实践-行级别安全解决方案 author: 风移 摘要 在SQL Server安全系列专题月报分享中&#xff0c;我们已经分享了&#xff1a;如何使用对称密钥实现SQL Server列加密技术、使用非对称密钥加密方式实现SQL Server列加密、使用混合密钥实现SQL Server列加密技术和列…

浮点数运算原理详解

导读&#xff1a; 浮点数运算是一个非常有技术含量的话题&#xff0c;不太容易掌握。许多程序员都不清楚使用操作符比较float/double类型的话到底出现什么问题。 许多人使用float/double进行货币计算时经常会犯错。这篇文章是这一系列中的精华&#xff0c;所有的软件开发人员都…

vCenter的安装

转载于:https://blog.51cto.com/yht1990/1857211

【数据结构】双链表的应用

1、设计一个算法&#xff0c;在双链表中值为y的结点前面插入一个值为x的新结点&#xff0c;即使得值为x的新结点成为值为y的结点的前驱结点。 2、设计一个算法&#xff0c;将一个双链表改建成一个循环双链表。 #include <stdio.h> #include <stdlib.h>typedef st…

Eclipse for Tricore 的安装方法

1.安装JDK32位版 2.安装Eclipse for Tricore 32位版&#xff08;应该也只有32位的&#xff09; 3.OK&#xff08;如果打开Tricore提示找不到JDK的话&#xff0c;在网上搜索如何配置JDK&#xff0c;修改环境变量&#xff09; 注意&#xff1a;Eclipse的位数必须和JDK位数相同 转…

NDK JNI方式读写Android系统的demo(二)

NDK & JNI&#xff08;方式读写Android系统的&#xff24;&#xff45;&#xff4d;&#xff4f;&#xff09; 大家都知道Android系统是一种基于Linux的自由及开放源码的操作系统&#xff0c;所以读写GPIO也可以直接用Linux那一套export/unexport方法&#xff0c;本文将介…

【数据结构】顺序串的插入算法,删除算法,连接运算,顺序串求子串算法

主函数自行添加 头文件 宏定义 #include <stdio.h> #include <stdlib.h> #define MAXSIZE 100 串的顺序存储 typedef struct {char str[MAXSIZE];int length; }seqstring; 顺序串的创建 void creat(seqstring *S) {char c;int i0;while((cgetchar())!\n){S-…

log4j日志记录级别是如何工作?

级别p的级别使用q&#xff0c;在记录日志请求时&#xff0c;如果p>q启用。这条规则是log4j的核心。它假设级别是有序的。对于标准级别它们关系如下&#xff1a;ALL < DEBUG < INFO < WARN < ERROR < FATAL < OFF。 举个栗子 下面的栗子明确指出如何可以过…

Kafka背后公司获1.25亿融资,估值超25亿美元

北京时间1月24日&#xff0c;开源Apache Kafka项目背后的公司Confluent在官方博客宣布进行了D轮融资&#xff0c;价值约为1.25亿美元&#xff0c;公司总估值高达25亿美元。 Confluent公司CEO Jay Kreps在博客中表示&#xff1a;我很高兴地宣布&#xff0c;Confluent已经募集了1…

深入学习jQuery描述文本内容的3个方法

前面的话 在javascript中&#xff0c;描述元素内容有5个属性&#xff0c;分别是innerHTML、outerHTML、innerText、outerText和textContent。这5个属性各自有各自的功能&#xff0c;且兼容性不同。jQuery针对这样的处理提供了3个便捷的方法&#xff0c;分别是&#xff1a;html(…

NDK JNI Android Studio开发与调试DEMO(三)(生成 .so 文件)

Android Studio NDK 开发与调试&#xff08;生成 .so 文件&#xff09; 温馨提示&#xff1a;如果你的 Android Studio 版本在 3.0以上 &#xff0c; 建议你用 cMake /ndk-build 的新姿势进行 NDK 开发 : https://developer.android.google.cn/index.html AS与&#xff47;&am…

字符串的模式匹配 (朴素模式匹配算法 ,KMP算法)

字符串的模式匹配 寻找字符串p在字符串t中首次出现的起始位置 字符串的顺序存储 typedef struct {char str[MAXSIZE];int length; }seqstring; 朴素的模式匹配算法 基本思想&#xff1a;用p中的每一个字符去与t中的字符一一比较。 模式p 正文 t 如果匹配成功&#xff0c…

框架页面jquery装载

转载于:https://www.cnblogs.com/moonsoft/p/10313309.html