博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
COMP7404 Machine Learing——Pipelining Transformers & K-Fold Cross-Validation
阅读量:2135 次
发布时间:2019-04-30

本文共 5199 字,大约阅读时间需要 17 分钟。

Pipelining Transformer

pipeline类本身具有fit、predict和score方法,其行为与Scikit-Learn中的其他模型相同

Pipeline是使用 (key,value) 对的list构建的,key是自己为这一step设定的名称,value是对应的处理类, 最后通过list将这些step传入

 

创建pipeline

from sklearn.pipeline import Pipelinefrom sklearn.svm import SVCfrom sklearn.decomposition import PCApipe = Pipeline(steps=[('pca', PCA()), ('svc', SVC())])#pipe是
print(pipe)

或写作

from sklearn.pipeline import Pipelinefrom sklearn.svm import SVCfrom sklearn.decomposition import PCAestimators = [('reduce_dim', PCA()), ('clf', SVC())]pipe = Pipeline(estimators)

或者用make_pipeline

from sklearn.pipeline import make_pipelinepipe = make_pipeline(MinMaxScaler(),SVC())

就不用我们指定名称了,函数会自动命名

一般来说,自动命名的步骤名称是类名称的小写版本,如果多个步骤属于同一个类,则会附加一个数字。 

 

pipeline.steps

from sklearn.pipeline import Pipelinefrom sklearn.svm import SVCfrom sklearn.decomposition import PCApipe = Pipeline(steps=[('pca', PCA()), ('svc', SVC())])#pipe是
print(pipe.steps)

 

通过set_params重新设置每个类里边需传入的参数

设置方法为step的name__parma名=参数值

from sklearn.pipeline import Pipelinefrom sklearn.svm import SVCfrom sklearn.decomposition import PCApipe = Pipeline(steps=[('pca', PCA()), ('svc', SVC())])#pipe是
pipe.set_params(svc__C=10.0)print(pipe.steps)

 

 

创建管道并训练

import pandas as pdfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinedf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)X = df.loc[:, 2:].valuesy = df.loc[:, 1].valuesle = LabelEncoder()y = le.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)pipe_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(random_state=1, solver='lbfgs'))pipe_lr.fit(X_train, y_train)print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

 

在网格搜索中使用管道

import pandas as pdfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import SVCfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCVdf = pd.read_csv('dataset/wdbc.data',header=None)y = df.loc[:,1].valuesX = df.loc[:,2:].valuesle = LabelEncoder()y = le.fit_transform(y)  #因为得到的y是'B'和'E',字符,需要编码成数字X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=1,stratify=y)pipe = Pipeline([('scaler',StandardScaler()), ('svm',SVC())])param_grid = {'svm__C':[0.001,0.01,0.1,1,10,100], 'svm__gamma':[0.001,0.01,0.1,1,10,100]}gs = GridSearchCV(pipe, param_grid=param_grid, cv=5)gs.fit(X_train, y_train)print('Accuracy: %.3f' % gs.score(X_test, y_test))

 

 

Stratified K-Fold Cross-Validation

StratifiedKFold()

只是给你划分k份,并不做validation的计算

import pandas as pdfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelineimport numpy as npfrom sklearn.model_selection import StratifiedKFolddf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)X = df.loc[:, 2:].valuesy = df.loc[:, 1].valuesle = LabelEncoder()y = le.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)pipe_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(random_state=1, solver='lbfgs'))kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True).split(X_train, y_train)scores = []for k, (train, test) in enumerate(kfold):    pipe_lr.fit(X_train[train], y_train[train])    score = pipe_lr.score(X_train[test], y_train[test])    scores.append(score)    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1, np.bincount(y_train[train]), score))  print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

 

cross_val_score

内置k-fold cross-validation scorer , 写法更简单

import pandas as pdimport numpy as npfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.model_selection import cross_val_scoredf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)X = df.loc[:, 2:].valuesy = df.loc[:, 1].valuesle = LabelEncoder()y = le.fit_transform(y)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)pipe_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(random_state=1, solver='lbfgs'))scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)print('CV accuracy scores: %s' % scores)print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

 

 

 

转载地址:http://tmygf.baihongyu.com/

你可能感兴趣的文章
TP 分页后,调用指定页。
查看>>
Oracle数据库中的(+)连接
查看>>
java-oracle中几十个实用的PL/SQL
查看>>
PLSQL常用方法汇总
查看>>
几个基本的 Sql Plus 命令 和 例子
查看>>
PLSQL单行函数和组函数详解
查看>>
Oracle PL/SQL语言初级教程之异常处理
查看>>
Oracle PL/SQL语言初级教程之游标
查看>>
Oracle PL/SQL语言初级教程之操作和控制语言
查看>>
Oracle PL/SQL语言初级教程之过程和函数
查看>>
Oracle PL/SQL语言初级教程之表和视图
查看>>
Oracle PL/SQL语言初级教程之完整性约束
查看>>
PL/SQL学习笔记
查看>>
如何分析SQL语句
查看>>
结构化查询语言(SQL)原理
查看>>
SQL教程之嵌套SELECT语句
查看>>
几个简单的SQL例子
查看>>
日本語の記号の読み方
查看>>
计算机英语编程中一些单词
查看>>
JavaScript 经典例子
查看>>