当前位置: 首页 > news >正文

网站开发建设培训/百度爱企查电话人工服务总部

网站开发建设培训,百度爱企查电话人工服务总部,网站有信心做的更好,网站做赌博词怎么推广平安医疗科技疾病问答迁移学习(CHIP-STS) 1.引言 本次题目是CHIP 2019中的评测任务二,由平安医疗科技主办。 迁移学习是自然语言处理中的重要一环,其主要目的是通过从已学习的相关任务中转移知识来改进新任务的学习效果&#x…

平安医疗科技疾病问答迁移学习(CHIP-STS)

1.引言

本次题目是CHIP 2019中的评测任务二,由平安医疗科技主办。

迁移学习是自然语言处理中的重要一环,其主要目的是通过从已学习的相关任务中转移知识来改进新任务的学习效果,从而提高模型的泛化能力。

本次评测任务的主要目标是针对中文的疾病问答数据,进行病种间的迁移学习。具体而言,给定来自5个不同病种的问句对,要求判定两个句子语义是否相同或者相近。所有语料来自互联网上患者真实的问题,并经过了筛选和人工的意图匹配标注。

# 下载最新版本的paddlenlp
!pip install --upgrade pip
!pip install paddlenlp --upgrade
import paddlenlp
import paddle

2.数据集

数据说明

数据由train.csv、dev.csv、test.csv三个文件构成:

train.csv是训练集,包含2万对人工标注好的疾病问答数据,由5个病种构成,其中diabetes10000对,hypertension、hepatitis、aids、breast_cancer各2500对;

dev.csv是验证集,包含10000对无label的疾病问答数据,由5个病种构成,其中diabetes,hypertension、hepatitis、aids、breast_cancer各2000对;

test.csv是测试集,包含5万对人工标注好的疾病问答数据,其中只有部分数据供验证。

category表示问句对的病种名称,分别对应:

diabetes-糖尿病,hypertension-高血压,hepatitis-乙肝,aids-艾滋病,breast_cancer-乳腺癌。

label表示问句之间的语义是否相同。若相同,标为1,若不相同,标为0。

其中,训练集label已知,验证集和测试集label未知。

!cd /home/aistudio/data/data107842/ && unzip 平安医疗疾病问答比赛数据集.zip 
Archive:  平安医疗疾病问答比赛数据集.zipinflating: train.csv               inflating: dev_id.csv              inflating: test_final.csv          

2.1 查看数据集

# 读取数据集
import pandastrain = pandas.read_csv('/home/aistudio/data/data107842/train.csv', sep=',')  # 有标签的训练数据文件
dev = pandas.read_csv('/home/aistudio/data/data107842/dev_id.csv', sep=',')    # 要进行预测的测试数据文件# 本项目只判断语义是否一致, 所以删除类别
# 删除 category 列
del train['category']
del dev['category']
del dev['id']# 查看训练数据前5条
train.head()
# 查看训练数据文件信息
# train.info()
# 统计训练数据类别标签分布, 可以看出相当均衡, 不愧是比赛数据集
# train['label'].value_counts()
question1question2label
0艾滋病窗口期会出现腹泻症状吗头疼腹泻四肢无力是不是艾滋病0
1由于糖尿病引起末梢神经炎,怎么根治?糖尿病末梢神经炎的治疗方法1
2H型高血压,是通所说的高血脂?高血压引起脑出血怎么抢救治疗0
3糖尿病跟尿毒症有什么区别?糖尿病人,尿酸只有4.6是什么原因造成的?0
4你好,我60岁,患高血压,80135,爱喝酸奶可以吗?高血压糖尿病人可以喝牛奶吗?1
# 查看数据的统计文本长度, 便于后续确定预训练模型的max_seq_length
print(train['question1'].map(len).describe())
print(train['question2'].map(len).describe())
print(dev['question1'].map(len).describe())
print(dev['question2'].map(len).describe())
count    20000.000000
mean        13.052450
std          4.702489
min          2.000000
25%         10.000000
50%         12.000000
75%         15.000000
max         57.000000
Name: question1, dtype: float64
count    20000.000000
mean        13.916000
std          5.251421
min          2.000000
25%         10.000000
50%         13.000000
75%         16.000000
max         73.000000
Name: question2, dtype: float64
count    10000.000000
mean        13.352600
std          5.031442
min          2.000000
25%         10.000000
50%         12.000000
75%         15.000000
max         50.000000
Name: question1, dtype: float64
count    10000.000000
mean        14.493700
std          5.570787
min          3.000000
25%         11.000000
50%         13.000000
75%         17.000000
max         52.000000
Name: question2, dtype: float64

2.2 划分训练和验证集

# 根据label的具体类别按9:1的比例去划分训练和验证集,使训练和验证集尽量同分布。 当然也可以尝试直接根据索引去划分
from sklearn.utils import shufflenew_train = pandas.DataFrame()  # 定义训练集
new_valid = pandas.DataFrame()  # 定义验证集tags = list(train.label.unique())  # 总类别
# 根据类别进行抽样划分
for tag in tags:data = train[(train['label'] == tag)]# 抽样取0.1作为验证集valid_sample = data.sample(int(0.1 * len(data)))valid_index = valid_sample.index# 将剩余0.9的数据作为训练集all_index = data.indexresidue_index = all_index.difference(valid_index)residue = data.loc[residue_index]# 对取的数据进行保存new_valid = pandas.concat([new_valid, valid_sample], ignore_index=True)new_train = pandas.concat([new_train, residue], ignore_index=True)# 对数据进行随机打乱
new_train = shuffle(new_train)
new_valid = shuffle(new_valid)# 保存训练和验证集文件
new_train.to_csv('train_data.csv', sep='\t', index=False) # 这里使用 \t 是因为有的数据集中包含 ,
new_valid.to_csv('valid_data.csv', sep='\t', index=False)

2.3 加载数据集

from paddlenlp.datasets import DatasetBuilder
# 定义模型训练和验证对应文件及文件的格式
class QueryData(DatasetBuilder):SPLITS = {'train': 'train_data.csv','dev': 'valid_data.csv',}def _get_data(self, mode, **kwargs):filename = self.SPLITS[mode]return filenamedef _read(self, filename):"""读取数据"""with open(filename, 'r', encoding='utf-8') as f:# 跳过列名# next(f)head = Nonefor line in f:data = line.strip().split("\t")    # 以'\t'分隔各列if not head:# 去空head = dataelse:question1, question2, label = datayield {"question1": question1, "question2": question2, "label": label}  # 数据的格式def get_labels(self):return ["0", "1"]   # 类别标签, 0、1
# 定义数据集加载函数
def load_dataset(name=None,data_files=None,splits=None,lazy=None,**kwargs):reader_cls = QueryDataif not name:reader_instance = reader_cls(lazy=lazy, **kwargs)else:reader_instance = reader_cls(lazy=lazy, name=name, **kwargs)datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)return datasets
# 加载训练和验证集
train_ds, dev_ds = load_dataset(splits=["train", "dev"])

3.基于PaddleNLP构建基线模型

3.1定义的预训练模型

关于什么是 bert 可以参考知乎的一篇文章 什么是bert

更多 bert 预训练模型可以在 BERT 这个链接里找到

当然也可以使用一些其他的预训练模型, 可以参考 PaddleNLP Transformer预训练模型

# 定义要进行微调的预训练模型
MODEL_NAME='bert-wwm-ext-chinese'
roberta_model = paddlenlp.transformers.BertModel.from_pretrained(MODEL_NAME)
model = paddlenlp.transformers.BertForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=2)
[2022-12-20 23:08:07,837] [    INFO] - Model config BertConfig {"attention_probs_dropout_prob": 0.1,"fuse": false,"hidden_act": "gelu","hidden_dropout_prob": 0.1,"hidden_size": 768,"initializer_range": 0.02,"intermediate_size": 3072,"layer_norm_eps": 1e-12,"max_position_embeddings": 512,"model_type": "bert","num_attention_heads": 12,"num_hidden_layers": 12,"pad_token_id": 0,"paddlenlp_version": null,"pool_act": "tanh","type_vocab_size": 2,"vocab_size": 21128
}[2022-12-20 23:08:07,842] [    INFO] - Configuration saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/config.json
W1220 23:08:07.847621   261 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1220 23:08:07.851722   261 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-12-20 23:08:09,544] [    INFO] - Downloading bert-wwm-ext-chinese.pdparams from http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese.pdparams
100%|██████████| 390M/390M [00:14<00:00, 28.5MB/s] 
[2022-12-20 23:08:24,973] [    INFO] - All model checkpoint weights were used when initializing BertModel.[2022-12-20 23:08:24,976] [    INFO] - All the weights of BertModel were initialized from the model checkpoint at bert-wwm-ext-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.
[2022-12-20 23:08:24,981] [    INFO] - Model config BertConfig {"attention_probs_dropout_prob": 0.1,"fuse": false,"hidden_act": "gelu","hidden_dropout_prob": 0.1,"hidden_size": 768,"initializer_range": 0.02,"intermediate_size": 3072,"layer_norm_eps": 1e-12,"max_position_embeddings": 512,"model_type": "bert","num_attention_heads": 12,"num_hidden_layers": 12,"pad_token_id": 0,"paddlenlp_version": null,"pool_act": "tanh","type_vocab_size": 2,"vocab_size": 21128
}[2022-12-20 23:08:24,984] [    INFO] - Configuration saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/config.json
[2022-12-20 23:08:26,234] [    INFO] - All model checkpoint weights were used when initializing BertForSequenceClassification.[2022-12-20 23:08:26,237] [ WARNING] - Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-wwm-ext-chinese and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3.2 定义模型对应的tokenizer以进行数据处理

PaddleNLP对于各种预训练模型已经内置了相应的tokenizer。指定想要使用的模型名字即可加载对应的tokenizer。tokenizer作用为将原始输入文本转化成模型model可以接受的输入数据形式。

tokenizer = paddlenlp.transformers.BertTokenizer.from_pretrained(MODEL_NAME)
[2022-12-20 23:08:26,245] [    INFO] - Downloading http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese-vocab.txt and saved to /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese
[2022-12-20 23:08:26,284] [    INFO] - Downloading bert-wwm-ext-chinese-vocab.txt from http://bj.bcebos.com/paddlenlp/models/transformers/bert/bert-wwm-ext-chinese-vocab.txt
100%|██████████| 107k/107k [00:00<00:00, 6.18MB/s]
[2022-12-20 23:08:26,364] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/tokenizer_config.json
[2022-12-20 23:08:26,367] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/bert-wwm-ext-chinese/special_tokens_map.json

3.3 数据预处理

# 数据转换, 将 question1 和 question2 拼接, 并转成token
def convert_example(example, tokenizer, max_seq_length=512, is_test=False):qtconcat = example["question1"] + example["question2"]encoded_inputs = tokenizer(text=qtconcat, max_seq_len=max_seq_length)input_ids = encoded_inputs["input_ids"]token_type_ids = encoded_inputs["token_type_ids"]if not is_test:label = np.array([example["label"]], dtype="int64")return input_ids, token_type_ids, labelelse:return input_ids, token_type_ids# 数据读取和处理总函数
def create_dataloader(dataset,mode='train',batch_size=1,batchify_fn=None,trans_fn=None):if trans_fn:dataset = dataset.map(trans_fn)shuffle = True if mode == 'train' else Falseif mode == 'train':batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)else:batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)return paddle.io.DataLoader(dataset=dataset,batch_sampler=batch_sampler,collate_fn=batchify_fn,return_list=True)
from functools import partial
from paddlenlp.data import Stack, Tuple, Padbatch_size = 50
max_seq_length = 128trans_func = partial(convert_example,tokenizer=tokenizer,max_seq_length=max_seq_length)batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),  # inputPad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segmentStack(dtype="int64")  # label
): [data for data in fn(samples)]
# 对训练和验证集进行加载与处理
train_data_loader = create_dataloader(train_ds,mode='train',batch_size=batch_size,batchify_fn=batchify_fn,trans_fn=trans_func)dev_data_loader = create_dataloader(dev_ds,mode='dev',batch_size=batch_size,batchify_fn=batchify_fn,trans_fn=trans_func)

3.4 设置Fine-Tune优化策略,接入评价指标

创建学习率计划,该调度程序线性增加学习率, 从 0 到给定, 在此预热期后学习率将从基本学习率线性降低到 0

LinearDecayWithWarmup

# 定义超参, loss, 优化器等
from paddlenlp.transformers import LinearDecayWithWarmup# 训练过程中的最大学习率
learning_rate = 4e-5
# 训练轮次
epochs = 5
# 学习率预热比例
warmup_proportion = 0.1
# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.01num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler,parameters=model.parameters(),weight_decay=weight_decay,apply_decay_param_fun=lambda x: x in [p.name for n, p in model.named_parameters()if not any(nd in n for nd in ["bias", "norm"])])criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

3.5 模型训练与评估

import numpy as np
# 定义模型训练验证评估函数
@paddle.no_grad() # 这里不需要反向传播、优化器参数更新和优化器梯度清零, 所以禁用动态图梯度计算
def evaluate(model, criterion, metric, data_loader):"""Given a dataset, it evals model and computes the metric.Args:model(obj:`paddle.nn.Layer`): A model to classify texts.data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.criterion(obj:`paddle.nn.Layer`): It can compute the loss.metric(obj:`paddle.metric.Metric`): The evaluation metric."""model.eval()metric.reset()losses = []for batch in data_loader:input_ids, token_type_ids, labels = batchlogits = model(input_ids, token_type_ids)loss = criterion(logits, labels)losses.append(loss.numpy())correct = metric.compute(logits, labels)metric.update(correct)accu = metric.accumulate()print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))model.train()metric.reset()return np.mean(losses), accu
# 模型训练
import paddle.nn.functional as F
best_loss=float('inf') # 正无穷
best_accu=0
global_step = 0
for epoch in range(1, epochs + 1):for step, batch in enumerate(train_data_loader, start=1):input_ids, segment_ids, labels = batchlogits = model(input_ids, segment_ids)loss = criterion(logits, labels)probs = F.softmax(logits, axis=1)correct = metric.compute(probs, labels)metric.update(correct)acc = metric.accumulate()global_step += 1if global_step % 10 == 0 :print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))loss.backward()optimizer.step()lr_scheduler.step()optimizer.clear_grad()# 对验证集进行评估loss, accu=evaluate(model, criterion, metric, dev_data_loader)if(best_loss>loss):print('best loss from {} to {}'.format(best_loss,loss))best_loss=loss# 这里的保存用的是 nlp 的 api# https://paddlenlp.readthedocs.io/zh/latest/source/paddlenlp.transformers.model_utils.html?highlight=save_pretrained()#paddlenlp.transformers.model_utils.PretrainedModel.save_pretrained_v2model.save_pretrained('./output/best_loss') tokenizer.save_pretrained('./output/best_loss')if(best_accu<accu):print('best accuracy from {} to {}'.format(best_accu,accu))best_accu=accumodel.save_pretrained('./output/best_accu')tokenizer.save_pretrained('./output/best_accu')

4.模型预测

# 定义模型预测函数
def predict(model, data, tokenizer, label_map, batch_size=1):examples = []for text in data:input_ids, segment_ids = convert_example(text,tokenizer,max_seq_length=128,is_test=True)examples.append((input_ids, segment_ids))batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),Pad(axis=0, pad_val=tokenizer.pad_token_id),): fn(samples)batches = []one_batch = []for example in examples:one_batch.append(example)if len(one_batch) == batch_size:batches.append(one_batch)one_batch = []if one_batch:batches.append(one_batch)results = []model.eval()for batch in batches:input_ids, segment_ids = batchify_fn(batch)input_ids = paddle.to_tensor(input_ids)segment_ids = paddle.to_tensor(segment_ids)logits = model(input_ids, segment_ids)probs = F.softmax(logits, axis=1)idx = paddle.argmax(probs, axis=1).numpy()idx = idx.tolist()labels = [label_map[i] for i in idx]results.extend(labels)return results
model = paddlenlp.transformers.BertForSequenceClassification.from_pretrained('./output/best_accu')
[2022-12-20 23:14:52,001] [    INFO] - loading configuration file ./output/best_accu/config.json
[2022-12-20 23:14:52,004] [    INFO] - Model config BertConfig {"architectures": ["BertForSequenceClassification"],"attention_probs_dropout_prob": 0.1,"dtype": "float32","fuse": false,"hidden_act": "gelu","hidden_dropout_prob": 0.1,"hidden_size": 768,"initializer_range": 0.02,"intermediate_size": 3072,"layer_norm_eps": 1e-12,"max_position_embeddings": 512,"model_type": "bert","num_attention_heads": 12,"num_hidden_layers": 12,"pad_token_id": 0,"paddlenlp_version": null,"pool_act": "tanh","type_vocab_size": 2,"vocab_size": 21128
}[2022-12-20 23:14:52,007] [    INFO] - Configuration saved in ./output/best_accu/config.json
[2022-12-20 23:14:53,373] [    INFO] - All model checkpoint weights were used when initializing BertForSequenceClassification.[2022-12-20 23:14:53,376] [    INFO] - All the weights of BertForSequenceClassification were initialized from the model checkpoint at ./output/best_accu.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
# 模型预测:
label_map = {0:'0', 1:'1'}# 定义对数据的预处理函数
def preprocess_prediction_data(data):examples = []for question1, question2 in data:examples.append({"question1": question1, "question2": question2})return examples# 对测试集的格式处理
data1 = list(dev.values)
examples = preprocess_prediction_data(data1)# 对测试集进行预测
results = predict(model, examples, tokenizer, label_map, batch_size=batch_size)# 保存预测结果文件, 格式为label
results = pandas.DataFrame(results)
results.columns = ["label"]
results['question1'] = dev['question1']
results['question2'] = dev['question2']results.to_csv('result.csv',index=False)
results.head()
labelquestion1question2
00乳腺癌晚期治疗费用要多少乳腺癌症做放疗一次费用要多少钱呢
11得了艾滋病会被隔离么艾滋病为什么不被隔离,如果病毒携带
21口腔溃疡接吻会不会传染乙肝呀接吻可以传染乙肝吗
31高血压脑病的个案护理?我想知道高血压脑病的个案护理。高血压脑病的护理问题有哪些老年人患高血压,生活中应注意哪些方面?
41乳腺癌晚期有什么症状晚期的乳腺癌的症状是哪些呢?

5.总结

  • 本项目使用 paddlenlp 高阶 api 完成对文本的匹配
  • 可以尝试换个预训练模型, 虽然这个也不错了

感谢项目导师张宏理的指导

此文章为搬运
原项目链接

http://www.lbrq.cn/news/1109287.html

相关文章:

  • 晋中做网站公司/宁波seo推广哪家好
  • 不用代码可以做网站设计吗/广告投放平台系统
  • 桥西企业做网站/seo网站排名优化工具
  • 辽宁省建设局网站/深圳知名seo公司
  • 联谊会建设网站/网络广告策划方案
  • 网站建设方案书人员资金安排/百度网站链接
  • 卫生系统网站的建设和维护/网站收录服务
  • 做牙科设计的网站/互联网广告投放平台加盟
  • 做外汇应该看哪一家网站/化妆品营销推广方案
  • 如何彻底清除网站的网页木马/视频广告联盟平台
  • 百度云平台建设网站/优化软件下载
  • 如何做体育彩票网站/企业培训权威机构
  • 网站开发国外研究状况/太原seo推广外包
  • 做吉祥物的网站/百度推广计划
  • 百度站长工具有哪些/seo优化公司
  • 查询做导员的网站/工作手机
  • wordpress降低版本/怎样优化网站排名靠前
  • 测试网站免费空间/如何推广公司网站
  • 医院网站党支部机构建设/自媒体平台
  • 民权平台网站建设/沈阳优化网站公司
  • 建立个人网站需要什么/网页制作成品模板网站
  • 网站快照是自己做的吗/滨州seo招聘
  • 四川省人民政府服务热线/优化疫情政策
  • 可以注销的网站/15个常见关键词
  • 济宁住房和城乡建设厅网站首页/下载班级优化大师app
  • 网站建立基本流程/西安关键词优化服务
  • 西安市今天发生的重大新闻/山东公司网站推广优化
  • 为什么python不适合开发网站/网站建设的基本
  • 昆山做轮胎网站/营销说白了就是干什么的
  • 合肥最好的网站建设公司排名/小程序怎么开发
  • web.m3u8流媒体视频处理
  • 初探:C语言FILE结构之文件描述符与缓冲区的实现原理
  • AI与BI的融合挑战:Strategy平台的差异化优势
  • 适配器模式 (Adapter Pattern)
  • 硅谷顶级风投发布《2025年AI实战手册》|附下载
  • Linux 716 数据库迁移