Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.-Z研学术

Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.

来自 PUBMED

作者：

Yoon HJ ， Klasky HB ， Gounley JP ， Alawad M ， Gao S ， Durbin EB ， Wu XC ， Stroup A ， Doherty J ， Coyle L ， Penberthy L ， Blair Christian J ， Tourassi GD

展开 

摘要：

In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.

收起

展开 

DOI：

10.1016/j.jbi.2020.103564

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(109)

参考文献(13)

引证文献(5)

Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports.

In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.

Yoon HJ ，Klasky HB ，Gounley JP ，Alawad M ，Gao S ，Durbin EB ，Wu XC ，Stroup A ，Doherty J ，Coyle L ，Penberthy L ，Blair Christian J ，Tourassi GD ... - 《-》

被引量: 5 发表:1970年
Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.

Alawad M ，Gao S ，Qiu JX ，Yoon HJ ，Blair Christian J ，Penberthy L ，Mumphrey B ，Wu XC ，Coyle L ，Tourassi G ... - 《-》

被引量: 32 发表:2020年
Scalable deep text comprehension for Cancer surveillance on high-performance computing.

Qiu JX ，Yoon HJ ，Srivastava K ，Watson TP ，Blair Christian J ，Ramanathan A ，Wu XC ，Fearn PA ，Tourassi GD ... - 《-》

被引量: - 发表:1970年
Development of message passing-based graph convolutional networks for classifying cancer pathology reports.

Yoon HJ ，Klasky HB ，Blanchard AE ，Christian JB ，Durbin EB ，Wu XC ，Stroup A ，Doherty J ，Coyle L ，Penberthy L ，Tourassi GD ... - 《BMC Medical Informatics and Decision Making》

被引量: - 发表:1970年
Classifying cancer pathology reports with hierarchical self-attention networks.

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks - site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data - Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Gao S ，Qiu JX ，Alawad M ，Hinkle JD ，Schaefferkoetter N ，Yoon HJ ，Christian B ，Fearn PA ，Penberthy L ，Wu XC ，Coyle L ，Tourassi G ，Ramanathan A ... - 《-》

被引量: 21 发表:1970年

加载更多

来源期刊

影响因子：暂无数据

JCR分区：暂无

中科院分区：暂无