Naslov (eng)

Zero- and few-shot machine learning for named entity recognition in biomedical texts

Autor

Košprdić, Miloš
Prodanović, Nikola
Ljajić, Adela
Bašaragin, Bojana
Milošević, Nikola

Publisher

Institut za molekularnu genetiku i genetičko inženjerstvo

Opis (eng)

Named entity recognition (NER) is an NLP that involves identifying and classifying named entities in text. Token classification is a crucial subtask of NER that assumes assigning labels to individual tokens within a text, indicating the named entity category to which they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has emerged as a powerful technique for improving NER performance. By training a pretrained LLM such as BERT on domain-specific labeled data, the model learns to recognize named entities specific to that domain with high accuracy. This approach has been applied to a wide range of domains including biomedical and has demonstrated significant improvements in NER accuracy. Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming and expensive process that requires expert domain knowledge. Also, domains with an open set of classes yield difficulties in traditional machine learning approaches since the number of classes to predict needs to be pre-defined. Our solution to the two mentioned problems is based on data transformation for factorizing the initial multiple classification problem into a binary one and applying crossencoder- based BERT architecture for zero- and few-shot learning. To create our dataset, we transformed six widely used biomedical datasets that contain various biomedical entities such as genes, drugs, diseases, adverse events, chemicals, etc., into a uniform format. This transformation process enabled us to merge the datasets into a single cohesive dataset of 26 different named entity classes. We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the NER task in zero- and few-shot settings. The results of the experiment for 9 classes in zero-shot mode are promising for semantically similar classes and improve significantly after providing only a few supporting examples for almost all classes. The best results were obtained using a fine-tuned PubMedBERT model, with average F1 scores of 35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER respectively.

Jezik

engleski

Datum

2023

Licenca

© All rights reserved

Predmet

Key words: zero-shot learning, machine learning, deep learning, natural language processing, biomedical named entity recognition

Deo kolekcije (1)

o:1610 Radovi saradnika Instituta za veštačku inteligenciju