Biomedical Named Entity Recognition: Navigating the complexities of biomedical language, and the promise of Large Language Models (LLMs)

Accurate recognition of biomedical entities (genes, cells, diseases, etc) in scientific literature demands a deep understanding of the field, as scientific meanings can evolve over time. NER is a natural language processing (NLP) task focused on identifying and classifying key entities within text. Human expertise has been crucial for effective NER due to ongoing changes in how entities are defined and classified, often influenced by new experimental and clinical insights. The term "Influenza" once covered any respiratory illness for example, but was later confined to diseases caused by influenza viruses specifically. This can be a problem for NLP (and even human curators), particularly when processing a corpus of scientific literature that has been published across a number of decades; the same term could have completely different meanings in two documents, published years apart.
When differentiating between entity types, the surrounding context is also important. For example, “insulin” is an endogenous hormone, but the same word is used to refer to a therapy in the form of insulin injections. Similarly, “Insulin” could also refer to the gene which encodes it (INS). The decision to classify a mention of “insulin” as a gene, protein, chemical, or drug therefore depends on the context in which it is being described. Whilst a human biomedical domain expert could ascertain which is most appropriate, this poses a challenge for traditional NLP methods.
Another challenge is the use of overlapping synonyms across different (but often related) entity types. This can be an unfortunate consequence of the way we historically named genes/proteins. For example, Duchenne Muscular Dystrophy is a severe neuromuscular disease, often abbreviated to “DMD”. In 1986, researchers identified a gene on the X chromosome that, when mutated, causes Duchenne Muscular Dystrophy (amongst other muscular dystrophies) (Kunkel, 1986). The protein it encodes was named “dystrophin”, but even today, the official gene symbol is “DMD”, which is identical to the acronym used to refer to the disease. Whilst an author might define the abbreviation at the start of a publication, NLP methods that rely on string-matching approaches extracted from individual sentences struggle with such examples.
One of the hardest NER challenges in this domain is being able to resolve between a gene, mRNA, and protein of the same name (e.g. PTEN). This is not an issue when using structured datasets because the nature of the dataset tends to determine the entity type, e.g. genomics, transcriptomics, or proteomics, respectively; but in the literature, an author will often refer to all three forms using the same string of text (perhaps with subtle differences in capitalisation or use of italics, if you’re lucky). This is a challenge even for human curators, but they may detect hints of context to help resolve them, for example the gene form being bound by a transcription factor, an mRNA transcript being inhibited by miRNA, or a protein being phosphorylated. This is a level of complexity far beyond the capabilities of traditional NLP approaches.
The promise of LLMs
Manual curation and conventional NLP techniques have their drawbacks. Yet, the rise of advanced AI and LLM technologies present a promising new approach, with the ability to address these challenges through a deeper understanding of complex human language. Figure 1 shows an abstract from a journal article centred around the discovery of a new splice variant of the DMD gene, identified in a patient with Duchenne Muscular Dystrophy, (Wen et al., 2023). An LLM-based NER model, fine-tuned by Biorelate, can disambiguate between mentions where “DMD” is used interchangeably to refer to both the gene and the disease. The model is able to use the surrounding context to make this distinction, and is able to do so despite the authors’ inconsistent use of italics versus roman text when referring to the DMD gene. This example demonstrates the promising potential of LLMs for enabling NER at scale, of biomedical domain-expert quality. By accurately harnessing existing knowledge from scientific literature, we can enhance our understanding of disease mechanisms, drug targets, and clinical treatment responses. Read more about the promise of LLMs in this blog post.

References
Kunkel, L.M. and co-authors, 1986. Analysis of deletions in DNA from patients with Becker and Duchenne muscular dystrophy. Nature, 322 (6074), pp.73-77.
Wen, Y., Yang, L., Shen, G., Dai, S., Wang, J. and Wang, X., 2023. A novel splicing mutation identified in a DMD patient: a case report. Frontiers in Pediatrics, 11, p.1261318.
Latest News
Discover new insights and updates for data science in biopharma