Biomedical Named Entity Recognition: Navigating the complexities of biomedical language, and the promise of Large Language Models (LLMs)

Hannah Percival, PhD, Product Manager, Biorelate

July 16, 2025

•

5 min read

Accurate recognition of biomedical entities (genes, cells, diseases, etc) in scientific literature demands a deep understanding of the field, as scientific meanings can evolve over time. NER is a natural language processing (NLP) task focused on identifying and classifying key entities within text. Human expertise has been crucial for effective NER due to ongoing changes in how entities are defined and classified, often influenced by new experimental and clinical insights. The term "Influenza" once covered any respiratory illness for example, but was later confined to diseases caused by influenza viruses specifically. This can be a problem for NLP (and even human curators), particularly when processing a corpus of scientific literature that has been published across a number of decades; the same term could have completely different meanings in two documents, published years apart.

When differentiating between entity types, the surrounding context is also important. For example, “insulin” is an endogenous hormone, but the same word is used to refer to a therapy in the form of insulin injections. Similarly, “Insulin” could also refer to the gene which encodes it (INS). The decision to classify a mention of “insulin” as a gene, protein, chemical, or drug therefore depends on the context in which it is being described. Whilst a human biomedical domain expert could ascertain which is most appropriate, this poses a challenge for traditional NLP methods.

Another challenge is the use of overlapping synonyms across different (but often related) entity types. This can be an unfortunate consequence of the way we historically named genes/proteins. For example, Duchenne Muscular Dystrophy is a severe neuromuscular disease, often abbreviated to “DMD”. In 1986, researchers identified a gene on the X chromosome that, when mutated, causes Duchenne Muscular Dystrophy (amongst other muscular dystrophies) (Kunkel, 1986). The protein it encodes was named “dystrophin”, but even today, the official gene symbol is “DMD”, which is identical to the acronym used to refer to the disease. Whilst an author might define the abbreviation at the start of a publication, NLP methods that rely on string-matching approaches extracted from individual sentences struggle with such examples.

One of the hardest NER challenges in this domain is being able to resolve between a gene, mRNA, and protein of the same name (e.g. PTEN). This is not an issue when using structured datasets because the nature of the dataset tends to determine the entity type, e.g. genomics, transcriptomics, or proteomics, respectively; but in the literature, an author will often refer to all three forms using the same string of text (perhaps with subtle differences in capitalisation or use of italics, if you’re lucky). This is a challenge even for human curators, but they may detect hints of context to help resolve them, for example the gene form being bound by a transcription factor, an mRNA transcript being inhibited by miRNA, or a protein being phosphorylated. This is a level of complexity far beyond the capabilities of traditional NLP approaches.

‍

The promise of LLMs

Manual curation and conventional NLP techniques have their drawbacks. Yet, the rise of advanced AI and LLM technologies present a promising new approach, with the ability to address these challenges through a deeper understanding of complex human language. Figure 1 shows an abstract from a journal article centred around the discovery of a new splice variant of the DMD gene, identified in a patient with Duchenne Muscular Dystrophy, (Wen et al., 2023). An LLM-based NER model, fine-tuned by Biorelate, can disambiguate between mentions where “DMD” is used interchangeably to refer to both the gene and the disease. The model is able to use the surrounding context to make this distinction, and is able to do so despite the authors’ inconsistent use of italics versus roman text when referring to the DMD gene. This example demonstrates the promising potential of LLMs for enabling NER at scale, of biomedical domain-expert quality. By accurately harnessing existing knowledge from scientific literature, we can enhance our understanding of disease mechanisms, drug targets, and clinical treatment responses. Read more about the promise of LLMs in this blog post.

**Figure 1. A fine-tuned, LLM-based NER model can disambiguate between mentions where “DMD” is used to refer to the gene and the disease interchangeably.** Image represents the title and abstract from an article (Wen et al., 2023), processed by one of Biorelate’s fine-tuned LLM-based NER models. Purple and orange highlighted text represent mentions where the model has identified the entity to be a disease and a gene, respectively. (All other text mentions recognised as biomedical entities by the model have been excluded from this visualisation for simplicity).

References
Kunkel, L.M. and co-authors, 1986. Analysis of deletions in DNA from patients with Becker and Duchenne muscular dystrophy. Nature, 322 (6074), pp.73-77.

Wen, Y., Yang, L., Shen, G., Dai, S., Wang, J. and Wang, X., 2023. A novel splicing mutation identified in a DMD patient: a case report. Frontiers in Pediatrics, 11, p.1261318.

Share this post

Biorelate News

Latest News

Discover new insights and updates for data science in biopharma

View all

Accelerating drug discovery innovation via advanced data science solutions – a case study of Almirall (global pharma company) and Biorelate (data science solutions provider)

Pharma Almirall enlisted Biorelate, with its cutting-edge AI platform, Galactic AI™, to help accelerate their R&D innovation and productivity.

Unlocking the Power of Data in AI-Driven Drug Discovery

To unlock the true potential of Artificial Intelligence, pharma companies must adopt a robust data strategy, generating and harmonising data specifically for AI applications

Reading Between the Lines: Challenges in Biomedical Relationship Extraction and the promise of LLMs

This blog post delves into the specific challenges posed by the complexities of interpreting biomedical language with a focus on Relationship Extraction, setting the stage for a deeper exploration of how Artificial Intelligence (AI) can help.

View all