BLOG POST

The limitations of named entity recognition in the era of LLMs, NLP and text mining for drug discovery: the case study of P40

Hannah Percival, Product Manager and Chris Morris, Lead Data Scientist

The Named Entity Recognition (NER) challenge for drug discovery

Prior to genomic sequencing and modern-day proteomic techniques, the only easily observable property of a protein was its molecular weight. So it was reasonable to use “p40” to identify a novel, “40 kilodalton protein”, at least until its function was deduced to inform a more meaningful name. Named entity recognition for genes and proteins creates challenges for drug discovery researchers which, luckily, can be overcome by NLP and LLM-curated text mining approaches such as those offered by Biorelate. 

As an example, HUGO Gene Nomenclature Committee (HGNC), recognise p40 as a synonym for TP63 (HGNC:15979), ARMH1 (HGNC:34345), RPSA (HGNC:6502), LANCL1 (HGNC:6508), and H3P28 (HGNC:54461). When we take capitalisation of the “P” into account,“P40” is also a recognised synonym for ARHGEF2 (HGNC:682), PSMD7 (HGNC:9565), EBNA1BP2 (HGNC:15531), and IL9 (HGNC:6029). 

Complex by name, complex by nature

NER becomes even more challenging when combining terms from multiple ontologies and controlled vocabularies.  For example, a subunit of the IL-12 protein complex - IL12B (HGNC:5970) - has the synonym “interleukin 12, p40” in HGNC. The IL-12 complex can either be heterodimeric (IL12A and IL12B), or homodimeric (IL12B and IL12B). The heterodimer is known as p70, and the homodimer is also known as p40. Gene Ontology (GO) lists p40 as an alias for the IL-12 complex itself (GO:0043514), which it defines as the heterodimeric, “p70” form. Some authors use the unofficial, but less ambiguous name IL12p40 for either the gene product, or the homodimer. IL12B also functions as a subunit of IL23, where it dimerises with IL23A. Adding even more ambiguity, GO also lists p40 as an alias for the IL-23 complex (GO:0070743).

Today, these usages of “p40” are less common in the literature. However, two research communities commonly use the term. Firstly, it is still often used in reference to otherwise uncharacterised 40kD viral proteins. Secondly, in the field of oncology it is the name for an important biomarker, a truncated isoform of TP63 (HGNC:15979), more specifically named as ΔNp63. In HGNC, “p51”, “p63”, and “p40” are all official synonyms for this gene.

Ideas for overcoming the NER and other challenges with data inconsistencies in literature reviews

Biorelate couples their best-in-class NLP technology with a world-class team of curators to ensure the above issues don’t muddy insights being extracted from the literature. As shown in the examples above, some of the complexity is in the biology itself, whereas some is avoidable confusion in the literature.

In the future, a helpful fix would involve viewers and publishers offering more robust guidance to authors to use unambiguous names. It helps a lot when a journal requires an abbreviation section, for example. 

For now, Biorelate seeks to ensure researchers can still access accurate and reliable insights from the data despite the challenges that come with language inconsistencies in the literature. Perhaps even further technological advances in NLP and AI will make the future brighter for what can be achieved in automatically extracting data insights across different public data sources.

Reporting unambiguous ID’s for biological entities can facilitate collaboration across the different sub-disciplines of molecular biology, which can lead to new insights and new therapies. For more discussion see 10.1371/journal.pbio.2001414.