Biomedical Ontology Retrieval-Augmented Language Models using django-snomed-ct and OgbujiPT

Chimezie Ogbuji
12 min readAug 31, 2023

As described previously, I’ve slowly gravitated back to my core intellectual passion: clinical research informatics. Most recently, I have been working on Home Case Mixer, a web-based application for performing exhaustive Home Care case-mix analysis, and this has involved heavy use of SNOMED via django-snomed-ct, which was unreleased at that time. This slow gravitation back to development and research I did in a previous professional life came to a blazing crescendo via recent work with Oori, a team of expert data engineers working with Large Language Models (LLMs), led by my brother Uche. In particular, I have started getting familiar with OgbujiPT (I would be lying if I said having our family name built-in wasn’t satisfying).

LLMs' research and development world is a gaping firehose of rapid change. The pace is significantly faster than I had become used to when working in the Semantic Web and Biomedical Ontology development and research fields. As a result, the learning curve has been slow but intellectually stimulating. I’ve been particularly interested in where it overlaps with the work I did with Semantic DB and Cycorp [1], as well as contract work I did with Paul Haley’s Automata, both of which involved connecting logical reasoning systems, biomedical domain ontologies, and natural language processing to realize the kind of expert system question answering promised by systems such as IBM’s Watson.

To help myself rapidly push through the learning curve, I started working on using controlled natural language definitions from biomedical ontologies to facilitate In-Context Retrieval-Augmented Language Modeling (RALM) [2]. This took me way back to earlier work [3] I had done in 2008 on InfixOWL, a native, idiomatic API for constructing and manipulating OWL in Python that builds on the strengths of the Manchester OWL syntax. Back then, I discovered Attempto Controlled English, an ambitious approach for designing Controlled Natural Language (CNL), subsets of natural languages restricted to allow their automatic translation into formal logic and facilitate human readability.

So, via trial and error, research, and development with OgbujiPT and django-snomed-ct, I began to look into the possibility of using CNL SNOMED-CT phrases of medical concepts of interest as the external corpus to use as the context for questions posed to the LLMs. Normally, such approaches will only include fragments from the corpus that match based on Cosine similarity with the user-specified question. However, this may exclude information that is logically related to the question but not syntactically similar.

It was this effort that led me to add a CNL module to django-snomed-ct and then finally release it to PyPi (version 0.5.2) so it can be easily installed on any Python system via:

$ pip install django-snomed-ct

You can refer to the documentation on the GitHub page, but below is mostly regarding how I used these capabilities for my tinkering with language model retrieval, augmented with biomedical ontologies.

I don’t have a machine beefy enough to do LLM generation justice. I’m using a salvaged Dell XPS 15 7590 laptop with 32GB and 4GB of VRAM and started off using a 4-bit quantization of the TheBloke/Llama-2–13B-GGML model with llama-cpp-python. Eventually, after getting more familiar with the differences in models, their parameters, and how they are trained, I have settled on using a 4-bit quantization of the TheBloke/medalpaca-13B-GGML model, a compilation of language models specifically fine-tuned for biomedical tasks [4].

My implementation so far is encapsulated in a Python script with various options that allow fine-tuning certain parameters regarding the model, how the context is separated into chunks, how patient problems are identified, etc. The script needs to be told which medical problems to consider and allows the user to pose a question against the model using the definitions of the problems as context.

The first part uses django-snomed-ct to collect the medical problems either as ICD 10 codes, SNOMED-CT concept string patterns, or SNOMED-CT identifiers:

if problem_type == 'ICD':
items = ICD10_Mapping.objects.by_icd_codes(medical_problems).filter(map_rule='TRUE').concepts
elif regex:
items = Concept.objects.by_full_specified_names(medical_problems, search_type=TextSearchTypes.CASE_INSENSITIVE_REGEX)
if verbose:
print("Matched (by pattern): ", items)
else:
items = Concept.objects.by_ids(medical_problems)

If the problems are specified as ICD 10 codes, the ICD 10 mapping capabilities of django-snomed-ct are used to retrieve SNOMED CT concepts mapped to these ICD 10 codes. Problems can also be specified as REGEX expressions of the fully-specified names of concepts. Otherwise, they are assumed to be SNOMED-CT identifiers.

For this example, we will use two SNOMED-CT concepts: 371627004 (Angiotensin converting enzyme inhibitor-aggravated angioedema) and 59621000 (Essential hypertension). The CNL module of django-snomed-ct renders their definitions as:

Angiotensin converting enzyme inhibitor-aggravated angioedema 
is an angioedema caused by angiotensin-converting-enzyme inhibitor, a
drug-aggravated angioedema-urticaria, an angiotensin-converting-enzyme inhibitor
adverse reaction, a chemical-induced dermatological disorder, a drug-induced
lesion, an angioedema due to disorder of kinin metabolism, a nonallergic
hypersensitivity disorder, a non-allergic hypersensitivity reaction, an
adverse reaction caused by drug, a hypersensitivity condition, an angioedema,
a non-allergic drug hypersensitivity disorder, and an allergic drug reaction.

It is a hypersensitivity process, is characterized in form by some angioedema,
is caused by some substance with angiotensin-converting enzyme inhibitor
mechanism of action, and is caused by some kallidin i.

Essential hypertension is a hypertensive disorder, systemic arterial. It is
an interpretation of blood pressure as increased. It is located in some
systemic circulatory system structure and entire cardiovascular system.

The SNOMED-CT concepts and properties are indicated in bold and italics.

The following function uses django-snomed-ct to collect the CNL sentences that define SNOMED-CT concepts, include textual definitions that may exist, breaks them into chunks using text_splitter, from ogbujipt.text_helper, and adds them to a client connection to an in-memory Qdrant collection created using qdrant_collection from ogbujipt.embedding_helper.

def extract_snomed_concept(concept, chunk_size, chunk_overlap, collection):
text = ''
definitions = concept.definitions().filter(active=True)
if definitions.exists():
for defn in definitions:
text += f'\n{concept.fully_specified_name_no_type}: {defn.term}'
text += ControlledEnglishGenerator(concept).get_controlled_english_definition(embed_ids=False)
chunks = text_splitter(text, chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator='\n')
collection.update(texts=chunks)

Rather than just adding textual definitions for the explicitly given problems, the script has an option also to include SNOMED-CT concepts that are specialized by the given problems according to their logical definition (see section 4.2 Class Hierarchies of the OWL 2 primer) if these more general concepts also define a property restriction. This excludes more general, placeholder SNOMED-CT without logical definitions to translate into controlled natural language the LLM can use as useful context. In the code below, items is a query set of Concepts, each specified as a medical problem.

item_ids = set().union(items.ids)
problem_names = [c.fully_specified_name_no_type.lower() for c in items]
for concept in items:
item_ids.update(concept.isa.filter(source_relationships__isnull=False).ids)
items = Concept.objects.by_ids(item_ids)

With this option, the full context presented to the LLM to generate an answer to a question is (the SNOMED-CT concepts whose definitions were included in this way are indicated in bold):

The patient has the following disorders: essential hypertension and 
angiotensin converting enzyme inhibitor-aggravated angioedema.

Angioedema caused by angiotensin-converting-enzyme inhibitor is a
drug-aggravated angioedema-urticaria, a drug-related disorder, and an
angioedema due to disorder of kinin metabolism. It is caused by some
substance with angiotensin-converting enzyme inhibitor mechanism of action,
is characterized in form by some angioedema, and is caused by some kallidin i

Angiotensin converting enzyme inhibitor-aggravated angioedema is an
angioedema caused by angiotensin-converting-enzyme inhibitor, a
drug-aggravated angioedema-urticaria, an angiotensin-converting-enzyme inhibitor adverse reaction,
a chemical-induced dermatological disorder, a drug-induced lesion,
an angioedema due to disorder of kinin metabolism, a nonallergic hypersensitivity disorder,
a non-allergic hypersensitivity reaction, an adverse reaction caused by drug,
a hypersensitivity condition, an angioedema,
a non-allergic drug hypersensitivity disorder, and an allergic drug reaction.
It is a hypersensitivity process, is characterized in form by some angioedema,
is caused by some substance with angiotensin-converting enzyme inhibitor mechanism of action,
and is caused by some kallidin i

Angiotensin-converting-enzyme inhibitor adverse reaction is
an antihypertensive adverse reaction. It is caused by some
substance with angiotensin-converting enzyme inhibitor mechanism of action

Angioedema due to disorder of kinin metabolism is an angioedema. It is characterized in form by some
angioedema and is caused by some kallidin i

Essential hypertension is a hypertensive disorder, systemic arterial. It is an
interpretation of blood pressure as increased. It is located in some
entire cardiovascular system and systemic circulatory system structure

Angioedema is an acute allergic reaction, a disorder characterized by edema,
an angioedema and/or urticaria, and a noninfectious, erythematous, papular and/or squamous disease.
It is characterized in form by some angioedema

Drug-aggravated angioedema-urticaria is an allergic drug reaction, a
drug-induced angioedema-urticaria, an angioedema, an eruption caused by drug, a
drug-induced dermatosis, and a chemical-aggravated angioedema-urticaria.
It is caused by some drug or medicament, is characterized in form by some
angioedema and urticaria, and is a hypersensitivity process. It is located in some
entire integumentary system, skin structure, and entire skin and subcutaneous tissue

Hypertensive disorder, systemic arterial is a finding of cardiovascular measurement,
a systemic arterial finding, a disorder of artery, and a disorder of cardiovascular system.
It is an interpretation of blood pressure as increased. It is located in some
entire cardiovascular system and systemic circulatory system structure

Drug-induced lesion is a disease and a drug-related disorder. It is characterized in form
by some lesion and is caused by some drug or medicament

Non-allergic drug hypersensitivity disorder is a non-allergic hypersensitivity condition,
a nonallergic hypersensitivity disorder, a disease, and a drug-related disorder.
It is caused by some drug or medicament and is caused by some non-allergic hypersensitivity reaction

Hypersensitivity condition: The disposition to develop an allergic or pseudoallergic reaction,
the reaction itself or its consequences.Hypersensitivity condition is a clinical finding and a disease


Adverse reaction caused by drug: All noxious and unintended responses to a
medicinal product related to any dose should be considered adverse drug reactions
(from US FDA Guideline for Industry, Clinical Safety Data Management: Definitions and Standards for
Expedited Reporting).


Adverse reaction caused by drug is a complication, an adverse reaction caused by substance,
and a drug-related disorder. It is caused by some drug or medicament

Non-allergic hypersensitivity reaction: A pathological nonimmune process generally
directed towards a foreign substance, which results in tissue injury, which is
usually transient. It is the realization of the pseudoallergic disposition.
A variety of mechanisms such as direct histamine release, complement activation,
cyclooxygenase activation and bradykinin generation may be involved.


Non-allergic hypersensitivity reaction is a hypersensitivity reaction and
a non-allergic hypersensitivity condition. It is a non-allergic hypersensitivity process

Chemical-induced dermatological disorder is a noninfectious,
vesicular and/or bullous disease and a disorder of skin and/or subcutaneous tissue.
It is located in some skin and/or subcutaneous tissue structure and is caused by some chemical

Note that the SNOMED-CT textual definitions for Adverse reaction caused by drug and Hypersensitivity condition were included in the context (in bold and italics) rather than generated by the CNL module.

The following question was submitted with this context to locally-hosted, 4-bit quantization instances of the TheBloke/medalpaca-13B-GGML, TheBloke/Nous-Hermes-13B-GGML, and localmodels/WizardLM-13B-v1.1-ggml [4] models:

Explain step by step if the patient should continue to use ACE-inhibitor medication to manage their essential hypertension

These are the model parameters that were used:

  • temperature: 0.7
  • top_p: .1
  • top_k: 40
  • repeat_penalty: 1.2
  • typical_p: 1.0
  • frequency_penalty: 0
  • presence_penalty: 1

Below are the results from each model.

TheBloke/medalpaca-13B-GGML

The response was:

As medical assistant I can understand your concern for your patient. For managing essential hypertension in patients with angioedema due to disorder of kinin metabolism, you can advise them ACE inhibitors such as captopril, enalapril maleate etc. These are safe even in presence of angioedema. You should educate the patient about importance of blood pressure control and low salt diet. In case of uncontrolled BP despite lifestyle modifications and twice daily medications then third step would be once daily CCBs like verapamil or diltiazem. If there are any signs and symptoms of acute decompensated heart failure (e.g shortness of breath, orthopnea) then start furosemide or lasix. Before prescribing any medicine it is better to do drug reconciliation so that no adverse interactions between existing medicines and new one exists. Hope this answers your question. If you have more questions then don’t hesitate to write back.

localmodels/WizardLM-13B-v1.1-ggml

The response was:

Based on the patient’s medical history with essential hypertension and angiotensin converting enzyme (ACE) inhibitor-aggravated angioedema, it would not be advisable to continue using an ACE inhibitor medication for managing their high blood pressure. The presence of angioedema aggravated by ACE inhibitors indicates that there might be another underlying issue causing this adverse drug reaction. In addition, discontinuing the use of ACE inhibitors can help prevent further complications related to angioedema. Instead, alternative antihypertensive medications should be considered after consulting with a healthcare professional to ensure proper management of the patient’s essential hypertension while avoiding potential risks associated with ACE inhibitors.

TheBloke/Nous-Hermes-13B-GGML

The response was:

Based on the information provided, it seems that you are asking if the patient should continue to use their angiotensin-converting enzyme (ACE) inhibitor medication for managing their essential hypertension despite having been diagnosed with both aggravated angioedema due to an ACE inhibitor and essential hypertension.

As a medical assistant, I would recommend consulting with your healthcare provider before making any changes to your treatment plan. They will consider factors such as the severity of your symptoms, potential side effects or complications from stopping the medication abruptly, and alternative treatments options available. Your doctor may also want to discuss other lifestyle modifications, such as dietary changes or increased physical activity, which can help manage blood pressure without relying solely on medication.

There are additional things that were implemented for the sake of further optimization. First, in SNOMED-CT, the definition for Essential hypertension only defines a restriction to the Systemic circulatory system structure (body structure) concept on the finding site attribute. However, the rendered CNL phrase includes entire cardiovascular system as an additional location for that disorder. This is because the systemic circulatory system structure (body structure) concept is in SNOMED-CT defined as a part of the entire cardiovascular system. The CNL module takes advantage of the useful biomedical ontology axiom that a disorder located in an anatomical part is a disorder located in the anatomical whole.

This was a basic proof of concept for how In-Context Retrieval-Augmentation can be used with context generated from the world's most comprehensive, multilingual clinical healthcare terminology. There are many further optimizations and considerations that I intend to investigate. For example, the entire SNOMED-CT CNL corpus can be put in a large, scalable Vector DB to facilitate retrieving a complete set of CNL-defining statements by their textual similarity. Alternative prompting language could be explored to determine if it helps the LLM generate more coherent and useful responses. Existing models can be fine-tuned with the entire SNOMED-CT CNL corpus or a domain-specific subset rather than using a very small sample as a source for context-based retrieval augmentation alone. The resulting SNOMED-CT-trained LLMs could free up the context space for patient-specific data.

Finally, the semantics of the Description Logic SNOMED-CT is based on can be more robustly used (possibly with the use of off-the-shelf OWL logical reasoning systems) to generate relevant CNL statements that cannot be included by syntactic comparison or the very rudimentary logical entailment used in this example alone.

I have uploaded the script I used to my ResearchGate account. It can be found here. It depends only on OgbujiPT, django-snomed-ct, and sentence-transformers.

The script pulls the prompt components from a separate configuration file in the TOML format that is read using OgbujiPT’s recently added Word Loom functionality. The configuration file content is:

lang = "en"

[system_prompt]
text="""
You are a helpful medical assistant who answers questions directly for a medical professional. Use the context to answer your question. If you cannot answer with the given context, just say so.
"""

[question]
text = """
explain step by step if the patient should continue to use ACE-inhibitor medication to manage their essential hypertension
"""

When saved locally to a file named prompt.toml, the script can be run like this (assuming a Linux or MacOSX command prompt) against a locally hosted instance of these models:

$ python medical_problems_chat.py --limit 40 --temperature .7 --generalization \
--problem-type SNOMED prompt.toml 371627004 59621000

[1] D Pierce, Christopher, David Booth, Chimezie Ogbuji, Chris Deaton, Eugene Blackstone, and Doug Lenat. “Semanticdb: A semantic web infrastructure for clinical research and quality reporting.” Current Bioinformatics 7, no. 3 (2012): 267–277.

[2] Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., & Shoham, Y. (2023). In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.

[3] Ogbuji, C. (2008). InfixOWL: An Idiomatic Interface for OWL. In OWLED (Vol. 432).

[4] Han, T., Adams, L. C., Papaioannou, J. M., Grundmann, P., Oberhauser, T., Löser, A., … & Bressem, K. K. (2023). MedAlpaca — An Open-Source Collection of Medical Conversational AI Models and Training Data. arXiv preprint arXiv:2304.08247.

--

--

Chimezie Ogbuji

An informatics engineer, data scientist, inventor, and business owner