Knowledge Management Tool: A Large Language Model-Based Search Engine

Speaker(s)

Gao W¹, Merrill C², Texeira BC², Weissmueller N², Gao C³, Bao Y², Anstatt D²
¹Bristol Myers Squibb, Buffalo Grove, IL, USA, ²Bristol Myers Squibb, Princeton Pike, NJ, USA, ³Bristol Myers Squibb, Princeton, NJ, USA

OBJECTIVES: To assess enhanced text-based data search within existing BMS document libraries by using Large Language Model (LLM) to develop a search functionality that delivers comprehensive answers to users’ questions that typically necessitate interaction with researchers.

METHODS: The initial phase involved the collection and preprocessing of over 700 BMS observational research protocols. Protocol information was extracted and separated by section before being used for causal language modeling (a method of predicting the next word). This process included tasks such as text cleaning, tokenization, stopword removal, and vectorization. The Corpus data underwent training using a recently released and powerful LLM model, specifically a fine-tuned Mistral 7B model. The frontend was developed using Streamlit, resulting in an interactive web app featuring a query field and a feedback section. Rigorous testing by power users took place, and feedback was systematically collected. The interface was subsequently improved based on the feedback received.

RESULTS: The tool achieved proficiency in conducting research by generating contextual answers and referencing source material, accomplishing these tasks within seconds. It effectively addressed user inquiries such as databases, variables, methods, and other commonly asked research questions. The incorporation of a feedback section was valuable, allowing for continuous refinement of the LLM models. Additional finetuning was conducted on the base models using Quantized Low Rank Adapters (QLoRA) on a corpus of protocol information. After training was completed the quantized low rank adapters were merged into the base model.

CONCLUSIONS: This integrated system transforms how we gather and leverage information, simplifies research and supercharges productivity. It can be used to expedite onboarding for new team members, speed up their learning curve, and offer quick access to information and troubleshooting tips.

Code

RWD151

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR 2024

May 5-8, 2024