Knowledge Management Tool: A Large Language Model-Based Search Engine
Speaker(s)
Gao W1, Merrill C2, Texeira BC2, Weissmueller N2, Gao C3, Bao Y2, Anstatt D2
1Bristol Myers Squibb, Buffalo Grove, IL, USA, 2Bristol Myers Squibb, Princeton Pike, NJ, USA, 3Bristol Myers Squibb, Princeton, NJ, USA
OBJECTIVES: To assess enhanced text-based data search within existing BMS document libraries by using Large Language Model (LLM) to develop a search functionality that delivers comprehensive answers to users’ questions that typically necessitate interaction with researchers.
METHODS: The initial phase involved the collection and preprocessing of over 700 BMS observational research protocols. Protocol information was extracted and separated by section before being used for causal language modeling (a method of predicting the next word). This process included tasks such as text cleaning, tokenization, stopword removal, and vectorization. The Corpus data underwent training using a recently released and powerful LLM model, specifically a fine-tuned Mistral 7B model. The frontend was developed using Streamlit, resulting in an interactive web app featuring a query field and a feedback section. Rigorous testing by power users took place, and feedback was systematically collected. The interface was subsequently improved based on the feedback received.
RESULTS: The tool achieved proficiency in conducting research by generating contextual answers and referencing source material, accomplishing these tasks within seconds. It effectively addressed user inquiries such as databases, variables, methods, and other commonly asked research questions. The incorporation of a feedback section was valuable, allowing for continuous refinement of the LLM models. Additional finetuning was conducted on the base models using Quantized Low Rank Adapters (QLoRA) on a corpus of protocol information. After training was completed the quantized low rank adapters were merged into the base model.
CONCLUSIONS: This integrated system transforms how we gather and leverage information, simplifies research and supercharges productivity. It can be used to expedite onboarding for new team members, speed up their learning curve, and offer quick access to information and troubleshooting tips.
Code
RWD151
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas