Welcome back to another exciting edition of my ongoing journey in developing an AI-powered clinical trial summarization tool. As a seasoned medical affairs professional, my mission is to combine my interest in cutting-edge technology to make a difference in the pharmaceutical industry. Today, I want to walk you through the latest developments and insights gained during this challenging yet rewarding process.
Teasing out the Key Sections
This week was packed with fine-tuning the methods for extracting crucial sections from the plethora of clinical trial PDFs. If you’ve been following my journey, you’ll recall that I had decided to explore a software engineering solution as opposed to training AI for this task. The first taste of success came when the tool reliably started extracting eligibility criteria and primary endpoints by targeting certain keywords. However, it wasn’t as easy as it sounds – I inadvertently broke the table extractor functionality in the process. Although that was meant to be a bonus feature, my focus is razor-sharp on the primary goal: clinical trial summarization. So I moved on without it.Tackling Study Objectives
My next challenge was extracting the study objective sentence, typically found before the methods section. I initially tried doing this based on its position but ran into numerous issues. After much deliberation, I switched to using keywords for identifying paragraphs. Adding specific keyword phrases seen in objectives (e.g., “objectives of this”) proved to be a successful strategy in extracting objectives. However, this wasn’t without its own set of challenges. The approach sometimes gave false positives as the keywords also appeared in other parts of the articles. At this point, it was clear that I needed to rethink my approach.Back to the AI Board
So, I decided to once again embrace the potential of AI in pharma medical information, specifically machine learning, to tackle these challenges head-on. Here’s my revised roadmap for leveraging AI to extract and summarize key information from clinical studies:Step 1: Text Classification for Section Identification
- Choose a Model: Opt for models like BERT, RoBERTa, or SciBERT which are renowned for text classification. Given that we are dealing with clinical trials, SciBERT seems apt as it’s pretrained on scientific text.
- Train the Model: Utilize the Hugging Face Transformers library to train the model on a labeled dataset containing paragraphs and section labels.
Step 2: Text Summarization
- Choose a Summarization Model: BART or T5 are excellent options for summarization.
- Train the Summarization Model: Employ the labeled dataset with input text and target summaries for training. Bear in mind that this step can be computationally intensive.