Artistic representation of AI processing clinical trial data, with a digital brain at the center surrounded by screens, a DNA helix, documents, and coding elements
Welcome back to another exciting edition of my ongoing journey in developing an AI-powered clinical trial summarization tool. As a seasoned medical affairs professional, my mission is to combine my interest in cutting-edge technology to make a difference in the pharmaceutical industry. Today, I want to walk you through the latest developments and insights gained during this challenging yet rewarding process.

Teasing out the Key Sections

This week was packed with fine-tuning the methods for extracting crucial sections from the plethora of clinical trial PDFs. If you’ve been following my journey, you’ll recall that I had decided to explore a software engineering solution as opposed to training AI for this task. The first taste of success came when the tool reliably started extracting eligibility criteria and primary endpoints by targeting certain keywords. However, it wasn’t as easy as it sounds – I inadvertently broke the table extractor functionality in the process. Although that was meant to be a bonus feature, my focus is razor-sharp on the primary goal: clinical trial summarization. So I moved on without it.

Tackling Study Objectives

My next challenge was extracting the study objective sentence, typically found before the methods section. I initially tried doing this based on its position but ran into numerous issues. After much deliberation, I switched to using keywords for identifying paragraphs. Adding specific keyword phrases seen in objectives (e.g., “objectives of this”) proved to be a successful strategy in extracting objectives. However, this wasn’t without its own set of challenges. The approach sometimes gave false positives as the keywords also appeared in other parts of the articles. At this point, it was clear that I needed to rethink my approach.

Back to the AI Board

So, I decided to once again embrace the potential of AI in pharma medical information, specifically machine learning, to tackle these challenges head-on. Here’s my revised roadmap for leveraging AI to extract and summarize key information from clinical studies:

Step 1: Text Classification for Section Identification

  • Choose a Model: Opt for models like BERT, RoBERTa, or SciBERT which are renowned for text classification. Given that we are dealing with clinical trials, SciBERT seems apt as it’s pretrained on scientific text.
  • Train the Model: Utilize the Hugging Face Transformers library to train the model on a labeled dataset containing paragraphs and section labels.

Step 2: Text Summarization

  • Choose a Summarization Model: BART or T5 are excellent options for summarization.
  • Train the Summarization Model: Employ the labeled dataset with input text and target summaries for training. Bear in mind that this step can be computationally intensive.

Step 3: Named Entity Recognition (NER)

Employ the Summarization Model: Use the trained summarization model to churn out summaries for the extracted sections.

Step 4: Combine the Models

Once you have trained models for section identification, summarization, and NER, you can compile a script that leverages these models to automatically identify sections, extract entities, and generate summaries from new clinical study texts.

Looking Ahead

In my previous post, I mentioned having a dataset at hand. This week, I put that dataset to good use by initiating the training of the SciBERT model. The wheels are in motion and the anticipation is mounting! Join me next week as I reveal how this exciting phase unfolds. The pharmaceutical industry is on the cusp of a revolution, with AI and machine learning pharma applications taking center stage. Through the tireless pursuit of innovation, we are not just changing the game; we are redefining it. Stay tuned for more insights and breakthroughs in my quest to unlock the full potential of AI in pharma medical information.