Screenshot displaying the output from the extractpdf.py script and project file structure in Visual Studio Code.

Hello everyone! Isaac here. This week, I took significant strides in my journey to fully embrace artificial intelligence (AI) and machine learning in pharma, specifically focusing on my project of integrating AI into medical writing. I thought it would be beneficial to share my progress, learnings, and even some of the challenges I’ve faced along the way.

Stepping Back for a Holistic View

One of the primary lessons I learned this week was the importance of breaking down a complex problem into smaller, more manageable parts. Although my project – incorporating AI into med info medical writing – is a multifaceted endeavor, I found it beneficial to deconstruct it into the following components:

    1. PDF Text Extractor: This is a tool that extracts text from PDF files.

    1. Summarizer: This is a component of my project that will condense large volumes of text into digestible summaries to help create a training dataset.

    1. Training Data: The quality of AI outputs is highly dependent on the quality of data it learns from, making the collection and preparation of suitable training data a critical task.

    1. Trained Standard Response Letter (SRL) Summarizer: This is the culmination of the previous components; a sophisticated AI model capable of summarizing medical text effectively, matching writing style and adhering to a style guide (eg, American Medical Association (AMA) 11th edition).

I’ve been writing separate Python scripts for each of these functions and then integrating them into a main.py script, which will serve as the heart of the project. While I’m not a seasoned coder, I found this modular approach helpful as it allowed me to visualize my program and focus on each part independently. Drawing from my experience building guitar pedals, I am treating each piece of code as a component in a circuit, tweaking each one to achieve the desired end product.

Challenges: Tackling Tables and Figures in PDFs

One unique challenge I faced this week was dealing with tables and figures within PDF files. Unlike plain text, tables and figures require a specialized approach for extraction and interpretation. To tackle this, I turned to PDFMiner and Camelot, two Python libraries specifically designed to handle PDFs.

For those unfamiliar with coding, Python libraries are collections of functions and methods that allow you to perform many actions without writing extensive code. PDFMiner is particularly adept at removing headers and footers, a feature not commonly found in other PDF extraction programs. Meanwhile, Camelot enables the extraction of tables from PDF files into a more manageable format, like CSV files, which I can potentially use to rebuild tables in the SRL.

Despite the steep learning curve, the benefits of these libraries have been tremendous, enabling me to extract not only the main body of text but also headers, footers, and references. Additionally, I’ve been able to remove the first page of the documents, which generally consists of an abstract and introduction that may not be essential for my dataset. However, I’m keeping an open mind to the possibility the last paragraph of the introduction, which often contains the objectives, may fall on the the first page. I will watch for it and update my code if necessary.

Celebrating Progress: Wins This Week

This week’s effort bore fruit in multiple ways. I successfully developed a PDF text extractor and integrated it into my main.py program. Then, I embarked on the development of my summarizer.py component, making notable progress. Leveraging the capabilities of PDFMiner and Camelot, I managed to extract tables from my preprocessed text, saving it for later as a CSV file.

I also achieved some finer details in the extraction process. I was able to isolate references into a separate text file, potentially useful for importing into reference manager software like Endnote later on. Preprocessing the data to separate text by paragraph was another significant accomplishment. This separation will allow me to summarize each paragraph independently, potentially leading to a comprehensive summary of the entire document.

Keeping the Balance

While my progress this week has been gratifying, I’ve learned a valuable lesson about maintaining balance. As many of you might understand, it’s easy to become engrossed in a project, especially when it involves the exciting world of AI in pharma. I found myself spending long hours coding, often stretching into the early hours of the morning. Remember, while our professional endeavors are important, it’s crucial to strike a healthy balance between work and rest.

Looking Ahead

As I move forward, I’m considering if I need to segment the data differently. Currently, I’ve broken it down by paragraph, but it might be beneficial to divide the data by sections (methods, patients, endpoints, results). However, the structure of PDF documents can vary, making this task more complex. It’s an area I’ll be exploring further as I continue my journey into the world of clinical artificial intelligence.

My journey into AI in pharma is far from over, and I’m excited about the possibilities that lie ahead. By sharing my progress, I hope to inspire and guide healthcare professionals, researchers, and pharmaceutical industry professionals in embracing AI’s transformative potential in medical writing and beyond. Let’s continue this journey together, exploring the frontiers of machine learning in pharma and uncovering innovative ways to enhance medical affairs and medical information in the pharmaceutical industry. Until next time, happy coding!