I would consider myself a SAS programmer, data wrangler, and data scientist.   I collaborate with statisticians to help make sense of complex data. Day-to-day I analyze large health care datasets (SEER, SEER-MHOS, NIS, NSQIP) to better understand clinical outcomes in surgery.   Unfortunately, not all datasets are well structured with clean data dictionaries.  In health care, we deal with complex, frequently non-sensical text including clinic notes, hand written orders, telephone communications, and emails.  Analysis of such data requires pre-processing.  Python is perfect for managing (at least processing) such data.

Problem:  How do I use Python to extract meaningful data from PDF?

Here is a simple example of Python at work.

Each year surgery residents take an exam called the ABSITE (American Board of Surgery In-Training Exam).  The report distributed to residents is a PDF outlining their score, percentiles, etc.   We struggle with how to best use this information.  How can we modify our curriculum to improve the educational experience of our residents?  Consolidating this data and identifying deficiencies might help.  Here is a sample report:

Using Python, we can import text from the PDF, analyze the PDF, and then export to Excel.  You can review the program in Github.  Warning:  It has been minimally tested but does the job.

Define packages used in Python program.

#Created on Wed Mar 09 17:38:20 2016
#absite pdf parse and store in excel file wrong answers
#by clancy clark

import os
import re
import xlsxwriter
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

Create function to read PDF using pdfminer.

#open and read pdf utility
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
    text = output.getvalue()
    return text 

Using xlsxwriter, create new Excel file in current directory and label columns in first row.

path = os.getcwd()
dirs = os.listdir(path)   

#create excel file
workbook = xlsxwriter.Workbook('resident_answers.xlsx')
worksheet = workbook.add_worksheet('Wrong')

#add headers
worksheet.write(0,0,'Resident Name')
worksheet.write(0,1,'Resident Year')

row = 1  
linenumber = 1

Now, loop through PDF files.

for gradefile in dirs:

  if gradefile.endswith('.pdf'):
    #add wrong answers   
    print 'working on', gradefile
    pdftext = convert(gradefile)
    #analyze text by line
    lines = pdftext.splitlines()

The long string of text is then split into lines. Loop through lines to find answers using regular expression operations.

    #loop through lines
    for line in lines: 
      linenumber = linenumber + 1 #count lines
      match = re.search(r'(.+)Level:\s(\w+)',line)  #find name and resident level
      if match: 
        #grab resident name and year level        
        resident = match.group(1)
        resident = resident.strip()
        level = match.group(2)
      if linenumber > 8: #after header information check for answers        
        redoline = line.decode('ascii','replace')
        answerline = re.search(r'\b(?!SCORE\b)[A-Z]{3}.+',redoline) #find answers  
        if answerline:  
          section = answerline.group()
          sections = re.split('\s-\s',section) 

          row = row + 1

Write list of wrong answers to Excel file and then close.

          #write excel cells
          #write sections of answer
          column = 2          
          for phrase in sections:  
            column = column + 1

  linenumber = 1 #reset line counting 
print 'Complete'
It’s a better world with Python: an example of extracting PDF data
Tagged on: