I would consider myself a SAS programmer, data wrangler, and data scientist. I collaborate with statisticians to help make sense of complex data. Day-to-day I analyze large health care datasets (SEER, SEER-MHOS, NIS, NSQIP) to better understand clinical outcomes in surgery. Unfortunately, not all datasets are well structured with clean data dictionaries. In health care, we deal with complex, frequently non-sensical text including clinic notes, hand written orders, telephone communications, and emails. Analysis of such data requires pre-processing. Python is perfect for managing (at least processing) such data.
Problem: How do I use Python to extract meaningful data from PDF?
Here is a simple example of Python at work.
Each year surgery residents take an exam called the ABSITE (American Board of Surgery In-Training Exam). The report distributed to residents is a PDF outlining their score, percentiles, etc. We struggle with how to best use this information. How can we modify our curriculum to improve the educational experience of our residents? Consolidating this data and identifying deficiencies might help. Here is a sample report:
Using Python, we can import text from the PDF, analyze the PDF, and then export to Excel. You can review the program in Github. Warning: It has been minimally tested but does the job.
Define packages used in Python program.
#Created on Wed Mar 09 17:38:20 2016 #absite pdf parse and store in excel file wrong answers #by clancy clark import os import re import xlsxwriter from cStringIO import StringIO from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage
Create function to read PDF using pdfminer.
#open and read pdf utility def convert(fname, pages=None): if not pages: pagenums = set() else: pagenums = set(pages) output = StringIO() manager = PDFResourceManager() converter = TextConverter(manager, output, laparams=LAParams()) interpreter = PDFPageInterpreter(manager, converter) infile = file(fname, 'rb') for page in PDFPage.get_pages(infile, pagenums): interpreter.process_page(page) infile.close() converter.close() text = output.getvalue() output.close return text
Using xlsxwriter, create new Excel file in current directory and label columns in first row.
path = os.getcwd() dirs = os.listdir(path) #create excel file workbook = xlsxwriter.Workbook('resident_answers.xlsx') worksheet = workbook.add_worksheet('Wrong') #add headers worksheet.write(0,0,'Resident Name') worksheet.write(0,1,'Resident Year') worksheet.write(0,2,'Section1') worksheet.write(0,3,'Section2') worksheet.write(0,4,'Section3') worksheet.write(0,5,'Section4') worksheet.write(0,6,'Answer') row = 1 linenumber = 1
Now, loop through PDF files.
for gradefile in dirs: if gradefile.endswith('.pdf'): #add wrong answers print 'working on', gradefile pdftext = convert(gradefile) #analyze text by line lines = pdftext.splitlines()
The long string of text is then split into lines. Loop through lines to find answers using regular expression operations.
#loop through lines for line in lines: linenumber = linenumber + 1 #count lines match = re.search(r'(.+)Level:\s(\w+)',line) #find name and resident level if match: #grab resident name and year level resident = match.group(1) resident = resident.strip() level = match.group(2) if linenumber > 8: #after header information check for answers redoline = line.decode('ascii','replace') answerline = re.search(r'\b(?!SCORE\b)[A-Z]{3}.+',redoline) #find answers if answerline: section = answerline.group() sections = re.split('\s-\s',section) row = row + 1
Write list of wrong answers to Excel file and then close.
#write excel cells worksheet.write(row,0,resident) worksheet.write(row,1,level) worksheet.write(row,6,redoline) #write sections of answer column = 2 for phrase in sections: worksheet.write(row,column,phrase) column = column + 1 linenumber = 1 #reset line counting print 'Complete' workbook.close()