Last week, I attended the IALP 2016 conference (20th International Conference on Asian Language Processing) in Taiwan. After the conference, each presenter received a u-disk with all accepted papers in PDF format. So when I got back to Beijing, I began going through the papers to extend my learning. Usually, when I return from a conference, I go through all paper titles and my conference notes, then choose the most interesting articles and dive into them for details. I’ll then summarize important research discoveries into one document. This always takes me several days or more to complete.
This time, I decided to try SAS Text Analytics to help me read papers efficiently. Here’s how I did it.
My first experiment was to generate a word cloud of all papers. I used these three steps.
Step 1: Convert PDF collections into text files.
With the SAS procedure TGFilter and SAS Document Conversion Server, you may convert PDF collections into a SAS dataset. If you don’t have SAS Document Conversion Server, you can download pdftotext for free. Pdftotext converts PDFfiles into texts only, you need to write SAS code to import all text files into a dataset. Moreover, if you use pdftotext, you need to check if the PDF file is converted correctly or not. It’s annoying to check texts one by one and I hope you look for smart ways to do this check. SAS TGFilter procedure has language detection functionality and language of any garbage document after conversion is empty rather than English, so I recommend you use TGFilter, then you can filter garbage documents out easily with a where statement of language not equal to ‘English.’
Step 2: Parse documents into words and get word frequencies.
Run SAS procedure HPTMINE or TGPARSE against the document SAS dataset, with stemming option turned on and English stop-word list released by SAS, you may get frequencies of all stems.
Step 3: Generate word cloud plot.
Once you have term frequencies, you can either use SAS Visual Analytics or use R to generate word cloud plot. I like programming, so I used SAS procedure IML to submit R scripts via SAS.
These steps generated a word cloud with the top 500 words of 66 papers. There were a total of 87 papers and 21 of them could not be converted correctly by SAS Document Conversion Server. 19 papers could not be converted correctly by pdftotext.