Interested in text mining? This week's SAS tip is from the new book - Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS by Goutam Chakraborty, Murali Pagolu, and Satish Garla. This hands-on guide is getting strong early reviews - and perhaps you'd like to write your own after reading the book.
In the meantime, I hope you'll enjoy reading this week's free book excerpt.
The following excerpt is from SAS Press authors Goutam Chakraborty, Murali Pagolu, and Satish Garla's book “Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS” Copyright © 2013, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. (please note that results may vary depending on your version of SAS software).
Importing textual data into SAS
As discussed in Chapter 1, the first step in the text mining process is collecting textual data and setting up what is sometimes referred to as a corpus. Often, a text corpus usually represents documents from a particular domain. For example, all papers published in a journal during a year could be a document corpus. Though collecting data looks like a simple task, this is one of the most tedious and challenging steps in the text mining process. This is because the unstructured data exists in various forms that cannot always be directly processed by SAS Text Miner. A data conversion step usually takes place before the data is used in the text mining task unless the data is readily available as a SAS data set.
The data collection step heavily depends on the business problem. This involves answering simple questions such as the following:
Once the data that is needed to solve the current problem is identified, the next challenge is to collect the data and convert it for SAS Text Miner to process. It is quite possible that you will face any of the following situations in your project:
- Data is readily available as a SAS data set.
- Data is available as textual files (PDF, XML, HTML, Word, etc.) in a directory or in a database.
- Data needs to be collected from the Internet.
The text mining process in SAS requires that the data be available as a SAS data set. This does not mean that the source data has to exist as a SAS data set (see situations 2 and 3 for other formats). In the first situation, you can directly create a data source and perform text mining. In the latter two situations, the source data has to be converted into a SAS data set. There are various ways to create a SAS data set from the data available in commercial databases using SAS data access features. Files in common formats such as comma-separated values (CSV) or Microsoft Excel can be easily imported into SAS using the SAS Enterprise Guide Data Import Wizard or the File Import node in SAS Enterprise Miner. The challenging part is to create SAS data sets from textual files. SAS Text Miner has the capability to create SAS data sets dynamically from textual files available in a directory or on the web. This is accomplished using the Text Import node in SAS Text Miner.
In Chapter 2, you learned how to collect data from the web and local files using SAS Information Retrieval Studio. Data collected using SAS Information Retrieval Studio is fed into SAS Text Miner for further text mining accomplishments. The following sections discuss in detail the different ways to collect textual data using SAS Text Miner.