In an upcoming paper for SAS Global Forum, several of us from the SAS Text Analytics team explore shifting the context of our underlying representation from documents to the sentences that are within the documents. We then look at how this shift can allow us to answer new text mining questions and explore the collection from a different angle. Our approach that I will explain below uses SAS Text Miner by analyzing sentences as "documents".
Our motivation for sentence-based analysis stems from the challenges that long documents present to fine-grained analysis of unstructured data. Text Miner uses the vector space model where a document is represented as a quantitative vector of the weighted frequencies of the terms in the collection. These vectors often have a size in the hundreds of thousands because there is an entry for every kept term in the collection. The vectors are also very sparse because most documents contain only a small subset of those terms. A diagram of the document vectors of this form is shown below. In the diagram there are m documents each prefixed with the letter "d" and n terms each prefixed with the letter "t". The filled-in squares indicate where a term existed in the given document.
Several nodes in SAS Text Miner then make use of the singular value decomposition (SVD) to create new dense document vectors with far fewer latent factors than the number of terms, usually only 50 to 100. Not only that, but the factorization also provides a dense, latent vector representation for each distinct term in the collection. The results of the factorization are shown below. The latent factor dimensions are prefixed with an "f" and there are k of them.
The term representations on the right hand side above become particularly useful because any new document, regardless of whether it contains a single term or thousands of terms, can be mapped into the same space as the training data.
The approach works well for detecting the main themes and topics in a collection but subtler aspects of the text can be lost. This is because the original representation in the first figure throws away information that is contained in the text. In that representation the number of times the term occurs is maintained but not the order of occurrence. The longer the document, the more significant this loss of information can be.
The approach in our paper simply replaces each document with its set of sentences and creates a vector for each sentence and then text mining is done on the sentences. Notice that the number of rows in the diagram below is now much larger. The first index on the letter "d" is the document ID and the second index indicates the sentence within the document. I assume each document contains exactly 3 sentences to ease the notation.
Then the SVD can form a reduced representation for each sentence and each term as shown below. Note that the term representation on the right-hand side has the same dimensions as before. Only the values in the matrix have changed.
So, while the representation above still neglects the order that the terms occur, because rows are sentence-based, the values in the matrices above reflect the influence of local terms within a sentence and not across a document. Typically, a model based on this sentence-based representation will be more refined than one based on a document-level representation. In short, it can make your analysis more effective, particularly with documents that are longer than a sentence or two.
If you are interested in trying out a sentence analysis for your own text mining, you can contact me and I can send you SAS code to convert your document data set to sentences. Also, as soon as our paper is made available, I will add a link to it here. In the meantime, SAS Global Forum is less than a month away! I hope to see you there.
2 Comments
The sentence mining option could be very productive. I would like to try the utility to break documents into sentences.
I am pasting sas code below to create a data set of sentences. The HPTMINE proc requires a Text Miner license.
proc hptmine data=documents;
doc_id doc_id; var text;
parse entities =none nostemming notagging outpos =position nonoungroups shownumpunct buildindex ;
performance details ;
run;
data sentenceSize;
retain document start size;
set position;
by document sentence;
if First.sentence then start=_start_+1;
if Last.sentence then do;
size=_end_ -start+2;
output;
end;
keep document start size;
run;
data sentenceObs;
length sentences $1000;
merge sentenceSize(in=A ) documents (rename=(docid=document) );
by document;
if A then do;
sentences=substrn(text,start,size);
output;
end;
keep sentences document;
run;
/* Remove short sentences of 2 or fewer terms and also number the sentences*/
data tm.sentenceObs;
retain sid 0;
set sentenceObs;
if lengthn(kstrip(sentences)) ge lengthn(kstrip(kcompress(sentences)))+ 2 then do;
sid=sid+1;
output;
end;
run;