Tag: SAS Visual Text Analytics

Advanced Analytics | Analytics | Data Management
Estelle Wang 2
Find duplicates and near-duplicates in a corpus with Natural Language Processing

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options. Using SAS Visual Text Analytics, you can customize and accomplish this task during your corpus analysis journey either with Python SWAT package or with PROC SQL in SAS.

Advanced Analytics
Sophia Rowland 2
Generating word embeddings

Word embeddings are the learned representations of words within a set of documents. Each word or term is represented as a real-valued vector within a vector space. Terms or words that reside closer to each other within that vector space are expected to share similar meanings. Thus, embeddings try to capture the meaning of each word or term through its relationships with the other words in the corpus.