SAS Blogs

Advanced Analytics | Analytics | Data Management

Estelle WangOctober 3, 2022 0

Find duplicates and near-duplicates in a corpus with Natural Language Processing

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options. Using SAS Visual Text Analytics, you can customize and accomplish this task during your corpus analysis journey either with Python SWAT package or with PROC SQL in SAS.

English

Artificial Intelligence | Data for Good | Learn SAS

Reggie TownsendAugust 8, 2022 0

Next generation of responsible AI innovators tackle real-world challenges with AI4ALL and SAS

As head of the SAS Data Ethics Practice, I spend a lot of time contemplating the social implications of AI. Considering its benefits like augmenting medical decisions and pitfalls, making decisions based on biased data results in dire consequences for patients. Such implications have the potential to impact society in a variety

English

Education

Analytics | Learn SAS

Alexis MallisAugust 5, 2022 0

4 reasons to build your data analytics skills with SAS

Are you looking to broaden your data analytics skills to land your dream job or propel your career? After looking at job posting statistics and the country's labor market, the data shows that now is the time to jump on board. As the demand for data skills is growing, the

English

Blogs

Tag: data culture and fluency