Break a sentence into words in SAS

10

Two of my favorite string-manipulation functions in the SAS DATA step are the COUNTW function and the SCAN function. The COUNTW function counts the number of words in a long string of text. Here "word" means a substring that is delimited by special characters, such as a space character, a period, or a comma. The SCAN function enables you to parse a long string and extract words. You can specify the delimiters yourself or use the default delimiters. Ron Cody discusses these and other string manipulation functions in his excellent 2005 tutorial, "An Introduction to SAS Character Functions."

Using the COUNTW and SCAN functions in the DATA step

For example, the following DATA step reads in a long line of text. The COUNTW function counts how many words are in the string. A loop then iterates over the number of words and the SCAN function extracts each word into a variable:

data parse;
length word $20;                 /* number of characters in the longest word */
input str $ 1-80;
delims = ' ,.!';                 /* delimiters: space, comma, period, ... */
numWords = countw(str, delims);  /* for each line of text, how many words? */
do i = 1 to numWords;            /* split text into words */
   word = scan(str, i, delims);
   output;
end;
drop str delims i;
datalines;
Introduction,to SAS/IML   programming!
Do you have... a question?
;
 
proc print data=parse;
run;
t_scan1

Notice that the delimiters do not include the '/' or '?' characters. Therefore these characters are considered to be part of words. For example, the strings "SAS/IML" and "question?" include those non-letter characters. Notice also that consecutive delimiters are automatically excluded, such as extra spaces or the ellipses marks.

Creating a vector of words in SAS/IML

One of the advantages of the SAS/IML matrix language is that you can call the hundreds of functions in Base SAS. When you pass in a vector of arguments to a Base SAS function, the function returns a vector that is the same size and shape as the parameter. In this way, you can vectorize the calling of Base SAS functions. In particular, you can pass in a vector of indices to the SCAN function and get back a vector of words. You do not need to write a loop to extract multiple words, as the following example demonstrates:

proc iml;
s = "Introduction,to SAS/IML... programming!";
delims = ' ,.!'; 
n = countw(s, delims);  
words = scan(s, 1:n, delims);  /* pass parameter vector: create vector of words */
print words;
t_scan2

In summary, Base SAS provides many useful functions such as the string manipulation functions. This article shows that when you call these functions from SAS/IML and pass in a parameter vector, you get back a vector of results.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

10 Comments

  1. Thank you, very helpful.
    I'm struggling to apply this logic when in your example s is not a scalar but a vector like:
    s = {"Introduction,to SAS/IML... programming!", "Take it a Little further"};

  2. I'm trying something like:

    proc iml;
    s = {"Introduction,to SAS/IML... programming!", "Take it a little further"};

    delims = ' ,.!';
    n = countw(s, delims);
    ctr=nrow(n);
    delims=repeat(delims,ctr,1);

    ctr2=j(ctr,1,1);
    n=ctr2||n;

    words = scan(t(s), 1:2, delims`); /* pass parameter vector: create vector of words */
    print words;

    If I put 1:n as second argument to the scan function it gives an error. If I put 1:2 the output is the first Word of the 1st row (Introduction) and the 2nd Word of the second row(it). I don't succeed in making words a matrix as I don't know how to convert the countw result into a vector like {1:4, 1:5}...

    • Rick Wicklin

      In your example, s[1] contains 5 words but s[2] contains only 4 words. Matrices have to be rectangular, but you are trying to create a matrix whose first row has five columns and whose second row has four columns.

      You can use a loop to process each element of s:

      do i = 1 to nrow(n);
      words = scan(s[i], 1:n[i], delims);
      print i words;
      end;

      • I have an error:

        60? %put &words;

        WARNING: Apparent symbolic reference WORDS not resolved.
        &words

        How do I make "words" a variable so I can retrieve it by using %put &words;

        I tried to retrieve the first word to nth word in a string and assigned it to another variable string.
        example: inputs: %let variable1 = I want to go to holidays every day;
        outputs: choose first word to 3rd word: %put variable2; "I want to"

        • Rick Wicklin

          You can use PUT instead of %PUT to output the value of a DATA step variable to the log. You can use CALL SYMPUT to create a macro variable from the value of a DATA step variable. If that doesn't answer your question, post your question and data to the SAS Support Communities.

  3. Michael Törnblom on

    Do anyone have the solution to the opposite problem; that is converting a vector of words into sentences, without using proc iml?

  4. Hi, Rick,

    Kudos for yet another practical and to the point post. I've used SCAN before, but needed a quick refresher -- and your post here fits the bill perfectly.

    I now have a macro that parses Libnames out of a master SAS program based on the first word being 'LIBNAME' and the second being the name of the Libname I'm trying to allocate. A simple %Master_Lib(name) now allocates files for us based on a "single source of truth" -- and only those needed instead of a list longer than War and Peace as we were doing by %INCLUDEing the master SAS program.

    My thanks,

    Jim

Leave A Reply

Back to Top