Two of my favorite string-manipulation functions in the SAS DATA step are the COUNTW function and the SCAN function. The COUNTW function counts the number of words in a long string of text. Here "word" means a substring that is delimited by special characters, such as a space character, a period, or a comma. The SCAN function enables you to parse a long string and extract words. You can specify the delimiters yourself or use the default delimiters. Ron Cody discusses these and other string manipulation functions in his excellent 2005 tutorial, "An Introduction to SAS Character Functions."
Using the COUNTW and SCAN functions in the DATA step
For example, the following DATA step reads in a long line of text. The COUNTW function counts how many words are in the string. A loop then iterates over the number of words and the SCAN function extracts each word into a variable:
data parse; length word $20; /* number of characters in the longest word */ input str $ 1-80; delims = ' ,.!'; /* delimiters: space, comma, period, ... */ numWords = countw(str, delims); /* for each line of text, how many words? */ do i = 1 to numWords; /* split text into words */ word = scan(str, i, delims); output; end; drop str delims i; datalines; Introduction,to SAS/IML programming! Do you have... a question? ; proc print data=parse; run; |
Notice that the delimiters do not include the '/' or '?' characters. Therefore these characters are considered to be part of words. For example, the strings "SAS/IML" and "question?" include those non-letter characters. Notice also that consecutive delimiters are automatically excluded, such as extra spaces or the ellipses marks.
Creating a vector of words in SAS/IML
One of the advantages of the SAS/IML matrix language is that you can call the hundreds of functions in Base SAS. When you pass in a vector of arguments to a Base SAS function, the function returns a vector that is the same size and shape as the parameter. In this way, you can vectorize the calling of Base SAS functions. In particular, you can pass in a vector of indices to the SCAN function and get back a vector of words. You do not need to write a loop to extract multiple words, as the following example demonstrates:
proc iml; s = "Introduction,to SAS/IML... programming!"; delims = ' ,.!'; n = countw(s, delims); words = scan(s, 1:n, delims); /* pass parameter vector: create vector of words */ print words; |
In summary, Base SAS provides many useful functions such as the string manipulation functions. This article shows that when you call these functions from SAS/IML and pass in a parameter vector, you get back a vector of results.
10 Comments
Thanks. I used this today in combination with some regular expressions to solve a problem I was working on.
Thank you, very helpful.
I'm struggling to apply this logic when in your example s is not a scalar but a vector like:
s = {"Introduction,to SAS/IML... programming!", "Take it a Little further"};
I'm trying something like:
proc iml;
s = {"Introduction,to SAS/IML... programming!", "Take it a little further"};
delims = ' ,.!';
n = countw(s, delims);
ctr=nrow(n);
delims=repeat(delims,ctr,1);
ctr2=j(ctr,1,1);
n=ctr2||n;
words = scan(t(s), 1:2, delims`); /* pass parameter vector: create vector of words */
print words;
If I put 1:n as second argument to the scan function it gives an error. If I put 1:2 the output is the first Word of the 1st row (Introduction) and the 2nd Word of the second row(it). I don't succeed in making words a matrix as I don't know how to convert the countw result into a vector like {1:4, 1:5}...
In your example, s[1] contains 5 words but s[2] contains only 4 words. Matrices have to be rectangular, but you are trying to create a matrix whose first row has five columns and whose second row has four columns.
You can use a loop to process each element of s:
do i = 1 to nrow(n);
words = scan(s[i], 1:n[i], delims);
print i words;
end;
I have an error:
60? %put &words;
WARNING: Apparent symbolic reference WORDS not resolved.
&words
How do I make "words" a variable so I can retrieve it by using %put &words;
I tried to retrieve the first word to nth word in a string and assigned it to another variable string.
example: inputs: %let variable1 = I want to go to holidays every day;
outputs: choose first word to 3rd word: %put variable2; "I want to"
You can use PUT instead of %PUT to output the value of a DATA step variable to the log. You can use CALL SYMPUT to create a macro variable from the value of a DATA step variable. If that doesn't answer your question, post your question and data to the SAS Support Communities.
Do anyone have the solution to the opposite problem; that is converting a vector of words into sentences, without using proc iml?
I assume you want to use the "CAT" series of functions, which includes CAT, CATS, CATT, and CATX. One method would be to use CATX with a space character as the separator. You can read a blog post, read the documentation, or ask a specific question on the SAS Support COmmunities.
Hi, Rick,
Kudos for yet another practical and to the point post. I've used SCAN before, but needed a quick refresher -- and your post here fits the bill perfectly.
I now have a macro that parses Libnames out of a master SAS program based on the first word being 'LIBNAME' and the second being the name of the Libname I'm trying to allocate. A simple %Master_Lib(name) now allocates files for us based on a "single source of truth" -- and only those needed instead of a list longer than War and Peace as we were doing by %INCLUDEing the master SAS program.
My thanks,
Jim
Very cool! Thanks for writing.