SAS provides an extensive set of tools for data cleansing and preparation – transforming data to a shape suitable for analysis, text mining, reporting, modeling and ultimately decision making.
In this post we will cover one of the common tasks of character data manipulation – inserting a substring into a SAS character string.
See also: Deleting a substring from a SAS string
A diagram below is a visual example of what we are going to achieve by combining two strings, or more precisely by inserting one string within another string at a particular position:
SAS character strings come in two different incarnations: character variables and macro variables. Since these two are quite different SAS language objects, let’s cover them one by one separately.
Inserting a substring into a character variable
Here is our task: we have a SAS character variable (string) and we want to insert in it a value of another character variable (substring) starting at a particular specified position.
Let’s say we have a string BASE in which we want to insert a COUNTRY name right before word "stays" to make different variation of the resultant phrase. Here is an example of how this can be easily done:
data COUNTRIES; length COUNTRY $20; input COUNTRY; datalines; Spain Argentina Slovenia Romania USA Luxembourg Egypt Switzerland ; data NEW (keep=COUNTRY PHRASE); BASE = 'The rain in stays mainly in the plain'; INSPOS = find(BASE,'stays'); set COUNTRIES; length PHRASE $50; PHRASE = catx(' ',substr(BASE,1,INSPOS-1),COUNTRY,substr(BASE,INSPOS)); run; |
This code dynamically creates variable PHRASE out of values of variable BASE and the values of variable COUNTRY, thus making it data-driven.
After this code runs, the data set NEW will look like this:
Code highlights
- We use find() function to find a position of word "stays" in BASE; that is the position where we want to insert our country name.
- We use substr() function to extract two parts of the variable BASE - before and after insertion:
- substr(BASE,1,INSPOS-1) captures the first part of the BASE (before insertion): substring of BASE starting from the position 1 with a length of INSPOS-1. We need to reduce INSPOS by 1 to derive the length of this first piece (since our INSPOS is the position of the first character of word "stays").
- substr(BASE,INSPOS) captures the second part of the BASE (after insertion): substring of BASE starting from the position INSPOS till the end of BASE value (since the third argument, length, is not specified).
- catx() function stitches (concatenates) all three pieces together. It also removes leading and trailing blanks from each piece and separates all three pieces with blanks (as specified by its first argument).
Code note
- Make sure you defined the length of variable PHRASE large enough to accommodate initial value of BASE, plus country name plus any separators you might have. The maximum length of fixed-length character variables is 32,767 bytes. The maximum length of varying-length character variables is 536,870,911 characters (UTF-8 encoding).
Inserting a substring into a SAS macro variable
Let’s solve a similar task, but now instead of SAS variables we will operate with SAS macro variables, since they are strings too.
Here is our problem to solve: we have a SAS macro variable (string) and we want to insert in it a value of another macro variable (substring) starting at a particular specified position.
Let’s say we have a macro variable BASE with value of The rain in stays mainly in the plain in which we want to insert a country name defined by macro variable COUNTRY with value of Spain right before word stays. Here is an example of how this can be done:
%let BASE = The rain in stays mainly in the plain; %let COUNTRY = Spain; %let W = stays; %let INSPOS = %index(&BASE,&W); %let PHRASE = %substr(&BASE,1,%eval(&INSPOS-1))&COUNTRY %substr(&BASE,&INSPOS); %put ***&PHRASE***; |
This code will insert the country name in the appropriate place within the BASE macro variable which will be printed in the SAS log by %put statement:
***The rain in Spain stays mainly in the plain***
Code highlights
- %index() macro function searches its first argument (&BASE) for the first occurrence of the second argument (&W) and returns the position of its first character. That is the position where we want to insert the macro variable COUNTRY’s value (&COUNTRY).
- We use %substr() macro function to extract two parts of its first argument (&BASE) - before and after insertion:
- %substr(&BASE,1,%eval(&INSPOS-1))captures the first part of &BASE (before insertion): substring of &BASE starting from the position 1 with a length of %eval(&INSPOS-1).
- %substr(&BASE,&INSPOS) captures the second part of &BASE (after insertion): substring of &BASE starting from the position &INSPOS till the end of &BASE (since the third argument is not specified).
- In case of macro variables, we don’t need any concatenation functions – we just list the component pieces of the macro variable value in a proper order with desired separators (blanks in this case).
Code notes
- Unlike for SAS variables, you don’t need to assign the length of SAS macro variables which are automatically defined by their assigned values. The maximum length of SAS macro variables is 65,534 bytes.
- Alternatively, you can achieve the same result using your macro variable values in double quotes as character constants (variable values) in a DATA step. There, you do inserts (or other string manipulations) as you would do with character variables, then convert your variables back to macro variables using CALL SYMPUTX(). Here is an example:
%let BASE = The rain in stays mainly in the plain; %let COUNTRY = Spain; %let W = stays; data _null_; INSPOS = find("&BASE","&W"); PHRASE = catx(' ',substr("&BASE",1,INSPOS-1),"&COUNTRY",substr("&BASE",INSPOS)); call symputx('PHRASE',PHRASE); run; %put ***&PHRASE***;
Inserting multiple instances of a substring into a SAS character string
Sometimes you need to insert a substring into several places (positions p1, p2, …, pn) of a character string. In this case you can use the above strategy repeatedly or iteratively with one little caveat: start inserting from the highest position and moving backwards to the lowest position. This will preserve your pre-determined positions because positions are counted from left to right and inserting a substring at a higher position won’t change the lower position number. Otherwise, after insertion of a substring into lower position, all your higher positions will shift by the length of the inserted substring.
Additional Resources for SAS character strings processing
- Removing repeated characters in SAS strings
- How to unquote SAS character variable values
- Expanding lengths of all character variables in SAS data sets
- Finding n-th instance of a substring within a string
Your thoughts?
Have you found this blog post useful? Please share your thoughts and feedback in the comments section below.
12 Comments
Useful to know. Thank you Leonid!
You are welcome, Deb! Thank you for your feedback.
Indeed, SAS RULES! Thank you!
Great catch! Thank you, Anna.
Another great post. Thanks for sharing.
Thank you, Bill, and you are welcome!
And if we need a macro variable created for each country we can do:
So deriving PHRASE in a data step for each country and then having the first argument of CALL SYMPUTX the variable that holds the names of all countries. At each data step iteration, a new macro variable will be created which is considered a data-driven derivation.
Thank you, Hayk, for your comment!
Yes, you can do that, but naming macro variables after country names or any other data in general may put you in a precarious situation when "data-driven" is not the best approach. For example, a country name may be comprised of two words (e.g. New Zealand or South Korea) or even three words (e.g. United Arab Emirates or Bosnia and Herzegovina), which is inconsistent with macro variable naming rules. This will lead to an error:
ERROR: Symbolic variable name New Zealand must contain only letters, digits, and underscores.
I'd rather use macro names COUTRY1, COUNTRY2, ... to make sure macro variable names are not data driven:
call symputx(cats('COUNTRY',_n_),PHRASE);
SAS Rules. 🙂
Nice post, Leonid. I never think to use the find function. That's a nice use case.
Thank you, Nicole, for your feedback. For me, find() is the most frequently used function. It's somewhat similar to index() function, but has more advanced features.
Wow....you can play by SAS whatever you want.
Thank you, Madhurima, for your cute comment. By the way, in this post I "embedded" a "hidden" message (actually, it is not so hidden, it is marked with red color) which somewhat resonates with your comment. Can you find it? 🙂