A trick to combine and split strings

0

Every year, I write a special article for Halloween in which I show a SAS programming TRICK that is a real TREAT! This year, the trick is to concatenate two strings into a single string in a way that guarantees you can always recover the original strings. I learned this trick from @RobinHouston on Twitter.

Why would you want to combine and pull apart strings?

I have combined strings when I wanted to concatenate levels of two categorical variables to form an interaction variable. By combining the strings, you can create a box plot that shows the response for each joint level of two categorical variables. I have not needed to separate the strings later, but I can imagine scenarios in which that capability might be useful.

To give a concrete example, suppose you have two strings: "ghost" and "goblin." The first trick constructs a string such as "ghost/goblin" and shows you how to reconstruct the original values from that string if the first string does not contain the character ('/') that is used to separate the first string from the second. The second trick constructs a more complicated string: "ghost/goblinghost*goblin". You can always reconstruct the original values regardless of the characters in the strings.

The easy way: Use one delimiter

A simple technique for combining strings is to use a single delimiter to concatenate two strings and then pull them apart. That idea works provided that the delimiter is not part of the first string. Let's see how it works by concatenating the Make and Model character variables from the Sashelp.Cars data set. The following DATA step extracts a subset of the full data:

data Have;
set Sashelp.Cars;
by Make;
if First.Make;
keep Make Model;
run;
 
proc print data=Have(obs=5);
run;

Suppose that you want to concatenate the Make and Model variables to form a new string while still being able to determine where the first string ends and the second string begins. If you know that some character such as '/' is not part of the first string, you can use the CAT or CATS function to concatenate the strings. The CATS function strips off leading and trailing blanks from all strings. Because SAS strings are blank-padded, I prefer to use the CATS function, as shown in the following DATA step:

%let DELIM = /;    /* choose any rare character as a delimiter */
data SimpleCombine;
length Join $55;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
Join = cats(Make,"&DELIM",Model);  
run;
 
proc print data=SimpleCombine(obs=5);
run;

The JOIN variable contains the concatenated strings, separated by the '/' character. If you want to recover the original two values (minus any spaces that were stripped off), you can use the FIND function to locate the position of the delimiter, then use the SUBSTR function to extract the substrings before and after the delimiter, as follows:

data SimpleSplit;
length v1 v2 $55;
set SimpleCombine;
loc = find(Join, "&DELIM");       /* location of delimiter */
v1 = substr(Join, 1, loc-1);
v2 = substr(Join, loc+1);
run;
 
proc print data=SimpleSplit(obs=5);
run;

The output shows that the v1 and v2 variables contain the original string values. The v1 and v2 variables do not contain any leading or trailing blanks, but they would if you use the CAT function instead of the CATS function. Thus, if the delimiter is not part of the first string, you can use a single delimiter to combine and split strings. The logic of the program assumes that the position of the first delimiter in the combined string is the location at which to split the string. That logic fails if the first string contains the delimiter as one of its characters.

A trick for combining and splitting any strings

In many situations, you know that the input strings will not contain a certain character, such as '/'. However, if you want a FOOLPROOF method that will work for ANY strings regardless of their contents, the trick from @RobinHouston become relevant. The trick enables you to combine and split any two strings!

The basic idea is to combine the strings by using two different delimiters instead of only one. I will use the delimiters '/' and '*', but the values do not matter. You can use 'A' and 'B' or '0' and '1', if you prefer. The method still works if the delimiter is contained in one or both strings.

Here's how to combine the strings so that they can be split later. If the original strings are v1 and v2, you form the concatenated strings
      s1 = v1 + '/' + v2
and
      s2 = v1 + '*' + v2
Notice that the strings s1 and s2 have the same length. In fact, they are identical except for one location: s1 contains the first delimiter whereas s2 contains the second delimiter. You then store the concatenated string
      s = s1 + s2
which always has an even number of characters.

You can use the SAS DATA step to implement this combination step:

/* Use the characters '/' and '*' to mark the boundary between Make and Model 
   Note: '*'=ASCII 42 precedes '/'=ASCII 47 in the ASCII table */
data Encode;
length s1 s2 $55 Join $110;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
s1 = cats(Make,'/',Model);  
s2 = cats(Make,'*',Model);
Join = cats(s1,s2);        /* Join does not have leading spaces */
drop s1 s2;
run;
 
proc print data=Encode(obs=5);
run;

To recover the original strings from s, you can do the following:

  • Let s1 be the first half of s and s2 be the second half of s. Because s has an even number of characters, this operation is always well defined.
  • Use the COMPARE function to find the position, k, where s1 differs from s2. By construction, this is the location of the delimiter.
  • Recover v1 by extracting the first k-1 characters from s1 and recover v2 by using the characters after the (k+1)st position.

The following DATA step splits the string back into its original values:

/* Split the encoded string in half. Use the COMPARE function to find where the two 
   halves are different. Split the string at that location.  */
data Decode;
set Encode;
L = lengthc(trim(Join));         /* trim any trailing spaces */
s1 = substr(Join, 1, L/2);       /* first half of string */
s2 = substr(Join, L/2 + 1, L/2); /* second half */
/* find the location where strings differ */
diffLoc = abs(compare(s1, s2));
v1 = substr(s1, 1, diffLoc-1);
v2 = substr(s1, diffLoc+1);
drop s1 s2 diffLoc;
run;
 
proc print data=Decode(obs=5);
run;

The output shows that the v1 and v2 variables contain the original strings.

Summary

This article shows how to concatenate two strings so that you can recover the original string values. The first trick uses only one delimiter and requires that you choose a character that is not part of the first string. The second trick is more complicated, but it works for ANY strings, regardless of their contents.

I think both tricks are real treats! You might never have need for the second trick, but it is very clever. I like to write these tricks down in case someone in the future (maybe me!) searches for "how do I combine two strings so that I can recover the original strings?" Post a comment if you use either of these tricks!

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top