Removing repeated characters in SAS strings

14

Removing duplicate charactersIn this blog post we are going to tackle a data cleansing task of removing unwanted repeated characters in SAS character variables.

Character repetition can stem from various stages of data life cycle: from data collection, to data transmission, to data transformation. It can be accidental or intentional by design. It can be sporadic or consistent. In either case, it needs to be addressed by robust data cleansing processes to ensure adequate data quality that is imperative for the data usability.

Character repetition examples

Example 1. Data entry, especially manual data entry, can be a high-risk factor for accidental character duplication. Have you ever pressed a key on your computer keyboard for a bit longer than intended, so it started automatically typing multiple characters???????????????

Keyboard properties adjustment

 Tip: You can adjust your Keyboard Properties to control “Repeat delay” and “Repeat rate” settings (on Windows computer, search for “Keyboard” and click on Keyboard in Control Panel).

Example 2. Recently, I had to deal with the data that contained multiple consecutive double quotation marks all over the character string values. Even though we don’t know the exact cause of it, still for each of these duplicated quotation marks occurrences we needed to replace them with a single quotation mark.

Removing repeated blanks

There is a very useful COMPBL() function in SAS that removes multiple consecutive blanks from a character string replacing them with a single blank.

Let’s use the COMPBL function as a prototype and create our own user-defined function that will do what COMPBL does to the blanks but extend its functionality to other characters.

Removing any repeated character - prototype

Based on the approach described in a blog post Removing unwanted characters from text strings by Amadeus Software we developed a prototype using TRANWRD() function that replaces any two equal characters with a single one. Instead of creating a function for replacing a specific character and calling it recursively, we placed the TRANWRD function in a loop that iterates while there are any two repeated characters are found. Here is the code illustrating this approach:

data D;
   c = ','; *<- character to un-duplicate;
   cc = c||c; *<- double character;
   string = 'Many,,,,,, commas,,,,, in,,, this,, sentence.,'; *<- source string;
   put 'BEFORE:' string=; *<- output initial string;
   do while (find(string,cc)); *<- loop through while there are doubles;
      string = tranwrd(string,cc,c); *<- replace double with a single character;
   end;
   put 'AFTER: ' string=; *<- output unduplicated string;
run;

This code will produce the following in the SAS log:

BEFORE:string=Many,,,,,, commas,,,,, in,,, this,, sentence.,
AFTER: string=Many, commas, in, this, sentence.,

which shows that this approach correctly un-duplicates the source string removing and replacing all repeated characters (commas in our example) with a single one.

User-defined SAS function for removing any repeated characters

Let’s use PROC FCMP to build a user-defined function UNDUPC (str, clist) where the two required arguments are:

  1. str - specifies a character constant, variable, or expression that you want to remove repeated characters from replacing them with a single character;
  2. clist - is a constant, variable, or character expression that specifies (lists) all the characters that you want to unduplicate in string.

The following code will create such a function:

libname funclib 'c:\projects\functions';
proc fcmp outlib=funclib.userfuncs.package1;
   function undupc(str $, clist $) $;
      length x $32767 c $1 cc $2;
      x = str; 
      do i=1 to length(clist);
         c = char(clist,i);
         cc = c||c;
         do while (find(trim(x),cc));
            x = tranwrd(x,cc,c);
         end;
      end;
      return (x); 
   endfunc; 
run;

Code highlights

  • We introduce an interim variable x to which we will iteratively apply replacing double characters with a single one.
  • We assign length attribute of this variable to be maximum allowable character length of 32767 bytes to accommodate any character length used in the calling program.
  • Outer do-loop loops through the clist containing characters we want to unduplicate.
  • Variable c is assigned a single character from clist, variable cc is assigned double of the cc value.
  • Inner do-loop iterates through trimmed characters in x while doubles are found; using trim(x) is essential as it not only speeds up processing while searching through a shorter string (without trailing blanks), it also prevents from falling into an infinite loop in case clist contains blank character to unduplicate (cc equals to double blanks which are always going to be found among trailing blanks).

Let’s test our newly minted UNDUPC function on the following data:

data SOURCE;
   infile datalines truncover;
   input str $50.;
   datalines;
"""Repeated "double quotes""""
Repeated,,,,,commas,,,,,,,,,,,
[[[""Mixed""]]   characters,,,
;

Since our user-defined function is permanently stored in the location specified in the outlib= option of the proc fcmp, now we can easily use it in any SAS session as in the following example:

options cmplib=funclib.userfuncs;
data TARGET;
   set SOURCE;
   length new_str $50;
   new_str = undupc(str, ' ,"][');
run;

This code will remove and replace all repeated sequences of characters ' ', ',', '"', ']', and '['. The order of these characters listed in the second argument doesn’t matter. Here is what we get:

Duplicate characters removal results
As you can see, we get what we wanted including the functionality of the COMPBL function.

User-defined CALL routine for removing any repeated characters

As much as I love user-defined functions, I have an issue with the above undupc user-defined function implementation. It has to do with how the PROC FCMP handles interim character variables length attribute assignment. It does not implicitly inherit their length attribute from another variable as SAS data step does. For example, if you run the following data step:

data a;
   length x $99;
   y = x;
run;

variable y will have the length attribute $99 implicitly inherited from the x variable.

In PROC CFMP function, you can either assign the length attribute to a character variable explicitly with LENGTH or ATTRIB statement (as we did by using length x $32767 ), or it will be set to $33 if you use any other way of implicit assignment. (I leave it up to you guessing why 33 and why not any other number.) Since we wanted to accommodate SAS character strings of any length, we had to explicitly assign our interim variable x length attribute the maximum valid value of $32767. This will inevitably take tall on the function performance as we will have to process longer strings.

However, we can avoid this issue by using CALL routine instead:

libname funclib 'c:\projects\functions';
proc fcmp outlib=funclib.usercalls.package1;
   subroutine undupc(str $, clist $, x $);
      outargs x;
      length c $1 cc $2;
      x = str;
      do i=1 to length(clist);
         c = char(clist,i);
         cc = c||c;
         do while (find(trim(x),cc));
            x = tranwrd(x,cc,c);
         end;
      end;
   endsub; 
run;

This code is very similar to the user-defined function above with a slight difference. Here, x variable is listed as an argument in the subroutine definition and refers to a SAS data step variable whose length attribute is assigned in the calling data step. Unlike SAS function, SAS subroutine does not return a value; instead, it uses OUTARGS statement that specifies arguments in an argument list that you want a subroutine to update.  Therefore, we don't need to assign the length attribute to x within the subroutine, we just place x in the OUTARGS statement. Here is how you would invoke a user-defined CALL routine instead of a function:

options cmplib=funclib.usercalls;
data TARGET;
   set SOURCE;
   length new_str $50;
   call undupc(str, ' ,"][', new_str);
run;

Notice, that now we control and define the length attribute of the third argument in the calling program, not in the CALL routine itself. And we get the same results as with the UNDUPC function above.

Store user-defined functions and subroutines separately

You can create and have both, user-defined function and call routine with the same name. However, to avoid confusion (and errors) do not store their definitions in the same data table (outlib= option of the PROC FCMP). If they are stored in the same data table, then when used in a DATA step, SAS will pull the latest definition by its name only and that may not be the entity you want.

Performance benchmarking

To compare performances of the UNDUPC function vs UNDUPC subroutine we created a rather large data table (1 Million observations) with randomly generated strings (1000 characters long):

libname SASDL 'C:\PROJECTS\TESTDATA';
 
data SASDL.TESTDATA (keep=str);
   length str $1000;
   do i=1 to 1000000;
      str = '';
      do j=1 to 1000;
         str = cats(str,byte(int(rank(' ')+38*rand('UNIFORM'))));
      end;
      output;
   end;
run;

Then we ran the following 2 data steps, one using the undupc() function, and the other using undupc() call routine:

options cmplib=funclib.userfuncs;
 
data SASDL.TESTDATA_UNDUPC_FUNC;
   set SASDL.TESTDATA;
   length new_str $1000;
   new_str = undupc(str, '#+');
run;
 
options cmplib=subrlib.usercalls;
 
data SASDL.TESTDATA_UNDUPC_CALL;
   set SASDL.TESTDATA;
   length new_str $1000;
   call undupc(str, '#+', new_str);
run;

A quick SAS log inspection reveals that CALL UNDUPC works as much as 3 times faster than UNDUPC function (10 seconds vs. 30 seconds). The time savings may vary depending on your data composition and computing environment, but in any case, if you process high volumes of data you may consider using CALL routine over function. This is not a blanket statement, as it only pertains to this particular algorithm of eliminating character repetitions where we had to explicitly assign the highest possible length attribute value to the interim variable in the function, but not in the CALL routine.

When we reduced declared length of x from $32767 to $1000 within the user-defined function definition its performance became on par with the CALL routine.

Additional Resources for SAS character strings processing

Your thoughts?

Have you found this blog post useful? Would you vote for implementing UNDUPC as a native built-in SAS function? Please share your thoughts and feedback in the comments section below.

Share

About Author

Leonid Batkhan

Leonid Batkhan, Ph.D. in Computer Science and Automatic Control Systems, has been a SAS user for more than 25 years. He came to work for SAS in 1995 and is currently a Senior Consultant with the SAS Federal Data Management and Business Intelligence Practice. During his career, Leonid has successfully implemented dozens of SAS applications and projects in various industries. All posts by Leonid Batkhan >>>

14 Comments

  1. I have used the COMPBL() function in the past and it's possible that I would want a similar function for something other than spaces.

  2. Tatiana Pshenichnikov on

    This is great! I believe this function should to be natively implemented and included in SAS software, similar to COMPBL().

  3. Hi Leonid, really good job, I tried to propose a faster function but failed.
    Just one question: Is "endfunc" a variant of "endsub" ?

    • Leonid Batkhan

      Thank you, Nero. I also tried several different versions and chose the most efficient one. I would love to see a faster implementation though. However, I realize that speed of execution depends on the distribution of repeated characters as well as the length of duplication streaks, so there might not be a universally "better" implementation as it is data-dependent.

      As for "endfunc" vs. "endsub", I think you can use them interchangeably, but I prefer using "endfunc" for functions, and "endsub" for subroutines.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top