In this blog post we are going to tackle a data cleansing task of removing unwanted repeated characters in SAS character variables.
Character repetition can stem from various stages of data life cycle: from data collection, to data transmission, to data transformation. It can be accidental or intentional by design. It can be sporadic or consistent. In either case, it needs to be addressed by robust data cleansing processes to ensure adequate data quality that is imperative for the data usability.
Character repetition examples
Example 1. Data entry, especially manual data entry, can be a high-risk factor for accidental character duplication. Have you ever pressed a key on your computer keyboard for a bit longer than intended, so it started automatically typing multiple characters???????????????
Tip: You can adjust your Keyboard Properties to control “Repeat delay” and “Repeat rate” settings (on Windows computer, search for “Keyboard” and click on Keyboard in Control Panel).
Example 2. Recently, I had to deal with the data that contained multiple consecutive double quotation marks all over the character string values. Even though we don’t know the exact cause of it, still for each of these duplicated quotation marks occurrences we needed to replace them with a single quotation mark.
Removing repeated blanks
There is a very useful COMPBL() function in SAS that removes multiple consecutive blanks from a character string replacing them with a single blank.
Let’s use the COMPBL function as a prototype and create our own user-defined function that will do what COMPBL does to the blanks but extend its functionality to other characters.
Removing any repeated character - prototype
Based on the approach described in a blog post Removing unwanted characters from text strings by Amadeus Software we developed a prototype using TRANWRD() function that replaces any two equal characters with a single one. Instead of creating a function for replacing a specific character and calling it recursively, we placed the TRANWRD function in a loop that iterates while there are any two repeated characters are found. Here is the code illustrating this approach:
data D; c = ','; *<- character to un-duplicate; cc = c||c; *<- double character; string = 'Many,,,,,, commas,,,,, in,,, this,, sentence.,'; *<- source string; put 'BEFORE:' string=; *<- output initial string; do while (find(string,cc)); *<- loop through while there are doubles; string = tranwrd(string,cc,c); *<- replace double with a single character; end; put 'AFTER: ' string=; *<- output unduplicated string; run; |
This code will produce the following in the SAS log:
BEFORE:string=Many,,,,,, commas,,,,, in,,, this,, sentence.,
AFTER: string=Many, commas, in, this, sentence.,
which shows that this approach correctly un-duplicates the source string removing and replacing all repeated characters (commas in our example) with a single one.
User-defined SAS function for removing any repeated characters
Let’s use PROC FCMP to build a user-defined function UNDUPC (str, clist) where the two required arguments are:
- str - specifies a character constant, variable, or expression that you want to remove repeated characters from replacing them with a single character;
- clist - is a constant, variable, or character expression that specifies (lists) all the characters that you want to unduplicate in string.
The following code will create such a function:
libname funclib 'c:\projects\functions'; proc fcmp outlib=funclib.userfuncs.package1; function undupc(str $, clist $) $; length x $32767 c $1 cc $2; x = str; do i=1 to length(clist); c = char(clist,i); cc = c||c; do while (find(trim(x),cc)); x = tranwrd(trim(x),cc,c); end; end; return (x); endfunc; run; |
Code highlights
- We introduce an interim variable x to which we will iteratively apply replacing double characters with a single one.
- We assign length attribute of this variable to be maximum allowable character length of 32767 bytes to accommodate any character length used in the calling program.
- Outer do-loop loops through the clist containing characters we want to unduplicate.
- Variable c is assigned a single character from clist, variable cc is assigned double of the cc value.
- Inner do-loop iterates through trimmed characters in x while doubles are found; using trim(x) is essential as it not only speeds up processing while searching through a shorter string (without trailing blanks), it also prevents from falling into an infinite loop in case clist contains blank character to unduplicate (cc equals to double blanks which are always going to be found among trailing blanks).
Let’s test our newly minted UNDUPC function on the following data:
data SOURCE; infile datalines truncover; input str $50.; datalines; """Repeated "double quotes"""" Repeated,,,,,commas,,,,,,,,,,, [[[""Mixed""]] characters,,, ; |
Since our user-defined function is permanently stored in the location specified in the outlib= option of the proc fcmp, now we can easily use it in any SAS session as in the following example:
options cmplib=funclib.userfuncs; data TARGET; set SOURCE; length new_str $50; new_str = undupc(str, ' ,"]['); run; |
This code will remove and replace all repeated sequences of characters ' ', ',', '"', ']', and '['. The order of these characters listed in the second argument doesn’t matter. Here is what we get:
As you can see, we get what we wanted including the functionality of the COMPBL function.
User-defined CALL routine for removing any repeated characters
As much as I love user-defined functions, I have an issue with the above undupc user-defined function implementation. It has to do with how the PROC FCMP handles interim character variables length attribute assignment. It does not implicitly inherit their length attribute from another variable as SAS data step does. For example, if you run the following data step:
data a; length x $99; y = x; run; |
variable y will have the length attribute $99 implicitly inherited from the x variable.
In PROC CFMP function, you can either assign the length attribute to a character variable explicitly with LENGTH or ATTRIB statement (as we did by using length x $32767 ), or it will be set to $33 if you use any other way of implicit assignment. (I leave it up to you guessing why 33 and why not any other number.) Since we wanted to accommodate SAS character strings of any length, we had to explicitly assign our interim variable x length attribute the maximum valid value of $32767. This will inevitably take tall on the function performance as we will have to process longer strings.
However, we can avoid this issue by using CALL routine instead:
libname funclib 'c:\projects\functions'; proc fcmp outlib=funclib.usercalls.package1; subroutine undupc(str $, clist $, x $); outargs x; length c $1 cc $2; x = str; do i=1 to length(clist); c = char(clist,i); cc = c||c; do while (find(trim(x),cc)); x = tranwrd(trim(x),cc,c); end; end; endsub; run; |
This code is very similar to the user-defined function above with a slight difference. Here, x variable is listed as an argument in the subroutine definition and refers to a SAS data step variable whose length attribute is assigned in the calling data step. Unlike SAS function, SAS subroutine does not return a value; instead, it uses OUTARGS statement that specifies arguments in an argument list that you want a subroutine to update. Therefore, we don't need to assign the length attribute to x within the subroutine, we just place x in the OUTARGS statement. Here is how you would invoke a user-defined CALL routine instead of a function:
options cmplib=funclib.usercalls; data TARGET; set SOURCE; length new_str $50; call undupc(str, ' ,"][', new_str); run; |
Notice, that now we control and define the length attribute of the third argument in the calling program, not in the CALL routine itself. And we get the same results as with the UNDUPC function above.
Store user-defined functions and subroutines separately
You can create and have both, user-defined function and call routine with the same name. However, to avoid confusion (and errors) do not store their definitions in the same data table (outlib= option of the PROC FCMP). If they are stored in the same data table, then when used in a DATA step, SAS will pull the latest definition by its name only and that may not be the entity you want.
Performance benchmarking
To compare performances of the UNDUPC function vs UNDUPC subroutine we created a rather large data table (1 Million observations) with randomly generated strings (1000 characters long):
libname SASDL 'C:\PROJECTS\TESTDATA'; data SASDL.TESTDATA (keep=str); length str $1000; do i=1 to 1000000; str = ''; do j=1 to 1000; str = cats(str,byte(int(rank(' ')+38*rand('UNIFORM')))); end; output; end; run; |
Then we ran the following 2 data steps, one using the undupc() function, and the other using undupc() call routine:
options cmplib=funclib.userfuncs; data SASDL.TESTDATA_UNDUPC_FUNC; set SASDL.TESTDATA; length new_str $1000; new_str = undupc(str, '#+'); run; options cmplib=subrlib.usercalls; data SASDL.TESTDATA_UNDUPC_CALL; set SASDL.TESTDATA; length new_str $1000; call undupc(str, '#+', new_str); run; |
A quick SAS log inspection reveals that CALL UNDUPC works as much as 3 times faster than UNDUPC function (10 seconds vs. 30 seconds). The time savings may vary depending on your data composition and computing environment, but in any case, if you process high volumes of data you may consider using CALL routine over function. This is not a blanket statement, as it only pertains to this particular algorithm of eliminating character repetitions where we had to explicitly assign the highest possible length attribute value to the interim variable in the function, but not in the CALL routine.
When we reduced declared length of x from $32767 to $1000 within the user-defined function definition its performance became on par with the CALL routine.
Additional Resources for SAS character strings processing
- An Introduction to SAS® Character Functions
- How to unquote SAS character variable values
- Finding n-th instance of a substring within a string
Your thoughts?
Have you found this blog post useful? Would you vote for implementing UNDUPC as a native built-in SAS function? Please share your thoughts and feedback in the comments section below.
22 Comments
Leonid This is great. Thanks for sharing!! A question: if the list clist ends with blanks the function doesn't clean duplicate blanks, what is the reason?
Hi Oscar, great observation. The algorithm loops through the characters in CLIST from position 1 to LENGTH(CLIST). The LENGTH(CLIST) function returns an integer that represents the position of the rightmost non-blank character in CLIST. The ending blank is considered as trailing blank and therefore is removed from consideration. If you want to include blank in CLIST you should place it before any non-blank character. If blank is the only character in the CLIST, you should rather use COMPBL() function.
Great explanation on Proc FCMP.
Thank you, Deb. Have you used PROC FCMP before?
Assume your string is in variable var1 then the new string in variable var2.
If your original string does not contain spaces you could also do the following:
var2=translate(compbl(translate(strip(var1)," ","_")),"_"," ");
If your original string contains spaces you would have to replace them by something exotic and replace them back after.
var2=translate(translate(compbl(translate(translate(strip(var1),"§"," ")," ","_")),"_"," ")," ","§");
Thank you, Guido, it's an interesting approach. Indeed, your code will work, but only (as you mentioned) "if your original string does not contain spaces" which is rarely the case.
Replacing spaces by some exotic characters approach is also described in the SAS code Sample 39036: Eliminating duplicate characters in a string. But I believe it's rather dangerous as those "exotic" characters might be legitimate characters in the original data. That is why I would rather rely on a more universal and robust approach presented in this blog post.
Hi Leonid, really good job, I tried to propose a faster function but failed.
Just one question: Is "endfunc" a variant of "endsub" ?
Thank you, Nero. I also tried several different versions and chose the most efficient one. I would love to see a faster implementation though. However, I realize that speed of execution depends on the distribution of repeated characters as well as the length of duplication streaks, so there might not be a universally "better" implementation as it is data-dependent.
As for "endfunc" vs. "endsub", I think you can use them interchangeably, but I prefer using "endfunc" for functions, and "endsub" for subroutines.
Hello. This is potentially very useful, and I vote to add UNDUPC as a SAS native function.
I played around with this, and I think I found a small change that saves about 10% processing time. That is, use TRIM(x) instead of x in the TRANWRD function call. I obtained 9.3% to 9.9% improvement when I used it in a DATA _NULL_ step in order to isolate its contribution from the overhead of writing a new dataset. I would be glad to send a copy of the test, but I don't see a way to attach a document to this comment.
I suppose when implemented as a native function, it would be more efficient than the user-defined version in any case.
Thank you, Van, for your vote, comment and suggestion. Great catch, it seems logical to trim x when using it in tranwrd() function. Interestingly, I stressed the significance of trim under the find function, but neglected to be consistent in using it under tranwrd as well. I tested this on the data and use case provided in the blog post and achieved 2-3% improvement in both, function and call routine. I have updated the blog's code with your suggested change. Thank you again!
Hi Leonid, it is really very helpful. Thank you!
I am glad to hear it. Thank you, Marina, for your feedback.
Excellent function to have, would vote for to be added as SAS function.
Great! Thank you, Rafi.
This is great. Thanks for sharing!!
You are welcome, Bill. And thank you for your feedback.
I agree! This would be a great feature. It's like a generalized COMPBL(). Nice work!
Thank you, Stu, you are exactly right! 3:0.
This is great! I believe this function should to be natively implemented and included in SAS software, similar to COMPBL().
Thank you, Tatiana, for your feedback. Your vote is counted 🙂
I have used the COMPBL() function in the past and it's possible that I would want a similar function for something other than spaces.
Thank you, Brian, for your feedback.