In my previous post, we solved the task of removing specified leading characters from SAS strings. In this post, we tackle the complementary task of removing trailing characters.
While removing trailing blanks is well covered in SAS by the TRIM() and TRIMN() functions, removing non-blank trailing characters remains a bit of a mystery that can pop up during text string processing.
For example, you may need to clean up the following strings by removing all trailing x's from them:
These extra characters can result from data entry errors, prior data manipulations, etc. No matter how you get them in, you want them out.
How to remove any trailing characters
For instance, let’s remove all occurrences of the arbitrary trailing character '*'. The following diagram illustrates what we are going to achieve and how:
In order to remove a specified character (in this example '*') from all trailing positions in a string, we need to search our string from right to left starting from the rightmost non-blank character position and find the position p of the first character in that string that is not equal to the specified character. Note, that despite our right-to-left direction of search the position p=10 is still enumerated from left-to-right. Then we can extract the substring starting from position 1 with the length equal to the found position p.
Unlike in our leading characters removal solution, out of two contenders for our search functionality, VERIFY and FINDC, the VERIFY function has to be dropped from the competition as it does not provide right-to-left search functionality. However, the FINDC function stays on track. Here is a possible solution using the FINDC() function.
Before we proceed to the code implementation of the outlined solution let’s consider the following edge case.
If our string X consists of all '*' characters and nothing else the FINDC() function will find no character (and therefore no position) that is not equal to '*'. In this case it will return 0. However, 0 is not a valid third argument value for the SUBSTR(X, 1, P) function. Valid values are 1 . . . through VLENGTH(X) – the length attribute of X. Having a 0 value for the third argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:
NOTE: Invalid third argument to function SUBSTR at line ## column #.
Therefore, we need to handle this special case separately, conditionally using SUBSTR(X, 1, P) for P>0 and assigning blank ('') otherwise.
Code implementation for removing trailing characters
Now we are ready to put everything together. First, let’s create a test data table:
data TEST; input X $ 1-20; datalines; *It's done*** ********* **01234*ABC** No trailing *'s ;
Then we apply the logic described above. The following DATA step illustrates our coding solution for deleting trailing characters:
data CLEAN (keep=X Y); set TEST; C = '*'; *<- trailing character(s) to be removed; P = findc(X, C, 'K', -length(X)); if P then Y = substr(X, 1, P); else Y = ''; put _n_= / X= / P= / Y= /; run;
The SAS log will show interim and final results by the DATA step iterations:
_N_=1 X=*It's done*** P=10 Y=*It's done _N_=2 X=********* P=0 Y= _N_=3 X=**01234*ABC** P=11 Y=**01234*ABC _N_=4 X=No trailing *'s P=15 Y=No trailing *'s
Here is the output data table CLEAN showing the original and the resulting strings X and Y side by side:
The solution presented in this blog post expands trailing character deletion functionality beyond solely blanks (which are handled by the TRIM and TRIMN functions). Moreover, using this coding technique, we can simultaneously remove a variety of trailing characters. For example, if we have a string X='012345xxx.%' and specify C = 'x.%' (the order of characters listed within the value of C does not matter), then all three characters 'x', '.', and '%' will be removed from all trailing positions of X. The resulting string will be Y='012345'.
In addition, numerous modifiers of the FINDC() function allow specifying many characters in bulk, without explicitly listing them one by one. For example, we may augment a list of characters being removed by adding the D modifier as in P = FINDC(X, C, 'KD', -LENGTH(X)) which will remove all trailing digits in addition to those characters specified in C. Similarly, we may throw in the U modifier as in P = FINDC(X, C, 'KDU', -LENGTH(X)) which adds all uppercase letters to the list of trailing characters to be removed. And so on.
- Removing leading characters from SAS strings
- Deleting a substring from a SAS string
- Inserting a substring into a SAS string
- Removing repeated characters in SAS strings
Questions? Thoughts? Comments?
Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.