Finding n-th instance of a substring within a string

"There's a way to do it better - find it." - Thomas A. Edison

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-the-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

Reverse string search

Since find() function allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;
   put p=;
 
   /* from right to left */
   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;
   put p=;
run;

FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth;
   function findnth(str $, sub $, n);
      p = ifn(n>=0,0,length(str)+1);
      do i=1 to abs(n) until(p=0);
         p = find(str,sub,sign(n)*p+1);
      end;
      return (p);
   endsub;
run;

We conveniently named it findnth() to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our findnth() function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the findnth() function usage:

options cmplib=sasuser.functions;
data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p=findnth(s,x,n);
   put p=;
 
   /* from right to left */
   p=findnth(s,x,-n);
   put p=;
run;

Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* using regex */
   xid = prxparse('/'||x||'/o');
   p = 0;
   do i=1 to n until(p=0);
      from = p + 1;
      call prxnext(xid, p + 1, length(s), s, p, len);
   end;
   put p=;
run;

However, efficiency benchmarking tests demonstrated that the above solutions using FIND() function or FINDNTH() SAS user-written function run roughly twice faster than this regex solution.

Overlapping and performance considerations

The above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x), and also replace p=0 to p=1-w for left-to-right search initial p value:

data a; 
   s='AB bhdf +BA s Ab fs ABC Nfm AB '; 
   x='AB'; 
   n=3; 
   w=length(x); 
 
   /* from left to right */ 
   p = 1 - w; 
   do i=1 to n until(p=0);  
      p = find(s, x, p+w); 
   end; 
   put p=;
 
   /* from right to left */ 
   p = length(s) + 1; 
   do i=1 to n until(p=0);  
      p = find(s, x, -p+w); 
   end; 
   put p=; 
run;

Just make sure, you calculate your w=length(x) before (not within) the do-loop to cut out recalculation of the same over and over.

Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

12 Comments

yang on January 29, 2021 10:22 am

many thanks leonid~

- Leonid Batkhan on January 29, 2021 10:31 am
  
  You are welcome, Yang!
  
Jim Barbour on October 8, 2020 7:12 pm

A very useful post, Leonid. Thank you for this. I used it today to parse through some text and pull out parenthetical expressions.

Best regards,

Jim

- Leonid Batkhan on October 8, 2020 9:03 pm
  
  Thank you, Jim. I am very happy that you found it useful and put it to work.
  
Moshiur Rahman on July 30, 2019 11:26 am

Thank you so much for your great posts. For both overlapping and non-overlapping substring within a string the little modified (p = find(s, x); and iterate n-1 times) code works, otherwise existing code start searching from the (p+w)-th position of the string:
data a;
s='AB bhdf AAAAAb fs ABC Nfm AAB ';
x='AA';
n=3;
w=length(x);

/* from left to right */
p = find(s, x);
do i=1 to n-1 until(p=0);
p = find(s, x, p+w);
end;
put p=;
run;

- Leonid Batkhan on July 30, 2019 12:15 pm
  Hi Moshiur, you are absolutely correct, great catch! I think only "from left to right" non-overlapping code modification is needed. It can be as simple as replacing initial p=0 with p=1-w:
```
data a; 
   s='AB bhdf +BA s Ab fs ABC Nfm AB '; 
   x='AB'; 
   n=3; 
   w=length(x); 

   /* from left to right */ 
   p = 1-w; 
   do i=1 to n until(p=0);  
      p = find(s, x, p+w); 
   end; 
   put p=; 
run;
```
  Do you agree?
  
Peter Clemmensen on June 28, 2019 5:37 am

Very cool post Leonid!

- Leonid Batkhan on June 28, 2019 9:08 am
  
  Thank you, Peter, I am glad you liked it.
  
Thomas Abernathy on June 27, 2019 1:04 pm

It might be more used if SAS could come up with something similar to the SASAUTOS concept for functions. Something that allowed it to automatically compile the source code for the function. It is much easier/portable to maintain a set of simple text files than propriety (version specific) formats like SAS catalogs.

- Leonid Batkhan on June 27, 2019 1:15 pm
  
  Hi Thomas,
  Thank you for stopping by and providing your feedback.
  SAS does have "something similar to the SASAUTOS concept for functions". If you look at my example of the FINDNTH() function usage, the very first line is:
  options cmplib=sasuser.functions;
  , where SASUSER.FUNCTIONS is a SAS data table that stores definition of the FINDNTH() function. It is very similar to SASAUTOS or FMTSEARCH options, don't you agree? You can store your function in any library you choose, not just SASUSER. I defined the library in the very first line of PROC FCMP:
  proc fcmp outlib=sasuser.functions.findnth;
  
Tom Gaughan on June 26, 2019 1:48 pm

Great illustration of using the SAS Function Compiler (PROC FCMP) to build on the foundation provided by out-of-the-box SAS functions!

- Leonid Batkhan on June 26, 2019 1:58 pm
  
  Thank you, Tom. Good point. I believe that PROC FCMP is grossly underused by SAS programmers.

Blogs