Help me answer a student question

7

Dear Reader,

Recently, another SAS instructor had a student in class who has a raw data file that looks like this:

Obs    Var

1        15 Harvey Rd    Macon

2        163 McNeil Pl. Kern

3        442 Glen Ave    Miami Beach

4        2491 Brady St. Munger

5        PO Box 2253 Washington DC

6        25 Chesire Pl      Short Hills

7        1725 Airport Rd. Springfield

8        11234 W Hoyt St. Chicago

9        922 Mitchell Circle Chicago

10     34 Clear Sky Rd. Butte

11     1012 Hwy 54 Morrisville

12     1315 Green Valley Ln. Sedona

Her student wanted to know if there was a way to read this data into a SAS data set and retrieve the City name.

The problem, of course, is threefold:

  • the only separator is a blank space
  • some of the city names are two words
  • the number of ‘words’ in the address varies

Cindy, a SAS instructor in our Irvine office, found a solution using the Perl Regular Expression CALL PRXSUBSTR routine available in Base SAS.  Here’s the SAS code:

data test(keep=street city address);
   retain ExpressionID;
 
 /* 1st time through the DATA step, establish the pattern of values */
 /* (between the /s) to look for.  Apply the PRXPARSE function to   */
 /* compile the Perl regular expression (PRX) in Pattern. The "i"   */
 /* after the second / makes the search case insensitive.           */
 
   if _N_ = 1 then do;
      pattern = "/ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy/i";
      ExpressionID = prxparse(pattern);
   end;
 
 /* Read in the raw data value. Use the PRXSUBSTR call routine to  */
 /* search Street for the compiled PRX and return its starting     */
 /* position and length of the value found.                        */
 
   input Street $80.;
   call prxsubstr(ExpressionID, Street, Position, Length);
 
 /* If any of the PRX values were found, use the ANYPUNCT function */
 /* to search Street for punctuation starting in Position.         */
 
   if Position ^= 0 then do;
      Punctuation = anypunct(Street, Position);
 
 /* If punctuation was found, substring Street starting in one     */
 /* position beyond the location of the punctuation. Left align    */
 /* the resulting City value.                                      */
 
      if Punctuation > 0 then
        City =left(substr(Street, Punctuation + 1));
 
 /* If punctuation was not found, substring Street starting in the */
 /* Position + Length spot. Left align the resulting City value.   */
 
      else
        city = left( substr(street, position + length));
 
 /* Determine the position of the first character of City (Spot).  */
 /* If it is greater than 1, substring City beginning in that      */
 /* position and left align the result. Write the value of City to */
 /* the SAS log.                                                   */      
 
      spot=notdigit(city);
      if spot > 1 then city = left(substr(city, spot));
   end;
   datalines;
15 Harvey Rd  Macon
163 McNeil Pl. Kern
442 Glen Ave    Miami Beach
2491 Brady St. Munger
PO Box 2253 Washington DC
25 Chesire Pl  Short Hills
1725 Airport Rd. Springfield
11234 W Hoyt St. Chicago
922 Mitchell Circle Chicago
34 Clear Sky Rd. Butte
1012 Hwy 54 Morrisville
1315 Green Valley Ln. Sedona
;

Here are the results of the PUT statement in the SAS log:

"Macon" found in "15 Harvey Rd  Macon"

"Kern" found in "163 McNeil Pl. Kern"

"Miami Beach" found in "442 Glen Ave    Miami Beach"

"Munger" found in "2491 Brady St. Munger"

"Washington DC" found in "PO Box 2253 Washington DC"

"Short Hills" found in "25 Chesire Pl  Short Hills"

"Springfield" found in "1725 Airport Rd. Springfield"

"Chicago" found in "11234 W Hoyt St. Chicago"

"Chicago" found in "922 Mitchell Circle Chicago"

"Butte" found in "34 Clear Sky Rd. Butte"

"Morrisville" found in "1012 Hwy 54 Morrisville"

"Sedona" found in "1315 Green Valley Ln. Sedona"

While the sample code has the values for which to search hard-coded, you could populate the PRX list from any SAS data set which contained the primary street suffixes and associated abbreviations.

The one “gotcha” that this solution doesn’t handle is the numbers after the street suffixes (PO Box 2253 or Hwy 54, for example).  While I could propose one, I’d love to see your solution to this.

Feel free to ponder, test, and then post your solution back to this blog.  PLEASE test your solution first!  I’ll be trying each one posted and commenting on the results.

Happy searching!

Miss SAS Answers

Share

About Author

Miss SAS Answers

Technical Training Specialist

Linda Jolley has been a SAS software user since 1980. She has been an instructor for SAS since 1997, and is Base and Advanced SAS Programming certified and working on the Data Integration Developer certification. She has presented papers at several SAS user group venues on various SAS programming efficiency techniques and the SAS Scalable Performance Data Server.

7 Comments

  1. Miss SAS Answers on

    Using Prashant's solution, you can add one line of code to extract the street address:

    address = substr(Street,1,position+length-1);

    Since you know the street address begins in position 1 in the string, and you know where the city name starts (position+length, left justified), you can simply substring from Street beginning in position 1 for position+length-1.

    Thanks for all the great ideas and solutions!

    Miss SAS Answers

  2. bob mcconnaughey on

    well doing something with a real address file would be far more baroque. Usually starting w/ reversing the whole address string, then reversing the component again. But the easiest way to muck w/ this file might be after reading each address in as one character string:
    address = translate(address,"","1234567890.");
    address = strip(address);
    address = lowcase(address);
    * ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy ;
    cityflag = 0;
    streetchk1 = index(address,"ave");
    if streetchk1 > 0 & cityflag = 0 then do ;
    addressb=substr(address,streetchk1+3);
    cityflag = 1;
    end; ** repeat as needed for each "Street type";
    .......
    city = strip(propcase(addressb));
    if you put the check for "st" at the end, it will also handle the St Petersburg problem.
    proc print listing.
    city

    Macon
    Kern
    Miami Beach
    Munger
    Washington Dc
    Short Hills
    Springfield
    Chicago
    Chicago
    Butte
    Morrisville
    St Petersburg

  3. A slight variation of my above code to account for the Gotcha mentioned by Quention.Ofcourse there may be other gotchase.Also Quention's Gotcha when 'W' is expanded as 'West'.The below code accounts for this.Notice the word boundary for 'st' in the pattern variable '\bst\b'

    Data Temp;

    Input Street $80.;

    pattern = "/(ave|avenue|dr|drive|road|rd|pl|\bst\b|circle|ln|box|hwy)(\s)?(\.|,)?(\d)*/i";
    call prxsubstr(prxparse(pattern), Street, Position, Length);
    City=Left(Substr(STreet,position+length));

    datalines;
    15 Harvey St. St. Paul
    134 West Rd St Petersburg
    163 McNeil Pl. Kern
    442 Glen Ave Miami Beach
    2491 Brady St. Munger
    PO Box 2253 Washington DC
    25 Chesire Pl Short Hills
    1725 Airport Rd. Springfield
    11234 W Hoyt St. Chicago
    922 Mitchell Circle Chicago
    34 Clear Sky Rd. Butte
    1012 Hwy 54 Morrisville
    ;
    Run;

  4. This is one solution.Ofcourse this does not take into account the Gotcha mentioned by Quentin above but accounts for numbers after street suffixes :

    Data Temp;

    Input Street $80.;

    pattern = "/(ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy)(\s)?(\.|,)? (\d)*/i";
    call prxsubstr(prxparse(pattern), Street, Position, Length);
    City=Left(Substr(STreet,position+length));

    datalines;
    15 Harvey Rd Macon
    163 McNeil Pl. Kern
    442 Glen Ave Miami Beach
    2491 Brady St. Munger
    PO Box 2253 Washington DC
    25 Chesire Pl Short Hills
    1725 Airport Rd. Springfield
    11234 W Hoyt St. Chicago
    922 Mitchell Circle Chicago
    34 Clear Sky Rd. Butte
    1012 Hwy 54 Morrisville
    ;
    Run;

  5. Interesting. Seems like a city like "St. Petersburg" could also be a good gotcha, if the solution relies on treating "ST" as a delimiter....

Leave A Reply

Back to Top