Dear Reader,
Recently, another SAS instructor had a student in class who has a raw data file that looks like this:
Obs Var
1 15 Harvey Rd Macon
2 163 McNeil Pl. Kern
3 442 Glen Ave Miami Beach
4 2491 Brady St. Munger
5 PO Box 2253 Washington DC
6 25 Chesire Pl Short Hills
7 1725 Airport Rd. Springfield
8 11234 W Hoyt St. Chicago
9 922 Mitchell Circle Chicago
10 34 Clear Sky Rd. Butte
11 1012 Hwy 54 Morrisville
12 1315 Green Valley Ln. Sedona
Her student wanted to know if there was a way to read this data into a SAS data set and retrieve the City name.
The problem, of course, is threefold:
- the only separator is a blank space
- some of the city names are two words
- the number of ‘words’ in the address varies
Cindy, a SAS instructor in our Irvine office, found a solution using the Perl Regular Expression CALL PRXSUBSTR routine available in Base SAS. Here’s the SAS code:
data test(keep=street city address); retain ExpressionID; /* 1st time through the DATA step, establish the pattern of values */ /* (between the /s) to look for. Apply the PRXPARSE function to */ /* compile the Perl regular expression (PRX) in Pattern. The "i" */ /* after the second / makes the search case insensitive. */ if _N_ = 1 then do; pattern = "/ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy/i"; ExpressionID = prxparse(pattern); end; /* Read in the raw data value. Use the PRXSUBSTR call routine to */ /* search Street for the compiled PRX and return its starting */ /* position and length of the value found. */ input Street $80.; call prxsubstr(ExpressionID, Street, Position, Length); /* If any of the PRX values were found, use the ANYPUNCT function */ /* to search Street for punctuation starting in Position. */ if Position ^= 0 then do; Punctuation = anypunct(Street, Position); /* If punctuation was found, substring Street starting in one */ /* position beyond the location of the punctuation. Left align */ /* the resulting City value. */ if Punctuation > 0 then City =left(substr(Street, Punctuation + 1)); /* If punctuation was not found, substring Street starting in the */ /* Position + Length spot. Left align the resulting City value. */ else city = left( substr(street, position + length)); /* Determine the position of the first character of City (Spot). */ /* If it is greater than 1, substring City beginning in that */ /* position and left align the result. Write the value of City to */ /* the SAS log. */ spot=notdigit(city); if spot > 1 then city = left(substr(city, spot)); end; datalines; 15 Harvey Rd Macon 163 McNeil Pl. Kern 442 Glen Ave Miami Beach 2491 Brady St. Munger PO Box 2253 Washington DC 25 Chesire Pl Short Hills 1725 Airport Rd. Springfield 11234 W Hoyt St. Chicago 922 Mitchell Circle Chicago 34 Clear Sky Rd. Butte 1012 Hwy 54 Morrisville 1315 Green Valley Ln. Sedona ; |
Here are the results of the PUT statement in the SAS log:
"Macon" found in "15 Harvey Rd Macon"
"Kern" found in "163 McNeil Pl. Kern"
"Miami Beach" found in "442 Glen Ave Miami Beach"
"Munger" found in "2491 Brady St. Munger"
"Washington DC" found in "PO Box 2253 Washington DC"
"Short Hills" found in "25 Chesire Pl Short Hills"
"Springfield" found in "1725 Airport Rd. Springfield"
"Chicago" found in "11234 W Hoyt St. Chicago"
"Chicago" found in "922 Mitchell Circle Chicago"
"Butte" found in "34 Clear Sky Rd. Butte"
"Morrisville" found in "1012 Hwy 54 Morrisville"
"Sedona" found in "1315 Green Valley Ln. Sedona"
While the sample code has the values for which to search hard-coded, you could populate the PRX list from any SAS data set which contained the primary street suffixes and associated abbreviations.
The one “gotcha” that this solution doesn’t handle is the numbers after the street suffixes (PO Box 2253 or Hwy 54, for example). While I could propose one, I’d love to see your solution to this.
Feel free to ponder, test, and then post your solution back to this blog. PLEASE test your solution first! I’ll be trying each one posted and commenting on the results.
Happy searching!
Miss SAS Answers
7 Comments
Using Prashant's solution, you can add one line of code to extract the street address:
address = substr(Street,1,position+length-1);
Since you know the street address begins in position 1 in the string, and you know where the city name starts (position+length, left justified), you can simply substring from Street beginning in position 1 for position+length-1.
Thanks for all the great ideas and solutions!
Miss SAS Answers
well doing something with a real address file would be far more baroque. Usually starting w/ reversing the whole address string, then reversing the component again. But the easiest way to muck w/ this file might be after reading each address in as one character string:
address = translate(address,"","1234567890.");
address = strip(address);
address = lowcase(address);
* ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy ;
cityflag = 0;
streetchk1 = index(address,"ave");
if streetchk1 > 0 & cityflag = 0 then do ;
addressb=substr(address,streetchk1+3);
cityflag = 1;
end; ** repeat as needed for each "Street type";
.......
city = strip(propcase(addressb));
if you put the check for "st" at the end, it will also handle the St Petersburg problem.
proc print listing.
city
Macon
Kern
Miami Beach
Munger
Washington Dc
Short Hills
Springfield
Chicago
Chicago
Butte
Morrisville
St Petersburg
A slight variation of my above code to account for the Gotcha mentioned by Quention.Ofcourse there may be other gotchase.Also Quention's Gotcha when 'W' is expanded as 'West'.The below code accounts for this.Notice the word boundary for 'st' in the pattern variable '\bst\b'
Data Temp;
Input Street $80.;
pattern = "/(ave|avenue|dr|drive|road|rd|pl|\bst\b|circle|ln|box|hwy)(\s)?(\.|,)?(\d)*/i";
call prxsubstr(prxparse(pattern), Street, Position, Length);
City=Left(Substr(STreet,position+length));
datalines;
15 Harvey St. St. Paul
134 West Rd St Petersburg
163 McNeil Pl. Kern
442 Glen Ave Miami Beach
2491 Brady St. Munger
PO Box 2253 Washington DC
25 Chesire Pl Short Hills
1725 Airport Rd. Springfield
11234 W Hoyt St. Chicago
922 Mitchell Circle Chicago
34 Clear Sky Rd. Butte
1012 Hwy 54 Morrisville
;
Run;
This solution works nicely for extracting city names like "St. Louis", "St. Petersburg", etc. Thanks!
This is one solution.Ofcourse this does not take into account the Gotcha mentioned by Quentin above but accounts for numbers after street suffixes :
Data Temp;
Input Street $80.;
pattern = "/(ave|avenue|dr|drive|road|rd|pl|st|circle|ln|box|hwy)(\s)?(\.|,)? (\d)*/i";
call prxsubstr(prxparse(pattern), Street, Position, Length);
City=Left(Substr(STreet,position+length));
datalines;
15 Harvey Rd Macon
163 McNeil Pl. Kern
442 Glen Ave Miami Beach
2491 Brady St. Munger
PO Box 2253 Washington DC
25 Chesire Pl Short Hills
1725 Airport Rd. Springfield
11234 W Hoyt St. Chicago
922 Mitchell Circle Chicago
34 Clear Sky Rd. Butte
1012 Hwy 54 Morrisville
;
Run;
Interesting. Seems like a city like "St. Petersburg" could also be a good gotcha, if the solution relies on treating "ST" as a delimiter....
Yes, that would also be a "gotcha".