A common task in data analysis is to locate observations that satisfy multiple criteria. For example, you might want to locate all zip codes in certain counties within specified states.
The SAS DATA step contains the powerful WHERE statement, which enables you to extract a subset of data that satisfy certain criteria. The Sashelp.Zipcode data set contains all US zip codes and the state and county in which they are located. The following DATA step uses the IN operator to create a subset of zip codes for Seminole County (FL), Wake County (NC), Tompkins County (NY), and Hennepin County (MN):
data Subset; set sashelp.Zipcode; where StateCode in ("FL" "NC" "NY" "MN") & CountyNM in ("Seminole" "Wake" "Tompkins" "Hennepin"); run; |
Finding rows that match column values
The SAS/IML language also supports the WHERE clause when reading from a SAS data set, but sometimes the data are in matrices in RAM. What options exist for finding observations in matrices? In this article I'd like to introduce the LOC-ELEMENT technique.
I've previously written about how to use the LOC function in SAS/IML to locate certain elements in a matrix that satisfy a single logical expression. However, when you are searching for observations that satisfy one of several conditions, there is a nice trick that you can use: combine the LOC function with the ELEMENT function.
The ELEMENT function is one of several functions in SAS/IML that enable you to form queries about sets, such as their union, intersection, and difference. The ELEMENT function returns a 0/1 vector that indicates whether each element in a matrix is contained in a specified set of values. The following SAS/IML statements read character variables into a matrix and then use the ELEMENT and LOC functions to find the row numbers that correspond to any observations for which the StateCode variable is in the set {FL, NC, NY, MN}:
proc iml; use sashelp.Zipcode; read all var {"StateCode" "CountyNM"} into C[colname=varNames]; read all var {"Zip"}; close sashelp.Zipcode; myStates = {FL NC NY MN}; /* target set */ StateIdx = loc(element(C[, "StateCode"], myStates)); /* find matches */ |
The StateIdx variable contains the row numbers that correspond to all zip codes in the four specified states. To get the actual zip codes, you can form the expression zip[StateIdx].
The previous statements contain two tricks that are worth studying. The first is that loc(element(A,B)) finds the indices of the elements in the matrix A that are contained in the set B. The second is that when you use the READ INTO statement, the SAS/IML language automatically assigns matrix attributes to the matrix, so that you can use syntax such as C[, "StateCode"] to refer to the column of C that contains the state codes. For more details on this feature, see the article "Access rows or columns of a matrix by names."
In a similar way, you can find all row numbers for the four specified counties:
myCounties = {"Seminole" "Wake" "Tompkins" "Hennepin"}; CountyIdx = loc(element(C[, "CountyNM"], myCounties)); |
Notice that these row numbers are independent of the states. In particular, it turns out that Georgia (GA) and Oklahoma (OK) both have counties named "Seminole." You can call the XSECT function to form the intersection of these two sets of row numbers, as follows:
idx = xsect(StateIdx, CountyIdx); print (C[idx,])[colname=varNames] (zip[idx])[colname="Zip"]; |
The LOC-ELEMENT technique is superior to the alternative, which is to form a long sequence of logical expressions, as follows:
code = C[, "StateCode"]; StateIdx2 = loc( code="FL" | code="NC" | code="NY" | code="MN" ); |
The logical expression isn't too hard to write for four states, but imagine wanting to extract 20 states or hundreds of counties! The LOC-ELEMENT technique is more compact and less error-prone. Try it; I think you'll like it.
3 Comments
Very interesting article.
I tried to apply this on the code below but it doesn't work. In fact what I need to do is a litlle bit different I think. I have a population for which I know the age stored in a vector, and I want to get for each one the mortality rate from a mortality table.
For now, I have written this code, using loc function and apply function.
But when I use this code on large datasets, the code is quite slow because of apply function.
Is there a way to do it differently (and faster) using loc/element functions?
Yes. What you are looking for is the UNIQUE-LOC technique. See "The UNIQUE-LOC trick" and the follow-up article "An efficient alternative to the UNIQUE-LOC technique". In your case, you can skip the UNIQUE step and just use the age groups of the mortality table as your categories, like this:
The reason your code is taking a long time is because you are doing a scalar computation for each element of population_age. Notice that the technique I recomend does a vector computation for each of the age categories, and there are many fewer categories than people.
Thank you Rick, I will try this as soon as possible, and I will let you know how faster it is.