A frequently performed task in data analysis is identifying all the observations in a data set that satisfy certain conditions. For example, you might want to identify all of the female patients in your study or to identify all patients whose systolic blood pressure is greater than 140 mm Hg.
Novice programmers often use loops to find observations that satisfy a criterion. Don't do it! In SAS/IML software, there is almost never a good reason to loop over all observations. Instead, use the LOC function.
The following statements use the LOC function to identify the patients mentioned earlier:
f = loc(gender = "female"); highBP = loc(systolic > 140); |
The LOC function is the most useful function in the SAS/IML language that DATA step programmers have never heard of. The LOC function finds the location of nonzero elements in a vector or matrix. Using the LOC function is much faster than writing a loop.
The LOC function returns a row vector that contains indices that satisfy the specified condition. You can use the indices to subset the data.
Forming Subsets
For example, suppose you have data for some famous witches and wizards in literature:
data MagicUsers; infile datalines dsd; length Name $11 Profession $7 Source $20; input Name Profession Power Source; datalines; Morgana, Witch, 7, Authurian Legend Merlin, Wizard, 10, Authurian Legend Gryffindor, Wizard, 8, Harry Potter Books Hufflepuff, Witch, 8, Harry Potter Books Ravenclaw, Witch, 8, Harry Potter Books Slytherin, Wizard, 8, Harry Potter Books Glinda, Witch, 5, Oz Books Elphaba, Witch, 6, Oz Books Diggs, Wizard, 1, Oz Books ; |
The LOC function can help you determine which witch is which. The following statements determine which names correspond to witches:
proc iml; use MagicUsers; read all var {Name Profession Power}; close MagicUsers; ndx = loc(Profession="Witch"); /** find the indices for witches **/ Names = Name[ndx]; /** subset the names of the witches **/ |
Similarly, if you want to compute the average power of these witches, you can form a subset of the Power variable and compute the mean power of the witches in the subset:
WitchPower = Power[ndx]; /** subset the power variable **/ AvgPower = WitchPower[:]; /** compute the subset's average **/ print ndx[colname=Names], AvgPower; |
14 Comments
Pingback: Removing Observations with Missing Values - The DO Loop
Pingback: Log transformations: How to handle negative data values? - The DO Loop
Pingback: Timing Performance: Looping versus LOC-ing - The DO Loop
Pingback: Tips and Techniques – What’s the Difference? - The DO Loop
Pingback: Reshape data so that each category becomes a new variable - The DO Loop
Pingback: Beware the naked LOC - The DO Loop
l use loc a lot. However , can the argument for loc be a matrix? Suppose I want to identify multiple rows in one pass to create a new matrix i.e., return new matrix={1 2 3, 4 5 6, 7 8 9} This problem is killing me because it seems that I end up having to do nested loops which requires multiple passes.
Yes, the argument to LOC is a matrix. However, the return value is always a row vector (1 row, k columns), so you might need to use the SHAPE function to reshape the result. If you have a specific example in mind that is causing you problems, post it to the SAS/IML Support Community.
Thanks a lot for this post. It is very useful. Something new that I learnt today.
Pingback: Detecting outliers in SAS: Part 2: Estimating scale - The DO Loop
Pingback: Need case-insensitive string comparisons? UPCASE! - The DO Loop
Pingback: Finding data that satisfy a criterion
I am wondering whether you can create a condition that selects witches or wizards. Obviously, in the example above, it will select everything. But if we also had mages and sorcerers than the condition will subset on more than one profession.
Yes. For your simple example, use the "OR" operator (|):
ndx = loc(Profession="Witch" | Profession="Wizard");
For more complex queries, use the LOC-ELEMENT technique.