In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables. This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.
One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.
The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.
The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:
- Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
- Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
- Merge the newly restricted outlier variable with the BASE data set.
/*Step 1: Create BASE and OUTLIER data sets*/ %let outliers = /*list of variables*/; DATA MYDATA.BASE; SET MYDATA.LOAN_ADJUST (DROP=&outliers.); ROW_NUM = _N_; RUN; DATA outliers; SET MYDATA.LOAN_ADJUST (KEEP=&outliers. ROW_NUM); ROW_NUM = _N_; RUN; /*Step 2: Create loop and apply the 1.5 IQR rule*/ %MACRO loopit(mylist); %LET n = %SYSFUNC(countw(&mylist)); %DO I=1 %TO &n; %LET val = %SCAN(&mylist,&I); PROC UNIVARIATE DATA = outliers ; VAR &val.; OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr; run; data _NULL_; SET boxStats; CALL symput ('median',median); CALL symput ('iqr', iqr); run; %PUT &median; %PUT &iqr; DATA out_&val.(KEEP=ROW_NUM &val.); SET outliers; IF &val. ge &median + 1.5 * &iqr THEN &val. = &median + 1.5 * &iqr; RUN; /*Step 3: Merge restricted value to BASE data set*/ PROC SQL; CREATE TABLE MYDATA.BASE AS SELECT * FROM MYDATA.BASE AS a LEFT JOIN out_&val. as b on a.ROW_NUM = b.ROW_NUM; QUIT; %END; %MEND; %LET list = &outliers; %loopit(&list);
Notes on the outlier adjustment program:
- A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
- Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
- A macro program is developed to contain the process of looping through the list of variables.
- A macro variable (n) is created that counts the number of variables contained in the macro variable.
- A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
- PROC UNIVARIATE identifies the variable’s median and interquartile range.
- A macro variable is created to contain the values of the median and interquartile range.
- A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
- PROC SQL adds the adjusted variables to the BASE data set.
This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently
Why is the the 1.5*IQR being added to the median and not Q3 in the script?
Had you thought of using PROC MEANS to calculate the median and inter-quartile range without a macro loop? You use the VAR sub-statement to identify the variables you have in &outliers. You could use the output data set in a DATA step to generate the code with CALL EXECUTE for another DATA step to do the final calculation. This requires only two passes through the original data set, no macro code and no splitting and joining. While developing the generated DATA step, replace the CALL EXECUTE with output to a file so you can see what is happening and run the generated code manually.
It looks like this is only adjusting outliers on the high end. It seems to be be missing the code to adjust low outliers. Or am I missing something?