Jedi SAS Tricks: FUNC(y) Formats

In the previous episode, we built our own custom SAS function - a masterful trick indeed. Gordon Keener, a developer here at SAS, responded exuberantly "You think that's cool? - try THIS!" and proceeded to demonstrate prodigious powers with the SAS by using a custom function in a custom informat to elegantly resolve a programming problem. And another episode of Jedi SAS Tricks was launched...

First, if you haven't experienced the joy of bringing order to chaotic data using your own custom formats and informats, an excellent place to start is with Jonas Bilenas' SGF paper titled "I Can Do That With PROC FORMAT". But Gordon was talking about a new (SAS9.3) capability - the ability create a custom format that performs a function on a value. And because we know how to build our own user-defined functions with PROC FCMP, can't we build a custom format leveraging a custom function? Yes, we can! So let's dive in. If you want to click along, download the code here and fire up your SAS session. Just FYI, the code includes segments we don't discuss here (like producing the data sets, etc.)

The problem:
Our data is read in from a raw text file. US data has temperatures recorded in °F, World data in °C. How can we compare these values? After reading it in, the original SAS data looks like this:
Listing of the two original data sets

First, we'll build custom functions to convert °F to °C and °C to °F:

proc fcmp outlib=sasuser.MyFunc.MyPack;
function c2f(Tc);
   return(((Tc*9)/5)+32);
endsub;
function f2c(Tf);
   return((5/9)*(Tf-32));
endsub;
run;

Next, we could use these functions to "enrich" the data, adding a temp_c variable to the US data and a temp_f variable to the World data:

options cmplib=sasuser.Myfunc;
data us_both;
   set us;
   Temp_c=f2c(temp_f);
run;
 
data world_both;
   set world;
   Temp_F=c2f(temp_c);
run;

Results:
Listing of the two modified data sets

But - wouldn't it be cool to just print our °F temperatures in °C without having to modify the data? For that, let's write a custom format using our temperature conversion functions and use it in PROC PRINT:

proc format;
  value   c2f (default=5) other=[c2f()]; 
  value   f2c (default=5) other=[f2c()]; 
run;
title 'Printing with Formats';
proc print data=us label noobs;
  label temp_f='Temp (°C)';
  format temp_f f2c5.1;
run;

Results:
Listing of the US data set with the Farenheit temperature column formatted to display as Celsius

Even more intriguing, as Gordon originally suggested, we could build custom informats based on our temperature conversion functions and use them to read the temperature data from the raw data file directly into the units we desired. We could have read the Fahrenheit data in as Celsius or vice-versa, right from the start. For example:

data US_F_and_C;
   input @1 Place $14. @19 Date date7. @15 Temp_f @15 Temp_c f2c3.;
   format date date9.;
   datalines;
Alaska        -80 23jan71
Colorado      -60 01jan79
Idaho         -60 18jan43
North Dakota  -60 15feb36
Wyoming       -63 09feb33
;
run;
 
data World_F_and_C;
   input @1 Place $14. @15 Date date7. @23 Temp_f c2f3. @23 Temp_c ;
   format date date9.;
   datalines;
Antarctica    21jul83 -89
Siberia       06feb33 -68
Greenland     09jan54 -66
Yukon         03feb47 -63
Alaska        23jan71 -67
;
run;

And the results:
Listing of the data sets produced by reading in the raw data using custom informats

And that's a wrap! Until next time, may the SAS be with you!
Mark

Post a Comment

#1 SAS programming tip for 2012

What tiny looking operator packs a mighty punch with large datasets?  Your queries are big. Your reports span multiple columns. To top it all you have to summarize data in multiple ways. Are you tempted to throw in the towel? Do you feel like saying a firm “Can’t be done” to your team?

Hint of Advice: DON’T give up! Read on for a handy tip which might well be the most useful tip of 2012.

The problem
While mentoring recently on SAS, a banking customer asked, “I source data from several detailed tables. I want this report:"


The setup
What if you also need insights on client behavior? You can easily sift through large data with the slick Boolean. Basically the digital computer breaks down everything into Binary, 0s and 1s. Boolean Logic is the heart and core of all programming. It works with simple YES and NO responses. A YES is a value of 1 and a NO is a value of 0. You’ll see how you can create any query combination effortlessly. Once you understand Boolean logic you’ll realize that the best part of it is how surprisingly simple it is!

Without further ado, here’s my code joining data from two tables and summarizing several possible combinations of employee demographics to group report by country.

The Code

proc sql;
 select country,
sum(salary < 60000 and employee_gender='M') 'Males & Salary < 60000 ',
sum(salary >= 60000 and employee_gender='M') 'Males & Salary >= 60000 ',
sum(salary < 60000 and employee_gender='F')  'Females & Salary < 60000',
sum(salary >= 60000 and employee_gender='F') 'Females & Salary >= 60000'
from sales as s, orion.employee_addresses as p
where s.employee_id=p.employee_id
group by country
;
quit;

The Extension
Extend Boolean Logic to any industry data. Below see a report I created using diabetes data where I’m trying to narrow in on high risk individuals, using a combination of bmi (body mass index), age and glucose levels as filters. If you’re curious about the code and data, pop me a note. I’m happy to send it to you. (FYI - this was a great study on diabetes incidence among the Pima Population in Arizona that I presented at a health user group.)


The summary

At its very basic, I would break up Boolean logic into two pieces:

1)      Predicate or Key concepts- this is your filter. Examples: employee_gender=’M’, Salary > $60000.
2)      Logical Operator or Glue – which joins predicates. These are the logical operators AND, OR, NOT.

It might help to remember that we use Boolean logic almost every day in our daily life. You answer questions with YES or NO responses.  Weather check on smartphone apps, calendars help track client appointments. With the help of these YES, or NO type answers we make decisions on what to wear!

To sum up, don’t let the big word BOOLEAN daze you. All I did was take advantage of its capacity to combine multiple conditions for sifting through customer data to get as much intelligence as my heart desires. Wonder why I picked this as my #1 tip? I've used many SAS tips to my advantage. But this one stands out in my mind in its universality of use that I simply had to share with you. Have you used Boolean logic? Do you have a better solution for my customer problem? Write to me. I’d love to hear your experiences.

Shout out to Andrew Macdonald whose question prompted me to write this blog post. And to my colleague Dr. David Yeo with whom I’ve had seemingly endless discussions on Boolean logic.

To learn more, consider these SAS training courses: SAS Programming 1: Essentials, SAS SQL 1: Essentials, Text Analytics Using SAS Text Miner.

Post a Comment

Jedi SAS Tricks – Roll Your Own Function

A SAS user (who lives in the the US) emailed me a question about SAS functions. He was reading UTC (Coordinated Universal Time) datetime values from server logs, and to make future calculations and comparisons easier, he wanted to transform the value to local datetime.  The INTNX() function worked great, but did not account for daylight savings time (DST), and, he wondered, “Does SAS have an interval calculating function that takes DST into account?”  I couldn’t find one, so I finally had to answer "No, it doesn't".  And I really hated having to say that!  I thought “Surely, the power of the SAS can bend this data to my will!”  And off I went to conquer the problem using the SAS Function Compiler procedure, affectionately known as PROC FCMP.

Introduced in SAS 9.1, PROC FCMP lets you to create custom SAS functions and CALL routines.  PROC FCMP syntax is very much like DATA step, and you can leverage most features of Base SAS when defining your routines. The custom functions and CALL routines created are used in subsequent DATA steps or SAS procedures just as you would any standard SAS function or CALL routine.

First I searched for some starter code, and found this excellent tidbit “Sample 24735: Compute daylight saving time”.  I thought I remembered that Canada and the US shared the days and times they shifted to DST. Because the US recently legislated changes to DST, I searched the Internet and found this article which says the US and Canada still shift on the same days. And I read it on the Internet - so it must be true!!  With the truth now firmly on my side, I set out to write a function what would convert DST to local time, and would work for Canada and US time zones.

The first step was writing a data step prototype which would successfully adjust datetime values. Here is the prototype of the function I want to build:

data TEST;
   /* Set values for testing */
   DT='26JUN12:17:00:00'dt;
   FORMAT DT DATETIME. NEWDATE DATETIME.;
   ZONE=-5;
   year=year((datepart(dt)));
   day1=mdy(1,1,year);
   /* Year <=2006 DST starts 1st SUN in April, ends last SUN in October */
   if year <= 2006 then do;
      dst_month_start=intnx('month',day1,3);
      dst_beg=intnx('week.1',dst_month_start,(weekday(dst_month_start) ne 1));
      dst_month_end=intnx('month',day1,9);
      dst_end=intnx('week.1',dst_month_end
                    ,(weekday(dst_month_end) in (6,7))+4);
   end;
   /* Year >2006 DST starts 2nd SUN in March, ends 1st SUN in November. */
   else do;  
      dst_month_start=intnx('month',day1,2);
      dst_beg=intnx('week.1',dst_month_start
                  , (weekday(dst_month_start) in (2,3,4,5,6,7))+1);
      dst_month_end=intnx('month',day1,10);
      dst_end=intnx('week.1',dst_month_end,(weekday(dst_month_end) ne 1));
   end;
   NEWDATE=dt+(zone*60*60);
   if dst_beg <= datepart(dt)<= dst_end then NEWDATE=NEWDATE+(3600);
   putlog _all_;
run;

This worked like a charm, so now it was time to convert my data step code into a function with PROC FCMP. However, it was annoying to have to look up the correct “minus number” for my time zone every time I used the routine while I was testing. I wanted my function to use the text time zone name instead - ‘Eastern’ for example. So I added a new character variable "Z" and a SELECT group do the lookup for the ZONE variable numeric value. The function takes 2 parameters: DT (a datetime value) and Z (the name of the time zone). Here is my code:

proc fcmp outlib=sasuser.myfunc.UTC2LOCAL;
   function UTC2LOC(dt,z $);
      Z=UPCASE(Z);
      select (z);
         when ('NEWFOUNDLAND') ZONE=-3.5;
         when ('ATLANTIC') ZONE=-4;
         when ('EASTERN') ZONE=-5;
         when ('CENTRAL') ZONE=-6;
         when ('MOUNTAIN') ZONE=-7;
         when ('PACIFIC') ZONE=-8;
         when ('SASKATCHEWAN') ZONE=-6;
         when ('ARIZONA') ZONE=-7;
         when ('ALASKA') ZONE=-9;
         when ('HAWAII') ZONE=-11;
         otherwise do;
            zone=0;
            PUT 'Valid zones: NEWFOUNDLAND, ALLANTIC, EASTERN, '
                'CENTRAL, MOUNTAIN, PACIFIC, ALASKA, ARIZONA, '
                'HAWAII or SASKATCHEWAN.';
         end;
      end;
      year=year((datepart(dt)));
      day1=mdy(1,1,year);
      /* Year <=2006 DST from 1st SUN in APR to last SUN in OCT */
      if year <= 2006 then do;
         dst_month_start=intnx('month',day1,3);
         dst_beg=intnx('week.1',dst_month_start
                       ,(weekday(dst_month_start) ne 1));
         dst_month_end=intnx('month',day1,9);
         dst_end=intnx('week.1',dst_month_end
                       ,(weekday(dst_month_end) in (6,7))+4);
      end;
      /* Year >2006 DST from 2nd SUN in MAR to 1st SUN in NOV */
      else do;  
         dst_month_start=intnx('month',day1,2);
         dst_beg=intnx('week.1',dst_month_start
                       ,(weekday(dst_month_start) in (2,3,4,5,6,7))+1);
         dst_month_end=intnx('month',day1,10);
         dst_end=intnx('week.1',dst_month_end,(weekday(dst_month_end) ne 1));
      end;
      NEWDATE=dt+(zone*3600);
      if dst_beg<=datepart(dt)<=dst_end 
         and Z NOT IN ('ARIZONA','HAWAII','SASKATCHEWAN')
         then NEWDATE=NEWDATE+(3600);
      return(NEWDATE);
   endsub;
run;

And now to test my handiwork! This data step will use the new function to convert a UTC datetime value to local time:

options cmplib=sasuser.myfunc;
data TEST;
   do UTC='31JAN12:17:00:00'dt
         ,'31MAR12:17:00:00'dt
         ,'30JUN12:17:00:00'dt
         ,'30NOV12:17:00:00'dt;
      Newfoundland=UTC2LOC(UTC,'Newfoundland');
      Atlantic=UTC2LOC(UTC,'atlantic');
      Eastern=UTC2LOC(UTC,'EASTERN');
      Central=UTC2LOC(UTC,'Central');
      Saskatchewan=UTC2LOC(UTC,'SASKATCHEWAN');
      Mountain=UTC2LOC(UTC,'Mountain');
      Arizona=UTC2LOC(UTC,'Arizona');
      Alaska=UTC2LOC(UTC,'Alaska');
      Hawaii=UTC2LOC(UTC,'Hawaii');
      output;
   end;
   FORMAT UTC -- HAWAII datetime.;
run;
 
PROC PRINT data=test noobs; 
   format UTC -- HAWAII tod.; 
RUN;

Output:Listing of the data created using the custom function.
It worked just like I had hoped! But I wondered – would it work in PROC SQL? Let’s see:

proc sql;
select UTC
      ,UTC2LOC(UTC,'EASTERN') format=datetime. as Eastern
   from test
;
quit;

Output:Output of an SQL query using the custom function.

Beautiful! I've decided I like PROC FCMP!!

That's all for this episode. Until the next time, may the SAS be with you!
Mark

PS: You can download the package for this blog posting here.

Post a Comment

ESTIMATE Statements - the final installment

FINALLY…the simplest ESTIMATE statements to write are for continuous variables not involved in interactions or higher order terms.

Consider a data set containing the 2004 SAT scores for each of the 50 states. The file includes the combined math and verbal SAT scores (TOTAL), the state (STATE) and the percent of students participating in the test (PARTRATE). Suppose you assume a regression model to predict SAT scores as a linear function of the percentage of students in the state taking the test. The syntax for getting the estimated change in the score for a 20% increase in students taking the exam would be EASY for the following model!

proc glm data=sat2004;
model total=partrate;
ESTIMATE 'SAT Change for 20% Higher PartRate' partrate 20;
run;quit;

From the following output, our model estimates that a 20% increase in participation rate results in almost a 40 point drop in the average combined SAT score (with the standard error as shown).

                                                    Standard
Parameter                                Estimate      Error  t Value Pr > |t|
SAT Change for 20% Higher PartRate   -39.6613349  3.11418443  -12.74    <.0001

However, it's best not to assume a linear model without first exploring the data. Let's create a scatter plot of the data using PROC SGPLOT and use the LOESS statement to detect potential trends in the data:

proc sgplot data=sat2004;
scatter y=total x=partrate;
loess y=total x=partrate / nomarkers;
title 'Explore SAT Scores Dataset';
run;


As the percent of students taking the tests increases (PartRate), the total score on the SAT decreases, but not in a linear fashion. Above about a 60% participation rate, the plot for the total score appears flat. To capture the curvature, include both quadratic and linear terms of participation rate in the model:

proc glm data=sat2004;
model total=partrate partrate*partrate;
title 'Quadratic Model';
run;

The output from the Parameter Estimates table indicates that both the linear and quadratic terms are significant in the model:

Parameter           Estimate      Standard Error    t Value          Pr > |t|
Intercept          1185.715281      9.01781421       131.49           <.0001
PartRate             -5.215672      0.62285184        -8.37           <.0001
PartRate*PartRate     0.038318      0.00723291         5.30           <.0001

ESTIMATE statements can answer the question "how much does the predicted SAT score change when the number of students taking the test increases by XX%?" for this model. However, since we don't have a linear relationship, we have to specify where on the curve we would like to estimate our change.

For example, look below at the estimated total scores when 20%, 40%, 60%, and 80% of students in a state take the exam. This is another type of ESTIMATE statement—one where you are asking for an estimate at a specific value of your predictor variable, rather than asking for a comparison. Notice that the intercept must be included with a coefficient of 1 for this type of ESTIMATE statement. (The ODS SELECT statement limits the output to the results of the ESTIMATE statement only.)

ods select 'Estimates';
proc glm data=sat2004;
model total=partrate partrate*partrate;
  ESTIMATE '20%' intercept 1 partrate 20 partrate*partrate 400;
  ESTIMATE '40%' intercept 1 partrate 40 partrate*partrate 1600;
  ESTIMATE '60%' intercept 1 partrate 60 partrate*partrate 3600;
  ESTIMATE '80%' intercept 1 partrate 80 partrate*partrate 6400;
title 'Estimates from Quadratic Model';
run;
quit;

In the code above, notice that in the estimate statement at 20% participation, the coefficient of the linear term is 20, but the coefficient of the quadratic term is 202, or 400. Similarly, for the estimate statement at 40% participation, the coefficients are 40 and 402 (1600), at 60% participation they are 60 and 602 (3600), and at 80% they are 80 and 802 (6400).

The output below indicates that the estimated total scores are approximately 1097 at 20% participation, 1039 at 40% participation, 1011 at 60%, and 1014 at 80% (with standard errors as shown).

Parameter     Estimate     Standard Error        t Value           Pr > |t|
20%         1096.72907      5.20224622           210.82            <.0001
40%         1038.39733      7.26081186           143.01            <.0001
60%         1010.72006      5.99461482           168.60            <.0001
80%         1013.69726      7.43284299           136.38            <.0001 

While the participation rate is increasing at a constant rate of 20%, the estimated scores decrease, but not at a constant rate. So…what coefficients would you use to get an estimated change in total SAT score for a change of 20% in the participation rate?

For all instances of a 20% increase, the linear coefficient will be 20. Since the relationship is not linear, the coefficients for the quadratic term depend on where you are measuring the change. You can obtain the correct coefficient for the quadratic term by taking the difference of the quadratic coefficients (as found in the previous ESTIMATE statements) for our starting and ending participation rates.  For example,

  • an increase from 20% to 40% participation rate would have a quadratic coefficient of 1200 (1600-400);
  • an increase from 40% to 60% would have a quadratic coefficient of 2000 (3600-1600);
  • an increase of 60% to 80% would have a quadratic coefficient of 2800 (6400-3600).

The syntax for the ESTIMATE statements would be:

proc glm data=sat2004;
model total=partrate partrate*partrate;
  ESTIMATE 'Increase from 20% to 40%' partrate 20 partrate*partrate 1200;
  ESTIMATE 'Increase from 40% to 60%' partrate 20 partrate*partrate 2000;
  ESTIMATE 'Increase from 60% to 80%' partrate 20 partrate*partrate 2800;
title 'Estimates for 20% Increases from 20%, 40% and 60% PartRate';
run;
quit;

Parameter                      Estimate Standard Error    t Value      Pr > |t|
Increase from 20% to 40%    -58.3317408     4.32058228     -13.50        <.0001
Increase from 40% to 60%    -27.6772689     3.37111967      -8.21        <.0001
Increase from 60% to 80%      2.9772030     8.42760869       0.35        0.7254

The estimated change in SAT scores is

  1. approximately a 58 point decrease (standard error  4.3) when increasing participation rates from 20% to 40%;
  2. approximately a 28 point decrease (standard error of 3.4) when increasing participation rates  from 40% to 60%;
  3. not significantly different from 0 when increasing participation rates from 60% to 80%.

While linear models certainly have the easiest ESTIMATE statements, getting the correct estimates depends first on getting the appropriate model. By exploring the data before building the model and writing your ESTIMATE statements, you can ensure meaningful results.

This is our last installment on ESTIMATE statements, so feel free to suggest topics for future blogs.

Post a Comment

Two opportunities for students at Analytics 2012

I’m excited to let you know about two opportunities for students at the Analytics 2012 Conference, Oct. 8-9 in Las Vegas. The first is the Student Poster Contest. If you have some research to share with the analytics community, consider submitting an abstract. If your abstract is accepted, then you can attend the conference for free and if your abstract is awarded as one of the top six, you will receive an all-inclusive trip to the conference including airfare, hotel and meals!

The second opportunity is the 2012 Data Mining Shootout where student teams are asked to solve a common business problem using appropriate data mining methods and SAS software. The top three teams will be rewarded with a trip to the conference for one team member, the chance to present at the conference and a donation from SAS to the team’s university.

I hope to see you there!

(The early bird discount ends June 29, so register early. If you are a student or professor, check out our 50% academic discount.)

Post a Comment

Taking the storyboard approach

A storyboard is a blueprint of drawings/ideas in a specific sequence to illustrate a story.  Let’s take the Pixar film Toy Story as an example (I have a two-year-old and admittedly have watched this movie more than the hairs on my head).  Before they ever began animating with the design software, they made simple sketches of the characters and scenes.  Why?  So that everyone was on board with the idea BEFORE getting into the more laborious phase in the process.

Creating reports, dashboards and stored processes works the same way!  As a BI user, we need to take the same approach – storyboard it.  What do you want to learn from the data?  By asking that simple question, you are setting a solid foundation for building your BI reports.

I tuned in to the live webinar at SAS Global Forum 2012 titled “Building Better Business Intelligence with SAS.”  The webcast featured SAS expert and author Tricia Aanderud, co-author Angela Hall and SAS Education Director Eric Rossland.  Stacy Hobson, Director of Customer Retention at SAS, led the discussion on everything hot in Business Intelligence.  The advice that came up most often was, “Start with the whiteboard.”  It’s very easy to get excited about the software and just dive right in, but setting that strong foundation is vital.

How do we do this?

STEP 1:  Know the data.

STEP 2:  Make sure that discussions are happening between the user and the developer.  The user needs to be engaged from the beginning of the process.  What questions do you need answered?  How will these reports be used?  How does the information need to be displayed?  These types of questions not only create a clear line of communication, but can also save a lot of time for the developer.  The goal is for the developer to create a usable system for the user.

So don’t be afraid to take the storyboard approach.  Map out exactly what you want to learn from your data before diving into the software. In Buzz Lightyear’s words, with a strong storyboard, your data can go “to infinity and beyond!”

---

Resources that were discussed during the webinar include Tricia and Angela’s new book titled “Building Business Intelligence Using SAS: Content Development Examples” and the fast track course Creating Business Intelligence for your Organization.  Both are great tools for BI Content Developers who are entering into the wonderful world of BI content development.

Other helpful links:
Peek Inside: Building Business Intelligence with SAS: Content Development Examples
SAS Enterprise Business Intelligence Training

Post a Comment

"Easy button" for ESTIMATE statements

My previous blog demonstrated the most difficult type of ESTIMATE statement to write—a two-way (or higher) ANOVA with interactions. An "easy button" for ESTIMATE statement comes by having a simpler model.

Models with only main effects and no interactions make writing ESTIMATE statements straightforward.  Consider first a one-way ANOVA. A study was conducted at the University of Melbourne[1] exploring the pain thresholds of blonds and brunettes. Subjects were divided into four categories according to hair color: light blond, dark blond, light brunette, and dark brunette. Each person in the experiment was given a pain threshold score based on performance in a pain sensitivity test (the higher the score, the higher the person’s pain tolerance). The variables on the data set are the outcome variable Pain and the four-level predictor variable HairColor.

The code for the ANOVA using PROC GLM would be:

proc glm data=pain;
  class HairColor;
  model Pain=HairColor;
run;

The ANOVA table indicates that the pain scores aren't the same for all hair colors.

Source DF Type III SS Mean Square F Value Pr > F
HairColor 3 1360.72631 453.575439 6.79 0.0041

While the boxplots (default output for PROC GLM in SAS 9.3) allow you to visualize the differences, the only way to ascertain whether differences are significant is to use an ESTIMATE (or CONTRAST) statement.

Perhaps you would like to compare average pain scores of blondes and brunettes. You can obtain the coefficients easily by examining the Class Level Information table. First, notice that the levels of HairColor are sorted alphanumerically—DarkBlond, DarkBrunette, LightBlond, and LightBrunette. This will be important in correctly placing the coefficients.

Class Level Information
Class Levels Values
HairColor 4 DarkBlond DarkBrunette LightBlond LightBrunette

You would average the pain scores for dark and light blondes by applying coefficients of 0.5 to those levels (1st and 3rd); then compare that to the average pain scores of dark and light brunettes by applying coefficients of -0.5 to those levels (2nd and 4th). The coefficients should be applied left to right to the four levels of HairColor, as shown in the Values column, resulting in:   0.5, -0.5, 0.5, -and 0.5.

The syntax would then be:

proc glm data=pain;
  class HairColor;
  model Pain= HairColor;
  estimate 'Compare Blondes to Brunettes'
            HairColor 0.5 -0.5 0.5 -0.5;
run;

The results that follow indicate that blondes have higher pain thresholds than brunettes by 15.25 points, on average.

Parameter

Estimate

Standard Error

Value

Pr > |t|

Compare Blondes to Brunettes

15.250000

3.7672492

4.05

0.0011

While this ESTIMATE statement comes from a one-way ANOVA, the approach will be the same for ANOVA models with more than one factor, as long as no interactions are present.

In the next blog, we'll look at the EASIEST of all ESTIMATE statements—continuous variables not involved in interactions or higher order terms. Until then, enjoy your new "easy button"! (To learn more, take our Statistics 2: ANOVA and Regression training course.)


[1] From the OzDasl website 

Post a Comment

Go home on time with these 5 PROC SQL tips

How can you go home on time? Take a quick tour with these 5 tips that I was able to summarize for my students in the SAS SQL 1: Essentials class yesterday. Since they found it handy and asked if it could become a blog post, I thought I’d share with you. If you’re already an SQL user, hope this helps to review.

1.       Begin with output
Since your SQL query can get pretty complex, its best to start with defining your output. What does your manager or team need? Before you go running away to work on the task, do stop to ask them for a sample of the output. If they’re unable to find one, see if you can draw up the report –once you get confirmation from them you’re good to go looking for the data sources. And then you can define the SQL process to get you from input to output.

2.       So Few Workers Go Home On Time
Sigh, has the world really come to this? Really? The answer is a resounding NO! You can go home on time by remembering that this acronym is just the sequence of statements that SQL expects from you.

  So   SELECT
  Few   FROM
  Workers   WHERE
  Go   GROUP BY
  Home   HAVING
  On Time   ORDER BY

3.       Combining Tables
When to use Joins or SET Operators? Much has been written about this topic. I won’t try to tell you all about it, but just try to summarize a little here. Want to debug a query? Try this real quick visual.

  Definition Joins
stack columns & align rows
Set operators
stack rows & align columns
Visual
Syntax
(Look how even the syntax matches the visual!)
Select * from table1, table2 Select * from table1
SET operator(Except/Intersect/Union/Outer Union)
Select * from table2

4.       Subquery or Inline View
If not carefully presented or monitored, this can become a hair pulling experience for learners. Here’s my summary to hopefully serve as a clear guide: 

Definition Subquery –is just a query within an outer query. Inline View-is a datasource that you add inline into your query.
Syntax       Structure Since it’s a subquery, ask yourself, what statements would I use to query my data? WHERE or HAVING of course. So it makes sense that the Subquery returns values to be used on your WHERE or HAVING. It takes the place of a table. Where does your table definition happen in PROC SQL? In your FROM clause of course, so that’s where you would plug in an inline view.
Usage A subquery is used to do subsetting of the rows returned by the outer query.  In-line views are generally used to do joins where the query data is not needed in a table or view for subsequent processing.
# of Columns,
# of Rows
Can only return 1 column but any number of rows. Given that it replaces a table, it can return any # of columns and any # of rows just like your table can.
Replace Data Source or Adds Data Source  Since it returns a column, you get an additional data stream Since it takes the place of a table it replaces a data stream.
 (I just love this distinction between replace and add – don’t you?)

5.       Know thy Data
If you’ve read my previous blog post, you already know my number one rule for a SAS programmer. Of course, know thy data. But you don't know your data? No problem, use dictionary tables. You can extract a wealth of information about your system, e.g. what titles are in effect, what macros are available and their values, what system options are turned on...you can even go and query your data for all columns that have the word Date in them. Why is this useful? Consider all the joins you do and you need to know what tables have ORDER_DATE. You could choose to run a PROC CONTENTS on your tables one by one. What an enormous manual task that would be! Or you could check out your dictionary tables for columns. One swift piece of code will get you all tables that have ORDER_DATE. For more on dictionary tables, check this useful paper titled A Hands-on Tour Inside the World of PROC SQL by SQL expert, Kirk Paul Lafler.

Hope these 5 tips were able to summarize PROC SQL for you just like it did for my class yesterday.  Try them on for size. Let me know how you make out. Don’t be surprised if you find yourself going home on time. Want more? Check out our SAS SQL 1: Essentials course where we teach you the topics above and much, much more.

Post a Comment

The magical ESTIMATE (and CONTRAST) statements

When asked to select the best (or worst) of something in a business setting, do you wish you had "magic glasses" to see the answer?

PROC GLM and other statistical modeling procedures have their own versions of such an item with their ESTIMATE (and CONTRAST) statements. They allow you to assess whether one scenario is better than another based on your data, and provide a way to make informed decisions.

For example, consider a study to explore what affects memory[1]. One theory is that material is recalled as a function of how much it is processed when first encountered. Fifty younger subjects and 50 older subjects (between 55 and 65 years old) were randomly assigned to one of five learning groups: Counting (read a list of words and count the number of letters in each word); Rhyming (read each word and think of a rhyming word); Adjective (think of modifiers for each word); and Imagery (form vivid images of each word). None of these groups were informed that they would need to recall the words at a later time. The last learning group, Intentional, was told to memorize the words for later recall. After the subjects had gone through the list of 27 items three times, they were asked to write down all the words they could remember.

Of interest is the number of words recalled (Words) and the categorical predictor variables Age (Younger/Older) and Process(Adjective, Counting, Imagery, Intentional, and Rhyming). One interesting hypothesis is that forming vivid images of each word would be more effective for later recall as compared to intentionally memorizing each word; this can be formally tested with an ESTIMATE (or CONTRAST) statement in PROC GLM.

The results of the study can be visualized in the following series of paneled boxplots:

The number of words recalled appears to differ by Process and by Age. When analyzing this data set, it seems reasonable to include an interaction of Process by Age, since the histograms of Process appear to differ by Age:

proc glm data=recall;
  class age process;
  model words=age|process;
run; quit;

The ANOVA table that follows indicates a significant interaction between Process and Age, making an ESTIMATE statement more challenging to write.

Source DF Type III SS Mean Square F Value Pr > F
Age 1 240.250000 240.250000 29.94 <.0001
process 4 1514.94000 378.735000 47.19 <.0001
Age*process 4 190.300000 47.575000 5.93 0.0003
  • To obtain the coefficients for the contrast, set up a two-way table as shown below: use the first variable on the CLASS statement as the ROW variable (Age) and the second variable on the CLASS statement as the COLUMN variable (Process). Sort the levels of the variables alphanumerically, as shown.
Age Process
  Adjective Counting Imagery Intentional Rhyming  
Older
Younger
 
  • Next fill in the body of the chart with coefficients for the comparison of interest: compare the IMAGERY method (averaged over Older and Younger) to the INTENTIONAL method (also averaged over Younger and Older).
Age

Process

  Adjective Counting Imagery Intentional Rhyming  
Older 0.5 -0.5
Younger 0.5 -0.5
 
  • Label the last column and last row as 'Marginal'. Then fill in the blank cells within the body of the chart with zeros.
  • Lastly, sum across the rows and down the columns to obtain the marginal coefficients.
Age

Process

  Adjective Counting Imagery Intentional Rhyming Marginal Age
Older 0 0 0.5 -0.5 0 0
Younger 0 0 0.5 -0.5 0 0
Marginal (Process) 0 0 1 -1 0 0


The marginal coefficients provide the coefficients for the main effects: the row marginal coefficients are for the variable Age[2] and the column marginal coefficients are for the variable Process. (Note that the marginal coefficients sum to 0 in both directions.) The coefficients in the body of the chart provide the coefficients for the Age*Process interaction.

The ESTIMATE (or CONTRAST) statement would follow the MODEL statement in your call to PROC GLM with the syntax shown below. The coefficients for the interaction term are obtained by reading within the body of the table: first across row 1 from left to right, then across row 2 left to right, then row 3 left to right:

estimate 'Compare Imagery to Intentional Memorizing (both averaged over age groups)'
                        Age 0 0
                  	Process 0 0 1 -1 0
                  	Age*Process 0 0 0.5 -0.5 0
                        0 0 0.5 -0.5 0;

The high p-value indicates that not enough evidence is present to reject the null hypothesis that vivid imagery and intentional memorization result in the same number of words recalled. Although the two methods do not differ significantly in the number of words recalled, vivid imagery may result in more interesting study sessions!

Parameter Estimate Standard Error t Value Pr > |t|
Compare Imagery to Intentional Memorizing (both averaged over age) -0.15000000 0.89585465 -0.17 0.8674

To learn more, take our Statistics 2: ANOVA and Regression training course.


[1] (Howell, D. C. (1999). Fundamental Statistics for the Behavioral Sciences, 4th Edition. Duxbury Press, Pacific Grove, California.

[2] Since the coefficients for Age are both zero, Age may be omitted from the ESTIMATE statement.

Post a Comment

125 years of experience can't be wrong

Occasionally we see students in our more advanced courses who have skipped the Programming 1: Essentials class. Usually they are familiar with SAS or other programming languages. Sometimes they are even fluent and proficient in other programming languages. They feel comfortable skipping the basics and moving on to other classes, confident in their abilities to succeed as an efficient SAS programmer. Our SAS Training Instructors offer some valid reasons and explanations as to why you should always start with the foundation….even if you’ve already been a SAS programmer for the past five years!

Cynthia Zender
SAS Programmer: 33 Years
SAS Training Instructor: 20 Years

Programming 1 is the foundation course and prerequisite knowledge for almost every other class we offer. It would sort of be like jumping into James Beard’s cookbook without knowing what the difference is between sauté, fry and blanche. Could you muddle your way through the recipe? Sure. But you’d be more confident if you understood the basics of cooking first.

James Waite
SAS Programmer: 6 Years
SAS Training Instructor: 6 Years

Imagine that I am a Windows user and I need to rename a picture file.  In Windows Explorer, I will right-click on the file, choose “Rename” and provide the new filename.  In this situation, my knowledge of Windows Explorer allows me to perform the task quickly.  Now, imagine that I want to rename all 1,000 pictures in a given folder, so that they have the same prefix.  Suddenly my skills within the very visual environment of Windows Explorer are limited and inefficient.  While I will technically be able to achieve the objective using my current skillset, the process will be cumbersome and inefficient.  However, if I take the time to learn some DOS, which is essentially the foundation for Windows Explorer, then in just a few minutes I would have code to run that renames all of the pictures in just seconds.

For exactly the same reason, new users of SAS should also consider taking SAS Programming 1, which is the Foundation of most of the visual development tools.

Will you need to write a lot of code to achieve your day-to-day objectives?  Not likely. Will the need arise to go beyond the day-to-day uses for the tool?  Quite likely, and you can often address these situations more efficiently if you know how to code in SAS.

Mary Harding
SAS Programmer: 21 Years
SAS Training Instructor: 11 Years

I recently had a student who took the time to tell me how much he was enjoying the SAS Programming 1 course because it was so much easier to understand than the self-teach method he had initially tried.  As instructors we use our knowledge and expertise to explain technical concepts in a way that is difficult to get from just help and online documents.  Online resources are great for looking up a specific issue but not always the fastest way of getting a good understanding of how it all fits together.  In this age of digital information we think we can get all the answers online, but in many situations this is a longer pathway, especially when learning something new.

In recommending SAS Programming 1, other points I would emphasize include:

  • Starting from the beginning provides a faster way to ramp up and be more productive at work.
  • Students receive important reference materials in class as well as access to extended learning resources after class.
  • SAS Programming 1 is a prerequisite for most other programming based SAS courses.
  • Students have the ability to ask a knowledgeable instructor specific SAS questions as well as the access to that instructor after class. Many students continue to seek the help and advice from their instructor weeks, months or even years after taking a class.

Andy Ravenna
SAS Programmer: 27 Years
SAS Training Instructor: 12 Years

I like to think that one of the bonuses of a student coming to our Programming 1 course is that we give them tips, tricks, and hints that are going to benefit them for the remainder of their SAS coding career.  I think each of us has a bucketful of experience that we have picked up along the way and we share that with them, in particular our Programming 1 students.  We emphasize questions like “Why do you think that stupid little period in the format is so important?”  or  “Do I need to worry about the case of these values in the quotation marks?”  or “How will this libname statement change when you get back to your office?”

This kind of learning you can’t find in books, searching a web site, or by watching a recording.  It only comes from that one to many interaction of experienced instructor with classroom.

Mark Jordan (a.k.a. the SAS Jedi)
Computer Programmer: 40 years
SAS Programmer: 18 Years
SAS Training Instructor: 6 Years
(And yes, Mark's birthday in SAS is a large negative number.)

 

I have a personal story to share about this:

My first professional programming gig was as an applications developer on z/OS for a shipbuilding company.  They allowed me all the time I needed to get things done, but they had no training budget at all – none.  So for my first 5 years as a SAS programmer I was completely self-taught.  I became adept at SAS programming, and even published a local SAS User Group paper or two.  I got my next job with a large financial services company based primarily on my reputation.  I remarked to my new boss how I’d never been able to weasel my way into a SAS class during my last job.  During my first week on board, my boss signed me up for SAS training!  I was ecstatic to learn that my new company had plenty of training budget, and actively encouraged professional development.  Then I discovered the class running that week was Programming 1, and I was nonplussed. And had perhaps had just a touch of attitude...  After all, the “SAS Jedi” - in a Programming 1 class?  Harrumph!

But, somewhere during that class the instructor presented the workings of the PDV.  I’d never heard of it before - and suddenly, a light came on. Now I understood how DATA step merges really worked, and I could code more consistently, competently and confidently with a firmer understanding of the fundamental concepts of the SAS language.  I paid much better attention thereafter in class, and picked up many useful tips.  I’ve never again looked askance at training, no matter how basic.

When we teach ourselves a skill, we usually have to focus on completing a particular project or task.  We don’t have or take the time to root through all of the fundamentals – we just get the job done.  Programming 1 can fill in those gaps in your “SAS foundation”, even if you’ve been using SAS for years.

Jim Simon
SAS Programmer: 30 Years
SAS Training Instructor: 23 Years

I tell my Programming  1 students something very similar to Mark's last sentence  above....

“I understand many of you have been using SAS for several weeks already.  That's fine.  You have some SAS knowledge.   But your knowledge is like Swiss cheese.  It's full of holes.  This class fills in those holes to give you a solid foundation going forward.”

Unlike Mark, I was no SAS Jedi going into my first SAS course.  Quite the contrary, I lied my way into my first SAS job (Fake it ‘til you make it!) and was quickly busted and sent to a SAS class as "punishment" ...  March 12, 1982, Denver, CO.   The faded signature on my 30-year-old diploma:  J Larry Stewart (Currently the Vice President of SAS Education).

Post a Comment