Top 5 SAS programming best practices

23

I’m sure you will all agree that a programmer’s favorite task is to maximize efficiency thus enhancing performance. Dear reader, in this blog post I’ll share my best programming practices so that you can benefit from what I typically show my students in class. At the end I’d like to share my book video with you.

1. Camel case variable names to save typing time - Unlike Oracle which doesn’t seem to remember whether a table or column name is lower case, upper case or camel case and other databases (which don’t support mixed-case naming very well), forcing many developers to use an underscore (_) to separate words, SAS remembers. SAS doesn’t need you to use underscores within variable names. This is useful in more than one way.

a) Are you a big fan of The_Finger_Breaking_Underscore_Style? Thought not. This neat SAS feature helps you use variable names like FirstName instead of First_Name.

b) When you print data, the variable names are automatically split in your output at the next uppercase letter. (Note: this depends on the length of the data field.)

Check out this support.sas.com page for more on naming examples for Oracle.

2. Improve performance with the WHERE clause
When possible use the WHERE clause to improve performance, more efficient than the subsetting IF. Take a look at 2 examples below:

Using both WHERE and IF reads only 65 observations

data work.december;
   set orion.sales;
   where Country='AU';
   BonusMonth=month(Hire_Date);
   if BonusMonth=12;
   Bonus=500;
   Compensation=sum(Salary,Bonus);
run;

NOTE: There were 63 observations read from the data set ORION.SALES.
WHERE Country='AU';
NOTE: The data set WORK.DECEMBER has 3 observations and 12 variables.

Using just IF reads in 165 observations

data work.december;
   set orion.sales;
   BonusMonth=month(Hire_Date);
   if Country='AU' and BonusMonth=12;
   Bonus=500;
   Compensation=sum(Salary,Bonus);
run;

NOTE: There were 165 observations read from the data set ORION.SALES.
NOTE: The data set WORK.DECEMBER has 3 observations and 12 variables.

Which one to use? The first program uses both WHERE and IF to read only 65 observations. The second program reads in 165 observations-- I rest my case. The WHERE clause is a pre-processor acting on your source data which makes it highly efficient. The IF statements go through the PDV subsetting one observation at a time. Use a combination—WHERE on existing variables, and IF on new variables to maximize program efficiency.

3. Declare variable type as character to save space
You know numeric variables take up eight bytes of floating point storage. What if your data is an ID type field, e.g. AC460 which is an Air Canada flight code? What are the chances you will ever ask for the average of flight code? Slim, right?! Ok, why not store your ID variable as character so you take up only six bytes of storage. Your next question is bound to be “Why can’t I change the length of numeric variables explicitly with the LENGTH statement?” My answer is you can, but you run into reducing numeric precision. Want more? Well then you’ll have to visit our Programming 3: Advanced Techniques and Efficiencies class.

4. Use the access=readonly option on the libname statement to avoid overwriting shared data
Maybe your system administrator gave everyone privileges to read and write on shared folders. You are not sure this is such a good idea as you are just starting to learn SAS. What’s a rookie to do to avoid overwriting on system wide shared data? Stick the access=readonly option on your libname statement and now SAS will never let you accidentally overwrite your shared data.

Libname Orion 'U:coursesPRG1PRG1 course data' access=readonly;

5. Use the OUT=option on PROC SORT to keep source data intact
I saved you the best for last. Consider your 10 years’ worth of data with a million observations stored in order of customer_id used by your entire team of 20. Friday evening your team lead decides to sort the data by customer_age for a marketing campaign. But he only wants 2010 data so he sticks in a WHERE clause limiting the sort to 2010 data. He’s quite happy with the result and sends out the marketing campaign to these customers.
You return to work on Monday morning and query the Orion.customers data for the period between 2001 and 2009. Surprise! There is no more data. You go knocking on your coworker’s cubicle to ask. Unfortunately the data is gone, vamoose, vanished-never to surface again. Isn’t that the worst thing that could happen at work other than someone stealing your coffee mug?? Just to avoid these sort of frustrating episodes at work, I would strongly recommend putting the OUT=option on your PROC SORT statement. This way your source data is untouched. And everyone can breathe deeply again. I hope I’ve created enough drama here to highlight what a big deal this is. Losing your data because of poor practices is one of the most frustrating feelings ever.


So do yourself a favor and put the OUT= option every time you sort to avoid damage like this:

These are my top programming practices. What about you? I’m sure you have your bag of tricks. I’d love to hear your best programming practices, so drop me a line or comment below.

Dear Reader, do you enjoy my writing? Do you want to know how I landed my SAS job? Click here to view my video for the book competition I’ve entered. Voting is easy. Just register, then vote and pass it on to everyone you know. I’m in third place in this international competition, please help move my book to first place by voting and passing it on. The deadline is February 28. Thank you!

Share

About Author

Charu Shankar

Technical Training Specialist

Charu Shankar has been a Technical Training Specialist with SAS since 2007. She started as a programmer, and has taught computer languages, business and English Language skills. At SAS, Charu teaches the SAS language, SQL, SAS Enterprise guide and Business Intelligence. She interviews clients to recommend the right SAS training to help them meet their needs. She is helping build a center for special needs kids in this project. http://www.handicareintl.org/pankaja/pankaja.swf

Related Posts

23 Comments

  1. hi Charu,

    I read your blogs alot ,they are very good you are a bank full of knowladge , i am a fresher looking for some training do you have any online training clases going on , or someother way i could use your help.

    TIY
    Neha

  2. Pingback: Homepage

  3. Thanks for the post Charu.
    I'm glad that I follow all these top 5 practices @ my work as needed. I would also like to add few more practices which is worth sharing.

    1) Use IF. THEN ..ELSE statements instead of multiple IF. THEN statements if only one condition can be true for observation/record you are evaluating. When using multiple IF... THEN... ELSE statements, use the conditions in descending order of probability( incase you know the trend/pattern of data you are analyzing). This will save CPU time.

    2) For IN / OR operator: order condition most likely to be true first, then condition next most likely to be true second. It saves little time

    3) For AND operator: order condition most likely to be false first and this will reduce number of conditions tested by SAS.

    • Charu Shankar

      Thanks Manikandan for sharing your best practices. I'd like to use them in another post if I may, as they are quite valuable. I didn't know the one about the AND. Thanks for sharing.

  4. Important data should be kept in a read only location with only selected people able to write to it. I don't like using the out paramter as it creates a separate copy. With today's large PC's, I prefer to read the data from source and the process the datasets in the work directory. When the program is done the intermediate files disappear. Assigning variables that are not quantitative (i.e. the value represents a quantity of something) as numberic is bad practice. At one time our company had an inspection code that looked numeric and was treated by all the programs as numeric. When the plant decided to create alphabetic inspection codes, we spent 20,000 hours changing all the programs.

    • Charu Shankar

      I agree. Having a central location for data is really the best practice. With read access provided to everyone who needs to use that data & write access to a select responsible few. The work library is ideal to process data temporarily. When you have to share that data, then creating a library for output datasets becomes useful.
      Great point about non-quantitative data-since qualitative data can be either numeric or character, yet best practice would be to store as character-- to avoid the issue your company ran into when your inspection code came across numeric qualitative data.
      Thanks for your comments Terry, I hope to write a blog post on this topic..

    • Charu Shankar

      GUI's may not suffice in all places. No fault of the developers but the GUI can only do so many radio buttons..It really depends on your needs. While the GUI may suffice for one business analyst, another data developer may need to write data step code or macros & would need the programming language. So tailoring the GUI for your advanced requirements would need code knowledge. Thanks for writing in.

  5. In #5 "Poor Practice" is allowing all users permissions to modify the orion.Customer dataset. If the data is that important then "Best Practice" is to allow only one process to update it and everyone else only gets read access.

    You cannot forget that everyone (regardless of skill and experience) can sometimes get careless so you have to protect the data.

    • Charu Shankar

      You're right Carlos.. Best practice is indeed to provide read access to everyone. And write access to a select few. My recommendation with the readonly access is when an administrator may have provided read & write access to everyone-hopefully that's a very small population :) thanks for writing.

  6. IME there are a couple of other reasons why it's often good to store numeric-looking ID variables as characters:

    - Even in a system where ID codes are USUALLY numeric, you may encounter occasional character values. For instance, I deal with a system that handles a lot of field workloads (ID = 4-digit integer) alongside a small number of training workloads (ID = "TRAIN"). Setting the ID field to char means it can cope with this.

    - ID variables often code specific info in specific digits - for instance, the first digit of my workload records usually indicates the state number. Storing as char means I can refer to these digits easily via SUBSTR - granted you can make it work with nums via floor/remainder calculations but it's ugly.

    - Ability to use 'edit distance' functions when resolving mis-typed IDs.

  7. Thanks for your comment Chris, just to ensure a 2nd opinion, I reached out to Art Carpenter, a SAS expert for his take below:
    Both are right. depends on the situation and the audience. You are addressing a general case; the respondent a specific case.
    IMHO - Most users may not understand the complexities & subtle implications of reducing length for numeric variables. If they do understand AND they know that they are only dealing with integers, then the respondent’s comments are valid.
    Most users and some companies (as company policy) never reduce (or never allow the reduction of) the length of numeric variables – “just in case”. Your statement about storage efficiency, therefore is generally correct when the numeric length is always set to 8 (storage of SSN and EIN are exceptions).
    Storage is generally cheap &access is generally fast, so my rule of thumb is that codes (regardless of content) are text. Only numbers are stored as numeric values. And I only rarely reduce the length of a numeric value.

    • I recently converted certain fields in three data sets from character to numeric.

      Why?
      Because there were fields related to mechanical attributes stored as character and that didn't make sense to a lot of business people (e.g. lengths, capacities etc.).

      What I did?
      I converted the fields from character to numeric using input function with a default length of 8.

      Results?
      Table A :: Size before conv=456 MB :: Size after converting 12 variables=3GB
      Table B :: Size before conv= 1GB :: Size after converting 8 variables=5GB
      Table C :: Size before conv=300 MB :: Size after converting 58 variables=3GB

      I would like to hear some feedback/comments/reasoning ? Thanks!

  8. Re. 3: Declare variable type as character to save space
    I disagree, strings take more room (not to say they should be avoided, just reacting to your point).
    Numeric precision issues dont concern integers, only real numbers.
    Unless you have real number (who does?) or 2-digit IDs, storing them as a character will take more room than storing as a numeric.
    On windows 32 bits:
    3 bytes will store integers up to 8k
    4 bytes will store integers up to 2m (good for dates & times btw)
    5 bytes will store integers up to 500m
    6 bytes will store integers up to 137b (good for dt)
    7 bytes will store integers up to 35t
    I reckon this impacts not only space, but more importantly index speed is the variable is indexed, but I could be wrong. Just a hunch.
    The exact formula is max_int=2**(length*8-11) I think.

  9. Re CamelCase. What do you mean by 'SAS remembers'?
    I find it frustrating that the editor does not remember. That is if I define a variable in the program like
    ThisVar = 1;
    it would be great if the next time I typed 'thisvar' it was automatically converted to the case pattern first defined. (The way that the visual basic editor works)

    • Charu Shankar

      Thanks for your comment Bruce,. Maybe if you send me a note on why you would this feature useful in the editor, I can follow up with the developers to let them know this might be a practical business need.

  10. While on the subject of useful and informational filename, I wish SAS could do something with dataset labels other than display them in the explorer window.

  11. Thanks Stephen! You're right, SAS keeps dataset names in Uppercase. I wanted to make users aware of the possibilities -The advantage of camelcasing variable names is mostly for the splitting of names in your output. Dictionary tables are a powerful & easy way to query variable/dataset names where a user maybe unsure of case-- a topic we teach in our SQL1 class.

  12. thanks Catherine for taking the time to read & to vote.. didn't make it to round 2 of the competition, so a different path of seeking a good publisher..

  13. Glad you liked this post.. Would love to hear if you have any SASsy tips that you return to time & again... thanks!

  14. Thanks for the post!
    (1) Not sure that camel case is the best practice for naming all variables...
    (a) I like to keep a consistent naming convention for ease of programming/readability. Sure you can use camel case for variable names but when it comes time to name data sets that strategy does not work as well especially because SAS will propcase the set name -my opinion is to just be consistent across both.
    (b)Additionally, long camel case names are not as easily read - this can become a problem when others are using the same code for the first time or just scanning through to get the gist
    (5) I agree that there are certain situations(including the one you mention) when creating and OUT set makes sense for sorted data ( ad-hoc analysis, programs in 'development mode', etc). In terms of automated/scheduled programs - the saving of duplicate data can eat of space when it isn't really necessary.
    -Stephen

  15. I attended the SAS Programming 3 course in April 2007 and can't tell you how much that one course has saved me in programming efficiencies! I still clunk around but regularly use #2, 3, and 5 from your list!
    But this one was news to me - #4 - being able to specify access=readonly option on the libname statement! As one of two people who have "admin" privileges to everyone's data, I am scared to death any time someone asks me to look at or use their data to help with something. I will definitely be assigning the access=readonly to myself when I work in someone else's library. FWIW, I usually just make a copy (data xyz; set theirs.xyz) so I won't mess their data up, or else use the out= option, but this access=readonly may more readily do the trick for me!
    Love this blog!!

Leave A Reply

Back to Top