With the pervasiveness of mobile devices, being able to read while “on the go” has been easier than ever. How many times have you found yourself in a situation where you pass the time waiting by reading something on your phone/iPad/tablet etc? With eBooks on my iPad, I find that I am reading more and with the vast array of SAS Publications available and I find that it’s easier to read what I need too.
This got me thinking… reading only what you need is not just applicable to books, magazines, websites, and blog posts; it also relates to reading data. With the volume of information growing at a minimum rate of 59 percent annually, organizations need to consider efficiency when processing data. If an ETL process is written without considering efficiency, what is the impact to the business not only now but also in the future? Think about the effect this one ETL process can have: the resources it consumes (processing time, memory, CPU, disk space), the dependent processes it has, and, in turn, the resources consumed by these dependent processes. Also consider how frequently this ETL process is run – daily, weekly, monthly? How would it help your business delivery if you could streamline the process and provide the results sooner?
In teaching SAS Programming 1 for SAS Education Australia, I often find myself telling students to “read what you need”… What I mean by this is to read only the variables (columns) and observations (rows) that you need for the task at hand. Is there a requirement to read the entire SAS dataset, database table, raw data file and process all the data? One might say, “just in case requirements change in the future I’ll read it all now.” This “future-proofing” approach may seem wise but, if the requirements don’t change, you have an ETL process consuming far more resources than it actually needs.
When I refer to reading what you need, I also mean subsetting the data as close to the source as possible, in the most efficient way. For example, when reading data from a SAS dataset, consider using a where statement instead of an if statement. Both will produce the same results. However, when using a where statement, SAS will only read the observations it needs into the Program Data Vector (PDV). This consumes fewer resources and returns the results sooner. (Note: a where statement can only be used when the variable comes from a SAS dataset.) See the examples below on the SASHELP.CLASS dataset that contains 19 observations.
data work.femalesIf; set sashelp.class; if sex="F"; run; NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: The data set WORK.FEMALESIF has 9 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.03 seconds |
From the log messages, we can see that all the 19 observations were read for processing during the data step to produce a dataset of 9 observations.
data work.femalesWhere; set sashelp.class; where sex="F"; run; NOTE: There were 9 observations read from the data set SASHELP.CLASS. WHERE sex='F'; NOTE: The data set WORK.FEMALESWHERE has 9 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds cpu time 0.01 seconds |
From the log messages, we can see that only 9 observations were read for processing during the data step, producing a dataset of 9 observations as above. Also note that the processing time was reduced (this value will vary from environment to environment).
This is just one example of the benefits of reading only what you need. By understanding the way in which SAS processes data behind the scenes, you can considerably improve the efficiency of your ETL processes. As one SAS Programming 1 attendee, who had been programming in SAS for several years, said to me recently “Attending the SAS Programming 1 course was like getting a professional driving lesson after being taught how to drive by friends/family. You are provided with focused information to break substandard habits and tips on the rules for the future.”
I’m interested to hear if this blog post has encouraged you to revisit your data processing and whether you have some examples you would like to share and/or papers/resources that you find useful in this area. Please share in the comments below.
3 Comments
Ever being rethinking on process doing a datastep view as a proc input instead of the real data?
You could avoid the PDV and intermediate files at all.savings in processing time and resource usage.
Absolutely... Being mindful of efficiencies at code creation is key.
Great advice Michelle. This is especially useful when you are working with large amounts of data over a slow connection!