What is a DATA step view and why is it important?

21

Last week I analyzed 12 million records of taxi cab transactions in New York City. As part of that analysis, I used a DATA step view to create a new variable, which was the ratio of the tip amount to the fare amount.

A novice SAS programmer told me that he has never heard of a "DATA step view." He asked, "What is a DATA step view?"

Simply put, a "view" is a SAS DATA step program that is stored for later execution. When the SAS language processor encounters the RUN statement, the program is compiled and saved, but not executed. When a procedure uses a data view, the program runs and serves data to the procedure as if the procedure were reading a regular SAS data set. Thus you can use a view to manipulate data "on the fly."

I like to create views when I want to construct a new variable in a huge data set, but I don't want to physically copy the data. When I analyze the data view by using another procedure, the constructed variable is computed on the fly.

Here's an example. Suppose you have a large data set that includes heights and weights for millions of patients. Some (but perhaps not all) analysts at your company need to analyze the body-mass index (BMI) for these patients. You have two options. The first option is to create a new data set that has the new column in it. This requires that you duplicate the original data and add a new column. The second option is to keep the original data unchanged, but create a view that computes the BMI. Any analyst that needs the BMI can access the data by using the view.

Let's see how this works by using a small data set. The Sashelp.Class data set contains height and weight (in English measurements) for 19 children. The BMI formula for children is slightly more complicated than for adults, but for simplicity the following SAS program simply uses the adult formula. The following DATA step creates a data view by using the VIEW= option on the DATA statement. The MEANS procedure then analyzes the newly calculated BMI variable:

data BMI / view=BMI;                 /* define DATA step view */
   set Sashelp.Class;
   BMI = weight / height**2 * 703;   /* BMI formula for adults (pounds and inches) */
run;
 
proc means data=BMI;
   var BMI;
run;
dataview

As you can see, the syntax for the MEANS procedure does not change. In fact, the only syntax that changes is the VIEW= option in the DATA step.

When SAS encounters the RUN statement in the DATA step, it saves the program. The program is not executed until it is used in the DATA= option in PROC MEANS. At that point, SAS executes the program and computes the BMI variable, which the procedure consumes and analyzes.

There are three main advantage of a data view: reduced storage space, not cluttering up a data set with extra variables, and if the view uses data that is updated regularly (for example, nightly sales data) then the view always reads the current data.

Three advantages to DATA step views #SAStip Click To Tweet

The main disadvantage to DATA step views is that the computed columns must be recomputed every time that a procedure uses the view. If you specify DATA=BMI for additional procedures, the BMI variable is recomputed each time, which is slower than reading pre-computed data.

For more information about SAS data views, see

Do you use DATA step views in your company? Leave a comment.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

21 Comments

  1. Anders Sköllermo on

    Hi! I do agree about the usefulness of Views. Mostly I have been using SAS Data Step views. Very useful. Good that Rick points this out.

    talked with Paul Kent at SAS Institute in Cary about the processing, many years ago. He said that in an example like above, the PROC MEANS is the important "part". (Not the best word, but I can not find any better word right now). Anyway PROC MEANS calls the data step View, and ask for data. So the processing in the View is initiated and "controlled" by the PROC MEANS.

    I my examples I read or fetched a lot of data, but only wanted those observations that fulfilled some complex criteria, like say SEX = 'F' and 30 <= AGE and INCOME >= 10000. (My actual application was far more serious with Very complex data from insurance files).

    Bottom line - If you have not tried Data Step Views do it! Right now!

    / Br Anders (anders.skollermo@one.se)

  2. Yep I know of the data-step view. Trying to promote its usage with "big data" as it is avoiding most of IO overhead. Imagine reading a big "-.txt" data file using SAS into some other system a DBMS or SAS/connect remote machine. No IO is the best performance gain for IO processing.
    Some others are compressing the data (Ross binary) reduction goal 10-20% of the not compressed size. Thinking on DIO usage instead of the system cache. Imagine SAS datasets of a size above 80GB where you system only has some similar amount of memory. The OS memory manager will get a lot of work leaving no time for real work (thrashing).

    All positieve? Not really.
    1/ Many people are learning the way of SQL views. They behave all somehow different. The choice of a SQL-view within a DBMS, a SQL view with SAS or a SAS data-step view.
    2/ Using Enterprise miner the data-step view is used by default. When having some bigger dataset transformations with some bigger data. The effect is a view on a view on a view ... resulting in very slow processing easily demanding more machine resources as available

  3. Anders Sköllermo on

    Hi! A further comment about the computations. Suppose You have a program like:
    data BMI / view=BMI; /* define DATA step view */
    set Sashelp.Class;
    BMI = weight / height**2 * 703; /* BMI formula for adults (pounds and inches) */
    run;

    proc means data=BMI; var BMI; run;
    proc freq data=BMI; tables BMI; run;
    Note that the Data step view is referred to twice.

    The actual calculations are performed as if the program looked like:
    data BMI / view=BMI; /* define DATA step view */
    set Sashelp.Class;
    BMI = weight / height**2 * 703; /* BMI formula for adults (pounds and inches) */
    run;

    proc means data=BMI; var BMI; run;
    The statements above are executed together, in the sense that first the Data Step view is defined,
    then the processing of PROC MEANS step starts - which includes calling the internal code for the Data Step view several times.

    NEXT look at:
    data BMI / view=BMI; /* define DATA step view */
    set Sashelp.Class;
    BMI = weight / height**2 * 703; /* BMI formula for adults (pounds and inches) */
    run;
    proc freq data=BMI; tables BMI; run;
    The statements above are executed together, in the following sense: (the Data Step view is already defined). The processing of PROC FREQ step starts - which includes calling the internal code for the Data Step view several times.

    It is up to the user to decide if tha same Data Step View shall be used, or another one, perhaps with the same name.

  4. Alan Gibson on

    I have a question: is it possible to specify view as the output option in a proc sql statement or enterprise guide query builder? Thanks.

    • Rick Wicklin

      In general, output from SAS/STAT procedures are written as data sets. The procedures take the input numbers and compute output numbers, which they write to disk. To write a view, the procedure would have to generate DATA step statements that give formulas/expression. The only place where I know that happens is if you use the CODE statement in a regression procedure. The CODE statement writes DATA step statements to a file, which could be %INCLUDEd into a view.

      I'll defer the SQL and EG questions to someone who knows more about those topics.

    • Chris Hemedinger
      Chris Hemedinger on

      Yes, PROC SQL supports views with CREATE VIEW, and SAS Enterprise Guide allows you to specify that you want a VIEW on the query options (click the Options for This Query menu button). Like DATA step views, SQL views build very fast, but can take longer to "render" when you open in a data grid or use as input to another procedure.

      • GERALD ZUCKIER on

        Also, caveat: An error in a SQL view (maybe a Dataset view as well?) won't show up when the view is created; only when somebody tries to read it.
        Which might mean the programmer runs the program without incident; at some point down the line, a user tries to do something with the view and gets the error.

        • Rick Wicklin

          Thanks for the tip. Syntax errors are caught when the view compiles, but data-dependent runtime errors can occur when the view runs. For example, the following DATA view does not contain any syntax errors, so it compiles. However, if the data set on the SET statement contains any observations for which Age<=13, the program will report "Invalid argument to LOG" for each observation at runtime:

          data V / view=V;
          set sashelp.class;
          y = Age - 13;     /* y< =0 when Age=13 */
          z = log(y);       /* domain error when y<=0 */
          run;
          
          proc means data=V;  /* get error here because view runs */
          run;
          

  5. Pingback: How to order categories in a two-way table with PROC FREQ - The DO Loop

  6. Pingback: How much do New Yorkers tip taxi drivers? - The DO Loop

  7. Pingback: Overlay plots on a box plot in SAS: Continuous X axis - The DO Loop

  8. Pingback: Use SAS formats to bin numerical variables - The DO Loop

  9. Hi,
    Nice article about data step view and its importance.Can you explain about How steep is the learning curve when transitioning from R to SAS.
    Thanks

  10. Pingback: Visualize an ANOVA with two-way interactions - The DO Loop

  11. Pingback: 5 reasons to use PROC FORMAT to recode variables in SAS - The DO Loop

  12. Pingback: The Farey sequence - The DO Loop

  13. Pingback: The case of the missing blanks: Why SAS output might not show multiple blanks in strings - The DO Loop

  14. Pingback: Reporting statistics for unobserved levels of categorical variables - The DO Loop

Leave A Reply

Back to Top