SAS Viya: Powering smarter decisions at lower cost and in shorter time

You have heard many sayings about time, money, or both. The phrase "Time is money" is frequently cited, as well as the complementary adage "You can get more money, but you cannot get more time." This is particularly true when conducting an analysis, as you are always on a tight schedule. If you are using shared resources, you are on everyone's clock, and those resources are everyone's money.

You know you need to save both time and money. In this post, you will learn a method to save time and possibly money when performing multiple repeated measures analyses with the LOGSELECT and GENSELECT procedures in SAS Viya. You will do this by using the APPLYROWORDER option and a data set that has a predefined organization.

Repeated measures… one more time

You might have used repeated measures analysis before, but it is always helpful to review the fundamentals. If your analysis involves making measurements of the same subject at separate times, you cannot always assume that these separate observations are independent. For example, it is reasonable to assume that a person has some latent qualities that tend to influence the observed characteristics of that person. When you use a generalized linear model, you can address this intra-subject correlation by using generalized estimating equations (GEE).

SAS Viya supports GEE in the GENSELECT and LOGSELECT procedures. In each of these procedures, you specify a subject effect, a working correlation type, and a within-subject ordering effect in the REPEATED statement. The subject effect identifies the individual subjects. The working correlation type specifies the assumed correlation structure between the repeated measurements. The within-subject effect specifies the order in which the measurements were taken; some correlation structures require this information.

The following SAS statements specify a logistic regression in PROC GENSELECT:

   proc genselect data=mycas.wheeze;
      class smoke subject visit;
      model wheeze(event='1') = age smoke / dist=binary;
      repeated subject=subject / type=UN within=visit;
   run;

This models the binary variable wheeze by using a continuous effect ofage and a classification effect of smoke. In the REPEATED statement, the subject variable identifies the distinct subjects in the data set. The type=un option specifies an unstructured working correlation structure. In this setting, unstructured means that there is no specific pattern in the correlation between pairs of observations. Because the order is important for the unstructured correlation type, the REPEATED statement also specifies that the visit variable identifies the order of the observations within each subject.

This is simple enough so far, but there are some hidden operations at work!

Hidden time

Repeated-measures analysis involves working with a set of observations for each individual subject. For computational efficiency, the procedure organizes the data set so that all observations for an individual subject are contiguous. This organization is called a partition of the data set. The procedure creates this partition for you before the analysis begins.

That is certainly a kind service the procedure provides, but what if you want to run more than one analysis using the same subject definition? For example, you might want to evaluate different working correlation structures, different mean effects, or different response distributions.

You might wonder, “What’s so bad about that?” Well, each analysis requires a re-partition. Because the data set might have observations for each subject scattered across different worker nodes, the partition process can be time-consuming. That time will show up in the overall time needed to complete the analysis.

Table 1 shows the maximum time required to complete the repeated-measures analysis for data sets of increasing size. The input data set wheeze has been randomly shuffled across five worker nodes before the analysis.

Notice how the time needed grows with the number of observations:

Number of Observations	Number of Subjects	Maximum run time (sec)
640,000	160,000	3.8
1,280,000	320,000	7.3
2,560,000	640,000	13.1
5,120,000	1,280,000	25.3
10,240,000	2,560,000	62.2
20,480,000	5,120,000	233.5
40,960,000	10,240,000	382.6

Table 1: Increasing run time as the number of subjects grows

This increase in run time with larger data sets is not surprising because there are more observations to process during the analysis. However, the time shown here also includes the time needed to partition the input data set. If you are performing several different analyses that use the same subject effect, then you are paying the price to repeat the same partition operation during each analysis. That does not feel like an effective use of your time and money – why repeat the partition for each analysis?

Time to relax - With a partition!

Beginning with the 2025.10 release of SAS Viya, you can use a pre-partitioned data set, removing the need for the procedure to re-partition the data set for each analysis. To do this, you will use PROC CAS to create the partition by using the partition action. The following SAS statements create a data set, wheezepart, that is a partitioned version of the wheeze data set:

   proc cas;
      table.partition /
        table={name="wheeze", groupBy={"subject"}, orderBy={"visit"}}, 
        casout={name="wheezepart", replace=true};
      run;
   quit;

The groupBy list specifies the variables that define the partition, and the orderBy list specifies the variables that define the order of the observations within each unique level of the groupBy list. The casout= option defines the output data set, named wheezepart. Notice that the groupBy list includes all the variables that define the subject effect.

Now that you have a pre-partitioned data set, you can proceed with many different analyses, all without having to wait for the partition process. You use the new wheezepart data set, and you specify the APPLYROWORDER option to the procedure:

   proc genselect data=mycas.wheezepart applyroworder;
      class smoke subject visit;
      model wheeze(event='1') = age smoke / dist=binary;
      repeated subject=subject / type=un within=visit;
   run;

Behind the scenes, the procedure will verify that the partition information in the wheezepart data set is compatible with the subject effect specified in the REPEATED statement. After the action verifies this, the analysis continues without the partitioning step. This feature is also available with the LOGSELECT procedure when you are analyzing repeated measures data in a logistic regression.

The results of another timing experiment that illustrates the benefit of pre-partitioning are shown in Table 2. The table compares the maximum time to complete the analysis for an unpartitioned input data set with the maximum time for a partitioned version of that data set. It also includes the maximum time for the partitioning step:

Number of Observations	Number of Subjects	Maximum partition time (sec)	Maximum run time *WITH* pre-partitioned data set (sec)	Maximum run time *WITHOUT* pre-partitioned data set (sec)
640,000	160,000	1.9	2.3	3.8
1,280,000	320,000	3.5	3.9	7.3
2,560,000	640,000	6.4	7.1	13.1
5,120,000	1,280,000	12.4	13.2	25.3
10,240,000	2,560,000	26.0	25.8	62.2
20,480,000	5,120,000	149.4	67.8	233.5
40,960,000	10,240,000	342.0	70.7	382.6

Table 2: Comparison of run time using a pre-partitioned data set to run time using an unpartitioned data set

The experiment aggregates results across multiple trials, so the sum of the partition time and the time for analysis with the pre-partitioned data set does not exactly equal the time for analysis without a pre-partitioned data set. However, the order of magnitude is clear. If you are using the data set with 40,960,000 observations and you run a dozen separate analyses, you could save an hour. If you are running in a cloud environment where you pay as you go for computation and storage, you might also save the costs associated with the redundant partitioning steps.

Enjoy your time off

In this post, you learned how to save time and resources in repeated-measures analysis by using a pre-partitioned data set and the APPLYROWORDER option for the GENSELECT and LOGSELECT procedures in SAS Viya. Your next assignment is to decide what to do with the time you save!