Create a strip plot in SAS

4

My colleague, Mike Drutar, recently showed how to create a "strip plot" that shows the distribution of temperatures for each calendar month at a particular location. Mike created the strip plot in SAS Visual Analytics by using a point-and-click interface. This article shows how to create a similar graph by using SAS programming statements and the SGPLOT procedure in Base SAS. Along the way, I'll point out some tips and best practices for creating a strip plot.

Daily temperature data for Albany, NY

The data in this article is 25 years of daily temperatures in Albany, NY, from 1995 to 2019. I have analyzed this data previously when discussing how to model periodic data. The following DATA step downloads the data from the internet and adds a SAS date variable:

/* Read the data directly from the Internet */
filename webfile url "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYALBANY.txt" 
 /* some corporate users might need to add  proxy='http://...:80' */;
data TempData;
infile webfile;
input month day year Temperature;
format Date date9.;
Date = MDY(month, day, year);
if Temperature=-99 then delete;   /* -99 is used for missing values */
run;

A basic strip plot

In a basic strip plot, a continuous variable is plotted against levels of a categorical variable. If the values of the continuous variable are distinct, this technique theoretically can show all data values. In practice, however, there will be overplotting of markers, especially for large data sets. You can use a jittering and semi-transparent markers to reduce the effect of overplotting. The SCATTER statement in PROC SGPLOT supports the JITTER option and the TRANSPARENCY= option.

Drutar's strip plot displays temperatures for each month of the year. For the Albany temperature data, you might assume that you need to create a new categorical variable that has the values 'Jan', 'Feb', ..., 'Dec'. However, you do not need to create a new variable. You can use the FORMAT statement and the MONNAMEw. format to convert the Date variable into discrete values "on the fly." This technique is described in the article "Use SAS formats to bin numerical variables." If you use the TYPE=DISCRETE option on the XAXIS statement, you obtain a basic strip plot of temperature versus each calendar month.

/* Create a strip plot.
   Use a format to bin dates into months:
   https://blogs.sas.com/content/iml/2016/08/08/sas-formats-bin-numerical-variables.html
*/
proc sgplot data=TempData;
   format Date MONNAME3.;                 /* bin Date into 12 months */
   scatter x=Date y=Temperature / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled) legendlabel="Daily Temperature";
   xaxis type=discrete display=(nolabel); /* create categorical axis */
   yaxis grid label="Temperature (F)"; 
run;
Basic strip plot in SAS

You can see dark areas in the graph. These indicate high-density regions where the daily temperatures are similar. For some applications, it is useful to further emphasize these regions by overlaying statistical estimates that show the average and range of each strip, as shown in the next section.

Of course, if your data already contains a categorical variable, you can create a strip plot directly. You will not need to use the FORMAT trick.

Overlay a visualization of the center and variation

To indicate the center of each month's temperature distribution, Drutar displays the median value for each month. He also overlays a line segment that shows the data range (min to max). In my strip plot, I will overlay the median but will use the interquartile range (Q1 to Q3) to display the variation in the data. You can use PROC MEANS to create a SAS data set that contains the statistics for each month:

/* write statistics for each month to a data set */
proc means data=Tempdata noprint;
   format Date MONNAME3.;            /* bin Date into 12 months */
   class Date;                       /* output the statistics for each month */
   var Temperature;
   output out=MeanOut(where=(_TYPE_=1)) median=Median Q1=Q1 Q3=Q3;
run;

To create a new graph that overlays the statistics, append the statistics and the data. You can then use a high-low plot to show the variation in the data and a second SCATTER statement to overlay the median values, as follows:

data StripPlot;
   set TempData MeanOut;                  /* append statistics to data */
run;
 
proc sgplot data=StripPlot;
   format Date MONNAME3.;                 /* bin Date into 12 months */
   scatter x=Date y=Temperature / 
           jitter transparency=0.85       /* handle overplotting */
           markerattrs=(symbol=CircleFilled) legendlabel="Daily Temperature";
   highlow x=date low=Q1 high=Q3 /        /* use high-low plot to display range of data */
           legendlabel="IQR" lineattrs=GraphData3(thickness=5);
   scatter x=date y=Median /              /* plot the median value for each strip */
           markerattrs=GraphData2(size=12 symbol=CircleFilled);
   xaxis type=discrete display=(nolabel); /* create categorical axis */
   yaxis grid label="Temperature (F)"; 
run;
Strip plot in SAS with overlays of median and interquartile range statistics

In summary, you can use the SCATTER statement in the SGPLOT procedure to create a basic strip plot. One axis will be the continuous variable, the other will be a discrete (categorical) variable. You can use the JITTER option to reduce overplotting. For data that contain thousands of observations, you might also want to use the TRANSPARENCY= option to display semi-transparent markers. Typically, you will use higher transparency values for larger data. Finally, you can use PROC MEANS to create an output data set that contains summary statistics for each strip. This article computes and displays the median and interquartile range for each strip, but you could also use the mean and standard deviation.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

4 Comments

  1. Very cool, as usual.

    One thing - what if there is no X variable? That is, you want a strip plot of a single variable?

    Well, I figured out one way (maybe there is a better one). I created a dummy X variable that is a constant with some jitter. E.g. suppose we have a variable called pop08millions in a dataset called HAVE. We can then do something like this:
    data want;
    set have;
    x = pop08/pop08+ranuni(1020101);
    run;

    proc sgplot data = want;
    scatter x = x y = pop08millions;
    xaxis values = (100000) label = " ";
    run;

    • Rick Wicklin

      It's easier: Just define a DATA step view and define x=0.

      data Want / view=Want;
      set sashelp.cars;
      x = 0;            /* dummy variable, used for strip plot */
      run;
       
      /* Create a strip plot.
         Use a format to bin dates into months:
         https://blogs.sas.com/content/iml/2016/08/08/sas-formats-bin-numerical-variables.html
      */
      ods graphics / width=200px height=400px;
      proc sgplot data=Want;
         scatter x=x y=MPG_City / 
                 jitter transparency=0.5       /* handle overplotting */
                 markerattrs=(symbol=CircleFilled);
         xaxis type=discrete display=(nolabel novalues noticks); /* create categorical axis */
         yaxis grid; 
      run;
  2. Pingback: 10 tips for creating effective statistical graphics - The DO Loop

  3. Pingback: Improve the Federal Reserve's dot plot - The DO Loop

Leave A Reply

Back to Top