Sankey Diagrams

Sankey Diagrams have found increasing favor for visualization of data.  This visualization tool has been around for a long time, traditionally used to visualize the flow of energy, or materials.   .

Now to be sure, GTL does have a statement design for a Sankey Diagram which was implemented only in Flex for use in interactive visualization cases.  The GTL Sankey Diagram statement was not implemented for use in MVA visualization cases due to lack of demand.

However, recently a SAS user asked about creating such graphs using SAS MVA graphics tools.  With SAS 9.4 there are sufficient tools in place to create such a diagram using custom coding without use of annotation.  In SAS 9.4M3, more tools are available that makes this task easier.  I have outlined the process below.

Sankey_2_940The diagram created using the SAS 9.4 SGPLOT procedure is shown on the right.  Click on the diagram to see bigger view.  Since no SANKEY statement is available in SGPLOT, such a diagram requires custom coding.  However, no annotation is required.   The program uses the following statements:

  • Series with SmoothConnect for the curves.
  • Highlow plots nodes and link values.
  • Scatter plot with MarkerChar for node labels.
  • Series plot to draw the brackets.
  • Scatter plot with MarkerChar for labels 1,2,3.

A custom data set has to be created to draw the different parts of the diagram as shown in the attached program link at the bottom.

SankeyThe diagram shown on the right uses the new SPLINE statement to be released soon with SAS 9.4M3.  This makes the process a little easier, as the spline is a smooth curve that does not need to pass through each of the vertex points.  The SAS 9.4M3 SGPLOT also supports varying line thickness for series and spline statements.

Clearly the data is hand-built for this particular diagram.  I believe this process can be converted to a macro to create a Sankey Diagram from a node-link data set with the appropriate information.  Things will get more interesting as the diagram includes links splits or merges at various nodes.

SAS 9.4 SGPLOT Code:   Sankey_940

Post a Comment

A 3D Scatter Plot Animation Macro

In the previous article, I described the process to create a 3D Scatter Plot using a 3D Orthographic View matrix and the SGPLOT procedure.  I posted a macro that can be used to create a 3D scatter plot from any SAS data set, using 3 numeric columns, one each for X, Y and Z (Response) axes.

Visualization of 3D data can be improved by providing interaction or animation.  Here I have described a way to create an animation using the idea described in the previous article.

Class3DScatterAnim

The setup for the animation is as follows:

options papersize=('5 in', '4 in') printerpath=gif animation=start 
        animduration=0.05 animloop=yes noanimoverlay;
ods printer dpi=100 file='C:\Class3DScatterAnim.gif';
 
ods listing image_dpi=200;
ods graphics / reset attrpriority=color width=5in height=4in imagefmt=GIF;
 
%run_anim_macro(data=sashelp.class, start=-30, end=-60, incr=-1);
%run_anim_macro(data=sashelp.class, start=-60, end=-30, incr=1);
 
options printerpath=gif animation=stop;
ods printer close;

I have modified the %Ortho3D_Macro provided in the previous article, and added a loop to render multiple graphs with changing value for the Z-Rotation from -30 to -60 and back by 1 degree.  Here I have created a GIF animation.  An SVG animation can also be created using the code at the bottom of the attached file.

3D Animation Macro:  Ortho_3D_Animation

Matrix Multiplication Function:  Matrix_Functions

Post a Comment

A 3D Scatter Plot Macro

The SG Procedures do not support creating a 3D scatter plot.   GTL has some support for 3D graphs, including a 3D Bi-variate Histogram and a 3D Surface, but still no 3D point cloud.  The lack of such a feature is not due to any difficulty in doing this as GTL already support the LAYOUT OVERLAY3D container, but the fact that there was no one urgently requesting such a feature.

However, often we do have a need for  visualization of 3D data, and it would be nice to be able to do this.  So, here I have presented a macro that uses the features of the SGPLOT procedure to display 3D data.  This uses SAS 9.4 features to render the walls, axis labels and the  filled "spherical-looking" markers.

%Ortho3D_Macro (Data=sashelp.class, WallData=wall_Axes, 
                X=height, Y=Age, Z=Weight,
                Lblx=Height, Lbly=Age, Lblz=Weight, 
                Group=Sex, Attrmap=attrmap, Tilt=65, Rotate=-55, 
                Title=Plot of Weight by Height and Age);

Note the following items in the macro invocation above:

  • The data set to be viewed is provided.
  • A data set defining the 3D walls is provided.  This is shown in detail in the program code.
  • The three columns to be mapped to each axis are provided.
  • X and Y form the two independent variables, and the response variable is displayed on the vertical Z axis.
  • Labels for each axis can be specified.
  • An Attribute map is used to set the visual attributes of the walls and bounding box of the data.  This is shown in the code.
  • A group variable can be used to color the markers.
  • Viewing parameters Tilt (0 to 90) and Rotate (-15 to -75)  are ideal.
  • Title can be set.

The macro maps the 3D data to a unit cube, and projects the data into the view space using an ORTHOGRAPHIC projection.  This avoids distortion of the data that can happen when using a perspective projection.

Class2

Here are the features of the graph:

  • Spherical looking markers are drawn at each (x, y, z) location.
  • Axis labels are drawn, but not the tick values.  That could be added, but can get messy.   The idea is to really see the shape of the data.
  • Relative positions of the markers can be a challenge to view in a static 3D view.  So, the X-Y, X-Z and Y-Z projections for each point are also displayed to help locate the points in 3D space.
  • Needles are dropped to the floor.
  • View parameters are displayed.

The same macro can be invoked for other data such as all the Sedans in the Cars data as shown below:

%Ortho3D_Macro (Data=sedans, WallData=wall_Axes, 
                X=horsepower, Y=Weight, Z=mpg_city,
                Lblx=Horsepower, Lbly=Weight, Lblz=Mileage, 
                Group=origin, Attrmap=attrmap, Tilt=45, Rotate=-60, 
                Title=Plot of Mileage by Horsepower and Weight);

Sedans2

For those interested in the process for projection of 3D data on to a 2D plane, the View Matrix for the Orthographic Projection is shown below.

Ortho_Projection_Border

We can also use the standard way to create an animated GIF or SVG file that helps in the visualization of the data.  I will include that in the next post.

Note the macro is provided for illustration purposes on what is possible.  I have not rigorously tested all settings and use cases.  If visualization of 3D data is something you feel you need, please chime in with your suggestions for more 3D plots right here, or to SAS Technical Support.

Full SAS 9.4 Code:  Ortho_3D_Macro_94

Matrix Functions:  Matrix_Functions

Post a Comment

Margin Plots

Last week a user wanted to view the distribution of data using a Box Plot.  The issue was the presence of a lot of "bad" data.  I got to thinking of ways such data can be visualized.  I also discussed the matter with our resident expert Rick Wicklin who pointed me to a couple of resources including some information on visualization of missing data on the web.

First, my usual disclaimer:  I am only a "Graph Guy", and not a Statistician.  So, my thoughts below are mainly graphical suggestions.  Please feel free to point out pros and cons of the techniques discussed below.

Box_MissingOn the issue of visualizing data using Box plots, I simulated some data using sashelp.heart. by setting some data to missing, and setting those values to zero in another column.  Then, I used a box plot to view the data, and overlaid a scatter plot to view the values that were set to missing.  Since I put those observations in another column with a value of zero, they all show up at the bottom of the graph.  You can select the appropriate value.  I set the Y axis so zero is not on the axis.

SGPLOT with  SAS 9.40M1  supports overlays of basic plots with a VBOX.   Note how we can see that some of the "Cancer" and "Coronary Heart Disease" data is "bad", in this case, "missing".

title 'Cholesterol by Death Cause';
proc sgplot data=heart_Box noautolegend;
  vbox cholesterol / category=deathcause extreme;
  scatter x=deathcause y=chol / markerattrs=graphdata1(symbol=circlefilled) 
          transparency=0.5  name='s' jitter jitterwidth=0.5 legendlabel='Missing Data';
  keylegend 's' / location=inside position=topleft;
  xaxis display=(noticks nolabel);
  yaxis values=(100 to 500 by 100) min=0 valueshint;
run;

Margin_Systolic_BoxThe user also made a comment on how the data was so skewed, that a box plot was not possible.  That got me looking for another way to view the same data.  This time, I replaced some values for Cholesterol and Systolic with missing values, copying them into other variables.

Now, I plotted Systolic by Cholesterol, which displayed the cloud of non-missing values.  Then I added a box plot for all the values and a box for just the values where cholesterol was missing.  The graph is shown on the right.  Click on graph for a higher resolution image.

The blue box is of all the observations where cholesterol is non-missing.  Red box is for observations where cholesterol is missing, but the systolic has a valid value.  Once again, this is possible with SAS 9.40M1 SGPLOT.  For the VBOX data, I have set the "category" values to 10 and 20.  Since the axes are "Linear" by the Scatter plot, this combination is possible.

proc sgplot data=heart_2D noautolegend;
  scatter x=chol y=syst / name='s' markerattrs=graphdata1  legendlabel='Non Missing Data'
          markerattrs=graphdata1(symbol=circlefilled) transparency=0.7;
  vbox systBox / category=cholA extreme group=systgrp fill nooutliers name='b' boxwidth=1;
  keylegend 's' 'b';
  xaxis min=0 values=(100 to 500 by 100) valueshint grid label='Cholesterol';
  yaxis min=0 values=( 50 to 300 by 50) valueshint grid label='Systolic';
run;

Margin_Cholesterol_ScatNow, this gives us some idea where all the data is, but still this may not work well if the distribution of the data is bi-modal. We can create the same graph using a scatter plot of the data instead of box.

Here, I have displayed a scatter of the non-missing data  along with another scatter plot with two groups - All data and Missing Systolic data.  Maybe this view can provide a better visualization of the missing data.  We can certainly add insets to indicate the percentage of the missing data.

Another way may be to use a HISTOGRAM instead of the VBOX or SCATTER to view the distribution of the missing data.  I will take that up in a follow-up post.

Even with SAS 9.40M1, SGPLOT will allow us only to view one distribution at a time.  If we want to plot both the distribution of the Systolic for missing Cholesterol and vice-versa, we will need to use GTL.  Also, if you have a SAS release prior to SAS 9.40M1, you can use GTL to create the VBOX + SCATTER overlay graphs shown above.

Margin_2DThe graph with box plots of all and missing data is shown on the right.  This graph is created using GTL.  It uses only one LAYOUT OVERLAY, since the categorical values for the box plots is also numeric.  However, we can use a LAYOUT LATTICE to create other combinations.

 Full SAS9.40M1 Code:  Margin_Plot

Post a Comment

Cancer Deaths Averted


cancer_mortalitySignificant progress in reduction of Cancer mortality is shown in a graph that I noticed recently on the Cancer Network web site.  This graph showed the actual and projected cancer mortality by year for males.  The graph is shown on the right.

The graph plots the projected and actual numbers by year, and highlights the difference using the hatched pattern.  The total number of Cancer Deaths Averted is shown.

The graph on the right includes a Y axis data range all the way down to zero, where it is really not necessary.  But, we can use this space that is otherwise wasted to display more information.

Creating the graph is easy, using the following SGPLOT code.  Some options are trimmed to fit the space.  See full code in link at bottom for the details.

title 'Cancer Deaths';
proc sgplot data=mortality nocycleattrs nowall noborder;
  styleattrs datalinepatterns=(solid);
  highlow x=year low=actual high=projected / type=line;
  series x=year y=projected / name='b' legendlabel='Projected';
  series x=year y=actual / name='a' legendlabel='Actual';
  keylegend 'a' 'b' / location=inside position=topleft linelength=20;
  xaxis values=(1975 to 2010 by 5) grid;
  yaxis values=(0 to  450000 by 50000) grid;
  run;

Mortality_Diff_2The graph is shown on the right is created by the code shown above.  The data is "eye-balled" from the original graph and includes the columns of Year, Actual, Predicted and Diff.  The total number of deaths averted is saved in a macro, and also inserted into the label to be displayed.

Two SERIES plots are used to plot the actual and predicted curves.  A HIGHLOW plot is used to draw the vertical hatch marks showing the reduction in the cancer deaths since 1990.  A legend is added to indicate the actual and predicted curves.

 

For the graph shown below on the right, a Band plot is added to display the reduction in cancer deaths by year explicitly.   Also, we have used a TEXT plot statement to display the inset indicating the number of deaths averted.

 

Mortality_Averted_Label_2

There some benefits of this addition.  The empty area at the bottom of the graph is utilized.  The actual deaths averted are drawn from a common baseline, thus removing the distortions in the hatched area due the varying baseline.  An an explicit inset shows the estimated number of deaths averted.

The TEXT plot is a SAS 9.4M2 feature, but one could use an INSET or a SCATTER with MarkerChar to do something similar.

Full SAS code:  CancerDeaths

Post a Comment

Displaying Unicode Symbols in Legend

Including special Unicode symbols into the graph is getting more popular.  In general, SG procedures support Unicode strings in places where these strings are coded into the syntax such as TITLE, FOOTNOTE.  These support Unicode characters and also the  special {SUP} and {SUB} commands.  This is because these statements are rendered by the graph using Java string API.

Curve Labels and Axis Labels that are assigned in the procedure syntax can also support Unicode, but not the {SUP} and {SUB} commands.  This is because these items are passed to the graph rendering engine which cannot handle the {SUP} and {SUB} commands.  However, most of the popular numeric sub and super scripts are available in the Unicode fonts, so much of the need is covered.

Recently, a user chimed in on the Communities page, wanting to include Unicode values in the Legend.  The group variable values include Unicode strings like "Less than or equals", and the journal preferred usage of the Unicode <= symbol, not the "<=" sequence of characters.

Data_GroupsWith all the releases of SAS till date, the SGPLOT procedure cannot support Unicode from data or formats into the graph legends or axis.  However there is a way to do this  by restructuring the grouped data into a multi-column format.

A few observations of the original data are shown on the right.  I have added a column based on the level of the Systolic Blood Pressure called "Status".

We could plot a Graph of Weight by Height by Status, and get a scatter plot of the data, with the "Status" values displayed in the legend as "GE160" and so on.  However that is not what user wants, and rather have the numeric values with the "<=" symbols.

Data_ColumnsThe transformed data set is shown on the right.  Here, I have created four new columns, each containing the appropriate value for weight based on the Status.  So, this result in some missing values in the new columns.

Now, instead of using one scatter plot with the GROUP option, we will plot these four columns using four scatter plots as shown below.  All of the scatter plot are without any group variable, and I have used the LEGENDLABEL option to provide the label for each scatter plot.  These labels include Unicode characters.

ods escapechar '~';
title 'Blood Pressure by Weight by Height';
proc sgplot data=heart_cols;
  scatter x=height y=ge160 / legendlabel="160 ~{Unicode '2264'x} Systolic ";
  scatter x=height y=ge140 / legendlabel="140 ~{Unicode '2264'x} Systolic &lt; 160";
  scatter x=height y=ge120 / legendlabel="120 ~{Unicode '2264'x} Systolic &lt; 140";
  scatter x=height y=lt120 / legendlabel="Systolic &lt; 120";
  keylegend / title='' location=inside position=topleft across=1;
  run;

UnicodeinLegend_930

Click on the graph for a higher resolution view.  Note the legend on the top left contains the ranges for the Systolic blood pressure, using the appropriate Unicode symbols.  Each scatter plot in the graph is represented in the legend by the LEGENDLABEL.  The legend label can be assigned Unicode values as shown above.

Now, the legend in the graph can be improved if we can position all the "Systolic" labels in the legend vertically aligned.  To do this, one might want to add some blanks to the front of the text string in the Legend label for the fourth scatter plot.  However, this will not work, as all leading blanks are automatically stripped.  But, the system can be tricked to not strip the leading blanks by first adding a non-breaking space character 'A0'x in the label string followed by the required number of blanks. This is shown in the code and graph below.

ods escapechar '~';
title 'Blood Pressure by Weight by Height';
proc sgplot data=heart_cols;
  scatter x=height y=ge160 / legendlabel="160 ~{Unicode '2264'x} Systolic ";
  scatter x=height y=ge140 / legendlabel="140 ~{Unicode '2264'x} Systolic &lt; 160";
  scatter x=height y=ge120 / legendlabel="120 ~{Unicode '2264'x} Systolic &lt; 140";
  scatter x=height y=lt120 / legendlabel="~{Unicode '00a0'x}         Systolic &lt; 120";
  keylegend / title='' location=inside position=topleft across=1;
  run;

UnicodeinLegend_Aligned_930

In the legend for the graph above, all the "Systolic" terms are correctly aligned, making the legend a bit easier to read.  Note, this process needs custom handling.  Full code is provided in the link below.

The good news is that support for Unicode in the graphs will be included with SAS 9.40M3 release using User Defined Formats.  With this approach, you will be able to format any data value into a string that can include Unicode symbols.  Thus group values or axis tick values can be customized programmatically.

Full SAS 9.3 Code:  LegendSymbols_930

 

 

 

 

 

 

Post a Comment

Marker Symbols

There has been much discussion on the SAS Communities page on usage of different symbols in a graph.  The solutioin can vary based on the SAS release.  New features have been added at SAS 9.4 releases to SG Procedures and GTL that make this very easy.  With SAS 9.4M1, almost any combination is possible.

Symbols_ColorOnlyThe user has a relative simple scatter plot with two class levels.  The graph on the right is easily created using a scatter plot with a group role.  The code is shown below.

Note, starting with SAS 9.3, ODS HTML is the default open destination, using the HTMLBlue style.  This is a "Color" priority style, where each group gets only a color change till all Style Elements are used.  So, you do not see varying marker symbols in the graph on the right.

title 'Mileage by Horsepower by Make'; 
proc sgplot data=cars;
  scatter x=horsepower y=mpg_city / group=make;
  keylegend / location=inside position=topright;
  yaxis grid integer;
  xaxis grid;
  run;

Symbols_ColorSymbolYou can run the same graph with a style like LISTING, or set ATTRPRIORITY=none in the ODS Graphics statement to get the graph on the right.  Now, each group gets a different color and a different marker symbol.  These come from the style GraphData1-12 elements.

 ods graphics / reset attrpriority=none;

The user wanted to use the symbols "X" and "Y" instead of the "circle" and "plus" symbols that are the default first two symbols in the GraphData1-12 elements list.  This in itself is very easy, since the "X" and "Y" symbols are included in the list of built-in symbols supported by these procedures.  All you need to do is change the default symbols in the GraphData1-12 elements.

Symbols_BuiltIn_94With SAS 9.4, it is very easy to change the group attributes by using the STYLEATTRS statement in SGPLOT.  This feature provides a simple in-line way to modify the list of color, contrast color, symbols and line patterns used for the group values, as shown in the code snippet below.

The list of values provided REPLACE the default group list as if this came from the style.  So, now the group cycling uses only the two symbols "X" and "Y" provided in the list.

proc sgplot data=cars;
  styleattrs datasymbols=(X Y);
  scatter x=horsepower y=mpg_city / group=make;
run;

But what if you want to use some special symbols that are not provided in the built-in list of symbols?  You can do that with SAS9.4M1 using the new statements SYBMOLCHAR and SYMBOLIMAGE.  SymbolChar statement supports the ability to use any character from any font as a symbol.  Using a Unicode font allows you thousands of symbols that can be used.

Symbols_Others_94Say you want to use the greek symbols  for "Alpha" and "Beta" as the marker symbols.  You can define a new symbol name using the SYMBOLCHAR statement and then include that in the list of group symbols to be used using the STYLEATTRS statement.  The code snippet is shown below, and the resulting graph is shown on the right.  Click on graph for a higher resolution view.

proc sgplot data=cars;
  symbolchar name=Alpha char='03b1'x / scale=1.8;
  symbolchar name=Beta char='03b2'x  / scale=1.8;
  styleattrs datasymbols=(Alpha Beta);
  scatter x=horsepower y=mpg_city / group=make markerattrs=(size=9);
  run;

Note the use of the SCALE option above.  Most font glyphs do not occupy all the pixels in the glyph.  So, these symbols may appear small.  The scale options allows us to scale them up.

Symbols_Image_94And now, the "pièce de résistance".  In many cases, such as the case here, we can use symbols that not only distinguish between the group values, but by themselves provide information on what they represent.

The SYMBOLIMAGE statement allows you to define new symbols from images.  These can then be used for group values using the STYLEATTRS statement, just like shown above.  Here is a graph using image symbols.  Note, I have removed the legend just to make this point.  The markers do not require any legend to explain what they stand for (for most users).

It helps to make the images have a transparent background, so the shape of the icon is visible, and does not block other markers.  The images must be available on the local file system.

proc sgplot data=cars noautolegend;
  symbolimage name=BMW image="C:\BMWTrans.png";
  symbolimage name=Porsche image="C:\PorscheTrans.png";
  styleattrs datasymbols=(BMW Porsche);
  scatter x=horsepower y=mpg_city / group=make;
  run;

Full SAS 9.4 Code:  Symbol

Post a Comment

Custom Labels

Over the Christmas Holidays I saw an graph of agricultural exports to Russia in 2013.  The part that caught my eye was the upper part of the graph, showing the breakdown of the trade with Russia as a horizontal stacked bar with custom labels.

TopThe value for each region / country is labeled individually along the top and bottom of the bar for each segment, as shown on the right.  Each label is at a custom location along the bar with some on top, some at the bottom.  Most labels include the name of the region and the amount, but others have the name in the label, but the amount in the bar (European Union).

Russia_LegendMaking this graph as a regular stacked horizontal bar with a legend is very simple and also scalable and extensible to other data.  I used the colors from the graph above, but then added a few other colors to distinguish the segments so they can be identified in the legend.  Click on the graph for a more detailed view.

proc sgplot data=russia noborder nocycleattrs;
  styleattrs datacolors=(%rgbhex(207, 49, 36) %rgbhex(225, 100, 50) 
             gold yellow lightgreen);
  hbarparm  category=cat response=value / group=label groupdisplay=stack 
            outlineattrs=(color=lightgray) 
            baselineattrs=(thickness=0) barwidth=0.5 grouporder=data;
  keylegend / title='' noborder location=inside position=top;
  yaxis display=none  colorbands=odd offsetmin=0.3;
  xaxis display=none;
run;

The main reason the original graph is interesting is the attempt to "move"  the  legend entries closer to the bar itself.   The benefit of this  is that the values can be read directly and easily and the graph is easier to decode.  In the legend case, one has to move the eye between the legend and the graph.  First, identify the color of the segment in the bar, then find its value from the legend.  Also, the small green segment for Australia could be missed.

Direct labeling is often useful for decoding a graph, especially where the graph is not too complicated.  But, direct labeling in this case also requires custom code, either annotation or something else.  So, there is a balance to be achieved between the two.

Russia_Labels_4Since I try to avoid annotation as much as possible, first I tried to create this graph using other means with SAS 9.4M2.  Here is what I was able to do with some coding.  My goal is to break up the legend and move each individual value closer to the bar segment itself.  I kept the color swatches to avoid the need the call-out line to each bar segment.

Clearly, the coding is more elaborate, as I have to place each color marker and the text close to where it needs to go, switching between above and below the bar as shown in the code below.  Some appearance options are trimmed to fit.  See full code in the link below.

proc sgplot data=russia_labels noborder noautolegend nocycleattrs;
  styleattrs datacolors=(%rgbhex(207, 49, 36) %rgbhex(225, 100, 50) 
             gold yellow lightgreen)
             datacontrastcolors=(%rgbhex(207, 49, 36) %rgbhex(225, 100, 50) 
             gold yellow lightgreen)
             datasymbols=(squarefilled);
  hbarparm  category=cat response=value / group=label groupdisplay=stack 
            baselineattrs=(thickness=0) barwidth=0.5 grouporder=data;
 
  scatter x=xlbl1 y=cat / discreteoffset=-0.35 group=label;
  text x=tlbl1 y=cat text=label / discreteoffset=-0.35 position=left 
       contributeoffsets=none splitpolicy=splitalways splitchar='=';
 
  scatter x=xlbl2 y=cat / discreteoffset= 0.35 group=label;
  text x=tlbl2 y=cat text=label / discreteoffset= 0.35 position=left 
       contributeoffsets=none splitpolicy=splitalways splitchar='=';
 
  yaxis display=none  colorbands=odd;
  xaxis display=none;
run;

Note, the code is longer because there are 2 pairs of scatter and text plot statements, one for the labels along the top and one for those at the bottom, because of the different values of DiscreteOffset.  The positions for the markers and the text are computed for each value in the code.  Now, each label and value are effectively moved close to the segment, making the graph easier to decode.

In this exercise, I have used the new TEXT plot statement added with SAS 9.4M2.  This statement is customized to draw text strings in the graph, and has many features for handling text.  We did not want to overload the scatter plot (with MarkerChar).  Going forward, you would be better off using the TEXT plot in place of cases where you used MarkerChar.  For earlier releases, you could use the scatter with MarkerChar or DataLabel to do something similar. This exercise is left to the motivated reader.

Alternatively, one could exactly duplicate the original graph by using SG Annotate to do the labeling, including the call out lines from the text to the segment.  In both cases, the code is heavily customized, and not easily scalable to other data.

I have presented my opinion on the pros and cons of each method.   I would love to hear your opinion too.

SAS 9.4M2 Code:  Russia_3

 

Post a Comment

Scatter Plot with Stacked Histograms

scatter_and_hist_borderLast week a user expressed the need to create a graph like the one shown on the right using SAS.   This seems eminently doable using GTL and I thought I would undertake making this graph using SAS 9.3.

The source data required to create this graph is only the X-Y information in the scatter plot.   Not having access to the original data in this graph, I simulated some data using random functions in three DO loop, one each for the three groups, in a DATA STEP.   The groups are 'A', 'B' and 'C', in place of the values like 'Center = 0.29' and so on.  See the full program in the link at the bottom.

The graph on the right can be constructed as a LATTICE of four cells with the following contents.

  • The cell on the bottom left is a regular X-Y scatter plot by group.
  • The cell at the top left is a stacked vertical histogram of counts for the x-bins by group.
  • The cell at the bottom right a stacked horizontal histogram of counts for the y-bins by group.
  • The cell at the top right contains the legend.

SAS 9.3 SGPLOT or GTL does not have a statement to draw a stacked histogram by group.  So, we have to find another way to do this.   We will us the HighLowPlot plot statement, which shows the group segments where we place them, and also supports a numeric x axis.  We now have to build the data set appropriate for the plot.

The good new is that we can leverage the SGPLOT Histogram statement to generate the bins and counts we need for X and BY=group as follows:

ods _all_ close;
ods output sgplot=xa;
proc sgplot data=scatter(where=(x le 5));
  by grp;
  histogram x / scale=count binstart=0 binwidth=0.25; 
  run;

xBinsThis program will bin the data by X, with BinStart and BinWidth set as needed.  The output is written the the 'XA' data set.  The SGPLOT generates the required bins and count columns using variable names that are based on the original variables.  You can turned off all destinations, so no graph is actually created but the data set is written out.  You can view the data set to find these new variables.

After this step I cleaned up this data set to create a data set of the xBins and the Counts by Group.  A snippet of the data set is shown on the right.

data xBins;
  set xa(where=(Bin_X_Scale_count_Binstart_0___Y ne .));
  drop x;
  rename Bin_X_Scale_count_Binstart_0___Y=count
         Bin_X_Scale_count_Binstart_0___X=xBin;
run;
proc sort data=xBins out=xBinsByBin;
  by xBin;
run;

xBinsHighLowNow we have the bins and the counts by group.  We need to stack the values so we can use the HighLowPlot to draw the stacked bins.  The data step shown below does just that, but creating the Low and High values for each group in a bin as stacked on the previous value.

The final data set is shown on the right.  We can plot it using the HighLow plot statement in SGPLOT to create just the horizontal stacked Histogram, to see if we have the right data.  I will save that step for later.

data HighLowX;
  drop count;
  retain Low High;
  set xBinsbyBin;
  by xBin;
  if first.xBin then Low=0;
  High=Low+Count; output;
  Low=High;
run;

We go through the same steps above for creating the binned data for the Y axis.  then, I merge the original X-Y data with the X and Y bin data sets to get the final data set ready for plotting.   I can plot each graph separately form this merged data set to ensure everything is working correctly.  The xBin, Low and High values are in a block of the data where other columns are missing, and so on.  Here is the graph for just the horizontal stacked histogram.

HighLow_X

The next step is to create a GTL template with a 2x2 layout of cells and common uniform axes.  See the program link at the bottom for the full code.  Here is the layout of the template.

proc template;
  define statgraph Scatter_Layout;
    begingraph;
      entrytitle 'Distribution by Group';
      /*--Outermost Lattice Container--*/
      layout lattice / rows=2 columns=2 rowweights=(0.3 0.7) columnweights=(0.7 0.3)
                       columndatarange=union rowdatarange=union
                       rowgutter=5 columngutter=5;
	/*--Common Row axes--*/
        rowaxes;
	  rowaxis / offsetmin=0 display=(ticks tickvalues) griddisplay=on;
	  rowaxis / label='Mean of Full Rho' griddisplay=on 
                    linearopts=(tickvaluesequence=(start=0 increment=0.5 end=3.5));
	endrowaxes;
	/*--Common Column axes--*/
        columnaxes;
	  columnaxis / label='Ratio of Full Rho' griddisplay=on);
	  columnaxis / offsetmin=0 display=(ticks tickvalues) griddisplay=on);
	endcolumnaxes;
 
	/*--Upper Left cell with Stacked X Bins counts by group--*/
        layout overlay;
          highlowplot x=xBin low=low high=high / group=grp type=bar;
	endlayout;
	/*--Upper Right cell with Legend--*/
        layout overlay;
          discretelegend 'a';
	endlayout;
	/*--Lower Left cell with SX-Y Scatter Plot--*/
        layout overlay;
          scatterplot x=x y=y / group=grp markerattrs=(symbol=circlefilled size=5) 
                      name='a';
	endlayout;
	/*--Lower Right cell with Stacked Y Bins counts by group--*/
        layout overlay;
          highlowplot y=yBin low=low high=high / group=grp type=bar;
	endlayout;
      endlayout;
    endgraph;
  end;
run;

Here is the Graph.  You can adjust the font sizing for the axes if needed.  Click on graph for a high resolution image.  Note, we are using common external Row and Column axes since these are uniform and should not be repeated.

Scatter_Layout

Full SAS 9.3 code:  Scatter_Layout

Post a Comment

Dual Response Axis Graphs

Often we need graphs that display two or more responses by the same category values.  In many cases it is useful to plot both responses on the same response (Y) axis.  This can be helpful to understand the data and compare the magnitudes side by side.  This works when the scales of both the response values are comparable and consistent.

ElectricPlot_SGHowever, the scales for the two responses may not be similar or consistent.  One common use case is when we are visualizing the actual and % changes for some categories as shown in the graph on the right.

For this example, I have run the MEANS procedure to compute the revenues by year for all the customers by year, and selected only the "Residential" customer for the graph.  I have also computed the change in the values for subsequent years from the first year (1994).

In the graph above, I have  plotted the actual revenues in Billions of $ for Residential customers as a bar chart on the default Y (left) axis.  The "Change" values with a PERCENT format are plotted as a Series plot on the Y2 (right) axis .  I have colored the Y axis ticks and label using a color consistent with the bars and the Y2 axis ticks and label using color consistent with the line.  This graph displays all the data correctly, in a way that is easy to comprehend.  Note:  I am actually using HIGHLOW instead of VBAR as it allows me to use a linear axis.

title 'Revenues and Growth over Time for Residential Customer';
proc sgplot data=ElecRevChange(where=(customer='Residential'));
  styleattrs datacolors=(orange orange) datacontrastcolors=(cx8f3f00 darkgreen);
  highlow x=year low=zero high=revenue / name='a' legendlabel='Revenue' type=bar 
          nooutline fillattrs=graphdata1 dataskin=pressed;
  series x=year y=change /  name='b' lineattrs=graphdata2(thickness=5) y2axis;
  xaxis integer display=(nolabel);
  yaxis offsetmin=0 min=0 valueattrs=graphdata1 labelattrs=graphdata1 grid;
  y2axis offsetmin=0 min=0 values=(0 .30 .60 .90 1.20 1.50) valueattrs=graphdata2 
         labelattrs=graphdata2;
  keylegend / linelength=20px;
run;

DataAs we can see in the data table on the right, while the "Change" values are shown with a % format, the values themselves are fractional between 1.0 - 2.0.   The Percent format converts the fractional values into a % number.  So, mixing values with Percent and non-Percent format on the same axis can result in a bad graph.

The axis format is determined by the "Primary" plot, usually the first plot in the list.  In this case, the revenues are plotted first using a bars on the default Y axis.  So, the default format for the Y axis comes from the bar.  If the series plot is also plotted on the same axis, those fractional values will be displayed with a non-percent format, and will not be visible in comparison with the revenue values as shown in the graph below on the right.

ElectricPlot_SGIn the graph on the right, the green line showing change is way down near the baseline.  This is because the response values are all fractional numbers between 1-2, and are plotted on the same axis as the revenues with an axis range of 100.

Things get even worse if the plot with the % format is primary, causing the axis format to be %.  Plotting data having a n0n-percent format on the same axis,will cause those values to be scaled by 100.

proc sgplot data=ElecRevChange(where=(customer='Residential'));
styleattrs datacolors=(orange orange) datacontrastcolors=(cx8f3f00 darkgreen);
  highlow x=year low=zero high=revenue / name='a' legendlabel='Revenue' type=bar 
          nooutline fillattrs=graphdata1 dataskin=pressed;
  series x=year y=change /  name='b' lineattrs=graphdata2(thickness=5);
  xaxis integer display=(nolabel);
  yaxis offsetmin=0 min=0 valueattrs=graphdata1 labelattrs=graphdata1 grid;
  keylegend / linelength=20px;
run;

In such cases, it is best to use a graph with two independent response axes, as shown in the graph at the top of this article.  Now, each axis has data with consistent formats, and life is good.  Note, each axis has its own data range.  In order to have nice grid lines, one has to ensure each axis has equal number of ticks so the grid lines from one axis can work for both.  Else, you will have two sets of grid lines.

ElectricPanel_GTLSo far so good.  But now let us take the next step.  We want to plot the graph for all customers, Commercial, Industrial and Residential in a panel.  We still want to see both revenues and change as a panel shown on the right.

One would think this would be a simple matter of changing from using a SGPLOT to SGPANEL, using "customer" as the panel variable.  In general, you would be right, except here we have crossed the 80-20 feature balance between SG and GTL. Supporting dual response axes for SGPANEL is a much harder task, and something not frequently requested by users. So, what do we do, and how did make the graph on the right?

Well, here is where we have to step out of the comfort zone of SG Procedures and move into the domain of GTL.  Clearly, all of SG features are implemented using GTL programs behind the scenes.  SGPANEL uses the GTL LAYOUT DATAPANEL and LAYOUT DATALATTICE to create the panels.  GTL does support dual response (and category) axes for panels.  So, now I have used the Layout DataPanel container in GTL, along with the BarChart and SeriesPLot statements.  The relevant part of the code is shown below, stripping all the options.  As you can see, it is not so hard to follow.  Full code is included in the attached program.

layout datapanel classvars=(customer) / rows=1 headerlabeldisplay=value 
  layout prototype / cycleattrs=true;
    highlowplot x=year low=zero high=revenue / name='a' legendlabel='Revenue' type=bar;
    seriesplot x=year y=change /  name='b' lineattrs=graphdata2(thickness=5) yaxis=y2;
  endlayout;
endlayout;

Dual Axis Graphs:   DualAxis

 

 

Post a Comment