Sometimes you want to label only certain observations in a plot. This is useful in many ways, but one use is to label outliers on a scatter plot.
In the SGPLOT procedure, the DATALABEL= option enables you to specify the name of a variable that is used to label observations. For example, the following scatter plot shows the height versus the weight for all 19 children in the Sashelp.Class data set. The Name variable is used to label the observations:
proc sgplot data=Sashelp.Class; scatter x=Height y=Weight / datalabel=Name; run;
Labeling only certain observations
For small data sets such as this one, you can label each observation without the labels colliding. For data sets with thousands of points, this is no long possible. Instead the convention is to label only unusual observations.
The trick is the use the fact that only nonmissing values for the DATALABEL= variable are plotted. Therefore, you can create a new variable called Label. The new variable is a copy of the Name variable, but it contains missing values for all but a handful of observations.
In the Sashelp.Class data, suppose you want to highlight the lightest student (Joyce), the heaviest student (Philip), and a student who has recently been sick (Judy). The following DATA step creates a new variable that contains the names of these three students, but otherwise contains a blank string, which is the SAS missing value for character variables:
data Class; set Sashelp.Class; Label = Name; if Name not in ("Joyce", "Judy", "Philip") then Label = " "; run;
If you use the Label variable on the DATALABEL= option on the SCATTER statement, labels are shown only for the three students that you specified:
proc sgplot data=Class; scatter x=Height y=Weight / datalabel=Label; run;
Label observations with numerical values
In the previous example, a character variable is used to label the observations. However, you can also use a numerical variable. All nonmissing values are displayed, so if you want to suppress labels for certain observations, set those label values to missing.
The following DATA step creates 1,000 observations from a bivariate normal distribution and computes the distance from each point to the origin. The goal is to label all points that are more than three units from the origin, so observations that are less than that distance are assigned a missing value for the dist variable. The dist variable is used as the DATALABEL= variable:
data a; call streaminit(12345); do i=1 to 1000; x = rand("Normal"); y = rand("Normal"); dist = euclid(x,y); if dist <= 3.0 then dist = .; output; end; run; proc sgplot data=a; scatter x=x y=y / datalabel = dist; /* label by distance */ run;
As you can see, each observation that is more than three units from the origin is labeled by its distance. The SGPLOT procedure automatically applies a default format to the dist variable (BEST6.?). You can specify your own format by using a FORMAT statement after the PROC SGPLOT statement. For example, if you want only to display the distance to one decimal point, specify format dist 3.1;