The post Estimates of doubling time for exponential growth appeared first on The DO Loop.

]]>This article shows one way to estimate the doubling time by using the most recent data. The method uses linear regression to estimate the slope (m) of a curve, then estimates the doubling time as log(2) / m.

The data in this article are the cumulative counts for COVID-19 cases in four countries (Italy, the United States, Canada, and South Korea) for the dates 03Mar2020 to 27Mar2020. You can download the data and SAS program for this article.

The data set contains four variables:

- The Region variable specifies the country.
- The Day variable specifies the number of days since 03Mar2020.
- The Cumul variable specifies the cumulative counts of confirmed cases of COVID-19.
- The Log10Cumul variable specifies the base-10 logarithm of the confirmed cases. In SAS, you can use the LOG10 function to compute the base-10 logarithm.

You can use PROC SGPLOT to visualize these data. The following graph plots the total number of cases, but uses the TYPE=LOG and LOGBASE=10 options to specify a base-10 logarithmic axis for the counts:

title "Cumulative Counts (log scale)"; proc sgplot data=Virus; where Cumul > 0; series x=Day y=Cumul / group=Region curvelabel; xaxis grid; yaxis type=LOG logbase=10 grid values=(100 500 1000 5000 10000 50000 100000) ValuesHint; run; |

This graph is sometimes called a semi-log graph because only one axis is displayed on a log scale. A straight line on a semi-log graph indicates exponential growth. However, all exponential growth is not equal. The slope of the line indicates how quickly the growth is occurring, and the doubling time is one way to measure the growth. A line with a steep slope indicates that the underlying quantity (confirmed cases) will double in a short period of time. A line with a flat slope indicates that the underlying quantity is not growing as quickly and will take a long time to double. For these data, the visualization reveals the following facts:

- The curves for the United States and Canada have steep slopes.
- The curve for South Korea is much flatter, which indicates that the number of confirmed cases is growing very slowly in that country.
- The slope of the curve for Italy looks similar to the US curve for days 0–6, but then the Italy curve begins to flatten. Although the US and Italy had the same number of cases on Day 24, the slope of the Italy curve was less than the slope of the US curve. The interpretation is that (on Day 24) the estimated doubling time for US cases is shorter than for Italy.

Some researchers fit a linear regression to *all* values on the curve in order to estimate an average slope.
This is usually not a good idea because interventions (such as stay-at-home orders) cause the curves to bend over time. This is clearly seen in the curves for Italy and South Korea.

You can get a better estimate for the current rate of growth if you fit a regression line by using only recent data values. I suggest using data from several previous days, such as the previous 5 or 7 days. You can use the REG procedure in SAS to estimate the slope of each line based on the five most recent observations:

%let MaxDay = 24; proc reg data=Virus outest=Est noprint; where Day >= %eval(&MaxDay-4); *previous 5 days: Day 20, 21, 22, 23, and 24; by Region notsorted; model log10Cumul = Day; quit; |

The Est data set contains estimates of the slope (and intercept) for the line that best fits the recent data. The estimates are shown in a subsequent section.

You can use these estimated slopes to estimate the doubling time for each curve.
If a quantity Y increases from Y_{0} at time t_{0} to 2*Y_{0} at some future time t_{0} + Δt, the value Δt is the doubling time. The next paragraph shows that the doubling time at t_{0} is log(2) / m, where m is an estimate of the slope at t_{0}.

The idea is to use the tangent line at t_{0} to estimate the doubling time.
Let log(Y) = m*t + b
be the equation of the tangent line at t_{0} on the semi-log graph.
When Y increases from Y_{0} to 2*Y_{0}, the logarithm increases from
log(Y_{0}) to log(2*Y_{0}) = log(2)+log(Y_{0}).
Since the slope is "rise over run," the tangent line reaches the doubled value when

m = [(log(2) + log(Y_{0})) - log(Y_{0})] / Δt = log(2) / Δt.

Solving for Δt gives

**Δt = log(2) / m**,

where m is the slope of a regression line for the semi-log curve at t_{0}. This formula estimates the doubling time, which *does not depend on the value of Y*, only on the slope at t_{0}.

The following SAS DATA step estimates the doubling time by using the slope estimates at the end of each curve (Day 24):

data DoublingTime; set Est(rename=(Day=Slope)); label Slope = "Slope of Line" DoublingTime = "Doubling Time"; DoublingTime = log10(2) / Slope; keep Region Intercept Slope DoublingTime; run; proc print data=DoublingTime label noobs; format Slope DoublingTime 6.3; var Region Slope DoublingTime; run; |

For a pandemic, short doubling times are bad and long doubling times are good. Based on the 27Mar2020 data, the table estimates the doubling time for Italy to be 9 days. In contrast, the estimate for the US doubling time is about 3.3 days, and the estimate for Canada is about 2.5. The estimate for South Korea is 67 days, but for such a long time period the assumption that "the situation stays the same" is surely not valid.

You can visualize the doubling times by adding an arrow to the end of each curve:

- The base of the arrow is located at the most recent data.
- The direction of the arrow is determined by the estimated slope of the curve.
- The horizontal extent of the arrow is the doubling time.
- The vertical extent of the arrow is twice the current count.

The arrows are shown in the following visualization, which excludes South Korea:

This graph shows how quickly the US and Canadian counts are predicted to double. The tip of each arrow indicates the time at which the number of cases are predicted to double. For the US and Canada, this is only a few days. The arrow for Italy indicates a longer time before the Italian cases double. Again, these calculations assume that the number of cases continues to grow at the estimated rate on Day 24. Because the curve for Italy appears to be flattening, the Italian estimate will (hopefully) be overly pessimistic.

In summary, you can use the slope of a cumulative curve (on a log scale) to estimate the doubling time for the underlying quantity. To find the slope at the most recent observation, you can fit a linear regression line to recent data. The doubling time is given by log(2)/m, where m is the estimate of the slope of the cumulative curve in a semi-log graph. If you want to visualize the doubling time on the graph, you can add an arrow to the end of each curve.

The post Estimates of doubling time for exponential growth appeared first on The DO Loop.

]]>The post Smokestack plots: A visualization technique for comparing cumulative curves appeared first on The DO Loop.

]]>- Total sales of songs, movies, or books, beginning when the item is released.
- Total views of blog posts, beginning when the post is published.
- Total cases of a disease for different countries, beginning when the disease passes some threshold (such as the 100th case) in each country.

If you plot multiple cumulative curves on the same graph, you can compare the different songs, blog posts, or countries. How do the sales of one song compare with another song that was released last week? How does the spread of a disease in one country compare with the spread in a different country that experienced the same disease months ago?

Robert Alison noticed that when you plot multiple cumulative curves on the same graph, the graph looks like a series of tall smokestacks, each billowing smoke. You can imagine a breeze is blowing from the left, which makes the smoke drift to the right. Therefore, he named this graph a "smokestack plot." A smokestack plot overlays several cumulative curves, which might start at different points in time. An example of a smokestack plot is shown to the right. (Click to enlarge.)

This article shows how to use SAS to create a smokestack plot like this one. It discusses how to get data in a suitable form, how to generate the cumulative quantities, and how to plot the cumulative curves. To illustrate the process, this article uses views of blog posts during early 2020.

You can download the data and SAS program for this article.

The data for a smokestack plot should be in "long form." That is, the data should contain three variables:

- An ID variable that identifies each subject. For these data, each post has a unique ID number.
- A date or time variable. For these data, the DATE variable identifies days that the blog post was viewed, beginning with the day it was published.
- A variable that records the number of events for each subject at each time point. For these data, the VALUE variable records the number of views for each blog post for each day.

The data should be sorted by the time variable within each value of the ID variable.

If you are interested in the frequency of views, you can plot the number of views versus time to obtain a frequency plot. For example, the following plot shows the number of views for each day for the blog post with ID=25559, which was published on 13JAN2020.

The frequency plot shows that the post received almost 100 views when it was first published, but the daily views quickly fell to about 10 per day until the end of January, when the blog post was mentioned in a newsletter. That caused a spike in views. The spike eventually faded, and the post settled into a steady state in which it receives a few (3 or 4) views per day. (Diseases and songs also experience spikes. A disease might spike when newly infected persons enter an area. Songs spike if they are featured in a movie or TV show.)

Other blog posts follow a similar trajectory. There is an initial spike in views after publication, and the daily views eventually settle into a steady state. For some posts, the daily views remain relatively high even long after they are published. Others attract only a few views per day after the initial spike.

You can use a smokestack plot of the cumulative views to monitor the steady-state behavior and to compare different subjects.

The first step is to compute the cumulative quantities for each ID. Assuming that the data are sorted by the Date variable within each value of the ID variable, the following DATA step computes the cumulative number of views for each value of the ID variable.

/* Compute cumulative quantities */ data Cumul; set Have; /* assume sorted by Date for each ID variable */ by ID notsorted; /* use NOTSORTED option if not sorted by ID */ if first.ID then do; Cumul=0; /* initialize the cumulative count */ output; /* Optional: to display 0, include the OUTPUT statements */ end; Cumul + Value; /* increment the cumulative count */ output; run; |

You can now plot the cumulative quantity versus time for each ID value. In SAS, you can use the SERIES statement and specify the GROUP= option to obtain a different curve for each ID value:

ods graphics / reset; title "Smokestack Plot"; title2 "Overlay Multiple Cumulative Curves"; proc sgplot data=Cumul; series x=Date y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; xaxis grid display=(nolabel); yaxis grid label="Cumulative Views"; run; |

The graph is shown at the top of this article. The graph shows that the posts are published at regular intervals (every Monday). They initially get a flurry of views, but after a week or two the views taper off to a rate that is different for each post. The slope of the cumulative curve indicates the rate at which a blog gets new views.

The blog posts in this data set get about 100 views on the first day. That's the "smokestack," which rises vertically. After that, new views look like smoke. During the first few days, the smoke rises quickly, which corresponds to a greater-than-average number of views. As time goes on, there are fewer views, so the trails of smoke become less steep. The long-term views are mostly generated from internet search engines at a post-specific rate. You can see in the graph that some slopes (smoke trails) are steeper than others.

On occasion, a post will be referenced in a newsletter, viral tweet, or subsequent blog post, which can result in a secondary spike in views. This is seen for the post that has ID=25559. In the smokestack plot, this looks like an updraft that lifts the smoke trail dramatically upward before it once again settles into a (possibly new) long-term slope.

You can use a variant of the smokestack plot to compare multiple cumulative curves as if they all started on the same day. To do that, you can create a new variable (DAY), which is the number of days since publication. You can use that variable to plot the cumulative curves with a common origin, as follows:

data Origin; retain BaseDate; set Cumul; /* start with the cumulative data */ by ID notsorted; /* use the NOTSORTED option if not sorted by ID */ if first.ID then /* remember the first day for this ID */ BaseDate = Date; Day = Date - BaseDate; /* days since first day */ drop BaseDate; run; title2 "Cumulative Curves with a Common Origin"; proc sgplot data=Origin; series x=Day y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; xaxis grid label="Days Since Publication" values=(0 to 100 by 10) valueshint; yaxis grid label="Cumulative Views" values=(0 to 1000 by 100) valueshint; run; |

This graph enables you to compare the number of views for each post *n* days after publication. Notice that the older posts have longer smoke trails, whereas the more recent posts have shorter trails. The "common origin" smokestack plot provides several informative facts:

- ID=25559 (dark blue, top) was popular even before it got a big "updraft" about 15 days after publication.
- Although ID=25609 (dark red, middle) and ID=25079 (brown, middle) have fewer cumulative views than ID=25559. you can see that their long-term slope is steeper, which indicates that they get more daily views. If these trends hold, they may overtake ID-25559 in total views after a few more months.
- Although ID=25779 (olive green) and ID=25753 (magenta) started out somewhat strong, they are now generating only 1-2 views per day. In contrast, ID=25672 (teal, bottom) started out slowly but generate 4-5 views every day, so it might eventually surpass those other two posts in total views.

You can make other choices for a common origin. For example, if some posts have a slower start than others, you could let "Day 0" be the first day on which each post surpasses 200 views.

I intentionally omitted blog posts that have more than 1000 cumulative views because I wanted to be able to see the trajectories of the less popular items. However, in your data, you might have one or two subjects whose cumulative quantities are an order of magnitude larger than the others. In this case, you might want to plot the cumulative curves on a logarithmic scale, which is equivalent to plotting the logarithm of the cumulative counts.

The SGPLOT procedure makes it easy to plot data on the log scale: simply add TYPE=LOG to either the YAXIS or XAXIS statement.
By default, the scale is the natural logarithm (base e), which is appropriate for most scientific applications. For social science and business applications, you can use LOGBASE=10 to plot log_{10} of the data. The data for a log-scale graph must be strictly positive, so the following call to PROC SGPLOT uses a WHERE Cumul>0 to drop any zeros in the cumulative counts. You can use a log scale for the original smokestack plot or for the version that has a common origin:

title2 "Cumulative Curves (log scale)"; proc sgplot data=Origin; where Cumul > 0; series x=Day y=Cumul / group=ID lineattrs=(pattern=solid) curvelabel; xaxis grid label="Days Since Publication" values=(0 to 100 by 10) valueshint; yaxis grid type=log logbase=10 /* LOG10 transformation of axis */ label="Cumulative Views" values=(100 to 1000 by 100) valueshint; run; |

As I said, a log scale enables you to see very large values (such as 10,000 or more) and very small (such as 100 or less) in one plot. For these data, the main advantage of a log scale is that you can more easily see values for the curves near Day=0 where the cumulative counts are small (less than 200).

This article shows how to use SAS to create a smokestack plot, which is a graph that overlays multiple cumulative curves. Often, the cumulative curves will start at different points in time. However, you can translate the curves to a common origin by defining a new variable such as "the number of days since publication." This enables you to directly compare to curves at the same point in their evolution. Lastly, if one curve is much taller than the others, you can use a log-scale axis to visualize all curves on a common scale.

The smokestack plot can be used for many different applications. During the coronavirus pandemic of 2020, smokestack plots were a common way to compare the cumulative totals of infected persons for different countries.

The post Smokestack plots: A visualization technique for comparing cumulative curves appeared first on The DO Loop.

]]>The post How to read a cumulative frequency graph appeared first on The DO Loop.

]]>- New cases for each day (or week). This information is usually shown as a histogram or needle plot. The graph is sometimes called a
*frequency graph*. - The
*total*number of cases plotted against time. Usually, this graph is a line graph. The graph is sometimes called a*cumulative frequency graph*.

An example of each graph is shown above. The two graphs are related and actually contain the same information. However, the cumulative frequency graph is less familiar and is harder to interpret. This article discusses how to read a cumulative frequency graph. The shape of the cumulative curve indicates whether the daily number of cases is increasing, decreasing, or staying the same.

For this article, I created five examples that show the spread of a hypothetical disease. The numbers used in this article do not reflect any specific disease or outbreak.

When the underlying quantity is nonnegative (as for new cases of a disease), the cumulative curve never decreases. It either increases (when new cases are reported) or it stays the same (if no new cases are reported).

When the underlying quantity (new cases) is a count, the cumulative curve is technically a step function, but it is usually shown as a continuous curve by connecting each day's cumulative total. A cumulative curve for many days (more than 40) often looks smooth, so you can describe its shape by using the following descriptive terms:

- When the number of new cases is increasing, the cumulative curve is "concave up." In general, a concave-up curve is U-shaped, like this: ∪. Because a cumulative frequency curve is nondecreasing, it looks like the right side of the ∪ symbol.
- When the number of new cases is staying the same, the cumulative curve is linear. The slope of the curve indicates the number of new cases.
- When the number of new cases is decreasing, the cumulative curve is "concave down." In general, a concave-up curve looks like an upside-down U, like this: ∩. Because a cumulative frequency curve is nondecreasing, a concave-down curve looks like the left side of the ∩ symbol.

A typical cumulative curve is somewhat S-shaped, as shown to the right. The initial portion of the curve (the red region) is concave up, which indicates that the number of new cases is increasing. The cumulative curve is nearly linear between Days 35 and 68 (the yellow region), which indicates that the number of new cases each day is approximately constant. After Day 68, the cumulative curve is concave down, which indicates that the number of daily cases is decreasing. Each interval can be short, long, or nonexistent.

The cumulative curve looks flat near Day 100. When the cumulative curve is exactly horizontal (zero slope), it indicates that there are no new cases.

Sometimes you might see a related graph that displayed the *logarithm* of the cumulative cases. Near the end of this article, I briefly discuss how to interpret a log-scale graph.

It is useful to look at the shape of the cumulative frequency curve for five different hypothetical scenarios. This section shows the cases-per-day frequency graphs; the cumulative frequency curves are shown in subsequent sections.

In each scenario, a population experiences a total of 1,000 cases of a disease over a 100-day time period. For the sake of discussion, suppose that the health care system can treat up to 20 new cases per day. The graphs to the right indicate that some scenarios will overwhelm the health care system whereas others will not. The five scenarios are:

**Constant rate of new cases:**In the top graph, the community experiences about 10 new cases per day for each of the 100 days. Because the number of cases per day is small, the health care system can treat all the infected cases.**Early peak:**In the second graph, the number of new cases quickly rises for 10 days before gradually declining over the next 50 days. Because the more than 20 new cases develop on Days 5–25, the health care system is overwhelmed on those days.**Delayed peak:**In the third graph, the number of new cases gradually rises, levels out, and gradually declines. There are only a few days in which the number of new cases exceeds the capacity of the health care system. Epidemiologists call this scenario "flattening the curve" of the previous scenario. By practicing good hygiene and avoiding social interactions, a community can delay the spread of a disease.**Secondary outbreak:**In the fourth graph, the first outbreak is essentially resolved when a second outbreak appears. This can happen, for example, if a new infected person enters a community after the first outbreak ends. To prevent this scenario, public health officials might impose travel bans on certain communities.**Exponential growth:**In the fifth graph, the number of new cases increases exponentially. The health care system is eventually overwhelmed, and the graph does not indicate when the spread of the disease might end.

The graphs in this section are frequency graphs. The next sections show and interpret a cumulative frequency graph for each scenario.

In the first scenario, new cases appear at a constant rate. Consequently, the cumulative frequency chart looks like a straight line. The slope of the line is the rate at which new cases appear. For example, in this scenario, the number of new cases each day is approximately 10. Consequently, the cumulative curve has an average slope ("rise over run") that is close to 10.

In the second scenario, new cases appear very quickly at first, then gradually decline. Consequently, the first portion of the cumulative curve is concave up and the second portion is concave down. In this scenario, the number of new cases dwindles to zero, as indicated by the nearly horizontal cumulative curve.

At any moment in time, you can use the slope of the cumulative curve to estimate the number of new cases that are occurring at that moment. Days when the slope of the cumulative curve is large (such as Day 10), correspond to days on which many new cases are reported. Where the cumulative curve is horizontal (zero slope, such as after Day 60), there are very few new cases.

In the third scenario, new cases appear gradually, level out, and then decline. This is reflected in the cumulative curve. Initially, the cumulative curve is concave up. It then straightens out and appears linear for 10–15 days. Finally, it turns concave down, which indicates that the number of new cases is trending down. Near the end of the 100-day period, the cumulative curve is nearly horizontal because very few new cases are being reported.

In the fourth scenario, there are two outbreaks. During the first outbreak, the cumulative curve is concave up or down as the new cases increase or decrease, respectively. The cumulative curve is nearly horizontal near Day 50, but then goes through a smaller concave up/down cycle as the second outbreak appears. Near the end of the 100-day period, the cumulative curve is once again nearly horizontal as the second wave ends.

The fifth scenario demonstrates exponential growth. Initially, the number of new cases increases very gradually, as indicated by the small slope of the cumulative frequency curve. However, between Days 60–70, the number of new cases begins to increase dramatically. The lower and upper curves are both increasing at an exponential rate, but the scale of the vertical axis for the cumulative curve (upper graph) is much greater than for the graph of new cases (lower graph). This type of growth is more likely in a population that does not use quarantines and "social distancing" to reduce the spread of new cases.

This last example demonstrates why it is important to label the vertical axis. At first glance, the upper and lower graphs look very similar. Both exhibit exponential growth. One way to tell them apart is to remember that a cumulative frequency graph never decreases. In contrast, if you look closely at the lower graph, you can see that some bars (Days 71 and 91) are shorter than the previous day's bar.

The previous analysis assumes that the vertical axis plot the cumulative counts on a linear scale. Scientific articles might display the logarithm of the total counts. The graph is on a log scale if the vertical axis says "log scale" or if the tick values are powers of 10 such as 10, 100, 1000, and so forth. If the graph uses a log scale:

- A straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time.
- A concave-down curve indicates that new cases are increasing at rate that is less than exponential. Log-scale graphs make it difficult to distinguish between a slowly increasing rate and a decreasing rate.

In summary, this article shows how to interpret a cumulative frequency graph. A cumulative frequency graph is provided for five scenarios that describe the spread of a hypothetical disease. In each scenario, the shape of the cumulative frequency graph indicates how the disease is spreading:

- When the cumulative curve is
**concave up**, the number of new cases is**increasing**. - When the cumulative curve is
**linear**, the number of new cases is**not changing**. - When the cumulative curve is
**concave down**, the number of new cases is**decreasing**. - When the cumulative curve is
**horizontal**, there are**no new cases**being reported.

Although the application in this article is the spread of a fictitious disease, the ideas apply widely. Anytime you see a graph of a cumulative quantity (sales, units produced, number of traffic accidents,...), you can the ideas in this article to interpret the cumulative frequency graph and use its shape to infer the trends in the underlying quantity. Statisticians use these ideas to relate a cumulative distribution function (CDF) for a continuous random variable to its probability density function (PDF).

The post How to read a cumulative frequency graph appeared first on The DO Loop.

]]>The post Add custom tick marks to a SAS graph appeared first on The DO Loop.

]]>- Specify evenly spaced values.
- Specify tick values that are not evenly spaced.
- Specify custom labels for the ticks, including symbols such as π.

You can accomplish these tasks by using options on the XAXIS and YAXIS statements. The VALUES= option enables you to specify tick locations; the VALUESDISPLAY= option enables you to specify text strings to display on an axis.

You can use the VALUES= option on the XAXIS or YAXIS statement to specify a set of values at which to display ticks. If you want to specify a set of evenly spaced values, you can use the `( start TO stop BY increment)` syntax. For example, the following call to PROC SGPLOT specifies that the ticks on the Y axis should be specified at the locations -0.5, 0, 0.5, 1, ..., 3.

/* create data for the graph of the function y = x*exp(x) */ data Function; do x = -5 to 1 by 0.05; y = x*exp(x); output; end; run; ods graphics / width=480px height=360px; title "Graph of y = x*exp(x)"; proc sgplot data=Function; series x=x y=y; yaxis grid values=(-0.5 to 3 by 0.5); /* specify tick locations */ xaxis grid; run; |

Notice that the Y axis has tick marks at the values that are specified by the VALUES= option.

Sometimes you might want to highlight a specific value on a graph. For example, you might want to indicate where a graph has a maximum, minimum, or inflection point. The graph in the previous section has a minimum value at (x,y) = (1, -1/e) ≈ (1, -0.368). You can use the VALUES= option to specify a set of tick locations that include -0.368. Because nobody who reads the graph will associate -0.368 with the number -1/e, you might want to display the text string "-1/e" on the axis instead of -0.368.

The VALUESDISPLAY= option enables you to specify a text string for each value in the VALUES= list. The strings are used as labels for each tick mark. For example, the following call to PROC SGPLOT uses the VALUESDISPLAY= option to display the string "-1/e" at the location y = -0.368. To further emphasize that the value corresponds to the minimum value of the function, I use the DROPLINE statement to display line segments that extend from the point (1, -0,368) to each axis.

%let eInv = %sysfunc(exp(-1)); /* e inverse = exp(-1) */ title "Graph of y = x*exp(x)"; proc sgplot data=Function; series x=x y=y; dropline x= -1 y= -&eInv / dropto=both lineattrs=(color=coral); /* Draw coordinate axes in dark grey */ refline 0 / axis=y lineattrs=(color=darkgrey); refline 0 / axis=x lineattrs=(color=darkgrey); /* ticks on the Y axis include -1/e */ yaxis grid max=1.5 values = (-1 -&eInv 0 0.5 1 1.5 ) /* locations */ valuesdisplay = ("-1" "-1/e" "0" "0.5" "1" "1.5"); /* strings to display */ xaxis grid; run; |

The previous example displays the string "-1/e" at a tick mark. You can also display symbols on the axes. Back in 2011, Dan Heath discussed how to specify symbols by using Unicode values. Since SAS 9.4M3, you can also use Unicode symbols in user-defined formats. The following example defines the Unicode symbol for pi (π) and uses the symbol in the VALUESDISPLAY= option to label tick marks at π/2, π, 3π/2, and 2π.

/* create the graph of y = sin(x) on [0, 2*pi] */ data Trig; pi = constant('pi'); drop pi; do x = 0 to 2*pi by pi/25; y = sin(x); output; end; run; %let piChar = (*ESC*){Unicode pi}; /* define pi symbol */ title "Graph of y = sin(x)"; proc sgplot data=Trig noautolegend; series x=x y=y; refline 0 / axis=y lineattrs=(color=darkgrey); xaxis grid values = (0 1.57 3.14 4.71 6.28 ) /* 0, pi/2, pi, 3pi/2, 2pi */ valuesdisplay = ("0" "&piChar/2" "&piChar" "3&piChar/2" "2&piChar"); run; |

In this example, I used the VALUES= and VALUESDISPLAY= option on the XAXIS statement. Consequently, the X axis displays the symbols π/2, π, 3π/2, and 2π.

In summary, you can use the VALUES= option to specify a list of locations for ticks on an axis.
You can use the `(start TO stop BY increment)` notation to specify evenly spaced ticks.
For unevenly spaced ticks, you can list the ticks manually, such as `(1 5 10 25 50 100)`.
If you want to display text instead of numbers, you can use the VALUESDISPLAY= option, which enables you to specify arbitrary strings. You can use Unicode symbols in the VALUESDISPLAY= list.

The post Add custom tick marks to a SAS graph appeared first on The DO Loop.

]]>The post Print SAS/IML variables with formats appeared first on The DO Loop.

]]>This article shows four ways to print SAS/IML vectors in a table in which each variable has a different format. The methods are:

- Use the SAS/IML PRINT statement and assign a format to each column. Optionally, you can also assign a label to each variable.
- Create a SAS/IML table and use the TablePrint statement. You can use the TableSetVarFormat subroutine to assign formats to columns.
- If you want to format the variables according to a SAS/STAT template, you can use the TablePrint subroutine and specify the name of the template.
- Write the vectors to a data set and use PROC PRINT.

Suppose you have computed a linear regression in the SAS/IML language. You have six vectors (each with three observations) that you want to print in table that resembles the ParameterEstimates table that is produced by PROC REG and other SAS/STAT procedures. Here are the vectors:

proc iml; factors = {'Intercept', 'MPG_Highway','Weight'}; /* Variable */ df = {1, 1, 1}; /* DF = degrees of freedom */ est = {-3.73649, 0.87086, 0.00011747}; /* Estimate */ SE = {1.25091, 0.02446, 0.0001851}; /* StdErr = standard error */ t = {-2.99, 35.6, 0.63}; /* tValue = value t test that the parameter is zero */ pvalue= {0.0029803, 1.36E-129, 0.525902}; /* Probt = two-sided p-value for hypothesis test */ |

The following PRINT statement displays these vectors:

`PRINT factors df est SE t pvalue;`

However, there are two problems with the output:

- The column headers are the names of the variables. You might want to display more meaningful column headers.
- By default, each vector is printed by using the BEST9. format. However, it can be desirable to print each vector in a different format.

For this table, the programmer wanted to use the same formats that PROC REG uses for the ParameterEstimates table, which are as follows:

- Degrees of freedom (DF) uses the 2.0 format.
- Parameter estimates (Estimate) uses the D11.3 format. (Learn more about the D
*w.d*format.) - Standard errors (StdErr) uses the D11.3 format.
- The t statistic (tValue) uses the 7.2 format.
- the p-value (Probt) uses the PVALUE6.4 format. (Learn more about the PVALUE
*w.d*format.)

IML programmers would use the PRINT statement to print all six columns in a single table. You can use the LABEL= option (or L=, for short) on the PRINT statement to specify the column header. You can use the FORMAT= option (or F=, for short) to specify the format. Thus, you can use the following PRINT statement to create a table in which each column uses its own format:

/* 1: Use PRINT statement. Assign format to each column. You cannot get a spanning header when you print individual columns */ print "-------- Parameter Estimates --------"; print factors[L='Variable'] df[L='DF' F=2.0] est[L='Estimate' F=D11.3] SE[L='StdErr' F=D11.3] t[L='tValue' F=7.2] pvalue[L='Probt' F=PVALUE6.4]; |

The result is acceptable. The PRINT statement enables you to apply formats to each column. However, unfortunately, there is no way to put a spanning header that says "Parameter Estimates" across the top of the table. This example uses a separate PRINT statement to emulate a header.

If you have SAS/IML 14.2 or later (or the iml action in SAS Viya 3.5 or later), you can put the variables into a SAS/IML table and use the TablePrint statement to display the table. The IML language supports the TableSetVarFormat statement. which you can use to set the formats for individual columns, as shown in the following statements:

/* 2: Put variables into a table and use the TABLEPRINT statement. */ Tbl = TableCreate('Variable', factors); call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'}, df || est || SE || t || pvalue); call TableSetVarFormat(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'}, {'2.0' 'D11.3' 'D11.3' '7.2' 'PVALUE6.4'}); call TablePrint(Tbl) ID='Variable' label='Parameter Estimates'; |

You can see that the TablePrint routine does an excellent job printing the table. You can use the ID= option to specify that the Variable column should be used as row headers. The LABEL= option puts a spanning header across the top of the table.

In the previous section, I used the TableSetVarFormat function to manually apply formats to specific columns. However, if your goal is to apply the same formats that are used by a SAS/STAT procedure, there is an easier way. You can use the TEMPLATE= option on the TablePrint subroutine to specify the name of an ODS template. You need to make sure that the names of variables in the table are the same as the names that are used in the template, but if you do that then the template will automatically format the variables and display a spanning header for the table.

/* 3. An easier way to duplicate the ParameterEstimates table from SAS/STAT procedure. Use the TEMPLATE= option. (Make sure the names of variables are what the template expects. */ Tbl = TableCreate('Variable', factors); call TableAddVar(Tbl, {'DF' 'Estimate' 'StdErr' 'tValue' 'Probt'}, df || est || SE || t || pvalue); call TablePrint(Tbl) label='Parameter Estimates' template="Stat.REG.ParameterEstimates"; |

Remember: You can use ODS TRACE ON or the SAS/STAT documentation to find the name of ODS tables that a procedure creates.

If none of the previous methods satisfy your needs, you can always write the data to a SAS data set and then use Base SAS methods to display the table. For example, the following statements create a SAS data set and use PROC PRINT to display the table. The LABEL statement is used to specify the column headers. The FORMAT statement is used to apply formats to the columns:

/* 4. Write a SAS data set and use PROC PRINT */ create PE var {'factors' 'df' 'est' 'SE' 't' 'pvalue'}; append; close; QUIT; proc print data=PE noobs label; label factors='Variable' df='DF' est='Estimate' SE='StdErr' t='tValue' pvalue='Probt'; format df 2.0 est SE D11.3 t 7.2 pvalue PVALUE6.4; var factors df est SE t pvalue; run; |

To learn more about SAS/IML table and other non-matrix data structures, see Wicklin (2017), More Than Matrices: SAS/IML Software Supports New Data Structures."

The post Print SAS/IML variables with formats appeared first on The DO Loop.

]]>The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

]]>I will start by briefly summarizing Jeske's result, which uses probability theory to derive the best biased and unbiased estimators. I then present a SAS program that simulates the problem and compares two estimators, one biased and one unbiased.

Suppose a gambler asks you to predict the next roll of a six-sided die. He will reward you based on how close your guess is to the actual value he rolls. No matter what number you pick, you only have a 1/6 chance of being correct. But if the strategy is to be *close to* the value rolled, you can compute the expected value of the six faces, which is 3.5. Assuming that the gambler doesn't let you guess 3.5 (which is not a valid outcome), one good strategy is to round the expected value to the nearest integer. For dice, that means you would guess ROUND(3.5) = 4.
Another good strategy is to randomly guess either 3 or 4 with equal probability.

Jeske's paper generalizes this problem. Suppose a random process produces the integers 1, 2, ..., N, with probabilities p1, p2, ..., pN, where the sum of the probabilities is 1. (This distribution is sometimes called the "table distribution.") If your goal is to minimize the mean squared error (MSE) between your guess and a series of future random values, Jeske shows that the optimal solution is to guess the value that is closest to the expected value of the random variable. I call this method the ROUNDING estimator. In general, this method is biased, but it has the smallest expected MSE. Recall that the MSE is a measure of the quality of an estimator (smaller is better).

An alternative method is to randomly guess either of the two integers that are closest to the expected value, giving extra weight to the integer that is closer to the expected value. I call this method the RANDOM estimator. The random estimator is unbiased, but it has a higher MSE.

The following example is from Jeske's paper. A discrete process generates a random variable, X, which can take the values 1, 2, and 3 according to the following probabilities:

- P(X=1) = 0.2, which means that the value 1 appears with probability 0.2.
- P(X=2) = 0.3, which means that the value 2 appears with probability 0.3.
- P(X=3) = 0.5, which means that the value 3 appears with probability 0.5.

A graph of the probabilities is shown to the right. The expected value of this random variable is E(X) = 1(0.2) + 2(0.3) + 3(0.5) = 2.3. However, your guess must be one of the feasible values of X, so you can't guess 2.3. The best prediction (in the MSE sense) is to round the expected value. Since ROUND(2.3) is 2, the best guess for this example is 2.

Recall that an estimator for X is *biased* if its expected value is different from the expected value of X. Since E(X) ≠ 2, the rounding estimator is biased.

You can construct an unbiased estimator by randomly choosing the values 2 and 3, which are the two closest integers to E(X). Because E(X) = 2.3 is closer to 2 than to 3, you want to choose 2 more often than 3. You can make sure that the guesses average to 2.3 by guessing 2 with probability 0.7 and guessing 3 with probability 0.3. Then the weighted average of the guesses is 2(0.7) + 3(0.3) = 2.3, and this method produces an unbiased estimate. The random estimator is unbiased, but it will have a larger MSE.

Jeske proves these facts for an arbitrary table distribution, but let's use SAS to simulate the problem for the previous example. The first step is to compute the expected values of X. This is done by the following DATA step, which puts the expected value into a macro variable named MEAN:

/* Compute the expected value of X where P(X=1) = 0.2 P(X=2) = 0.3 P(X=3) = 0.5 */ data _null_; array prob[3] _temporary_ (0.2, 0.3, 0.5); do i = 1 to dim(prob); ExpectedValue + i*prob[i]; /* Sum of i*prob[i] */ end; call symputx("Mean", ExpectedValue); run; %put &=Mean; |

MEAN=2.3 |

The next step is to predict future values of X. For the rounding estimator, the predicted value is always 2. For the random estimator, let k be the greatest integer less than E(X) and let F = E(X) - k be the fractional part of E(x). To get an unbiased estimator, you can randomly choose k with probability 1-F and randomly choose k+1 with probability F. This is done in the following DATA step, which makes the predictions, generates a realization of X, and computes the residual difference for each method for 1,000 random values of X:

/* If you know mean=E(X)=expected value of X, Jeske (1993) shows that round(mean) is the best MSE predictor, but it is biased. Randomly guessing the two closest integers is the best UNBIASED MSE predictor https://www.academia.edu/15728006/Predicting_the_value_of_an_integer-valued_random_variable Use these two predictors for 1000 future random variates. */ %let NumGuesses = 1000; data Guess(keep = x PredRound diffRound PredRand diffRand); call streaminit(12345); array prob[3] _temporary_ (0.2, 0.3, 0.5); /* P(X=i) */ /* z = floor(z) + frac(z) where frac(z) >= 0 */ /* https://blogs.sas.com/content/iml/2020/02/10/fractional-part-of-a-number-sas.html */ k = floor(&Mean); Frac = &Mean - k; /* distance from E(X) to x */ do i = 1 to &NumGuesses; PredRound = round(&Mean); /* guess the nearest integer */ PredRand = k + rand("Bern", Frac); /* random guesses between k and k+1, weighted by Frac */ /* The guesses are made. Now generate a new instance of X and compute residual difference */ x = rand("Table", of prob[*]); diffRound = x - PredRound; /* rounding estimate */ diffRand = x - PredRand; /* unbiased estimate */ output; end; run; /* sure enough, ROUND is the best predictor in the MSE sense */ proc means data=Guess n USS mean; var diffRound DiffRand; run; |

The output from PROC MEANS shows the results of generating 1,000 random integers from X. The uncorrected sum of squares (USS) column shows the sum of the squared residuals for each estimator. (The MSE estimate is USS / 1000 for these data.) The table shows that the USS (and MSE) for the rounding estimator is smaller than for the random estimator. On the other hand, The mean of the residuals is not close to zero for the rounding method because it is a biased method. In contrast, the mean of the residuals for the random method, which is unbiased, is close to zero.

It might be easier to see the bias of the estimators if you look at the predicted values themselves, rather than at the residuals. The following call to PROC MEANS computes the sample mean for X and the two methods of predicting X:

/* the rounding method is biased; the random guess is unbiased */ proc means data=Guess n mean stddev; var x PredRound PredRand; run; |

This output shows that the simulated values of X have a sample mean of 2.34, which is close to the expected value. In contrast, the rounding method always predicts 2, so the sample mean for that column is exactly 2.0. The sample mean for the unbiased random method is 2.32, which is close to the expected value.

In summary, you can use SAS to simulate a simple example that compares two methods of predicting the value of a discrete random process. One method is biased but has the lowest MSE. The other is unbiased but has a larger MSE. In statistics and machine learning, practitioners often choose between an unbiased method (such as ordinary least squares regression) and a biased method (such as ridge regression or LASSO regression). The example in this article provides a very simple situation that you can use to think about these issues.

The post Predict a random integer: The tradeoff between bias and variance appeared first on The DO Loop.

]]>The post Polygons, pi, and linear approximations appeared first on The DO Loop.

]]>Recently, I saw a graphic on Twitter by @neilrkaye that showed the rapid convergence of a regular polygon to a circle as you increase the number of sides for the polygon. The author remarked that polygons that have 40 or more sides "all look like circles to me." That is, a regular polygon with a large number of sides is visually indistinguishable from a circle.

I had two reactions to the graphic. The first was "Cool! I want to create a similar figure in SAS!" (I have done so on the right; click to enlarge.) The second reaction was that the idea is both mathematically trivial and mathematically sophisticated. It is trivial because it is known to Archimedes more than 2,000 years ago and is readily understood by using high school math. But it is sophisticated because it is a particular instance of a mathematical limit, which is the foundation of calculus and the mathematical field of analysis.

The figure also demonstrates the fact that you can approximate a smooth object (a circle, a curve, a surface, ...) by a large number of small, local, linear approximations. It is not an exaggeration to say that that concept is among the most important concepts in applied mathematics. It is used in numerical methods to solve nonlinear equations, integrals, differential equations, and more. It is used in numerical optimization. It forms the foundations of computer graphics. A standard technique in computational algorithms that involves a nonlinear function is to approximate the function by the first term of the Taylor expansion—a process known as linearization.

Archimedes used this process (local approximation by a linear function) to approximate pi (π), which is the ratio of the circumference of a circle to its diameter. He used two sets of regular polygons: one that is inscribed in the unit circle and the other that circumscribes the unit circle. The circumference of an inscribed polygon is less than the circumference of the circle. The circumference of a circumscribed polygon is greater than the circumference of the circle. They converge to a common value, which is the circumference of the circle. If you apply this process to a unit circle, you approximate the value 2π. Archimedes used the same construction to approximate the area of the unit circle, which is π.

You can use Archimedes's ideas to approximate π by using trigonometry and regular polygons, as shown in the following paragraph.

Suppose that a regular polygon has *n* sides. Then the central angle between adjacent vertices is θ = 2π/*n* radians. The following diagram illustrates the geometry of inscribed and circumscribed polygons. The right triangle that is shown is half of the triangle between adjacent vertices. Consequently,

- The half-length of a side is
*b*, where*b*= cos(θ/2) for the inscribed polygon and*b*= tan(θ/2) for the circumscribed polygon. - The height of the triangle is
*h*, where*h*= |sin(θ/2)| for the inscribed polygon and*h*= 1 for the circumscribed polygon. -
The circumfernce of the regular polygon is 2
*n b* -
The area of the regular polygon is 2
*n*(*b h*/2 ).

Although Archimedes did not have access to a modern computer, you can easily write a SAS DATA step program to reproduce Archimedes's approximations to the circumference and area of the unit circle, as shown below:

/* area and circomference */ data ApproxCircle; pi = constant('pi'); /* Hah! We must know pi in order to evaluate the trig functions! */ south = 3*pi/2; /* angle that is straight down */ do n = 3 to 100; angle = 2*pi/n; theta = south + angle/2; /* first vertex for this n-gon */ /* Circumference and Area for circumscribed polygon */ b = tan(angle/2); h = 1; C_Out = 2*n * b; A_Out = 2*n * 0.5*b*h; /* Circumference and Area for inscribed polygon */ b = cos(theta); h = abs( sin(theta) ); C_In = 2*n * b; A_In = 2*n * 0.5*b*h; /* difference between the circumscribed and inscribed circles */ CDiff = C_Out - C_In; ADiff = A_Out - A_In; output; end; label CDiff="Circumference Difference" ADiff="Area Difference"; keep n C_In C_Out A_In A_Out CDiff ADiff; run; /* graph the circumference and area of the n-gons as a function of n */ ods graphics / reset height=300px width=640px; %let curveoptions = curvelabel curvelabelpos=min; title "Circumference of Regular Polygons"; proc sgplot data=ApproxCircle; series x=n y=C_Out / &curveoptions; series x=n y=C_In / &curveoptions; refline 6.2831853 / axis=y; /* reference line at 2 pi */ xaxis grid values=(0 to 100 by 10) valueshint; yaxis values=(2 to 10) valueshint label="Regular Polygon Approximation"; run; title "Area of Regular Polygons"; proc sgplot data=ApproxCircle; series x=n y=A_Out / &curveoptions; series x=n y=A_In / &curveoptions; refline 3.1415927 / axis=y; /* reference line at pi */ xaxis grid values=(0 to 100 by 10) valueshint; yaxis values=(2 to 10) valueshint label="Regular Polygon Approximation"; run; |

As you can see from the graph, the circumference and area of the regular polygons converge quickly to 2π and π, respectively. After *n* = 40 sides, the curves are visually indistinguishable from their limits, which is the same result that we noticed visually when looking at the grid of regular polygons.

The DATA step also computes the difference between the measurements of the circumscribed and inscribed polygons. You can print out a few of the differences to determine how close these estimates are to the circumference and area of the unit circle:

proc print data=ApproxCircle noobs; where n in (40, 50, 56, 60, 80, 100); var n CDiff ADiff; run; |

The table shows that the difference between the circumference of the circumscribed and inscribed polygons is about 0.02 when *n* = 40. For *n* = 56, the difference is less than 0.01, which means that the circumference of a regular polynomial approximates the circumference of the unit circle to two decimal places when *n* ≥ 56. If you use a regular 100-gon, the circumference is within 0.003 of the circumference of the unit circle. Although it is not shown, it turns out you need to use 177 sides before the difference is within 0.001, meaning that a 177-gon approximates the circumference of the unit circle to three decimal places.

Similar results hold for the area of the polygons and the area of a unit circle.

In conclusion, not only does a regular *n*-gon look very similar to the circle when *n* is large, but you can quantify how quickly the circumference and areas of an *n*-gon converges to the
values 2 π and π, respectively. For *n*=56, the polygon values are accurate to two decimal places; for *n*=177, the polygon values are accurate to three decimal places.

Approximating a smooth curve by a series of discrete approximations is the foundation of calculus and modern numerical methods. The idea had its start in ancient Greece, but the world had to wait for Newton, Leibnitz, and others to provide the mathematical machinery (limits, convergence, ...) to understand the concepts rigorously.

The post Polygons, pi, and linear approximations appeared first on The DO Loop.

]]>The post ROC curves for a binormal sample appeared first on The DO Loop.

]]>Of course, in the real world, you do not know the distribution of the population, so you cannot obtain an exact ROC curve. However, you can estimate the ROC curve by using a random sample from the population.

This article draws a random sample from the binormal model and constructs the empirical ROC curve by using PROC LOGISTIC in SAS. The example shows an important truth that is sometimes overlooked: a sample ROC curve is a statistical estimate. Like all estimates, it is subject to sampling variation. You can visualize the variation by simulating multiple random samples from the population and overlaying the true ROC curve on the sample estimates.

The easiest way to simulate data from a binormal model is to simulate the *scores*. Recall the assumptions of the binormal model: all variables are continuous and there is a function that associates a score with each individual. The distribution of scores is assumed to be normal for the positive and negative populations.
Thus, you can simulate the scores themselves, rather than the underlying data.

For this article, the scores for the Negative population (those who do not have a disease or condition) is N(0, 1). The scores for the Positive population (those who do have the disease or condition) is N(1.5, 0.75). The ROC curve for the population is shown to the right. It was computed by using the techniques from a previous article about binormal ROC curves.

The following SAS DATA step simulates a sample from this model. It samples *n*_{N} = 50 scores from the Negative population and *n*_{P} = 25 scores from the Positive population. The distribution of the scores in the sample are graphed below:

%let mu_N = 0; /* mean of Negative population */ %let sigma_N = 1; /* std dev of Negative population */ %let mu_P = 1.5; /* mean of Positive population */ %let sigma_P = 0.75; /* std dev of Positive population */ %let n_N = 50; /* number of individuals from the Negative population */ %let n_P = 25; /* number of individuals from the Positive population */ /* simulate one sample from the binormal model */ data BinormalSample; call streaminit(12345); y = 1; /* positive population */ do i = 1 to &n_P; x = rand("Normal", &mu_P, &sigma_P); output; end; y = 0; /* negative population */ do i = 1 to &n_N; x = rand("Normal", &mu_N, &sigma_N); output; end; drop i; run; title "Distribution of Scores for 'Negative' and 'Positive' Samples"; ods graphics / width=480px height=360px subpixel; proc sgpanel data=BinormalSample noautolegend; styleattrs datacolors=(SteelBlue LightBrown); panelby y / layout=rowlattice onepanel noheader; inset y / position=topleft textattrs=(size=14) separator="="; histogram x / group=y; rowaxis offsetmin=0; colaxis label="Score"; run; |

You can create an ROC curve by first creating a statistical model that classifies each observation into one of the two classes. You can then call PROC LOGISTIC in SAS to create the ROC curve, which summarizes the misclassification matrix (also called the confusion matrix) at various cutoff values for a threshold parameter. Although you can create an ROC curve for any predictive model, the following statements fit a logistic regression model. You can use the OUTROC= option on the MODEL statement to write the values of the sample ROC curve to a SAS data set. You can then overlay the ROC curves for the sample and for the population to see how they compare:

/* use logistic regression to classify individuals */ proc logistic data=BinormalSample noprint; model y(event='1') = x / outroc=ROC1; run; /* The ROC curve for the binormal population */ data PopROC; do t = -3 to 4 by 0.1; /* cutoff values */ FPR = 1 - cdf("Normal", t, &mu_N, &sigma_N); TPR = 1 - cdf("Normal", t, &mu_P, &sigma_P); output; end; run; /* merge in the population ROC curve */ data ROC2; set ROC1 PopROC; run; title "Population ROC Curve vs. Estimate"; ods graphics / width=480px height=480px; proc sgplot data=ROC2 aspect=1 noautolegend; step x=_1MSPEC_ y=_SENSIT_ / lineattrs=(color=blue) name="est" legendlabel="Estimate"; series x=FPR y=TPR / lineattrs=(color=black) name="pop" legendlabel="Population"; lineparm x=0 y=0 slope=1 / lineattrs=(color=gray); xaxis grid; yaxis grid; keylegend "est" "pop" / location=inside position=bottomright across=1 opaque; label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)"; run; |

The graph shows the sample ROC curve (the blue, piecewise-constant curve) and the population ROC curve (the black, smooth curve). The sample ROC curve is an estimate of the population curve. For this random sample, you can see that most estimates of the true population rate are too large (given a value for the threshold parameter) and most estimates of the false positive rate are too low. In summary, the ROC estimate for this sample is overly optimistic about the ability of the classifier to discriminate between the positive and negative populations.

The beauty of the binormal model is that we know the true ROC curve. We can compare estimates to the truth. This can be useful when evaluating competing models: Instead of comparing sample ROC curves with each other, we can compare them to the ROC curve for the population.

If we simulate many samples from the binormal model and plot many ROC estimates, we can get a feel for the variation in the ROC estimates in random samples that have *n*_{N} = 50 and *n*_{P} = 25 observations. The following DATA step simulates *B* = 100 samples. The subsequent call to PROC LOGISTIC uses BY-group analysis to fit all *B* samples and generate the ROC curves. The ROC curves for five samples are shown below:

/* 100 samples from the binormal model */ %let NumSamples = 100; data BinormalSim; call streaminit(12345); do SampleID = 1 to &NumSamples; y = 1; /* sample from positive population */ do i = 1 to &n_P; x = rand("Normal", &mu_P, &sigma_P); output; end; y = 0; /* sample from negative population */ do i = 1 to &n_N; x = rand("Normal", &mu_N, &sigma_N); output; end; end; drop i; run; proc logistic data=BinormalSim noprint; by SampleID; model y(event='1') = x / outroc=ROCs; run; title "5 Sample ROC Curves"; proc sgplot data=ROCs aspect=1 noautolegend; where SampleID <= 5; step x=_1MSPEC_ y=_SENSIT_ / group=SampleID; lineparm x=0 y=0 slope=1 / lineattrs=(color=gray); xaxis grid; yaxis grid; label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)"; run; |

There is a moderate amount of variation between these curves. The brown curve looks different from the magenta curve, even though each curve results from fitting the same model to a random sample from the same population. The difference between the curves are only due to random variation in the data (scores).

You can use partial transparency to overlay all 100 ROC curves on the same graph. The following DATA step overlays the empirical estimates and the ROC curve for the population:

/* merge in the population ROC curve */ data AllROC; set ROCs PopROC; run; title "ROC Curve for Binormal Model"; title2 "and Empirical ROC Curves for &NumSamples Random Samples"; proc sgplot data=AllROC aspect=1 noautolegend; step x=_1MSPEC_ y=_SENSIT_ / group=SampleID transparency=0.9 lineattrs=(color=blue pattern=solid thickness=2); series x=FPR y=TPR / lineattrs=(color=black thickness=2); lineparm x=0 y=0 slope=1 / lineattrs=(color=gray); xaxis grid; yaxis grid; label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)"; run; |

The graph shows 100 sample ROC curves in the background (blue) and the population ROC curve in the foreground (black). The ROC estimates show considerable variability.

In summary, this article shows how to simulate samples from a binormal model. You can use PROC LOGISTIC to generate ROC curves for each sample. By looking at the variation in the ROC curves, you can get a sense for how these estimates can vary due to random variation in the data.

The post ROC curves for a binormal sample appeared first on The DO Loop.

]]>The post Store pre-computed matrices in a list appeared first on The DO Loop.

]]>
This article assumes that the matrices are not known until you read a data set that contains the parameters. The program must be able to handle computing, storing, and accessing an arbitrary number of matrices: 4, 10, or even 100. Obviously, if you know *in advance* the number of matrices that you need, then you can just create those matrices and give them names such as X1, X2, X3, and X4.

To give a concrete example, suppose the following data set contains multiple values for a parameter, *p*:

data Powers; input p @@; datalines; 9 1 5 2 2 5 9 1 2 5 2 1 ; |

For each value of *p*, you need to compute X = A^{p} = A*A*...*A for a matrix, A, and then use X in a subsequent computation. Here's how you might perform that computation in a straightforward way in SAS/IML without using lists:

proc iml; /* use a small matrix, but in a real application the matrix might be large. */ A = {2 1, 1 3}; use Powers; /* read the parameters from the data set */ read all var "p"; close; /* The naive method: Compute the power A**p[i] each time you need it. */ do i = 1 to nrow(p); power = p[i]; X = A**power; /* compute each power as needed */ /* compute with X ... */ end; |

The program computes A^{p} for each value of *p* in the data set. Because the data set contains duplicate values of *p*, the program computes
A^{2} four times,
A^{5} three times, and
A^{9} two times. If A is a large matrix, it is more efficient to compute and store these matrices and reuse them as needed.

If the values of *p* are uniformly likely, the easiest way to store matrix powers in a list is to find the largest value of *p* (call it m) and then create a list that has m items. The first item in the list is A, the second item is A^{2}, and so forth up to A^{m}. This is implemented by using the following statements:

/* Method 1: Find upper bound of power, m. Compute all matrices A^p for 1 <= p <= m. Store these matrices in a list so you don't need to recompute. */ m = max(p); /* largest power in data */ L = ListCreate( m ); do i = 1 to m; L$i = A**i; /* compute all powers up to maximum; store A^i as i_th item */ end; do i = 1 to nrow(p); /* extract and use the matrices, as needed */ power = p[i]; X = L$power; /* or X = ListGetItem(L, power); */ /* compute with X ... */ end; |

For these data, the list contains nine elements because 9 is the largest value in the data.
The following statements show how to display a compact version of the list, which shows the matrices A^{p} for *p*=1,2,...,9.

package load ListUtil; /* load the Struct and ListPrint subroutines */ run struct(L); |

The table shows the first few elements of every item in the list. In this case, the *p*_th item is storing the 2 x 2 matrix A^{p}. The elements of the matrix are displayed in row-major order.

Notice that the data only contain four values of p: 1, 2, 5, and 9. The method in the previous section computes and stores unnecessary matrices such as A^{3}, A^{4}, and A^{8}. This is inefficient. Depending on the maximum value of *p* (such as max(p)=100), this method might be wasteful in terms of memory and computational resources.

A better solution is to compute and store only the matrices that are needed for the computation. For this example, you only need to compute and store four matrices. You can use the UNIQUE function to find the unique values of *p* (in sorted order). The following statements compute and store A^{p} for only the unique values of *p* in the data set:

/* Method 2: Compute matrices A^p for the unique values of p in the data. Store only these matrices in a named list. */ u = unique(p); /* find the unique values of p */ nu = ncol(u); /* how many unique values? */ L = ListCreate( nu ); do i = 1 to nu; power = u[i]; /* compute only the powers in the data; store in a named list */ L$i = [#"Power" = power, /* optional: store as a named list */ #"Matrix" = A**power]; end; run struct(L); |

For this example, each item in the list L is itself a list. I used a "named list" with items named "Power" and "Matrix" so that the items in L have context. The STRUCT subroutine shows that L contains only four items.

How can you access the matrix for, say, A^{5}? There are several ways, but the easiest is to use the `u` matrix that contains the unique values of *p* in sorted order. You can use the LOC function to look up the index of the item that you want. For example, A^{5} is stored as the third item in the list because 5 is the third element of `u`. Because each item in L is itself a list, you can extract the "Matrix" item as follows:

do i = 1 to nrow(p); power = p[i]; k = loc(u = power); /* the k_th item is A^p */ X = L$k$"Matrix"; /* extract it */ /* compute with X ... */ *print i power X; end; |

In summary, this article describes how to use a list to store matrices so that you can compute them once and reuse them many times. The article assumes that the number of matrices is not known in advance but is obtained by reading a data set at run time. The example stores powers of a matrix, but this technique is generally applicable.

If you have never used lists in SAS/IML, see the article "Create lists by using a natural syntax in SAS/IML" or watch this video that describes the list syntax.

The post Store pre-computed matrices in a list appeared first on The DO Loop.

]]>The post Create a deviation plot to visualize values relative to a baseline appeared first on The DO Loop.

]]>Let's generalize the problem. Suppose you have a time series and you want to compare the values to a baseline (or reference) value. One way to do this is to visualize the data as deviations from the baseline. Data values that are close to the baseline will be small and almost unnoticeable. The eye will be drawn to values that indicate large deviations from the baseline. A "deviation plot" like this can be used for many purposes. Some applications include monitoring blood glucose relative to a target value, showing expenditures relative to a fixed income amount, and, yes, displaying the temperature relative to some comfortable reference value. Deviation plots sometimes accompany a hypothesis test for a one-way frequency distribution.

My colleague's display shows one year's worth of temperatures by plotting the day of the year along a circle. While this makes for an eye-catching display, there are a few shortcomings to this approach:

- It is difficult to read the data values. It is also difficult to compare values that are on opposite sides of a circle. For example, how does March data compare with October data?
- Although a circle can show data for one year, it is less effective for showing 8 or 14 months of data.
- Even for one year's worth of data, it has a problem: It places December 31 next to January 1. In the temperature graph, the series began on 01JAN2018. However, the graph places 31DEC2018 next to 01JAN2018 even though those values are a year apart.

As mentioned earlier, you can use SAS/GRAPH or the statistical graphics (SG) procedure in SAS to display the data in polar coordinates. Sanjay Matange's article shows how to create a polar plot. For some of my thought about circular versus rectangular displays, see "Smoothers for periodic data."

The graph to the right (click to enlarge) shows an example of a deviation plot (or deviation-from-baseline plot). It is similar to a waterfall chart, but in many waterfall charts the values are shown as percentages, whereas for the deviation plot we will show the observed values. You can see that the values are plotted for each day. The high values are plotted in one color (red) whereas low values are plotted in a different color (blue). A reference line (in this case, at 100) is displayed.

To create a deviation plot, you need to perform these three steps:

- Use the SAS DATA step to encode the data as 'High' or 'Low' by using the reference value. Compute the deviations from the reference value.
- Create a discrete attribute map that maps values to colors. This step is optional. Alternatively, SAS will assign colors based on the current ODS style.
- Use a HIGHLOW plot to graph the deviations from the reference value.

Let's implement these steps on a time series for three months of daily blood glucose values. An elderly male takes oral medications to control his blood glucose level. Each morning he takes his fasting blood glucose level and records it. The doctor has advised him to try to keep the blood glucose level below 100 mg/dL, so the reference value is 100. The following DATA step defines the dates and glucose levels for a three-month period.

data Series; informat Date date.; format Date Date.; input Date y @@; label y = "Blood Glucose (mg/dL)"; datalines; 01SEP19 100 02SEP19 96 03SEP19 86 04SEP19 93 05SEP19 105 06SEP19 106 07SEP19 123 08SEP19 121 09SEP19 115 10SEP19 108 11SEP19 94 12SEP19 96 13SEP19 95 14SEP19 120 15SEP19 112 16SEP19 104 17SEP19 97 18SEP19 101 19SEP19 108 20SEP19 108 21SEP19 117 22SEP19 103 23SEP19 109 24SEP19 97 25SEP19 93 26SEP19 100 27SEP19 98 28SEP19 122 29SEP19 116 30SEP19 99 01OCT19 102 02OCT19 99 03OCT19 95 04OCT19 99 05OCT19 116 06OCT19 109 07OCT19 106 08OCT19 94 09OCT19 104 10OCT19 112 11OCT19 119 12OCT19 111 13OCT19 104 14OCT19 101 15OCT19 99 16OCT19 92 17OCT19 101 18OCT19 115 19OCT19 109 20OCT19 98 21OCT19 91 22OCT19 92 23OCT19 100 24OCT19 109 25OCT19 102 26OCT19 117 27OCT19 106 28OCT19 98 29OCT19 98 30OCT19 95 31OCT19 97 01NOV19 129 02NOV19 120 03NOV19 117 04NOV19 . 05NOV19 101 06NOV19 105 07NOV19 105 08NOV19 106 09NOV19 118 10NOV19 109 11NOV19 102 12NOV19 98 13NOV19 97 14NOV19 . 15NOV19 92 16NOV19 114 17NOV19 107 18NOV19 98 19NOV19 91 20NOV19 97 21NOV19 109 22NOV19 98 23NOV19 95 24NOV19 95 25NOV19 94 26NOV19 . 27NOV19 98 28NOV19 115 29NOV19 123 30NOV19 114 01DEC19 104 02DEC19 96 03DEC19 97 04DEC19 100 05DEC19 94 06DEC19 93 07DEC19 105 08DEC19 . 09DEC19 88 10DEC19 84 11DEC19 101 12DEC19 122 13DEC19 114 14DEC19 108 15DEC19 103 16DEC19 88 17DEC19 74 18DEC19 92 19DEC19 110 20DEC19 118 21DEC19 106 22DEC19 100 23DEC19 106 24DEC19 107 25DEC19 116 26DEC19 113 27DEC19 113 28DEC19 117 29DEC19 101 30DEC19 96 31DEC19 101 ; |

The first step is to compute the deviation of each observed value from the reference value. If an observed value is above the reference value, mark it as 'High', otherwise mark it as 'Low'. We will plot a vertical bar that goes from the reference level to the observed value. Because we will use a HIGHLOW statement to display the graph, the DATA step computes two new variables, High and Low.

/* 1. Compute the deviation and encode the data as 'High' or 'Low' by using the reference value */ %let RefValue = 100; data Center; set Series; if (y > &RefValue) then Group="High"; else Group="Low"; Low = min(y, &RefValue); /* lower end of highlow bar */ High = max(y, &RefValue); /* upper end of highlow bar */ run; |

If you want SAS to assign colors to the two groups, you can skip this step. However, in many cases you might want to choose which color is plotted for the high and low categories. You can map levels of a group to colors by using a discrete attribute map ("DATTR map", for short) in PROC SGPLOT. Because we are going to use a HIGHLOW statement to graph the data, we need to define a map that has the FillColor and LineColor for the vertical bars. The following DATA step maps the 'High' category to red and the 'Low' category to blue:

/* 2. Create a discrete attribute map that maps values to colors */ data DAttrs; length c FillColor LineColor $16.; ID = "HighLow"; Value = "High"; c="DarkRed"; FillColor=c; LineColor=c; output; Value = "Low"; c="DarkBlue"; FillColor=c; LineColor=c; output; run; |

The final step is to create a high-low plot that shows the deviations from the reference value. You can use the DATTRMAP= option to tell PROC SGPLOT how to assign colors for the group values. Because a data set can contain multiple maps, the ATTRID= option specifies which mapping to use.

/* 3. Use a HIGHLOW plot to graph the deviations from the reference value */ title "Deviations from Reference Value (&RefValue)"; title2 "Morning Fasting Blood Glucose"; ods graphics / width=600px height=400px; proc sgplot data=Center DATTRMAP=DAttrs noautolegend; highlow x=day low=low high=high / group=Group ATTRID=HighLow; refline &RefValue / axis=y; yaxis grid label="Blood Glucose Level"; run; |

The graph is shown at the top of this section. It is clear that on most days the patient has high blood sugar. With additional investigation, you can discover that the highest levels are associated with weekends and holidays.

Note that these data would not be appropriate to plot on a circular graph because the data are not for a full year. Furthermore, on this graph it is easy to see specific values and days and to compare days in September with days in December.

My colleague's graph displayed daily average temperatures. The following deviation plot shows average temperatures and a reference value of 65F. The graph shows the daily average temperature in Raleigh, NC, for 2018:

In this graph, it is easy to find the approximate temperature for any range of dates (such as "mid-October") and to compare the temperature for different time periods, such as March versus October. I think the rectangular deviation plot makes an effective visualization of how these data compare to a baseline value.

You can download the SAS program that creates the graphs in this article.

The post Create a deviation plot to visualize values relative to a baseline appeared first on The DO Loop.

]]>