I attended a seminar last week whose purpose was to inform SAS 9 programmers about SAS Viya. I could tell from the programmer's questions that some programmers were confused about three basic topics: What are the computing environments in Viya, and how should a programmer think about them? What procedures
Author
In a previous article, I showed how to use theCVEXHULL function in SAS/IML to compute the convex hull of a finite set of planar points. The convex hull is a convex polygon, which is defined by its vertices. To visualize the polygon, you need to know the vertices in sequential
Given a cloud of points in the plane, it can be useful to identify the convex hull of the points. The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as
I was recently asked how to create a frequency polygon in SAS. A frequency polygon is an alternative to a histogram that shows similar information about the distribution of univariate data. It is the piecewise linear curve formed by connecting the midpoints of the tops of the bins. The graph
Recall that the binomial distribution is the distribution of the number of successes in a set of independent Bernoulli trials, each having the same probability of success. Most introductory statistics textbooks discuss the approximation of the binomial distribution by the normal distribution. The graph to the right shows that the
A SAS programmer asked whether it is possible to add reference lines to the categorical axis of a bar chart. The answer is yes. You can use the VBAR statement, but I prefer to use the VBARBASIC (or VBARPARM) statement, which enables you to overlay a wide variety of graphs
A previous article discusses how to use SAS regression procedures to fit a two-parameter Weibull distribution in SAS. The article shows how to convert the regression output into the more familiar scale and shape parameters for the Weibull probability distribution, which are fit by using PROC UNIVARIATE. Although PROC UNIVARIATE
It can be frustrating when the same probability distribution has two different parameterizations, but such is the life of a statistical programmer. I previously wrote an article about the gamma distribution, which has two common parameterizations: one that uses a scale parameter (β) and another that uses a rate parameter
A genetic algorithm (GA) is a heuristic optimization technique. The method tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The characteristics of the better candidates are combined to create a new
This article uses an example to introduce to genetic algorithms (GAs) for optimization. It discusses two operators (mutation and crossover) that are important in implementing a genetic algorithm. It discusses choices that you must make when you implement these operations. Some programmers love using genetic algorithms. Genetic algorithms are heuristic
Sometimes we can learn as much from our mistakes as we do from our successes. Recently, I needed to solve an optimization problem for which the solution vector was a binary vector subject to a constraint. I was in a hurry. Without thinking much about what I was doing, I
Many optimization problems in statistics and machine learning involve continuous parameters. For example, maximum likelihood estimation involves optimizing a log-likelihood function over a continuous domain, possibly with constraints. Recently, however, I had to solve an optimization problem for which the solution vector was a 0/1 binary variable. To solve the
In a matrix-vector language such as SAS/IML, it is useful to always remember that the fundamental objects are matrices and that all operations are designed to work on matrices. (And vectors, which are matrices that have only one row or one column.) By using matrix operations, you can often eliminate
A reader asked whether it is possible to find a bootstrap sample that has some desirable properties. I am using the term "bootstrap sample" to refer to the result of randomly resampling with replacement from a data set. Specifically, he wanted to find a bootstrap sample that has a specific
Graphing data is almost always more informative than displaying a table of summary statistics. In a recent article about "dynamite plots," I briefly mentioned that graphs such as box plots and strip plots are better at showing data than graphs that merely show the mean and standard deviation. This article
The field of probability and statistics is full of asymptotic results. The Law of Large Numbers and the Central Limit Theorem are two famous examples. An asymptotic result can be both a blessing and a curse. For example, consider a result that says that the distribution of some statistic converges
The SAS/IML language supports lists, which are containers that store other objects, such as matrices and other lists. A primary use of lists is to pack objects of various types into a single symbol that can be passed to and from modules. A useful feature of using lists is that
I previously wrote about one way to solve the partition problem in SAS. In the partition problem, you divide (or partition) a set of N items into two groups of size k and N-k such that the sum of the items' weights is the same in each group. For example,
The partition problem has many variations, but recently I encountered it as an interactive puzzle on a computer. (Try a similar game yourself!) The player is presented with an old-fashioned pan-balance scale and a set of objects of different weights. The challenge is to divide (or partition) the objects into
A statistical programmer asked how to simulate event-trials data for groups. The subjects in each group have a different probability of experiencing the event. This article describes one way to simulate this scenario. The simulation is similar to simulating from a mixture distribution. This article also shows three different ways
A colleague spent a lot of time creating a panel of graphs to summarize some data. She did not use SAS software to create the graph, but I used SAS to create a simplified version of her graph, which is shown to the right. (The colors are from her graph.)
A colleague spent a lot of time creating a panel of graphs to summarize some data. She did not use SAS software to create the graph, but I used SAS to create a simplified version of her graph, which is shown to the right. (The colors are from her graph.)
The number of possible bootstrap samples for a sample of size N is big. Really big. Recall that the bootstrap method is a powerful way to analyze the variation in a statistic. To implement the standard bootstrap method, you generate B random bootstrap samples. A bootstrap sample is a sample
You can use the bootstrap method to estimate confidence intervals. Unlike formulas, which assume that the data are drawn from a specified distribution (usually the normal distribution), the bootstrap method does not assume a distribution for the data. There are many articles about how to use SAS to bootstrap statistics
For graphing multivariate data, it is important to be able to convert the data between "wide form" (a separate column for each variable) and "long form" (which contains an indicator variable that assigns a group to each observation). If the data are numeric, the wide data can be represented as
This article shows how to create a "sliced survival plot" for proportional-hazards models that are created by using PROC PHREG in SAS. Graphing the result of a statistical regression model is a valuable way to communicate the predictions of the model. Many SAS procedures use ODS graphics to produce graphs
A previous article discusses the geometry of weighted averages and shows how choosing different weights can lead to different rankings of the subjects. As an example, I showed how college programs might rank applicants by using a weighted average of factors such as test scores. "The best" applicant is determined
People love rankings. You've probably seen articles about the best places to live, the best colleges to attend, the best pizza to order, and so on. Each of these is an example of a ranking that is based on multiple characteristics. For example, a list of the best places to
One of the benefits of using the SWEEP operator is that it enables you to "sweep in" columns (add effects to a model) in any order. This article shows that if you use the SWEEP operator, you can compute a SSCP matrix and use it repeatedly to estimate any linear
Do you ever use a permutation matrix to change the order of rows or columns in a matrix? Did you know that there is a more efficient way in matrix-oriented languages such as SAS/IML, MATLAB, and R? Remember the following tip: Never multiply with a large permutation matrix! Instead, use