The post The path of zip codes appeared first on The DO Loop.
]]>Last week I read an interesting article on Robert Kosara's data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):
proc sort data=sashelp.zipcode(where=(StateCode NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI") /* exclude territories */ AND ZIP_Class = " ")) /* exclude "special" zip codes */ out=zipcode(keep=ZIP x y City StateCode); by zip; run; title "Path of US Zip Codes in Numerical Order"; proc sgplot data=zipcode noborder; where StateCode NOT IN ("AK", "HI"); /* contiguous US */ series x=X y=Y / group=StateCode; xaxis display=none; yaxis display=none; run; |
Note that the coordinates are "raw" longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static "zoomed-in" version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a close-up map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):
This graph shows that there are about a dozen zip-code clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.
You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.
data Zip1; length SCF $5.; set zipcode; FloorZip = floor(zip/100); /* round down to nearest 100 */ SCF = putn(FloorZip,"Z3.")||"xx"; /* Sectional Center Facility, eg, 841xx */ keep x y zip StateCode City SCF; run; proc freq data=Zip1; where StateCode='UT'; tables SCF / nocum; run; |
If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zig-zag across a state and do not necessarily follow a geographical progression such as north-south or east-west. The following image connects the location of the first zip code in each SCF region in Utah. The individual zip-code centers are shown as markers that are colored by the SCF.
For states that have more than a dozen SCFs, the five-character labels can obscure the path of the SCFs. If you don't care about the actual zip-code prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled '1' and the last SCF (349xx) is labeled '25'.
Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large "jumps" across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix '59') down to Illinois (prefix '60'). Another big jump is from Nebraska in the Midwest (prefix '69') down to Louisiana (prefix '70') in the South-Central region.
In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara's blog. Kosara's image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.
If you would like to explore these data yourself, you can download the SAS program that analyzes the zip code data.
The post The path of zip codes appeared first on The DO Loop.
]]>The post Simulate multivariate clusters in SAS appeared first on The DO Loop.
]]>This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture. You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a four-dimensional normal distribution. Each cluster has its own within-cluster covariance, which controls the spread of the cluster and the amount overlap between clusters.
This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and how to generate a random sample from a univariate mixture distribution.
The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution. A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.
For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher's famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data. The means of the three component distributions are μ_{1} = {50, 34, 15, 2}, μ_{2} = {59, 28, 43, 13}, and μ_{3} = {66, 30, 56, 20}. The covariance matrices for the component distributions are shown later.
The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:
A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lower-triangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.
proc iml; /* Simulate from mixture of K mulitvariate normal distributions in dimension d. OUTPUT variables: X : a SampleSize x d matrix of simulated observations ID : a column vector that contains the component from which each obs is drawn INPUT variables: pi : 1 x K vector of mixing probabilities, sum(pi)=1 mu : K x d matrix whose i_th row contains the mean vector for the i_th component Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component */ start SimMVNMixture(X, ID, /* output arguments */ SampleSize, pi, mu, Cov); /* input arguments */ K = ncol(pi); /* number of components */ d = ncol(mu); /* number of variables */ X = j(SampleSize, d); /* output: each row is observation */ ID = j(SampleSize, 1); /* ID variable */ N = RandMultinomial(1, SampleSize, pi); /* vector of samples sizes for components */ b = 1; /* b = beginning index for group i */ do i = 1 to K; e = b + N[i] - 1; /* e = ending index for group i */ ID[b:e] = i; /* set ID variable */ c = shape(Cov[i,], d, d); /* reshape Cov to square matrix */ X[b:e, ] = RandNormal(N[i], mu[i,], c); /* simulate i_th MVN sample */ b = e + 1; /* next group starts at this index */ end; finish; |
The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,..., NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.
After the module is defined, you can call it for a specific set of parameters. Assume you want to generate K=3 clusters of four-dimensional data (d=4). The following statements specify the mixing probabilities for a three-component model. The mu matrix is a K x d matrix whose rows are the mean vectors for the components. The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components. The following statements generate a total of 100 observations from the three-component mixture distribution:
/* specify input args; means/cov correspond to sashelp.iris data for species */ pi = {0.35 0.5 0.15}; /* mixing probs for K=3 groups */ mu = {50 34 15 2 , /* means of Group 1 */ 59 28 43 13 , /* means of Group 2 */ 66 30 56 20 }; /* means of Group 3 */ /* specify within-group covariances */ Cov = {12 10 2 1 10 14 1 1 2 1 3 1 1 1 1 1 , /* cov of Group 1 */ 27 9 18 6 9 10 8 4 18 8 22 7 6 4 7 4 , /* cov of Group 2 */ 40 9 30 5 9 10 7 5 30 7 30 5 5 5 5 8 }; /* cov of Group 3 */ /* run the simulation to generate 100 observations */ call randseed(12345); run SimMVNMixture(X, Group, 100, pi, mu, Cov); |
The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn. You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:
/* save simulated data to data set */ varNames="x1":"x4"; Y = Group || X; create Clusters from Y[c=("Group" || varNames)]; append from Y; close; quit; ods graphics / attrpriority=none; title "Multivariate Normal Components"; title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}"; proc sgscatter data=Clusters; compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12); run; |
The graph is shown at the top of this article.
Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.
In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.
The post Simulate multivariate clusters in SAS appeared first on The DO Loop.
]]>The post Symbolic derivatives in SAS appeared first on The DO Loop.
]]>In SAS/IML, the nonlinear optimization routines will use analytical derivatives if they are provided; otherwise, they will automatically generate numerical finite-difference approximations to the derivatives. I rarely specify derivatives because it can be time-consuming and prone to error. But recently I realized that PROC NLIN and other SAS procedures automatically generate symbolic derivatives, and you can "trick" the procedure into displaying the derivatives so that they can be used in other applications. To get PROC NLIN to output symbolic derivatives, do the following:
Here is a one-variable example. Suppose you want to take the derivative of
f(x) = x^{3} + x sin(x) + exp( x^{2} ).
The function and its derivative is shown to the right.
The following SAS statements create a "fake" data set that defines a variable named F. The call to PROC NLIN sets up the problem. The LIST option on the PROC NLIN statement generates the 'ProgList' table, which contains the derivative function:
data _dummy; f = 1; /* name of function */ run; proc nlin data=_dummy list; parms x=0; /* list variables */ model f = x**3 + x*sin(x) + exp(x**2); /* define function */ ods select ProgList; /* output symbolic derivatives */ run; |
The output shows the expressions for the function ("MODEL.f") and the derivative with respect to x. The derivative df/dx is written as "@MODEL.f/@x", where the symbol '@' is used for the 'd'. You can see that SAS took the simple derivative of the x^{3} term, applied the product rule to the x*sin(x) term, and applied the chain rule to the exp( x^{2} ) term. You can also see that the expressions are written in SAS notation, with "**" indicating the power operator. SAS functions are written in upper case (SIN, COS, and EXP).
In exactly the same way, you can obtain partial derivatives. In a partial derivative, all variables are held constant except one. Suppose you have the function for the normal density,
f(x) = 1/(sqrt(2 π) σ) exp( -(x - μ)**2 / (2 σ**2) )
Suppose that you want to compute the partial derivatives ∂f/∂μ and ∂f/∂σ. To compute these derivatives, list μ and σ on the PARMS statement and assign values to the other symbols (x and π) that appear in the function. It does not matter what value you assign to the x and π, you are just preventing PROC NLIN from complaining that there are unassigned variables. You could assign the value 1 to both variables, but I prefer to tell the procedure that π equals 3.14. In the following program, note that PROC NLIN reuses the fake data set that defines the F variable:
proc nlin data=_dummy list; parms mu=0 sigma=1; /* list variables */ x=1; pi = constant('pi'); /* assign arbitrary values to constants */ model f = exp( -(x-mu)**2 / (2*sigma**2) ) / (sqrt(2*pi)*sigma); /* define function */ ods select ProgList; /* output symbolic derivatives */ run; |
The derivative with respect to σ requires using the chain rule and the quotient rule. Notice that the derivative includes a call to the original function ("MODEL.f"). Notice also that there is repetition of various expressions such as -(x - μ)**2 / (2 σ**2). If you copy and paste these expressions into a SAS program that computes the derivative, the computation will be slightly inefficient because the program will evaluate the same expression multiple times. One way to improve the efficiency is to collect sub-expressions into a temporary variable, as shown in the next section.
An experienced statistician will recognize that the expression z = (x - μ) / σ arises naturally as the standardized variable. Using such an expression might simplify the derivatives.
The following program computes the same derivatives as before, except this time the programmer defines the expression z = (x - μ) / σ and uses z in the function. When SAS takes the derivative, it computes dz/dμ and dz/dσ as part of the computation, as shown below:
proc nlin data=_dummy list; parms mu=0 sigma=1; /* list variables */ x=1; pi = constant('pi'); /* assign arbitrary values to constants */ const = sqrt(2*pi); z = (x-mu) / sigma; /* intermediate expression */ model f = exp(-z**2 / 2) / (const*sigma); /* define function */ ods select ProgList; /* output symbolic derivatives */ run; |
Notice that the derivatives use several sub-expressions such as dz/dμ and dz/dσ. This computation is more efficient than the previous because it reuses previously computed quantities. For another example, define f1 = 1 / sqrt(2*pi*sigma) and f2 = exp(-z**2 / 2) and define the MODEL as f = f1*f2. You will see that each term in the derivative is efficiently evaluated in terms of previously expressions.
As I said previously, I do not usually specify analytical derivatives for my optimizations. I find that numerical derivatives (which are automatically computed) are fast and accurate for 99% of the optimizations that I perform.
Furthermore, some of the objective functions that I optimize do not have analytical derivatives. Others are matrix computations for which it would be difficult and time-consuming to write down the analytical derivative. (For an example, see the article on the derivative of the bivariate normal cumulative distribution.)
If you decide to specify an analytical derivative in SAS, make sure that the derivative is correct and efficient. The technique in this article will help to ensure that the derivative is correct. By strategically defining sub-expressions as in the previous section, you can help SAS generate derivatives that are computationally efficient.
Be aware that the derivatives are generated by applying the chain rule, product rule, and quotient rule without any algebraic simplifications of the result. Thus the result can be unnecessarily complicated. For an example, ask SAS to display the derivative of log( (1+x)/(1-x) ) / 2. The naive result is complicated. If you rewrite the function as log(1+x)/2 - log(1-x)/2, then the derivative is much simpler. However, in neither case will SAS simplify the derivative to obtain 1 / (1-x**2).
For more of my thoughts on using derivatives in SAS, see "Two hints for specifying derivatives." What about your thoughts? What do you think about using PROC NLIN to generate automatic derivatives of simple functions? Do you think this technique will be useful to you?
The post Symbolic derivatives in SAS appeared first on The DO Loop.
]]>The post Construct polynomial effects in SAS regression models appeared first on The DO Loop.
]]>The EFFECT statement is supported in many SAS procedures, including the GLIMMIX, GLMSELECT, and LOGISTIC procedures. Because those procedures can output a design matrix, you can use a model that is generated by an EFFECT statement in any SAS procedure, even older procedures that do not support the EFFECT statement. I have previously shown how you can use the SPLINE option in the EFFECT statement to generate spline effects. This article deals with polynomial effects, which are effects that are formed by elementwise multiplication of continuous variables, such as x1*x1, x1*x2, and x1*x1*x2*x3.
The following statements rename a few continuous variables in the Sashelp.Heart data set, which will be used in the examples. The new variables (x1, x2, x3, and x4) are easier to type and using these short variable names makes it easier to see how interactions effects are formed.
data Heart / view=Heart; set Sashelp.Heart; rename Height=x1 Weight=x2 Diastolic=x3 Systolic=x4; run; |
Recall that many SAS regression procedures (such as GLM, GENMOD, LOGISTIC, MIXED,...) support the bar operator (|) to specify interactions between effects. For example, the following MODEL statement specifies that the model should include all main effects and all higher-order interactions:
proc logistic data=Heart; model Y = x1 | x2 | x3 | x4; /* all main effects and interactions up to 4-way */ run; |
The previous MODEL statement includes all two-way, three-way, and four-way interaction effects between distinct variables. In practice, fitting a model with so many effects will lead to overfitting, so most analysts restrict the model to two-way interactions. In SAS you can use the "at" operator (@) to specify the highest interaction terms in the model. For example, the following syntax specifies that the model contains only main effects and two-way interactions:
model Y = x1 | x2 | x3 | x4 @2; /* main effects and two-way interactions */ /* equivalent: model Y = x1 x2 x3 x4 x1*x2 x1*x3 x1*x4 x2*x3 x2*x4 x3*x4; */ |
Notice that the bar operator does not generate the interaction of a variable with itself. For example, the terms x1*x1 and x2*x2 are not generated. Notice also that you need to explicitly type out each variable when you use the bar operator. You cannot use "colon notation" (x:) or hyphens (x1-x4) to specify a range of variables. For four variable, this is an inconvenience; for hundreds of variables, this is a serious problem.
The EFFECT statement enables you to create polynomial effects, which have the following advantages:
The POLYNOMIAL option in the EFFECT statement is described in terms of multivariate polynomials. Recall that a multivariate monomial is a product of powers of variables that have nonnegative integer exponents. For example, x^{2} y z is a monomial that contains three variables. The degree of a monomial is the sum of the exponents of the variables. The previous monomial has degree 4 because 4 = 2 + 1 + 1.
The syntax for the EFFECT statement is simple. For example, to generate all main effects and two-way interactions, including second-degree terms like x1*x1 and x2*x2, use the following syntax:
ods select ParameterEstimates(persist); proc logistic data=Heart; effect poly2 = polynomial(x1-x4 / degree=2); model Status = poly2; /* equivalent: model Status = x1 | x2 | x3 | x4 @2 x1*x1 x2*x2 x3*x3 x4*x4; */ run; |
The name of the effect is 'poly2'. It is a polynomial effect that contains all terms that involve first- and second-degree monomials. Thus it contains the main effects, the two-way interactions between variables, and the terms x1*x1, x2*x2, x3*x3, and x4*x4. The equivalent model in "stars and bars" notation is shown in the comment. The models are the same, although the variables are listed in a different order.
You can also use the colon operator to select variables that have a common prefix. For example, to specify all polynomial effects for variables that begin with the prefix 'x', you can use EFFECT poly2 = POLYNOMIAL(x: / DEGREE=2).
You can use the MDEGREE= option to control the maximum degree of any exponent in a monomial. For example, the monomials x1*x1 and x1*x2 are both second-degree monomials, but the maximum exponent that appears in the first monomial is 2 whereas the maximum exponent that appears in the second monomial is 1. The following syntax generates only monomial terms for which the maximum exponent is 1, which means that x1*x1 and x2*x2 will not appear:
proc logistic data=Heart; effect poly21 = polynomial(x1-x4 / degree=2 mdegree=1); /* exclude x1*x1, x2*x2, etc. */ model Status = poly21; /* equivalent: model Status = x1 | x2 | x3 | x4 @2 */ run; |
A novel use of polynomial effects is to generate all two-way interactions between variables in one list and variables in another list. For example, suppose you are interested in the interactions between the lists (x1 x2) and (x3 x4), but you are not interested in within-list interactions such as x1*x2 and x3*x4. By using polynomial effects, you can create the two lists and then use the bar operator to request all main and two-way interactions between elements of the lists, as follows:
proc logistic data=Heart; effect list1 = polynomial(x1-x2); /* first list of variables */ effect list2 = polynomial(x3-x4); /* second list of variables */ model Status = list1 | list2; /* main effects and pairwise interactions between lists */ /* equivalent: model Status = x1 x2 x3 x4 x1*x3 x1*x4 x2*x3 x2*x4; */ run; |
Notice that you can use the EFFECT statement multiple times in the same procedure. This is a powerful technique! By using multiple EFFECT statements, you can name sets of variables that have similar properties, such as demographic effects (age, sex), lifestyle effects (diet, exercise), and physiological effects (blood pressure, cholesterol). You can then easily construct models that contain interactions between these sets, such as LIFESTYLE | PHYSIOLOGY.
In summary, the POLYNOMIAL option in the EFFECT statement enables you to control the interactions between variables in a model. You can form models that are equivalent to the "stars and bars" syntax or specify more complex models. In particular, you can use the EFFECT statement multiple times to construct lists of variables and interactions between elements of the lists. For more information about polynomial effects, see the SAS documentation for the POLYNOMIAL option in the EFFECT STATEMENT.
The post Construct polynomial effects in SAS regression models appeared first on The DO Loop.
]]>The post 7 ways to view correlation appeared first on The DO Loop.
]]>How can we understand these many different interpretations? As the song says, "the answer is easy if you take it logically." These seven ways to view your Pearson correlation are based on the wonderful paper by Rodgers and Nicewander (1988), "Thirteen ways to look at the correlation coefficient," which I recommend for further reading.
The simplest way to visualize correlation is to create a scatter plot of the two variables. A typical example is shown to the right. (Click to enlarge.) The graph shows the heights and weights of 19 students. The variables have a strong linear "co-relation," which is Galton's original spelling for "correlation." For these data, the Pearson correlation is r = 0.8778, although few people can guess that value by looking only at the graph.
For data that are approximately bivariate normal, the points will align northwest-to-southeast when the correlation is negative and will align southwest-to-northeast when the data are positively correlated. If the point-cloud is an amorphous blob, the correlation is close to zero.
In reality, most data are not well-approximated by a bivariate normal distribution. Furthermore, Anscombe's Quartet provides a famous example of four point-clouds that have exactly the same correlation but very different appearances. (See also the images in the Wikipedia article about correlation.) So although a graphical visualization can give you a rough estimate of a correlation, you need computations for an accurate estimate.
In elementary statistics classes, the Pearson sample correlation coefficient between two variables x and y is usually given as a formula that involves sums. The numerator of the formula is the sum of crossproducts of the centered variables. The denominator involves the sums of squared deviations for each variable. In symbols:
The terms in the numerator involve the first central moments and the terms in the denominator involve the second central moments. Consequently, Pearson's correlation is sometimes called the product-moment correlation.
I have a hard time remembering complicated algebraic formulas. Instead, I try to visualize problems geometrically. The way I remember the correlation formula is as the inner (dot) product between two centered and standardized vectors. In vector notation, the centered vectors are x - x̄ and y - ȳ. A vector is standardized by dividing by its length, which in the Euclidean norm is the square root of the sum of the square of the elements. Therefore you can define u = (x - x̄) / || x - x̄ || and v = (y - ȳ) / || y - ȳ ||. Looking back at the equation in the previous section, you can see that the correlation between the vectors x and y is the inner product r = u · v. This formula shows that the correlation coefficient is inavariant under affine transformations of the data.
Linear algebra teaches that the inner product of two vectors is related to the angle between the vectors. Specifically, u · v = ||u|| ||v|| cos(θ), where θ is the angle between the vectors u and v. Dividing both sides by the lengths of u and v and using the equations in the previous section, we find that r = cos(θ), where θ is the angle between the vectors.
This equation gives two important facts about the Pearson correlation. First, the correlation is bounded in the interval [-1, 1]. Second, it gives the correlation for three important geometric cases. When x and y have the same direction (θ = 0), then their correlation is +1. When x and y have opposite directions (θ = π), then their correlation is -1. When x and y are orthogonal (θ = π/2), their correlation is 0.
Recall that the covariance between two variables x and y is
The covariance between two variables depends on the scales of measurement, so the covariance is not a useful statistic for comparing the linear association. However, if you divide by the standard deviation of each variable, then the variables become dimensionless. Recall that the standard deviation is the square root of the variance, which for the x variable is given by
Consequently, the expression s_{xy} / (s_{x} s_{y}) is a standardized covariance. If you expand the terms algebraically, the "n - 1" terms cancel and you are left with the Pearson correlation.
Most textbooks for elementary statistics mention that the correlation coefficient is related to the slope of the regression line that estimates y from x. If b is the slope, then the result is r = b (s_{x} / s_{y}). That's a messy formula, but if you standardize the x and y variables, then the standard deviations are unity and the correlation equals the slope.
This fact is demonstrated by using the same height and weight data for 19 students. The graph to the right shows the data for the standardized variables, along with an overlay of the least squares regression line. The slope of the regression line is 0.8778, which is the same as the correlation between the variables.
The previous section showed a relationship between two of the most important concepts in statistics: correlation and regression. Interestingly, the correlation is also related to another fundamental concept, the mean. Specifically, when two variables are positively correlated, the correlation coefficient is equal to the geometric mean of two regression slopes: the slope of y regressed on x (b_{x}) and the slope of x regressed on y (b_{y}).
To derive this result, start from the equation in the previous section, which is r = b_{x} (s_{x} / s_{y}). By symmetry, it is also true that r = b_{y} (s_{y} / s_{x}). Consequently, r^{2} = b_{x} b_{y} or r = sqrt( b_{x} b_{y} ), which is the geometric mean of the two slopes..
This article shows multiple ways that you can view a correlation coefficient: analytically, geometrically, via linear algebra, and more. The Rodgers and Nicewander (1988) article includes other ideas, some of which are valid only asymptotically or approximately.
If you want to see the computations for these methods applied to real data, you can download a SAS program that produces the computations and graphs in this article.
What is your favorite way to think about the correlation between two variables? Leave a comment.
The post 7 ways to view correlation appeared first on The DO Loop.
]]>The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.
]]>The value of each pixel in a grayscale image can be stored in a matrix where each element of the matrix is a value between 0 (off) and 1 (full intensity). I want to create a small example that is easy to view, so I'll create a small matrix that contains information for a low-resolution image of the capital letters "SVD." Each letter will be five pixels high and three pixels wide, arranges in a 7 x 13 array of 0/1 values. The following SAS/IML program creates the array and displays a heat map of the data:
ods graphics / width=390px height=210px; proc iml; SVD = {0 0 0 0 0 0 0 0 0 0 0 0 0, 0 1 1 1 0 1 0 1 0 1 1 0 0, 0 1 0 0 0 1 0 1 0 1 0 1 0, 0 1 1 1 0 1 0 1 0 1 0 1 0, 0 0 0 1 0 1 0 1 0 1 0 1 0, 0 1 1 1 0 0 1 0 0 1 1 0 0, 0 0 0 0 0 0 0 0 0 0 0 0 0 }; A = SVD; /* ColorRamp: White Gray1 Gray2 Gray3 Black */ ramp = {CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000}; call heatmapcont(A) colorramp=ramp showlegend=0 title="Original Image" range={0,1}; |
Because the data matrix contains only five non-zero rows, the rank of the A matrix cannot be more than 5. The following statements compute the SVD of the data matrix and create a plot of the singular values. The plot of the singular values is similar to the scree plot in principal component analysis, and you can use the plot to help choose the number of components to retain when approximating the data.
call svd(U, D, V, A); /* A = U*diag(D)*V` */ title "Singular Values"; call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"}; |
For this example, it looks like retaining three or five components would be good choices for approximating the data. To see how low-rank approximations work, let's generate and view the rank-1, rank-2, and rank-3 approximations:
keep = 1:3; /* use keep=1:5 to see all approximations */ do i = 1 to ncol(keep); idx = 1:keep[i]; /* components to keep */ Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`; /* rank k approximation */ Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, stdize into [0,1] */ s = "Rank = " + char(keep[i],1); call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1}; end; |
The rank-1 approximation does a good job of determining the columns and do not contain that contain the letters. The approximation also picks out two rows that contain the horizontal strokes of the capital "S."
The rank-2 approximation refines the image and adds additional details. You can begin to make out the letters "SVD." In particular, all three horizontal strokes for the "S" are visible and you can begin to see the hole in the capital "D."
The rank-3 approximation contains enough details that someone unfamiliar with the message can read it. The "S" is reconstructed almost perfectly and the space inside the "V" and "D" is almost empty. Even though this data is five-dimensional, this three-dimensional approximation is very good.
Not only is a low-rank approximation easier to work with than the original five-dimensional data, but a low-rank approximation represents a compression of the data. The original image contains 7 x 13 = 91 values. For the rank-3 approximation, three columns of the U matrix contain 33 numbers and three columns of V^{T} contain 15 numbers. So the total number of values required to represent the rank-3 approximation is only 48, which is almost half the number of values as for the original image.
You can use the singular value decomposition and low-rank approximations to try to eliminate random noise that has corrupted an image. Every TV detective series has shown an episode in which the police obtain a blurry image of a suspect's face or license plate. The detective asks the computer technician if she can enhance the image. With the push of a button, the blurry image is replaced by a crystal clear image that enables the police to identify and apprehend the criminal.
The image reconstruction algorithms used in modern law enforcement are more sophisticated than the SVD. Nevertheless, the SVD can do a reasonable job of removing small random noise from an image, thereby making certain features easier to see. The SVD has to have enough data to work with, so the following statements duplicate the "SVD" image four times before adding random Gaussian noise to the data. The noise has a standard deviation equal to 25% of the range of the data. The noisy data is displayed as a heat map on the range [-0.25, 1.25] by using a color ramp that includes yellow for negative values and blue for values greater than 1.
call randseed(12345); A = (SVD // SVD) || (SVD // SVD); /* duplicate the image */ A = A + randfun( dimension(A), "Normal", 0, 0.25); /* add Gaussian noise */ /* Yellow White Gray1 Gray2 Gray3 Black Blue */ ramp2 = {CXFFEDA0 CXFFFFFF CXF0F0F0 CXBDBDBD CX636363 CX000000 CX3182BD}; call heatmapcont(A) colorramp=ramp2 showlegend=0 title="Noisy Image" range={-0.5, 1.5}; |
I think most police officers would be able to read this message in spite of the added noise, but let's see if the SVD can clean it up. The following statements compute the SVD and create a plot of the singular values:
call svd(U, D, V, A); /* A = U*diag(D)*V` */ call series(1:nrow(D), D) grid={x y} xvalues=1:nrow(D) label={"Component" "Singular Value"}; |
There are 14 non-zero singular values. In theory, the main signal is contained in the components that correspond to the largest singular values whereas the noise is captured by the components for the smallest singular values. For these data, the plot of the singular values suggests that three or five (or nine) components might capture the main signal while ignoring the noise. The following statements create and display the rank-3 and rank-5 approximations. Only the rank-5 approximation is shown.
keep = {3 5}; /* which components to examine? */ do i = 1 to ncol(keep); idx = 1:keep[i]; Ak = U[ ,idx] * diag(D[idx]) * V[ ,idx]`; Ak = (Ak - min(Ak)) / range(Ak); /* for plotting, standardize into [0,1] */ s = "Rank = " + char(keep[i],2); call heatmapcont(Ak) colorramp=ramp showlegend=0 title=s range={0, 1}; end; |
The denoised low-rank image is not as clear as Hollywood depicts, but it is quite readable. It is certainly good enough to identify a potential suspect. I can hear the detective mutter, "Book 'em, Danno! Murder One."
In summary, the singular value decomposition (SVD) enables you to approximate a data matrix by using a low-rank approximation. This article uses a small example for which the full data matrix is rank-5. A plot of the singular values can help you choose the number of components to retain. For this example, a rank-3 approximation represents the essential features of the data matrix.
For similar analyses and examples that use the singular value decomposition, see
In SAS software, the SVD is heavily used in the SAS Text Miner product. For an overview of how a company can use text mining to analyze customer support issues, see Sanders and DeVault (2004) "Using SAS at SAS: The Mining of SAS Technical Support."
The post The singular value decomposition and low-rank approximations appeared first on The DO Loop.
]]>The post The singular value decomposition: A fundamental technique in multivariate data analysis appeared first on The DO Loop.
]]>Mathematically, the singular value decomposition is a fundamental theorem of linear algebra. )You could argue that it is THE fundamental theorem, but Gil Strang names a different result.) The singular value decomposition says that every n x p matrix can be written as the product of three matrices: A = U Σ V^{T} where
The SVD represents the essential geometry of a linear transformation. It tells us that every linear transformation is a composition of three fundamental actions. Reading the equation from right to left:
Thus the SVD specifies that every linear transformation is fundamentally a rotation or reflection, followed by a scaling, followed by another rotation or reflection. The Strang (1993) article about the fundamental theorem of linear algebra includes the following geometric interpretation of the singular value decomposition of a 2 x 2 matrix:
The diagram shows that the transformation induced by the matrix A (the long arrow across the top of the diagram) is equivalent to the composition of the three fundamental transformations, namely a rotation, a scaling, and another rotation.
Because of its usefulness, the singular value decomposition is a fundamental technique for multivariate data analysis. A common goal of multivariate data analysis is to reduce the dimension of the problem by choosing a small linear subspace that captures important properties of the data. The SVD is used in two important dimension-reducing operations:
In SAS, you can use the SVD subroutine in SAS/IML software to compute the singular value decomposition of any matrix. To save memory, SAS/IML computes a "thin SVD" (or "economical SVD"), which means that the U matrix is an n x p matrix. This is usually what the data analyst wants, and it saves considerable memory in the usual case where the number of observations (n) is much larger than the number of variables (p). Technically speaking, the U for an "economical SVD" is suborthogonal: U^{T} U is the identity when n ≥ p and U U^{T} is the identity when n ≤ p.
As for eigenvectors, the columns of U and V are not unique, so be careful if you compare the results in SAS to the results from MATLAB or R. The following example demonstrates a singular value decomposition for a 3 x 2 matrix A. For the full SVD, the U matrix would be a 3 x 3 matrix and Σ would be a 3 x 2 diagonal matrix. For the "economical" SVD, U is 3 2 and Σ is a 2 x 2 diagonal matrix, as shown:
proc iml; A = {0.3062 0.1768, -0.9236 1.0997, -0.4906 1.3497 }; call svd(U, D, V, A); /* A = U*diag(D)*V` */ print U[f=6.3], D[f=6.3], V[f=6.3]; |
Notice that, to save memory, only the diagonal elements of the matrix Σ are returned in the vector D. You can explicitly form Σ = diag(D) if desired. Geometrically, the linear transformation that corresponds to V rotates a vector in the domain by about 30 degrees, the matrix D scales it by 2 and 0.5 in the coordinate directions, and U inserts it into a three-dimensional space, and applies another rotation.
My next blog post shows how the SVD enables you to reduce the dimension of a data matrix by using a low-rank approximation, which has applications to image compression and de-noising.
The post The singular value decomposition: A fundamental technique in multivariate data analysis appeared first on The DO Loop.
]]>The post The arithmetic-geometric mean appeared first on The DO Loop.
]]>Did you know that there is a hybrid quantity, called the arithmetic-geometric mean, which is defined by combining the two quantities? The arithmetic-geometric mean is only defined for two positive numbers, x and y. It is defined as the limit of an alternating iterative process:
Interestingly, this process always converges. The number that it converges to is called the arithmetic-geometric mean of x and y, which is denoted by AGM(x,y). The AGM is between the geometric mean and the arithmetic mean. I am not aware of a multivariate generalization of the AGM function.
In SAS you can use the SAS/IML language or PROC FCMP to create a user-defined function that computes the arithmetic-geometric mean. Since the AGM is not a vector operation, I show the PROC FCMP approach:
proc fcmp outlib=work.funcs.MathFuncs; function AGM(x, y); if x<0 | y<0 then return( . ); epsilon = 1e-10; /* precision of computation */ a = mean(x, y); g = geomean(x,y); do while (abs(a - g) > epsilon); a_new = mean(a, g); g_new = geomean(a, g); a = a_new; g = g_new; end; return( mean(a, g) ); endsub; quit; /* test the function */ options cmplib=work.funcs; /* define location of function */ data _NULL_; agm = AGM(1, 0.8); put agm 18.14; run; |
0.89721143211504 |
An example of calling the new AGM function is shown for the two numbers 1 and 0.8. The arithmetic mean of 1 and 0.8 is 0.9. The geometric mean of 1 and 0.8 is sqrt(0.8) = 0.8944. The computation shows that the arithmetic-geometric mean is 0.8972, which is between the arithmetic and geometric means.
The following program computes the graph of the AGM function and displays a contour plot of the result. Note that AGM(x, x) = x for any nonnegative value of x.
data grid; do x = 0 to 5 by 0.1; do y = 0 to 5 by 0.1; agm = AGM(x, y); output; end; end; run; /* see http://blogs.sas.com/content/iml/2012/07/02/create-a-contour-plot-in-sas.html */ proc sgrender data=grid template=ContourPlotParm; dynamic _TITLE="Graph of Arithmetic-Geometric Mean AGM(x,y)" _X="x" _Y="y" _Z="AGM"; run; |
So, what is the arithmetic-geometric mean good for? The AGM seems mainly useful in applied mathematics and number theory for computing elliptic functions and elliptic integrals. I am not aware of any applications to statistics, perhaps because the AGM is not defined for a sample that contains n elements when n > 2.
Although the AGM function is not used in statistics, the iterative algorithm that defines the AGM is a technique that I have seen often in computational statistics. When a researcher wants two conditions to hold simultaneously, one possible method is to alternately solve for each condition and prove that this iterative process converges to a solution that satisfies both conditions simultaneously. The iterative process can be an effective way to implement an optimization problem if each sub-optimization is simple.
In multivariate statistics, this technique is used in the method of alternating least squares, which is used in several SAS procedures such as PROC TRANSREG, VARCLUS, and MDS. In linear algebra, this technique is used in the method of alternating projections. I have used this technique to compute the closest correlation matrix to an arbitrary matrix.
One last interesting fact. There is another kind of mean called the harmonic mean. You can compute the arithmetic-harmonic mean by applying the same iterative algorithm, but replace the GEOMEAN function with the HARMEAN function. The process converges and the arithmetic-harmonic mean converges is equal to the geometric mean! Thus although this two-step definition might seem contrived, it is "natural" in the sense that it produces the geometric mean (of two numbers) from the arithmetic and harmonic means.The post The arithmetic-geometric mean appeared first on The DO Loop.
]]>The post 6 tips for timing the performance of algorithms appeared first on The DO Loop.
]]>The following general principles apply regardless of the language that you use to implement the algorithm:
This article is motivated by an R programmer who was trying to find the fastest way (on average) to check whether a vector contains more than one unique value. The programmer was timing the performance of five different methods in R in hopes of finding the fastest. I will use the same example, but will examine only two SAS/IML methods:
The ALL function should be fast to discard non-constant vectors because it a "short-circuiting" operation. That is, as soon as it detects two different values, it returns 0 (false). If you have a vector with 100,000 elements, the ALL function might only examine a small number of elements to determine that the vector is not constant. In contrast, the UNIQUE function should be relatively slow: it always examines all the elements, and it allocates memory to return a sorted list of all unique values.
The following program illustrates many of the best practices. It constructs random binary vectors that contain between 100,000 and 1 million elements. Most elements are 0, but there is a small probability that an element could be 1. Some of the vectors will contain a 1 near the beginning of the vector (the best case for the ALL function), others will contain a 1 near the end (the worst case for ALL).
proc iml; /* TIP: "Burn-in" by calling each important function once before you start timing */ x = randfun(N, "Bernoulli", 1/N); /* Simulate data for size */ r = all(x = x[1]); /* method 1: The ALL function */ r = (ncol(unique(x)) = 1); /* method 2: The UNIQUE function */ /* end burn-in */ sizes = do(2,10,2)*1e5; /* TIP: choose sizes so test runs in reasonable time */ NReps = 300; /* TIP: call functions multiple times */ TotalTime = j(ncol(sizes), 2); /* result matrix: for each size, save average time */ do j = 1 to ncol(sizes); /* TIP: use vectors of different sizes */ N = sizes[j]; x = j(N, 1); /* TRICK: estimate time to generate the random vectors; subtract that time later */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); end; tRand = time() - t0; /* Method 1: time for NReps calls */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); /* TIP: simulate data for size */ r = all(x = x[1]); /* Method 1: The ALL function */ end; TotalTime[j,1] = time() - t0 - tRand; /* subtract time to generate random numbers */ /* Method 2: time for NReps calls */ t0 = time(); do i = 1 to NReps; call randgen(x, "Bernoulli", 1/N); /* TIP: simulate data for size */ r = (ncol(unique(x)) = 1); /* Method 2: The UNIQUE function */ end; TotalTime[j,2] = time() - t0 - tRand; /* subtract time to generate random numbers */ end; AvgTime = TotalTime / NReps; /* compute average time per call */ print AvgTime[c={ALL UNIQUE} r=(putn(sizes,'comma9.')) format=6.4]; /* TIP: create a series plot that overlays both curves */ Size = Sizes` // Sizes`; /* convert from wide to long data */ Time = AvgTime[,1] // AvgTime[,2]; Group = j(ncol(sizes), 1, "ALL") // j(ncol(sizes), 1, "UNIQUE"); title; call series(Size, Time) group=Group grid={x y} option="curvelabel" label={"Size", "Average Time per Call"}; |
The results are shown to the right. The graph shows the average time to determine whether the elements of a vector of size N are unique for N in the range [1e5, 1e6]. Notice that the graph shows the average time out of 300 different samples of data. As expected, the average time for the ALL function is less than the average time for the UNIQUE function. For the ALL function, some of the tests run almost instantaneously, whereas others require longer run times. The method that calls the UNIQUE function has less variation, although the variation is not shown in this graph.
In terms of absolute time, both methods require only a few milliseconds. In relative terms, however, the ALL method is much faster, and the relative difference increases as the size of the data increases.
Notice that the program demonstrates a useful trick. The ALL function runs much faster than the time required to generate a random vector with a million elements. Therefore the time required to generate a vector and determine whether it is constant is dominated by generating the data. To estimate ONLY the time spent by the ALL and UNIQUE functions, you can either pre-compute the data or you can estimate how long it takes to generate the data and subtract that estimate from the total time. Because this particular test requires generating 300 vectors with 1 million elements, pre-computing and storing the vectors would require a lot of RAM, therefore I used the estimation trick.
In conclusion, when you are timing an algorithm in a vector-matrix language, follow the best practices in this article. The most important tip is to call the method multiple times and plot the average (as I have done here) or plot the distribution of times (not shown). Plotting the average gives you an estimate for the expected time whereas plotting the distribution enables you to estimate upper and lower bounds on the times.
SAS/IML programmers can find additional examples of timing performance in the following articles:
The post 6 tips for timing the performance of algorithms appeared first on The DO Loop.
]]>The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.
]]>Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.
In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.
ods exclude all; proc corr data=sashelp.Heart; /* pairwise correlation */ var _NUMERIC_; ods output PearsonCorr = Corr; /* write correlations, p-values, and sample sizes to data set */ run; ods exclude none; |
The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:
proc iml; use Corr; read all var "Variable" into ColNames; /* get names of variables */ read all var (ColNames) into mCorr; /* matrix of correlations */ ProbNames = "P"+ColNames; /* variables for p-values are named PX, PY, PZ, etc */ read all var (ProbNames) into mProb; /* matrix of p-values */ close Corr; call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix"; |
The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)
These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.
Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)
numCols = ncol(mCorr); /* number of variables */ numPairs = numCols*(numCols-1) / 2; length = 2*nleng(ColNames) + 5; /* max length of new ID variable */ PairNames = j(NumPairs, 1, BlankStr(length)); i = 1; do row= 2 to numCols; /* construct the pairwise names */ do col = 1 to row-1; PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]); i = i + 1; end; end; lowerIdx = loc(row(mCorr) > col(mCorr)); /* indices of lower-triangular elements */ Corr = mCorr[ lowerIdx ]; Prob = mProb[ lowerIdx ]; Significant = choose(Prob > 0.05, "No ", "Yes"); /* use alpha=0.05 signif level */ create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"}; append; close; QUIT; |
You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.
proc sort data=CorrPairs; by Corr; run; ods graphics / width=600px height=800px; title "Pairwise Correlations"; proc sgplot data=CorrPairs; hbar PairNames / response=Corr group=Significant; refline 0 / axis=x; yaxis discreteorder=data display=(nolabel) labelattrs=(size=6pt) fitpolicy=none offsetmin=0.012 offsetmax=0.012 /* half of 1/k, where k=number of catgories */ colorbands=even colorbandsattrs=(color=gray transparency=0.9); xaxis grid display=(nolabel); keylegend / position=topright location=inside across=1; run; |
The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.
The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).
Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.
The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.
]]>