Simulate lognormal data with specified mean and variance

lognormalparams

In my book Simulating Data with SAS, I specify how to generate lognormal data with a shape and scale parameter. The method is simple: you use the RAND function to generate X ~ N(μ, σ), then compute Y = exp(X). The random variable Y is lognormally distributed with parameters μ and σ. This is the standard definition, but notice that the parameters are specified as the mean and standard deviation of X = log(Y).

Recently, a SAS customer asked me an interesting question. What if you know the mean and variance of Y, rather than log(Y)? Can you still simulate lognormal data from a distribution with that mean and variance?

Mathematically, the situation is that if m and v are the mean and variance, respectively, of a lognormally distributed variable Y, can you compute the usual parameters for log(Y)? The answer is yes. In terms of μ and σ, the mean of Y is m = exp(μ + σ2/2) and the variance is v = (exp(σ2) -1) exp(2μ + σ2). You can invert these formulas to get μ and σ as functions of m and v. Wikipedia includes these formulas in its article on the lognormal distribution, as follows:

lognormaleqns

Let's rewrite the expression inside logarithm. If you let φ = sqrt(v + m2), then the formulas are more simply written as
μ = ln(m2 / φ),     σ2 = ln(φ2 / m2 )
Consequently, you can specify the mean and the variance of the lognormal distribution of Y and derive the corresponding (usual) parameters for the underlying normal distribution of log(Y), as follows:

data convert;
m = 80; v = 225;                /* mean and variance of Y */
phi = sqrt(v + m**2);
mu    = log(m**2/phi);          /* mean of log(Y)    */
sigma = sqrt(log(phi**2/m**2)); /* std dev of log(Y) */
run;
 
proc print noobs; run;
t_lognormalparams

For completeness, let's simulate data from a lognormal distribution with a mean of 80 and a variance of 225 (that is, a standard deviation of 15). The previous computation enables you to find the parameters for the underlying normal distribution (μ and σ) and then exponentiate the simulated data:

data lognormal;
call streaminit(1);
keep x y;
m = 80; v = 225;      /* specify mean and variance of Y */
phi = sqrt(v + m**2);
mu    = log(m**2/phi);
sigma = sqrt(log(phi**2/m**2));
do i = 1 to 100000;
   x = rand('Normal', mu, sigma);
   y = exp(x);
   output;
end;
run;

You can use the UNIVARIATE procedure to verify that the program was implemented correctly. The simulated data should have a sample mean that is close to 80 and a sample standard deviation that is close to 15. Furthermore, the LOGNORMAL option on the HISTOGRAM statement enables you to fit a lognormal distribution to the data. The fit should be good and the parameter estimates should be close to the parameter values μ = 4.36475 and σ = 0.18588 (except that PROC UNIVARIATE uses the Greek letter zeta instead of mu):

ods select Moments Histogram ParameterEstimates;
proc univariate data=lognormal;
   var y;
   histogram y / lognormal(zeta=EST sigma=EST);
run;

The histogram with fitted lognormal curve is shown at the top of this article. The mean of the simulated data is very close to 80 and the sample standard deviation is close to 15.

t_lognormalparams2

My thanks to the SAS customer who asked this question—and researched most of the solution! It is a question that I had not previously considered.

Is this a good way to simulate lognormal data? It depends. If you have data and you want to simulate lognormal data that "looks just like it," I suggest that you run PROC UNIVARIATE on the real data and produce the maximum likelihood parameter estimates for the lognormal parameters μ and σ. You can then use those MLE estimates to simulate more data. However, sometimes the original data is not available. You might have only summary statistics that appear in some journal or textbook. In that case the approach in this article enables you to map the descriptive statistics of the original data to the lognormal parameters μ and σ so that you can simulate the unavailable data.

Post a Comment

Specify formats when you write vectors to a data set

Sometimes you have data in SAS/IML vectors that you need to write to a SAS data set. By default, no formats are associated with the variables that you create from SAS/IML vectors. However, some variables (notably dates, times, and datetimes) should have formats associated with the data values. You can associate a format with a vector by using the MATTRIB statement. Then when you create a SAS data set, the data set variable automatically gets the same format.

As an example, the following SAS/IML program defines data for the height of a hypothetical child. A vector of dates records when the heights were measured, and a vector of percentiles indicates how the child's height compared to the heights of his peers. You can use the FORMAT= option on the print statement to display the data in a readable form:

proc iml;
/* growth chart (cm) for hypothetical child */
Date={'15MAR10'd, '22MAR11'd, '12MAR12'd, '18MAR13'd, '20MAR14'd };
Height = {133.9, 139.6, 144.2, 157.0, 168.1};
Pctl   = { 0.25,  0.30,  0.50,  0.65,  0.75};
print Date[format=DATE9.] Height[format=3.0] Pctl[format=PERCENT7.];
t_mattribformat

The DATE format is essential for understanding the data, so it would be nice to include that format (and the format for the other variables) when the data are written to a SAS data set. The MATTRIB statement makes this process easy:

mattrib Date format=DATE9.
        Height format=3.0
        Pctl format=PERCENT7.;
 
create Growth var {"Date" "Height" "Pctl"};
append;
close Growth;
quit;
 
ods select Variables;
proc contents data=Growth; 
run;
t_mattribformat2

As a bonus, the PRINT statement in SAS/IML uses the formats that are assigned by the MATTRIB statement. Consequently, if you assign formats by using the MATTRIB statement, you do not need to use the FORMAT= option when you print the vectors in SAS/IML.

Post a Comment

Permute elements within each row of a matrix

Bootstrap methods and permutation tests are popular and powerful nonparametric methods for testing hypotheses and approximating the sampling distribution of a statistic. I have described a SAS/IML implementation of a bootstrap permutation test for matched pairs of data (an alternative to a matched-pair t test) in my paper "Modern Data Analysis for the Practicing Statistician" (Wicklin, 2010, pp 11–14).

The matched-pair permutation test enables you to determine whether the means of the two groups are significantly different. Recently, a SAS user asked how to create a permutation test that compares the means of k groups. An excellent overview of permutation tests in the ANOVA context is provided by M. J. Anderson (2001), who says (p. 628):

Suppose the null hypothesis is true and the groups are not really different (in terms of the measured variable). If this is the case, then the observations are exchangeable between the different groups. That is, the labels that are associated with particular values, identifying them as belonging to a particular group, could be randomly shuffled (permuted) and a new value of [a test statistic] could be obtained.

In a matrix language such as SAS/IML, data is often packed into a matrix with n rows and k columns. (That is, the data are stored in "wide form," as opposed to the "long form" that would be used by the ANOVA or GLM procedures.) One way to implement a permutation test for ANOVA is to apply a permutation to the k elements in each row. The purpose of this article is to provide an efficient way to permute elements within each row of a matrix.

How to permute elements within rows?

Let's start by defining some data and reading the data into a SAS/IML matrix:

data test;
input t1-t3;
datalines;
45 50 55
42 42 45
36 41 43
39 35 40
51 55 59
44 49 56
;
 
proc iml;
use test; read all into x; close test;

One approach to permute the elements of each row would be to loop over the rows and apply the RANPERM function to each row. That approach is fine for small data sets, but it is not a vectorized operation. To efficiently permute elements within each row, I will use three facts:

  • The RANPERM function can generate n independent random permutations of a set of k elements. Consequently, the RANPERM function can generate a permutation of the column subscripts {1 2 3} for each row.
  • The ROW function returns the row number for each element in a matrix. (If you do not have SAS/IML 12.3 or later, I defined the ROW function in a previous blog.)
  • A SAS/IML matrices is stored in row-major order. In an n x k matrix, the element with subscript (i,j) is stored in position k(i – 1) + j.

By using these three facts, you can construct a function that independently permutes elements of the rows of a matrix:

/* independently permute elements of each row of a matrix */
start PermuteWithinRows(m);
   n = nrow(m);  k = ncol(m); 
   j = ranperm(1:k, n);          /* each row is permutation of {1 2 ... k} */
   matIdx = k*(row(m) - 1)  + j;      /* matrix position; ROW fcn in 12.3  */
   return( shape(m[matIdx], n) );     /* permute elements of m and reshape */
finish;
 
/* call the function on example data */
call randseed(1234);
p = PermuteWithinRows(x);
print x p;
t_permutewithinrows

The PermuteWithinRows function is very efficient and can be used inside a loop as part of a permutation test. You can also use this technique to implement random permutations in an experimental design.

Post a Comment

An easy way to generate a vector of letters

A little-known but useful feature of SAS/IML 12.3 (which was released with SAS 9.4) is the ability to generate a vector of lowercase or uppercase letters by using the colon operator (:).

Many SAS/IML programmers use the colon operator to generate a vector of sequential integers:

proc iml;
x = 1:5;     /* increasing sequence {1 2 3 4 5} */
y = 9:5;     /* decreasing sequence {9 8 7 6 5} */

The colon operator is also useful for generating variable names that start with a common prefix and are enumerated consecutively:

varNames = "var1":"var5";  /* {"var1" "var2" "var3" "var4" "var5"} */

The new feature of the colon operator is the ability to generate a sequence of consecutive English letters:

lower = "a":"z";    /* 26 lowercase letters */
upper = "A":"Z";    /* 26 uppercase letters */
s = "p":"k";        /* {"p" "o" "n" "m" "l" "k"} */
u = "U":"Z";        /* {U V W X Y Z} */

This feature is useful when you want to quickly name the levels of a categorical variable. For example, the following SAS/IML statements sample with replacement from four categories named A, B, C, and D:

call randseed(1234);
x = sample("A":"D", 10); /* 10 draw with replacement from {A B C D} */
print x;
t_letters

Can you think of a way to use this feature in your work? Leave a comment.

Post a Comment

Using associativity can lead to big performance improvements in matrix multiplication

In a previous post, I stated that you should avoid matrix multiplication that involves a huge diagonal matrix because that operation can be carried out more efficiently. Here's another tip that sometimes improves the efficiency of matrix multiplication: use parentheses to prevent the creation of large matrices.

Matrix multiplication is associative, which means that a matrix expression like Z = A*B*C can be computed as Z = (A*B)*C or as Z = A*(B*C). The answer is the same, but sometimes the dimensions of the matrices are such that one way is more efficient than the other. Because matrix languages such as the SAS/IML language evaluate matrix expressions from left to right, the expression Z = (A*B)*C is equivalent to not using parentheses at all.

When do parentheses help? Parentheses help when the product A*B is large, but B*C is small. The following SAS/IML statements compute the same product, but the computation that uses parentheses is much faster because it avoids forming a huge intermediate result:

/* Compute Z = A*B*C where
   A is N x p matrix
   B is p x N matrix
   C is N x k matrix, and N is much large than p and k */
proc iml;
N=10000;  p = 500;  k = 300;
A = j(N,p,1); B = j(p,N,1); C = j(N,k,1);
 
t0 = time();
Z1 = A*B*C;
T1 = time()-t0;
 
t0 = time();
Z2 = A*(B*C);
T2 = time()-t0;
print T1 T2;
t_matrixmult2

This computation was done in SAS 9.4, which includes support for multithreaded matrix multiplication in SAS/IML. If you are running SAS 9.3, use a smaller value for N.

You can see that the first computation required about 30 seconds, whereas the second required less than a second. Both expressions compute the same N x k matrix. Why is the second computation so much faster?

  • The first computation, must compute the temporary matrix A*B, which is an N x N matrix. After that huge matrix is formed, it is used to multiply C.
  • The second computation computes the temporary matrix B*C, which is only a small p x k matrix. That matrix is premultiplied by the medium-sized matrix A to form the final result.

So in the second computation there is no "inflation" of the matrix sizes. Each intermediate matrix is the same size or smaller than the original matrices. In contrast, the first computation forms a huge matrix that is much larger than any of the original matrices.

So next time you need to compute a product of several matrices, think about the dimensions of those matrices and choose an order for the multiplication that will keep intermediate matrices small. The effect can be dramatic!

Post a Comment

Never multiply with a large diagonal matrix

I love working with SAS Technical Support because I get to see real problems that SAS customers face as they use SAS/IML software. The other day I advised a customer how to improve the efficiency of a computation that involved multiplying large matrices. In this article I describe an important efficiency tip: Never multiply with a diagonal matrix.

Here is the scenario. The customer needed to compute the matrix Z, which is the symmetric matrix product
     Z = W1/2 B R B′ W1/2
where

  • W = diag(d) is an N x N diagonal matrix
  • B is an N x p matrix
  • B′ is the transpose of B
  • R is a p x N symmetric matrix. (The symmetry of R isn't exploited in this article.)
In the customer's scenario, N and p were large. Roughly, N = 10,000 and p = 600. The matrix computation was taking a long time, and because the computation was inside a simulation loop, the entire program required many hours to run.

The brute force approach

The customer implemented the formula in the natural way. Let's time how long the straightforward computation takes. (I am using SAS 9.4, which uses multithreaded matrix multiplication. If you are using SAS 9.3, you might want to use N = 5000.) Since the contents of the matrices don't matter, I'll create random elements. As I've shown in a previous blog post, you can use the SQRVECH function to create the symmetric matrix R:

proc iml;
N = 10000; p = 600;
/* define matrices */
d = j(N,1,1);               /* d is N x 1 */
B = j(N,p,1);               /* B is N x p */
v = j(p*(p+1)/2, 1, 1);     /* allocate vector */
/* fill with random uniform numbers in (0,1) */
call randgen(d,"Uniform"); call randgen(B,"Uniform"); call randgen(v,"Uniform");
R = sqrvech(v);             /* create symmetric p x p matrix */
 
/* straightforward (but slow) computation */
t0 = time();
W = diag(d);                /* N x N diagonal matrix */
Z1 = sqrt(W) * B * R * B` * sqrt(W);
T1 = time() - t0;

On my computer, the naive computation with N = 10000 takes about 24 seconds. I think we can do better with a few small modifications.

Never multiply with a diagonal matrix

The time required to compute this matrix expression can be dramatically shortened by implementing the following improvements:

  • W is a diagonal matrix. Therefore computation sqrt(W) * B multiplies the ith row of B by the ith element of the diagonal of W1/2. You can compute this expression more efficiently by using elementwise multiplication (#) operator, as I showed in an article about converting a correlation matrix into a covariance matrix. The simpler expression is sqrt(d) # B, which also avoids forming the huge N x N diagonal matrix, W, and avoids taking the square-root of N2 elements, most of which are 0.
  • The expression sqrt(W) * B appears twice. The expression appears at the beginning of the formula, and the transpose of the expression appears at the end of the formula. Whenever you see a computation repeated twice, you should consider creating a matrix to hold the intermediate result, such as C = sqrt(d) # B.

If you implement these two improvements, the computation executes much quicker. On my computer it now takes less than a second:

free W;                               /* release the W memory */
/* avoid forming diag(d) and store temporary result */
t0 = time();
C = sqrt(d) # B;
Z2 = C * R * C`;
T2 = time() - t0;
print T1 T2;
t_matrixmult

When you use a huge N x N diagonal matrix to multiply B, most of the time is spent multiplying the off-diagonal elements, which are zero. The naive approach multiplies (and adds) about 100 million zeros! The elementwise multiplication does not multiply any zeros. Getting rid of the diagonal matrix makes a major difference in the speed of the computation, and leads to the following efficiency tip:
Tip: Never, ever, multiply with a large diagonal matrix! Instead, use elementwise multiplication of rows and columns.

Specifically, if d is a column vector:

  • Instead of diag(d) * A, use d # A to multiply the ith row of A by the ith element of d.
  • Instead of A * diag(d), use A # d` to multiply the jth column of A by the jth element of d.
Do you have any useful tips for computing with large matrices? Leave a comment.

Post a Comment

Tips for concatenating strings in SAS/IML

Last week, as part of an article on how spammers generate comments for blogs, I showed how to generate random messages by using the CATX function in the DATA step. In that example, the strings were scalar quantities, but you can also concatenate vectors of strings in the SAS/IML language. However, there are some problems that need to be handled when concatenating vectors. In this post I describe the following:
  • how to get rid of spaces when you concatenate strings
  • how to insert spaces (or other delimiters) between strings

A canonical application of concatenation is combining the first and last names of individuals to form the full name. For example, the following SAS/IML statements define vectors that contains the first and last names of three famous mathematicians. You can use the SAS/IML CONCAT function or the string concatenation operator (+) to concatenate the names. This example explicitly concatenates a space between the names:

proc iml;
first = {"C."    "Isaac"  "Leonhard"};
last  = {"Gauss" "Newton" "Euler"};
name = first + " " + last;   /* concatenate with space between first and last */
print name;
t_concat1

As I mentioned in my article on how to understand SAS/IML character vectors, there are actually several blank characters between the first and last names of Gauss and Newton. If you use the PrintChars module from my previous post, you see the following:

/* define or load the PrintChars module here... */
run PrintChars(name);
t_concat2

These blank characters appear because the array of first names (first) is a character array in which all elements have the same length. In this case, all elements have length 8, which is the number of characters in the longest name, "Leonhard." Shorter strings are padded with trailing blanks.

Vectorized approach to trim blanks

Although the SAS/IML language and the SAS DATA step language are similar, string concatenation in SAS/IML has some complexities that are not present in the DATA step. In the DATA step, you can use the TRIM function (or the STRIP function) to get rid of blanks. Unfortunately, when you apply these functions to a matrix, they don't solve the problem. The matrix that is returned by trim(first) is exactly the same as first because after TRIM strips off the trailing blanks, the trimmed strings are assembled into a matrix of length 8, which re-adds the trailing blanks!

So how can you get rid of the trailing blanks? One vectorized approach is to use the RIGHT function to right-align the first names prior to concatenating them with the last names. Of course, that will result in strings with leading spaces, so you then need to use the LEFT function to get rid of leading blank. This approach gets messy when you are concatenating many vectors.

I scratched my head over this problem for a long time. For a while I even abandoned trying to use a vectorized approach; I just iterated over the elements of the vectors and concatenated scalar quantities. (Mea culpa!) Then one day I remembered that you can call Base SAS functions from SAS/IML and pass in matrices as arguments. Can the CATX function, which solves the problem for scalar quantities, also solve the concatenation problem for vector quantities? Let's see:

name = catx(" ", first, last);   /* insert space delimiter between names */
run PrintChars(name);
t_concat3

Success! The concatenated strings have trailing blanks, and in every case the first name is separated from the last name by exactly one space. Once again, Base SAS functions help me to solve a problem that involves vectors! From now on I will use the CATX function to concatentate vectors of strings when I want to insert a space between strings.

The first argument to the CATX function specifies the delimiter that is inserted between strings. You can use that argument to insert commas, slashes, or any other delimiters between strings. You can even specify the "null string" (two quotation marks with no space betwen them) to concatenate strings when you do not want to insert a delimiter.

Post a Comment

How to create a string of a specified length in SAS/IML

In my recent post on how to understand character vectors in SAS/IML, I left out an important topic: How can you allocate a character vector of a specified length? In this article, "length" means the maximum number of characters in an element, not the number of elements in a vector.

In the SAS DATA step, you can use the LENGTH statement to specify the maximum number of characters that can be stored in a variable. Thus the length of a variable can be greater than the length of any string that it contains. However, the SAS/IML language does not support the LENGTH statement. This leads to an interesting problem: How can you create a vector in SAS/IML such that each element can contain up to k characters?

This problem comes up in statistical programming because sometimes you need to enforce a limit on the length of a character vector. For example, if you are creating strings that will be used to name SAS variables, the strings must not exceed 32 characters.

There are a couple of possible solutions. In SAS/IML 12.3, which was released with SAS 9.4, you can use the BLANKSTR function to create a blank string with a given number of characters. The following statements create a string that contains 10 space characters. You can then use the J function or the REPEAT function to create a character vector.

proc iml;
/* generate blank strings of an arbitrary size */
maxChars = 10;                /* maximum number of characters to store */
Blanks = BlankStr(maxChars);  /* string of length 10 (SAS/IML 12.3)    */
c = j(1, 5, Blanks);          /* allocate 1x5 vector of blank strings  */

If you do not have SAS/IML 12.3, use the SUBPAD function in Base SAS. The following statement duplicates the functionality of the BlankStr function.

Blanks = subpad(" ", 1, maxChars); /* return string of length 10 */

I'll leave it as an exercise to figure out why this call to the SUBPAD function creates a string with 10 blank characters!

Have you ever faced the problem of generating a string with a specified number of characters? What was your solution?

Post a Comment

This article is actually fastidious: How spammers generate random comments for blogs

Last week Chris Hemedinger posted an article about spam that is sent to SAS blogs and discussed how anti-spam software helps to block spam. No algorithm can be 100% accurate at distinguishing spam from valid comments because of the inherent trade-off between specificity and sensitivity in any statistical test. Therefore, some spam comments slip through the anti-spam filter, and I get the pleasure of reading the comments and deciding whether to allow them to appear on my blog, or whether to manually mark them as spam.

Why spammers submit comments to blogs

When I first started getting spam comments, I wondered what the spammers were trying to achieve. A typical spam comment seems fairly innocuous. Here are some actual spam comments that I have received:

  • Thanks for a marvelous posting! I certainly enjoyed reading it, you can be a great author.
  • Thank you for the good writeup. It in fact was a amusement account it.
  • If all bloggers made good content as you did, the internet will be much more useful than ever before.
  • Wow, this article is actually fastidious.

Yes, some of the grammar and word choices are strange, but these comments are not much different from some legitimate comments that I have received from legitimate readers whose native language is not English. What makes me sure that these are spam comments?

Along with each of these comments, the commenter included a URL link to some web site. The URL in a typical spam comment links to a web site that advertises cheap "name brand" merchandise, Russian brides, or get-rich-quick schemes. The spammers get paid for each link that they can successfully embed somewhere on the web, such as on my blog. As you might know, internet search engines use the number of "incoming links" as a measure of how important a web site is, and therefore how high it should appear in the search results. The goal of the blog spammer is to embed many links in many blog articles so that internet search engines rank their sponsoring web site highly when someone searches for something like "cheap viagra."

The link is not always embedded in the comment itself. When you comment on a blog, you have the option to include your name and to link to your personal web site. In a legitimate comment, the links points to the commenter's blog or business; spammers link to their sponsoring URL.

How spammers create random comments

As you can see from the sample comments, spammers try to construct complimentary but fairly generic message that they can submit regardless of an article's content.

Suppose that a spammer decides to construct the following generic message: Your blog is truly wonderful. He could write a program that submits this comment to a million blogs. However, he would not be very successful because anti-spam software can block this simple attack by applying simple logic: IF the comment is 'Your blog is truly wonderful' AND the URL field is filled in, THEN classify the comment as spam.

To attempt to defeat anti-spam software, spammers randomly generate comments by using synonyms for the nouns, verbs, and adjectives that appear in the comment. For example, synonyms for "blog" include "post" and "article." A synonym for "wonderful" is "marvelous," and so forth. Thus the spammer could modify his spam program to generate random messages according to the following grammatical template: Your NOUN is ADJECTIVE SUBJECT_COMPLEMENT. The result is like those Mad Libs® stories you and your sibling used to create on long car rides. You can create a huge number of possible sentences, but most of them sound silly.

It is easy to write a SAS DATA step program that generates comments that fit the grammatical template. The following program uses the RAND function to randomly generate elements of a character array and uses the CATX function to concatenate the elements into a sentence:

data SpamComments(keep=msg);
array noun{4} $12 _temporary_                /* nouns */
   ("blog","post","article","commentary");
array adj{7}  $12 _temporary_                /* adjectives */
   (" ","simply","actually","truly","sincerely","honestly","really");
array sc{6}   $12 _temporary_                /* subject complements */
   ("wonderful","marvelous","fastidious","judicious","superb","fantastic");
sp = " ";                                    /* delimiter between words */
call streaminit(12345);
length msg $ 120;
do i = 1 to 20;                              /* generate 20 messages   */
   noun_i = ceil(dim(noun)*rand("uniform")); /* random index into noun array */
   adj_i  = ceil(dim(adj)*rand("uniform"));
   sc_i   = ceil(dim(sc)*rand("uniform"));
   msg = catx(sp, "Your", noun[noun_i], "is", adj[adj_i], sc[sc_i]);
   output;
end;
run;
 
proc print; run;
t_spamcomments

As shown by the output on the left, these randomly generated comments can be amusing. It is often obvious when spammers use a thesaurus to automatically generate dozens of synonyms for each word in their message template. Words have subtle connotations; they cannot be mixed and matched like a verbal closet of Garanimals®. I call comments like this "grammaticals."

Do you have any experience dealing with spammers? Share your experience by leaving a comment. I'm sure it will be fastidious. If all readers make astonishing comments as you do, the web will be much more useful than ever before.

Post a Comment

Blanks and lengths: Understanding SAS/IML character vectors

SAS programmers are probably familiar with how SAS stores a character variable in a data set, but how is a character vector stored in the SAS/IML language?

Recall that a character variable is stored by using a fixed-width storage structure. In the SAS DATA step, the maximum number of characters that can be stored in a variable is determined when the variable is initialized, or you can use the LENGTH statement to specify the maximum number of characters. For example, the following statement specifies that the NAME variable can store up to 10 characters:

data A;
length name $ 10;   /* declare that a variable stores 10 characters */
...

The values in a character variable are left aligned. That is, values that have fewer than 10 characters are padded on the right with blanks (space characters).

SAS/IML character vectors

The same rules apply to character vectors in the SAS/IML language. A vector has a "length" that determines the maximum number of characters that can be stored in any element. (In this article, "length" means the maximum number of characters, not the number of elements in a vector.) Elements with fewer characters are blank-padded on the right. Consequently, the following two character vectors are equivalent. :

proc iml;
c  = {"A",      "B   C",  "  XZ",   "LMNOPQ"}; /* length set at initialization */
c2 = {"A     ", "B   C ", "  XZ  ", "LMNOPQ"}; /* all strings have length 6 */
if c=c2 then print "Character vectors are equal";
else print "Character vectors are not equal";
t_charstorage

You can determine the maximum number of characters that can be stored in each element by using the NLENG function in SAS/IML. You can also discover the number of characters in each element of a vector (omitting any blank padding) by using the LENGTH function, as follows:

N = nleng(c);
trimLen = length(c);
print N trimLen c;
t_charstorage2

In this example, each element of the vector c can hold up to six characters. If you write the c variable to a SAS data set, the corresponding variable will have length 6. However, if you trim off the blanks at the end of the strings, most elements have fewer than six characters. Notice that the LENGTH function counts blanks at the beginning and the middle of a string but not at the end, so that the string "  XZ" counts as four characters.

Where are the blanks?

Notice that the ODS HTML destination is not ideal for visualizing blanks in strings. In HTML, multiple blank characters are compressed into a single blank when the string is rendered, so only one space appears on the displayed output. If you need to view the spaces in the strings, use the ODS LISTING destination, which uses a fixed-width font that preserves spaces. Alternatively, the following SAS/IML function prints each character (not including trailing blanks):

/* convert a string to a row vector of single characters (uses SAS/IML 12.1) */
start Str2Vec(s);
   return (substr(s, 1:length(s), 1));  /* row vector of characters */
finish;
 
/* print characters of all strings in a vector */
start PrintChars(v);
   L = length(v);     /* characters per name, not counting trailing blanks */
   do i = 1 to ncol(L)*nrow(L);
     c = char(1:L[i], 2);
     print (Str2Vec(v[i]))[colname=c];  /* print individual letters */
   end;
finish;
 
run PrintChars(c);
t_charstorage3

I think the Str2Vec function is very cool. It uses a feature of the SUBSTR function in SAS/IML 12.1 to convert a string into a vector of characters. The PrintChars function simply calls the Str2Vec function for each element of a character matrix and prints the characters with a column header. This makes it easy to see each character's position in a string.

This article provides a short overview of how strings are stored inside SAS/IML character vectors. For more details about SAS/IML character vectors and how you can manipulate strings, see Chapter 2 of Statistical Programming with SAS/IML Software.

Post a Comment