The post The noncentral t distribution in SAS appeared first on The DO Loop.
]]>SAS software supports the noncentrality parameter in the PDF, CDF, and QUANTILE functions. This article shows how to use these functions for the noncentral t distribution. The RAND function in SAS does not directly support the noncentrality parameter, but you can use the definition of a random noncentral t variable to generate random variates.
The classic Student t distribution contains a degree-of-freedom parameter, ν. (That's the Greek letter "nu," which looks like a lowercase "v" in some fonts.) For small values of ν, the t distribution has heavy tails. For larger values of ν, the t distribution resembles the normal distribution. The noncentral t distribution uses the same degree-of-freedom parameter.
The noncentral t distribution also supports a noncentrality parameter, δ. The simplest way to visualize the effect of δ is to look at its probability density function (PDF) for several values of δ. The support of the PDF is all real numbers, but most of the probability is close to x = δ. You can use the PDF function in SAS to compute the PDF for various values of the noncentrality parameter. The fourth parameter for the PDF("t",...) call is the noncentrality value. It is optional and defaults to 0 if not specified.
The following visualization shows the density functions for positive values of δ and positive values of x. In the computer programs, I use DF for the ν parameter and NC for the δ parameter.
/* use the PDF function to visualize the noncentral t distribution */ %let DF = 6; data ncTPDFSeq; df = &DF; /* degree-of-freedom parameter, nu */ do nc = 4, 6, 8, 12; /* noncentrality parameter, delta */ do x = 0 to 20 by 0.1; /* most of the density is near x=delta */ PDF = pdf("t", x, df, nc); output; end; end; label PDF="Density"; run; title "PDF of Noncentral t Distributions"; title2 "DF=&DF"; proc sgplot data=ncTPDFSeq; series x=x y=PDF / group=nc lineattrs=(thickness=2); keylegend / location=inside across=1 title="Noncentral Param" opaque; xaxis grid; yaxis grid; run; |
The graph shows the density functions for δ = 4, 6, 8, and 12 for a distribution that has ν=6 degrees of freedom. You can see that the modes of the distributions are close to (but a little less than) δ when δ > 0. For negative values of δ, the functions are reflected across x=0. That is, if f(x; ν, δ) is the pdf of the noncentral t distribution with parameter δ, then f(-x; ν, -δ) = f(x; ν, δ).
If you change the PDF call to a CDF call, you obtain a visualization of the cumulative distribution function for various values of the noncentrality parameter, δ.
The quantile function is important in hypothesis testing. The following DATA step finds the quantile that corresponds to an upper-tail probability of 0.05. This would be a critical value in a one-sided hypothesis test where the test statistic is distributed according to a noncentral t distribution.
%let NC = 4; data CritVal; do alpha = 0.1, 0.05, 0.01; tCritUpper = quantile("T", 1-alpha, &DF, &NC); output; end; run; proc print data=CritVal nobs; run; |
The graph shows the critical value of a noncentral t statistic for a one-sided hypothesis test at the α significance level for α=0.1, 0.05, and 0.01. A test statistic that is larger than the critical value would lead you to reject the null hypothesis at the given significance level.
Although the RAND function in SAS does not support a noncentrality parameter for the t distribution, it is simple to generate random variates. By definition, a noncentral t random variable, T_{ν δ} is the ratio of a standard normal variate with mean δ and a scaled chi-distributed variable. If Z ~ N(δ,1) is a normal random variable and V ~ χ^{2}(ν) is a chi-squared random variable with ν degrees of freedom, then the ratio T_{ν δ} = Z / sqrt(V / ν) is a random variable from a noncentral t distribution.
/* Rand("t",df) does not support a noncentrality parameter. Use the definition instead. */ data ncT; df = &DF; nc = &NC; call streaminit(12345); do i = 1 to 10000; z = rand("Normal", nc); /* Z ~ N(nc, 1) */ v = rand("chisq", df); /* V ~ ChiSq(df) */ t = z / sqrt(v/df); /* T ~ NCT(df, nc) */ output; end; keep t; run; title "Random Sample from Noncentral t distribution"; title2 "DF=&DF; nc=&NC"; proc sgplot data=ncT noautolegend; histogram t; density t / type=kernel; xaxis max=20; run; |
The graph shows a histogram for 10,000 random variates overlaid with a kernel density estimate. The density is very similar to the earlier graph that showed the PDF for the noncentral t distribution with ν=6 degrees of freedom and δ=4.
The noncentral t distribution is a probability distribution that is used in power analysis and hypothesis testing. You can this of the noncentral t distribution as a skewed t distribution. SAS software supports the noncentral t distribution by using an optional argument in the PDF, CDF, and QUANTILE functions. You can generate random variates by using the definition of a random variable, which is a ratio of a normal variate and a scaled chi-distributed variable.
The post The noncentral t distribution in SAS appeared first on The DO Loop.
]]>The post Generate random ID values for subjects in SAS appeared first on The DO Loop.
]]>A common requirement is that the strings be unique. This presents a potential problem. In a set of random values, duplicate values are expected. Think about rolling a six-sided die several times: you would expect to see a duplicate value appear after a few rolls. Even for a larger set of possible values, duplicates arise surprisingly often. For example, if a room contains 23 people, the Birthday Paradox shows that it is likely that two people in the room have the same birthday!
If you generate N random numbers independently, you will likely encounter duplicates. There are several ways to use SAS to achieve both randomness and unique values. Perhaps the easiest way is to start with a long list of unique values, then randomly select N of these values. You can do the selection in either of two ways:
A third way to obtain random strings without duplicates is to use a SAS/IML program.
This article shows how to use SAS to generate random four-character strings. You can use the strings as IDs for subjects in a study that requires hiding the identity of the subjects. The article also discusses an important consideration: some four-character strings are uncomplimentary or obscene. I show how to remove vulgar strings from the list of four-character strings so that no patient is assigned an ID such as 'SCUM' or 'SPAZ'.
I have previously shown that you can associate each string of English characters with a positive integer. For most applications, it suffices to consider four-digit strings because there are 26^{4} = 456,976 strings that contains four English characters from the set {A, B, C, ..., Z}. Thus, I will use four-character strings in this article. (If you use five-character strings, you can assign up to 11,881,376 unique IDs.)
The following SAS DATA step outputs the integers 0 through 26^{4}-1. For each integer, the program also creates a four-character string. Small integers are padded with "leading zeros," so that the integers 0, 1, 2, ..., 456975, are associated with the base 26 strings AAAA, AAAB, AAAC, ..., ZZZZ. See the previous article to understand how the program works.
/* Step 1: Generate all integers in the range [0, 26^k -1] for k=4. See https://blogs.sas.com/content/iml/2022/09/14/base-26-integer-string.html Thanks to KSharp for the suggestion to use the RANK and BYTE functions: rank('A') = 65 in ASCII order byte(rank('A') + j) is the ASCII character for the j_th capital letter, j=0,1,...,25 */ %let nChar = 4; /* number of characters in base 26 string */ data AllIDs; array c[0:%eval(&nChar-1)] _temporary_ ; /* integer coefficients c[0], c[1], ... */ length ID $&nChar; /* string for base b */ b = 26; /* base b = 26 */ offset = rank('A'); /* = 65 for ASCII order */ do nID = 0 to b**&nChar - 1; /* compute the coefficients that represent nID in Base b */ y = nID; do k = 0 to &nChar - 1; c[k] = mod(y, b); /* remainder when r is divided by b */ y = floor(y / b); /* whole part of division */ substr(ID,&nChar-k,1) = byte(offset+c[k]); /* represent coefficients as string */ end; /* Some strings are vulgar. Exclude strings on the denylist. */ if ID not in ( 'CRAP','DAMN','DUMB','HELL','PUKE','SCUM','SLUT','SPAZ' /* add F**K, S**T, etc */ ) then output; end; drop b offset y k; run; |
The table shows the first few and the last few observations in the data set. The ID column contains all four-character strings of English letters, except for words on a denylist. The list of objectionable words can be quite long, so I included only a few words as an example. I left out the F-bomb, other vulgar terms, and racist slurs. The appendix contains a more complete denylist of four-letter words.
The first step created a list of unique four-character ID values. The second step is to randomly select N elements from this list, where N is the number of subjects in your study that need a random ID. This section shows how to perform random selection by permuting the entire list and selecting the first N rows of the permuted data.
The following DATA step generates a random uniform number for each ID value. A call to PROC SORT then sorts the data by the random values. The resulting data set, RANDPERM, has randomly ordered ID values.
/* Option 1: Generate a random permutation of the IDs */ data RandPerm; set AllIDs; call streaminit(12345); _u = rand("uniform"); run; /* sort by the random variate */ proc sort data=RandPerm; by _u; run; /* proc print data=RandPerm(obs=10) noobs; var nID ID; run; */ |
You can use PROC PRINT to see that the ID values are randomly ordered. Because we started with a list of unique values, the permuted IDs are also unique.
You can now merge the first N random IDs with your data. For example, suppose I want to assign an ID to each student in the Sashelp.Class data set. The following program counts how many subjects (N) are in the data, and merges the first N random IDs with the data:
/* Count how many rows (N) are in the input data set */ %let dsName = sashelp.class; proc sql noprint; select count(*) into :N from &DSName; quit; data All; merge &DSName RandPerm(keep=ID obs=&N); run; proc print data=All(obs=8) noobs label; var ID Name _NUMERIC_; run; |
The output shows that each student is assigned a random four-character ID.
The previous section uses only Base SAS routines: The DATA step and PROC SQL. An alternative approach is to extract N random IDs by using PROC SURVEYSELECT in SAS/STAT software. In this approach, you do not need to generate random numbers and manually sort the IDs. Instead, you extract the random values by using PROC SURVEYSELECT. As of SAS 9.4 M5, the SURVEYSELECT procedure supports the OUTRANDOM option, which causes the selected items in a simple random sample to be randomly permuted after they are selected. Thus, an easier way to assign random IDs is to count the number of subjects, randomly select that many ID values from the (unsorted) set of all IDs, and then merge the results:
/* Option 2: Extract random IDs and merge */ %let dsName = sashelp.class; proc sql noprint; select count(*) into :N from &DSName; quit; proc surveyselect data=AllIDs out=RandPerm noprint seed=54321 method=srs /* sample w/o replacement */ OUTRANDOM /* SAS 9.4M5 supports the OUTRANDOM option */ sampsize=&N; /* number of observations in sample */ run; data All; merge &DSName RandPerm(keep=ID); run; proc print data=All(obs=8) noobs label; var ID Name _NUMERIC_; run; |
The table shows the ID values that were assigned to the first few subjects.
Both previous methods assign random strings of letters to subjects. However, I want to mention a third alternative because it is so compact. You can write a SAS/IML program that performs the following steps:
As usual, the SAS/IML version is very compact. It requires about a dozen lines of code to generate the IDs and write them to a data set for merging with the subject data:
/* Generate N random 4-character strings (remove words on denylist) */ proc iml; letters = 'A':'Z'; /* all English letters */ L4 = expandgrid(letters, letters, letters, letters); /* 26^4 combinations */ strings = rowcat(L4); /* concatenate into strings */ free L4; /* done with matrix; delete */ deny = {'CRAP','DAMN','DUMB','HELL','PUKE','SCUM','SLUT','SPAZ'}; /* add F**K, S**T, etc*/ idx = loc( ^element(strings, deny) ); /* indices of strings NOT on denylist */ ALLID = strings[idx]; call randseed(1234); ID = sample(strings, &N, "WOR"); /* random sample without replacement */ create RandPerm var "ID"; append; close; /* write IDs to data set */ QUIT; /* merge data and random ID values */ data All; merge &DSName RandPerm; run; proc print data=All(obs=8) noobs label; var ID Name _NUMERIC_; run; |
This article shows how to assign a random string value to subjects in an experiment. The best way to do this is to start with a set of unique strings. Make sure there are no objectionable words in the set! Then you can extract a random set of N strings by permuting the set or by using PROC SURVEYSELECT to perform simple random sampling (SRS). In the SAS/IML language, you can generate all strings and extract a random subset by using only a dozen lines of code.
In a DATA step, I used base 26 to generate the list of all four-character strings. By changing the definition of the nChar macro variable, this program also works for character strings of other lengths. If you set the macro to the value k, the program will generate 26^{k} strings of length k.
Of the three methods, the first (DATA step and PROC SORT) is the simplest. It can also easily handle new subjects who are added to the study after the initial assignment of IDs. You can simply assign the next unused random IDs to the new subjects.
This appendix shows a list of objectionable words. You do not want to assign a patient ID that is obscene, profane, insulting, racist, sexist, or crude. I used several web pages to compile a list of four-letter words that some people might find objectionable. You can use an IF-THEN statement to block the following list four-letter English words:
/* some words on this list are from https://www.wired.com/2016/09/science-swear-words-warning-nsfw-af/ https://en.everybodywiki.com/List_of_profane_words */ if ID not in ( 'ANAL','ANUS','ARSE','BOOB','BUNG','BUTT','CLIT','COCK','COON','CRAP', 'CUMS','CUNT','DAMN','DICK','DONG','DUMB','DUMP','DYKE','FAGS','FUCK', 'GOOK','HEBE','HELL','HOMO','JEEZ','JERK','JISM','JIZZ','JUGS','KIKE', 'KNOB','KUNT','MICK','MILF','MOFO','MONG','MUFF','NADS','PAKI','PISS', 'POON','POOP','PORN','PUBE','PUKE','PUSS','PUTO','QUIM','SCUM','SHAG', 'SHAT','SHIT','SLAG','SLUT','SMEG','SPAZ','SPIC','SUCK','TARD','THOT', 'TOSS','TURD','TWIT','TWAT','WANK','WANG' ) then output; |
The post Generate random ID values for subjects in SAS appeared first on The DO Loop.
]]>The post Base 26: A mapping from integers to strings appeared first on The DO Loop.
]]>
For any base b, you can express an integer as a sum of powers of b, such as
\(\sum\nolimits_{i=0}^p {c_i} b^i\)
where the \(c_i\) are integers \(0 \leq c_i < b\).
From the coefficients, you can build a string that represents the number in the given base.
Traditionally, we use the symbols 0, 1, ..., 9 to represent the first 10 coefficients and use letters of the English alphabet for higher coefficients. However, in base 26, it makes sense to break with tradition and use English characters for all coefficients. This is done in a natural way by associating the symbols {A, B, C, ..., Z} with the coefficient values {0, 1, 2, ..., 25}.
Notice that the symbols for the coefficients in base 26 are zero-based (A=0) and not one-based (A≠1), which is different than you might have seen in other applications.
For example, the number 2398 (base 10) can be written as the sum 3*26^{2} + 14*26^{1} + 6*26^{0}. If you use English letters to represent the coefficients, then this number equals DOG (base 26) because 3→D, 14→O, and 6→G. In a similar way, the number 1371 (base 10) can be written as 2*26^{2} + 0*26^{1} + 19*26^{0}, which equals CAT (base 26) because 2→C, 0→A, and 19→T.
Recall that for base 10 numbers, we typically do not write the numbers with leading zeros. For example, when considering three-digit numbers in base 10, we do not write the numbers 0-99. But if we use leading zeros, we can write these integers as three-digit numbers: 000, 001, 002, ..., 099. In a similar way, you can represent all three-character strings in base 26 (such as AAA, ABC, and ANT) if you include one or more leading zeros. In base 26, a "leading zero" means that the string starts with A. Unfortunately, if you include leading zeros, you lose a unique representation of the integers because A = AA = AAA, and similarly Z = AZ = AAZ. However, it is a small price to pay. To represent character strings that start with the letter A, you must allow leading zeros.
It is straightforward to adapt the SAS DATA step program in my previous article to base 26. (See the previous article for an explanation of the algorithm.) In this version, I represent each integer in the range [0, 17575] in base 26 by using a three-digit string. The number 17575 (base 10) is the largest integer that can be represented by using a three-digit string because 17575 (base 10) = 25*26^{2} + 25*26^{1} + 25*26^{0} = ZZZ (base 26).
The following statements put a few integers in the range [0, 17575] into a SAS data set:
/* Example data: Base 10 integers in the [0, 17575] */ data Base10; input x @@; /* x >= 0 */ datalines; 0 25 28 17575 16197 13030 1371 341 11511 903 13030 2398 ; |
The following DATA step converts these integers to three-character strings in base 26.
/* For simplicity, only consider three-digit strings. The strings will contain 'leading zero', which means strings like ABC (base 26) = 28 (base 10) Three-digit strings correspond to integers 0 - 17575. */ %let maxCoef = 3; /* number of characters in string that represents the number */ %let base = 26; /* base for the representation */ %let valueList = ('A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z'); data Base26; array values[&base] $ _temporary_ &valueList; /* characters to use to encode values */ array c[0:%eval(&maxCoef-1)] _temporary_; /* integer coefficients c[0], c[1], ... */ length ID $&maxCoef; /* string for base b */ b = &base; /* base for representation */ set Base10; /* x is a positive integer; represent in base b */ /* compute the coefficients that represent x in Base b */ y = x; do k = 0 to &maxCoef-1; c[k] = mod(y, b); /* remainder when r is divided by b */ y = floor(y / b); /* whole part of division */ substr(ID,&maxCoef-k,1) = values[c[k]+1]; /* represent coefficients as string */ end; drop b y k; run; proc print data=Base26 noobs label; label x="Base 10" ID="Base 26"; var x ID; run; |
After converting a few arbitrary base 10 numbers to base 26, I show a few special numbers that correspond to three-character English words. The last seven numbers in the data set are integers that, when represented in base 26, correspond to the sentence THE CAT AND RAT BIT THE DOG!
In the previous section, I used base 10 numbers that converted to a complete English sentence: THE CAT AND RAT BIT THE DOG. Obviously, I started with the sentence that I wanted to "find," and then figured out which decimal digits would produce the sentence. In other words, I started with the base 26 representation and computed the base 10 (decimal) representation.
For compactness, I will use SAS/IML software to compute the integer value for each base 26 representation. You could use the DATA step, if you prefer. The SAS/IML language supports a little-known feature (the MATTRIB statement) that enables you to index a matrix by using a string instead of an integer. This enables you to put the decimal numbers 0-25 into an array and index them according to the corresponding base 26 characters. This feature is demonstrated in the following example:
proc iml; n = 0:25; /* decimal values */ L = "A":"Z"; /* base 26 symbols */ mattrib n[colname=L]; /* enable indexing n by the letters A-Z */ C = n["C"]; A = n["A"]; T = n["T"]; print C A T; D = n["D"]; O = n["O"]; G = n["G"]; print D O G; |
The output shows that an expression like n['D'] results in the numerical value of the symbol 'D'. For any base 26 string, you can use the SUBSTR function to extract each character in the string. You can use the MATTRIB trick to find the corresponding base 10 value. You can use the position of each character and the definition of base 26 to find the integer represented by each string:
/* convert base 26 strings to integers */ str = {"AAA" "AAZ" "ABC" "ZZZ" "XYZ" "THE" "CAT" "AND" "RAT" "BIT" "THE" "DOG"}`; Decimal = j(nrow(str),1); do j = 1 to nrow(str); s = str[j]; /* get the j_th string */ k = length(s); /* how many characters? */ x = 0; do i = 0 to k-1; /* for each character in the string */ c = substr(s, k-i, 1); /* extract the character at position k-i */ x = x + n[c]*26**i; /* use n[c] as the coefficient in the base 26 representation */ end; Decimal[j] = x; /* save the integer value for the j_th string */ end; print str[L="Base 26"] Decimal; |
The output shows the base 10 representation for a set of three-digit base 26 strings. These are the values that I used in the first part of this article. (NOTE: You can vectorize this program and eliminate the inner loop! I leave that as an exercise.)
This article shows a fun fact: You can use base 26 to associate an integer to every string of English characters. In base 26, the string 'CAT' is associated with 2398 (base 10) and the string 'DOG' is associated with the number 1371 (base 10). This article uses three-digit character strings to demonstrate the method, but the algorithms apply to character strings that contain an arbitrary number of characters.
The post Base 26: A mapping from integers to strings appeared first on The DO Loop.
]]>The post Convert integers from base 10 to another base appeared first on The DO Loop.
]]>This process is commonly called "converting" an integer from base 10 to base b, but that is a slight misnomer. An integer has an intrinsic value regardless of the base that you use to represent it. We aren't converting from base 10, although we typically use base 10 to input the number into a computer. I prefer to use the term "represent the integer" when discussing the process of writing it in a specified base.
The most common way to represent an integer is to use base 10, which
represents each positive integer as a sum
\(x = \sum\nolimits_{i=0}^p {c_i} 10^i\)
where the \(c_i\) are integers \(0 \leq c_i < 10\).
Notice that this sum of powers looks like a polynomial, so the c_{i} are often called coefficients. (If you require the leading coefficient to be nonzero, the representation is unique.)
For example, the number 675 (base 10) can be represented as 6*10^{2} + 7*10^{1} + 5*10^{0}.
So, the ordered tuple (6,7,5) represents the integer in base 10. We usually concatenate the tuple values and simply write 675 and optionally add "base 10" if the base in unclear.
We call this representation "base 10" because it represents a number as a sum of powers of 10. The phrase "change the base" means that we represent the same number as a sum of powers of some other positive integer. For example, the number 675 (base 10) can be represented as 1243 (base 8) because 675 = 1*8^{3} + 2*8^{2} + 4*8^{1} + 3*8^{0}. Consequently, the tuple (1,2,4,3) represents the integer in base 8. Equivalently, the coefficients in base 8 are (1,2,4,3).
There is a simple iterative algorithm to represent a positive integer, x, in any base b ≥ 2. The steps in the algorithm are indexed by an integer i=0, 1, 2, ..., k-1, where k is the smallest integer such that x < b^{k}.
The algorithm works by rewriting the "sum of powers" by using Horner's method. For example, the sum
1*8^{3} + 2*8^{2} + 4*8^{1} + 3*8^{0}
can be rewritten as
((1*8 + 2)*8 + 4)*8 + 3.
In this representation, each nested term is some number times 8 (the base) plus a remainder.
For a general base, you can write the Horner sum as
(((c[k-1]+...)*b + c[2])*b + c[1])*b + c[0]
The mathematical algorithm exploits Horner's representation, as follows:
For example, let's represent the number 675 (base 10) in base 8. Use x=675 and b=8 as the values in the formulas. The algorithm is as follows:
This process is summarized in the following table for base 8 and the base-10 integer 675:
From the coefficients, you can build a string (such as '1243') that represents the number in the given base. For b ≤ 10, the symbols 0, 1, ..., b-1 are used to represent the coefficients. For 10 < b ≤ 36, the letters of the English alphabet represent the higher coefficients: A=10, B=11, C=12, ..., Z=35. The SAS program in the next section uses these symbols to represent the coefficients.
Let's put a set of base-10 numbers into a data set:
/* Example data: Base 10 integers to convert to other bases */ data Base10; input x @@; /* x >= 0 */ datalines; 2 3 4 5 8 15 31 32 33 49 63 675 ; |
Because we want to store the representation in a character variable, we set the length of the variable by using the parameter MAXCOEF=32. A string of length 32 is usually sufficient for representing a wide range of integers. The VALUELIST macro defines the set of characters to use to represent each integer as a string for bases b=2, 3, ..., 36. If you want to use a larger base, extend this set of values.
%let maxCoef = 32; /* how many characters in a string that represents the number? */ %let valueList = ('0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z'); |
The following SAS DATA step implements the previous algorithm to represent the numbers in a new base. Let's begin by representing the numbers in binary (base 2).
%let base = 2; /* positive integer. Usually 2 <= base <= 36 */ data NewBase; array values[36] $ _temporary_ &valueList; /* characters to use to encode values */ array c[0:%eval(&maxCoef-1)] _temporary_ ; /* integer coefficients c[0], c[1], ... */ length s $&maxCoef; /* string for base b */ b = &base; /* base for representation */ set Base10; /* x is a positive integer; represent in base b */ /* compute the coefficients that represent x in Base b */ y = x; do k = 0 to &maxCoef while(y>0); c[k] = mod(y, b); /* remainder when r is divided by b */ y = floor(y / b); /* whole part of division */ substr(s,&maxCoef-k,1) = values[c[k]+1]; /* represent coefficients as string */ end; keep s x k b; run; proc print data=NewBase noobs label; label x="Base 10" s="Base &base"; var x s k; run; |
For each positive integer, x, the variable k specifies how many characters are required to represent the number in base 2. (This assumes that you do not want to use leading zeros in the representation.) For example, 31 (base 10) = 11111 (base 2) requires five characters, whereas 32 (base 10) = 100000 (base 2) requires six characters. The program set MAXCOEF = 32, which means that this program can compute the binary representation of any number up to 2^{32} – 1 = 4,294,967,295 (base 10).
To represent a positive integer in a different base, simply redefine the BASE macro variable and rerun the program. For example, the following statements enable you to convert the numbers to base 8:
%let base = 8; /* positive integer 2 <= base <= 36 */ |
The last row of the table shows that 675 (base 10) = 1243 (base 8), as was shown previously.
For hexadecimal (base 16), you can use the letters 'A' through 'F' to represent the larger values of the coefficients:
%let base = 16; /* positive integer 2 <= base <= 36 */ |
The hexadecimal numbers might be familiar to statistical programmers who use hexadecimal values for RGB colors in statistical graphics. In many programming languages, you can specify a color as a hexadecimal value such as 0x1F313F. The first two digits ('1F') specify the amount of red in the color, the next two digits ('31') specify the amount of green, and the last two digits specify the amount of blue. As an integer, this number is 2044,223 (base 10). You might have seen an advertisement for a computer monitor that claims that the monitor "displays more than 16 million colors." That number is used because FFFFFF (base 16) = 16,777,215 (base 10). In other words, if each red, green, and blue pixel can display 256 colors, the total number of colors is more than 16 million.
This article shows an algorithm that enables you to represent a positive integer in any base. This is commonly called "converting from base 10," although that is a misnomer. At each step of the algorithm, you use division by the base to find the integer part and remainder of an integer. This algorithm is demonstrated by using the SAS DATA step. As written, the program supports bases 2–36.
The post Convert integers from base 10 to another base appeared first on The DO Loop.
]]>The post A test for monotonic sequences and functions appeared first on The DO Loop.
]]>In a recent project, I needed to determine whether a certain function is monotonic increasing. Because the function is only known at a finite sequence of values, the reduced problem is to determine whether a sequence is increasing. I have previously shown that you can use the DIF function in SAS to test for an increasing sequence. This article shows how to apply the method to the problem of deciding whether a function is increasing.
There are two types of monotonicity. In a weakly monotonic sequence, adjacent terms can be equal. That is, the difference between adjacent terms is allowed to be 0. In a strongly monotonic sequence (also called strictly monotonic), adjacent terms cannot be equal. For a strictly increasing sequence, the difference between adjacent terms is always positive. For a strictly decreasing sequence, the difference between adjacent terms is always negative. By itself, the term monotonic means the sequence is either increasing or decreasing.
Thus, there are four different tests for monotonicity. You can test whether a sequence is weakly increasing, strictly increasing, weakly decreasing, or strictly decreasing.
The following SAS/IML module evaluates a vector and tests whether the elements are increasing. It uses the DIF function to produce a vector that contains the difference between adjacent elements of the input vector. That is, if x = {x1, x2, x3, ..., xn} is the input vector, then d = diff(x,1,1) is the vector {x2-x1, x3-x2, ..., xn-x[n-1]}. The second argument specifies the lag. The third argument (which is available in SAS 9.4M5) is a flag that specifies whether the result is padded with missing values. (See the Appendix for more information.) By default, the function tests for a weakly increasing sequence.
proc iml; /* Test whether a sequence of elements is monotonic increasing. Valid options are strict=0 : (Default) Return 1 if a sequence is nondecreasing strict=1 : Return 1 if a sequence is strictly increasing */ start IsIncr(_x, strict=0); x = colvec(_x); if nrow(x)=1 then return(1); /* one element is always monotonic! */ d = dif(x,1,1); /* lag=1; delete initial missing value */ if strict then return( all(d > 0) ); return( all(d >= 0) ); finish; /* test whether sequences are increasing */ x = {0,2,2,2,6,7,9}; y = {0,1,3,4,6,7,9}; z = {0,1,3,4,2,7,9}; b1 = IsIncr(x); /* test weakly increasing */ b2 = IsIncr(x, 1); /* test strictly increasing */ b3 = IsIncr(y, 1); /* test strictly increasing */ b4 = IsIncr(z); /* test weakly increasing */ print b1 b2 b3 b4; |
The IsIncr function is called four times:
If a sequence {x1, x2, x3, ...} is monotone increasing, then the sequence obtained by multiplying each element by -1 is monotone decreasing. Therefore, it is trivial to write a function that tests whether a sequence is decreasing: simply test whether the negative of the sequence is increasing! This is accomplished by the following SAS/IML function:
/* Test whether a sequence of elements is monotonic decreasing. strict=0 : (Default) Return 1 if a sequence is nonincreasing strict=1 : Return 1 if a sequence is strictly decreasing */ start IsDecr(x, strict=0); return IsIncr(-x, strict); finish; /* test whether sequence is increasing */ u = {9,8,7,7,6,2,0}; b5 = IsDecr(u); /* test weakly decreasing */ b6 = IsDecr(u, 1); /* test strictly decreasing */ print b5 b6; |
The sequence is weakly decreasing but not strictly decreasing. The first call to the IsDecr function returns 1. The second call returns 0.
I used the IsIncr function to address a general question: Can you detect whether an unknown univariate function is monotonic increasing?
In a recent project, I was modeling the cumulative distribution function (CDF) of an unknown continuous distribution. By definition, a CDF must be increasing, but for some values of the parameters, the model is not an increasing function. I needed to identify these "infeasible" parameter values in a quick and reliable manner.
Mathematically, an increasing function, F, has the property that for every increasing sequence, {x_i}, the image of that sequence, {F(x_i)}, is also increasing. Numerically, you can generate an increasing sequence in the domain and ask whether the image of that sequence is also increasing.
Here's an example. Suppose you want to test whether the following functions are increasing:
F_{1}(x) = (5 - 4*x) log( x/(1-x) )
F_{2}(x) = (5 - 5.2*x) log( x/(1-x) )
The functions are defined on the domain (0, 1). The following program defines an increasing sequence on (0,1) and tests whether the image of the sequence is also increasing for F_{1} and for F_{2}:
start Func1(x); return (5 - 4*x)#log( x/(1-x) ); finish; start Func2(x); return (5 - 5.2*x)#log( x/(1-x) ); finish; dt = 0.005; x = do(dt, 1-dt, dt); /* increasing sequence on a fine grid */ y1 = Func1(x); /* image of sequence under F1 */ y2 = Func2(x); /* image of sequence under F2 */ b1 = IsIncr(y1); b2 = IsIncr(y2); print b1 b2; |
The output indicates that the first function in increasing, but the second function is not. The following graph of the second function shows, indeed, that the function is not strictly increasing.
The accuracy of this method depends on choosing a sequence on a fine grid of points in the domain. It assumes that the derivative of the function is bounded so that function cannot quickly increase and decrease on one of the small gaps between consecutive elements of the sequence.
This article shows a way to test whether a function is monotonic on an interval. A common application is to test for an increasing function, but it is equally easy to test for a decreasing function. You can test whether the function is weakly increasing or strictly increasing.
These numerical tests evaluate a function at a finite set of points. Consequently, they can rule out the hypothesis of monotonicity, but they do not prove that the function is increasing everywhere. Nevertheless, these tests work well in practice if you choose the input sequence on a fine grid in the domain. If you have a mathematical bound on the derivative of the function, you can say more.
By default, the DIF function in SAS/IML returns a vector that is the same size as the input vector. If k is the lag parameter, the first k elements are missing values. The differences between elements are contained in the elements k+1, k+2, and so forth. If you specify, the value 1 for the third option, the DIF function deletes the leading missing values. The result is a vector that has n – k elements, where n is the length of the input vector and k is the lag parameter.
For example, the following example passes in k=1 and deletes the leading missing value for the result:
t = {0,3,3,2,7}; /* input argument has 5 elements */ d = dif(t, 1, 1); /* LAG=1; delete leading missing value */ print d; /* result has 4 elements */ |
The post A test for monotonic sequences and functions appeared first on The DO Loop.
]]>The post Two types of syntax for the SELECT-WHEN statement in SAS appeared first on The DO Loop.
]]>A previous article shows the first syntax for the SELECT-WHEN statement and includes several examples. However, I will provide only a simple example here. The first syntax requires that you specify an expression on the SELECT statement. On the WHEN statement, you specify one of more values that the expression can take. Following the WHEN statement, you specify a statement that you want to perform when the WHEN expression is true. If you want to perform more than one statement, you can use a DO-END block to perform multiple operations.
It sounds more difficult than it is, so let's see an example. The following SAS data set contains dates in 2020. Some of them are well-known religious holidays, others are US holidays, and others are "personal holidays" that I like to celebrate with my friends, such as Leap Day, Pi Day, Cinco de Mayo, and Halloween. The data set contains the date of the holiday and a name. You can use a SELECT-WHEN statement to classify each holiday as Religious, USA, or Personal, as follows:
data Dates; length Holiday $11; input Holiday 1-12 Date date9.; format Date date9.; datalines; MLK 20JAN2020 Leap 29FEB2020 Pi 14MAR2020 StPatrick 17MAR2020 Easter 12APR2020 CincoDeMayo 05MAY2020 Memorial 25MAY2020 Labor 07SEP2020 Halloween 31OCT2020 Christmas 25DEC2020 ; data Holidays; length Type $12; set Dates; select(upcase(Holiday)); /* specify expression on the SELECT statement */ when('MLK','MEMORIAL','LABOR') Type="USA"; /* handle possible values of expression */ when('EASTER','CHRISTMAS') Type="Religious"; otherwise Type="Fun"; /* handle any value not previously handled */ end; run; proc print data=Holidays noobs; var Date Holiday Type; run; |
Notice that the SELECT statement branches on the expression UPCASE(Holiday). The new data set has a Type variable that classifies each holiday according to the value of UPCASE(Holiday).
In the SELECT-WHEN syntax, the OTHERWISE statement is optional but is recommended. If you do not have an OTHERWISE statement, and all the WHEN statements are false, the DATA step will stop with an error. You can place use a null statement after the OTHERWISE keyword to specify that the program should do nothing for unhandled values, as follows:
OTHERWISE ; /* semicolon ==> null statement ==> do nothing */
By the way, if you have only the dates, you can still form the Type variable by using the HOLIDAYNAME function in SAS. This is left as an exercise.
As stated previously, the SELECT-WHEN statement provides the same functionality as an IF-THEN/ELSE statement, but it is a simpler way to handle multiple values. For example, the equivalent IF-THEN/ELSE logic is as follows:
if upcase(Holiday)='MLK' | upcase(Holiday)='MEMORIAL' | upcase(Holiday)='LABOR' then Type="USA"; else if ... |
The first SELECT-WHEN syntax enables you to perform an action for any value of an expression. Usually, you will branch on values of a discrete variable.
You can use the second SELECT-WHEN syntax when you need more complex logical processing. In the second syntax, you do not specify an expression on the SELECT statement. Instead, you specify a logical expression on each WHEN statement. The WHEN statements are executed in order until one is true. The corresponding action is then executed. If no WHEN statements are true, the OTHERWISE statement is processed.
The previous example uses the name of a holiday to assign values to the Type variable. Many holidays do not have fixed dates but rather dates that vary from year to year. For example, US Thanksgiving is the fourth Thursday of November, and MLK Day is the third Monday of January. Easter is even more complicated. Luckily, SAS provides the HOLIDAY function, which enables you to obtain the dates of common holidays for each year. For each date in the data set, you can use the HOLIDAY function to determine whether the date corresponds to a religious or USA holiday. The following SELECT-WHEN statement shows how to implement the second syntax:
data Want; length Type $12; set Dates; Year = year(Date); /* get the year for the date */ select; /* no expression on SELECT statement */ when(holiday('MLK', Year)=Date | /* logic to determine type of holiday */ holiday('Memorial', Year)=Date | holiday('USIndependence',Year)=Date | holiday('Labor', Year)=Date | mdy(9, 11, Year) =Date | /* Patriot Day 9/11 */ holiday('Veterans', Year)=Date | holiday('Thanksgiving', Year)=Date) Type="USA"; when(holiday('Easter', Year)=Date | holiday('Christmas', Year)=Date ) Type="Religious"; otherwise Type="Fun"; end; run; proc print data=Want noobs; var Date Holiday Type; run; |
This syntax for the SELECT-WHEN statement looks even more like an IF-THEN/ELSE statement.
On discussion forums, I see a lot of questions like this example in which programmers are creating a new variable from the values of an existing variable. In simple situations, you can use PROC FORMAT to perform this task. For more complicated situations, you can even define your own function to use as a format.
This article shows two ways to specify the SELECT-WHEN statement in SAS. The SELECT-WHEN statement is an alternative to using multiple IF-THEN/ELSE statements. For simple logical processing, you can specify an expression on the SELECT statement and specify values for the expression on the WHEN statements. For more complicated logic, you can put all the logic into the WHEN statements.
The post Two types of syntax for the SELECT-WHEN statement in SAS appeared first on The DO Loop.
]]>The post Order the bars in a bar chart with PROC SGPLOT appeared first on The DO Loop.
]]>This article shows how PROC SGPLOT in SAS orders categories in a bar chart in three scenarios:
After you understand these three ways to order categories for a simple bar chart, you can investigate how these orderings work for a stacked bar chart.
For data, let's use vehicles in the Sashelp.Cars data set and create bar charts that visualize the number of SUVs, sports cars, wagons, and trucks. This data also contains the Origin variable, which specifies the regions (Asia, USA, or Europe) that each vehicle was manufactured:
/* create example data */ data Have; set sashelp.cars; where Type in ('SUV' 'Sports' 'Truck' 'Wagon'); keep Type Origin; run; |
Let's visualize the number of SUVs, sports cars, wagons, and trucks. You can use PROC SGPLOT to order the categories of a bar charts in three ways: alphabetical order, ascending (or descending) order by frequency, and a user-specified order. Each bar chart shows the same data, but the order of the bars is different.
ods graphics / width=300px height=240px; /* three ways to order categories in a bar chart */ title "Categories in Alphabetical Order"; proc sgplot data=Have; vbar Type; xaxis display=(nolabel); run; title "Categories in Frequency Order"; proc sgplot data=Have; vbar Type / categoryorder=respdesc; xaxis display=(nolabel); run; title "Categories in Arbitrary Order"; proc sgplot data=Have; vbar Type; xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' ); run; |
For these data, you can divide each category into subgroups by using the Origin variable. The next section shows how these plots change if you add a GROUP= variable to the VBAR statement.
When you use the GROUP= option on the VBAR statement, you have an option to display the subgroups as a cluster of bars (GROUPDISPLAY=CLUSTER) or as a stacked bar chart (GROUPDISPLAY=STACK). The clustered option is easier to understand, so let's start with that. You can use the GROUP=Origin option to create clustered bar charts that display the number of cars for each category in each of three subsets: vehicles that were manufactured in 'Asia', 'USA', or 'Europe'.
When you introduce groups and different orderings of the data, you need to ensure that the colors of the groups are consistent across graphs. One way to do this is to create and use a discrete attribute map that associates each subgroup with a color: the bars for Origin='Asia' are red, the bars for Origin='USA' are blue, and the bars for Origin='Europe' are gold.
/* create a discrete attribute map to associate a color with values of the Origin variable */ data BarMap; length ID $10 value $6 linecolor $ 9 fillcolor $ 9; input ID $ value $ linecolor $ fillcolor $; datalines; BarAttr Asia DarkGray FireBrick BarAttr USA DarkGray DarkBlue BarAttr Europe DarkGray Gold ; |
You can now create bar charts that consistently use these colors regardless of the order of the bars. To make the output easy to see, the following program uses the ODS LAYOUT GRIDDED statement to arrange the output in one row that contains three graphs:
/* for GROUP=Origin, examine the three ways to order categories in a bar chart */ %let method = CLUSTER; /* use CLUSTER or STACK */ ODS LAYOUT GRIDDED columns=3 advance=table column_gutter=8px; title "Categories in Alphabetical Order"; title2 "GROUPDISPLAY = &method"; proc sgplot data=Have dattrmap=BarMap; vbar Type / attrid=BarAttr group=Origin groupdisplay=&method; xaxis display=(nolabel); run; title "Categories in Frequency Order"; title2 "GROUPDISPLAY = &method"; proc sgplot data=Have dattrmap=BarMap; vbar Type / attrid=BarAttr categoryorder=respdesc group=Origin groupdisplay=&method; xaxis display=(nolabel); run; title "Categories in Arbitrary Order"; title2 "GROUPDISPLAY = &method"; proc sgplot data=Have dattrmap=BarMap; vbar Type / attrid=BarAttr group=Origin groupdisplay=&method; xaxis display=(nolabel) values=('Wagon' 'Sports' 'SUV' 'Truck' ); run; ODS LAYOUT END; |
The graphs show that the ordering method (alphabetical or frequency) also extends to the subgroups. For example:
There is a simple way to understand the order of bars and bar segments in a stacked bar chart in PROC SGPLOT.
First, create a clustered bar chart, as shown in the previous section.
Then, change GROUPDISPLAY=CLUSTER to GROUPDISPLAY=STACK and rerun the program.
In the previous section, I wrote the code so that I only need to change one line:
%let method = STACK; /* use CLUSTER or STACK */
With that change, you can re-run the program to obtain the following graphs:
The graphs show that the order of segments in a stacked bar chart is the same as the order of bars in a clustered bar chart (from left to right). Specifically:
This article shows how to understand how PROC SGPLOT in SAS orders bars in a bar chart. There are essentially three ways to order bars: alphabetically, by frequency, or by specifying the order manually. When you use the GROUP= option, you get either a clustered bar chart or a stacked bar chart. The order of the subgroups is best understood by looking at the clustered bar chart. The order of bars in the clusters (from left to right) is the same as the order of the segments in a stacked bar chart (from bottom to top). Be aware that the CATEGORYORDER= option also orders the subgroups. This can be confusing to the viewer because segments in the stacked bars might "move around" depending on their relative frequencies.
The post Order the bars in a bar chart with PROC SGPLOT appeared first on The DO Loop.
]]>The post How to stagger labels on an axis in PROC SGPLOT appeared first on The DO Loop.
]]>One of the challenges in statistical graphics is that long labels in plots can overlap, which makes the labels difficult to read. There are a few standard tricks for dealing with long or densely packed labels in graphs:
This article demonstrates a new technique for your graphical toolbox: How to use the FITPOLICY= option on the XAXIS or X2AXIS statements to prevent tick labels from overlapping. Specifically, this article shows the FITPOLICY=STAGGER option.
To understand the problem, let's construct some hypothetical data about a weight-loss study. The analyst wants to display the weekly mean weight loss of the participants in the program along with reference lines and labels that visualize various phases and events in the study. The following program uses the REFLINE statement in PROC SGPLOT to specify the location of the events. By default, the labels for the events are displayed at the top of the scatter plot:
data Have; label MeanValue = "Mean Weight Loss (kg)"; input Day MeanValue @@; datalines; 0 0 7 1 14 2.2 21 2.7 28 3.1 35 3.3 42 3.6 49 4.1 56 5.0 63 5.6 70 5.9 77 6.3 84 6.5 ; title "Mean Weight Loss in Program"; title2 "First Attempt"; proc sgplot data=Have noautolegend; scatter x=Day y=MeanValue; refline (0 21 25 42 60 84) / axis=x label=('Phase 1' 'Phase 2' 'Adjustment' 'Phase 3' 'End Treatment' 'End Study'); run; |
For this set of labels and for the default size of the plot, the REFLINE statement decided to plot the labels vertically because the 'Phase 2' and the 'Adjustment' labels are close to each other and would overlap if the labels are displayed horizontally. The plot isn't terrible looking, but we can do better. I would prefer for the labels to appear horizontally rather than to be rotated by 90 degrees.
Sometimes, you can handle long labels on a horizontal axis by simply making the graph wider.
For example, you might try to use
ODS GRAPHICS / width=800px height=400px;
to see whether a wider plot enables the reference labels to display side-by-side without overlapping. For these data, the distance between the 'Phase 2' event (Day=21) and the 'Adjustment' event (Day=25) is very close, so making the plot longer does not fix the problem.
Similarly, you can try to use the LABELATTRS= option to decrease the font size of the labels. The SIZE=6 option is the smallest font size that I can read. However, adding the LABELATTRS=(Size=6) option to the REFLINE statement does not fix the problem for these data.
The problem is that the REFLINE statement has limited support for displaying the labels. It checks to see if they can be displayed horizontally without colliding, and, if not, it rotates the labels 90 degrees. In contrast, PROC SGPLOT provides more support for the tick label on an axis. The XAXIS and X2AXIS statements support the FITPOLICY= option, which provides more options for controlling how to handle overlapping labels. The next section removes the REFLINE statement and uses two X axes: one to show the days and another to show the events.
As mentioned, the XAXIS and X2AXIS statements support the FITPOLICY= option, which supports more than a dozen ways to control the display of the ticks labels. For these data, I will use FITPOLICY=STAGGER, which alternates the placement of the labels in a two-row display. See the documentation for other useful options.
To visualize both the Day values and the events, you can use two axes, one below the scatter plot and one above the plot. In the following graph, the upper axis displays the events, and the lower axis displays the Day. (You could easily make the opposite choice.) The following techniques are used to create the plot:
title2 "Uses FITPOLICY=STAGGER to Stagger Labels"; proc sgplot data=Have noautolegend; scatter x=Day y=MeanValue; /* add an invisible scatter plot to the X2 axis (set SIZE=0) */ scatter x=Day y=MeanValue / x2axis markerattrs=(size=0); x2axis display=(nolabel) grid FITPOLICY=stagger values = (0 21 28 42 60 84) valuesdisplay = ('Phase 1' 'Phase 2' 'Adjustment' 'Phase 3' 'End Treatment' 'End Study'); run; |
The SAS analyst was very happy to see this graph. Both the days and the events in the study are apparent. None of the tick labels overlap. The text is displayed horizontally.
This example shows how to use the FITPOLICY=STAGGER option to avoid overlap when you display long tick labels on an axis. The example uses two X axes: one to display the data and another to display related events. To use a second axis (called the X2 axis), you must create a plot that uses the second axis. This article creates an invisible scatter plot, which ensures that the X and X2 axes have the same scale.
In this article, I have hard-coded the locations of the ticks and the labels for each tick mark. However, if this information is in a SAS data set, you can read the data into macro variables and plot the events automatically. This trick is shown in the Appendix.
Sometimes the location and labels for the events are stored in a SAS data set. If so, you can read that information into a SAS macro variable and use the macro as the value for the VALUES= and VALUESDISPLAY= options on the XAXIS or X2AXIS statements. You can use PROC SQL and the SELECT INTO (COLON) statement to create a macro variable that contains the data. There is one trick to learn: The values in the VALUESDISPLAY= option must be strings, so when you read the data you should add quotation marks around the strings, as follows:
data Ticks; length Label $15; input value Label 5-18; Label = "'" || trim(Label) || "'"; /* add single quotes to both sides of each string */ datalines; 0 Phase 1 21 Phase 2 25 Adjustment 42 Phase 3 60 End Treatment 84 End Study ; /* put the list of values into macro variables */ proc sql noprint; select value into :TickList separated by ' ' from Ticks; select Label into :LabelList separated by ' ' from Ticks; quit; %put &=TickList; %put &=LabelList; /* use the macro variables in the VALUES= and VALUESDISPLAY= options */ proc sgplot data=Have noautolegend; scatter x=Day y=MeanValue; scatter x=Day y=MeanValue / x2axis markerattrs=(size=0);/* invisible */ x2axis display=(nolabel) grid FITPOLICY=stagger values = (&tickList) valuesdisplay = (&LabelList); run; |
The graph is the same as the previous example, which hard-coded the tick values and strings.
The post How to stagger labels on an axis in PROC SGPLOT appeared first on The DO Loop.
]]>The post The univariate Box-Cox transformation appeared first on The DO Loop.
]]>Formally, a Box-Cox transformation is a transformation of the dependent variable in a regression model. However, the documentation of the TRANSREG procedure contains an example that shows how to perform a one-variable transformation. The trick is to formulate the problem as an intercept-only regression model. This article shows how to perform a univariate Box-Cox transformation in SAS.
Let's look at the distribution of some data and then try to transform it to become "more normal." The following call to PROC UNIVARIATE creates a histogram and normal quantile-quantile (Q-Q) plot for the MPG_Highway variable in the Sashelp.Cars data set:
%let dsName = Sashelp.Cars; /* name of data set */ %let YName = MPG_Highway; /* name of variable to transform */ proc univariate data=&dsName; histogram &YName / normal; qqplot &YName / normal(mu=est sigma=est); ods select histogram qqplot; run; |
The histogram shows that the data are skewed to the right. This is also seen in the Q-Q plot, which shows a point pattern that is concave up.
You can use a Box-Cox transformation to attempt to normalize the distribution of the data. But you must formulate the problem as an intercept-only regression model. To use the TRANSREG procedure for the Box-Cox transformation, do the following:
Notice that the values of the response variable are all positive, therefore we can perform a Box-Cox transformation without having to shift the data. The following statements use a DATA step view to create a new constant variable named _ZERO. The call to PROC TRANSREG uses an intercept-only regression model to transform the response variable:
data AddZero / view=AddZero; set &dsName; _zero = 0; /* add a constant variable to use for an intercept-only model */ run; proc transreg data=AddZero details maxiter=0 nozeroconstant; model BoxCox(&YName / geometricmean convenient lambda=-2 to 2 by 0.05) = identity(_zero); output out=TransOut; /* write transformed variable to data set */ run; |
The graph is fully explained in my previous article about the Box-Cox transformation. In the graph, the horizontal axis represents the value of the λ parameter in the Box-Cox transformation. For this example, the LAMBDA= option in the BOXCOX transformation specifies values for λ in the interval [-2, 2]. The vertical axis shows the value of the normal log-likelihood function for the residuals after the dependent variable is transformed for each value of λ. The maximum value of the parameter occurs when λ=0.05. However, because the CONVENIENT option was used and because λ=0 is included in the confidence interval for the optimal value (and is "convenient"), the procedure selects λ=0 for the transformation. The Box-Cox transformation for λ=0 is the logarithmic transformation Y → G*log(Y) where G is the geometric mean of the response variable. (If you do not want to scale by using the geometric mean, omit the GEOMETRICMEAN option in the BOXCOX transformation.)
To summarize, the Box-Cox method selects a logarithmic transformation as the power transformation that makes the residuals "most normal." In an intercept-only regression, the distribution of the residuals is the same as the distribution of the centered data. Consequently, this process results in a transformation that makes the response variable as normally distributed as possible, within the family of power transformations.
You can use PROC UNIVARIATE to visualize the distribution of the transformed variable, as follows:
proc univariate data=TransOut; histogram T&YName / normal kernel; qqplot T&YName / normal(mu=est sigma=est); ods select histogram qqplot; run; |
The distribution of the transformed variable is symmetric. The distribution is not perfectly normal (it never will be). In the Q-Q plot, the point pattern shows quantiles that are below the diagonal line on the left and above the line on the right. This indicates outliers (with respect to normality) at both ends of the data distribution.
This article shows how to perform a Box-Cox transformation of a single variable in SAS. The Box-Cox transformation is intended for regression models, so the trick is to run an intercept-only regression model. To do this, you can use a SAS DATA view to create a constant variable and then use that variable as a regressor in PROC TRANSREG. The procedure produces a Box-Cox plot, which visualizes the normality of the transformed variable for each value of the power-transformation parameter. The parameter value that maximizes the normal log-likelihood function is the best parameter to choose, but in many circumstances, you can use a nearby "convenient" parameter value. The convenient parameter is more interpretable because it favors well-known transformations such as the square-root, logarithmic, and reciprocal transformations.
Upon first glance, it is not clear why the logarithm is the correct limiting form of the Box-Cox power transformations as the parameter λ → 0. The power transformations have the form \((x^\lambda - 1)/\lambda\). If you rewrite \(x^\lambda = \exp(\lambda \log(x))\) and use the Taylor series expansion of \(\exp(z)\) near z=0, you obtain
\(\frac{\exp(\lambda \log(x)) - 1}{\lambda} =
\frac{(1 + \lambda \log(x) + \lambda^2/2 \log^2(x) + \cdots) - 1}{\lambda} \approx
\log(x)\) as λ → 0.
The post The univariate Box-Cox transformation appeared first on The DO Loop.
]]>The post The Box-Cox transformation for a dependent variable in a regression appeared first on The DO Loop.
]]>The Box-Cox family of transformations (1964) is a popular way to use the data to suggest a transformation for the dependent variable. Some people think of the Box-Cox transformation as a univariate normalizing transformation, and, yes, it can be used that way. (I discuss the univariate Box-Cox transformation in another article.) However, the primary purpose of the Box-Cox transformation is to transform the dependent variable in a regression model so that the residuals are normally distributed. The Box-Cox transformation attempts to find a new effect W = f(Y) such that the residuals (W - X*β) are normally distributed, where f is a power transformation that is chosen to maximize the normality of the residuals. Recall that normally distributed residuals are useful if you intend to make inferential statements about the parameters in the model, such as confidence intervals and hypothesis tests.
It is important to remember that you are normalizing residuals, not the response variable! Linear regression does not require that the variables themselves be normally distributed.
This article shows how to use the TRANSREG procedure in SAS to compute a Box-Cox transformation of Y so that the least-squares residuals are approximately normally distributed.
In the simplest case, the Box-Cox family of transformations is given by the following formula:
\(
f_\lambda(y) = \left\{
\begin{array}{l l l} (y^\lambda - 1) / \lambda & & \lambda \neq 0 \\ \log (y) & & \lambda = 0 \end{array}
\right.
\)
The objective is to use the data to choose a value of the parameter λ that maximizes the normality of the residuals (f_{λ}(Y) - X*β).
In SAS, you can use PROC TRANSREG to perform a regression that includes the BOXCOX transformation.
Before performing a Box-Cox transformation, let's demonstrate why it might be necessary. The following call to PROC GLM in SAS performs a least squares regression of the MPG_Highway variable (response) onto two explanatory variables (Weight and Wheelbase) for vehicles in the Sashelp.Cars data set. The residuals from the model are examined in a histogram and in a Q-Q plot:
%let dsName = Sashelp.Cars; %let XName = Weight Wheelbase; %let YName = MPG_Highway; proc glm data=&dsName plots=diagnostics(unpack); model &YName = &XName; ods select ResidualHistogram QQPlot; quit; |
The graphs show that the distribution of the residuals for this model deviates from normality in the right tail. Apparently, there are outliers (large residuals) in the data. One way to handle this is to increase the complexity of the model by adding additional effects. Another way to handle it is to transform the dependent variable so that the relationship between variables is more linear. The Box-Cox method is one way to choose a transformation.
To implement the Box-Cox transformation of the dependent variable, use the following syntax in the TRANSREG procedure in SAS:
The following call to PROC TRANSREG implements the Box-Cox transformation for these data and saves the residuals to a SAS data set:
proc transreg data=&dsName ss2 details plots=(boxcox); model BoxCox(&YName / convenient lambda=-2 to 2 by 0.1) = identity(&XName); output out=TransOut residual; run; |
The procedure forms 41 regression models, one for each requester value of λ in the range [-2, 2]. To help you visualize how each value of λ affects the normality of the residuals, the procedure creates the Box-Cox plot, which is shown. The graph is a panel that shows two plots:
The procedure also displays a table that summarizes the optimal value of λ, the value used for the regression, and other information:
You can use PROC UNIVARIATE to plot the distribution of the residuals for the selected model (λ= -0.5). The name of the residual variable is RYName, where YName is the name of the dependent variable.
proc univariate data=TransOut(keep=R&YName); histogram R&YName / normal kernel; qqplot R&YName / normal(mu=est sigma=est) grid; ods select histogram qqplot GoodnessOfFit Moments; run; |
As shown by the histogram of the residuals, the Box-Cox transformation has eliminated the extreme outliers and improved the fit. Be aware that "improving the normality of the residuals" does not mean that the residuals are perfectly normal, especially if you use the CONVENIENT option. A test for normality (not shown) rejects the hypothesis of normality. However, the inferential statistics for linear regression are generally robust to mild departures from normality.
A normal quantile-quantile (Q-Q) plot provides an alternate way to visualize normality. The histogram appears to be slightly left-skewed. In a Q-Q plot, a left-skewed distribution will have a curved appearance, as follows:
There are two mathematical issues with the simple Box-Cox family of transformations:
Thus, the general formulation of the Box-Cox transformation incorporates two additional terms: a shift parameter and a scaling parameter. If any of the response values are not positive, you must add a constant, c. (Sometimes c is also used when Y is very large.) The scaling parameter is a power of the geometric mean of Y. The full Box-Cox transformation is
\(
f_\lambda(y) = \left\{
\begin{array}{l l l} ((y + c)^\lambda - 1) / (\lambda g) & & \lambda \neq 0 \\ \log (y + c) / g & & \lambda = 0 \end{array}
\right.
\)
where \(g = G^{\lambda - 1}\), and G is the geometric mean of Y.
Although the response variable in this example is already positive, the following program uses a PROC SQL calculation to compute the shift parameter c = 1 - min(Y) and write the value to a macro variable. You can then use the PARAMETER= option in the BOXCOX transformation to specify the shift parameter. The geometric mean is much simpler to specify: merely use the GEOMETRICMEAN option. Thus, the following statements implement the full Box-Cox transformation in SAS:
/* use full Box-Cox transformation with c and G. */ /* compute c = 1 - min(Y) and put in a macro variable */ proc sql noprint; select 1-min(&YName) into :c trimmed from &dsName; quit; %put &=c; proc transreg data=&dsName ss2 details plots=(boxcox); model BoxCox(&YName / parameter=&c geometricmean convenient lambda=-2 to 2 by 0.05) = identity(&XName); output out=TransOut residual; run; |
For these data, the value of the c parameter is -11, so when we add c, we are shifting the response variable to the left. For variables that have negative values (the usual situation in which you would use c), you shift the response to the right. The geometric mean scales each transformation. The bottom plot in the panel indicates that λ = 0.3 is the value of λ that makes the residuals most normal. For this analysis, λ = 0.25 is NOT in the confidence interval for λ, so no "convenient" parameter value is used. Instead, the chosen transformation is 0.3.
If you want to examine the distribution of the residuals, you can use the following call to PROC UNIVARIATE:
/* optional: plot the distribution of the residuals */ proc univariate data=TransOut(keep=R&YName); histogram R&YName / normal kernel; qqplot R&YName / normal(mu=est sigma=est) grid; ods select histogram qqplot; run; |
Both the histogram (not shown) and the Q-Q plot indicate that the residuals are approximately normal but are left-skewed.
The advantage of the Box-Cox transformation is that it provides an automated way to transform a dependent variable in a regression model so that the residuals for the model are as normal as possible. The transformations are all power transformations, and the logarithmic transformation is a limiting case that is also included.
The Box-Cox transformation also has several disadvantages:
Modern nonparametric regression methods (such as splines and loess curves) might provide a better fit. However, I suggest looking at the Box-Cox transformation to see if the method suggests a simple transformation, such as the inverse, log, square-root, or quadratic transformations (λ = -1, 0, 0.5, 2). If so, you can choose the parametric approach for the transformed response.
This article shows how to use PROC TRANSREG in SAS to perform a Box-Cox transformation of the response variable in a regression model. The procedure examines a family of power transformations indexed by a parameter, λ. For each value of λ, the procedure transforms the response variable and computes a regression model. The optimal parameter is the one for which the distribution of the residuals is "most normal," as measured by a maximum likelihood computation. You can use the CONVENIENT option to specify that a nearby simple transformation (for example, a square-root transformation) is preferable to a less intuitive transformation.
This article assumes that there are one or more regressors in the model. The TRANSREG procedure can also perform a univariate Box-Cox transformation. The univariate case is discussed in a second article.
The post The Box-Cox transformation for a dependent variable in a regression appeared first on The DO Loop.
]]>