Simulate correlated variables by using the Iman-Conover transformation

5

Simulating univariate data is relatively easy. Simulating multivariate data is much harder. The main difficulty is to generate variables that have given univariate distributions but also are correlated with each other according to a specified correlation matrix. However, Iman and Conover (1982, "A distribution-free approach to inducing rank correlation among input variables") proposed a transformation that approximately induces a specified correlation among component variables without changing the univariate distribution of the components. (The univariate distributions are called the marginal distributions.) The Iman-Conover transformation is designed to transform continuous variables.

This article defines the Iman-Conover transformation. You can use the transformation to generate a random sample that has (approximately) a given correlation structure and known marginal distributions. The Iman-Conover transformation uses rank correlation, rather than Pearson correlation. In practice, the rank correlation and the Pearson correlation are often close to each other.

This article shows how to use the Iman-Conover transformation. A separate article explains how the Iman-Conover transformation works. The example and SAS program in this article are adapted from Chapter 9 of my book Simulating Data with SAS (Wicklin, 2013).

Create simulated data with known distributions

To use the Iman-Conover transformation, you must specify two things: A set of data and a target correlation matrix. The transformation will induce the specified correlation in the data by rearranging (or permuting) the columns of the data. In other words, the algorithm permutes the columns independently to induce the specified correlation without changing the marginal distributions.

This example generates data with four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively. It does not matter whether the original data are correlated or not. For simplicity, generate each variable independently, so the initial correlation matrix is approximately the identity matrix. The following SAS DATA step simulates four variables from the specified distributions. The call to PROC CORR computes the rank correlation of the original data. It also creates a scatter plot matrix that shows the marginal distributions of the variable in the diagonal cells and the bivariate joint distributions in the off-diagonal cells.

%let N = 100; /* sample size */
/* independently simulate variables from known distributions */
data SimIndep;
call streaminit(12345);
do i = 1 to &N;
   Normal    = rand("Normal");
   Lognormal = rand("Lognormal");
   Expon     = rand("Expo");
   Uniform   = rand("Uniform");
   output;
end;
drop i;
run;
 
proc corr data=SimIndep Spearman noprob plots=matrix(hist);
   var Normal Lognormal Expon Uniform;
run;

For these simulated data, the table shows that the rank correlation between any pair of variables is small. None of the correlations are statistically different from 0, which you can verify by removing the NOPROB option from the PROC CORR statement.

Implement the Iman-Conover transformation in SAS

The Iman-Conover transformation is actually a sequence of transformations of multivariate data. The following program defines the Iman-Conover transformation as a SAS/IML function. An explanation of the function is provided in a separate article.

/* define a store a SAS/IML function that computes the Iman-Conover transformation */
proc iml;
/* Input: X is a data matrix with k columns
          C is a (k x k) rank correlation matrix
   Output: A matrix W. The i_th column of W is a permutation of the i_th
          columns of X. The rank correlation of W is close to the 
          specified correlation matrix, C.
*/          
start ImanConoverTransform(X, C);
   N = nrow(X);
   S = J(N, ncol(X));
   /* T1: Create normal scores of each column */
   do i = 1 to ncol(X);
      ranks = ranktie(X[,i], "mean");          /* tied ranks */
      S[,i] = quantile("Normal", ranks/(N+1)); /* van der Waerden scores */
   end;
   /* T2: apply two linear transformations to the scores */
   CS = corr(S);        /* correlation of scores */
   Q = root(CS);        /* Cholesky root of correlation of scores */
   P = root(C);         /* Cholesky root of target correlation */
   T = solve(Q,P);      /* same as  T = inv(Q) * P; */
   Y = S*T;             /* transform scores: Y has rank corr close to target C */
 
   /* T3: Permute or reorder data in the columns of X to have the same ranks as Y */
   W = X;
   do i = 1 to ncol(Y);
      rank = rank(Y[,i]);          /* use ranks as subscripts, so no tied ranks */
      tmp = W[,i]; call sort(tmp); /* sort column by ranks */
      W[,i] = tmp[rank];           /* reorder the column of X by the ranks of M */
   end;
   return( W );
finish;
store module=(ImanConoverTransform);  /* store definition for later use */
quit;

Unlike many algorithms in the field of simulation, the Iman-Conover transformation is deterministic. Given a matrix of data (X) and a correlation matrix (C), the function returns the same transformed data every time you call it.

Use the Iman-Conover transformation to simulate correlated data

The following SAS/IML program shows how to use the Iman-Conover transformation to simulate correlated data. There are three steps:

  1. Read real or simulated data into a matrix, X. The columns of X define the marginal distributions. For this example, we will use the SimIndep data, which contains four variables whose marginal distributions are normal, lognormal, exponential, and uniform, respectively.
  2. Define a "target" correlation matrix. This specifies the rank correlations that you hope to induce. For this example, the target correlation between the Normal and Lognormal variables is 0.75, the target correlation between the Normal and Exponential variables is -0.7, the target correlation between the Normal and Uniform variables is 0, and so forth.
  3. Call the ImanConoverTransform function. This returns a new data matrix, W. The i_th column of W is a permutation of the i_th column of X. The permutations are chosen so that the rank correlation of W is close to the specified correlation matrix:
proc iml;
load module=(ImanConoverTransform);  /* load the function definition */
 
/* Step 1: Read in the data */
varNames = {'Normal' 'Lognormal' 'Expon' 'Uniform'};
use SimIndep; read all var varNames into X; close;
 
/* Step 2: specify target rank correlation */
/*    X1    X2    X3    X4                 */
C = { 1.00  0.75 -0.70  0,           /* X1 */
      0.75  1.00 -0.95  0,           /* X2 */
     -0.70 -0.95  1.00 -0.3,         /* X3 */
      0     0    -0.3   1.0};        /* X4 */
 
/* Step 3: Strategically reorder the columns of X. 
           The new ordering has the specified rank corr (approx)
           and preserves the original marginal distributions. */
W = ImanConoverTransform(X, C);
RankCorr = corr(W, "Spearman");
print RankCorr[format=6.3 r=varNames c=varNames];

As advertised, the rank correlation of the new data matrix, W, is very close to the specified correlation matrix, C. You can write the data matrix to a SAS data set and use PROC CORR to create a scatter plot matrix that has the marginal histograms on the diagonal:

/* write new data to a SAS data set */
newvarNames = {'newNormal' 'newLognormal' 'newExpon' 'newUniform'};
create SimCorr from W[c=newvarNames];  append from W;  close;
quit;
 
proc corr data=SimCorr Spearman noprob plots=matrix(hist);
   var newNormal newLognormal newExpon newUniform;
run;

If you compare the diagonal histograms in this graph to the histograms in the previous graph, you will see that they are identical. They have to be identical because each new variable contains exactly the same data values as the original variables.

At the same time, the bivariate scatter plots show how permuting the data has induced the prescribed correlations between the variables. For this example, the newNormal and newLognormal variables have a strong positive rank correlation (0.73). The newNormal and NewExponential variables have a strong negative rank correlation (-0.65). The newNormal and newUniform variables are essentially uncorrelated (-0.03), and so forth.

Summary

The Iman-Conover transformation is a remarkable algorithm that induces a specified correlation structure on a data matrix while preserving the marginal distributions. You start with any data matrix (X) and any target correlation matrix. (The columns of X must represent continuous data.) You apply the Iman-Conover transformation to obtain a new matrix (W). The rank correlation of the new matrix is close to the values you specified. The columns of W have the same values as the columns of X, but the values have been reordered. Thus, the new matrix has the same marginal distributions as the original matrix.

You can download the complete SAS program that defines and uses the Iman-Conover transformation.

In a follow-up article, I explain the geometry or the Iman-Conover transformation and how it works.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

5 Comments

  1. Pingback: The geometry of the Iman-Conover transformation - The DO Loop

    • Rick Wicklin

      They are similar: Both use transformations to correlate variables in a multivariate distribution while preserving the marginal distributions. However, a copula uses a different set of transformations than the ones discussed here. I am currently writing a blog post about the transformations that PROC COMPULA uses. Check back in 7-10 days.

  2. Hi Dr. Wicklin,

    I simulated data with 10 samples with known distributions. Now I hope to use the Iman-Conover transformation to simulate correlated data within each SampleId. I tried %macro loop, but seems not working. Could you please give me some suggestions? Thank you in advance! Below are my codes.

    Karen Shore

    %let n_cont=1; /*number of continuous variables*/;

    %let n_class=1; /*number of classification variables*/;
    %let n_level_class=3; /*number of levels for each classification variable*/;

    %let n_binary=1; /*number of binary variables*/;
    %let n_level_binary=2; /*number of levels for each binary variable*/;

    %let n=50; /*sample size*/;

    %let p=0.05; /*missing rate*/;

    %let num_samples=10;

    ***************************************************************************************;
    data dat_overall;

    array x{&n_cont} x1-x&n_cont;
    array c{&n_class} c1-c&n_class;
    array b{&n_binary} b1-b&n_binary;

    call streaminit(1234);

    /*simulate the model*/
    do sampleid=1 to &num_samples;
    do i=1 to &n;
    do j=1 to &n_cont; /*continuous variables for i_th observation*/;
    x{j}=rand("Uniform"); /*uncorrelated uniform*/;
    end;

    do j=1 to &n_class; /*classification variables for i_th observation*/;
    c{j}=ceil(&n_level_class*rand("Uniform")); /*discrete uniform*/;
    end;

    do j=1 to &n_binary; /*binary variables for i_th observation*/;
    b{j}=ceil(&n_level_binary*rand("Uniform")); /*binary*/;
    end;
    output;
    end;
    end;
    drop i j;

    run;

    %macro trial;
    data
    %do k=1 %to &num_samples;
    dat&k
    %end;;

    set dat_overall;

    %do k=1 %to &num_samples;
    if sampleid=&k then output dat&k;
    %end;
    run;
    %mend;
    %trial;

    *computes the Iman-Conover transformation was computed*;

    %macro try;
    proc iml;
    load module=(imanconovertransform); /* load the function definition */

    /* Step 1: Read in the data */
    varnames={'x1' 'c1' 'b1'};
    %do k=1 %to &num_samples;
    use dat&k; read all var varnames into x;
    %end;
    close;

    /* Step 2: specify target rank correlation */
    /* X1 X2 X3 */
    c={1.00 -0.02 0.05, /* X1 */
    -0.02 1.00 0.02, /* X2 */
    0.05 0.02 1.00}; /* X3 */

    /* Step 3: Strategically reorder the columns of X.
    The new ordering has the specified rank corr (approx)
    and preserves the original marginal distributions. */
    w=imanconovertransform(x, c);
    rankcorr=corr(w, "Spearman");
    print rankcorr[format=6.3 r=varnames c=varnames];

    /* write new data to a SAS data set */
    newvarnames={'new_y' 'new_x1' 'new_c1' 'new_b1'};

    %do k=1 %to &num_samples;
    create dat&k from w[c=newvarnames];
    append from w;
    %end;
    close;
    quit;
    %mend try;

    • Rick Wicklin

      Here are two suggestions:
      1. Post code and ask questions at the SAS Support Communities. There are communities for many areas of SAS, including SAS/IML.
      2. Get rid of the macro loops in PROC IML. They are not needed. Use a program like this:

      /* Step 1: Read in the data */
      varnames={'x1' 'c1' 'b1'};
      do k=1 to &num_samples;
         dsIn = 'dat' + strip(char(k));
         use (dsIn); read all var varnames into x; close;
       
         /* do the computations here */
       
         newvarnames={'new_y' 'new_x1' 'new_c1' 'new_b1'};
         dsOut = 'Out' + strip(char(k));
         create (dsOut) from w[c=newvarnames];
         append from w;
         close;
      end;

      In general, it is better to deal with one large data set that has many samples instead of many data sets that each have one sample. See "Simulation in SAS: The slow way or the BY way."

Leave A Reply

Back to Top