Isotonic regression: An application of quadratic optimization

0

Isotonic regression (also called monotonic regression) is a type of regression model that assumes that the response variable is a monotonic function of the explanatory variable(s). The model can be nondecreasing or nonincreasing. Certain physical and biological processes can be analyzed by using an isotonic regression model. For example, a growth curve for plants is assumed to be monotonic. Mathematically, a function, f, is monotonic If the function preserves order. A function is monotonic nondecreasing if f(x1) ≤ f(x2) for all x1 ≤ x2. A function is monotonic nonincreasing if f(x1) ≥ f(x2) for all x1 ≤ x2.

In a simple univariate linear regression model, monotonicity follows from linearity. However, there are regression models that are not linear. The simplest isotonic regression model is a piecewise constant model. This model can have smaller residual errors than an OLS model that is globally linear.

While reading the Wikipedia article about isotonic models, I was reminded that an isotonic model can be solved by using a quadratic program (QP). Although this is not the fastest way to solve the isotonic regression, it is a cool application of quadratic optimization. This article shows how to solve for the parameters in a univariate isotonic regression model by using the NLPQUA subroutine in SAS/IML software in SAS 9.4. SAS Viya customers can solve the same problem by using the QPSOLVE subroutine in SAS IML, which has a simpler syntax. Customers of SAS Optimization can use PROC OPTMODEL and the QP solver.

Isotonic regression and the QP

The general form of a quadratic programming (QP) problem (in matrix form) is as follows. You want to find a vector, c, that minimizes the quadratic polynomial
      \(\frac{1}{2} \mathbf{c}' Q \mathbf{c} + \mathbf{v}' \mathbf{c} + \mbox{const}\)
subject to boundary constraints and linear constraints of the form A*cb. Here Q is a symmetric matrix. Let's see how isotonic regression can be cast as a QP problem.

Suppose you have n pairs of points (x1,y1), (x2,y2), ..., (xn,yn). An isotonic regression searches for a column vector of n values c = (c1, c2, ..., cn) that minimize the quantity \(\sum_{i=1}^n (c_i - y_i)^2\). Although the x values do not appear in the equation, they are used because the isotonic model passes through the points (x1,c1), (x2,c2), ..., (xn,cn).

If you expand the quadratic terms, you get the following quadratic objective function:
      \(\sum_i c_i^2 - 2 \sum_i c_i y_i + \sum_i y_i^2\)
This quadratic polynomial is represented in matrix form as follows:
      \(\frac{1}{2} \mathbf{c}' (2I) \mathbf{c} - 2 \mathbf{y}' \mathbf{c} + \mathbf{y}' \mathbf{y}\).
So, the vector representation of the objective function uses Q = 2*I, v = -2*y, and \(\mbox{const} = \mathbf{y}' \mathbf{y}\).

Notice that the vector y is known. The goal is to find a vector c that minimizes the expression. The parameter is subject to the isotonic constraints. For a nondecreasing regression, the n-1 constraints are c[1] ≤ c[2], c[2] ≤ c[3], ..., c[n-1]& le; c[n]. For a nonincreasing regression, the inequality symbols are reversed. Thus, the linear constraint matrix is a banded matrix that has 1 on the diagonal and -1 on the superdiagonal. The constraints can be written as the matrix system

\[ \begin{bmatrix} 1 & -1 & 0 & \cdots & 0 \\ 0 & 1 & -1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & 1 & -1 \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix} \leq \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} \]

Specify the quadratic objective function in IML

Let's assume that the variables x and y are read into IML vectors and that neither have missing values. Let n be the size of the vectors.

Whether you use the older NLPQUA subroutine or the newer QPSOLVE subroutine, you must specify the matrix Q and the vector v for the QP. Because Q=2*I is a diagonal matrix, you could express Q as a dense matrix: Q=2*I(nrow(y)). But this is wasteful of memory. Both NLPQUA and QPSOLVE enable you to use a three-column sparse representation for Q. The first column is the set of row indices (1:n), the second is the set of column indices (also 1:n), and the third column is the set of values (here, all 2s).

Although the constant term is not necessary for the optimization, you can optionally append it to the v vector. The IML code to specify the quadratic objective function follows:

/* example data has 19 observations */
data Have;
set sashelp.class;
x = Height;
y = Weight;
keep x y;
run;
 
proc iml;
use Have;
read all var {x y} into Z;
close Have;
 
/* Note: If there are missing values, extract complete cases.
   The COMPLETECASES function is available in SAS Viya:
   https://go.documentation.sas.com/doc/en/pgmsascdc/v_049/casimllang/casimllang_common_sect057.htm
   For SAS 9.4, use this definition from
   https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html
*/
n = nrow(Z);
call sort(Z, 1); /* sort for graphing */
x = Z[,1];       /* the x values aren't used for the computation, only for plotting */
y = Z[,2];
 
/* to specify a QP, specify matrix Q and vector v and solve
   1/2*b`*Q*b + v`*b + c = 0 for the vector b of parameters */
/* specify Q=2*I in three-column sparse form. (Use 2*I b/c of leading 1/2.) */
sparseQ = T(1:n) || T(1:n) || j(n,1,2);    /* 2*I */
/* specify linear vector and constant terms */
v = -2*y`;
c = ssq(y); 
linvec = v || c;

Specify the linear constraints in IML

The specification of constraints for the NLPQUA subroutine is a little complicated. The specification was greatly simplified in the QPSOLVE subroutine. They each use the same building blocks, so I'll show the harder of the two cases, which is the specification for NLPQUA.

/* Specify the linear inequalities as a matrix system.
   For an increasing isotonic model, A*b <= 0.
   In CALL NLPQUA, assign each row as 
   0 0 ..0 0 +1 -1 0 .. 0 | EQCode 0,
   where EQCode= -1 for LEQ and +1 for GEQ. For QPSOLVE, the A matrix, the inequality symbol,
   and the RHS are specified separately.
*/
/* define subscripts of diagonal and superdiagonal elements.
   See https://blogs.sas.com/content/iml/2015/11/25/extract-elements-matrix.html
*/
lincon = j(n-1,n+2,0);
dim = dimension(lincon);
rows = T(1:n-1);               /* row numbers */
diag      =  rows || rows;     /* (i,i) subscripts */
superDiag =  rows || rows+1;   /* (i,i+1) subscripts */
lincon[ sub2ndx(dim,diag     ) ] =  1;  /* +1 on diagonal */
lincon[ sub2ndx(dim,superDiag) ] = -1;  /* -1 on superdiagonal */
/* use correlation to decide whether model is increasing or decreasing */
r = corr(Z);
if r[1,2] > 0 then 
   EQCode = -1;  /* LE for monotone increasing */
else
   EQCode = +1;  /* GE for monotone increasing */
lincon[ ,n+1] =  EQCode;  
lincon[ ,n+2] =  0;       /* RHS */

The (n-1) x n matrix, A, is defined as a bi-diagonal matrix. The (n+1)th column encodes the inequality operation. The (n+2)th column contains the right-hand side of the constraint system, which is 0 in this problem.

For the NLPQUA subroutine, the linear constraint matrix must be appended to a representation for the boundary constraints. There are no boundary constraints for this problem, so you can use a matrix of missing values:

/* specify the parameter bounds, if any */
bounds = j(2, n+2, .);
/* for NLPQUA, the bounds and linear constraint matrix are 
   assembled into one argument. In QPSOLVE, they are separate */
con  = bounds // lincon;  /* for NLPQUA */

The remaining step is to specify an initial guess for the vector of parameters. The following statement sets all parameters to the mean of the response values. The information is then passed to the NLPQUA routine, which solves for the c vector:

p0 = j(1, n, mean(y));   /* initial guess is c[i]=mean(y) */
opt = {0 0};              /* minimize quadratic; no print */
call nlpqua(rc, c, sparseQ, p0, opt) BLC=con lin=linvec;
pred = colvec(c);

Visualize the isotonic regression

You can write the predicted values to a SAS data set and visualize the fit by using the STEP statement in PROC SGPLOT, which graphs piecewise constant functions, as follows:

create want var{'x' 'y' 'pred'};  append;  close;
QUIT;
 
title "Isotonic Regression";
proc sgplot data=Want noautolegend;
   scatter x=x y=y;
   step x=x y=pred / markers markerattrs=(symbol=x) JUSTIFY=center curvelabel="Isotonic Reg";
   yaxis label="y";
run;

For these data, the regression model is nondecreasing. It has 12 unique values at which the model increases. Notice that I have graphed the step function by using the JUSTIFY=CENTER option on the STEP statement. It is equally valid to use JUSTIFY=LEFT or JUSTIFY=RIGHT. That is because the function is only defined at values of the x coordinates in the data. Any piecewise constant function that passes through those points is equally valid.

For that matter, if f is ANY monotonic function for which f(x[i])=c[i], that function is also a valid solution. For example, you can use linear interpolation to form a piecewise linear model that agrees with isotonic values, as shown below.

The piecewise constant model is simpler, but there is no mathematical reason to prefer one model over another. However, there might be domain-specific reasons to prefer a model.

Summary

Isotonic regression can be formulated as a quadratic programming problem. This article shows how to formulate the problem and how to solve it in the SAS IML language. The article shows how to use the old-style NLPQUA subroutine. The syntax for the newer QPSOLVE subroutine is simpler because each mathematical entity (bounds, constraint matrix, vector of inequality symbols) are specified separately, whereas the NLPQUA subroutine combines various entities into a single matrix argument.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top