Isotonic regression (also called monotonic regression) is a type of regression model that assumes that the response variable is a monotonic function of the explanatory variable(s). The model can be nondecreasing or nonincreasing. Certain physical and biological processes can be analyzed by using an isotonic regression model. For example, a growth curve for plants is assumed to be monotonic. Mathematically, a function, f, is monotonic If the function preserves order. A function is monotonic nondecreasing if f(x1) ≤ f(x2) for all x1 ≤ x2. A function is monotonic nonincreasing if f(x1) ≥ f(x2) for all x1 ≤ x2.
In a simple univariate linear regression model, monotonicity follows from linearity. However, there are regression models that are not linear. The simplest isotonic regression model is a piecewise constant model. This model can have smaller residual errors than an OLS model that is globally linear.
While reading the Wikipedia article about isotonic models, I was reminded that an isotonic model can be solved by using a quadratic program (QP). Although this is not the fastest way to solve the isotonic regression, it is a cool application of quadratic optimization. This article shows how to solve for the parameters in a univariate isotonic regression model by using the NLPQUA subroutine in SAS/IML software in SAS 9.4. SAS Viya customers can solve the same problem by using the QPSOLVE subroutine in SAS IML, which has a simpler syntax. Customers of SAS Optimization can use PROC OPTMODEL and the QP solver.
Isotonic regression and the QP
The general form of a quadratic programming (QP) problem (in matrix form) is as follows. You want to find a vector, c, that minimizes the quadratic polynomial
\(\frac{1}{2} \mathbf{c}' Q \mathbf{c} + \mathbf{v}' \mathbf{c} + \mbox{const}\)
subject to boundary constraints and linear constraints of the form A*c ≤ b.
Here Q is a symmetric matrix.
Let's see how isotonic regression can be cast as a QP problem.
Suppose you have n pairs of points (x1,y1), (x2,y2), ..., (xn,yn). An isotonic regression searches for a column vector of n values c = (c1, c2, ..., cn) that minimize the quantity \(\sum_{i=1}^n (c_i - y_i)^2\). Although the x values do not appear in the equation, they are used because the isotonic model passes through the points (x1,c1), (x2,c2), ..., (xn,cn).
If you expand the quadratic terms, you get the following quadratic objective function:
\(\sum_i c_i^2 - 2 \sum_i c_i y_i + \sum_i y_i^2\)
This quadratic polynomial is represented in matrix form as follows:
\(\frac{1}{2} \mathbf{c}' (2I) \mathbf{c} - 2 \mathbf{y}' \mathbf{c} + \mathbf{y}' \mathbf{y}\).
So, the vector representation of the objective function uses Q = 2*I, v = -2*y, and
\(\mbox{const} = \mathbf{y}' \mathbf{y}\).
Notice that the vector y is known. The goal is to find a vector c that minimizes the expression. The parameter is subject to the isotonic constraints. For a nondecreasing regression, the n-1 constraints are c[1] ≤ c[2], c[2] ≤ c[3], ..., c[n-1]& le; c[n]. For a nonincreasing regression, the inequality symbols are reversed. Thus, the linear constraint matrix is a banded matrix that has 1 on the diagonal and -1 on the superdiagonal. The constraints can be written as the matrix system
\[ \begin{bmatrix} 1 & -1 & 0 & \cdots & 0 \\ 0 & 1 & -1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & 1 & -1 \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix} \leq \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} \]
Specify the quadratic objective function in IML
Let's assume that the variables x and y are read into IML vectors and that neither have missing values. Let n be the size of the vectors.
Whether you use the older NLPQUA subroutine or the newer QPSOLVE subroutine, you must specify the matrix Q and the vector v for the QP. Because Q=2*I is a diagonal matrix, you could express Q as a dense matrix: Q=2*I(nrow(y)). But this is wasteful of memory. Both NLPQUA and QPSOLVE enable you to use a three-column sparse representation for Q. The first column is the set of row indices (1:n), the second is the set of column indices (also 1:n), and the third column is the set of values (here, all 2s).
Although the constant term is not necessary for the optimization, you can optionally append it to the v vector. The IML code to specify the quadratic objective function follows:
/* example data has 19 observations */ data Have; set sashelp.class; x = Height; y = Weight; keep x y; run; proc iml; use Have; read all var {x y} into Z; close Have; /* Note: If there are missing values, extract complete cases. The COMPLETECASES function is available in SAS Viya: https://go.documentation.sas.com/doc/en/pgmsascdc/v_049/casimllang/casimllang_common_sect057.htm For SAS 9.4, use this definition from https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */ n = nrow(Z); call sort(Z, 1); /* sort for graphing */ x = Z[,1]; /* the x values aren't used for the computation, only for plotting */ y = Z[,2]; /* to specify a QP, specify matrix Q and vector v and solve 1/2*b`*Q*b + v`*b + c = 0 for the vector b of parameters */ /* specify Q=2*I in three-column sparse form. (Use 2*I b/c of leading 1/2.) */ sparseQ = T(1:n) || T(1:n) || j(n,1,2); /* 2*I */ /* specify linear vector and constant terms */ v = -2*y`; c = ssq(y); linvec = v || c; |
Specify the linear constraints in IML
The specification of constraints for the NLPQUA subroutine is a little complicated. The specification was greatly simplified in the QPSOLVE subroutine. They each use the same building blocks, so I'll show the harder of the two cases, which is the specification for NLPQUA.
/* Specify the linear inequalities as a matrix system. For an increasing isotonic model, A*b <= 0. In CALL NLPQUA, assign each row as 0 0 ..0 0 +1 -1 0 .. 0 | EQCode 0, where EQCode= -1 for LEQ and +1 for GEQ. For QPSOLVE, the A matrix, the inequality symbol, and the RHS are specified separately. */ /* define subscripts of diagonal and superdiagonal elements. See https://blogs.sas.com/content/iml/2015/11/25/extract-elements-matrix.html */ lincon = j(n-1,n+2,0); dim = dimension(lincon); rows = T(1:n-1); /* row numbers */ diag = rows || rows; /* (i,i) subscripts */ superDiag = rows || rows+1; /* (i,i+1) subscripts */ lincon[ sub2ndx(dim,diag ) ] = 1; /* +1 on diagonal */ lincon[ sub2ndx(dim,superDiag) ] = -1; /* -1 on superdiagonal */ /* use correlation to decide whether model is increasing or decreasing */ r = corr(Z); if r[1,2] > 0 then EQCode = -1; /* LE for monotone increasing */ else EQCode = +1; /* GE for monotone increasing */ lincon[ ,n+1] = EQCode; lincon[ ,n+2] = 0; /* RHS */ |
The (n-1) x n matrix, A, is defined as a bi-diagonal matrix. The (n+1)th column encodes the inequality operation. The (n+2)th column contains the right-hand side of the constraint system, which is 0 in this problem.
For the NLPQUA subroutine, the linear constraint matrix must be appended to a representation for the boundary constraints. There are no boundary constraints for this problem, so you can use a matrix of missing values:
/* specify the parameter bounds, if any */ bounds = j(2, n+2, .); /* for NLPQUA, the bounds and linear constraint matrix are assembled into one argument. In QPSOLVE, they are separate */ con = bounds // lincon; /* for NLPQUA */ |
The remaining step is to specify an initial guess for the vector of parameters. The following statement sets all parameters to the mean of the response values. The information is then passed to the NLPQUA routine, which solves for the c vector:
p0 = j(1, n, mean(y)); /* initial guess is c[i]=mean(y) */ opt = {0 0}; /* minimize quadratic; no print */ call nlpqua(rc, c, sparseQ, p0, opt) BLC=con lin=linvec; pred = colvec(c); |
Visualize the isotonic regression
You can write the predicted values to a SAS data set and visualize the fit by using the STEP statement in PROC SGPLOT, which graphs piecewise constant functions, as follows:
create want var{'x' 'y' 'pred'}; append; close; QUIT; title "Isotonic Regression"; proc sgplot data=Want noautolegend; scatter x=x y=y; step x=x y=pred / markers markerattrs=(symbol=x) JUSTIFY=center curvelabel="Isotonic Reg"; yaxis label="y"; run; |
For these data, the regression model is nondecreasing. It has 12 unique values at which the model increases. Notice that I have graphed the step function by using the JUSTIFY=CENTER option on the STEP statement. It is equally valid to use JUSTIFY=LEFT or JUSTIFY=RIGHT. That is because the function is only defined at values of the x coordinates in the data. Any piecewise constant function that passes through those points is equally valid.
For that matter, if f is ANY monotonic function for which f(x[i])=c[i], that function is also a valid solution. For example, you can use linear interpolation to form a piecewise linear model that agrees with isotonic values, as shown below.
The piecewise constant model is simpler, but there is no mathematical reason to prefer one model over another. However, there might be domain-specific reasons to prefer a model.
Summary
Isotonic regression can be formulated as a quadratic programming problem. This article shows how to formulate the problem and how to solve it in the SAS IML language. The article shows how to use the old-style NLPQUA subroutine. The syntax for the newer QPSOLVE subroutine is simpler because each mathematical entity (bounds, constraint matrix, vector of inequality symbols) are specified separately, whereas the NLPQUA subroutine combines various entities into a single matrix argument.
4 Comments
Pingback: QPSOLVE: A new SAS IML function for quadratic optimization - The DO Loop
Pingback: A geometric solution to isotonic regression - The DO Loop
Alternatively, you can use the MONOTONE function in PROC TRANSREG to fit a linear regression model with monotonicity constraints.
Yes, and furthermore, the TRANSREG procedure enables you to obtain confidence limits for the predicted mean. It provides both "liberal" and "conservative" confidence intervals. To compare the PROC TRANSREG output to the QP solution, run the following: