Just yesterday, Santa called my cell phone asking for a favor... Yes, Santa has my direct line, and I owe him (he once did me a solid, back in 1984, for Christmas, scoring me an awesome Optimus Prime Transformer). That's me there in the front - sporting plaid duds and an awesome head of hair. Santa has a dilemma. He has a warehouse full of the most popular gifts of the year (the elves have been working hard). In addition, he has a list of each child's gift preference. He wants to assign one gift to each child so that he maximizes the overall happiness of all the children. The problem, of course is, that some gifts are much more popular than others - but Santa has only a limited number of each type of gift. In order to assign gifts to children, he'll have to work hard to figure out the best allocation.

This is where I come into play. Santa recognized the fact that his dilemma is, in fact, an optimization problem that SAS/OR might be able to help with. Santa, as usual, is correct!

To model this, let's first define a set $C$ of children, and a set $G$ of gifts. If we have multiple copies of the same gift, just give each copy a separate identifier. Let $u_{cg}$ be child $c$'s level of unhappiness if he or she were to receive gift $g$. Now, define a binary decision variable $x_{cg}$, which will be chosen to be 1 if child $c$ is assigned gift $g$. Then, we can formulate the problem of minimizing the children's unhappiness (maximize happiness) as the following integer linear programming (ILP) problem:

\begin{align} & \text{minimize} & \sum_{c \in C} \sum_{g \in G} u_{cg} x_{cg} \notag \\ & \text{subject to} & \sum_{c \in C} x_{cg} & = 1 & & c \in C \\ & & \sum_{g \in G} x_{cg} & \leq 1 & & g \in G \\ & & x_{cg} & \in \{0,1\} & & c \in C, g \in G \end{align}

Constraint (1) ensures that each child is assigned exactly one gift. Constraint (2) restricts that each gift can be given to at most one child. This problem is known as the linear assignment problem (LAP).

First, let's generate some random data using a SAS DATA step so that we run some experiments. The real data is top secret and can be found only on a special server located at the North Pole. It is protected behind 128-bit encrypted firewalls and is quite difficult to access. The following code generates, for each child, an unhappiness value for each gift using a random variable. The gifts in this case are ordered by a global perceived value to add some correlation to the data creation. That is, certain gifts will be coveted by many children. We also produce some ineligible assignments using the missing (.) value. These are gifts that the particular child would just refuse (... kids are tough to please these days)!
%let num_children = 2000; %let num_gifts = 4000; data unhappiness_data(drop=c g); call streaminit(123); retain child; array gift[&num_gifts]; do c = 1 to &num_children; child = compress('child'||put(c,best.)); do g = 1 to &num_gifts; gift[g] = g*rand('UNIFORM'); if gift[g] > 0.2*&num_gifts then gift[g] = .; end; output; end; run;

PROC OPTMODEL
Now that we have data, let's read it into a PROC OPTMODEL session as follows:
proc optmodel; /* Read the unhappiness values from a data set and populate the set of gifts and the set of children. */ num dsid = open('unhappiness_data'); num nVars = attrn(dsid,'nvar'); set GIFTS = setof{i in 2..nVars} varname(dsid,i); set <str> CHILDREN; num unhappiness {CHILDREN, GIFTS}; read data unhappiness_data into CHILDREN=[child] {gift in GIFTS} <unhappiness[child,gift]=col(gift)>; set <str,str> ASSIGNMENTS; /* Construct a set of eligible child-gift assignments. */ set ELIGIBLE = {child in CHILDREN, gift in GIFTS: unhappiness[child,gift] ne .};
Before constructing an integer programming model, Santa asked if we could first try a simple idea to attempt to find a good assignment. He asked if we could try assigning (sequentially) each child the gift with the least level of unhappiness that has not yet been assigned. This is known as a greedy heuristic, as it greedily assigns the best gift available to a child in some sequence. Of course, this approach has no guarantee to find the optimal assignment, nor does it even guarantee to find some feasible assignment (since not all gifts are acceptable to all children). The following code shows how to do this in OPTMODEL:
 /* greedy heuristic */ num min; str argmin; num assigned {GIFTS} init 0; num cost init 0; num failed init 0; for {child in CHILDREN} do; min = constant('BIG'); argmin = ''; for {<(child),gift> in ELIGIBLE} do; if min > unhappiness[child,gift] and assigned[gift] = 0 then do; min = unhappiness[child,gift]; argmin = gift; assigned[gift] = 1; end; end; if argmin = '' then do; put 'Greedy heuristic failed.'; failed = 1; leave; end; else do; cost = cost + unhappiness[child,argmin]; end; end; if failed = 0 then put cost=;
In this particular case, the greedy heuristic fails to find a feasible solution. So we will next construct and solve the ILP problem as follows:
 /* Assign[child,gift] = 1 if this child is assigned this gift, 0 otherwise */ var Assign {ELIGIBLE} binary; min TotalUnhappiness = sum {<child,gift> in ELIGIBLE} unhappiness[child,gift] * Assign[child,gift]; con AssignChild {child in CHILDREN}: sum {<(child),gift> in ELIGIBLE} Assign[child,gift] = 1; con AssignGift {gift in GIFTS}: sum {<child,(gift)> in ELIGIBLE} Assign[child,gift] <= 1;   /* call MILP solver */ solve; ASSIGNMENTS = {<child,gift> in ELIGIBLE: Assign[child,gift].sol > 0.5}; create data solution_data_milp from [child gift]=ASSIGNMENTS unhappiness;
This works great. We now have a SAS data set called solution_data_milp that identifies the optimal child-gift assignments to minimize the overall unhappiness of all the children. The log below shows the progress of the MILP solver.
NOTE: The presolved problem has 4175256 variables, 6000 constraints, and
8350512 constraint coefficients.
NOTE: The MILP solver is called.
Node  Active    Sols    BestInteger      BestBound      Gap    Time
0       1       1    706.6848321    706.6848321    0.00%      27
0       0       1    706.6848321    706.6848321    0.00%      27
NOTE: Optimal.
NOTE: Objective = 706.68483205.

The problem took only 27 seconds to solve. Santa is very pleased! However, Santa has also had some coursework in integer linear programming (at North Pole U. in the 60s). So he immediately recognizes the fact that the coefficient matrix defining the feasible region of the model has a nice mathematical property known as total unimodularity (TU). For any matrix that is TU, we know that all of the corner points of the polyhedron that represents its feasible region are integral. Therefore, for this problem, we need not rely on the expensive algorithms embedded in an integer programming solver (MILP). Rather, we can simply solve the model as a linear program by relaxing the integrality restriction on the variables $x$. Then, solving the problem should be much easier and execute faster. To try this, we simply tell OPTMODEL to use the LP solver on the relaxed problem, as follows:
 /* call LP solver */ solve with LP relaxint;
NOTE: The integrality restriction was relaxed for 4172655 variables.
NOTE: The presolved problem has 4175256 variables, 6000 constraints, and
8350512 constraint coefficients.
NOTE: The LP solver is called.
NOTE: The Dual Simplex algorithm is used.
Objective
Phase Iteration        Value         Time
D 1          1    0.000000E+00         0
D 2          2    0.000000E+00         0
D 2       1323    2.162238E+02         0
D 2       2692    4.873860E+02         1
D 2       3442    6.267065E+02         2
D 2       3911    6.660432E+02         3
D 2       4304    6.945121E+02         3
D 2       4649    7.058489E+02         4
D 2       4746    7.066848E+02         5
NOTE: Optimal.
NOTE: Objective = 706.68483205.
NOTE: The Dual Simplex solve time is 5.49 seconds.

Better! Now the problem solves in 5.49 seconds. Santa is getting excited. However, Santa now thinks back to his other coursework in network optimization. He realizes that this model can also be formulated as a minimum-cost network flow (MCF) problem. However, he admits that constructing such a model might be a bit tedious. Luckily, SAS/OR has an LP solver option for this exact situation. Under the hood, the solver searches for an embedded network through translations and reflections of the original matrix. Once the solver puts the problem into this form, it can invoke the very fast network simplex algorithm and solve the problem even faster. To try this, we simply tell OPTMODEL to use the network simplex algorithm on the relaxed problem, as follows:
 /* call LP solver with network simplex algorithm */ solve with LP relaxint / algorithm=network;
NOTE: The integrality restriction was relaxed for 4172655 variables.
NOTE: The Network Simplex algorithm is used.
NOTE: The network has 6000 rows (100.00%), 4172655 columns (100.00%), and
1 component.
The network extraction and setup time is 0.23 seconds.
Primal         Primal           Dual
Iteration      Objective  Infeasibility  Infeasibility     Time
1   7.124332E-02   1.000000E+00   4.806800E+06     0.30
10000   1.124941E+03   0.000000E+00   1.210373E+01     0.37
20000   1.030613E+03   0.000000E+00   5.678750E+00     0.45
30000   9.632152E+02   0.000000E+00   3.236318E+00     0.53
40000   9.178664E+02   0.000000E+00   2.570354E+00     0.62
50000   8.739243E+02   0.000000E+00   2.554968E+00     0.72
60000   8.325805E+02   0.000000E+00   1.197253E+00     0.83
70000   7.816137E+02   0.000000E+00   7.320827E-01     1.05
75708   7.066848E+02   0.000000E+00   0.000000E+00     1.81
NOTE: The Network Simplex solve time is 1.72 seconds.
NOTE: The total Network Simplex solve time is 1.95 seconds.
NOTE: Optimal.
NOTE: Objective = 706.68483205.

Great! Now the problem solves in a speedy 1.95 seconds. Santa is satisfied and wants to run with this approach. But, there is one more thing I'd like to show Santa. A completely different way of thinking about this problem is to consider each child and each gift as nodes in a bipartite graph. The edges between the nodes are eligible assignments, and each edge weight is the level of unhappiness if the edge is chosen. The problem can now be thought of as the minimal-weight bipartite matching problem. Inside PROC OPTMODEL, SAS/OR provides a full suite of network algorithms, including weighted bipartite matching (also known as the linear assignment problem). To try this, we tell OPTMODEL which data attributes define the graph (using the links= and subgraph= options) as follows:
 /* call network solver with linear assignment algorithm */ solve with NETWORK / graph_direction = directed links = (weight=unhappiness) subgraph = (links=ELIGIBLE) linear_assignment out = (assignments=ASSIGNMENTS);
NOTE: The number of nodes in the input graph is 6000.
NOTE: The number of links in the input graph is 4172655.
NOTE: Processing the linear assignment problem.
NOTE: Objective = 706.68483205.
NOTE: Processing the linear assignment problem used 0.25 (cpu: 0.25) seconds.
Eureka - 0.25 seconds! Now that is ludicrous speed!

PROC OPTNET
PROC OPTNET is a separate procedure that provides the same suite of network algorithms provided in OPTMODEL through the solve with NETWORK syntax. OPTNET works directly with data sets in the form of a graph (links/nodes data sets), or in the case of a bipartite graph, it can read directly a matrix. For Santa's problem, the data set unhappiness_data is already in the form of a matrix with children as the rows, and gifts as the columns. We can use this directly and invoke the solver as follows:
proc optnet data_matrix = unhappiness_data; linear_assignment out = solution_data_optnet id = (child); run;
NOTE: Processing the linear assignment problem.
NOTE: Objective = 706.68483205.
NOTE: Processing the linear assignment problem used 0.11 (cpu: 0.11) seconds.

In this case, we also solve the problem extremely fast and save a good deal of memory compared to OPTMODEL, which stores a symbolic representation of the model (or graph). For very large instances of graph or network problems, PROC OPTNET is often a good alternative.

We can run the data generation step with more children and more gifts, and the following log shows PROC OPTNET handling a more realistic problem size (e.g., a small town in Santa's delivery schedule):
%let num_children = 10000; %let num_gifts = 100000;
NOTE: Processing the linear assignment problem.
NOTE: The linear assignment problem has 10000 rows.
NOTE: The linear assignment problem has 100000 columns.
NOTE: The linear assignment problem has 521886931 nonzeros.
NOTE: Starting augmenting row reduction for 10000 rows (pass 1).
NumRows   Complete   CpuTime   RealTime
1000        10%     17.64      17.64
2000        20%     18.02      18.02
3000        30%     18.41      18.41
4000        40%     18.80      18.80
5000        50%     19.20      19.20
6000        60%     19.62      19.63
7000        70%     20.05      20.05
8000        80%     20.47      20.47
9000        90%     20.92      20.92
10000       100%     21.39      21.39
10000       100%     21.39      21.39
NOTE: Objective = 1920.5011207.
NOTE: Processing the linear assignment problem used 21.39 (cpu: 21.39) seconds.

Now - if we could only help Santa with his routing problem!

Linear Assignment Problem
The linear assignment problem (or weighted bipartite matching problem) is useful in a number of other contexts besides what we have presented here.

In 2012, Alvin E. Roth and Lloyd S. Shapley won the Nobel Prize in Economics for their contributions to the theory of stable matchings and the design of markets. An application of their work was for solving the NYC public school system's student assignment problem, where students are assigned to schools based on some preference level. Their approach was much more complicated and relied on game theory. However, the basic version of the model is similar to the one we solved for Santa.

Another example where LAP can be used to model a useful (real-life) problem is in the area of drug clinical trials. The case/control matching problem tries to maximize the similarity of various attributes (age, gender, weight, etc.) between the cases (those that received the drug) to controls (those that did not receive the drug) to better understand the statistical relevance of the effects. Similarly, in epidemiology, the cases are the people who have contracted a particular disease, while the controls have been exposed to the disease but have not contracted it. The Mayo Clinic's Biomedical Statistics and Informatics Division maintains a suite of SAS macros including the %vmatch() macro, which is widely used in the biomedical statistics community. As part of the analysis, the macro solves a weighted bipartite matching problem using PROC OPTNET.

OK, Santa... I think we are even now. Debt repaid!

Acknowledgment: Rob Pratt (OPTMODEL code)
Share