How to calculate Word Mover's Distance with SAS

Word Mover's Distance (WMD) is a distance metric used to measure the dissimilarity between two documents, and its application in text analytics was introduced by a research group from Washington University in 2015. The group's paper, From Word Embeddings To Document Distances, was published on the 32nd International Conference on Machine Learning (ICML). In this paper, they demonstrated that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates on eight real world document classification data sets.

They leveraged word embedding and WMD to classify documents, and the biggest advantage of this method over the traditional method is its capability to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. In a traditional way, one method to manipulate semantically similar words is to provide a synonym table so that the algorithm can merge words with same meaning into a representative word before measuring document distance, otherwise you cannot get an accurate dissimilarity result. However, maintaining synonym tables need continuous efforts of human experts and thus is time consuming and very expensive. Additionally, the semantic meaning of words depends on domain, and the general synonym table does not work well for varied domains.

Definition of Word Mover's Distance

WMD is the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from one document to the other document. The distance is calculated through solving the following linear program problem.

Where

T_ij denotes how much of word i in document d travels to word j in document d^';
c(i; j) denotes the cost “traveling” from word i in document d to word j in document d^'; here the cost is the two words' Euclidean distance in the word2vec embedding space;

If word i appears c_i times in the document d, we denote

WMD is a special case of the earth mover's distance metric (EMD), a well-known transportation problem.

How to Calculate Earth Mover's Distance with SAS?

SAS/OR is the tool to solve transportation problems. Figure-1 shows a transportation example with four nodes and the distances between nodes, which I copied from this Earth Mover's Distance document. The objective is to find out the minimum flow from {x1, x2} to {y1, y2}. Now let's see how to solve this transportation problem using SAS/OR.

The weights of nodes and distances between nodes are given below.

data x_set;
   input _node_ $ _sd_;
   datalines;
x1    0.74
x2    0.26
;
 
data y_set;
   input _node_ $ _sd_;
   datalines;
y1    0.23
y2    0.51
;
 
data arcdata;
   input _tail_ $ _head_ $ _cost_;
   datalines;
x1    y1    155.7
x1    y2    252.3
x2    y1    292.9
x2    y2    198.2
;
 
proc optmodel;
   set  xNODES;
   num w {xNODES};
 
   set  yNODES;
   num u {yNODES};
 
   set &lt;str,str&gt; ARCS;
   num arcCost   {ARCS};
 
   read data x_set into xNODES=[_node_] w=_sd_;
   read data y_set into yNODES=[_node_] u=_sd_;
   read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;
 
   var flow {&lt;i,j&gt; in ARCS} &gt;= 0;
   impvar sumY = sum{j in yNODES} u[j];
   min obj = (sum {&lt;i,j&gt; in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
   con con_y {j in yNODES}: sum {&lt;i,(j)&gt; in ARCS} flow[i,j] = u[j];
   con con_x {i in xNODES}: sum {&lt;(i),j&gt; in ARCS} flow[i,j] &lt;= w[i];
 
   solve with lp / algorithm=ns scale=none logfreq=1;
   print flow;
quit;

The solution of SAS/OR as Table-1 shows, and the EMD is the objective value: 203.26756757.

The flow data I got with SAS/OR as Table-2 shows the following, which is same as the diagram posted in the aforementioned Earth Mover's Distance document.