Transform word vector tables from long to wide

4

In natural language processing, word vectors play a key role in making technologies such as machine translation and speech recognition possible. A word vector is a row of numeric values where each point captures a dimension of the word’s meaning. Each value represents how closely it relates to the concept behind that dimension, so the semantics of the word is embedded across the dimensions of the vector. Since similar words have similar vectors, representing words as vectors like this would simplify and unify vectors' operations.

Word vectors are generated by a training performed word-word co-occurrence statistics on a large corpus. You can use pre-trained word vectors like GloVe, provided by Stanford University.

Let's talk about how to transform word vector tables from long to wide in SAS, so we can potentially get sentence vectors to process further. Suppose we generate word vectors from the following 3 sentences:

Jack went outside.
Jill likes to draw in the afternoon.
Tony is a boy.

Each word has 2 numeric values (Vector1, Vector2), each value represents how closely the word relates to the concept defined by that dimension. The value numbers (here VNUM=2) may range from hundreds to thousands in real text analysis scenarios.

Long word vector table

The sample code below generates an upper sample table and sorts it for further processing.

data HAVE;
  length Word $ 45;
  input SentenceID Word Vector1-Vector2; /*300+*/
datalines;
1	Jacky 	   0.24011	 0.400996 
1	went	  -0.047581	 0.868716 
1	outside	  -1.197891	 1.162238
2	Jill	  -0.199579	 0.251252
2	likes	  -1.935640	-0.288264
2	to	  -0.526053	-1.143420
2	draw	  -0.736289	-0.794812
2	in 	  -2.757234	 0.506639
2	the	  -0.736289	-0.794812
2	afternoon -0.047581	 0.868716
3	Tony 	   0.34032	 0.600983 
3	is	   0.147531	 0.968817
3	a	   1.347543	 2.568323
3	boy       -3.257891      3.172238
run; 
proc sort data=HAVE;
  by SentenceID;
run;
proc print data=have;run;

If we want to transform the upper long table to a wide table as seen below, how can we do this as efficiently and simply as possible? The upper 14 words belong to 3 sentences that would result in the following 3 rows with 22 columns (1 + WNUM + WNUM x VNUM=1 + 7 + 7 x 2 = 22).

Wide word vector table

Please note that we can calculate the max word number (WNUM) in a sentence at runtime with SAS code below. For the upper case, the value of WNUM is 7.

proc sql noprint;
  select max(count) into :wnum from (
    select count(Word) as count from HAVE group by SentenceID 
  );
quit;

In fact, we don’t need any SAS PROC to handle this kind of transformation. A SAS Data step provides an efficient and convenient way to transform data. The key is to use an ARRAY to map all word vectors from the source table, and then define two ARRAYs to store output words and vectors in a wide style. These two arrays for output words and vectors need to be RETAIN during the implicit loop and KEEP for OUTPUT while it reaches the last SentenceId.

You can see the full SAS code below with detailed comments.

/*Long table to Wide table*/
%let vnum=2; /*vector numbers for a word*/
%let wnum=7; /*max word number in a sentence*/
data WANT;
  set HAVE;
  by Sentenceid;
  array _vector_ [*] vector:;         /*Map to source vectors*/
 
  array _word [ %eval(1*&wnum)] $ 45; /*Array to store WORD in wide table*/
  array _vector [ %eval(&wnum*&vnum)];/*Array to store VECTORS in wide table*/
  retain _word: _vector:;             /*RETAIN during the implicit loop*/
 
  retain _offset_ 0;                  /*Offset of a WORD in a sentence, base 0*/
  if first.Sentenceid then do;
    call missing(of _word[*]);
	call missing(of _vector[*]);
    _offset_=0;
  end;
  else _offset_=_offset_+1;
 
  _word[ _offset_+1 ]=word;           /*Cache current word to array WORD at [ _offset_+1]*/
  do i=1 to dim(_vector_);            /*Cache each vectors to array VECTORS at [_offset_* &vnum +i]*/
    _vector[_offset_* &vnum +i]=_vector_[i]; 
  end;
  keep Sentenceid _word: _vector: ;   /*Keep for output when it hit last.Sentenceid*/
 
  if last.Sentenceid then output;     /*Output the cached WORD and VECTORS*/
run;
 
proc print data=want;run;

Accordingly, if we need to transform a word vector back from wide style to long style, we need to generate &WNUM rows x &VNUM columns for each sentence, and it’s the reversed process for upper logic. The full SAS code with detailed comments is listed below:

/*Wide table to Long table*/
data HAVE2;
  set WANT; 
 
  array _word [*] _word:;           /*Array _word[] mapping to WORD in wide table*/
  array _vector_ [*] _vector:;     /*Array _vector[] mapping to VECTORS in wide table*/
 
  length Word $ 45;                 /*Output Word in the long table*/
  array Vector[&vnum];              /*Output Vectors in the long table*/
  do i=1 to &wnum;                  /*Unpack word from array _word[]*/       
    word=_word[i]; 
	if word=" " then continue;
    do j=1 to &vnum;                /*Unpack vectors from array _vector[]*/
	  oo= (j+&vnum * (i-1)); 
      Vector[j]=_vector_[j + &vnum *(i-1)];
    end;
	keep Sentenceid Word Vector:;
	output;                          /*One row in wide table generate &wnum rows[]*/
  end;
run;
proc print data=HAVE2;run;

To wrap the upper bi-directional transformation process for general repurposing in text analysis, we provide two SAS MACROs listed below:

%Long2Wide(data=Have, vnum=2, wnum=7, sid=SentenceId, word=Word, out=Want);
proc print data=Want;run;
 
%Wide2Long(data=Want, vnum=2, wnum=7, sid=Sentenceid, out=Have2, outword=Word, outvector=Vector);
proc print data=Have2;run;

We have demonstrated how to transform a word vector table from a long style to a wide style (or vice versa) efficiently with a SAS DATA step. We have also provided two well-wrapped SAS MACROs for general re-use purposes. To learn more, please check out these additional resources:

Share

About Author

Yinliang Wu

Director of Research and Development at SAS

Yinliang Wu has over 23 years software industry management and business experience. He focuses on data science, parallel computing and visualization such as AI, BI, big data, data visualization, quantitative trading, web crawler etc. Currently he is working on SAS Visual Analytics product family research and development. He received his Master of Science degree from Peking University. He is author of the book SAS 技术内幕 (2018) [Insider of SAS Technology: From Programmer to Data Scientist] and co-author of the book "数据分析实用教程 (2021) [Data Analysis Practical Tutorial] ".

4 Comments

  1. Besides the transform function implementation, lots of tips in this article, looks simple, but implements complex logic, including:
    Data reading; SQL proc and pick up value to macro variable; array, retain, loop, condition statements; call missing usage; data step implicit loop...
    In such little article, combined so many tips, really good!

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top