The title of this post borrows from Stanley Kubrick’s 1964 comedy “Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.” It stars the great Peter Sellers as the titular character as well as George C. Scott and Slim Pickens. The black and white film is strange and brilliant just like Kubrick was. Similarly, as I was experiencing the issue I outline below and was told of this solution, I thought two things. The first was “what a strange but brilliant solution” and the second one was “I’ll try anything as long as it works.” Thus, a title was born. But enough about that. Why are we here?
You want to add a couple of columns of information to your already large dataset, but each time you try to join the tables you run out of memory! For example, you want to append latitude and longitude values from Table B to an existing list of customer phone numbers in Table A.
You’ve tried this and got nowhere fast:
proc sort data = demo.tablea; by npa nxx; proc sort data = demo.tableb; by npa nxx; run; data demo.aunionb; merge demo.tablea (in=a) demo.tableb (in=b); by npa nxx; if a; run;
And then you tried this and also got nowhere (albeit a little slower):
proc sql; create table demo.aunionb as select (*), from demo.tablea a left join demo.tableb b on (a.npa = b.npa) and (a.nxx = b.nxx); quit;
Solution - Joining tables with PROC FORMAT
Use PROC FORMAT!
First, take Table B and create character equivalents of the fields required in your join (assuming they aren’t characters already). In this example, NPA and NXX are the two fields that you are joining on. They will be your key once you concatenate them. Next, create character equivalents of the fields that you want appended.
data work.tableb (keep = npa_nxx--nxx_c); set demo.tableb; npa_c = compress(put(npa, best10.)); nxx_c = compress(put(nxx, best10.)); npa_nxx = catx('_',npa_c, nxx_c); lat_c = compress(put(latitude, best14.3)); long_c = compress(put(longitude, best14.3)); run;
Next, make sure that you have only unique values of your key. Use PROC SORT with OPT=noduprecs turned on.
Now, create a table that will be used as the input into PROC FORMAT. In this example, you are creating a table that will contain the formats for the latitude column.
proc sort data = work.tableb noduprecs; by npa_nxx; data demo.tableb_lat_fmt(keep=fmtname type start label); retain fmtname 'lat_f' type 'C'; set work.tableb; if npa_nxx = '._.' then start = 'Other '; else start = npa_nxx; label = lat_c; run; proc sort data = demo.tableb_lat_fmt; by start; run;
This step creates a table that includes the format name (lat_f), the format type (C), the key field (start) and its corresponding latitude value (label). Sort this table by the ‘start’ column and then repeat this step for every column you wish to append, with each column getting its own unique format and table.
Now run PROC FORMAT using the CNTLIN option pointing to the tables that you just created in order to create your format.
proc format cntlin=demo.tableb_lat_fmt; run; proc format cntlin=demo.tableb_long_fmt; run;
Now all you have to do is run your data step to create the resultant dataset with the appended values.
data demo.aunionb (drop = npa_nxx); set demo.tablea; npa_nxx = catx('_',compress(put(npa,best10.)),compress(put(nxx, best10.))); latitude = input(put(npa_nxx, $lat_f.), BEST.); longitude = input(put(npa_nxx, $long_f.), BEST.); run;
This step creates 3 columns: npa_nxx, latitude, and longitude. Npa_nxx is the key built from the NPA and NXX values. Latitude and longitude are then populated with the formatted value of npa_nxx, which in this case is the character equivalent of the original latitude or longitude. It also formats the value back into a numeric field.
The result is a clever way to add columns to a dataset, much like a VLOOKUP function works in Microsoft Excel, without the hassle of running out of memory space.
- The author realizes there are other, more boring ways of tackling this issue like indexing and using WHERE statements, but where’s the fun in that?
- This solution may not be right for you. See your doctor if you experience any of the following symptoms: memory loss, headache, frustration, Cartesian rage, cranial-keyboard embedment or memory loss.
Hi, from the archives, i.e SUGI-96 ...
Rapid Record Matching via FORMATS and CNTLIN Data Sets
Today (20 years later), in lieu of MERGE or SQL I'd use a HASH table.
@Mike, +1 to you for the hash suggestion.
Thanks Andrew for a good way to use Proc Format in joining 2 large tables. I detect a small typo?
proc sort data = demo.tableb_fmt;
proc sort data = demo.tableb_lat_fmt;
Corrected! Thank you.