You have the infrastructure, and you know that DATA step would run so much better in DS2, but it’s complicated and you don’t know how to get started. Well, if you have SAS 9.4M5, rejoice! The DStoDS2 procedure is here, and it’s a humdinger! (No, Chris, not a Hemedinger… ;-) It's one of the many new things you can learn about in the new second edition of "Mastering the SAS DS2 Procedure - Advanced Data Wrangling Techniques"
Here are a few important caveats we should be aware of before diving in:
- PROC DStoDS2 can’t translate all possible DATA step syntax, but it does support a huge subset. Any lines of code lines in the traditional DATA step that cannot be translated will appear as comments in the output. Any comments in the original DATA step program are removed.
- PROC DStoDS2 translates only one DATA step at a time. The DATA step to be translated must be saved to its own text file which will serve as input.
- In the traditional DATA step, character variable lengths are measured in bytes, but in DS2 fixed-length character variable lengths are specified in characters. If your data has multi-byte encoding, adjust any LENGTH statements in the original DATA step before running PROC DSTODS2 to specify the variable length in characters, so that the DS2 DATA program will specify the correct lengths upon conversion.
- The resulting DS2 program might not be syntactically complete. You will have to add a RUN statement and any commented sections will have to be converted by hand. Perhaps a few other tweaks will be necessary before it executes and produce the desired results.
Having said all of that, PROC DStoDS2 makes converting traditional DATA step code to DS2 a snap! Let’s try it out on a sample program that as a lot of typical DATA step features:
data audi (drop=make count) bmw (drop=make count) counts (keep=make count); if _n_=1 then do; put 'And so it begins!'; end; set cars; by Make; if first.make then Count=0; Count+1; select (Make); when ("Audi") output audi; when ("BMW") output bmw; otherwise put Make= Model= ' - how did YOU get in here?'; end; if last.make then output counts; run; |
We’ll save that to a file named DStoDS2_data_step.sas. Next, we’ll write and execute a PROC DStoDS2 step to convert the DATA step program to DS2 and save the new program as DStoDS2_data_step_converted.sas:
proc dstods2 in="&path/DStoDS2_data_step.sas" out="&path/DStoDS2_data_step_converted.sas"; run; |
A quick look at the output shows it’s not too pretty, but it seems functional enough. Because I’m working in SAS Studio, a quick click on the Format Code button makes it more legible. For the purposes of the blog, I’ll format it a little by hand:
data AUDI(DROP=(MAKE COUNT) ) BMW(DROP=(MAKE COUNT) ) COUNTS(KEEP=(MAKE COUNT)); method run(); if _N_=1.0 then do; put 'And so it begins!'; end; set CARS; by MAKE; if FIRST.MAKE then COUNT=0.0; COUNT + 1.0; select (MAKE); when ('Audi') output AUDI; when ('BMW') output BMW; otherwise put MAKE=MODEL=' - how did YOU get in here?'; end; if LAST.MAKE then output COUNTS; ; _return: ; end; enddata; |
That’s not too bad, actually. I have to add the PROC DS2 statement and the RUN and QUIT, obviously. I’ll remove that unused _return: label. And for efficiency’s sake, I’ll convert the IF _N_ =1 DO group into an INIT method:
proc ds2; data AUDI_DS2(DROP=(MAKE COUNT)) BMW_DS2(DROP=(MAKE COUNT)) COUNTS_DS2(KEEP=(MAKE COUNT)); method init(); put 'And so it begins!'; end; method run(); set CARS; by MAKE; if FIRST.MAKE then COUNT=0.0; COUNT + 1.0; select (MAKE); when ('Audi') output AUDI_DS2; when ('BMW') output BMW_DS2; otherwise put MAKE=MODEL=' - how did YOU get in here?'; end; if LAST.MAKE then output COUNTS_DS2; end; enddata; run; quit; |
Now, that seemed almost too easy, didn't it? Let's run my shiny new DS2 program, and check the log to make sure everything is running smoothly. Oops! It looks like my DS2 program produced a warning in the log:
WARNING: No DECLARE for assigned-to variable count; creating it as a global variable of type double.
So back I’ll go to the editor, and add a declaration for the COUNT variable. This time it runs without a warning. Now, I’m a suspicious guy, so I’ll sort the outputs from both processes into the same order and run PROC COMPARE to ensure it’s all working as expected. Sure enough, I get that lovely “No unequal values were found. All values compared are exactly equal.” report.
So that’s the scoop – PROC DStoDS2 can ingest some pretty complex DATA step code and do 99% of the work needed to convert it to DS2. Remember Jedi Programming Rule #1? If not, here you go:
“Lazy programmers are great programmers.”
So lazy programmers of the world, rejoice - the DStoDS2 procedure has arrived! And, if you’re too lazy to copy the code in this post by hand, you can download the ZIP file from this link.
Until next time, may the SAS be with you!
Mark
26 Comments
Hello Mark,
the following code failed indicating INTERWRK.AEHF_DE" does not exist ( it is a SAS View table). Can DS2 use view table in the merge ? Thanks,
/******/
%let state=de;
proc ds2;
data INTERWRK.TMP_&STATE._ADDRESSES
INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS(
KEEP = ( PIK ADDRESS_YEAR ) );
method run();
merge INPUTS.ICF_US_ADDRESSES( IN = IN_A ) INTERWRK.AEHF_&STATE( IN = IN_B
KEEP = ( PIK YEAR ) RENAME = ( YEAR = ADDRESS_YEAR ) );
by PIK ADDRESS_YEAR;
if IN_B;
if IN_A then output INTERWRK.TMP_&STATE._ADDRESSES;
else output INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS;
;
_return: ;
end;
enddata;
run;
quit;
/*******/
ERROR: Compilation error.
ERROR: BASE driver, Table AEHF_DE does not exist or cannot be accessed or created
ERROR: Table "INTERWRK.AEHF_DE" does not exist or cannot be accessed
ERROR: Line 12: Unable to prepare SELECT statement for table AEHF_de (rc=0x80fff802U).
NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.
23 quit;
Cindy,
As noted in this SAS Communities posting, the problem is that the view was created with PROC SQL. DS2 is only compatible PROC FedSQL, so can't read PROC SQL views. But take heart - if the view is created with PROC FedSQL, the process should run fine. Here is a rudimentary mock-up using the code snip you provided:
%let state=DE;
libname INPUTS "%pathname(work)";
libname INTERWRK "%pathname(work)";
data INPUTS.ICF_US_ADDRESSES ViewSource (keep=PIK address_year rename=(address_year=YEAR));
do PIK= 1 to 5;
do ADDRESS_YEAR=2015 to 2020;
Address='From ICF_US_ADDRESSES';
output INPUTS.ICF_US_ADDRESSES;
if not mod(ADDRESS_YEAR,3) and PIK in (1,4) then output ViewSource;
end;
end;
PIK=5;
ADDRESS_YEAR=2021;
output ViewSource;
run;
proc FedSQL;
drop view INTERWRK.AEHF_&STATE force;
create view INTERWRK.AEHF_&STATE as
select * from ViewSource
;
quit;
title "Source Tables";
proc FedSQL;
title2 "INPUTS.ICF_US_ADDRESSES";
select * from INPUTS.ICF_US_ADDRESSES limit 10;
title2 "INTERWRK.AEHF_&STATE";
select * from INTERWRK.AEHF_&STATE limit 10;
quit;
/*Your code*/
proc ds2;
data INTERWRK.TMP_&STATE._ADDRESSES
INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS(KEEP=(PIK ADDRESS_YEAR))/overwrite=yes;
method run();
merge INPUTS.ICF_US_ADDRESSES(IN=IN_A)
INTERWRK.AEHF_&STATE(IN=IN_B KEEP=(PIK YEAR) RENAME=(YEAR=ADDRESS_YEAR));
by PIK ADDRESS_YEAR;
if IN_B;
if IN_A then output INTERWRK.TMP_&STATE._ADDRESSES;
else output INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS;
;
_return: ;
end;
enddata;
run;
quit;
title "Result Tables";
proc FedSQL;
title2 "INTERWRK.TMP_&STATE._ADDRESSES ";
select * from INTERWRK.TMP_&STATE._ADDRESSES limit 10;
title2 "INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS";
select * from INTERWRK.ICF_&STATE._NEWWORKERS_NO_ADDRESS limit 10;
quit;
I hope this helps! May the SAS be with you :-)
Mark
Fernando,
The capacity to STORE a character value is completely dependent on the storage technology - it really has nothing to do with the capabilities of DS2. For example, we are all familiar with the 32,767 (32K) byte limit for character data stored in SAS data sets. If you were to store a 1M character value to a SAS data set, it would be truncated at 32K. Hive has a 65,535 (64K) character limit (for VARCHAR), so if the same value was stored in Hive, it would be truncated at 64K.
As for the very long JSON API responses I discussed in the previous comment, I captured the JSON response to a local VARCHAR variable and used a DS2 JSON package instance to parse out the data I wanted. The extracted data was stored in an output table, but the the raw JSON was discarded.
May the SAS be with you!
Mark
Hello SAS Jedi,
Thank you for sharing proc dstods2.
Is there any way we can convert below proc sql - oracle query to ds2 query
proc sql;
options nosgen;
connect to oracle(user="&oruser." orapw="&orpw." path="@abc" buffsize=15000);
options sgen;
create table sample as
select id,
waiver,
opt
from connection to oracle
(select to_number(ex_id) as id,
waiver,
put(option,z4.) as option
FROM my_table
order by ann_fee
);
disconnect from oracle;
quit;
Rikin,
The FedSQL integration with DS2 makes using explicit pass-through SQL a snap. I'm pretty handy with Orcale SQL, and in your demo code I do not recall an Oracle PUT function, so I'm pretty sure the "put(option,z4.) as option" won't work. However, here is sample code showing how to implement this in DS2 using FedSQL explicit pass-through:
LIBNAME mydb oracle user="&oruser." orapw="&orpw." path="@abc";
proc ds2;
data sample;
method run();
set{select to_number(ex_id) as id,waiver
/* This does NOT look like ORACLE SQL to me - if it's not, this won't work
Consider using the PUT function in the DS2 code to do the conversion */
, put(option,z4.) as option
FROM my_table
order by ann_fee};
end;
enddata;
run;
quit;
You can learn more about this technique in my blog post titled "Jedi SAS Tricks: Explicit SQL Pass-through in DS2"
May the SAS be with you!
Mark
What I've discovered through trial & error is that a SPDE libname that has any spaces in the paths doesn't work. For example, the first libname below works fine, but the second one doesn't work.
libname slib spde 'c:\test\'
datapath=('c:\test\a' 'c:\test\b\')
indexpath=('c:\test\b\');
libname slib spde 'c:\test\'
datapath=('c:\test\a a\' 'c:\test\b b\')
indexpath=('c:\test\b b\');
While this is a great place to discuss general or specific coding issues, it's not a great place to get help for a issues which may depend on your specific SAS installation. SAS Technical Support is the best route to a quick solution for problems like this. I'm going to open a tech support track on this myself.
As a reminder, for urgent, business-critical issues call +1 (919) 677-8008. Otherwise, I'd suggest using the SAS Tech Support Web Form (http://support.sas.com/ctx/supportform/createForm).
Before contacting SAS Tech Support, gather this information and have it handy:
Your company name and the country in which you are operating
Your name, e-mail address and phone number
Site number, operating system and software release (see top of a fresh SAS session’s log)
SAS product involved (for this problem, the product in Base SAS)
Succinct one-line description of the issue (subject line)
Detailed problem description, including ERROR OR WARNING messages
A description of any troubleshooting or research you’ve already done
Copies of your SAS program, SAS log & any other pertinent files.
After submitting the web form, you'll receive an automated email with a tracking number. Reply to that email and attach copies of your SAS program and the SAS log for the session which shows the errors / problem.
I hope this helps!
Mark
Well, it looks like we identified a bug! The response to my tech support track indicated that a fix is planned for a future release which, at the present time, is scheduled for release by the end of 2018. However, as of this writing no fix is available. And because path names cannot be modified for existing SPDE libraries (see http://support.sas.com/kb/14/028.html), no work-around is available either. To avoid this problem going forward, please ensure the paths and file names for components of SPDE libraries do not contain spaces.
Your example works, but when I try it with one of our SPDE libnames, it doesn't work.
269
270 libname YR4_12_1 spde 'h:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012'
271 datapath=('m:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012'
272 'p:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012'
273 'l:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012'
274 'k:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012')
275 indexpath=('j:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012'
276 'i:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012')
277 bysort=no
278 compress=binary;
NOTE: Libref YR4_12_1 was successfully assigned as follows:
Engine: SPDE
Physical Name: h:\studies\PV\data\original\CCW64\2012 - new 5pct - 2012\
279
280
281 proc ds2 libs=(YR4_12_1 work);
ERROR: Client unable to establish connection.
ERROR: The DATAPATH= option specifies physical paths, listed above, that do not exist or are
inaccessible
ERROR: PROC DS2 initialization failed.
282 thread comp /overwrite=yes;
283 dcl double total;
284 method run();
285 set YR4_12_1.BENEFICIARY_SUMMARY_FILE;;
286 end;
287 endthread;
288 run;
289 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
Thanks for the reply. The test code works fine. However, our goal was to try and use the DS2 procedure to speed up data access to some of our large datasets (100+ million obs) that were created using the SPDE libname engine. I guess it wouldn't work on these, either, since the libref is a series of concatenated paths. Do you know if this problem be addressed in a future release?
SPDE libraries are not concatenated libraries. They are distributed libraries using the SPDE engine, which is a completely different beast. SPDE is explicitly listed as a supported DS2 data source and should be no problem to access using DS2. For example, this code runs flawlessly on my SAS 9.4M5 installation (Windows):
libname slib spde 'c:\sasdata\spde\mainpath'
datapath=('c:\sasdata\spde\data1' 'd:\sasdata\spde\data2')
indexpath=('c:\sasdata\spde\index');
proc ds2 libs=(slib work);
thread comp /overwrite=yes;
dcl double total;
method run();
set slib.narrow;
by group;
if first.group then total=0;
total+normal;
if last.group then output;
end;
endthread;
run;
data summary/overwrite=yes ;
dcl thread comp c();
method run();
set from c threads=4;
end;
enddata;
run;
quit;
I tried the following code, but got an error. The SASHELP.CARS dataset works fine in a normal dataset, so I'm not sure what the problem is. Any ideas?
proc ds2;
data TEST;
method run();
set SASHELP.CARS;
end;
enddata;
run;
quit;
177 proc ds2;
ERROR: Client unable to establish connection.
ERROR: The DATAPATH= option specifies physical paths, listed above, that do not exist or are
inaccessible
ERROR: PROC DS2 initialization failed.
178 data TEST;
179 method run();
180 set SASHELP.CARS;
181 end;
182 enddata;
183 run;
184 quit;
I also tried the code from your example and got an error.
17 proc ds2;
18 data AUDI_DS2(DROP=(MAKE COUNT))
19 BMW_DS2(DROP=(MAKE COUNT))
20 COUNTS_DS2(KEEP=(MAKE COUNT))/overwrite=yes;
21 dcl double Count;
22 method init();
23 put 'And so it begins!';
24 end;
25 method run();
26 set CARS;
27 by MAKE;
28 if FIRST.MAKE then COUNT=0.0;
29 COUNT + 1.0;
30 select (MAKE);
31 when ('Audi') output AUDI_DS2;
32 when ('BMW') output BMW_DS2;
33 otherwise put MAKE=MODEL=' - how did YOU get in here?';
34 end;
35 if LAST.MAKE then output COUNTS_DS2;
36 end;
37 enddata;
38 run;
ERROR: Compilation error.
ERROR: BASE driver, Table CARS does not exist or cannot be accessed or created
ERROR: Table "WORK.CARS" does not exist or cannot be accessed
ERROR: Line 26: Unable to prepare SELECT statement for table CARS (rc=0x80fff802U).
39 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE DS2 used (Total process time):
real time 0.20 seconds
cpu time 0.17 seconds
Robert,
DS2 does not support concatenated SAS libraries (see Usage Note 51043: Librefs with concatenated libraries are not supported in DS2 - https://support.sas.com/kb/51/043.html)
Try this:
libname ds2help "!SASROOT\core\sashelp";
proc ds2;
data TEST;
method run();
set DS2HELP.CARS;
end;
enddata;
run;
quit;
May the SAS be with you!
Mark
I've been having trouble getting this procedure to work in SAS University Edition. I tried copying your example and I always get this result:
1 proc dstods2 in='/folders/myfolders/DStoDS2_data_step.sas'
2 out='/folders/myfolders/DStoDS2_data_step_converted.sas';
3 run;
ERROR: Could not generate formats text file.
ERROR: Could not compile source code.
NOTE: The SAS System stopped processing this step because of errors.
The syntax of the procedure is simple enough that it's hard for me to see how I might have gone wrong, maybe it's a problem with the configuration of University Edition somehow?
Matthew,
Thanks for taking the time to point this out. This post was created using PC SAS. I have replicated this error in SAS University Edition and have submitted a tech support track to see what's up. Will respond here if I get more information.
Regards,
Mark
Thanks Mark! Glad to know I'm not crazy.
:-D
10 months later, and it seems still not fixed?!?
The DStoDS2 procedure requires that the in and out files be in the current working directory for SAS. You can't include path information in the in= and out= procedure options. However, it's not hard to change the working directory (in SAS 9.4M4 or higher) using the DLGCDIR function. For example, let's say your program file is C:\temp\datastep.sas and you want the DS2 data program file produced to be named dataprogram.sas. This code would do the trick:
%put %qsysfunc(dlgcdir(c:\temp));
PROC DSTODS2 IN="datastep.sas"
OUT="dataprogram.sas"
;
RUN;
Does DS2 support character variable length longer than 32767?
Tricky question! Yes, DS2 supports character variables longer than 32K. I have personally used VARCHAR variables in a DS2 program to hold JSON responses over 1 million characters long. But they were only used for processing. There's no place I know of where you can STORE a varchar that big. For example, in LATIN encoding, the largest character variable that can be stored in Teradata is 64,000 characters. And for SAS Data sets, the limit for character variables is still 32,767 bytes. SO yes, you can process them - but you probably can't store them anywhere...
I presume HDFS is a candidate for variables: See here for a discussion of Hive string limits https://stackoverflow.com/questions/35030936/is-there-maximum-size-of-string-data-type-in-hive
DS2 does not read from or write to raw HDFS files. actually, DS2 doesn't read or write raw data files of any type, as of this writing. DS2 interacts with Hadoop via Hive, which has a 255 character limit for CHAR and a 65,535 character limit for VARCHAR.
Hello Mark,
Do you believe that it is possible to store the result the result of such long character into a text file using Proc DS2?
You may be able to do it using ODS, something like the techniques used in Jedi SAS Tricks: Writing to Text Files from DS2 (https://blogs.sas.com/content/sastraining/2016/06/06/jedi-sas-tricks-text-files-from-ds2)