In my previous blog post, I talked about using PROC CAS to accomplish various data preparation tasks. Since then, my colleague Todd Braswell and I worked through some interesting challenges implementing an Extract, Transform, Load (ETL) process that continuously updates data in CAS. (Todd is really the brains behind getting this to work – I am just along for the ride.)
In a nutshell, the process goes like this:
- PDF documents drop into a "receiving" directory on the server. The documents have a unique SubmissionID. Some documents are very large – thousands of pages long.
- A Python job runs and converts PDFs into plain text. Python calls an API that performs Optical Character Recognition (OCR) and saves off the output as a CSV file, one row per page, in the PDF document.
- A SAS program, running in batch, loads the CSV file with new records into a CAS table. SubmissionID is passed to the batch program as a macro variable, which is used as part of the CAS table name.
- Records loaded from the CSV file are appended to the Main table. If records with the current SubmissionID already exist in the Main table, they are deleted and replaced with new records.
The Main table is queried by downstream processes, which extract subsets of data, apply model score code, and generate results for the customer.
Due to the volume of data and the amount of time it takes to OCR large PDFs, the ETL process runs in multiple sessions simultaneously. And here is a key requirement: the Main table is always available, in a promoted state, with all up-to-date records, in order for the model score code to pick up the needed records.
What does "promoted state" mean?
The concept of table scope, which was introduced with the first release of SAS Viya, presents a challenge. CAS tables are in-memory tables that can have one of two "scopes":
- Session scope – the table exists within the scope of your current CAS session and drops from memory as soon as the session ends. Functionally, this is somewhat similar to the data you write to the WORK library in SAS 9 – once you disconnect, the data drops from the WORK library.
- Global scope – the table is available to all sessions. If your session ends, you will still have access to it when you start a new session. Your colleagues also maintain access, assuming they have the necessary permissions. For the table to assume global scope, it needs to be promoted.
Common promotion techniques for a table are the DATA STEP, PROC CAS, or PROC CASUTIL. For example:
/*promote a table using DATA STEP*/ *this snippet copies a SAS 9 table into CAS and promotes in one step; data mycaslib.my_table (promote=yes); set mylib.my_table; run; /*promote using PROC CASUTIL*/ *this snippet promotes a session-scope CAS table to global scope; proc casutil incaslib='mycaslib' outcaslib='mycaslib'; promote casdata='my_table'; quit; /*Promote using PROC CAS*/ *same as above, this snippet promotes a session-scope table to global scope; proc cas; table.promote / caslib='mycaslib' targetcaslib='mycaslib' name='my_table' target='my_table'; quit; |
Fun Facts About Table Promotion
You cannot promote a table that has already been promoted. If you need to promote a new version of the same table, you need to first drop the existing table and promote the new version.
To discover the current table state, use the table.Tableinfo action in PROC CAS. for example:
proc cas; table.tableinfo / caslib='mycaslib' name='main'; quit; |
If you append rows to a promoted table using the DATA STEP append option, the new rows are automatically promoted. For example, in this snippet the mycaslib.main table, which is promoted, remains promoted when the rows from mycaslib.new_rows are appended to it:
data mycaslib.main(append=yes); set mycaslib.new_rows; run; |
When you manipulate a promoted table using the DATA STEP apart from appending rows, it creates a new, session-scope version of the same table. You will have two versions of the table: the global-scope table, which remains unchanged, and the session-scope version which has the changes you implemented. Even if you don't change anything in the table and simply run:
data mycaslib.my_table; set mycaslib.my_table; run; |
in which mycaslib.my_table is promoted, you end up with a promoted and an unpromoted version of this table in the mycaslib library – a somewhat unexpected and hardly desired result. Appendix 1 walks through a quick exercise you can try to verify this.
As you probably guessed, this is where we ran into trouble with our ETL process: the key requirement was for the Main table to remain promoted, yet we needed to continuously update it. The task was simple if we just needed to append the rows; however, we also needed to replace the rows if they already existed. If we tried to delete the existing rows using the DATA STEP, we would have to deal with the changes applied to a session-scope copy of the global-scope table.
Initially, we designed the flow to save off the session-scope table with changes, then drop the (original) global-scope version, and finally reload the up-to-date version. This was an acceptable workaround, but errors started to pop up when a session looked for the Main table to score data, while a different concurrent session reloaded the most up-to-date data. We were uncertain how this would scale as our data grew.
PROC CAS to the rescue!
After much research, we learned the Table action set has an action called deleteRows, which allows you to delete rows directly from a global-scope table. The data is never dropped to session-scope – exactly what we needed. Here's an example:
proc cas; table.deleteRows / table={caslib="mycaslib", name="Main", where="SubmissionID = 12345"}; quit; |
In case you are wondering, the Table action set also has an update action, which conditionally updates rows rather than deleting them.
Complete Flow
Our complete flow on the SAS side of things looked something like this:
/*1. Load new rows. SubmissionID macro variable is a parameter passed to the batch program*/ /*New rows are written to the casuser library, but it does not really matter which caslib you choose – we are not persisting them across sessions*/ proc casutil; load file="/path_to_new_rows/New_rows_&SubmissionID..csv" outcaslib="casuser" casout="new_rows"; quit; /*2. Delete rows with the current SubmissionID */ proc cas; table.deleteRows / table={caslib="prod", name="Main", where="SubmissionID = &SubmissionID."}; quit; /*3. Append new rows*/ data mycaslib.main(append=yes); set mycaslib.new_rows; run; /*4. Save the main table to ensure we have a disk backup of in-memory data*/ proc casutil incaslib="prod" outcaslib="prod"; save casdata="main" replace; quit; |
Conclusion
We learned how to continuously update data in CAS while ensuring the data remains available to all sessions accessing it asynchronously. We learned the append option in DATA STEP automatically promotes new rows but manipulating the data in a global-scope table through DATA STEP in other ways leads to data being copied to session-scope. Finally, we learned that to ensure the table remains promoted while it is updated, we can fall back on PROC CAS.
Together, these techniques enabled implementation of a robust data flow that overcomes concurrency problems due to multiple processes updating and querying the data.
Acknowledgement
We thank Brian Kinnebrew for his generous help in investigating this topic and the technical review.
Appendix 1
Try the following exercise to verify that manipulating a promoted table in DATA STEP leads to two copies of the table – session- AND global-scope.
/*copy and promote a sample SAS 9 table*/ data casuser.cars(promote=yes); set sashelp.cars; run; /*check the number of rows and confirm that the table is promoted*/ proc cas; table.tableinfo / caslib='casuser' name='cars'; quit; /*The table is promoted and has 428 rows*/ /*delete some rows in the promoted table*/ data casuser.cars; set casuser.cars; if make='Acura' then delete; run; /*check again – how may rows does the table have? Is it promoted?*/ proc cas; table.tableinfo / caslib='casuser' name='cars'; quit; /*The table is has 421 rows but it is no longer promoted*/ /*reset your CAS session*/ /*kill your current CAS session */ cas _all_ terminate; /*start a new CAS session and assign caslibs*/ cas; caslib _all_ assign; /*check again – how may rows does the table have? Is it promoted?*/ proc cas; table.tableinfo / caslib='casuser' name='cars'; quit; /*The table is promoted and has 428 rows*/ |
What we see here is, manipulating a global-scope table in DATA STEP leads to duplication of data. CAS copies the table to session-scope and applies the changes there. The changes go away if you terminate the session. One way to get around this issue is instead of trying to overwrite the promoted table, create a new table, then drop the old table and promote the new table under the old table's name. Otherwise, use table.DeleteRows or table.Update actions to update/delete rows in place, as described in this post.
4 Comments
hi That is really helpful comments. The one confirmation most useful for me is "If you append rows to a promoted table using the DATA STEP append option, the new rows are automatically promoted"
I am working on one such ETL scenario, i am new to SAS studio. Can I contact for help related to issues i am facing.
Arundhati
Hi Arundhati, I encourage you to contact SAS Technical Support through one of the channels listed here: https://www.sas.com/en_us/contact/technical-support.html
Thank you very much for a very informative post! This topic seems easy but has been quite frustrating to grasp before reading this.
This is really helpful, Daria. Thanks for finding this info and putting it together into this post. So much to learn!