Python Integration to SAS® Viya® - Part 23 - Executing SQL on Snowflake was published on SAS Users.
]]>If you're interested in learning more about SAS and Snowflake, feel free to check out this informative communities article, SAS partner page, or watch the SAS Viya release highlights.
Starting off, I've taken the first steps of importing the required packages and establishing a connection between my Python client and the distributed CAS server in SAS Viya, which I've named conn. Now, I'll proceed by executing the about CAS action to validate my connection to CAS and to retrieve information about the version of SAS Viya that I'm working with.
If you're curious about setting up a CAS connection, you can find detailed instructions in one of my earlier blog posts.
import swat import pandas as pd ## Connect to CAS conn = swat.CAS(Enter your CAS connection information) ## View the version of SAS Viya conn.about()['About']['Viya Version'] ## and the results 'Stable 2024.01' |
The results show the connection was successful and the version of Viya is Stable 2024.01.
After establishing your connection to CAS, you will require a Snowflake account to make a connection. I have stored my Snowflake account information in a JSON file named snowflake_creds.json, structured as follows:
{ "account_url": "account.snowflakecomputing.com", "userName": "user-name", "password": "my-password" } |
Please follow any company policy regarding authentication. I'm using a demonstration Snowflake account.
To create a new connection to Snowflake, called a caslib, we'll utilize the Snowflake SAS Viya Data Connector. This will connect the distributed CAS server to the Snowflake data warehouse. To create the caslib connection, you'll require additional information from Snowflake. Details include:
After gathering these details, you can use the table.addCaslib action on your conn connection object to connect to Snowflake. I'll name my caslib my_snow_db.
## Connect to CAS ## Get my Snowflake connection information from my JSON file my_json_file = open(os.getenv('CAS_CREDENTIALS') + '\snowflake_creds.json') snow_creds = json.load(my_json_file) ## Create a caslib to Snowflake using my specified connection information cr = conn.addcaslib(name = 'my_snow_db', datasource = dict( srctype = 'snowflake', server = snow_creds['account_url'], userName = snow_creds['userName'], password = snow_creds['password'], database = "SNOWFLAKE_SAMPLE_DATA", schema = "TPCH_SF10" ) ) ## and the results NOTE: 'my_snow_db' is now the active caslib. NOTE: Cloud Analytic Services added the caslib 'my_snow_db'. |
The results show that the Snowflake connection was added to the CAS server and named my_snow_db.
I'll use the fileInfo action to explore the available database tables in the my_snow_db caslib.
conn.fileInfo(caslib = 'my_snow_db') |
The results show we have a variety of database tables available in the SNOWFLAKE_SAMPLE_DATA.TPCH_SF10 schema.
Next, I'll check if any tables are loaded in memory on the my_snow_db caslib.
conn.tableInfo(caslib = 'my_snow_db') ## and the results NOTE: No tables are available in caslib my_snow_db of Cloud Analytic Services. |
The results show that there are currently no in-memory items in the caslib, which is expected since we haven't loaded anything in memory yet.
To execute queries using the SWAT package, you first have to load the FedSQL action set. I have more information about executing SQL with the SWAT package in a previous post.
conn.loadActionSet('fedSQL') ## and the results NOTE: Added action set 'fedSQL'. § actionset fedSQL |
I'll start by executing a simple query to count the number of rows in the PART table within Snowflake. I'll create a variable named totalRows to hold my query, and then I'll utilize the execDirect action to execute the query. The query parameter specifies the query as a string. The method parameter is optional, but when set to True, it prints a brief description of the FedSQL query plan.
totalRows = ''' SELECT count(*) FROM my_snow_db.part ''' conn.execDirect(query = totalRows, method = True) |
The results display the expected count from the PART table. However, the method parameter provides us with additional insights. It notes that our SQL statement was slightly modified and fully offloaded to the underlying data source via full pass-through. But what does that mean?
SAS Viya Data Connectors (along with SAS/ACCESS technology) strive to convert SAS SQL queries to run natively in the data source whenever possible. This is called implicit pass-through. In my experience, the more ANSI standard your queries are enhances the likelihood that SAS will push the processing directly in the database. This approach leverages the database computational power efficiently, retrieving only the necessary results.
I'll run another query. This time I'll count the number of rows within each P_MFGR group using the PART Snowflake table.
group_p_mfgr = ''' SELECT P_MFGR, count(*) FROM MY_SNOW_DB.PART GROUP BY P_MFGR ''' conn.execDirect(query = group_p_mfgr, method = True) |
The results confirm once more that the SAS FedSQL query was fully processed within Snowflake. You can see the entire SELECT statement that was run on Snowflake in the log.
What if a coworker uses Snowflake SQL, while you work within SAS Viya and the Python SWAT package. Your coworker has written a Snowflake SQL query for you and sent it your way to summarize the data. You want to run it through the Python SWAT package to utilize the results for another process. What can you do? What if we just run the native Snowflake query? Let's try it.
In the Snowflake documentation, there is an example query that addresses the business question: "The Pricing Summary Report Query provides a summary pricing report for all line items shipped as of a given date. The date is within 60-120 days of the greatest ship date contained in the database." I'll execute the Snowflake query through FedSQL as is.
snowflakeSQL = ''' select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1-l_discount)) as sum_disc_price, sum(l_extendedprice * (1-l_discount) * (1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.LINEITEM where l_shipdate <= dateadd(day, -90, to_date('1998-12-01')) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; ''' conn.execDirect(query = snowflakeSQL, method = True) ## and the results ERROR: Table "SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.LINEITEM" does not exist or cannot be accessed ERROR: The action stopped due to errors. |
The SQL we submitted contains Snowflake-specific syntax that other SQL processors like FedSQL don’t understand, generating an error.
Unlike most other SQL processors, FedSQL can send native SQL code directly to a database for execution, processing only the result set in FedSQL. We call this process “explicit pass-through”. A FedSQL explicit pass-through query is really made up of two separate queries: the inner query that executes in the database, and the outer query that processes the result set in FedSQL.
In the example below, the outer query merely selects all columns and rows from the result set returned from the database connected to the caslib specified by the CONNECTION TO keyword, in this case, my_snow_db. The pair of parenthesis encloses the inner query. Here, we've simply copied in the native Snowflake query code without any modifications.
snowflakeSQL = ''' SELECT * FROM CONNECTION TO MY_SNOW_DB (select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1-l_discount)) as sum_disc_price, sum(l_extendedprice * (1-l_discount) * (1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.LINEITEM where l_shipdate <= dateadd(day, -90, to_date('1998-12-01')) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus); ''' conn.execDirect(query = snowflakeSQL, method = True) |
The log shows that FedSQL did not generate a plan because the entire query was explicitly sent to Snowflake as is and processed. Then, the results are displayed on the client. Instead of merely displaying the result set, I could have stored it in a DataFrame for further processing, or loaded it as a CAS table in the CAS server for additional advanced analytics or visualization using CAS actions, SAS Visual Analytics, or other Viya applications.
Lastly, what if you want to force the native FedSQL query to run in database to avoid transferring large data between CAS and Snowflake?
Let's look at a simple example. I'll view 10 rows from the PART Snowflake table using FedSQL.
myQuery = ''' SELECT * FROM MY_SNOW_DB.PART LIMIT 10 ''' conn.execDirect(query = myQuery, method = True) |
The log shows that the CAS server performed a serial load of the entire Snowflake table into CAS, then it limits the results to 10 rows. When executing a query that can't be implicitly converted to run in database, all of the data is first automatically loaded into, and then processed by CAS. You get the desired results, but transferring large data from any data source to CAS can be inefficient and time consuming and should be avoided where possible.
One solution is to suppress the automatic action, ensuring the query either runs in the database (Snowflake) or returns an error. To do that, I like to use the CNTL parameter. It provides a wide variety of additional parameters to use when executing FedSQL. The parameter of interest here is requireFullPassThrough. Setting this parameter to True stops processing if FedSQL can't implicitly pass-through the entire query to the database. In that case, no data is loaded into CAS, no output table or result set is produced, and a note is generated indicating what happened.
myQuery = ''' SELECT * FROM MY_SNOW_DB.PART LIMIT 10 ''' conn.execDirect(query = myQuery, method = True, cntl={'requireFullPassThrough':True}) |
The log shows that since this FedSQL query could not be passed into Snowflake, the query stops prior to loading any of the Snowflake data into CAS.
If you encounter this and want to ensure your query runs in Snowflake to take advantage of it's capabilities, simply modify the query or use explicit pass-through!
myQuery = ''' SELECT * FROM CONNECTION TO MY_SNOW_DB (SELECT * exclude(P_COMMENT, P_NAME) FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.PART LIMIT 10) ''' conn.execDirect(query = myQuery, method = True) |
The log shows that the Snowflake SQL query was pushed directly to Snowflake for execution, and the results are returned to the client.
The SWAT package combines the power of the pandas API, SQL, and CAS actions to handle distributed data processing. In this example, I demonstrated executing implicit and explicit queries in Snowflake through the Snowflake Data Connector.
In the next blog post, I'll illustrate how to load data from Snowflake into CAS to leverage specific Viya analytic actions and other SAS Viya applications.
For more efficiency tips and tricks when working with databases, check out the SAS course Efficiency Tips for Database Programming in SAS®.
Python Integration to SAS® Viya® - Part 23 - Executing SQL on Snowflake was published on SAS Users.
]]>Getting Started with Python Integration to SAS Viya for Predictive Modeling - Comparing Logistic Regression and Decision Tree was published on SAS Users.
]]>In part 6 and part 7 of this series we fit a logistic regression and a decision tree. We assessed each model individually and now we want to determine which of these two models is the best choice based on different criterion like misclassification, area under the curve (AUC), and lift charts.
Using the validation data created in part 4, we will calculate these criterion and decide which model works best based on each criterion.
When comparing predictive models, there are several key considerations that a modeler should use to determine which model performs the best for a particular problem. Here are some important factors to consider:
By carefully considering these factors and selecting the model that best balances performance, interpretability, and practical considerations, a predictive modeler can make informed decisions about which model is the most suitable for their particular application.
Misclassification rate, area under the curve (AUC), and lift charts are all commonly used evaluation metrics in predictive modeling, but they serve different purposes and are best used under different circumstances:
The misclassification rate, or accuracy, is a straightforward metric that measures the proportion of correct predictions made by the model (i.e. it measures what the model got right vs not right)
When to Use: Misclassification rate is best used when assessing the accuracy of a binary classification model for example for identifying fraudulent transactions in a banking system, where both false positives and false negatives have significant consequences.
AUC is a metric that evaluates the ability of a binary classification model to distinguish between positive and negative classes across all possible threshold values.
When to Use: AUC is best used when evaluating the performance of a machine learning model for example when predicting customer churn in a telecommunications company, where class imbalance is prevalent and accurately distinguishing between churners and non-churners is critical for targeted retention strategies.
Lift charts are used to assess the effectiveness of a predictive model in targeting a specific outcome (e.g., response rates, conversion rates) by ranking observations based on their predicted probabilities.
When to Use: Lift charts are best used when analyzing the effectiveness of a predictive model for targeting high-value customers in a marketing campaign, where identifying segments with the highest response rates can optimize resource allocation and maximize return on investment.
Let’s now compare our two models by calculating misclassification, area under the curve and lift for these two models.
To calculate these metrics, we used the percentile action set and the assess action in parts 6 and 7 of this series.
Using Python assign the model name logistic regression and decision tree to the appropriate access data files created using the percentile action set for each model.
# Add new variable to indicate type of model dt_assess["model"]="DecisionTree" dt_assess_ROC["model"]="DecisionTree" lr_assess["model"]="LogisticReg" lr_assess_ROC["model"]="LogisticReg" |
Next append these data files together to create two data frames.
# Append data df_assess = pd.concat([lr_assess,dt_assess]) df_assess_ROC=pd.concat([lr_assess_ROC, dt_assess_ROC]) |
Look at the first few rows of the data from the combined assess action output dataset df_assess. Here you see the values at each of the depths of data from 5% incremented by 5.
Look at the first few rows of data from the combined assess action output data set df_assess_ROC. This data is organized by cutoff value starting .00 and going to 1 incremented by the value of .01.
As explained in the last two parts of this series, for decision trees just like logistic regression, we use a cutoff value to make decisions, kind of like drawing a line in the sand. In this case if we use a cutoff value of .03 using the table above it means we are using the prediction probability of .03 to predict if someone is going to be delinquent on their loan. If we look at .03 above then for our validation data with the logistic regression (it was first on our concatenation statement) the predicted true positives will be 356 and true negatives 72.
The default cutoff value is typically .5 or 50% and we will use it to create the confusion matrix.
Create a confusion matrix, which compares predicted values to actual values. It breaks down these predictions into four categories: true positives, true negatives, false positives, and false negatives.
# create confusion matrix cutoff_index = round(df_assess_ROC['_Cutoff_'],2)==0.5 conf_mat = df_assess_ROC[cutoff_index].reset_index(drop=True) conf_mat[['model','_TP_','_FP_','_FN_','_TN_']] |
This confusion matrix indicates that the decision tree is doing a better job at predicting the true positives, and the logistic regression is better at predicting the true negatives.
Next let’s calculate and look at the misclassification rate for the two models.
# calculate misclassification rate conf_mat['Misclassification'] = 1-conf_mat['_ACC_'] conf_mat['miss'] = conf_mat[round(conf_mat['_Cutoff_'],2)==0.5][['Misclassification']] conf_mat[['model','Misclassification']] |
Remember the misclassification rate gives us the rate of what the model got wrong. It his case the misclassification rate for the decision tree is lower at 12.8% vs the logistic regression at 15.7%.
Now let’s calculate the area under the curve (AUC) measurement for each model.
#print Area under the ROC Curve print("AUC (using validation data)".center(40, '-')) df_assess_ROC[["model", "_C_"]].drop_duplicates(keep="first").sort_values(by="_C_", ascending=False) |
The decision tree again edges out logistic regression with a higher AUC measurement at 0.87 vs 0.80 for logistic regression.
A picture or graph is always a great way to visualize our metrics. Let’s create the ROC chart which will illustrate both the tradeoff of true positives and false positives and show the AUC for both models.
Using the matplotlib package create the ROC charts with both models.
# Draw ROC charts from matplotlib import pyplot as plt plt.figure() for key, grp in df_assess_ROC.groupby(["model"]): plt.plot(grp["_FPR_"], grp["_Sensitivity_"], label=key) plt.plot([0,1], [0,1], "k--") plt.xlabel("False Postivie Rate") plt.ylabel("True Positive Rate") plt.grid(True) plt.legend(loc="best") plt.title("ROC Curve (using validation data)") plt.show() |
The last criterion we will calculate and create is a lift chart.
A lift chart shows how much better a predictive model is at targeting a specific outcome, like response rates, compared to random selection. It does this by plotting the cumulative response rates of different segments of the population, helping to identify the most responsive groups and optimize marketing strategies.
Using matplotlib package lets plot our lift chart.
# Draw lift charts plt.figure() for key, grp in df_assess.groupby(["model"]): plt.plot(grp["_Depth_"], grp["_Lift_"], label=key) plt.xlabel("_Depth_") plt.ylabel("_Lift_") plt.grid(True) plt.legend(loc="best") plt.title("Lift Chart (using validation data)") plt.show(); |
In the depicted chart, at a depth of 20, the lift is approximately 2.5 for the decision tree model and 1.8 for the logistic regression model. This indicates that at the 20% threshold, the decision tree can identify individuals prone to loan delinquency around 2.5 times more effectively than random selection, compared to 1.8 times for the logistic regression model.
In conclusion, when comparing the performance of logistic regression and decision tree models for predicting outcomes, it's crucial to consider various evaluation metrics such as misclassification rates, area under the curve (AUC), and lift charts. By assessing each model against these criteria using validation data, we gain valuable insights into their effectiveness and suitability for the task at hand.
Throughout this blog series, we've delved into the intricacies of each model, examining their strengths and weaknesses. Now, armed with a deeper understanding of their performance, we're equipped to make an informed decision about which model to choose.
In our analysis, we found that the decision tree model outperformed logistic regression in terms of misclassification rates, AUC, and lift at a depth of 20. This indicates that the decision tree is better at identifying individuals who are likely to default on their loan, making it a more favorable choice for our predictive modeling task.
However, it's essential to remember that the best model selection ultimately depends on the specific objectives of the project and the broader business goals. By carefully weighing the trade-offs between accuracy, interpretability, and practical considerations, we can ensure that our chosen model aligns with our overarching objectives and delivers actionable insights to drive decision-making.
In the next post, we will learn how to fit a random forest model using the decision tree action set.
SAS Help Center: Percentile Action Set
SAS Help Center: assess Action
SAS Tutorial | How to compare models in SAS (youtube.com)
Building Machine Learning Models by Integrating Python and SAS® Viya® - SAS Users
Getting Started with Python Integration to SAS Viya for Predictive Modeling - Comparing Logistic Regression and Decision Tree was published on SAS Users.
]]>Improving Our Agility was published on SAS Users.
]]>Our overarching goal with this model is to adopt a structure based on function rather than on geography. This structure also benefits everyone involved:
During the transition time, our experienced Tech Support engineers will continue to provide you with the same level of excellence you have come to expect from SAS. As always, you are able to engage with Technical Support using the familiar channels of customer portal, chat, phone, and email. If you encounter any issues, please let us know using one of these channels.
Along with this new model, we are modernizing Tech Support to better leverage technological advancements such as generative AI. And, coming soon, we'll let you know about a couple of big improvements that are underway but not yet ready for launch. In the meantime, know that we are here for you, always striving to provide you with the best support possible!
SAS Technical Support is ensuring tomorrow and supporting today.
Improving Our Agility was published on SAS Users.
]]>What migration path is available for SPDS users? was published on SAS Users.
]]>The good news is that we now have a clear migration path: SAS Viya integrated with SingleStore.
SPDS was developed over 30 years ago and has remained a very strong performance option for customers wanting to optimize access to SAS datasets. We now have an option for SAS to operate directly on SQL tables in SingleStore (no need to move data to a SAS dataset). We refer to this as our SAS with SingleStore solution under SAS Viya.
Utilizing Sas with SingleStore lets us address and improve upon historical SPDS functionality. Below is a chart highlighting the capability in SPDS and how it has been addressed in our SAS with SingleStore solution:
SPDS | SAS with SingleStore |
---|---|
Clustered Tables | S2 Sharding enables similar clustering |
SQL Optimization | Next-gen SQL performance |
Pre-sorting for performance | Pre-sort by setting sort keys on tables |
Hyper Indexing options | Columnar datastore with inherent indexing |
SPDS Partition Files / Data distribution | Multi-tier massively parallel architecture |
Optimized table structure for analytics | Columnar table structure optimized for analytics |
Proprietary SAS format | Open ANSI standard SQL access |
Basic file permissions | Advanced access controls |
By leveraging our SAS with SingleStore solution we can deliver all the capabilities offered by our SPDS solution and at the same time deliver significant benefits...
To summarize, our migration solution enables you to harness the rapid performance of SAS Viya alongside cutting-edge SQL-based technologies. Simultaneously, it streamlines data storage costs by optimizing hot, warm and cold data tiers, enhancing the efficiency of your analytical processes.
If you’re looking for a long-term solution with a strong future vision that evolves your SAS environment, we have the answer: SAS Viya (aka Viya 4), with our SAS with SingleStore embedded process capabilities.
What migration path is available for SPDS users? was published on SAS Users.
]]>Promise delivered: SAS Enterprise Guide integration with SAS Viya 4 was published on SAS Users.
]]>Once you've defined a connection to your SAS Viya 4 environment in SAS Enterprise Guide, you will be able to use SAS Enterprise Guide as a bridge to leverage your existing projects and code workflows as you adopt SAS Viya.
You can then also extend your projects to take advantage of exiting new SAS Viya 4 capabilities, such as embedding Python routines inside your SAS programs, and running SAS code against SAS Cloud Analytics Services to execute data and AI tasks with a new level of scalability.
Watch the recent SAS Viya Release Highlights show for short demonstration of how to connect SAS Enterprise Guide to SAS Viya and begin running programs and tasks.
You can also review the detailed steps in this article on SAS Communities.
If you're a current SAS Enterprise Guide user, you can update to the new version simply by using the Help->Check for Updates feature in the application menus. (If the auto-update feature is disabled at your site, talk to your SAS administrator to get the latest release.)
Promise delivered: SAS Enterprise Guide integration with SAS Viya 4 was published on SAS Users.
]]>Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Decision Tree was published on SAS Users.
]]>In Part 6 of this series we took our Home Equity data saved in Part 4 and fit a logistic regression to it. In this post we will use the same data and fit a classification decision tree to predict who is most likely to go delinquent on their home equity loan.
For fitting decisions trees we will use the decisiontree action set and the dtreeTrain action and use the dtreeScore action to score our validation data.
A decision tree is a representation of possible outcomes based on a series of decisions or variables. It is commonly used in data analysis and predictive modeling to help understand and interpret complex relationships within the data.
A decision tree is created through a process called recursive partitioning, which uses an algorithm to split the data into smaller and more homogenous groups based on specific variables. This allows for the identification of key patterns and relationships within the data, forming branches and nodes in the decision tree.
With SAS Viya, you can easily build, train, and evaluate decision trees to make accurate predictions and informed decisions based on your data using the Decision Tree Action Set.
There are two main types of decision trees: classification and regression. Classification trees are used for predicting categorical or discrete outcomes, while regression trees are used for predicting continuous outcomes.
For our example we will be fitting a classification tree since BAD is a classification variable with 2 levels. 0 if the person is not delinquent on their home equity loan and 1 if they are delinquent.
Let’s start by loading our data, the sashdat, csv, and parquet files we saved in part 4 into CAS memory. I will load the sashdat file using the code below. The csv and parquet file can be loaded using similar syntax.
conn.loadTable(path="homeequity_final.sashdat", caslib="casuser", casout={'name':'HomeEquity', 'caslib':'casuser', 'replace':True}) |
The home equity data is now loaded and ready for modeling.
Before fitting a decision tree in SAS Viya we need to load the decisionTree action set.
conn.loadActionSet('decisionTree') |
The decisionTree action set contains several actions, let’s display the actions to see what is available for us to use.
conn.help(actionSet='decisionTree') |
This action set not only contains actions for fitting and scoring decision trees but also for fitting and scoring random forest models and gradient boosting models. Examples for fitting these two will be included in future posts.
Fit a decision tree using the the dtreeTrain action using the HomeEquity training data set (ie where _PartInd_=1). Save the model to a file named dt_model. We will use a split criterion of information gain and prune equal to true to use C4.5 pruning method. Also specify the variable importance information to be generated.
Assign the results to Tree_Output.
Tree_Output = conn.decisionTree.dtreeTrain( table = dict(name = HomeEquity, where = '_PartInd_ = 1'), target = 'BAD', inputs = ['LOAN', 'IMP_REASON', 'IMP_JOB' ,'REGION', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', ‘IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ'], nominals = ['BAD','IMP_REASON','IMP_JOB','REGION'], casOut = dict(name='dt_model', replace = True), crit="GAIN", prune=True, varImp=True ) |
Let’s take a look at what the data looks like in the model file dt_model generated from our decision tree.
conn.CASTable("dt_model").fetch() |
What are the keys created as part of the output in Tree_Output?
list(Tree_Output.keys()) |
Three keys are created.
Let’s look at the important variable information generated for our decision tree model. 9 of our 14 variables turned out to be important in our model and the list below shows them in the order of importance.
Tree_Output['DTreeVarImpInfo'] |
Now plot these values by creating a data frame with this table and using the Python package matplotlib.
from matplotlib import pyplot as plt df = Tree_Output['DTreeVarImpInfo'] df.plot(kind = 'bar', x = 'Variable', y = 'Importance') |
Now let’s take the model created (file named dt_model) and score using the dtreeScore action to apply it to the validation data (_PartInd_=0). Create a new dataset called dt_scored to store the scored data.
dt_score_obj = conn.decisionTree.dtreeScore( table = dict(name = HomeEquity, where = '_PartInd_ = 0'), model = "dt_model", casout = dict(name="dt_scored",replace=True), copyVars = 'BAD', encodename = True, assessonerow = True ) |
Look at the first 5 rows of the scored data to see the predicted probability for each row.
conn.CASTable('dt_scored').head() |
We will assess the performance of a decision tree model using the same metrics we used for the logistic regression in Part 6 of this series: confusion matrix, misclassification rates, and a ROC (Receiver Operating Characteristic) plot. These measures help us determine how well our decision tree model fits the data.
To calculate these metrics, we will use the percentile action set and the assess action. Load the percentile action set and display the actions.
conn.loadActionSet('percentile') conn.builtins.help(actionSet='percentile') |
Assess the decision tree model using the scored data (dt_scored). Two data sets are created name dt_assess and dt_assess_ROC.
conn.percentile.assess( table = "dt_scored", inputs = 'P_BAD1', casout = dict(name="dt_assess",replace=True), response = 'BAD', event = "1" ) |
Look at the first 5 rows of the data from the assess action output dataset dt_assess. Here you see the values at each of the depths of data from 5% incremented by 5.
display(conn.table.fetch(table='dt_assess', to=5)) |
Look at the first 5 rows of data from the assess action output data set dt_assess_ROC. This data is organized by cutoff value starting .00 and going to 1 incremented by the value of .01.
conn.table.fetch(table='dt_assess_ROC', to=5) |
In decision trees, like in logistic regression, we use a cutoff value to make decisions, kind of like drawing a line in the sand. In this case if we use a cutoff value of .03 using the table above it means we are using the prediction probability of .03 to predict if someone is going to be delinquent on their loan. If we choose .03 then for our validation data the predicted true positives will be 347 and true negatives 388. The default cutoff value is .5 or 50%.
Bring these results to our client by creating local data frames so we can calculate a confusion matrix, misclassification rate, and ROC plot for our decision tree model.
dt_assess = conn.CASTable(name = "dt_assess").to_frame() dt_assess_ROC = conn.CASTable(name = "dt_assess_ROC").to_frame() |
Create a confusion matrix, which compares predicted values to actual values. It breaks down these predictions into four categories: true positives, true negatives, false positives, and false negatives. Here are the category definitions:
These measures help us evaluate the performance of the model and assess its accuracy in predicting the outcome of interest.
Use a cutoff value of 0.5, which means if the predicted probability is greater than or equal to 0.5 then our model predicts someone will be delinquent on their home equity loan. If the predicted value is less than 0.5 then our model predicts someone will not be delinquent on their home equity loan.
#create confusion matrix cutoff_index = round(dt_assess_ROC['_Cutoff_'],2)==0.5 conf_mat = dt_assess_ROC[cutoff_index].reset_index(drop=True) conf_mat[['_TP_','_FP_','_FN_','_TN_']] |
We can also calculate a misclassification rate, which indicates how often the model makes incorrect predictions.
# calculate misclassification rate conf_mat['Misclassification'] = 1-conf_mat['_ACC_'] miss = conf_mat[round(conf_mat['_Cutoff_'],2)==0.5][['Misclassification']] miss |
Our misclassification rate for our decision tree model is .128635 or 13%. This means that our model is wrong 13% of the time, but correct 87%.
Additionally, a ROC (Receiver Operating Characteristic) plot visually displays the trade-off between sensitivity and specificity, helping us evaluate the overall performance of the model. The closer the curve is to the top left corner, the higher the overall performance of the model.
Use the Python graphic package matplotlib to plot our ROC Curve.
# plot ROC Curve from matplotlib import pyplot as plt plt.figure(figsize=(8,8)) plt.plot() plt.plot(dt_assess_ROC['_FPR_'],dt_assess_ROC['_Sensitivity_'], label=' (C=%0.2f)'%dt_assess_ROC['_C_'].mean()) plt.xlabel('False Positive Rate', fontsize=15) plt.ylabel('True Positive Rate', fontsize=15) plt.legend(loc='lower right', fontsize=15) plt.show() |
Our curve is somewhat close to the top left corner which indicates a good fit, but also shows room for improvement. The C=0.87 represents the Area Under the Curve (AUC) statistic which means our model is doing better than the 0.5 of a random classifier but not as good as the perfect model at 1.
The Decision Tree action set in SAS Viya with Python using SWAT makes it simple to create and analyze decision trees for your data. Using the dtreeTrain to train our decision tree and dtreeScore to score our validation or hold out sample we can evaluate how well our decision tree model fits our data and predicts new data.
In the next post, we will learn how to compare the logistic regression model to this decision tree for assessing which of these two models is the best model between the two.
SAS Help Center: Load a SASHDAT File from a Caslib
SAS Help Center: loadTable Action
Getting Started with Python Integration to SAS® Viya® - Part 5 - Loading Server-Side Files into Memory
SAS Help Center: Decision Tree Details
SAS Help Center: dtreeTrain Action
SAS Help Center: dtreeScore Action
SAS Help Center: Percentile Action Set
SAS Help Center: assess Action
Tree-related Models: Supervised Learning in SAS Visual Data Mining and Machine Learning
Decision Tree in Layman’s Terms - SAS Support Communities
Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Decision Tree was published on SAS Users.
]]>2023 Achievements for SAS Technical Support was published on SAS Users.
]]>By far our biggest accomplishment was the launch of the new customer portal. To achieve this, Technical Support worked with divisions across SAS as well as vendors to do the following:
After the customer portal launch in August of 2023, Technical Support’s priority has been focused on addressing software issues and implementing enhancements to improve the customer experience. Despite the expected issues during the first months of use, we are excited about the possibilities available with this new software. For more details, see my blog, Partnership: The SAS customer experience and you.
We want it to be easy for you to engage with us via the channel of your choice—portal, chat, email, or phone. To this end, we unified all of Technical Support onto one global, cloud-based phone system. Although some training is required in order for everyone to be able to use the new system seamlessly, this is a big step toward our goal of providing you the phone support that you need when you need it.
To clarify and streamline SAS product support policy information, we modernized the SAS Technical Support Policies with a landing page that links to more detailed subpages. This is the first step in our effort to improve your understanding of SAS support policies, so stay tuned for future enhancements to this information.
To facilitate more efficient responses and resolutions in customer cases, we created a custom model for how Tech Support and R&D engage with one another on these issues. This model included making updates to internal tools to convey information easily and with transparency between the divisions. We have already seen some gains in efficiency since this model was implemented.
To better enable Technical Support to analyze metrics related to case work, we enhanced our internal reporting software:
These enhancements provide insight that will help all of SAS collaborate to create better software that is easier to use.
We have many things in the works for 2024 that we are excited about, including:
As always, our goal is to bring a modern and methodical approach to providing world-class customer support, which we have done for over 40 years. We are committed to continuous improvement to give you the support needed to achieve the best outcomes for your business. I’ll give you updates throughout the year as we tackle our goals for 2024. Thanks again for being a SAS customer!
2023 Achievements for SAS Technical Support was published on SAS Users.
]]>9 for SAS9 – Top Tips for SAS 9 Programmers Moving to SAS Viya was published on SAS Users.
]]>Let's address those concerns, get excited, and delve into 9 essential insights for SAS 9 programmers making the transition to SAS Viya. All the programs are available in my GitHub repository.
First and foremost: Almost ALL your SAS9 code runs on SAS Viya. Think of the SAS Viya platform as a car with two engines. The first engine, the SAS Compute server, is designed to run traditional SAS code and provide the expected results. The SAS Compute server is equivalent to the traditional SAS9 engine (sometimes called the SAS workspace server or SPRE). Your DATA step, PROC MEANS, FREQ, favorite statistical procedures, SAS/ACCESS technology, ODS Graphics and more will produce the same results as they did in a SAS 9 environment. Don’t believe me? Go ahead - run this code in either SAS9 or SAS Viya. It doesn’t matter!
The code below uses the following SAS9 procedures and statements:
SAS9_Tips_Viya_01.sas
This program will download the home_equity.csv file from the Example Data Sets for the SAS Viya Platform documentation page and import it as a SAS table. Make sure to specify where to download the CSV file in your SAS environment by modifying the path macro variable. If your SAS environment is in lockdown mode you will have to manually download the CSV file and load it to your environment.
/**********************************/ /* 1 - Run SAS9 Code on SAS Viya! */ /**********************************/ /*****************************************/ /* a. REQUIRED - Specify the folder path */ /*****************************************/ /* This code will dynamically specify the project folder */ /* REQUIRED - SAS program must be saved to the location */ /* REQUIRED - Valid only in SAS Studio */ %let fileName = %scan(&_sasprogramfile,-1,'\/'); %let myPath = %sysfunc(tranwrd(&_sasprogramfile, &fileName,)); %put &=myPath; /* You can also manually specify your path to the location you want to save the downloaded CSV file if the code above does not work */ %let path = &myPath; /*-----Modify your path here if necessary - Example path: C:/user/documents/ */ /* View the path to download the CSV file in the log */ %put &=path; /*********************************/ /* b. Download CSV file from SAS */ /*********************************/ /* SAS Viya documentation data sets URL */ %let download_url = https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/home_equity.csv; /* Download CSV file from the internet and save the CSV file in SAS */ filename out_file "&path/home_equity.csv"; proc http url="&download_url" method="get" out=out_file; run; /* Import the CSV file and create a SAS table in the WORK library */ proc import datafile="&path/home_equity.csv" dbms=csv out=work.home_equity replace; guessingrows=1000; run; /*********************************/ /* c. Run SAS9 in Viya! */ /*********************************/ /* Preview the SAS table */ proc print data=work.home_equity(obs=10); run; /* View column metatdata */ ods select Variables; proc contents data=work.home_equity; run; /* View descriptive statistics */ proc means data=work.home_equity; run; /* View number of distinct values in specified columns */ proc sql; select count(distinct BAD) as DistinctBAD, count(distinct REASON) as DistinctREASON, count(distinct JOB) as DistinctJOB, count(distinct NINQ) as DistinctNINQ, count(distinct CLNO) as DistinctCLNO, count(distinct STATE) as DistinctSTATE, count(distinct DIVISION) as DistinctDIVISION, count(distinct REGION) as DistinctREGION from work.home_equity; quit; /* View categorical column frequencies */ proc freq data=work.home_equity order=freq nlevels; tables BAD REASON JOB NINQ CLNO STATE DIVISION REGION / plots=freqplot missing; run; /* View missing values in the table */ /* a. Create a format to group missing and nonmissing */ proc format; value $missfmt ' '='Missing' other='Not Missing'; value missfmt . ='Missing' other='Not Missing'; run; /* b. Apply the format in PROC FREQ */ proc freq data=work.home_equity; format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */ tables _CHAR_ / missing missprint nocum nopercent; format _NUMERIC_ missfmt.; tables _NUMERIC_ / missing missprint nocum nopercent; run; /* Find the mean of the following columns to use to replace missing values using PROC SQL */ proc sql; select round(mean(YOJ)) as MeanYOJ, round(mean(MORTDUE)) as MeanMORTDUE, round(mean(VALUE)) as MeanVALUE, round(mean(DEBTINC)) as MeanDEBTINC into :YOJmean trimmed, :MORTDUEmean trimmed, :VALUEmean trimmed, :DEBTINCmean trimmed from work.home_equity; quit; %put &=YOJmean &=MORTDUEmean &=VALUEmean &=DEBTINCmean; /* Prepare the data */ data work.final_home_equity; set work.home_equity; /* Fix missing values */ if YOJ = . then YOJ = &YOJmean; if MORTDUE = . then MORTDUE = &MORTDUEmean; if VALUE = . then VALUE = &VALUEmean; if DEBTINC = . then DEBTINC = &DEBTINCmean; /* Round column */ DEBTINC = round(DEBTINC); /* Format columns */ format APPDATE date9.; /* Drop columns */ drop DEROG DELINQ CLAGE NINQ CLNO CITY; run; /* Check the final data for missing values */ proc freq data=work.final_home_equity; format _CHAR_ $missfmt.; /* apply format for the duration of this PROC */ tables _CHAR_ / missing missprint nocum nopercent; format _NUMERIC_ missfmt.; tables _NUMERIC_ / missing missprint nocum nopercent; run; /* Preview final data */ proc print data=work.final_home_equity(obs=10); run; /* Create a visualization */ title height=14pt justify=left "Current vs Default Loans"; proc sgplot data=work.final_home_equity; vbar BAD / datalabel; run; title; /* Create a logistic regression model to predict bad loans */ proc logistic data=work.final_home_equity; class REASON JOB / param=REFERENCE; model BAD(event='1') = LOAN MORTDUE VALUE REASON JOB YOJ DEBTINC; store mymodel; run; /* Score the model on the data */ proc plm restore=mymodel; score data= work.final_home_equity out=work.he_score predicted lclm uclm / ilink; run; /* Preview the scored data */ proc print data=work.he_score(obs=25); run; |
The results demonstrate that the code produces the expected output, regardless of whether it is executed in SAS9 or SAS Viya.
One common issue when moving from SAS9 to SAS Viya is any hardcoded paths to your data. If you are transitioning from SAS9 you will want to make sure you are pointing to where your data currently lives in the new Viya environment. This code uses the CSV file downloaded from previous code.
For example, let's pretend you were originally using SAS locally, and referencing the CSV file in your local directory. When moving to Viya, make sure the data you need is on the server, then simply modify the path. That's it!
SAS9_Tips_Viya_02.sas
/*****************************************/ /* 2 - Check hardcoded paths */ /*****************************************/ /* Old local path or SAS9 remote server path */ %let old_path = C:\workshop; proc import datafile="C:/users/peter/home_equity.csv" dbms=csv out=work.new_table replace; guessingrows=1000; run; /* New path to data in SAS Viya */ /* You can use the path macro variable from the previous program */ %let path = &path; /* ----- modify path to your data on the Viya server. Example - /new_viya_path/user */ proc import datafile="&path/home_equity.csv" dbms=csv out=work.new_table replace; guessingrows=1000; run; |
Remember, I said SAS Viya is like a car with two engines? Well, the other engine is called Cloud Analytic Services (CAS). CAS is a massively parallel processing engine that processes memory-resident data. That’s right - you now have the ability to load data into memory for extended periods of time, as well as access to a cluster of machines designed for high-speed, parallel processing of big data.
The main difference to remember for CAS processing is that the data must be explicitly loaded into memory before it can be processed. Once loaded, the data remains available in-memory until removed, greatly reducing the impact of I/O in your processing. Here is an example of how to connect to the CAS server from your Compute server client, load a server-side file into memory, view the contents of the distributed in-memory CAS table, and then unload the CAS table.
SAS9_Tips_Viya_03.sas
/******************************************************************/ /* 3 - Loading data into memory in CAS for distributed processing */ /******************************************************************/ /***************************************************************/ /* a. Connect the Compute Server to the distributed CAS Server */ /***************************************************************/ cas conn; /**********************************************/ /* b. View data sources connected to SAS Viya */ /**********************************************/ /* View available libraries (data sources) to the SAS Compute server */ libname _all_ list; /* View available caslibs (data sources) connected to the CAS cluster */ caslib _all_ list; /*********************************************************/ /* c. View available files in a caslib on the CAS server */ /*********************************************************/ /* The samples caslib is available by default. It's similar to the SASHELP library on the Compute server */ proc casutil; list files incaslib = 'samples'; quit; /********************************************************************************************/ /* d. Load a table into the distributed CAS server and view metadata of the in-memory table */ /********************************************************************************************/ proc casutil; /* Explicitly load a server-side file into memory (files can be a database table, or other file formats like CSV,TXT, PARQUET and more) */ load casdata='RAND_RETAILDEMO.sashdat' incaslib = 'samples' casout='RAND_RETAILDEMO' outcaslib = 'casuser'; /* View available in-memory tables in the Casuser caslib */ list tables incaslib = 'casuser'; /* View the contents of the in-memory table */ contents casdata='RAND_RETAILDEMO' incaslib = 'casuser'; quit; /*****************************************/ /* e. Drop a distributed in-memory table */ /*****************************************/ proc casutil; droptable casdata='RAND_RETAILDEMO' incaslib = 'casuser'; quit; /*************************************/ /* f. Disconnect from the CAS server */ /*************************************/ cas conn terminate; |
The results display a range of information. In the section Detail Information for RAND_RETAILDEMO in Caslib CASUSER(Peter), it shows that the RAND_RETAILDEMO.sashdat file has been loaded into memory in CAS and distributed into blocks. This block distribution enables massively parallel processing of the data on CAS for faster processing.
For more information about the CAS server, check out the SAS® Cloud Analytic Services: Fundamentals documentation page.
You can also run DATA step code on the distributed CAS server for massively parallel processing. This will dramatically increase the speed of your data preparation on big data. To run DATA step in CAS you must make a library reference to the Caslib on the CAS server. Then simply read from and write to CAS using the SAS DATA step. Most traditional SAS9 DATA step functions and statements are available in CAS.
Few considerations to make when running DATA step on the CAS server:
SAS9_Tips_Viya_04.sas
/**********************************************************/ /* 4 - Run DATA step in on the distributed CAS server */ /**********************************************************/ /* NOTE: The data is small for training purposes */ /**********************************************************/ /***************************************************************/ /* a. Connect the Compute Server to the distributed CAS Server */ /***************************************************************/ cas conn; /**************************************************/ /* b. Explicitly load a file into memory into CAS */ /**************************************************/ /* Load the demo home_equity.csv client-side file from the SAS Viya example data sets website into the CAS server */ filename out_file url "https://support.sas.com/documentation/onlinedoc/viya/exampledatasets/home_equity.csv"; proc casutil; load file=out_file casout='home_equity' outcaslib = 'casuser'; quit; /* Confirm the table was loaded into CAS */ proc casutil; /* View available in-memory distributed tables */ list tables incaslib = 'casuser'; quit; /********************************************************************/ /* c. Create a library reference to a caslib using the CAS engine */ /********************************************************************/ libname casuser cas caslib = 'casuser'; /****************************/ /* d. Preview the CAS table */ /****************************/ title "Original Raw Data"; proc print data=casuser.home_equity(obs=10); run; title; /*************************************************************************/ /* e. Run DATA step on the in-memory table in the distributed CAS server */ /* and create a new in-memory CAS table */ /*************************************************************************/ /* Prepare the data using the distributed CAS server */ data casuser.final_home_equity; set casuser.home_equity end=end_of_thread; /* Use the END= option to view the processing by thread *. /* Fix missing values with means */ if YOJ = . then YOJ = 9; if MORTDUE = . then MORTDUE = 73761; if VALUE = . then VALUE = 101776; if DEBTINC = . then DEBTINC = 34; /* Round column */ DEBTINC = round(DEBTINC); /* Format columns */ format APPDATE date9.; /* Drop columns */ drop DEROG DELINQ CLAGE NINQ CLNO CITY; /* View number of rows processed on each thread (demo data, not all threads will be used) */ if end_of_thread=1 then put 'Total Available Threads for Processing: ' _NTHREADS_ ', Processing Thread ID: ' _THREADID_ ', Total Rows Processed by Thread: ' _N_ ; run; /*********************************/ /* f. Preview the new CAS table */ /*********************************/ title 'Prepared CAS table'; proc print data=casuser.final_home_equity(obs=10); run; title; /*****************************************************************/ /* Continue to the next program without disconnecting from CAS */ /*****************************************************************/ |
The results show the new CAS table was successfully prepared as expected. Upon reviewing the log, we can see that this DATA step executed in parallel across multiple threads. Although the dataset in this example is relatively small for training purposes, the CAS server is generally used for larger data. Additionally, for the subsequent steps, let's leave the CAS table in memory and proceed with further processing without having to load it back into memory.
SAS Viya supplies a wide array of new procedures designed to process data in CAS. The key is you must again have data loaded into memory on the distributed CAS server, create a library reference using the CAS engine to the caslib, then utilize the new procedure. Those resource intensive analytics you need to run? They’re really going to fly on SAS Viya! Check out the SAS Procedures and Corresponding CAS Procedures and Actions documentation for more information.
In this example, the data was loaded in the previous tip, and the library reference to the caslib is already set. So I can just continue running my analytical workflow!
SAS9_Tips_Viya_05.sas
The following CAS procedures are used:
/********************************************************************/ /* 5 - New distributed PROCS for the CAS server */ /********************************************************************/ /* NOTE: Continue processing the final_home_equity CAS table from */ /* the previous program. Once the data is loaded in-memory */ /* it stays in-memory until dropped or the CAS session ends. */ /********************************************************************/ /************************************/ /* a. Descriptive statistics in CAS */ /************************************/ proc mdsummary data=casuser.final_home_equity; output out=casuser.home_equity_summary; run; proc print data=casuser.home_equity_summary; run; /************************************************/ /* b. Frequencies in the distributed CAS server */ /************************************************/ proc freqtab data=casuser.final_home_equity; tables BAD REASON JOB STATE DIVISION REGION / plots=freqplot; quit; /************************************************/ /* c. Correlation in the distributed CAS server */ /************************************************/ proc correlation data=casuser.final_home_equity; run; /*********************************************************************************************************/ /* d. View the cardinality of the columns using CAS */ /*********************************************************************************************************/ /* The CARDINALITY procedure determines a variable’s cardinality or limited cardinality in SAS Viya. */ /* The cardinality of a variable is the number of its distinct values, and the limited cardinality of a */ /* variable is the number of its distinct values that do not exceed a specified threshold. */ /*********************************************************************************************************/ proc cardinality data=casuser.final_home_equity outcard=casuser.home_equity_cardinality maxlevels=250; run; proc print data=casuser.home_equity_cardinality; run; /********************************************************/ /* e. Logistic regression in the distributed CAS server */ /********************************************************/ /* Demo: Logistic Regression Modeling Using the LOGSELECT Procedure in SAS Viya */ /* https://video.sas.com/detail/video/5334372288001/logistic-regression-modeling-using-the-logselect-procedure-in-sas-viya */ /* Run a logistic regression using the distributed CAS server */ proc logselect data=casuser.final_home_equity; class REASON JOB / param=REFERENCE; model BAD(event='1') = LOAN MORTDUE VALUE REASON JOB YOJ DEBTINC; store out=casuser.mymodel; run; /* Score the data using your model */ proc astore; score data=casuser.final_home_equity rstore=casuser.mymodel copyvars=BAD out=casuser.home_equity_scored; quit; /* Preview the scored data */ proc print data=casuser.home_equity_scored(obs=10); run; /*****************************************************************/ /* Continue to the next program without disconnecting from CAS */ /*****************************************************************/ |
The results display a range of information, including descriptive statistics, frequency values, visualizations, correlation coefficients, model details, and scoring information. All these analyses were processed on the distributed CAS server, showcasing just a subset of the numerous CAS-enabled procedures available.
Want to run SQL on the distributed CAS server? Use FedSQL! Be aware PROC SQL does not run in CAS.
SAS9_Tips_Viya_06.sas
/********************************************************************/ /* 6 - Execute SQL in the distributed CAS server */ /********************************************************************/ /* NOTE: Continue processing the final_home_equity CAS table from */ /* the previous program. Once the data is loaded in-memory */ /* it stays in-memory until dropped or the CAS session ends. */ /********************************************************************/ /* To run SQL in the distributed CAS server you must use FedSQL with the sessref = option and the CAS session name */ /* Simple LIMIT */ proc fedsql sessref=conn; select * from casuser.final_home_equity limit 10; quit; /* GROUP BY */ proc fedsql sessref=conn; select BAD, count(*) as TotalLoans, mean(MORTDUE) as avgMORTDUE, mean(YOJ) as avgYOJ from casuser.final_home_equity group by BAD; quit; /*****************************************************************/ /* Continue to the next program without disconnecting from CAS */ /*****************************************************************/ |
The results demonstrate that you can utilize ANSI standard SQL on the distributed CAS server. However, it's important to note that PROC SQL won't execute on CAS. To leverage distributed processing, you need to use fedSQL instead.
If you are comfortable with object-oriented languages like Python, you’ll like CAS’ native language. CAS “speaks” CASL, with the work being done via CAS actions, which are optimized to run distributed on the CAS server. CAS actions perform a single task and are organized with other actions within an action set. You can think of actions as methods or functions. Using actions gives you the utmost control on your processing, and even faster results.
SAS9_Tips_Viya_07.sas
/**********************************************************************************/ /* 7 - Use the native CAS language (CASL) for additional data types and actions */ /**********************************************************************************/ /* NOTE: Continue processing the home_equity_final CAS table from */ /* the previous program. Once the data is loaded in-memory */ /* it stays in-memory until dropped or the CAS session ends. */ /**********************************************************************************/ /* Use CASL data types */ proc cas; /* String */ myString = 'Peter Styliadis'; print myString; describe myString; /* Numeric */ myInt = 35; print myInt; describe myInt; myDouble = 35.5; print myDouble; describe myDouble; /* List */ myList = {'Peter', 'SAS', 37, 'Curriculum Development', {'Owen', 'Eva'}}; print myList; describe myList; /* Dictionary */ myDict = {Name='Peter', Age=37, Job='Curriculum Development', Children={'Owen', 'Eva'}}; print myDict; describe myDict; quit; /* Use native CAS actions on the CAS server for MPP */ proc cas; /* View available files in a caslib */ table.fileInfo / caslib = 'samples'; /* View available in-memory CAS tables in a caslib */ table.tableInfo / caslib = 'casuser'; /* Reference the CAS table using a dictionary */ tbl = {name="FINAL_HOME_EQUITY", caslib="casuser"}; /* Preview 10 rows of the CAS table */ table.fetch / table = tbl, to=10; /* View the number of missing values and distinct values */ simple.distinct / table = tbl; /* View descriptive statistics */ simple.summary / table = tbl; /* View frequency values */ cols_for_freq = {'BAD','REASON','JOB'}; simple.freq / table = tbl, input = cols_for_freq; quit; /***********************************/ /* Disconnect from the CAS server */ /***********************************/ cas conn terminate; |
The log and results show that the CAS procedures and actions ran successfully. Using CASL gives you additional capabilities through new data types like lists and dictionaries, as well as optimized CAS actions that run on the distributed CAS server. There are hundreds of new actions available. Check out the Actions by Name documentation for a list of available actions.
SAS Viya also enables you to build flows in SAS Studio using custom programs or premade steps to help organize your analytic workflows. This is a great feature for programmers and point and click users alike! SAS Studio flows are similar to Enterprise Guide projects.
Check out the quick start video below for a quick demonstration!
SAS Viya integrates open source technology across the analytics life cycle, enabling all types of SAS and open source users to collaborate. If you are a SAS programmer but dabble in Python, SAS Viya enables you to run Python code with the Python editor or PROC PYTHON in SAS Studio. You can also use the SAS Python SWAT package to process data in the massively parallel processing CAS engine using all Python.
Check out the resources below.
Ok one last one, not specifically for programmers so it doesn’t really count as 10, but if you want to get a glimpse of some of the point and click applications on the SAS Viya platform for creating dashboards, machine learning models, managing models, discovering information assets and more, check out the SAS Viya Quick Start Tutorials playlist!
We learned a lot here, but if you want a fundamentals course on making your SAS code run faster on SAS Viya check out the Accelerating SAS® Code on the SAS® Viya® Platform course! You can also check out the Accelerate Code in SAS Cloud Analytics | SAS Viya Quick Start Tutorial for a demonstration.
9 for SAS9 – Top Tips for SAS 9 Programmers Moving to SAS Viya was published on SAS Users.
]]>2024 SAS Customer Recognition Awards – It’s Time to Vote! was published on SAS Users.
]]>While the online voting is happening, we have a panel of five judges that will score each entry on a 1-5 scale on three categories:
When the voting closes on Feb 2^{nd}, the top three vote-getters will get bonus points added to the scores from the judges to create a total score for each entry. The top three scores in each category will be the winners. The first-place winner will get a trip to SAS Innovate*!
The first-place winners will be notified during February and all the winners will be publicly announced in April. The first-place winners will be announced at SAS Innovate and all winners will be posted online right after the event.
Tom Abernathy: Tom has been using SAS since 1983 to analyze all types of clinical data from experiments and clinical trials to Real World data during his last 31 years for Pfizer. The company uses SAS to do data management, analysis and reporting. He has been interacting with SAS users online since the days of BITNET.
Louise Hadden: Louise Hadden presented at her first SAS® conference in 1996 and has never looked back, presenting at multiple conferences across the continent over the years. She supports analytic processing at Abt Associates Inc. as a Science and Research Senior Manager and specializes in reporting and data visualization in the data science capability. Much of her current work portfolio is in the health sciences domain, working with CMS, CDC, and state governments. She is also the Girl with the SAS Tattoo.
Chris Hemedinger: As the Director of SAS User Engagement, Chris’ talented team oversees SAS online communities, SAS user groups, developer experience and GitHub, tech newsletters, expert webinars, and tutorials. He is a recovering software developer who helped build popular SAS products such as SAS® Enterprise Guide®. Inexplicably, Chris is still coasting on the limited fame he earned as an author of SAS For Dummies. You can follow Chris on Twitter/X as @cjdinger.
Fareeza Khurshed: Fareeza is a graduate from the University of Alberta and has worked in the Data Science field for 20 years. She currently works for Alberta Blue Cross, leading the Analytics team and has previously worked for the Government of Alberta in several ministries as well as the BC Cancer Agency and private consulting. SAS has been a significant contributor to her achievements throughout her career and she is a leading contributor on the SAS community’s board.
Udo Sglavo: Udo leads Applied Artificial Intelligence and Modeling Research and Development at SAS. With a 25-year track record of fostering technology innovation and excellence, Udo heads a team of expert developers and data scientists dedicated to pioneering cutting-edge software and leveraging advanced models to transform the way the world works. By imagining and building the next generation of AI-driven models and software solutions, Udo helps organizations harness the power of data and analytics to solve their toughest business challenges and outpace the world around them.Udo’s commitment to pushing the boundaries of what AI can achieve has earned him recognition as a global thought leader shaping the future of technology. In addition, Udo has written three publications and holds four patents in advanced analytics.
*Please visit the SAS Customer Recognition Awards site for all the program details, rules and to cast your vote. Please reach out to Juan Cabanillas with any questions.
2024 SAS Customer Recognition Awards – It’s Time to Vote! was published on SAS Users.
]]>Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Logistic Regression was published on SAS Users.
]]>In part 4 of this series, we created and saved our modeling data set with all our updates from imputing missing values and assigning rows to training and validation data. Now we will use this data to predict if someone is likely to go delinquent on their home equity loan.
Just like in Part 5, where we fit a linear regression to this data, we will be using the regression action set, but this time with the logistic action. Then we will take our model and score the validation data using the logisticScore action.
Logistic regression is a statistical technique for predicting binary outcomes (e.g., yes or no to being delinquent on a home equity loan) based on one or more independent variables. It works by estimating the probability of an event occurring and classifying it as either 0 or 1, making it useful for predicting categorical outcomes. It is commonly used in fields such as medicine, marketing, and finance to analyze and understand the relationship between predictor variables and a binary outcome.
The Regression Action Set in SAS Viya is a collection of procedures and functions designed to perform various types of regression analysis. It includes actions such as Linear Regression, Logistic Regression, General Linear Models (GLM), and more. These actions allow users to build statistical models for predicting outcomes or understanding the relationships between variables. Additionally, the action set provides various options for model selection, diagnostics, and scoring, helping users interpret and validate their results. Overall, the Regression Action Set in SAS Viya offers a comprehensive set of tools for conducting regression analysis efficiently and effectively.
To fit a logistic regression and score data we will use the logistic action and the logisticScore action from the Regression Action Set.
Let’s start by loading our data we saved in part 4 into CAS memory. I will load the sashdat file for my example. The csv and parquet file can be loaded using similar syntax.
conn.loadTable(path="homeequity_final.sashdat", caslib="casuser", casout={'name':'HomeEquity', 'caslib':'casuser', 'replace':True}) |
The home equity data is now loaded and ready for modeling.
Before we can fit a logistic regression model, we need to load the regression action set.
conn.loadActionSet('regression') |
The regression action set consists of several actions, let’s display the actions in the regression action set to see what is available.
conn.help(actionSet='regression') |
The actions include modeling algorithms like glm, genmod, and logistic as well as corresponding actions where we can score data using the models created.
Fit a logistic regression model using logistic action on the HomeEquity training data set (ie where _PartInd_ =1). Save the model to a file named lr_model. In the model statement we also indicate to fit the model to those who went BAD (or delinquent) on their loan (i.e., event=1)
conn.regression.logistic( table = dict(name = HomeEquity, where = '_PartInd_ = 1'), classVars=['IMP_REASON','IMP_JOB','REGiON'], model=dict(depVar=[dict(name='BAD', options=dict(event='1'))], effects=[dict(vars=['LOAN', 'IMP_REASON', 'IMP_JOB', 'REGiON', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', 'IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ'])] ), store = dict(name='lr_model',replace=True) ) |
The default output includes information about the model, the number of observations used, response profile, class level information, convergence status, fit statistics, and parameter estimates.
The data used was the HomeEquity data we loaded in memory and the target variable or Y is BAD, which is the 0, 1 indicator variable on whether someone went delinquent on a home equity loan. 4,172 rows from the training data were used to train the model.
The asterisk in the image below indicates that BAD=1 was the level modeled for this logistic regression.
Partial list of parameters from the logistic regression model fitted (or trained) is shown below.
Now let’s take the model created (file named lr_model) and score it using the logisticScore action to apply it to the validation data (_PartInd_=0). Create a new dataset called lr_scored to store the scored data.
lr_score_obj = conn.regression.logisticScore( table = dict(name = HomeEquity, where = '_PartInd_ = 0'), restore = "lr_model", casout = dict(name="lr_scored", replace=True), copyVars = 'BAD', pred='P_BAD') |
Look at first 5 rows of the scored data to see the predicted probability for each row and the actual value of BAD.
conn.CASTable('lr_scored').head() |
To assess the performance of a logistic regression model, we can use metrics such as a confusion matrix, misclassification rates, and a ROC (Receiver Operating Characteristic) plot. These measures can help us determine how well our logistic regression model fits the data and make any necessary adjustments to improve its accuracy.
To calculate these metrics, we will use the percentile action set and the assess action. Load the percentile action set.
conn.loadActionSet('percentile') conn.builtins.help(actionSet='percentile') |
Assess the logistic regression model using the scored data (lr_scored). Two data sets are created named lr_assess and lr_assess_ROC
conn.percentile.assess( table = "lr_scored", inputs = 'P_BAD', casout = dict(name="lr_assess", replace=True), response = 'BAD', event = "1" ) |
Look at the first 5 rows of data from the assess action output data set lr_assess. Here you see the values at each of the depths of data from 5% incremented by 5.
display(conn.table.fetch(table='lr_assess', to=5)) |
Look at the first 5 rows of data from the assess action output data set lr_assess_ROC. This data is organized by cutoff value starting .00 and going to 1 incremented by the value of .01.
conn.table.fetch(table='lr_assess_ROC', to=5) |
In logistic regression, we use a cutoff value to make decisions, kind of like drawing a line in the sand.
In this case if we use a cutoff value of .03 using the table below it means we are using the prediction probability of .03 to predict if someone is going to be delinquent on their loan. If we choose .03 then for our validation data the predicted true positives will be 356 and true negatives 72. The default cutoff value is .5 or 50%.
Let's now bring the results to client by creating local data frames to calculate a confusion matrix, misclassification rate, and ROC plot for our logistic regression model.
lr_assess = conn.CASTable(name = "lr_assess").to_frame() lr_assess_ROC = conn.CASTable(name = "lr_assess_ROC").to_frame() |
Create a confusion matrix, which compares predicted values to actual values. It breaks down these predictions into four categories: true positives, true negatives, false positives, and false negatives. A true positive happens when the actual value is 1 and our model predicts a 1. A false positive is when the actual value is 0 and our model predicts a 1. A true negative is when the actual value is 0 and our model predicts a 0. A false negative is when the actual value is a 1 and our model predicts a 0.
These measures help us evaluate the performance of the model and assess its accuracy in predicting the outcome of interest.
Use a cutoff value of 0.5, which means if the predicted probability is greater than or equal to 0.5 then our model predicts someone will be delinquent on their home equity loan. If the predicted value is less than 0.5 then our model predicts someone will not be delinquent on their home equity loan.
# create confusion matrix cutoff_index = round(lr_assess_ROC['_Cutoff_'],2)==0.5 conf_mat = lr_assess_ROC[cutoff_index].reset_index(drop=True) conf_mat[['_TP_','_FP_','_FN_','_TN_']] |
We can also calculate a misclassification rate, which indicates how often the model makes incorrect predictions.
# calculate misclassification rate conf_mat['Misclassification'] = 1-conf_mat['_ACC_'] miss = conf_mat[round(conf_mat['_Cutoff_'],2)==0.5][['Misclassification']] miss |
Our misclassification rate for our logistic regression model is .157718 or 16%. This means that our model is wrong 16% of the time, but correct 84%.
Additionally, a ROC (Receiver Operating Characteristic) plot visually displays the trade-off between sensitivity and specificity, helping us evaluate the overall performance of the model. The closer the curve is to the top left corner, the higher the overall performance of the model.
Use the Python graphic package matplotlib to plot our ROC Curve.
# plot ROC Curve from matplotlib import pyplot as plt plt.figure(figsize=(8,8)) plt.plot() plt.plot(lr_assess_ROC['_FPR_'],lr_assess_ROC['_Sensitivity_'], label=' (C=%0.2f)'%lr_assess_ROC['_C_'].mean()) plt.xlabel('False Positive Rate', fontsize=15) plt.ylabel('True Positive Rate', fontsize=15) plt.legend(loc='lower right', fontsize=15) plt.show() |
Our curve is somewhat close to the top left corner which indicates a good fit, but also shows room for improvement.
The C=0.8 represents the Area Under the Curve (AUC) statistic which means our model is doing better than the 0.5 of a random classifier but not as good as the perfect model at 1.
Logistic regression is a powerful tool in SAS Viya for predicting binary outcomes. By using the regression action set with the logistic action, we can build and assess various models to find the best fit for our data. We can also utilize the logisticScore action to score data with the percentile action set to assess our model using criterion like confusion matrixes, misclassification rate, and ROC curves to evaluate and compare different models. This ultimately aids us in making accurate predictions and informed decisions based on our data.
In the next post, we will learn how to fit a decision tree to our Home Equity Data.
SAS Help Center: Load a SASHDAT File from a Caslib
SAS Help Center: loadTable ActionGetting Started with Python Integration to SAS® Viya® - Part 5 - Loading Server-Side Files into Memory
SAS Help Center: Regression Action Set
SAS Help Center: logistic Action
SAS Help Center: logisticScore Action
SAS Help Center: Percentile Action Set
SAS Help Center: assess Action
Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Logistic Regression was published on SAS Users.
]]>Using the BXOR Function to Encode and Decode Text was published on SAS Users.
]]>Most everyone is familiar with the Boolean OR operator. It returns a TRUE value if either argument is true. It also returns a TRUE value when both arguments are true. An exclusive OR is like the OR operator, except it returns a value of FALSE if both arguments are TRUE. To summarize:
Why is this useful in encoding and decoding text? It turns out that if you perform an exclusive OR to encode text with a key, if you then perform another exclusive OR with the coded text and the key, you get back the original text. The following SAS program demonstrates this:
*Program to demonstrate how the exclusive OR can be used to encode and decode text; data Cipher; Text = rank('A'); Key = rank('B'); Code = bxor(Text, Key); Decode = bxor(Code, Key); run; Title "Listing of Data Set Cipher"; proc print data=Cipher noobs; format Text Key Code Decode Binary8.; run; |
The RANK function returns the ASCII value of its argument. In this example, it returns the value 65 when the argument is an ‘A’ (01000001 in binary) and a value of 66 (010000010) when the argument is a ‘B’. The BXOR function takes two arguments and performs an exclusive OR.
Here is the output:
Notice that the value of Decode is the same as Text.
If you are interested, here are two SAS macros that you can use to encode and decode data:
%macro encode(Dsn=, /* Name of the SAS data set to hold the encrypted message */ File_name=, /* The name of the raw data file that holds the plain text */ Key= /* A number of your choice which will be the seed for the random number generator. A large number is preferable */ ); %let len = 150; data &dsn; array l[&len] $ 1 _temporary_; /* each element holds a character of plain text */ array num[&len] _temporary_; /* a numerical equivalent for each letter */ array xor[&len]; /* the coded value of each letter */ retain key &key; infile "&file_name" pad; input string $char&len..; do i = 1 to dim(l); l[i] = substr(string,i,1); num[i] = rank(l[i]); xor[i] = bxor(num[i],ranuni(key)); end; keep xor1-xor&len; run; %mend encode; %macro decode(Dsn=, /* Name of the SAS data set to hold the encrypted message */ Key= /* A number that must match the key of the enciphered message */ ); %let Len = 150; data decode; array l[&Len] $ 1 _temporary_; array num[&Len] _temporary_; array xor[&Len]; retain Key &Key; length String $ &Len; set &Dsn; do i = 1 to dim(l); num[i] = bxor(xor[i],ranuni(Key)); l[i] = byte(num[i]); substr(String,i,1) = l[i]; end; drop i; run; title "Decoding Output"; proc print data=decode noobs; var String; run; %mend decode; |
Here is an example of how to call these two macros:
%encode (Dsn=code, File_name=c:\books\functions\plaintext.txt, Key=17614353) %decode (Dsn=code, Key=17614353) |
Many of you have read one or more of my SAS and/or statistics books, but did you know I also write fiction? I recently published a fiction novel, The Enigma Terrorists, on Amazon.
The story centers around a college professor who is asked to help the NSA break a code to stop terrorists from blowing up nuclear reactors in France. Of course, the coding programs are written in SAS!
CHECK IT OUT | RON CODY'S AMAZON AUTHOR PAGEUsing the BXOR Function to Encode and Decode Text was published on SAS Users.
]]>Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Linear Regression was published on SAS Users.
]]>In part 4 of this series, we created our modeling dataset by including a column to identify the rows to be used for training and validating our model. Here, we will create our first model using this data.
In this example, we will fit a linear model on the training data using the regression action set with the glm action. Then, we will take our model and score the validation data using the glmScore action.
Linear regression is a statistical method used to identify and analyze the relationship between one or more independent variables (x or features) and a dependent variable (y, target, or label). It is often used in data analysis to predict numeric values based on historical trends and patterns, making it a valuable tool for decision-making.
By using linear regression, we can better understand the impact of different factors on our desired outcome and make more informed decisions. Additionally, it allows us to quantify the strength and direction of relationships between variables, providing valuable insights into our data. So, in summary, linear regression is an essential tool for understanding and making sense of complex data sets.
The regression action set in SAS Viya is a collection of procedures and functions designed to perform various types of regression analysis. It includes actions such as linear regression, logistic regression, general linear models (GLM), and more. These actions allow users to build statistical models that predict outcomes or understand the relationships between variables.
Additionally, the action set provides various options for model selection, diagnostics, and scoring to help users interpret and validate their results. Overall, the regression action set in SAS Viya offers a comprehensive set of tools for conducting regression analysis efficiently and effectively.
GLM stands for general linear models, and it is a type of regression analysis that fits linear regression models using the method of least squares.
Let’s start by loading our data we saved in part 4 into CAS memory. I will load the sashdat file for my example. The csv and parquet file can be loaded using similar syntax.
conn.loadTable(path="homeequity_final.sashdat", caslib="casuser", casout={'name':'HomeEquity', 'caslib':'casuser', 'replace':True}) |
The home equity data is now loaded and ready for modeling.
Before we can fit a linear regression model, we need to load the regression action set.
conn.loadActionSet('regression') |
The regression action set consists of several actions, let’s display the actions in the regression action set to see what is available.
conn.help(actionSet='regression') |
The actions include modeling algorithms like glm, genmod, and logistic as well as corresponding actions where we can score data using the models created.
Fit a linear regression model using the glm action on the HomeEquity training data set (i.e., where _PartInd_=1). Save the model to a file named reg_model.
conn.regression.glm( table= dict(name = HomeEquity, where = '_PartInd_ = 1'), classVars=['IMP_REASON','IMP_JOB','REGION','BAD'], model={'depVar':'LOAN', 'effects':['IMP_REASON', 'IMP_JOB', 'REGION', 'BAD', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', 'IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ'] }, store = dict(name='reg_model', replace=True) ) |
The output includes information about the model, the number of observations used, classification variables, ANOVA, fit statistics, and parameter estimates.
The HomeEquity data we loaded earlier into memory was used and the target variable or Y is LOAN, which is the amount of the home equity loan. The 4,172 rows from the training data were used to train the model.
For the classification variable IMP_REASON and BAD each have 2 levels (or values), REGION has 4, and IMP_JOB has 6. The model has 14 effects (or inputs) and 24 parameters were estimated.
The Analysis of Variance (ANOVA) table shows overall statistics about the linear model created.
The fit statistics give us an overall understanding on how well the model fits the data and allows us to compare this model to other models to determine which is better at fitting the data.
This is a partial list of parameter estimates for our model.
Now take the model created (file named reg_model) and score using the glmScore action to apply it to the validation data (_PartInd_=0). Create a new dataset called regscoredata to store the scored data.
conn.regression.glmScore( table= dict(name = HomeEquity, where = '_PartInd_ = 0'), restore='reg_model', casOut=dict(name='regscoredata', replace=True), fitData='true', copyVars='All', pred='pred', resid='resid', rstudent='rstudent' ) |
This creates a new file with the scored data that includes the 1788 rows from the validation data.
Print out the first five rows of the scored data to see the predicted value for LOAN for each row. Variable PRED holds this value. When we scored the data, we also calculated the residual value, and rstudent as columns in the data.
regscoredata=conn.CASTable("regscoredata", caslib="casuser") regscoredata.head() |
To determine the most important variables and simplify our model, use the selection option to do a forward selection of the variables.
conn.regression.glm( table= dict(name = HomeEquity, where = '_PartInd_ = 1'), classVars=['IMP_REASON','IMP_JOB','REGION','BAD'], model={'depVar':'LOAN', 'effects':['IMP_REASON', 'IMP_JOB', 'REGION', 'BAD', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', 'IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ'] }, selection={'method':'forward'} ) |
The forward selection method in linear regression is a step-by-step process of selecting the most relevant and significant variables to build the best possible model. It starts with adding one variable at a time and adds one by one until none of the remaining variables meet the proper criterion.
Below are the default selection criteria.
For this model 8 of our 14 variables were added to the model. The model selected is indicated by the 1 in the OptSBC column.
The analysis of variance table (ANOVA) indicates the overall statistics for the model.
The model fit statistics for the reduced model indicate the new smaller model is very similar in fit and accuracy as the full model.
Here is a complete listing of the included variables with the parameter estimates.
The GLM action in the regression action set for SAS Viya provides a powerful tool for performing linear regression. By carefully selecting and adding variables to our model, we can create accurate representations of real-world relationships and make informed decisions based on data analysis. Building a reliable model takes time and effort, but the results are worth it.
With the GLM Action and forward selection method, we can continue our imporoved understanding and prediction abilities, making linear regression in SAS a valuable tool for any data scientist or analyst.
In the next post, we will learn how fit a logistic regression.
SAS Help Center: Load a SASHDAT File from a Caslib
SAS Help Center: loadTable Action
Getting Started with Python Integration to SAS® Viya® - Part 5 - Loading Server-Side Files into Memory
SAS Help Center: Regression Action Set
SAS Help Center: glm Action
SAS Help Center: glmScore Action
Getting Started with Python Integration to SAS Viya for Predictive Modeling - Fitting a Linear Regression was published on SAS Users.
]]>New SAS Training Course: Statistics You Need to Know for Machine Learning was published on SAS Users.
]]>Statistics is a core component of data analytics and machine learning. Despite the "bigness" of the data, statistics still has a lot of application. The role of statistics remains what it has always been and is even more important now. Perhaps the core statistical task in (traditional) statistics is inductive inference from data to models and scientific conclusions. This core task is still very relevant in the advent of massive data sets.
Replicability, stability, heterogeneity, causality, and uncertainty are the five basic principles of statistics, and they all hold equally well with big data.
Ideally, in big data scenario too, the conclusions and findings are replicable and generalizable. If you imagine running the analysis again, now on a new data set, would the outcome be similar, meaning that the model is stable? How would you find out what similarity in outcomes means and how to evaluate accuracy, to quantify uncertainty. Understanding heterogeneity in large-scale data sets is more crucial and comprehending causality and its connection to robust prediction is still interesting.
Are you interested in machine learning and want to grow your career in it? The key to machine learning is using the right data preprocessing techniques, understanding the algorithm, cutting through the equations and Greek letters, and making sense out of complex results.
Developing an accurate understanding of statistics will help you build robust machine learning models that are optimized for a given business problem. SAS launched a new course that provides a comprehensive overview of the fundamentals of statistics that you'll need to start your data science journey. This course is also a prerequisite to many courses in the SAS data science curriculum.
In this course, you learn how to:
It also gives you opportunity of hands-on using SAS Studio tasks to perform your data analysis. This course is available in three formats: face-to-face classroom training, remotely connected live web, and self-paced e-learning.
New SAS Training Course: Statistics You Need to Know for Machine Learning was published on SAS Users.
]]>Announcing the 2024 SAS Customer Recognition Awards was published on SAS Users.
]]>First-place winners will be announced at SAS Innovate 2024 in Las Vegas April 16-19, 2024. Each winner will receive one trip to the event*, a SAS trophy and SAS swag. Winners will create a short video to describe their winning submission for a compilation video to show during the conference.
*Please visit the SAS Customer Recognition Awards site for the program details, rules and to submit an entry.
Peruse the roster of previous award seekers and 2023 winners. Think your project's a winner? Keep reading, then complete an entry!
The SAS Customer Recognition Awards live on the SAS Support Community which allows customers to submit photos, documents and videos with their entry and allows us to do online voting. Winners will be chosen based on a combination of public votes and scores from a judging panel.
Submit Entry Dates: December 4, 2023 – January 19, 2024
Voting Dates: January 22, 2024 – February 2, 2024
Panel Judging Dates: January 22, 2024 – February 2, 2024
Winners Notified: Week of February 12, 2024
First-Place Winners Announced: April 16-19, 2024 at SAS Innovate (complete winner list posted in the SAS Support Community)
Winners: First place, second place, third place (pending quantity of entries)
(Votes by customers combined with scores from a judging panel determine winners.)
Community Uplift Award: Awarded to a SAS customer who has made an impact in their community at large using SAS products.
Curious Thinker: Awarded to a SAS customer that has an inspiring career story to share and highlights how SAS played a part in that journey.
Innovative Problem Solver: Awarded to a SAS customer who uses SAS in innovative ways to solve a business problem.
ROI Rock Star: Awarded to a SAS customer demonstrating the greatest business benefit and return on investment (ROI) using SAS products.
SAS Analytics Explorers Advocate: Awarded to a SAS customer who is leveraging the SAS Analytics Explorers program to grow their skills, their career and/or their network.
(One winner selected for each category.)
Customer Impact Award – Public Sector: Awarded to a public sector customer who has had the most impact through a willingness to share their analytics journey, successes and lessons learned with others.
Customer Impact Award – Private Sector: Awarded to a private sector customer who has had the most impact through a willingness to share their analytics journey, successes and lessons learned with others.
SAS Support Communities Hero: A customer that answers many questions in Communities and goes above and beyond helping other users on the SAS Support Communities.
SAS Users Group MVP: Awarded to a SAS Users Group leader demonstrating a dedicated passion for the success of users group members.
User Feedback Award: Awarded to a customer who provides valuable feedback on SAS products and has been an essential reference for product improvements.
Please reach out to Juan Cabanillas with any questions.
SUBMIT YOUR ENTRY | 2024 SAS Customer Recognition AwardsAnnouncing the 2024 SAS Customer Recognition Awards was published on SAS Users.
]]>Getting Started with Python Integration to SAS Viya for Predictive Modeling - Creating Training and Validation Data Sets was published on SAS Users.
]]>In part 3 of this series, we replaced the missing values with imputed values. Our final step in preparing the data for modeling is to split the data into a training and validation data set. This is important so that we have a set of data to train the model and a holdout or separate set of data to validate the model and make sure it performs well with new data.
For this step we will use the sampling action set and will show examples using both the srs (simple random sampling) and stratified actions.
When it comes to data modeling, having both training and validation datasets is essential for ensuring the accuracy and reliability of our models. The training dataset is used for training the model on a portion of the data, while the validation dataset is used to assess the performance of the trained model on a holdout sample of the data. This process of splitting data into two separate sets helps prevent overfitting and helps us understand how well the model generalizes to new data.
Having a validation dataset allows us to evaluate our models objectively and adjust if necessary. It also helps us avoid the mistake of assuming that a model that performs well on the training data will also perform well on a holdout sample of the data. Without proper validation, we may end up with a model that is biased towards the training data and performs poorly on new data.
Furthermore, having a separate validation dataset allows us to fine-tune our model and select the best performing one before deploying it in a real-world scenario. This can save time, resources, and potential consequences of using an inadequate model. In summary, incorporating both training and validation datasets into our modeling process is crucial for enhancing accuracy, reliability, and generalization of our models.
The sampling action set allows us to sample data in SAS Viya. It consists of four actions: srs, stratified, oversampling, and k-fold. Let's look at examples for srs and stratified in this post.
The sampling action set in our example will create an output table including a variable identifying the partition or sample the row or observations belongs to. The partition variable name is _Partind_.
The training data is identified when _Partind_ = 1 and the validation data is identified when _Partind_=0.
Simple random sampling (SRS) is like picking names out of a hat. Just like how you give everyone an equal chance to be picked when playing a game, SRS gives every item in a group an equal chance of being selected for the sample. This helps us get a fair representation of the whole group without leaving anyone out.
Stratified sampling is like sorting candy into different groups like chocolate vs non-chocolate before picking a few from each group to eat. Just like how we might want to try all the different types of candy, stratified sampling allows us to get a fair representation of each group in a population by selecting some from each group for our sample. This helps us get a more accurate understanding of the whole population.
For the Home Equity data, we want to create training data that includes 70% of the data and the validation will include the remaining 30%. So, for our data we want to have _Partind_ randomly assigned the value of 1 to 70% and the value of 2 to 30% of the rows.
Using the sampling action set with the srs action, our code looks like this where the samppct represents the 70 percent of the data we want to be used for training. We also want _Partind_ indicator variable created to identify which row is in the training vs validation data. Finally, the seed is a number we select (and can be any integer value) so that we can replicate the results.
conn.loadActionSet('sampling') conn.sampling.srs( table = 'HomeEquity2', samppct = 70, seed = 919, partind = True, output = dict(casOut = dict(name = 'HomeEquity3', replace = True), copyVars = 'ALL') ) |
For our data the srs action has taken the original 5,960 rows and assigned 4,172 to be in the training data, so for these 4,172 the _Partind_=1.
Let’s look at the count in each partition for training and validation using the freq action in the simple action set.
conn.simple.freq( table = 'HomeEquity3', inputs = ["BAD","_Partind_"] ) |
Here we can see for the _Partind_ indicator variable that we have the correct assignment of 4,172 rows.
Does this sample represent the number of people that went delinquent on their home equity loan fairly in the training and validation data? Let’s use the freqTab action to look at a crosstab of the number of people who when delinquent (BAD) and _Partind_ to compare the percentages in the groups.
conn.freqTab(table="HomeEquity3", tabulate=['BAD','_Partind_',{'vars':{'BAD','_Partind_'}}] ) |
From the highlighted column percentages in the output table we see that the percentages are close but not exact. To have BAD equally represented in the training and validation data we would need to use stratified sampling instead of SRS.
Using the stratified action in the sampling action set we will also need to use the groupby option to specify BAD as our stratification variable.
conn.sampling.stratified( table = {'name':'HomeEquity2', 'groupby' : 'bad'}, samppct = 70, seed = 919, partind = True, output = dict(casOut = dict(name = 'HomeEquity4', replace = True), copyVars = 'ALL') ) |
In our output for the stratified sample, we still have a 70% sample or 4,172 rows assigned to our training data but now the proportion of BAD is better represented.
As you can see in the output below highlighted in red from the freqTab action the percentages are now almost identical.
We are happy with our data and are now ready to create some models. Let’s save the data so we have all our updates available for our modelers.
We will not use all the columns in the data so to be efficient for storage and modeling, we will specify the columns to save. For example, we will be using the imputed columns and will not need the original for our models.
For the next step, determine the columns for inclusion in the modeling dataset. Use the alterTable action in the table action set to keep the columns for modeling and to create the order of the columns in the data.
Next, verify the desired outcome by printing the first 5 rows.
## Reference the new CAS table and reorganize columns finalTbl = conn.CASTable('HomeEquity4', caslib = 'casuser') ## Specify the order of the columns newColumnOrder = ['BAD', 'LOAN', 'REGION', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', 'IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ', 'IMP_JOB', 'IMP_REASON', '_PartInd_'] finalTbl.alterTable(keep = {'BAD', 'LOAN', 'REGION', 'IMP_CLAGE', 'IMP_CLNO', 'IMP_DEBTINC', 'IMP_DELINQ', 'IMP_DEROG', 'IMP_MORTDUE', 'IMP_NINQ', 'IMP_VALUE', 'IMP_YOJ', 'IMP_JOB', 'IMP_REASON', '_PartInd_'}) finalTbl.alterTable(columnOrder = newColumnOrder) ## Preview the CAS table finalTbl.head() |
First 5 rows of the data looks great!
Our data table is now ready for us to create models. Let’s save it to a file using the table.save CAS action so we can load it into memory later for our modeling activities.
You can decide if you want to save the data as a .sashdat file
finalTbl.save(name = 'homeequity_final.sashdat', caslib = 'casuser') |
or a csv file
finalTbl.save(name = 'homeequity_final.csv', caslib = 'casuser') |
or parquet file type.
finalTbl.save(name = 'homeequity_final.parquet', caslib = 'casuser') |
For more information about saving files check out Getting Started with Python Integration to SAS® Viya® - Part 17 - Saving CAS tables blog post.
We have learned the importance of preparing our data before beginning the modeling process. Through the first four parts of this series by exploring and understanding our data, handling missing values, and creating training and validation datasets, we have set ourselves up for success in our modeling endeavors. By following these steps, we can confidently move on to the next part of this series and begin the exciting task of building models to gain insights and make predictions.
In the next post, we will learn how to fit a Linear Model using the data we prepared and saved.
Getting Started with Python Integration to SAS Viya for Predictive Modeling - Creating Training and Validation Data Sets was published on SAS Users.
]]>