The post Maximize product quality with Optimization and Machine Learning models appeared first on The SAS Data Science Blog.
]]>To illustrate with a simplified example, I will describe a couple relevant metrics and settings in an automotive airbag production. The process owners need to decide the amount of sodium azide and oxidizer to use as propellant (among many other manufacturing settings) in order to satisfy the required ability to produce an amount of gas at a given rate, ensuring proper airbag inflation. Quality metrics (such as airbag inflation) will typically have an associated tolerance, allowing for a generation of an upper and lower bound. The goal is to find the right combination of manufacturing settings (sodium azide and oxidizer) to minimize costs (or maximize yield), while keeping the key quality metric (gas production rate) within required bounds.
To address this problem, we need to understand how the settings affect the key metric, which traditionally has been explained with linear regression models, partially due to their natural fit within linear optimization formulations. Now the industry is exploring to explain these relationships with more sophisticated models (attempting to increase model accuracy) such as Neural Nets, or Grad Boost models, which in turn require pushing some boundaries in optimization formulations and solution methodologies.
Incorporating non-closed-form and nonlinear models (such as Neural Nets or Grad Boost) in optimization does not allow for traditional sound-and-proof algorithms to work (such as branch-and-bound or simplex). Fortunately, SAS offers the capability to solve this nonlinear optimization model with cutting-edge solvers such as black-box, while still using the OR practitioner-beloved modeling language OPTMODEL.
In this post, I will walk you through the right coding syntax to formulate and solve a nonlinear optimization problem, where constraint and objective function equations are non-closed-form Machine Learning models. The following SAS functionalities will be used:
Please note I will not be discussing optimization convergence or ML model accuracy in this blog. Instead, I will keep it focused on the code syntax to help SAS users explore the incorporation of ML models as constraints or objectives in an optimization formulation.
To illustrate the syntax, I have set up an oversimplified example with two products, four manufacturing settings (one of those being a binary setting, making this a Mixed Integer Nonlinear Optimization problem), one quality metric (kpi) that has a lower bound and an overall yield that needs to be maximized. Both the kpi and the yield are explained with the manufacturing settings as regressors through Gradboost models.
Decision Variables:
\(\textrm{Setting}_j\): Value of the manufacturing setting \(\mathit{j}\), where \(j\in\{1,\dots,4\}\)
Constraints:
\(f_i(\textrm{Setting}_1, ... ,\textrm{Setting}_4)\geq 100\) for \(i \in \{1,2\}\)
Constraint for each product \(i\) that sets a lower bound of \(100\) for \(f()\), where \(f()\) is the non-closed-form ML model depending on the value of \(\textrm{Setting}_1,\dots,\textrm{Setting}_4\)
Objective Function:
maximize \(g(\textrm{Setting}_1,\dots,\textrm{Setting}_4)\)
Maximizes the non-closed-form ML model \(g()\) that depends on the value of \(\textrm{Setting}_1,\dots,\textrm{Setting}_4\)
Set up a CAS session:
proc options option=(CASHOST CASPORT); run; cas mysess; libname mycas cas sessref=mysess; |
Generate mock data for demonstration purposes:
data mycas.prod_history; input item yield setting1-setting4 kpi; datalines; 101 100 1 4 3.5 6 140 101 180 0 2.3 6 10 120 102 69 1 1.3 5 12 163 102 79 1 1.6 10 23 203 ; |
Generate two Gradboosting models to predict yield and KPI based on four settings. Please notice setting 1 is a nominal variable.
proc gradboost data=mycas.prod_history; input setting2 setting3 setting4 / level=interval; input setting1 item / level=nominal; target yield / level=interval; savestate rstore=mycas.stored_gb_yield; run; proc gradboost data=mycas.prod_history; input setting2 setting3 setting4 / level=interval; input setting1 item / level=nominal; target kpi / level=interval; savestate rstore=mycas.stored_gb_kpi; run; |
Save the astores (analytical stores for yield) locally:
proc astore; download rstore=mycas.stored_gb_yield store="/r/sanyo.unx.sas.com/vol/vol920/u92/navikt/casuser/gp/stored_gb_yield"; download rstore=mycas.stored_gb_kpi store="/r/sanyo.unx.sas.com/vol/vol920/u92/navikt/casuser/gp/stored_gb_kpi"; quit; |
Create user-defined functions, calling the analytical store defined above:
proc fcmp outlib=work.score.funcs; function astore_yield(item, setting1, setting2, setting3, setting4); declare object myscore(astore); call myscore.score("/r/sanyo.unx.sas.com/vol/vol920/u92/navikt/casuser/gp/stored_gb_yield"); return(P_yield); endsub; function astore_kpi(item, setting1, setting2, setting3, setting4); declare object myscore(astore); call myscore.score("/r/sanyo.unx.sas.com/vol/vol920/u92/navikt/casuser/gp/stored_gb_kpi"); return(P_kpi); endsub; run; quit; |
Point to the previously stored compiled functions:
options cmplib=work.score; |
Define the decision variables and the implicit variables in OPTMODEL. The implicit variables Kpi and Yield, which typically are defined with closed-form equations, will now call the user-defined functions that include the analytical stores for the gradboost models.
proc optmodel; set ITEMS = {101,102}; var Setting1 binary; var Setting2 >= 0; var Setting3 >= 0; var Setting4 >= 0; impvar Kpi{i in ITEMS} = astore_kpi(i, Setting1, Setting2, Setting3, Setting4); impvar Yield{i in ITEMS} = astore_yield(i, Setting1, Setting2, Setting3, Setting4); |
Define the constraints:
con KPI_1_UB{i in ITEMS}: Kpi[i] >= 100; |
Define the objective:
max TotalYield = sum{i in ITEMS} Yield[i]; |
Call the black-box solver:
solve with blackbox; |
Create output data:
create data mycas.prod_out from setting1 setting2 setting3 setting4 TotalYield; create data mycas.kpi_out from [ITEMS] Kpi; quit; |
Using this syntax, we are able to obtain optimum values for our four settings maximizing yield:
Setting1 | Setting2 | Setting3 | Setting4 | TotalYield |
0 | 92.43697479 | 88.235294118 | 82.352941176 | 228.3719873 |
while satisfying the requirement to keep the quality kpi above 100 for each product:
ITEMS | Kpi |
101 | 156.35430437 |
102 | 156.35430437 |
There is an increasing need to incorporate non-closed form models within optimization formulations. SAS provides an easy and intuitive way to incorporate state-of-the-art technology such as Machine Learning models and black-box optimization solvers with the syntax described above. Enjoy!
LEARN MORE | SAS Visual Data Mining and Machine Learning
For additional information regarding Operations Research, be sure to visit our SAS Community and other Operations Research blog posts.
The post Maximize product quality with Optimization and Machine Learning models appeared first on The SAS Data Science Blog.
]]>The post Accessing text analytics from a chatbot: Sentiment appeared first on The SAS Data Science Blog.
]]>Here are some initial thoughts around likely objectives customers may have, which involve text analytics accessed from a chatbot.
A high-level dialogue is provided. The scenario is a situation when, typically at the end of an interaction, the customer is asked to provide feedback regarding their experience. While the technical steps can be better explored in a different blog or provided as an example, here is the sequence of activities that ensue:
“Thanks for the interaction so far. If you would like to provide some feedback, please feel free to enter the same.”
Customers provide their feedback to the bot. Other complex cases need to be tackled in production scenarios (such as, if the customer chooses to provide feedback over a series of chats) and therefore, this is a simplification. The customer’s feedback in this case is captured through a Text Input node. A parameter within the Text Input node (call it textByUser) stores this.
As an optional step, the feedback provided by the customer is also stored in a variable for easy retrieval (Feedback is logged). A Modify Context Data Provider node is used for this purpose. In this case, the same variable textByUser was assigned as a string. This aids reuse of the variable in succeeding dialogues and nodes:
Feedback is analyzed – During this stage, the bot connects to Visual Text Analytics. The connection is through a SAS Code Data Provider node, where a CAS (the analytical engine of SAS Viya) session is started, and the feedback is analyzed for the sentiment. The response/output from the SAS Code Data Provider node is then returned to the bot in the form of a JSON. The following is the code used inside the SAS Code Data Provider Node:
Logic (through a Logic Node) is applied over the JSON to decide and execute an appropriate response. A positive sentiment (the If branch) may lead to the bot thanking the user and ending the conversation on a positive note. A negative sentiment (the Else branch) may however need empathy from the bot, and mitigation or remediation if possible. This can be accomplished through another flow (or redirecting to another dialogue). Other data variables surrounding the sentiment analysis’ final output ( e.g. in addition to sentiment, return the matches indicating negative sentiment) can also be provided.
The following is the Logic for the IF condition (the node labeled “If Sentiment is positive”). The JSON returned from the previous SAS Code Data Provider node is called sentOut.
$sentOut["SASTableData+OUT_SENT"][0]["_sentiment_"] == "Positive"
When using the SAS Code Data Provider node, one needs to maintain realistic expectations with regards to speed/latency. Accounting for speed and performance is important; remember the chatbot makes a connection to another application and therefore you should account for some time between the handoff. The ‘art’ aspects of designing a chatbot gain importance here and it is recommended that you choose only those sorts of situations where the user understands that the process might take time. A tactic could be to inform the user, “Hold on while our system analyzes this…” to passively engage the user.
Another key development to look out for, would be to access sentiment analysis (and other text analytics models) through future APIs or web services. There are plans to make text analytics capabilities callable through APIs and provide output more directly and synchronously. Upon such improvements, a call to text analytics through the Web Service Data Provider node can be considered.
Designer skills also need to be considered. The choice of whether to use the SAS Code Data Provider node or the Web services Data Provider node depends on whether the chatbot designer happens to be a data scientist or analyst who is familiar with coding in the SAS language, as opposed to a designer who is comfortable with web services and APIs.
You were provided a broad approach on how to access other analytical capabilities through SAS Conversation Designer, taking text analytics as an example. Also, you also obtained an overview of the steps required. In a future blog, we shall also look at the possibilities of calling search actions from a chatbot. Click here to learn more, and request a demo of the many other ways you can leverage a SAS Conversation Designer chatbot to obtain answers to questions, query data and reports, and access Artificial Intelligence through a conversational experience!
The post Accessing text analytics from a chatbot: Sentiment appeared first on The SAS Data Science Blog.
]]>The post Automated linearization in SAS Optimization appeared first on The SAS Data Science Blog.
]]>Sometimes a model that is not quite linear can be transformed to an equivalent linear model to reduce the overall computational time. Experienced modelers have learned a bag of tricks to accomplish such linearization by introducing new variables and constraints. A recently added feature automates some of these tricks by performing the necessary transformations on your behalf.
Since December 2020, in SAS Optimization in SAS Viya, OPTMODEL has included a LINEARIZE option to automate linearization for several common use cases:
In this blog post, I'll demonstrate the LINEARIZE option in the context of a maximum dispersion problem. For example, where should you build a fixed number of new stores from a set of candidate locations? Or where should you build new distribution centers to improve a regional supply chain?
In his Yet Another Mathematical Programming Consultant blog, Erwin Kalvelagen described the maximum dispersion problem as follows:
Given \(n\) points with their distances \(d_{i,j}\), select \(k\) points such that the sum of the distances between the selected points is maximized.
A natural way to model this problem is to introduce a binary decision variable \(s_i\) to indicate whether point \(i\) is selected and then maximize the quadratic function \(\sum_{1 \le i < j \le n} d_{i,j} s_i s_j\) subject to a linear constraint \(\sum_{i=1}^n s_i = k\). This nonconvex mixed integer quadratic programming (MIQP) problem is difficult to solve. Kalvelagen shows how it can be linearized and solved quickly with a MILP solver.
The following DATA step contains the input data for \(n=50\) locations, with coordinates \(x\) and \(y\):
/* store the input data in a data set */ data indata; input point $ x y; datalines; i1 1.717 8.433 i2 5.504 3.011 i3 2.922 2.241 i4 3.498 8.563 i5 0.671 5.002 i6 9.981 5.787 i7 9.911 7.623 i8 1.307 6.397 i9 1.595 2.501 i10 6.689 4.354 i11 3.597 3.514 i12 1.315 1.501 i13 5.891 8.309 i14 2.308 6.657 i15 7.759 3.037 i16 1.105 5.024 i17 1.602 8.725 i18 2.651 2.858 i19 5.940 7.227 i20 6.282 4.638 i21 4.133 1.177 i22 3.142 0.466 i23 3.386 1.821 i24 6.457 5.607 i25 7.700 2.978 i26 6.611 7.558 i27 6.274 2.839 i28 0.864 1.025 i29 6.413 5.453 i30 0.315 7.924 i31 0.728 1.757 i32 5.256 7.502 i33 1.781 0.341 i34 5.851 6.212 i35 3.894 3.587 i36 2.430 2.464 i37 1.305 9.334 i38 3.799 7.834 i39 3.000 1.255 i40 7.489 0.692 i41 2.020 0.051 i42 2.696 4.999 i43 1.513 1.742 i44 3.306 3.169 i45 3.221 9.640 i46 9.936 3.699 i47 3.729 7.720 i48 3.967 9.131 i49 1.196 7.355 i50 0.554 5.763 ; |
You can use the following PROC SGPLOT statements to plot the locations:
/* plot the input data */ title 'Locations'; proc sgplot data=indata aspect=1; scatter x=x y=y; run; |
Now declare a SAS macro variable \(k\) to specify that we want to select 10 points.
/* specify the number of points to select */ %let k = 10; |
The first several PROC OPTMODEL statements declare parameters and read the input data:
proc optmodel; /* declare parameters and read data */ set POINTS; num x {POINTS}; num y {POINTS}; read data indata into POINTS=[point] x y; set PAIRS = {i in POINTS, j in POINTS: i < j}; num d {<i,j> in PAIRS} = sqrt((x[i] - x[j])^2 + (y[i] - y[j])^2); |
The next few statements declare the decision variables, objective, and constraint:
/* Select[i] = 1 if point i is selected; 0 otherwise */ var Select {POINTS} binary; /* maximize the sum of distances between pairs of selected points */ max QuadraticObjective = sum {<i,j> in PAIRS} d[i,j] * Select[i] * Select[j]; /* select exactly &k points */ con Cardinality: sum {i in POINTS} Select[i] = &k; |
As observed by Kalvelagen, the following optional constraint, obtained by multiplying both sides of the cardinality constraint by \(\text{Select}[j]\), reduces the solve time:
/* optional constraint that greatly reduces solve time */ con RLT {j in POINTS}: sum {i in POINTS} Select[i] * Select[j] = &k * Select[j]; |
You can optionally use the EXPAND statement with the LINEARIZE option to display the linearized model:
expand / linearize; |
The log shows that \(\binom{50}{2}=1225\) variables (one for each pair of locations) and \(3\binom{50}{2}=3675\) constraints were added to the model:
NOTE: The problem has 50 variables (0 free, 0 fixed). NOTE: The problem has 50 binary and 0 integer variables. NOTE: The problem has 1 linear constraints (0 LE, 1 EQ, 0 GE, 0 range). NOTE: The problem has 50 linear constraint coefficients. NOTE: The problem has 50 nonlinear constraints (0 LE, 50 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver removed 0 variables, 0 linear constraints, and 0 nonlinear constraints. NOTE: The OPTMODEL presolver replaced 50 nonlinear constraints, 1 objectives, and 0 implicit variables. NOTE: The OPTMODEL presolver added 1225 variables and 3675 linear constraints. NOTE: The OPTMODEL presolved problem has 1275 variables, 3726 linear constraints, and 0 nonlinear constraints. NOTE: The OPTMODEL presolver added 11075 linear constraint coefficients, resulting in 11125.
Var _ADDED_VAR_[1] BINARY ... Var _ADDED_VAR_[1225] BINARY ... Constraint _ADDED_CON_[1]: _ADDED_VAR_[1] - Select[i1] <= 0 Constraint _ADDED_CON_[2]: _ADDED_VAR_[1] - Select[i50] <= 0 Constraint _ADDED_CON_[3]: - _ADDED_VAR_[1] + Select[i1] + Select[i50] >= 1
Constraint RLT[i1]: - 9*Select[i1] + _ADDED_VAR_[1] + _ADDED_VAR_[2] + _ADDED_VAR_[3] + _ADDED_VAR_[4] + ... _ADDED_VAR_[44] + _ADDED_VAR_[45] + _ADDED_VAR_[46] + _ADDED_VAR_[47] + _ADDED_VAR_[48] + _ADDED_VAR_[49] = 0
/* call MILP solver, automatically linearizing the products of binary variables in the objective and RLT constraint */ solve linearize; /* output selected points to data set */ create data plotdata1 from [point] x y Select=(round(Select[point])); |
Kalvelagen mentions an alternative objective, which is to maximize the minimum distance (rather than the sum of distances) between pairs of selected points. He then introduces a new variable \(\Delta\) and a linear constraint to linearize the \(\min\) operator that appears in the objective. With the LINEARIZE option, OPTMODEL automatically performs this linearization on your behalf. First declare the nonlinear objective:
/* maximize the minimum distance between pairs of selected points */ num bigM = max {<i,j> in PAIRS} d[i,j]; max MaxMinObjective = min {<i,j> in PAIRS} (d[i,j] + bigM * (1 - Select[i] * Select[j])); |
As before, you can optionally expand the linearized model:
expand / linearize; |
The log now shows \(\binom{50}{2}+1=1226\) added variables and \(4\binom{50}{2}=4900\) added constraints:
NOTE: The problem has 50 nonlinear constraints (0 LE, 50 EQ, 0 GE, 0 range). NOTE: The OPTMODEL presolver removed 0 variables, 0 linear constraints, and 0 nonlinear constraints. NOTE: The OPTMODEL presolver replaced 50 nonlinear constraints, 1 objectives, and 0 implicit variables. NOTE: The OPTMODEL presolver added 1226 variables and 4900 linear constraints. NOTE: The OPTMODEL presolved problem has 1276 variables, 4951 linear constraints, and 0 nonlinear constraints.
Var _ADDED_VAR_[1226] Maximize MaxMinObjective=_ADDED_VAR_[1226] ... Constraint _ADDED_CON_[3676]: - 11.197402065*_ADDED_VAR_[365] - _ADDED_VAR_[1226] >= -14.62148223
/* call MILP solver, automatically linearizing the MIN operator and products of binary variables */ solve linearize; /* output selected points to data set */ create data plotdata2 from [point] x y Select=(round(Select[point])); quit; |
The following statements plot the first solution, with the \(k=10\) selected points displayed in gold:
/* plot the first solution */ proc sort data=plotdata1; by Select; run; title 'Quadratic Objective'; proc sgplot data=plotdata1 aspect=1 noautolegend; scatter x=x y=y / group=Select; run; |
The selected points are along the border, but some pairs of points are close to each other. The following statements plot the second solution:
/* plot the second solution */ proc sort data=plotdata2; by Select; run; title 'Maxmin Objective'; proc sgplot data=plotdata2 aspect=1 noautolegend; scatter x=x y=y / group=Select; run; |
You can see that no two selected points are close to each other.
This example illustrates two of the automated linearization transformations (products of binary variables, and maximizing a minimum) that are now available in OPTMODEL just by adding the LINEARIZE keyword to the SOLVE statement. OPTMODEL introduces the required additional variables and constraints on your behalf and returns the optimal solution in terms of the original variables. Please try the other automated linearizations and let us know about any additional features that you would like to see introduced.
The post Automated linearization in SAS Optimization appeared first on The SAS Data Science Blog.
]]>The post Video: Reinforcement learning using Deep-Q Networks with SAS Viya appeared first on The SAS Data Science Blog.
]]>Reinforcement learning (RL) agents have famously learned to play games such as checkers, backgammon, Othello, and most recently land a lunar-lander in OpenAI Gym. The long-term reward is to win the game but getting there can take many different paths of moves. Furthermore, nothing is hidden so there is complete information available to all players. Lastly, anyone can learn to play the game by following the rules as the problem space has been defined. Expertise is developed based on the experiences of playing the game repeatedly. Games with win-lose-draw outcomes are not too dissimilar from real-world problems in robotics, process control, health care, trading, finance and much more.
The RL technique featured for scoring a model in the video below is the Deep-Q Network (DQN) which attempts to model the actions that perform best in each state in real-time. Think of this as a player trying to determine which move to make in a game that will lead to a win. A user-defined neural network will output a value for each possible action that assesses that action’s quality. These values are often identified with a function Q so the family of algorithms that rely on them has become collectively known as Q-learning. Using the output Q-values, an agent can decide an optimal policy by choosing the highest-quality action at each time step. In this example, Deep-Q Learning is being performed through the application of a DQN.
The task that the agent is trying to learn is known as CartPole-v0. This environment simulates a cart on a track trying to balance a pole upright. The objective is to keep the pole balanced upright. Rewards, states, and actions are the following:
The video will walk you through the simple steps needed to create an RL model for Deep-Q Learning in SAS Viya, using a Jupyter Notebook. Do you have a problem with a complex sequence of decisions for which you want to maximize the outcome? Then check out reinforcement learning (RL) in SAS Viya. Not just for fun and games anymore, RL can be used to solve a variety of real-world problems.
Learn More: Application of reinforcement learning to control traffic signals
The post Video: Reinforcement learning using Deep-Q Networks with SAS Viya appeared first on The SAS Data Science Blog.
]]>The post Chatbots as a means of building confidence in AI appeared first on The SAS Data Science Blog.
]]>Let's suppose I am a financial advisor responsible for high net worth clients. Every month, the data science department gives me updated churn propensity for each client and I have to intervene to ensure that they remain clients. Is it a solo ride with a bunch of numbers? Or is it a draining conversation between business and data science with one side talking about clients and contracts and the other one talking about neural networks and stochastic calculus?
Chatbots use Conversational AI to enable humans to interact with machines using natural language and instantly get a human-like, intelligent response that is tailored to the user. In the SAS world, chatbots offer another user-friendly conversational interface to the entire Viya ecosystem, bringing together reporting capabilities, analytics and artificial intelligence.
Now, back to that client churn scenario. Let's see how I can interact with the chatbot to make interpretable analytics-driven decisions out of churn probabilities numbers:
The interaction with the chatbot allowed me to review the at-risk clients:
In summary, the chatbot proved to be a useful tool to access analytical insights and results, providing me the exact bits of information I was looking for, instead of having to navigate through multiple reports and visualizations. Moreover, having a combination of graphs, numbers and explanations that the data science team designed for me, I get insights instead of numbers so I can make faster, data-driven decisions.
Now I have a better understanding of what impacts the clients and their likelihood to churn. Thus, I can take this feedback to my marketing team to adjust the campaigns we run and have a better understanding of my clients to provide better-personalized service.
I am not a financial advisor (even if my parents did push for that when I was younger), but I am a data scientist and would like to write a few comments for my fellows.
What you saw in the video was the output of a gradient boosting model for evaluating clients' churn propensity. On top of this, I ran interpretability techniques such as LIME, to get local interpretable explanations, and ICE, to evaluate the dependency of the churn propensity on one variable. Finally, I applied clustering techniques to group clients by similarity.
Talking about chatbot development, SAS Conversation Designer allowed me to build one in a visual interface that requires little to no code and can trigger code execution or APIs calls to gather the results.
Thank you for reading.
The post Chatbots as a means of building confidence in AI appeared first on The SAS Data Science Blog.
]]>The post 2021 trends data scientists should follow appeared first on The SAS Data Science Blog.
]]>The recently released 2021 Gartner MQ for Data Science and Machine Learning contains a wealth of information and here are my takes on key market trends from that report for data scientists. This evaluation features SAS Viya with its SAS Data Science offerings.
You must often push the boundaries of innovation when asked to solve key business problems. That’s because the problems that you are asked to solve are complex and often require both structured and unstructured data to solve, calling for the application of different AI techniques or composite AI. That’s where SAS Viya comes in, by providing machine learning, deep learning, NLP, computer vision, forecasting, and optimization capabilities that can easily be used together to solve the most complex of business problems.
Ultimately, you create models to help businesses make better decisions. In many cases, these models can help automate decision-making in real-time when combined with business rules and embedded in a decision process. Gartner acknowledges the decision intelligence in SAS Viya as a strength.
Data science and machine learning platforms must support model operationalization in addition to model building. This includes model performance monitoring, model governance, and lineage. Why is MLOps so important? On average, only half the analytics models built ever make it into production. That’s right 50%! That can be disheartening to those of you who pour your time and energy into modeling only to never have those models see the light of day. MLOps is another strength of SAS’ – but don’t take just our word for it, Gartner says so too.
You want instant access to the latest innovations and enhancements in your modeling toolkit. The most likely way to achieve this is through applications running on the cloud. The integration between Microsoft Azure and the SAS Viya analytics platform empowers organizations to stand up SAS analytics in their cloud environments with ease and quickly gives users access to the latest and greatest.
You look for the most innovative tools and technologies to help them solve the business challenges in the best way. Often this is a modeling melting pot of open source and commercial analytics tools which need governance to manage the disparity of the code base and processes. SAS Viya is praised for its innate integration with open source by supporting models in different languages, moving them from sandbox to production in a centralized, governed manner that meets any scalability requirement.
Think about the tasks that you perform across the analytics life cycle: data access, data prep, feature engineering, building models, training models, tuning models, and deploying models. Now imagine if those could be automated? How many more models could be built? The end game is to build as many models as needed, as easily as possible, to find the one that can be used to solve the business problem at hand. SAS Viya is praised by Gartner for its automated pipeline generation and its hyperparameter autotuning to facilitate the experimentation process.
This report is a great way to educate yourself about the trends in data science and machine learning. No registration is required to get the report. Happy reading!
The post 2021 trends data scientists should follow appeared first on The SAS Data Science Blog.
]]>The post SAS Conversation Designer: interacting with APIs appeared first on The SAS Data Science Blog.
]]>For instance, if you are making a bot that takes feedback from a user you can use sentiment analysis on the feedback to help determine the next appropriate action. The feedback could also be classified by using a model and depending on the classification a new dialogue could start to help the user if they are encountering an issue. Other uses could be calling a model for a loan approval process using information from a user, answering questions about flights by querying current flight statuses, or inform the user how their order is progressing by checking the status of a transaction. All of these are potential applications of using APIs with SAS Conversation Designer.
To interact with APIs, we will be using the Web Service Data Provider node. With it, your bot can execute HTTP methods to access RESTful services. Requests can also be made through SAS Code nodes, but we will focus on the Web Service Data Provider node.
In case you are new to working with APIs let's describe what a call looks like. A REST API call is delivered to a resource via a URL and is composed of a Method that declares what action will be taken on the resource, a Body which is the data that will be sent to the resource, and Headers which are a list of HTTP header keys and their associated values. Headers provide a simple means of passing additional information with the request and are often used for functions like authentication.
You can find the Web Service Data Provider Node under the Run Code tab:
Here are the Web Service Data Provider Node's properties as per the documentation:
For a more detailed look, read the documentation. Let's go through an example that uses both the GET and POST to see how to use this in practice.
This bot helps identify the species of an iris flower and provides some basic information about them. It helps identify the species by asking the user for measurements about the iris' petal width and length and then using this knowledge to score a model. It provides information about the species by performing a REST call to the Wikipedia page on that specific species.
Aligning with this bot’s two goals we'll focus on two dialogues, one for scoring the model, and one for querying Wikipedia.
Note on the Model: The model is a decision tree trained on the iris flower data set and published to SAS Micro Analytic Service (MAS) with the name dtree_iris.
To acquire the data for the model prediction we use multiple HTML Response nodes followed by Text Input nodes. To keep things simple, we won't be verifying the inputs, though for real applications Logic Nodes and clever use of Apache Velocity will help make the experience more user friendly. Once the input data has been acquired, we can run the node to score the model. By omitting the first part of the URL (the hostname) SAS Conversation Designer will use the same host that it's already running on which lets it handle authorization for us.
These are the parameters of the web service data provider node used to score the model:
Web Service Data Provider
{"inputs": [ { "name": "petal_length", "value": $petal_length }, { "name": "petal_width", "value": $petal_width } ]} |
Accept - application/json; application/vnd.sas.microanalytic.module.step.output+json Content-Type - application/json; application/vnd.sas.microanalytic.module.step.input+json |
With the model prediction done all that is left is to share it with the user. The results of the call are stored in the variable we chose to name $score_out, though you can choose whatever naming scheme makes the most sense when creating your own bots. The format of $score_out is a JSON file, so we'll use a Modify Context Data Provider node to create a variable named $species by extracting the prediction:
Modify Context Data Provider
The output has a space and quotations marks we don't need so we'll use one last Modify Context Data Provider node to clean that up:
Modify Context Data Provider
With all preparation completed we can simply use an HTML Response node that displays the model prediction:
HTML Response
Which will render in the chat as:
This part will use Wikipedia as a source for getting more information about the predicted species. Wikipedia has an API to interact with many of its features. For a given page we can use the GET method to retrieve just the extract. We will not require either headers or a body for this request, which also means you can copy and paste the URL used into your browser to take a look at the exact JSON it will return.
Conversation Designer lets you use variables within the URLs of a Web Service Data Provider node, allowing us to have one flow for all three species. To allow the user to choose which species they want to learn about we'll use a Button node followed by Text Input Node, this sets the page_title according to the predefined button choices. On the buttons node under Label is what the user will see, and Display Text is what will be passed to the Text Input node.
These are the parameters of the web service data provider node used for the query:
Web Service Data Provider
With the query completed, we can display the results. The JSON that's returned has a slightly odd structure, so you'll see a bit of code using Apache Velocity to extract the pageId.
HTML Response
#foreach($key in $wiki.get("query").get("pages").keySet()) #set($pageId = $key) #end According to Wikipedia: $wiki.get("query").get("pages").get($pageId).get("extract") |
Which renders as such:
This shows how to interact with REST APIs using SAS Conversation Designer. There are many possible use cases involving REST calls, so hopefully, this example helped spark some ideas for other use cases.
For the call that queries Wikipedia, a similar setup could be used for an FAQ bot that not only shares a link to a resource or documentation but also retrieves the relevant information and shares it within the chat. A solution like this helps leverage pre-existing work and has the potential to save time because when the resource is updated so will the bot's response.
Similarly, being able to interact with the whole of Viya's APIs opens up many more opportunities for integrating resources besides scoring models, like being able to trigger a decision flow in SAS Intelligent Decisioning.
Thank you for reading. Any feedback or questions are welcome!
The post SAS Conversation Designer: interacting with APIs appeared first on The SAS Data Science Blog.
]]>The post Spatial econometric modeling unleashes the geographic potential of your data appeared first on The SAS Data Science Blog.
]]>Due to the increasing popularity of spatial data, the demand for spatial analysis has surged in the past few decades. Geographical information in spatial data provide you with new perspectives in understanding how events occurring in one location are affected by events occurring in neighboring locations.
In this post, I will present an overview on spatial econometric analysis. I will also introduce the SPATIALREG procedure in SAS/ETS^{®} and the CSPATIALREG procedure in SAS^{® }Econometrics in SAS® Viya®. Both of these can be used for analyzing a wide range of spatial econometric models.
Typically, spatial data refers to any data that contains information about specific locations in space. As observations in time series are referenced by time, observations in spatial data are geographically referenced. For example, geographic information such as longitude-latitude coordinates, zip codes, street address, and census tract codes allows us to identify different points or regions on earth.
From an analytic point of view, spatial data invalidate the underlying assumption of independence between observations for standard linear regression models. This is because data collected over different locations or regions in space are often spatially correlated. The strength of spatial correlation is determined by the proximity of two spatial units in space. The two observations are more correlated when two spatial units are closer.
Ignoring spatial dependence in the data can lead to biased parameter estimates and flawed inference. Spatial data analysis aims to account for spatial dependence in the data. This ensures the resulting parameter estimates and inference are correct. Combining spatial analysis and econometrics, spatial econometrics extends standard regression models by explicitly incorporating spatial effects for cross-sectional and panel data. These extended regression models deal with two specifications of spatial effects - spatial interaction and spatial heterogeneity. They are often referred to as spatial econometric models.
As illustrated by the flowchart below, spatial econometric analysis often involves three steps. You begin spatial econometric analysis with data preparation and exploratory data analysis prior to the model fitting. In the second step of model fitting, the actions in sequence are:
The third step relates to post-model-fitting inference such as computing fitted values and marginal effects.
To prepare data for your analysis, some dedicated tools are required to import, project, aggregate, and visualize spatial data. Depending on what operations that you need when processing your data, various SAS mapping procedures can be used to meet your needs.
Although spatial data come in various formats, the shapefile format is widely used to store geometry and attribute information of spatial features such as points, lines, and polygons. You can use the MAPIMPORT procedure to read shapefiles in SAS. To visualize your data, you can use both GMAP and SGMAP procedures to show variations of a variable between geographic areas on a map. For example, Figure 2 displays median log-transformed home values for 506 census tracts in Boston from 1970 census (Harrison 1978) using PROC GMAP.
If your data contains address information, the GEOCODE procedure can be used to convert an address to geographic coordinates (longitude and latitude). Based on geographic coordinates, you can compute the distance between two locations and project the longitude-latitude coordinates to a 2-dimensional plane with several map projection techniques provided in the GPROJECT procedure. You use the GREMOVE and GREDUCE procedure to combine unit areas and reduce the number of points in a map data set, respectively. SAS^{®} Visual Analytics also provides Location Analytics capabilities to leverage geographic potential of your data. This includes data visualization, geocoding, geographic selection, geo-searching, and much more.
After some data cleansing and exploratory data analysis, you are ready to fit some spatial econometric models to your data. The two key components are constructing spatial weights matrices and choosing a model.
Spatial weights matrices play an important role in spatial econometric modeling. They are used to describe the proximity of spatial units and to formulate spatial econometric models. In its simplest form, a spatial weights matrix W is an \(n \times n\) binary matrix with the \((i,j)\)th entry \(W_{ij}\) being
\[W_{ij} = \begin{cases} 1, &\text{if units $i$ and $j$ are neighbors} \\0, &\text{if units $i$ and $j$ are not neighbors} \end{cases}\]
For continuous data, the standard linear regression model is useful for modeling the linear relationship between a response variable and some explanatory variables. For instance, in a vector form, the linear regression model can be described as
\(\mathbf{ y=X_1}\boldsymbol{ \beta+\epsilon.}\)
Spatial econometric models extend the standard linear regression model by incorporating spatial dependence arising from three different interaction effects:
The linear regression model can be extended by including the spatially lagged dependent variable Wy as an additional regressor to account for endogenous interaction effect. This leads to a spatial autoregressive (SAR) model of the form:
\(\mathbf{ y=}\mathrm{ \rho}\mathbf{ Wy+X_1}\boldsymbol{ \beta+\epsilon.}\)
Similarly, spatially lagged explanatory variables in the form of WX are included in the linear regression model to address exogenous interaction effects. In this case, you end up with the following model:
\(\mathbf{ y=X_1}\boldsymbol{\beta+}\mathbf{WX_2}\boldsymbol{ \gamma+\epsilon.}\)
For spatial dependence in the error terms, the linear regression model can be extended by assuming the error terms follow specific structures, such as the spatial autoregressive and spatial moving average structure.
Due to presence of interaction effects, regression coefficients \(\boldsymbol{ \beta}\) in spatial econometric models do not necessarily have the same interpretation as in the standard linear regression model. As a result, marginal effects are often computed to quantify how much change you would expect for changes in explanatory variables. To this end, three impact estimators are provided to summarize the direct impact, indirect impact, and total impact of changes arising from explanatory variables on the dependent variable.
Both the SPATIALREG procedure and the CSPATIALREG procedure can be used for spatial econometric modeling. The CSPATIALREG procedure is designed to run on a cluster of machines that distribute the data and the computations. The SPATIALREG procedure runs on a single machine. Table 1 provides a complete list of spatial regression models in the CSPATIALREG procedure. The CSPATIALREG procedure is capable of handling large spatial data. In addition, it supports features such as parameter estimation, hypothesis testing, marginal effects computation, and many more. The SPATIALREG procedure supports similar features to the CSPATIALREG procedure except for conditional autoregressive models and impact estimation.
The modeling capabilities in the SPATIALREG and CSPATIALREG procedures give you the analytical power to unleash the geographic potential of your data and gain actionable insights from it.
This is the ninth post in our series about statistics and analytics bringing peace of mind during the pandemic.
The post Spatial econometric modeling unleashes the geographic potential of your data appeared first on The SAS Data Science Blog.
]]>The post Mathematical optimization at SAS appeared first on The SAS Data Science Blog.
]]>In his recent blog post, What is optimization? And why it matters for your decisions," Rob Pratt, Senior Manager in Scientific Computing R&D, wrote: "If the differences in outcomes are significant and the options are numerous, especially if multiple decisions are interdependent, you have a good opportunity to apply analytics." The following post provides additional background on mathematical optimization and its availability in SAS software.
Mathematical optimization is one of the most valuable disciplines in analytics, with applications in every industry. It is used to rigorously search for the best way to use resources to maximize or minimize some metric while respecting business rules that must be satisfied. The four main ingredients to define an optimization problem are input parameters, decision variables, objectives, and constraints:
The goal of optimization is to find an optimal solution, that is, an assignment of values to decision variables that satisfies all constraints and attains the best possible value for the objective function. An optimization algorithm searches for an optimal solution. An optimization solver is a collection of algorithms for a specific problem type (characterized by the types of functions that are used to define the objective and constraints and the values the decision variables can take on).
Optimization plays a key role in providing optimal or good solutions for many real-world business problems. Notable applications include airline crew scheduling, inventory optimization, revenue management, and price optimization. Recently, optimization has become a core component in other modern analytics areas. Hyperparameter optimization, meta learning, and optimization algorithms for deep learning are among the most active research areas. The integration between optimization and machine learning, statistics, forecasting, and econometrics has become very tight as new theories and algorithms advance.
SAS^{®} Optimization in SAS^{®} Viya^{®} provides access to a wide array of optimization solvers, with each solver specialized to suit a problem type. For example, the linear programming (LP) solver handles problems where all decision variables are continuous and both the objective and constraints are linear functions of these variables. If some of the variables must take integer values, the mixed integer linear programming (MILP) solver is used instead. Other supported solvers include quadratic programming (QP), nonlinear programming (NLP), constraint programming, black-box optimization, and network optimization.
The OPTMODEL procedure and the corresponding runOptmodel action provide an algebraic modeling language that enables you to build and solve optimization problems, with access to all the solvers listed above. A common use case involves the following steps:
For more complicated problems, you might need to write a customized solution algorithm that calls a solver in a loop or calls multiple solvers, with the output of one solver providing input for another solver.
OPTMODEL includes a programming language, with access to almost all DATA step functions, that enables you to write such customized algorithms. The following examples that were mentioned in the earlier post all use this functionality:
SAS Optimization includes several features that employ threaded and distributed computation. OPTMODEL provides a COFOR loop to solve independent problems concurrently, and the runOptmodel action supports BY-group processing for the common use case of building and solving the same problem multiple times with different input data.
For the NLP solver, the multistart feature increases the likelihood of finding a globally optimal solution for highly nonconvex problems that have many local optima. For the MILP solver, the default branch-and-cut algorithm threads the dynamic tree search and in distributed mode processes tree nodes on different workers and communicates new global lower and upper bounds back to the controller.
The LP and MILP solvers both include a decomposition algorithm that exploits block-angular structure; for MILP problems that consist of loosely coupled subproblems, this algorithm often yields dramatic performance improvements over branch-and-cut. The network solver contains both threaded and distributed implementations for selected algorithms, and it also supports generic BY-group processing.
Besides the direct use of the optimization solvers by SAS Optimization procedures and actions, several SAS solutions, including SAS^{®} Pack Optimization, SAS^{®} Promotion Optimization, and SAS^{®} Marketing Optimization, use our solvers under the hood. The optimization engine in SAS Marketing Optimization provides a specialized algorithm that uses the LP and MILP solvers to solve problems with millions of decision variables and constraints. Customer success stories include Akbank and Scotiabank.
Many procedures in other products also use various optimization solvers. For example, the PSMATCH procedure uses the network solver to solve the underlying linear assignment and minimum-cost network flow problems. The CAUSALGRAPH procedure also uses the network solver for cycle enumeration, connected components, and vertex separation. The HPFRECONCILE procedure uses the QP solver to solve a least-squares problem, and many procedures call the NLP solver. SAS Visual Data Mining and Machine Learning uses the black-box solver to tune hyperparameters for a number of machine learning models, including decision tree, gradient boosting, support vector machine, neural network, and logistic regression.
For Python users, the sasoptpy modeling package also provides a modeling interface for the LP, MILP, QP, NLP, and black-box solvers in SAS Optimization. You can use native Python structures such as dictionaries, tuples, and lists to define an optimization problem and then call one of the solvers.
I hope this blog post has helped you learn about some applications of mathematical optimization and how it is used in SAS software. For more information, here are some useful links:
This is the eighth post in our series about statistics and analytics bringing peace of mind during the pandemic.
The post Mathematical optimization at SAS appeared first on The SAS Data Science Blog.
]]>The post The road to modern econometrics appeared first on The SAS Data Science Blog.
]]>There are many definitions of econometrics. Going to its origins, the word econometrics originated from two greek words: oikonomia, meaning the study of household activity and management, and metriks, which stands for measurement. Modernizing the definition, we arrive at econo-metrics, the measurement and testing of economic theory by using mathematics, statistics, and computer science knowledge. Discussing econometrics, I jokingly say that it can be defined as: Economics = Mathematically Checked and Conveyed (E=MC^{2}). Development of economic theory and its applications has spanned centuries. Let's take a look at the history of this specialty.
In this post, I will discuss some of the applications of modern econometrics. The road to forming and establishing modern econometrics was a long and complicated one. Many things changed over the hundreds of years due to advancements in statistical, mathematical, and computer science theory However, the dominant driver of the change was technology related. Regression remained the workhorse of econometrics, but it became much more complex due to attempts for large scale modeling and availability of big data. Technology, including grid and cloud computing, enabled us to crunch a lot of data but it is often not enough just to rely on technology. When there is a lot of data, TECHNOLOGY itself is not enough---you need “tricks”, which I will talk about next.
We can use spatial regression as an example of large data and large models Spatial regression is based on the first law of geography (Tobler 1970): “Everything is related to everything else, but near things are more related than distant things.”
Using regression, you can construct a model projecting the target variable of interest at one location on regressors at the same location and, at the same time, regressors at neighboring locations. It is still a regression but enhanced by neighboring relationships that play an essential role. These neighboring relationships are recorded in a matrix that can be quite large. In spatial econometrics we call it the matrix of spatial neighbors (weights) that directly depends on the number of locations that enter the analysis.
For example, performing analysis involving US census data at the tract level contains 64,999 locations (tracts). Not a large number on its own but the problem we are solving becomes large if each of these locations can potentially interact with other locations. If we try to record these relationships fully, it means that the dimension of the spatial weight matrix is 64,999 by 64,999. If you directly read this matrix into memory, it needs about 30GB. Even if you have enough memory to accommodate a matrix of that size on your computer, there is no doubt that the computation speed will be greatly impacted when you perform computations with such a large matrix. You might not have enough hours in the day to solve your model!
Considering the difficulties that might arise when dealing with large matrices, it might be time to think about some tricks and answer questions that can help you to make the problem to become much computationally simpler.
The first thing to consider would be the sparsity of the matrix. We can likely take advantage of the fact that the weight matrix is sparse. Really it means that there are not that many possible neighbors for each of its elements (not everyone can be a neighbor with everyone else) and you only store the relevant ones. The sparsity trick might get you over the problem of storing the matrix. With this sparse representation trick, you are able to store it in memory.
But it is likely that you still need to work with this matrix of NxN size in your calculations. For example, you might need to take an inverse of this matrix, possibly doing this more than once. Assuming it is computationally feasible, inversing the matrix with billions of elements would take many CPU hours. It would be time to use another trick and think about altering your calculation through an approximation that avoids the inversion of the matrix. Both tricks (and many more) have been applied in the CSPATIALREG procedure in SAS Econometrics. You can get your estimation result quickly and accurately for very large problems.
Figure 1 (a) and (b) below are the images of true values vs. predicted values for the analysis of census tract data with very large matrix of spatial neighbors. PROC CSPATIALREG uses the Taylor approximation and finishes the estimation in about one minute, instead of days or months that it would take if the full estimation method was used. The predicted values are very close to the true values.
One big branch of econometrics is time series analysis. Some scholars even think the econometrics field started with time series analysis. I like to call the goals of time series analysis “UFO”:
There are many applications of time series analysis in different fields. There is one example in the stock market that is fairly difficult to model. You might be familiar with the concept of a bull market and a bear market. If you knew when it was to be a bull or bear market, trading might be much easier. Unfortunately, the market states cannot be directly observed. What you can observe is the stock prices. How do you decode the hidden states from what you can observe? Hidden Markov Models (HMMs) to the rescue!
Why do I call HMM a big model? Because HMM explores the probability space in an exponential way. For example, from 1926 to 1999, there were about 4,000 business weeks. Assume that HMM uses two hidden states to model weekly data, then, there are 2^{4,000} possible paths of hidden scenarios due to the combination of these two hidden states along 4,000 weeks! In such an insanely large modeling space, HMM can find out which path is the most likely one by applying the trick (the Viterbi algorithm) to decode the hidden states so you can “observe” them.
Figure 2 (a) describes three hidden market states: bull, bear, and in-between. The estimation and model selection are another huge task. In this example, it costs about 10,000 CPU hours to find the best HMM. Once the best model is found, you can apply it to the new data (from 2000 to 2017 in this example) to forecast next week’s market state for each time window in the new data. Then you can define your trading strategies according to such forecasts---optimizing the present. The forecasts and wealth curves of different trading strategies are shown in
Figure 2(b) and 2(c), respectively. In this example, by using HMMs to design your trading strategies, you can beat the market by 70% better in return, or 70% better in Sharpe Ratio! For more information, see Example 14.1 Discovering the Hidden Market States by Using the Regime-Switching Autoregression Model in the chapter The HMM Procedure.
This example covers a modern and important simulation method in the area of Bayesian analysis and MCMC (Markov Chain Monte Carlo). I’m going to talk briefly about Bayesian Analysis of Time Series (BATS), and eventually about something even more complex than MCMC---the SMC, the Sequential Monte Carlo method, which is also known as the particle filter. The concept is related to Kalman filter, a well-known technique for the (linear Gaussian) state space models (SSMs). The particle filter extends the filtering framework to nonlinear non-Gaussian state space models.
SSM play a critical role in time series analysis, because many time series models can be written in the state space form, such as ARIMA, UCM, VARMA, stochastic volatility models, and so on. In this example, I will use the Stochastic Susceptible-Infected-Recovered (SSIR) model for the COVID-19 pandemic to demonstrate modern SMC analysis.
In the epidemic theory, the effective reproduction number, R, is an extremely important parameter. R is the average number of secondary cases per primary case. The R is time-varying due to the epidemic spread stage, intervention, and other factors. Once R < 1, it implies that the epidemic might be in decline and under control. How to trace R along the time is not an easy task, and the SSIR model is one way to do so. However, due to the nonlinearity in the modeling, there is no closed-form solution, so we need to use some approximation methods and SMC certainly comes to mind. Based on the real COVID-19 data for Pennsylvania and through millions of or even billions of simulations, the R can be traced, as shown in Figure 3 (a), and the future number of new cases can be forecasted and compared with the actual values, as shown in Figure 3 (b). The effective reproduction number is successfully traced even with the confidence intervals. Forecasts are very accurate since the true testing data are within the confidence intervals of forecasts.
For more information, see Example 18.3 Estimating the Effective Reproduction Number of COVID-19 Cases in Pennsylvania in the chapter The SMC Procedure.
I often say that there are no “small” problems in econometrics. Some problems might be larger than others. However, if you have the right toolbox you can solve them all. SAS Econometrics, including SAS/ETS, offers a lot of value. I encourage you to explore the SAS Econometrics offering for your daily modeling needs.
Procedures Doc | SAS Econometrics
For more information about spatial regression analysis, watch the video Spatial Econometric Modeling for Big Data Using SAS Econometrics or the tutorial SAS Introduction to Spatial Econometric Modeling.
This is the seventh post in our series about statistics and analytics bringing peace of mind during the pandemic.
The post The road to modern econometrics appeared first on The SAS Data Science Blog.
]]>