The post How to plan an optimal tour using Network Optimization in SAS Viya appeared first on The SAS Data Science Blog.

]]>Various algorithms can be used to evaluate and understand a public transportation network. The *Minimum Spanning Tree* algorithm can reveal the most critical routes needed to maintain the same level of accessibility in the public transportation network. It basically specifies which stations (trains, metros, trams, and buses) need to be maintained for the reachability of the original network to stay the same; considering all other available stations. The *Minimum-Cost Network Flow* can describe an optimal way to flow the population throughout the city, allowing better plans for commuting based on available routes and their capacities, considering all possible types of transportation. The *Path* algorithm can reveal all possible routes between any pair of locations. This is particularly important in a network considering a multimodal transportation system. It gives transportation authorities a set of possible alternate routes in case of unexpected events. The *Shortest Path* can reveal an optimal route between any two locations in the city, allowing security agencies to establish faster emergency routes. The *Transitive Closure* algorithm identifies which pairs of locations are joined by some route, helping public transportation agencies to account for the reachability in the city.

The algorithm emphasized here is the *Traveling Salesman Problem*. Its solution requires finding the minimum-cost tour within a network. In my case, the minimum-cost is based on the distance traveled, considering a set of locations to visit, based on all possible types of transportations available in the network. More specifically, the cost is the walking distance, which means I will try to minimize the distance needed to walk to visit all the places on my tour list.

Open public transportation data allows both companies and individuals to create applications that can help residents, tourists, and even government agencies with planning and deploying public services efficiently. In this case, I am going to use open data provided by some transportation agencies in Paris (RAPT – Régie Autonome des Transports Parisiens – and SNCF – Société Nacionale des Chemins de fer Français).

The first step is to collect and evaluate the appropriate data and create a transportation network. The data contains information about 18 metro lines, 8 tram lines, and 2 major train lines. It comprises information about all lines, stations, timetables, coordinates, among many others. This data builds the transportation network, identifying all possible stops and step sequences performed when traveling the city through its public transportation system.

I want to particularly show the traveling salesman problem algorithm and to do that, I am going to visit Paris. I've selected a set of 42 places: a hotel in Les Halles (the starting and ending point) and 41 places of interest, including the most popular tourist locations in Paris, and my preferred cafés and restaurants. I would visit all the cafés and restaurants if I could! I am going to execute two optimal tours. The first tour will consist of just walking. Paris is a beautiful city and there's nothing better than just walking through the city of lights. The second tour considers the multimodal public transportation system. This tour doesn’t help me to enjoy the wonderful view as much, but it will definitely help me to enjoy the delicious cafés and restaurants for longer... Oh, and the tourist spots, too!

By the way, if you'd like to see this done on video, we've got you covered:

Without further ado, let’s get started. The first step is to set up the 42 places based on x and y coordinates.

data places; length name $20; infile datalines delimiter=","; input name $ x y; datalines; Novotel,48.860886,2.346407 Tour Eiffel,48.858093,2.294694 Louvre,48.860819,2.33614 Jardin des Tuileries,48.86336,2.327042 Trocadero,48.861157,2.289276 ... Luigi Pepone,48.841696,2.308398 Les Negociants,48.837129,2.351927 Au Trappiste,48.858295,2.347485 ; run; |

An HTML file is created to show all these places in a map. Leaflet is the open source map used here.

filename arq "&dm/parisplaces.htm"; data _null_; set places end=eof; file arq; length line $1024.; k+1; if k=1 then do; put '<!DOCTYPE html>'; put '<html>'; put '<head>'; put '<title>SAS Network Optimization</title>'; put '<meta charset="utf-8" />'; put '<meta name="viewport" content="width=device-width, initial-scale=1.0">'; put '<link rel="stylesheet" href="https://unpkg.com/leaflet@1.5.1/dist/leaflet.css" integrity="sha512-xwE/Az9zrjBIphAcBb3F6JVqxf46+CDLwfLMHloNu6KEQCAWi6HcDUbeOfBIptF7tcCzusKFjFw2yuvEpDL9wQ==" crossorigin=""/>'; put '<script src="https://unpkg.com/leaflet@1.5.1/dist/leaflet.js" integrity="sha512-GffPMF3RvMeYyc1LWMHtK8EbPv0iNZ8/oTtHPx9/cc2ILxQ+u905qIwdpULaqDkyBKgOaB57QTMg7ztg8Jm2Og==" crossorigin=""></script>'; put '<style>body{padding:0;margin:0;}html,body,#mapid{height:100%;width:100%;}</style>'; put '</head>'; put '<body>'; put '<div id="mapid"></div>'; put '<script>'; put 'var mymap=L.map("mapid").setView([48.856358, 2.351632],14);'; put 'L.tileLayer("https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=pk.eyJ1IjoibWFwYm94IiwiYSI6ImNpejY4NXVycTA2emYycXBndHRqcmZ3N3gifQ.rJcFIG214AriISLbB6B5aw",{maxZoom:20,id:"mapbox.streets"}).addTo(mymap);'; end; line='L.marker(['||x||','||y||']).addTo(mymap).bindTooltip("'||name||'",{permanent:true,opacity:0.7}).openTooltip();'; if name = 'Novotel' then do; line='L.marker(['||x||','||y||']).addTo(mymap).bindTooltip("'||name||'",{permanent:true,opacity:1}).openTooltip();'; line='L.circle(['||x||','||y||'],{radius:75,color:"'||'blue'||'"}).addTo(mymap).bindTooltip("'||name||'",{permanent:true,opacity:1}).openTooltip();'; end; put line; if eof then do; put '</script>'; put '</body>'; put '</html>'; end; run; |

Here is the map with all the places I want to visit.

There are so many places to visit... so many cafés. I am not sure if I can make it to all of them. At least, I don't know if I can walk all this and still have working legs by the end. Plus, there are so many possible decisions to make through this tour. I would get lost easily. Combining all pairs of locations, I have 1,722 possible steps.

OK, this is too much, I definitely need help through **proc network** and **proc optnetwork**, the network analysis and optimization package for **SAS Viya**. The *Traveling Salesman Problem* (TSP) is one of the multiple network optimization algorithms included in this package. The TSP algorithm aims to find the minimum-cost tour in a graph. A path in a graph is a sequence of nodes linked to each other. An elementary cycle is a path where the start node and the end node are the same. A tour is essentially an elementary cycle where every node is visited just once (no matter how much you like the place!). TSP aims to find the tour with a minimum total cost. The cost here is the distance traveled.

I calculate the Euclidian distance for all possible links between all locations I want to visit so I can use the TSP algorithm to search for the optimal tour.

proc optnetwork direction = directed links = mycas.placesdist out_nodes = mycas.placesnodes; linksvar from = org to = dst weight = distance; tsp cutstrategy = none heuristics = none milp = true out = mycas.placesTSP; run; |

Here is my optimal tour on the map.

The best tour to visit all 41 locations requires 19.4 miles of walking. For me, this is almost a marathon. This walking tour would take me 6 hours and 12 minutes. It is too much!

A feasible approach to reduce the walking distance is to use the public transportation system. In Paris, this network considers trains, trams and metro, and of course, buses (it doesn't count scooters, bicycles, motorcycles, electric cars, etc.). Our tour does not consider the bus network, just trains, trams, and metro. The open data provided by the RATP is used to create a public transportation network, considering 27 lines (16 metro lines, 9 tram lines out of existing 11 and 2 train lines out of existing 5) and 518 stations. Here I will take advantage of my **Navigo** card to travel as many times as I need.

Now I can enhance my optimal tour considering the multimodal transportation system, including not just walking, but also the public lines. An upgrade from my initial optimal tour is that through public transportation, the distance I need to walk from two points of interest and their respective closest stations should be less than the distance between those two points. Otherwise, I will just walk. For example, if the total distance of walking from the origin place **A** to the closest station to **A** plus from the closest station to **B** to the destination place **B** is greater than the distance just walking from **A** to **B**, there is no reason to take the public transportation. If that distance is less than the distance from **A** to **B**, then I will take the public transportation.

Then I need to calculate the closest station to each point of interest I will visit (which stations serve these places, and which is the closest). For each possible step in my best tour, I need to confirm if it is better to walk or to take public transportation.

Once I calculate the distances between all the places to visit and the places to the nearest stations, I can compare all possible steps in my path in terms of shortest distances to see if I will take public transportation or just walk. I will then execute the TSP algorithm again to search for the optimal tour considering both walking and public transportation.

Now I know my optimal sequence of places to visit to minimize my walking distance, and I know when to walk and when to take public transportation. The last step in this multimodal tour is to find the shortest path between any two stations in my optimal tour. Paris has a very dense public transportation network, which means that there are many options to go from one place to another considering not just multiple lines, but also multiple types of transportation like metro, train, or tram. I need to find the shortest one for each pair of stations considered in my tour. To do this, I need to run another algorithm, the *Shortest Path*.

The shortest path problem aims to find the path between two nodes in a graph where the sum of its links' weights is minimized. The weight of the link here is the distance. It finds the path between two nodes *u* and *v* in a graph with the lowest total link weight, considering all possible paths starting at *u* and ending at *v*.

proc optnetwork direction = undirected links = mycas.metrolinks; linksvar from = org to = dst weight = dist; shortestpath outpaths = mycas.shortpathmetrotour; run |

Now I know the best sequence of places to visit, when to walk and when to take public transportation, and what route should I go when I do take public transportation.

A tour can start and end at any place, but I want mine to be the same place. I need to make a minor adjustment to my results. I will start my tour at Novotel Les Halles and I want to (hopefully) finish at the same place at the end of the day. I simply duplicate the results and select the sequence that starts and ends at Novotel.

data stationplacestep; set mycas.stationplacetsp mycas.stationplacetsp; run; data stationplacestepstart; set stationplacestep; if plorg = 'Novotel' then k+1; if k = 1 then do; order+1; drop k; output; if pldst = 'Novotel' then K+1; end; run; |

Now I finally have my optimal multimodal tour.

In step 21 of my tour, I stop by at Les Cailloux, an Italian restaurant in the Butte aux Cailles district, an art deco architectural heritage neighborhood. From there, I walk to the Covisart station and I take line 6 until Edgar Quinet. From there, I walk to Josselin and I grab a crepe (one of the best in the area). Then I walk to Café Gaité and enjoy a beer, just watching people walk by. The concept of a shot of coffee that lasts for hours at outside tables applies for a beer as well. It is a very nice area with lots of bars and restaurants. Then I walk to the Montparnasse Tower. Probably the best view of Paris, because from there you can see the Eiffel Tower. It beats out the Eiffel Tower because you can’t see the Eiffel Tower while you're on the Eiffel Tower! You don’t need to buy a ticket (18€) for the Observation Deck. Go to the restaurant Ciel de Paris on the 56th floor and enjoy a coffee or a glass of wine. You may be lower 1 or 2 floors but will save 13€ or 9€, depending on what you pick. From there, I walk to the Financier, an honest pub because you know I must stop by a pub. From there I walk to the Montparnasse station and I take line 6 again. I get off at the Pasteur station and switch to line 12. I get off at the Volontaires station and I walk to the Pizzeria Luigi Pepone (the best pizza in Paris – ask for Richard). And from there my tour continues. I am still on step 26 and I have 14 more steps to go. Now you understand why I need an optimization algorithm to search for the best tour. There are just too many places to visit!

If you recall (after so many bars and restaurants), the walking tour would take me 19.4 miles and **6 hours and 12 minutes** of walking. Now, my multimodal tour will take me 27.6 miles (more than the first one) but I will walk just 2.8 miles and the entire tour will take just **2 hours and 30 minutes**. Here is the new dilemma- since I'm stopping at so many restaurants and bars, I should walk more to burn off the calories. But by walking more around the city, I can't help myself from stopping at even more bars and restaurants. Perhaps there is another optimization algorithm that can help me solve this...

Merci for reading! To see optimal tours done in different cities, please check out these videos:

The post How to plan an optimal tour using Network Optimization in SAS Viya appeared first on The SAS Data Science Blog.

]]>The post What is optimization? And why it matters for your decisions appeared first on The SAS Data Science Blog.

]]>*Is what we're trying to accomplish possible?**What's the best we can do?**What happens if conditions change?*

*So how can we do better? In this post, Rob Pratt, Senior Manager in Scientific Computing R&D, provides us with a whirlwind tour of the many facets of SAS Optimization.*

You make decisions every day: what time to get up, what to wear, what to eat, what route to drive to work (well, not so much lately), when to schedule a meeting, which check-out line to join, and so on. Often, you make these decisions with little thought, based on instinct or what you did the last time you faced a similar situation. If the reasonable options are few and the consequences of the decisions do not vary widely, then it doesn’t really matter much what choice you make.

But if the differences in outcomes are significant and the options are numerous, especially if multiple decisions are interdependent, you have a good opportunity to apply analytics.

Mathematical optimization is one of the most valuable disciplines in analytics, with applications in every industry. It is used to rigorously search for the best way to use resources to maximize or minimize some metric while respecting business rules that must be satisfied.

Often, optimization is applied to business problems that are easily described but difficult to solve. Let’s review some examples that meet that description. Each of these problems was solved using the advanced features of SAS Optimization, and many were implemented by the SAS Analytics Center of Excellence.

**How can you safely meet oil well service levels with lower costs for the company and better hours for technicians? **

Using SAS/OR® to Optimize Scheduling and Routing of Service Vehicles describes the use of the mixed integer linear programming (MILP) solver and the network solver to assign service technicians to oil wells in a way that minimizes travel costs while satisfying service frequency requirements and respecting limits on working hours per day.

**How can you divide a geographic region into equal zones?**

Using the OPTMODEL Procedure in SAS/OR® to Solve Complex Problems explains how to use the MILP, constraint programming, and network optimization solvers to solve a political districting problem that partitions a geographic region into a specified number of smaller contiguous subregions in a way that minimizes the differences in populations between regions.

**How can you help more sports fans return to the stadium while maintaining social distancing guidelines?**

Why Venue Optimization is Critical and How It Works, by Sertalp Cay, discusses a COVID-19 project that uses our optimization solvers to determine which stadium seats to sell in order to maximize revenue while respecting social distancing guidelines. This post also mentions a fun seating optimization game that challenges you to find an optimal seating arrangement and then compares your choices against what the MILP solver finds.

**How**** can you improve production levels while meeting all quality requirements in manufacturing? **

One project for a large manufacturer and distributor of pulp, paper, and building products develops an analytical flow process to support scoring of the predictive models, optimization, and visualization of the wallboard manufacturing process. In the optimization phase, the objective is to maximize yield such that the constraints meet business rules and keep key performance indicators (for quality and waste measures) within their expected ranges. Then the optimization output provides recommendations for controllable settings for the wallboard manufacturing process. The mathematical formulation of this project is a nonlinear optimization problem that is formulated and solved by using SAS Optimization.

**How can you produce the best laundry detergent at the lowest cost? **

A laundry portfolio optimization project for Procter & Gamble sets portfolio strategy for a multi-billion-dollar laundry business. The mathematical formulation of this project is a mixed integer nonlinear optimization problem. The objective is to minimize the total cost of the recommended ingredient levels while meeting quality constraints and business rules. The solution approach uses a COFOR loop to solve multiple independent nonlinear programming (NLP) subproblems concurrently and then uses the resulting solutions as input to the MILP solver. Earlier work related to this ongoing project led to a joint team from Procter & Gamble and SAS being named by INFORMS as finalists for the 2014 Daniel H. Wagner Prize for Excellence in Operations Research Practice.

**How can you prevent power outages by reducing contact between electric lines and trees?**

Another project for Honeywell concerns tree contact with transmission lines, a leading cause of electric power outages and a common cause of past regional blackouts. The objective of the model is to minimize the risk of failure of a power circuit, which is defined by user-provided metrics, information regarding priority of the network, population affected if the network experiences an outage, the cost of bringing a system back up after failure, and so on. Using estimated tree growth projections, the idea is to provide a schedule of when a circuit should be serviced and by which vendor. It is a simple assignment problem that ensures that the recommended schedule cost does not exceed the predefined budget. The problem is solved by using the MILP solver in the runOptmodel action.

**How can you improve the bussing experience for students with disabilities?**

For Boston Public Schools, an important problem is to optimally assign monitors or supervisors to accompany students with disabilities on school buses. Several rules need to be respected in assigning monitors to students, with a goal of maximizing the number of routes within each monitor’s package. The solution approach uses the network solver to enumerate paths, the MILP solver to solve an integer multicommodity flow problem, and the network solver to decompose the resulting solution into directed cycles. More details are available in this SAS Global Forum 2020 poster

These projects exemplify how the era of big data and big computing power has made it possible to construct larger and more detailed optimization models that capture both the relationships among decision variables and their contributions to the metric being optimized. To solve these increasingly complex problems, sometimes even a set of models is needed where the output of one model becomes the input for a subsequent model. As optimization becomes one step of many in the modeling processes, data scientists and other modelers expect to solve these problems using their favorite language as part of an integrated workflow.

SAS® Optimization in SAS® Viya includes several distinguishing features that support these needs. In addition to traditional mathematical optimization solvers for linear programming (LP), mixed integer linear programming (MILP), quadratic programming (QP), and nonlinear programming (NLP), SAS Optimization includes constraint programming, black-box optimization, and network optimization. All of these are accessible from the same algebraic modeling language, OPTMODEL.

Like the rest of SAS Viya, optimization actions make the various solvers available from SAS, Java, Lua, Python, R, and REST APIs. For many years, OPTMODEL has supported a Coroutine FOR (COFOR) loop to solve independent problems concurrently, either on a single machine or in distributed mode. By design, the syntax is minimal, in many cases requiring only a single keyword change from FOR to COFOR.

For the NLP solver, the multistart feature increases the likelihood of finding a globally optimal solution for highly nonconvex problems that have many local optima. This feature is available in both single-machine and distributed modes. The runOptmodel action now supports BY-group processing for the common use case of building and solving the same problem multiple times with different input data. This functionality does not require any explicit looping, and both problem generation and solver execution are automatically parallelized.

The network solver contains a large suite of algorithms, many of which are threaded and distributed. It also supports generic BY-group processing. The newest algorithm added solves the capacitated vehicle routing problem. The latest release contains automated linearization techniques that introduce new variables and constraints to transform several common nonlinear structures to linear form. This improvement enables you to make broader use of the fast linear optimization solvers in SAS Optimization without needing to explicitly modify your models to use only linear functions.

For the MILP solver, the default branch-and-cut algorithm threads the dynamic tree search. In distributed mode, the solver processes tree nodes on different workers and communicates new global lower and upper bounds back to the controller. The LP and MILP solvers both include a threaded and distributed Dantzig-Wolfe decomposition algorithm that exploits block-angular structure in the constraint matrix. For MILP problems that consist of loosely coupled subproblems, this algorithm often yields dramatic performance improvements over branch-and-cut.

I hope this blog post has helped you learn about some applications of mathematical optimization and how you can use SAS software to solve optimization problems. We continue to add new features that make it easier for users to model complex optimization problems, and in every release, we make performance improvements to solve those problems more quickly. For more information:

- The SAS Optimization documentation is the definitive source of information for the procedures and actions in SAS Optimization. SAS/OR® 15.2 User's Guide: Mathematical Programming Examples includes 29 examples that illustrate various features and demonstrate best practices.
- The Mathematical Optimization, Discrete-Event Simulation, and OR SAS Support Community provides assistance for SAS/OR, SAS Optimization, and SAS Simulation Studio.
- The Operations Research SAS blog explores the use of operations research modeling methods and solution algorithms in optimization, simulation, scheduling, and related areas.
- The SAS Software Statistics and Operations Research YouTube channel includes over one hundred short videos.

This is the seventh post in our series about statistics and analytics bringing peace of mind during the pandemic.

The post What is optimization? And why it matters for your decisions appeared first on The SAS Data Science Blog.

]]>The post Application of reinforcement learning to control traffic signals appeared first on The SAS Data Science Blog.

]]>With the emergence of urbanization and the increase in household car ownership, traffic congestion has been one of the major challenges in many highly-populated cities. Traffic congestion can be mitigated by road expansion/correction, sophisticated road allowance rules, or improved traffic signal controlling. Although either of these solutions could decrease travel times and fuel costs, optimizing the traffic signals is more convenient due to limited funding resources and the opportunity of finding more effective strategies. Here we introduce a new framework for learning a general traffic control policy that can be deployed in an intersection of interest and ease its traffic flow.

Let’s first define the TSCP. Consider the intersection in the following figure. There are some lanes entering and some leaving the intersection, shown with \(l_1^{in}, \dots, l_6^{out}\) and \(l_1^{out}, \dots, l_6^{out}\), respectively. Also, six sets *v ^{1}* ...

There are two main approaches for controlling signalized intersections, namely conventional and adaptive methods. In the former, customarily rule-based fixed cycles and phase times are determined a priori and offline based on historical measurements as well as some assumptions about the underlying problem structure. However, since traffic behavior is dynamically changing, that makes most conventional methods highly inefficient. In adaptive methods, decisions are made based on the current state of the intersection. In this category, methods like Self-organizing Traffic Light Control (SOTL) and MaxPressure brought considerable improvements in traffic signal control; nonetheless, they are short-sighted and do not consider the long-term effects of the decisions on the traffic. Besides, these methods do not use the feedback from previous actions toward making more efficient decisions.

Consider an environment and an agent, interacting with each other in several time-steps. At each time-step *t*, the agent observes the state of the system, *s _{t}*, takes an action,

Several reinforcement learning (RL) models are proposed to address these shortcomings. Although, they need to train a new policy for any new intersection or new traffic pattern. For example, if a policy *π* is trained for an intersection with 12 lanes, it cannot be used in an intersection with 13 lanes. Similarly, if the number of phases is different between two intersections, even if the number of lanes is the same, the policy of one does not work for the other one.

Similarly, the policy which is trained for the noon traffic-peek does not work for other times during the day.

The main reason is that there are a different number of inputs and outputs among different intersections. So, a trained model for one intersection does not work for another one.

We propose AttendLight to train a single universal model to use it for any intersection with **any number** of roads, lanes, phases, and traffic flow. To achieve such functionality, we use two attention models: (i) State-Attention, which handles different numbers of roads/lanes by extracting meaningful phase representations \(z_p^t\) for every phase p. (ii) Action-Attention, which decides for the next phase in an intersection with any number of phases. So, AttendLight does not need to be trained for new intersection and traffic data.

We explored 11 intersection topologies, with real-world traffic data from Atlanta and Hangzhou, and synthetic traffic-data with different congestion rates. This results in 112 intersection instances. We followed two training regimes: (i) Single-env regime in which we train and test on single intersections, and the goal is to compare the performance of AttendLight vs the current state of art algorithms. (ii) Multi-env regime, where the goal is to train a ** single universal** policy that works for any new intersection and traffic data with no re-training. For the multi-env regime, we train on 42 training instances and test on 70 unseen instances.

AttendLight achieves the best result on 107 cases out of 112 (96% of cases). Also, on average of 112 cases, AttendLight yields an improvement of 46%, 39%, 34%, 16%, 9% over FixedTime, MaxPressure, SOTL, DQTSC-M, and FRAP, respectively. The following figure shows the comparison of results on four intersections.

There is no RL algorithm in the literature with the same capability, so we compare AttendLight multi-env regime with single-env policies. In average of 112 cases, AttendLight yields improvement of 39%, 32%, 26%, 5%, and -3% over FixedTime, MaxPressure, SOTL, DQTSC-M, and FRAP, respectively. Note that here we compare the single policies obtained by AttendLight model which is trained on 42 intersection instances and tested on 70 testing intersection instances, though in SOTL, DQTSC-M, and FRAP there are 112 (were applicable) optimized policy, one for each intersection. \(\rho_m = \frac{a_m - b_m}{\max(a_m, b_m)}\) Also, *ta _{m}* where

With AttendLight, we train a single policy to use for any new intersection with any new configuration and traffic-data. In addition, we can use this framework for *Assemble-to-Order Systems, Dynamic Matching Problem, and Wireless Resource Allocation* with no or small modifications. See more details on the paper!

- Full text of the paper
- Free trial: SAS Visual Data Mining and Machine Learning
- Product: SAS Visual Data Mining and Machine Learning

The post Application of reinforcement learning to control traffic signals appeared first on The SAS Data Science Blog.

]]>The post Which machine learning algorithm should I use? appeared first on The SAS Data Science Blog.

]]>A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

- The size, quality, and nature of data.
- The available computational time.
- The urgency of the task.
- What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one and done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

The **machine learning algorithm cheat sheet** helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems. This article walks you through the process of how to use the sheet.

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers. There are several issues on which we have not reached an agreement and for these issues we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

Read the path and algorithm labels on the chart as "If *<path label>* then use *<algorithm>*." For example:

- If you want to perform dimension reduction then use principal component analysis.
- If you need a numeric prediction quickly, use decision trees or linear regression.
- If you need a hierarchical result, use hierarchical clustering.

Sometimes more than one branch will apply, and other times none of them will be a perfect match. It’s important to remember these paths are intended to be rule-of-thumb recommendations, so some of the recommendations are not exact. Several data scientists I talked with said that the only sure way to find the very best algorithm is to try all of them.

This section provides an overview of the most popular types of machine learning. If you’re familiar with these categories and want to move on to discussing specific algorithms, you can skip this section and go to “When to use specific algorithms” below.

Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate the future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

**Classification:**When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.**Regression:**When predicting continuous values, the problems become a regression problem.**Forecasting:**This is the process of making predictions about the future based on the past and present data. It is most commonly used to analyze trends. A common example might be estimation of the next year sales based on the sales of the current year and previous years.

The challenge with supervised learning is that labeling data can be expensive and time consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlies the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

**Clustering:**Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.**Dimension reduction:**Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

Reinforcement learning analyzes and optimizes the behavior of an agent based on the feedback from the environment. Machines try different scenarios to discover which actions yield the greatest reward, rather than being told which actions to take. Trial-and-error and delayed reward distinguishes reinforcement learning from other techniques.

When choosing an algorithm, always take these aspects into account: accuracy, training time and ease of use. Many users put the accuracy first, while beginners tend to focus on algorithms they know best.

When presented with a dataset, the first thing to consider is how to obtain results, no matter what those results might look like. Beginners tend to choose algorithms that are easy to implement and can obtain results quickly. This works fine, as long as it is just the first step in the process. Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

Looking more closely at individual algorithms can help you understand what they provide and how they are used. These descriptions provide more details and give additional tips for when to use specific algorithms, in alignment with the cheat sheet.

Linear regression is an approach for modeling the relationship between a continuous dependent variable \(y\) and one or more predictors \(X\). The relationship between \(y\) and \(X\) can be linearly modeled as \(y=\beta^TX+\epsilon\) Given the training examples \(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt.

If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression using a logit link function. Logistic regression is a simple, fast yet powerful classification algorithm. Here we discuss the binary case where the dependent variable \(y\) only takes binary values \(\{y_i\in(-1,1)\}_{i=1}^N\) (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the "1" class versus the probability that it belongs to the "-1" class. Specifically, we will try to learn a function of the form:\(p(y_i=1|x_i )=\sigma(\beta^T x_i )\) and \(p(y_i=-1|x_i )=1-\sigma(\beta^T x_i )\). Here \(\sigma(x)=\frac{1}{1+exp(-x)}\) is a sigmoid function. Given the training examples\(\{x_i,y_i\}_{i=1}^N\), the parameter vector \(\beta\) can be learnt by maximizing the log-likelihood of \(\beta\) given the data set.

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function. A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector \(w\) and bias \(b\) of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

\begin{equation*}

\begin{aligned}

& \underset{w}{\text{minimize}}

& & ||w|| \\

& \text{subject to}

& & y_i(w^T X_i-b) \geq 1, \; i = 1, \ldots, n.

\end{aligned}

\end{equation*}

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector and bias of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners.

Decision trees, random forest and gradient boosting are all algorithms based on decision trees. There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement. However, they tend to over fit data when we exhaust the branches and go very deep with the trees. Random Forrest and gradient boosting are two popular ways to use tree algorithms to achieve good accuracy as well as overcoming the over-fitting problem.

Neural networks flourished in the mid-1980s due to their parallel and distributed processing ability. But research in this field was impeded by the ineffectiveness of the back-propagation training algorithm that is widely used to optimize the parameters of neural networks. Support vector machines (SVM) and other simpler models, which can be easily trained by solving convex optimization problems, gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks. Increasingly powerful computational capabilities, such as graphical processing unit (GPU) and massively parallel processing (MPP), have also spurred the revived adoption of neural networks. The resurgent research in neural networks has given rise to the invention of models with thousands of layers.

In other words, shallow neural networks have evolved into deep learning neural networks. Deep neural networks have been very successful for supervised learning. When used for speech and image recognition, deep learning performs as well as, or even better than, humans. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.

A neural network consists of three parts: input layer, hidden layers and output layer. The training samples define the input and output layers. When the output layer is a categorical variable, then the neural network is a way to address classification problems. When the output layer is a continuous variable, then the network can be used to do regression. When the output layer is the same as the input layer, the network can be used to extract intrinsic features. The number of hidden layers defines the model complexity and modeling capacity.

Kmeans/k-modes, GMM clustering aims to partition n observations into k clusters. K-means define hard assignment: the samples are to be and only to be associated to one cluster. GMM, however define a soft assignment for each sample. Each sample has a probability to be associated with each cluster. Both algorithms are simple and fast enough for clustering when the number of clusters k is given.

** **

When the number of clusters k is not given, DBSCAN (density-based spatial clustering) can be used by connecting samples through density diffusion.

Hierarchical partitions can be visualized using a tree structure (a dendrogram). It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities (i.e., can refine/coarsen clusters) using different K.

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features. Principal component analysis (PCA), singular value decomposition (SVD), and* *latent Dirichlet allocation (*LDA*) all can be used to perform dimension reduction.

PCA is an unsupervised clustering method which maps the original data space into a lower dimensional space while preserving as much information as possible. The PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data’s covariance matrix.

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA. However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles and movie profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

A related technique in NLP is latent Dirichlet allocation (LDA). LDA is probabilistic topic model and it decomposes documents into topics in a similar way as a Gaussian mixture model (GMM) decomposes continuous data into Gaussian densities. Differently from the GMM, an LDA models discrete data (words in documents) and it constrains that the topics are *a priori* distributed according to a Dirichlet distribution.

This is the work flow which is easy to follow. The takeaway messages when trying to solve a new problem are:

- Define the problem. What problems do you want to solve?
- Start simple. Be familiar with the data and the baseline results.
- Then try something more complicated.

SAS Visual Data Mining and Machine Learning provides a good platform for beginners to learn machine learning and apply machine learning methods to their problems. Sign up for a free trial today!

The post Which machine learning algorithm should I use? appeared first on The SAS Data Science Blog.

]]>The post 3 steps to better models with SAS and Azure Synapse appeared first on The SAS Data Science Blog.

]]>The platform combines the power of Azure Synapse and SAS® Viya® to offer a complete data and analytics solution. Azure Synapse is a unified platform for analytics, blending big data, data warehousing and data integration into a single cloud native service. It eliminates the silos between databases and data lakes and empowers customers to analyze any data at any scale.

The integrated web studio development environment enables developers to ingest, prepare, manage and serve data for scenarios ranging from descriptive reports to predictive machine learning. SAS Viya is a cloud native AI, analytic and data management platform that runs on a modern, scalable architecture. It’s designed to deliver better decisions, maximum value and trusted outcomes, regardless of the size or type of data, algorithm used, or how the analtyics are deployed.

SAS’ integration with Azure Synapse starts with connectivity and extends to native in-engine operationalization of models within the Synapse SQ engine. SAS Viya addresses the entire scope of analytics requirements, including machine learning, text analytics, computer vision, forecasting, econometrics and optimization.

Running natively on Microsoft Azure, SAS Viya scales to fit the scope of all analytics challenges, from experimental to mission critical. When combined with Azure Synapse, it’s easy to rapidly operationalize insights across the entire organization, enabling everyone to be more productive with data. SAS Viya and Azure Synapse empower everyone – data scientists, business analysts, developers and executives alike – to collaborate and realize innovative results faster.

The alignment of SAS Viya and Azure Synapse provides the data scientist community with a comprehensive analytical cloud-native environment to create, facilitate and manage the entire analytics life cycle.

**Step 1:**

First, data must be identified, accessed and consolidated for use. Azure Synapse has robust data integration capabilities including over 90 connectors to relational and non-relational databases as well as SaaS applications, that makes it easy to load data. This is where the magic of the partnership begins.

**Step 2:**

Once a data pipeline has been completed in Azure Synapse, SAS Viya can seamlessly access the data set inside of the Azure Synapse environment.

**Step 3:**

With SAS Viya, data scientists can build and generate automatic model pipelines. They can also easily weave open source R and Python models into the modeling pipelines and consider them in the modeling comparison exercise to identify the champion model. SAS provides natural language explanations of the model assesments, including model intepretability, so it’s easy for data scientists and business analysts to understand why one model was chosen over another, which provides transparency and trust in the outcome.

Perhaps the most important part of any analytics effort is getting models into production so that they can be used to drive decisions. And once they’re in production, it’s critical to understand the model health and performance. SAS Model Manager on Viya is the perfect solution for registering, deploying and monitoring the well-being of these models. It provides for back testing and model tracking over time to ensure that when a model begins to decay, it’s refreshed, retrained or replaced to maintain optimal performance. Open source models get the same treatment, providing governance of all models.

When the best model has been identified it can be quickly published into production via model scoring APIs. This is key because the analytical scoring can be executed in-engine with Microsoft Azure Synapse, providing a highly scalable solution for calculating millions of predictions without the overhead of external API calls. Users with basic SQL skills are now empowered with analytical predictions. And because it’s just SQL, downstream systems or applications can also easily integrate to consume these predictions.

The combination of SAS and Microsoft in Azure Synapse provide data scientists with more options for methods, governance and scalability. The alignment brings a superior unified analytical platform not seen anywhere else in the market.

To learn more, watch the SAS and Azure demo below or visit our SAS and Microsoft partner site.

The post 3 steps to better models with SAS and Azure Synapse appeared first on The SAS Data Science Blog.

]]>The post Using data visualization to solve a global cybersecurity incident appeared first on The SAS Data Science Blog.

]]>This exercise was part of the IEEE Visual Analytics Science and Technology (VAST) Challenge. A team of SAS volunteers submitted a solution for Mini Challenge 1, which required us to analyze synthetic network data related to a worldwide cyber event. As part of the challenge, we were asked to use Center for Global Cyber Strategy (CGCS) data to identify candidate groups that authorities could approach for assistance in restoring the internet. Provided data included a very large main graph (>100 million edges) and smaller seed graphs to be used to find matches. So we assembled a team and deployed a SAS® Viya® environment to take on the challenge.

We used SAS® Visual Analytics and the NETWORK procedure from SAS® Visual Data Mining and Machine Learning to tackle the problem. Our process involved combining visualizations and network algorithms to uncover patterns that discriminate between candidate matches and the template graph of interest to locate these patterns in the full graph. We describe the techniques and strategies for solving this problem in more detail and discuss potential future improvements here and in the video below.

Pattern matching, the process of finding instances of a subgraph (query graph) in a larger network, is a problem that has applications in many areas, including social network analysis, fraud detection, and biology. For this challenge, we utilized the pattern matching capabilities in SAS Visual Data Mining and Machine Learning to help identify groups that most closely resemble the hacker profile template graph.

A core part of our solution leveraged network analytics from SAS Visual Data Mining and Machine Learning 8.5. Network Analytics offers a wide range of functionality for analyzing networks, and we utilized different algorithms for this challenge. Most significantly, we relied on the PATTERNMATCH statement, which can be used to find all the subgraphs that are the same as or similar to a given pattern graph in a data graph. Through the use of the SAS Function Compiler (FCMP), it gives the user the ability to specify a set of functions to add user-defined conditions that the subgraph must satisfy to be considered a match.

This functionality enables exact node and link attribute matching, such as requiring that the weight of a link be greater than some specified value. In addition, global conditions can be specifiedll, such as requiring that the timestamp on a communication link be within a week of the timestamp on a travel link.

We examined the network structure and observed the initial differences between the candidate graphs and the template graph with SAS Visual Analytics. To quantify these differences, we systematically compared the candidate and template graphs by using the PATTERNMATCH statement. We created a set of relatively simple subgraphs for the different link types (for example, a buy/sell action between two people) and incrementally added complexity to these patterns. Then, we used the PATTERNMATCH statement to find the number of matches to these patterns in the candidate and template graphs.

*Video submission VAST 2020 (YouTube)*

As expected, we were able to determine that the large graph contained no subgraph that exactly matched the template graph. So, to locate potential template matches in the full graph, we took patterns that occurred in the template graph, searched for those patterns in the full graph, and then examined the network around those patterns. To make subsequent visualization and analysis more tractable, we focused on patterns that were rare in the full graph.

To complete the challenge, we answered all questions provided and submitted a comprehensive document explaining every step in our analysis, including the final identification of the group responsible for the outage.

Our work earned us an honorable mention certificate and we were invited to present our solution at this year's VAST Challenge 2020 workshop - giving us the opportunity to highlight our approach and answer questions to the panel.

The VAST Challenge provides a great opportunity to validate our software against real-world scenarios and using complex data sets. Not only do we learn from these projects but we also send feedback to our development teams to further improve product capabilities for customers.

Finding a solution to this problem wouldn't have been possible without the commitment and technical expertise of each individual. In particular, **Steven Harenberg** and **Matthew Galati** spent countless hours analyzing the graph data and making use of SAS Viya's excellent network analysis capabilities. **Falko Schulz** used SAS Visual Analytics to explore and visualize the data in order to tell a complete story and focus on the Mini Challenge related questions. Thanks to **Riley Benson** and **Rajiv Ramarajan** for their guidance during the project which involved compiling results, writing papers and presenting the solution. Also huge thanks to **Rachel Nisbet**, **Shaun Kurian** and **Jesse Olley** for the willingness and effort compiling a beautiful video summary. None of this would have been possible without each person.

Thanks again to the entire SAS team!

- VAST Challenge 2020
- Visual Analytics Benchmark Repository
- YouTube - Submission Video
- SAS Institute Inc. SAS Visual Data Mining and Machine Learning. (Online.) 2020.
- Matthew Galati and Steve Harenberg. “Introducing Pattern Matching for Graph Queries in SAS Viya 3.4.”
*Proceedings of the**SAS Global Forum 2019 Conference**.*SAS Institute Inc. Cary, NC. 2019. - SAS Institute Inc. SAS Visual Analytics. (Online.) 2020.

The post Using data visualization to solve a global cybersecurity incident appeared first on The SAS Data Science Blog.

]]>The post The art and science of finding answers in connected data appeared first on The SAS Data Science Blog.

]]>Network science is a mature but growing field that provides insights based on the known or inferred connectivity in data. In this post, I will introduce the *NETWORK *and *OPTNETWORK Procedures*, SAS® Viya software’s toolkit for working with networks, including hands-on examples. While these examples use social media data, network analytics can be used to analyze networks of every type - solving crimes, understanding the spread of disease, building community structures, and much more.

For the full working example code, please visit this GitHub link.

The first step in analyzing connected data is often data modeling. This step consists primarily of two tasks. The first is identifying which entities in your data to represent as nodes. The second task is identifying which associations in your data to represent as links. Furthermore, any data fields that provide additional information about the nodes or links that will be relevant to your analysis can be added to the graph as attributes.

The example presented above illustrates the format of some raw social network data. For this analysis, the nodes are chosen to be social media pages of government-related entities (A through E). For links, we consider two nodes to be associated if one or more sampled users "like" both pages. In our chosen data model, these page-page links (the dashed lines) are the links to be considered in the subsequent analysis.

In the data set available for download here, the data has already been prepared in two files. The nodes are listed in file “fb-pages-government.nodes” and the links are listed in the file “fb-pages-government.edges”. When your data is in a format similar to these, with nodes and their attributes in a delimited file or database table, and edges and their attributes in another delimited file or database table, you can import and directly run various types of network analytics by using SAS® Viya.

Let’s take a look at the network data and explore some of the analysis possible with PROC NETWORK (or, equivalently, the Network action set).

The first step is to import the links and nodes data tables by using the DATA step.

Next, we show a rendering of the entire network:

With over 7,000 nodes and 89,000 edges, it is very difficult to gather insights from the whole raw graph. A plausible first step might be to perform community detection, one of many network science capabilities included with SAS® Visual Data Mining and Machine Learning. By using community detection, you can break a graph down into more manageable subgraphs. The goal is to define communities that have a dense count of links within each community relative to the counts of links between disparate communities. of links between disparate communities.

You can use the COMMUNITY statement in PROC NETWORK to perform community detection. Here are the nine largest communities visualized by separate node colors overlaid on the whole graph:

And here are the same nine communities visualized separately:

Once we start looking at individual communities, the network visualizations can become far more meaningful.

One useful technique for quantifying the relative importance of each node in a network is to use any of the many centrality metrics, one of which is PageRank centrality. This algorithm produces rank values for each node that are proportional to the sum of neighboring nodes’ rank values. With PROC NETWORK, you can compute the PageRank by using the CENTRALITY statement.

Here is a visualization of a single community that appears to have an “emergency response” theme. The size of the nodes displayed is proportional to their PageRank centrality.

Note how the most central node, that is, the node with the highest PageRank centrality, is FEMA, a large-scale emergency response organization. On the other hand, many of the peripheral nodes are local emergency response organizations.

Here is another community that is mainly focused on Tunisian government pages:

Cliques are subgraphs that are even denser than communities. By definition, all nodes in a clique are linked to all other nodes within that clique. You can use cliques to find very strongly associated clusters in the government social media pages graph. With PROC NETWORK, you can find cliques containing eight or more nodes by using the CLIQUE statement.

Here is a visualization of one clique that connects nineteen European-Union-related pages.

And here is another clique, representing pages related to the US Armed Forces:

When comparing two nodes, it is often useful to consider similarities between the neighborhoods surrounding each node. Several methods for quantifying node similarity are available by using the PROC NETWORK NODESIMILARITY statement. For example, you can use node similarity to determine the pages that are most similar to “NOAA NWS National Hurricane Center”.

Here are the top 10 most similar nodes by Jaccard node similarity.

One capability that sets SAS® Viya apart is the ability to perform complex analysis and optimization on networks at scale. The next example considers a larger data set, which can be downloaded from here. In this data set, links represent posts on the Reddit platform from one subreddit community to another. Each link also has sentiment attributes that indicate whether the post has a positive or negative connotation. This data set contains approximately 54,000 nodes and 600,000 links but is light work for SAS Viya, as both the upcoming patternMatch and minimum spanning tree snippets run in under one second.

To analyze the Reddit sentiment network, you can use patternMatch to search for instances of patterns of interest (subgraphs) within the entire network. The illustrated pattern, or query graph, represents a topology of three subreddit nodes, A, B, and C, connected by six directed links. The subreddit A was linked in negative-sentiment posts from both B and C. The other links forming the subgraph, however, have positive sentiment.

Let’s search for this pattern throughout the entire network. Perhaps the subreddits that appear more frequently as node A in this pattern represent topics that typically receive heavy criticism.

You can invoke the pattern matching algorithm by using the PATTERNMATCH statement.

One of the matches found when searching the pattern of interest is depicted here:

Here, a link weight of positive one represents a post with positive sentiment, and negative one represents negative sentiment.

In all, 3,351 matches were found by patternMatch, and the query took just 0.6 seconds. In order to quantify the subreddits that are most prone to receive negative criticism, let’s define a score as \({C_{A}\over C}\), where \(C_{A}\) is the number of times a node appears in a match as node A, and \(C\) is the number of times a node appears in any match. The top ten scoring nodes (read: topics that we love to hate) are given in this table:

Now you’ve seen a variety of ways to analyze and draw insight from connected data, so now what? To demonstrate optimization over the Reddit sentiment data set, imagine you want to deploy a word-of-mouth marketing campaign.

Let’s say studies have shown that subscribers to a given subreddit community are more likely to engage with ads that come from another subreddit when the subreddits are linked by posts with positive sentiment. You can use the minimum spanning tree algorithm to determine how to show the word-of-mouth ad to all subreddits, maximizing the total overall subreddit-to-subreddit sentiment. By choosing weights that are inversely proportional to overall sentiment, you can solve this problem with a minimization algorithm.

You can compute a minimum spanning tree by using the MINSPANTREE statement in PROC OPTNETWORK.

Here are the results, shown only for a single community, because the whole graph is too large to easily visualize. Each link represents a recommended word-of-mouth advertisement post linking from one subreddit community to another. For the entire Reddit data set, the minimum spanning tree action took about 0.15 seconds to complete.

These are just a handful of the SAS Viya capabilities that allow you to analyze network connections in your data. For more details and a complete list of algorithms, head over to the documentation pages for Network and documentation pages for Optnetwork.

For more detailed illustrations of the possibilities of network science, check out these recent articles, which highlight how SAS is helping customers combat the coronavirus pandemic using network analysis.

- Mobility tracing: Helping local authorities in the fight against COVID-19
- Jump-start COVID-19 research with text analytics

LEARN MORE | SAS Contact Tracing

This is the sixth post in our series about statistics and analytics bringing peace of mind during the pandemic.

The post The art and science of finding answers in connected data appeared first on The SAS Data Science Blog.

]]>The post Enhancing your Natural Language Processing: Intro to Conversational AI appeared first on The SAS Data Science Blog.

]]>Conversational AI applies many of the techniques shared in our previous natural language processing articles like rules-based machine learning and the hybrid approach as well as text discovery tools like topic/concept extraction and sentiment analysis. Simply put, conversational AI enables a frictionless, 2-way dialogue with a machine where a human is able to receive a quick answer to a question or complete a task by using their natural language – either voice or text.

With recent advancements in conversational AI, chatbots and personal digital assistants have become mainstream. Many of you reading this likely have a smart speaker or another digital assistant in your home or with you at all times on your mobile phone. Along with these innovations, customer behavior has also shifted. Consumers expect “now” and personalized service. A poor user experience can have a real business impact, deterring a customer from using a product or service.

Conversational AI can offer a way to provide an always-on 24/7, fast, convenient experience that can go anywhere (phone, computer smart speakers, even your car). It can provide a human-like experience through real-time, personalized interaction with AI running in the background. This technology is being applied across many industries for a variety of use cases (both customer-facing and for use within an organization). Chatbots have emerged in Telecom, Financial Services in the form of account management and real-time customer service. In IoT, we see personal assistants emerge in smart home devices and wearables.

Since chatbots can automatically query and describe large corporate or public data sets, organizations in the E-commerce and Retail industries use chatbots to provide personalized messaging and offers to consumers online.

Similarly, business users can request summarized or analyzed results by saying or typing through a conversational interface. For instance, “Which marketing campaigns are generating the most leads this quarter?” The chatbot can provide the answer and then offer additional information, a data visualization, or even suggest a related report to view based on patterns in the data or past queries.

At the heart of every bot is an NLP engine. Let’s take a quick look behind the scenes at what makes bots able to understand and interact with human language.

NLP is foundational to the conversational AI process. **Linguistic analysis** helps a machine understand text – whether written or spoken – essentially helping the machine recognize and understand the construct of a language. Through data mining and machine learning algorithms, the machine automatically extracts key features and relational concepts. Human input from linguistic rules adds to the process, enabling contextual comprehension or **Natural Language Understanding (NLU)** of content such as slang, sarcasm, and sentiment.

**Natural Language Interaction (NLI)** converts written or spoken natural language text into application-specific, executable code. In other words, it automatically maps a user’s command to the correct action or intent.

To help you get started, here are some key terms that are foundational to what comprises a chatbot:

- A
**conversational experience**includes**dialogues**or**conversational flows**. Dialogues are a set of various routes and directions that a chatbot can take. - An
**intent**corresponds to a single goal that the user wants the bot to accomplish. A chatbot tries to match what you’ve asked to an intent that it understands. The more a chatbot communicates with you, the more it learns and understands. - An intent can be expressed in many forms and manners. That is the beauty of language. The various unique cases of how these intents are expressed are known as
**utterances**.

Examples of utterances include the different ways an intent is expressed – such as “What’s the 411”, or “I want info on the campaign”, or “Tell me the numbers”. In the image above we are looking for campaign data, but there are many ways of asking for the same thing.

Regardless of how these utterances are expressed, they contain certain common pieces of information that are crucial to satisfying the intent.

These are the essential pieces that are needed for the chatbot to understand what you mean when you are asking it a question. If you are interested in learning more about Conversational AI and would like to see some demos of this technique in action, check out the final webinar in our series Enhancing your Natural Language Processing: Conversational AI. Thank you for reading!

The post Enhancing your Natural Language Processing: Intro to Conversational AI appeared first on The SAS Data Science Blog.

]]>The post Why data scientists should be looking at analyst evaluations appeared first on The SAS Data Science Blog.

]]>There are a plethora of commercially available data science platforms in the market, but guidance as to how to create a shortlist for consideration during the buying cycle is often lacking. That’s when it becomes time to turn to analyst firms for help in researching data science platforms - they do all the hard work for you!

An analyst report offers an unbiased, side-by-side, third-party evaluation of the technology in the market. These analysts know how to put the vendors through the paces and require proof of any claims that are made. Interacting with analysts is hard work on both sides as they require completion of an extensive RFI that makes my Ph.D. look like child’s play. They also require demos to validate what was said in the RFI and external points of view such as commentary from customers.

Below are summaries of 4 key analyst reports that I think should be on the reading list of data scientists and AI/Analytics leaders who are considering the adoption of a commercially available data science platform.

The key message in this report is that data science and machine learning platforms must support model operationalization in addition to model building. This is also known as ModelOps. Why is this so important? On average, only half the analytics models built ever make it into production. That’s right 50%! That can be disheartening to analytics teams who pour their time and energy into modeling only to never have those models see the light of day. In this report, SAS is acknowledged for its model operationalization and management platform which includes model performance monitoring, model governance, and lineage.

Furthermore, Gartner included vendors who could support a diverse analytics team such as expert data scientists who code their models (in SAS or open-source) in notebooks, citizen data scientists who prefer a visual drag and drop interface to construct pipelines, and others such as data engineers, developers, and machine learning engineers. SAS is also called out for ease of use for its automated capabilities such as AutoML and automated suggestions for data quality and data prep.

What stands out in this report is the inclusion of a scorecard comparing vendors on the current offering, strategy, and market presence. (I also like the reference to the mother lode of intelligence that models can infuse into processes and applications). Key messages in this report include augmenting data sciences via AutoML, ModelOps, and reviewing vendor roadmaps in terms of integration with other tools and technologies as well as the ability to serve ever-expanding analytics teams. SAS is praised for its AutoML, its guided analytics, and its support for open source programming languages such that they can take advantage of the powerful SAS engine.

If you are just getting started with AI, then this would be a good report to read first. IDC offers advice for organizations just getting started with developing AI applications. This report includes a discussion of a platform’s ability to analyze both structured and unstructured data, which is important for AI capabilities such as NLP and computer vision. IDC also touches on knowledge representation and machine learning at a high level. IDC recognizes SAS for its ability to support the creation of models that use both structured data and NLP or computer vision. IDC also mentions SAS’ visual interface to better support non-data scientists with AI and machine learning.

This report gets into detail about the specific tasks that data scientists perform such as data access, data prep, feature engineering, building models, training models, tuning models, and deploying models. It also includes a thorough discussion of model management and model performance, topics that are getting more attention in recent years due to the need to deploy increasing more models into production (which is not an easy feat as mentioned previously). IDC recognizes SAS for supporting the end to end machine learning process with both a visual and programming interface which is appealing for those organizations who wish to develop an analytics team with differing skill sets, including data scientists, business analysts, and application developers who want to take advantage of REST APIs.

These four evaluations feature SAS Viya with SAS Visual Data Mining and Machine Learning and SAS Model Manager as the key products.

These reports are a great way to educate yourself about the trends, vendors in these spaces, and their product capabilities. So, sit back, get a cup of coffee, and start reading! Did I mention that these reports are free reading provided to you by SAS? See the links in the sidebar. No registration is required to get the reports. Happy reading!

The post Why data scientists should be looking at analyst evaluations appeared first on The SAS Data Science Blog.

]]>The post Your recipe for unlimited capacity starts in the cloud appeared first on The SAS Data Science Blog.

]]>Regardless of how you store your data or how much there is, the goal is to build powerful analytical models using that data to provide answers quickly for better decision making. Today I'm fortunate to talk to Josh Griffin, a Senior Manager in Analytics R&D, about how we accomplish parallelization and optimal computation in SAS software.

**Udo: **Josh, you and your team are working on optimization routines, performance considerations, and parallelization of our software stack. Tell us a little bit about yourself and your responsibility at SAS.

**Josh:** My background before coming to SAS was in large-scale nonlinear nonconvex optimization and in derivative-free optimization. The former tends to come in to play when doing things like training machine learning models, while the latter is critical for modern auto-tuning algorithms. Currently, my team and I work on building and maintaining many supporting routines ultimately surfaced in SAS statistical and machine learning products.

I believe that ultimately, all analytics vector to some specific optimization problem class. If that problem happens to be nonlinear and/or nonconvex, my team can help. In addition to nonlinear, my team also has experts in linear algebra. This is key as arguably all optimization further reduces to a series of linear algebra operations. How fast these operations can be executed plays a pivotal role in how fast the optimization and analytics can be performed. And of course, we very interested in exploiting available cloud architecture. We want to make maximal use of available worker nodes, CPU's, GPU's, minimizing cache misses and memory transfers, etc.

It is critical to consider and tailor code to modern architectures that we expect customers to have now or one day. Like a symphony, high-speed analytics needs all of these components synchronized and in harmony to do what we call "light up the grid". Meaning, make optimal use of the customer's cloud resource by stepping up to this boundary without overprescribing/locking out any one node. I will admit this is tricky.

Over half of my team has had calls from our grid admin for "lighting up the grid" too long or too aggressively. As much as I like to keep our grid admin happy, I am proud my team knows how to do this. Just because a customer gives a routine 100 worker nodes doesn't mean that routine can use them. My team clearly can. So it is a bit of a balancing act, but thoroughly fun, like solving a never-ending stream of really challenging puzzles.

**Udo: **We have heard so much about AI and smart machines taking on the world. What’s your take on such claims?

**Josh:** Current AI approaches are all based on machine learning. Under the surface, machine learning models are found using various optimization routines. Thus models can be found more readily if you can parallelize these routines while leveraging new GPU and CPU chip architectures. Modern AI has indeed accomplished feats that are seemingly impossible. Part of the AI revolution comes from breakthrough theory and methods. Still, almost all of these methods would be hypothetical without the power that new chip architectures and cloud computing offer.

One of the most popular machine learning methods today, stochastic gradient descent (SGD), is around 70 years old. It was one of the simplest algorithms to describe and trivial to implement sequentially from an optimization perspective. Why is it so popular now? SGD beats more intricate optimization methods in part because modern cloud computing environments are very friendly to SGD iterations. This allows the algorithm to iterate remarkably quickly. Note that when moving from sequential SGD to parallel, its simplicity is utterly lost. Making it work well in parallel is like being a symphony conductor. A lot of different elements must work harmoniously together.

I completely agree with Jensen Huang's 2017 quote, "Software is eating the world, but AI is going to eat software." I think it is already doing so, but not in the way people think.

I had a mentor in undergraduate school once say to a fellow student frustrated by one of the coding exercises, "Computers do not do what you want them to; they do [exactly]what you tell them to." In a way, machine learning is changing that. For example, I can mistype every word in a search query and obtain identical results compared to perfect grammar and spelling. The underlying machine algorithm is very good at guessing my intentions. I guarantee you that the computer is still doing exactly what the developers told it.

So for me, Jensen’s quote simply means that future developers will be using AI to create powerful new products. Thus without an intimate knowledge of modern machine learning tools, developers will ultimately be left behind. Much like how the automobile started as a fantastic invention, but now we drive it daily without a thought.

**Udo: **We learned from an earlier post that cloud computing accelerates transformation. How do we make sure that our customers and users can take advantage of most modern architectures?

**Josh:** Having been at SAS for over a decade, I was a developer when our exciting transformation to fully distributed computing began. Since those days, we have worked very hard to ensure our software runs universally in modern cloud computing environments. We continually seek new and innovative ways to fully leverage the allotted cloud-computing resource. In the early days, distributed computing was a niche area, relevant to an elite subgroup of well-funded university departments, national labs, and large-cap companies who had access to private compute clusters.

The advent of cloud computing has genuinely democratized distributed computing, making it available to almost anyone. There is, of course, an inherent time-based cost now with cloud computing resources. Users no longer necessarily pay a one-time fee and then own the resources indefinitely. Now they may "rent" computing power for a transient amount of time. In the old days, you wanted to reduce run time to make the customer happy. Now reducing run time saves the customer actual dollars. That is, the faster you run, the more efficiently you run … this also translates into cost reduction. The advent of modern cloud computing architectures opens the door for everyone to benefit. SAS developers have worked extremely hard in recent years to ensure our users can leverage these same opportunities.

**Udo: **SAS has invested quite some efforts in the parallelization of our algorithms. Can you share some insights on how we go about this?

**Josh:** I love working in the world of distributed and threaded computing – both the challenges and the possibilities. But I feel it is a mistake to think cloud computing and big data are conjoined entities. Yes, cloud computing is an excellent answer to big data, but it is also a perfect coupling with small data. It is easy to get tunnel vision. It can take a lot of effort to change the way you think and be open to new algorithms and ways of doing things. Indeed, there are only so many ways to slice a cake before it becomes impractical to divide calculations further. And of course, certain calculations are inherently sequential and thus cannot be computed simultaneously.

This circles back to the concept of tunnel vision; most problems we solve do not live in a vacuum. They are not solved as an end in itself (as developers, we can forget this). They typically are part of a larger analysis pipeline. Cloud computing enables you to solve many variants of the pipeline in parallel to find what works best. That is if you cannot divide the cake any further, stack the cake. You can parallelize job stacking very efficiently.

For example, a customer has a small data statistic problem to solve. Our analytics and statistic products come equipped with a lot of flexibility on how to explore data. What combinations are best? In the past, it was the user's job to find the needle in the haystack manually. We now have tools that can automate this search for them and do this all in parallel.

So I would say the biggest effort is learning to look at the world of analytics through a separate lens. Once you do, you can see parallelism and cloud computing opportunities everywhere, regardless of data size or problem type. Of course, it takes effort to make these opportunities a reality, but that is the fun part for developers.

*The biggest effort is learning to look at the world of analytics through a separate lens. Once you do, you can see parallelism and cloud computing opportunities everywhere, regardless of data size or problem type.*

Click To Tweet

**Udo**: Can you provide some examples of problems we can solve today, which were unthinkable some years ago?

**Josh:** Two examples pop into my head. The first was when I solved my first support-vector machines (SVM) problem to global optimality with more than 2.2 billion observations spanning over a terabyte of data. We could handle many more observations than this, but the 2.2 threshold is special. Internally, data dimensions are often stored as int's. The largest int is 2,147,483,647. I always held my breath when making the jump from 2 billion observations to 3 billion. When we passed this threshold, I breathed a sigh of relief.

Of course, there is traffic on blogs arguing just because you can solve a problem on that scale to arbitrary accuracy, should you? Maybe with better models and data processing, smaller samples of the data work just as well. As a developer, it is my job to be ready for whatever sized problems the customer throws at me. This brings me to my second example, of which I am still excited. My team recently developed an action called solveBlackbox. No matter what size data problem you are solving, this tool makes the cloud computing environment an exciting option.

It is a tool that I dreamed about creating for several years at SAS. SAS Viya opened the door for this dream to become a reality. Though we just released this action, I have been testing a prototype for several years now. Like a Swiss Army knife, it is a tool I find a use for wherever I go. A lot of what a data scientist and analyst do is trial and error, trying many combinations of options and ideas to finding what works best. Rather than you spending your weekend sifting through a haystack of possibilities, SAS can automate this process for you as well as search asynchronously using multiple levels of parallelism.

I've read that Thomas Edison tested thousands of different materials before finding a stable material for the light bulb. I imagine the idea of the light bulb and the physics behind it required a revolutionary mind like Edison. But performing thousands of repetitive experiments manually would be a waste of his time and talent. How awesome would it have been if he could have flipped a switch, gone away for the weekend, and return to a table of all the thousands of tests and their results?

I think it will be some time before users fully grasp the scope of the power they now have available at the fingertips. The solveBlackbox action is just one small example of what SAS can now do for our users.

*Rather than you spending your weekend sifting through a haystack of possibilities, SAS can automate this process for you as well as search asynchronously using multiple levels of parallelism.*

Click To Tweet

**Udo: **SAS is moving towards to cloud. What does the future hold for you and your team?

**Josh:** When it comes to research and development, I often think of the Ouroboros, the snake eating itself which symbolizes eternal cyclic renewal. It perfectly captures how the world of software works. We create new and exciting tools for our customers. Then we realize that a tool can be used as a base to create a new product. I often think about how today's tools might be used as stepping stones to build new and innovative products for our customers. Each new product may open a door previously thought shut. In the past, you would hit a computational limit of what you could do on one computer. Now with the cloud at our disposal, almost anything is possible to reconsider.

One goal I have is simplifying the customer's life through automation. The cloud creates the opportunity to explore more potential cases in parallel. The mere fact of making these cases creates new metadata sets that we can use to detect patterns and find new solutions. AI opens exciting doors to learn and then automate the boring repetitive components of our jobs. This free us to focus on the part of the problem that requires innovation and creativity.

Part of the reason I became a manager was so I could have more involvement in the different areas of work that our awesome SAS developers do each day. All of these pieces fit nicely together when viewed from the perspective of the cloud computing architecture. SAS is unique in that we have more than 40 years of analytic computing expertise legacy all positioned together in one unique ecosystem. We have all the pieces to solve any problem we might encounter on the cloud. I look forward to putting more and more of these puzzle pieces together.

**Udo**: Many thanks for all that you do for SAS, our customers, and our partners. Keep up the great work.

This is the fifth post in our series about statistics and analytics bringing peace of mind during the pandemic.

The post Your recipe for unlimited capacity starts in the cloud appeared first on The SAS Data Science Blog.

]]>