The post Data for Good: Galapagos sea turtle recognition appeared first on The SAS Data Science Blog.
]]>Researchers in the field routinely and safely capture sea turtles to inspect their health. This involves checking fin tags to verify the identity and then measuring weight, size, carapace, and beak dimensions. They also make sure there haven’t been any harmful interactions with plastic or vessels in the water. While doing so, they take photographs of the turtles, which we can use to train CV models. However, not every photo contains the information we need. For example, there might not be a side profile of the turtle where the facial scales are clearly visible. Figure 1 shows some images that include the scales and many other types of photographs captured.
There’s also the added complexity of handheld photography on mobile devices compared with CCTV scenarios in fixed locations with predictable lighting. There are a variety of daylight conditions and difficulties of photography in and around the sea. Issues include under/overexposure, motion blur, noise (underwater artifacts, sand, etc.), as well as the turtles’ images being captured from many different distances and angles.
Thanks to the efforts of the Data for Good volunteer team, over 5,000 of the initial 12,000 images were marked as ‘usable’ for our model training. In each of these 5,000 images, we have the face of the turtle and crucially we know the identity of it. Images remained where researchers had not identified a turtle. But for CV tasks to be able to start, we generally require labeled data.
A SAS Visual Analytics dashboard was created while the image data was being processed. This gave researchers the ability to view their data in a way they had never been able to before. Held within each image is a series of metadata objects that very simply enabled us to view “Who, When, and Where” for each image. Over 6,000 images contained latitude and longitude data, which were plotted with the image capture timestamp. The turtle ID provided by researchers created data points in space and time. An example of this is shown in Figure 2. Every data point on the map represents a single turtle image, with interactive filters enabling users to filter by individual turtle or species throughout the time period.
Suddenly, researchers were able to track the population of the 538 turtles they had been monitoring for over ten years via an intuitive and easy-to-use dashboard interface. This was all from their photograph data. In their terminology, this represented an ‘Index of Health’ which they can use to monitor the location and well-being of turtles over time. Extra data, such as measurements and weights recorded by the researchers, supplement the location data via drill-down tables within the dashboard. This is shown in Figure 3. We can also ingest freely available data sources, for example, weather data from NOAA which allows for monitoring of unusual weather events and how that interacts with the turtle data.
A CV pipeline was developed to have the following repeatable steps to score a new, ‘unseen’ image:
Many CV techniques were implemented throughout the pipeline. SAS DLPy provided a Python interface to develop, train, test, and deploy CV models. SAS tools included YOLO object detection to extract the scales region, Darknet image classification for morphotype and species prediction, and U-Net instance segmentation to remove as much background noise as possible around the scales. Applying these models gave us a close crop on the scale area, which we could use to recognize the individual.
There is also the action for feature recognition or template matching between images. This action works well with precise matches between images (such as logo detection) in which the target object retains the same shape, form, and detail.
Figure 4 shows how we applied preprocessing (grayscale, blur, binarization, and contour analysis) to isolate scales for template matching between images. However, the noise, rotation, and variability in the appearance of scales were too high for our data.
The state of the art in feature recognition continues to advance. In June 2023, an open source framework and paper titled LightGlue: Local Feature Matching at Light Speed were published. In them, the model is trained on photographs of man-made architecture. The authors even mention that the model might be suitable for use in recognizing fish scales. The model detects pairs of keypoints from images A and B, so it does not provide a ‘classification’ but rather a quantitative result of the number of keypoints. So, we must decide on a threshold for MATCH or NO MATCH.
Figure 5 shows how LightGlue compares two images of the Sagrada Familia and iteratively assesses keypoints between them. In each layer, the points are evaluated in similarity and kept/discarded to produce a final list of common keypoints in the images over the confidence threshold.
The model architecture helps tackle many of the complexities in recognizing keypoints from two different images that might be subject to changes in perspective, object rotation, object distance, noise, and lighting. In our context, this gives us a robust method to find common keypoints in the turtle scales area which can then be used to decide whether two images do or do not belong to the same turtle.
In Figure 6, we test the model on two images of the same individual, GAL340. The images were captured in quick succession, but there are slight differences in the angle of the head, occlusions above the eye, and in the scales hidden by the neck. We can see the keypoints that were assessed, discarded, and selected as a match.
We can see how the model does a thorough job of detecting keypoints along the contours between scales. The top and far-right scales are occluded in the first image. We can see this because of the clean boundaries in the second image’s keypoints that show which scales are not fully visible.
We must also test the model on two different individuals to check if false positives could be an issue. In this case, we compare images of two turtles of the same species (lower contrast between scales implies higher difficulty) and in the same orientation for a fair assessment, as shown in Figure 7.
We learned from tests like these that eyes and beaks are not as unique as the scales. So this can result in keypoints being matched. Crucially though, the model does an excellent job of confirming that the scale area (or fingerprint) of the individual is not a match, with barely a dozen matching keypoints in this area. So, some interpretation of the keypoints metric is important. Equally, we could consider preprocessing by using segmentation to extract only the scales and nothing else on the face.
In Figure 8, we randomly select an image of GAL212 and carry out a population-wide search covering 446 unique turtles and 782 images where the scales are visible. The turtle identities are hidden from the model. The goal is to find a different image belonging to the same turtle with enough matching keypoints to decide that it is a match.
We see that a different image of GAL212 returns the highest matched keypoints by a considerable margin. Data challenges remain, but these tests show us that for future research, our pipeline is capable of detecting matches and results that are highly transparent and explainable.
The life cycle of the project serves as a template for how you can tackle a CV problem with a combination of open source tools and SAS software. This is demonstrated in Figure 9. With SAS it is easy to combine typical CV technologies like Python, OpenCV, PyTorch, and CVAT with the SAS Viya platform supporting data management, preprocessing, modeling, and deployment.
An application developed by SAS Solutions Factory was featured at SAS Innovate and SAS Explore. Conference attendees could get hands-on experience with the image data to help label thousands of images containing unknown turtles. Results were gathered and analyzed after the events to help us and the researchers identify unknown turtles from the collection of over 20,000 images captured without knowing the individual. These were then fed into a SAS Visual Analytics dashboard.
In this post, we covered the story of how SAS and the Galapagos Science Center are collaborating to help save sea turtles by using CV. An end-to-end pipeline was created by using SAS and open source technologies. This combination extracts image metadata for an Index of Health in SAS Visual Analytics and processes the image to predict the species, morphotype, and identity of a turtle. This process is the first of its kind and can be repeatable for sea turtles worldwide. It could also be used for other animals with unique features to help researchers and conservationists in their efforts to monitor and protect endangered species that are vital to the ecosystem.
“Projects like this are proof we can change the world one algorithm at a time”
Reggie Townsend, SAS Vice President of Data Ethics
What comes next is delivering the results and solution to the researchers so they can use it on their own images and those submitted by locals, tourists, or turtle enthusiasts!
The post Data for Good: Galapagos sea turtle recognition appeared first on The SAS Data Science Blog.
]]>The post Mastering panel data regression: Robust analysis with the CPANEL procedure appeared first on The SAS Data Science Blog.
]]>In panel data regression, the default classical standard errors are computed under the assumption of homoscedasticity. This means that the conditional variance of the error term is assumed to be constant and does not depend on the observed variables. However, this assumption is rarely valid in real-world scenarios. It can result in underestimated standard errors and thus spuriously high significance due to an inflated t-statistic. For instance, in our cigarette demand analysis, US states with a higher per capita income tend to exhibit greater fluctuations in per capita cigarette sales. In this context, assuming uniform volatility across all the US states becomes unreasonable.
By using PROC CPANEL, you can leverage the HCCME= option in the MODEL statement to obtain heteroscedasticity-consistent (HC) standard errors. HCCME here represents the HC Covariance Matrix Estimation. This option offers a range of values from 0 to 3, which in a sense indicates how conservative the standard errors can be. HCCME=1 stands out as a widely employed setting in practice. It adjusts for the loss of degrees of freedom, making it a robust choice. Meanwhile, the HCCME=2 and HCCME=3 standard errors take a more conservative approach by incorporating standardized and predicted residuals, respectively. As a result, they tend to be larger compared to HCCME=0 standard errors. For more technical details, please refer to the HCCME documentation.
For our cigarette demand analysis, Figure 1 provides the SAS code. This code estimates a one-way fixed-effects model by using HC standard errors and cluster-robust standard errors. These will be discussed in the following section. Figure 2 offers an output table showcasing these robust estimates. These options can be applied similarly to other models, yielding valuable insights into your data.
proc cpanel data = mycas.Cigar; id State Year; classical: model LSales = LSales_1 LPrice LDisp LMin / fixone; hccme_0: model LSales = LSales_1 LPrice LDisp LMin / fixone hccme=0; hccme_1: model LSales = LSales_1 LPrice LDisp LMin / fixone hccme=1; hccme_2: model LSales = LSales_1 LPrice LDisp LMin / fixone hccme=2; hccme_3: model LSales = LSales_1 LPrice LDisp LMin / fixone hccme=3; cluster_1: model LSales = LSales_1 LPrice LDisp LMin / fixone hccme=1 cluster; compare / mstat(none) pstat(estimate stderr probt); run; |
Figure 1: SAS PROC CPANEL code using robust standard errors
From Figure 2, it is intriguing to note that the HCCME=1 standard errors turn out to be the largest for each coefficient estimate, whereas the classical standard errors prove to be the smallest. According to Stock and Watson (2008), the HC standard errors are generally biased in panel models unless the errors are homoscedastic. In practice, this is rarely the case. This underscores the necessity for more resilient techniques to accommodate potential correlations in panel data. We refer specifically to the cluster-robust standard errors we present in the following section.
Observations for the same individual over time are often correlated with panel data. In the previous post, the first-difference model was introduced as a technique that can, to some extent, address this issue. This is particularly true when the errors follow a random walk. However, for scenarios where you need to account for arbitrary correlations within a given unit, there is another valuable tool at your disposal. This tool is the cluster-robust standard error.
By combining the CLUSTER option with HCCME= in the MODEL statement, you can obtain robust standard errors that are clustered at the individual level. Interestingly, when we delve into column CLUSTER_1 in Figure 2, we discover that clustering does not necessarily lead to significantly larger standard errors. In fact, for this specific example, they only exhibit a slight increase compared to the classical standard errors. However, they are smaller than all the other HC standard errors without clustering.
In the context of cluster-robust inference, it is crucial to recognize that the number of clusters represents the effective sample size. In our specific example, this translates to the number of states we included, which amounts to a mere 46. This underscores the importance of having a reasonably sized number of clusters when considering cluster-robust methods. It is also worth emphasizing that cluster-robust techniques are not universally preferable. Their utility varies, especially when confronted with a large number of regressors, which can result in reduced degrees of freedom and imprecise covariance estimates. In such cases, the applicability of cluster-robust methods may not be as clear-cut. For a deeper understanding of cluster inference, Econometrics by Bruce Hansen is recommended. Additionally, if you are seeking a practical guide, A Practitioner's Guide to Cluster-Robust Inference offers valuable insights on this subject.
In panel data analysis, the importance of robust standard errors cannot be overstated. Cluster-robust standard errors have become increasingly prevalent in contemporary economic applications. They play a pivotal role in safeguarding the reliability of our statistical inferences, shielding them from practical challenges like heteroscedasticity and correlated observations. If you are eager to explore additional functionalities within PROC CPANEL, online documentation is available. There, you will find a wealth of information to further enhance your panel data analysis skills and elevate your data-driven decision-making.
The post Mastering panel data regression: Robust analysis with the CPANEL procedure appeared first on The SAS Data Science Blog.
]]>The post Mastering panel data regression: Using the CPANEL procedure to analyze cigarette demand appeared first on The SAS Data Science Blog.
]]>Panel data, also called longitudinal or cross-sectional time series data, is a type of data set that gathers information from multiple observational units over time. Think of it as a collection of snapshots that track changes in various aspects of life. Changes such as annual household income, company performance, or economic conditions like state-level GDP and housing prices.
Panel data regression harnesses this unique data structure to uncover hidden individual (or entity) and time-specific effects. This method helps address issues like causal relationships without needing complicated statistical tools like instrumental variables. It is much more versatile than traditional cross-sectional data analysis techniques. It can even capture broader forms of heterogeneity and dynamic interactions, depending on the specific models used. Next, you will see how to conduct panel regressions for analyzing cigarette demand by using PROC CPANEL.
The analysis of cigarette demand, originally explored in Baltagi and Levin (1992), serves as a use case for dynamic panel estimation in Example 10.5 of the PROC CPANEL documentation. This investigation utilizes data from a panel consisting of 46 American states, spanning the years from 1963 to 1992.
The primary variable of interest is the logarithm of real per capita cigarette sales, denoted as LSales. Several factors are considered as potential influencers of sales, including the lag of the outcome variable (LSales_1), the log of the average retail price of a pack of cigarettes (LPrice), the log of real per capita disposable income (LDisp), and the log of the minimum real price in neighboring states (LMin). This last variable serves as a proxy for potential smuggling effects across state borders.
Since all the variables are in logarithmic form, our primary focus of this study lies in identifying the cigarette short-term own-price elasticity (β_{LPrice}), the income elasticity (β_{LDisp}), and the cross-price elasticity of cigarette sales within a state concerning the prices in neighboring states, often referred to as the neighboring price elasticity (β_{LMin}). Furthermore, we can infer the long-term own-price elasticity by using the formula β_{LPrice}/(1 − β_{LSales_1}), where β_{LSales_1 }represents a time discounting factor.
To estimate these elasticities, you can employ PROC CPANEL, which offers a variety of panel data regression techniques. PROC CPANEL is highly performant, thanks to its design to run on a cluster of machines that distribute the data and the computations while exploiting all available cores and concurrent threads.
It's important to note that the selection of these models is primarily for illustrative purposes. In practice, their utilization necessitates sound economic modeling support and appropriate diagnostic tests. Figure 1 provides the SAS code for estimating these models. Figure 2 offers a side-by-side comparison of the results. We will examine these results more closely in the upcoming section.
proc cpanel data = mycas.Cigar; id State Year; model_1: model LSales = LSales_1 LPrice LDisp LMin / pooled; model_2: model LSales = LSales_1 LPrice LDisp LMin / fixone; model_3: model LSales = LSales_1 LPrice LDisp LMin / fixtwo; model_4: model LSales = LSales_1 LPrice LDisp LMin / ranone; model_5: model LSales = LSales_1 LPrice LDisp LMin / btwng; model_6: model LSales = LSales_1 LPrice LDisp LMin / fdone; compare / mstat(nobs ncs nts dfe f probf m probm) pstat(estimate stderr probt); run; |
Figure 1: SAS PROC CPANEL codes estimating various panel data models
Figure 2: SAS PROC CPANEL regression output comparison
The one-way error component model is a fundamental framework in panel data analysis:
\(y_{it} = x'_{it}\beta + u_i + \epsilon_{it'}\)
This is where i indexes the individual and t indexes the time period. \(y_{it}\) is the outcome variable, \(x_{it}\) is a vector of regressors, \(u_t\) is the individual-specific effect, and \(\epsilon_{it}\) is an idiosyncratic error term. What sets this model apart from the cross-sectional model is the inclusion of unobservable individual-specific effects, represented by \(u_i\). For instance, in a wage regression, \(u_i\) could represent an individual worker’s unobserved ability. In a production model, \(u_i\) might correspond to a firm-specific productivity factor. In our cigarette demand example, \(u_i\) could signify unobservable state-specific factors that remain relatively constant over time, such as regional cultures, demographic characteristics, or geographic attributes. These effects account for variations between individuals that might not be captured by the observed variables. This makes the model particularly suitable for analyzing panel data.
The cross-sectional model is indeed a special case of the one-way error component model when the individual-specific effects are assumed to be zero. In panel data analysis, these models are often referred to as pooled regression models. The pooled regression estimator is essentially the traditional Ordinary Least-Squares (OLS) estimator used in cross-sectional models. These models are suitable only when there is compelling evidence that the individual-specific effects are negligible. In our example, this would imply that factors specific to individual states that are constant over time do not significantly influence cigarette sales. This is unlikely to be true.
In Figure 2, you can find the statistics for the pooled regression model under the column labeled MODEL 1. These statistics encompass elasticity estimates and their associated significance levels. Despite neglecting the state and year effects, the signs of these elasticity estimates align with expectations. However, it is worth noting that the short-run price elasticity, which is -0.106, is considerably lower compared to models that take state (and time) effects into account. The long-run price elasticity, -0.106/(1 - 0.956) = -2.410, is much more substantial.
Fixed-effects models are widely favored by economists in panel data analysis because they effectively account for unobservable individual-specific effects without assuming that these effects are uncorrelated with the covariates of interest. The estimator for FE models is often referred to as the within estimator. It essentially applies the OLS estimator to a within-transformation of the one-way error component model, meaning, subtracting the individual means of all the time periods. When there is reason to believe that temporal trends, seasonality, or any unobservable time-varying factors influence the outcome variable, you should consider adding time-specific effects \(v_t\) in This leads to a two-way error component model.
In Figure 2, you can find the one-way FE estimates in column MODEL 2 and the two-way FE estimates in column MODEL 3. Compared to the pooled regression estimates, incorporating control for state (and time) effects results in substantially higher short-run price elasticities, -0.248 and -0.292, for one-way and two-way FE, respectively. However, the long-run price elasticities, -1.312 and -1.718, are lower. This suggests that unless the unobserved fixed effects are properly controlled for, they will indeed confound the analysis.
The joint significance of these fixed effects can be further confirmed by observing an F-statistic of 7.57 and a very low p-value in column MODEL 3. Additionally, the F-statistic for the significance of the state effects by themselves is 6.06 with a nearly zero p-value. These results underscore the critical importance of accounting for state and timing factors in the cigarette demand equation.
Random-effects models are generally more efficient than FE models when the fixed effects are believed to be uncorrelated with the covariates. This efficiency gain makes them particularly popular in microeconomic studies where there are many individuals but relatively few time periods. In such cases, FE models might struggle to capture meaningful time variations due to their reliance on within-individual variation. However, in practice, robustness is frequently favored over efficiency. The strict assumptions required by RE models can limit their applicability. The two-way RE model can also be estimated by using the RANTWO option. We did not include it here since it usually requires even more stringent exogeneity conditions than one-way RE. Consequently, FE models remain more commonly used.
The Hausman test can serve as a specification test to determine whether the assumptions of RE models are appropriate for a given data set. If the test supports the use of RE models, it can provide evidence in favor of this approach. However, it is generally unwise to solely rely on the Hausman test as a decision rule for choosing between RE or FE models. This is because the procedure itself can be biased, as discussed in Hansen (2022). In this example, the Hausman test statistic in column MODEL 4 is 267.81 with an almost zero p-value. This result strongly suggests that FE models are more appropriate for your data set, aligning with the conventional preference for FE models in applied research.
In contrast to the within estimator used in FE models, which explores variations within individuals, the between-effects estimator focuses solely on variations between individuals. It essentially applies an OLS estimator to the individual averages and assesses the effects of the covariates when they vary between individuals. While the BE estimator is less sensitive to issues related to serial correlation, it is not as commonly employed as FE or RE estimators because it discards all the information across time. As demonstrated from column MODEL 5, the BE model has a much lower degree of freedom, only 41. Therefore, it is consequently less informative.
In addition to the within transformation, first-differencing is another crucial transformation that effectively removes the individual-specific effects. Unlike the within estimator, FD is typically employed when there is a belief that a serial correlation issue exists between the errors. It proves to be efficient when the errors follow a random walk. In column MODEL 6, you can see the estimated own-price and income elasticities by using the FD model. Interestingly, these elasticities are even more substantial in magnitude compared to the within estimators, but the bordering price elasticity is smaller.
Given the results of the joint FE tests and the Hausman test, it seems that the two-way FE model is more suitable for the cigarette demand analysis. Other advanced panel data models can be exceptionally useful for addressing specific challenges in empirical research. IV models are useful when the covariates of interest are endogenous and we have appropriate IVs on hand. Dynamic linear models address the endogeneity issue when lagged outcome variables are included as covariates. Hybrid models, such as Hausman-Taylor and Amemiya-MaCurdy estimations, aim to strike a balance between the consistency of FE models and the efficiency of RE models. More details about these panel models will be discussed in a future post.
In practice, using pooled regression serves as a useful initial step to gain a preliminary understanding of whether the model is appropriately specified. This is accomplished by examining the signs and magnitudes of the estimates. Conventional practice favors FE models when regressors are suspected to be correlated with unit FEs. Alternatively, RE models can be used for improved efficiency, contingent on the Hausman test not yielding a rejection.
Beyond what we explored so far, PROC CPANEL offers a wealth of features such as restricted estimation and linear hypothesis testing. If you like to learn more about panel data analysis, how to implement these advanced panel methods, and the full capabilities of PROC CPANEL, please refer to the SAS documentation. Moreover, exploring the documented examples can offer valuable hands-on experience. This will significantly assist you on your journey to mastering panel data analysis with PROC CPANEL.
The post Mastering panel data regression: Using the CPANEL procedure to analyze cigarette demand appeared first on The SAS Data Science Blog.
]]>The post %FiniteHMM Macro for finite Hidden Markov Models appeared first on The SAS Data Science Blog.
]]>The HMM procedure supports HMMs which have been widely applied in economics, finance, science, and engineering. A finite HMM is one type supported by PROC HMM. In theory, a finite HMM assumes that the response variable is discrete (whether its value is numeric or categorical). PROC HMM assumes that the response variable is composed of consecutive natural numbers starting with 1. However, for most real-world business data, the variable of interest could be recorded in other formats, such as categorical levels, decimals, negative numbers, and so on. If the response variable contains values other than natural numbers, PROC HMM will register errors. In other cases, a response variable could be recorded as a nonconsecutive natural number. It could also be recorded as a consecutive natural number, but not starting from 1. In the last two cases, PROC HMM will not generate an error, but the results will be incorrect. In these cases, PROC HMM still assumes the response variable has consecutive values from 1 to the maximum value of the response variable. Consequently, the output tables will contain unnecessary parameters. Therefore, it is likely that you will need to preprocess the data before using PROC HMM to perform finite HMMs.
For example, if you are interested in modeling a dynamic feature of promotional channels for a buyer, the response variable y1 could be recorded with values such as ‘aa email’, ‘bb mail’, ‘cc phone’, ‘dd flyer’, ‘ee seminar’, and ‘ff forum’. Applying PROC HMM directly on y1 will cause an error. You would need to preprocess the data by creating a consecutively natural-number-valued variable y corresponding to the levels of y1. Then apply PROC HMM on y if you assume that y1 is in the data set finite1. The following SAS program demonstrates the preprocessing:
data finite2; set finite1; if y1='aa email' then y=1; if y1='bb mail' then y=2; if y1='cc phone' then y=3; if y1='dd flyer' then y=4; if y1='ee seminar' then y=5; if y1='ff forum' then y=6; run; |
The variable y in the finite2 data set has values of the consecutive natural numbers 1 through 6. After you load the data set finite2 to a defined CAS library mycas, you can use PROC HMM on y for a finite HMM as shown:
ods output CPM=my_cpm_0 ParameterEstimates=my_est_0; proc hmm data=mycas.finite2; id time=t section=section; model y / type=finite nstate=3 nCategory=6 method=MAP; estimate out=mycas.est_0 outall=mycas.estall_0; forecast out=mycas.fcst_0 lead=4; run; |
Because y has values 1 through 6, the output tables have information about the levels of y only, not the levels of y1. To understand and analyze the results for the original y1, you need to do postprocessing by manually matching the y levels to the levels of y1. This can be tedious if y1 has too many levels. Hence it would be helpful if there is a tool to automatically handle the preprocessing and post-processing in such cases. SAS macro %FiniteHMM is the tool to do that.
The %FiniteHMM macro is a wrapper for finite HMMs with PROC HMM. The macro treats the response variable as a categorical variable. It automatically preprocesses the input data, applies PROC HMM to the internally created response variable, and postprocesses the results in terms of the original variable.
In the preprocessing stage, the macro will:
Then, it will apply PROC HMM to the internally created response variable for finite HMMs.
In the postprocessing stage, the macro will:
The parameters of the %FiniteHMM macro are defined to correspond to the statements in PROC HMM. The options in each statement of PROC HMM become the parameter values in the macro. This macro is useful in these three cases:
Finally, here are several examples that show how to use the %FiniteHMM macro. First, you need to simulate a data set. You can use the simulation SAS code in examples of %FiniteHMM to simulate a data set finite1. The data set finite1 has three variables: y, t, and section. Among them, y is the response variable with values 1 to 6, t is the time with values 1 to 500, and section is the section variable with values 1 to 3. This SAS code creates three response variables y1, y2, y3 for our examples.
data finite2; length y1 $15.; set finite1; if y=1 then y1='aa email'; if y=2 then y1='bb mail'; if y=3 then y1='cc phone'; if y=4 then y1='dd flyer'; if y=5 then y1='ee seminar'; if y=6 then y1='ff forum'; if y<=3 then y2=2; else y2=6; y3=y+2; run; |
Here in data set finite2 y1 is a categorical variable with values in the set ['aa email', 'bb mail', 'cc phone', 'dd flyer', 'ee seminar’, 'ff forum'], y2 is a natural-number-valued variable with values in the set [2, 6], and y3 has consecutive natural number values in the set [3, 4, 5, 6, 7, 8].
Applying PROC HMM on y1 generates errors since y1 contains values other than natural numbers. There is no error applying PROC HMM on y2 and y3, but the results are not correct. PROC HMM still assumes that y2 has consecutive integer values from 1 to 6, and y3 has consecutive integer values from 1 to 8. So both y2 and y3, the output tables will contain redundant parameters and the results are incorrect. By using %FiniteHMM, you can get the desired results for all three variables.
Assuming that CAS library mycas has been defined, the following SAS code uploads the finite2 data set to the default library of your current active CAS session:
proc casutil; load data=finite2 casout='finite2' replace; run; quit; |
To apply %FiniteHMM on y1, you can use this SAS code:
%FiniteHMM( data = %str(mycas.finite2), proc = %str(labelSwitch=(sort=asc(cpm))), id = %str(time=t section=section), model = %str(y1 / type=finite nstate=3 nCategory=6 method=MAP), estimate = %str(out=mycas.est_1 outall=mycas.estall_1), forecast = %str(out=mycas.fcst_1 lead=4), odsout = %str(CPM=my_cpm_1 ParameterEstimates=my_est_1)); |
You can then print out the ODS table my_cmp_1, ODS table my_est_1, and CAS output table mycas.fcst_1. The labels of Category1 to Category6 in the CPM and forecast table are linked to the levels of response variable y1. The column RespVarLevel in the parameter estimates table shows the levels of y1.
proc print data=my_cpm_1 label noobs; run; proc print data=my_est_1 noobs; run; proc print data=mycas.fcst_1 label noobs; run; |
To apply %FiniteHMM on y2, you can use this SAS code:
%FiniteHMM( data = %str(mycas.finite2), proc = %str(labelSwitch=(sort=asc(cpm))), id = %str(time=t section=section), model = %str(y2 / type=finite nstate=3 nCategory=6 method=MAP), estimate = %str(out=mycas.est_2 outall=mycas.estall_2), forecast = %str(out=mycas.fcst_2 lead=4), odsout = %str(CPM=my_cpm_2 ParameterEstimates=my_est_2)); |
Similarly, you can then print out the ODS table my_cmp_2, ODS table my_est_2, and CAS output table mycas.fcst_2.
proc print data=my_cpm_2 label noobs; run; proc print data=my_est_2 noobs; run; proc print data=mycas.fcst_2 label noobs; run; |
To apply %FiniteHMM on y3, you can use the SAS code:
%FiniteHMM( data = %str(mycas.finite2), proc = %str(labelSwitch=(sort=asc(cpm))), id = %str(time=t section=section), model = %str(y3 / type=finite nstate=3 nCategory=6 method=MAP), estimate = %str(out=mycas.est_3 outall=mycas.estall_3), forecast = %str(out=mycas.fcst_3 lead=4), odsout = %str(CPM=my_cpm_3 ParameterEstimates=my_est_3)); |
You can then print out the ODS table my_cmp_3, ODS table my_est_3, and CAS output table mycas.fcst_3.
proc print data=my_cpm_3 label noobs; run; proc print data=my_est_3 noobs; run; proc print data=mycas.fcst_3 label noobs; run; |
Three examples of using %FiniteHMM have been presented. In each, when directly applying PROC HMM for a finite HMM, error messages are generated or the results are incorrect because of the unnecessary parameters in output tables. Applying %FiniteHMM generates the correct results for all these cases.
When you apply PROC HMM to a finite HMM, PROC HMM expects that the response variable is composed of consecutive natural numbers starting from 1. If the values of the response variable in the input data set are not composed of consecutive natural numbers starting from 1, you might get errors or incorrect results. In such cases, the %FiniteHMM macro helps to automate the preprocessing and postprocessing. When you apply %FiniteHMM, the ODS output table of CPM and CAS output table of forecasts have labels that link to the levels of the original response variable. The ODS output table and CAS output table of parameter estimates have a column called RespVarLevel. The RespVarlevel column contains values that link to the levels of the original response variable. Hence these tables are ready for analysis on the original response variable.
%FiniteHMM treats the response variable as a categorical variable. This is consistent with the theoretical assumption of finite HMM. Using %FiniteHMM can save you a lot of time when you have a big data set, when the response variable has many levels of values, or when you are building many models.
The post %FiniteHMM Macro for finite Hidden Markov Models appeared first on The SAS Data Science Blog.
]]>The post MLOps for Pirates and Snakes: The Sasctl Packages for R and Python appeared first on The SAS Data Science Blog.
]]>SAS Model Manager supports the registration, comparison, scoring, publishing, and monitoring of SAS, Python, and R models. SAS Model Manager does not translate these models into another language, but rather utilizes Python and R environments to run these models. Python and R models can also be published into their own containers that can be leveraged in Docker, Azure, AWS, and GCP. To bridge the handoff between data scientists developing their models in Python or R and the MLOps engineers deploying models, SAS has released the sasctl open-source packages.
The first public release of python-sasctl was on July 15^{th}, 2019. Python-sasctl is now fully supported by the SAS Model Manager team with dedicated developers continuing to add new features and working with customers to address new use cases. You can stay up to date on Python-sasctl releases through our GitHub page or the SAS Model Manager community.
Python-sasctl can be installed and leveraged just like any other open source package. Through Python-sasctl, data scientists can take their models developed using Python packages, such as sklearn or xgboost, and automatically generate the scoring code, model pickle file, input variables metadata, output variables metadata, package and version requirements, training performance metadata, and model properties in just a few lines of code. Next, these files are directly pushed into SAS Model Manager. With the model and metadata in hand, MLOps engineers have what they need to manage, monitor, and deploy Python models. The handoff between Data Scientist and MLOps engineer has gotten far easier!
To get started using python-sasctl, first import the package and start a session:
import sasctl sess = Session(hostname, username, password) |
Next, generate your modeling metadata. The following example is for a binary classification model, but you can find additional examples here.
pzmm.JSONFiles.calculate_model_statistics target_value=1, prob_value=0.5, train_data=train_data, test_data=test_data, json_path=model_path) pzmm.JSONFiles.write_var_json(input_data=input_data, is_input=True, json_path=model_path) output_var = pd.DataFrame(columns=score_metrics, data=[["A", 0.5]]) pzmm.JSONFiles.write_var_json(input_data=output_var, is_input=False, json_path=model_path) pzmm.JSONFiles.write_model_properties_json(model_name=model_prefix, target_variable=target, target_values=["1", "0"], json_path=model_path, model_algorithm=algorithm, modeler=modeler) pzmm.JSONFiles.write_file_metadata_json(model_prefix=model_prefix,json_path=model_path) |
Finally, pickle your model and then send it off to SAS Model Manager:
pzmm.PickleModel.pickle_trained_model(model_prefix=model_prefix, trained_model=model, pickle_path=model_path) pzmm.ImportModel.import_model(model_files=model_path, model_prefix=model_prefix, project=mm_project, input_data=input_data, predict_method=[model.predict_proba, [float, float]], score_metrics=score_metrics, overwrite_model=True, target_values=["1", "0"], model_file_name=model_prefix + ".pickle") |
Additionally, users with publishing privileges, can publish and score models without needing to leave their Python environment. Examples for the most common use cases are currently available to help organizations get started now.
R-sasctl was released in January of this year and announced in our SAS Model Manager community. R-sasctl supports the registration of PMML models, in addition to R models, to SAS Model Manager. Like Python-sasctl, R-sasctl will automatically generate the input variables metadata, output variables metadata, training performance metadata, and model properties in just a few lines of code. R-sasctl recently released an experiment code generation function that will create the R scoring code for a few common R models. Models can also be published from R-sasctl to supported destinations. Moreover, R-sasctl has several neat additional functions, including one to format input data in the correct format for various publishing destinations. To learn more about using R-sasctl and view an end-to-end example, see Eduardo Hellas’s article as well as the R-sasctl GitHub page.
SAS Model Manager and the sasctl packages aim to create a seamless ModelOps and MLOps process for Python and R models. Python and R models are not second-class citizens within SAS Model Manager. SAS, Python, and R models can be easily managed using our no-code/low-code interface. This is an interface that can be extended to support a variety of use cases.
Ready to see sasctl in action? We walk through the MLOps lifecycle for Python models using Python-sasctl and SAS Model Manager in just 20 minutes during the MLOps Uncoiled: Python’s Path on SAS® Viya® With SAS Model Manager session during SAS Explore. We also have two additional super-demos highlighting SAS’s MLOps capabilities for Python and R models. SAS Explore will be live in Las Vegas September 11^{th} – 14^{th}. The ModelOps team will be showing off their groundbreaking work for open source models, so you don’t want to miss it!
The post MLOps for Pirates and Snakes: The Sasctl Packages for R and Python appeared first on The SAS Data Science Blog.
]]>The post FINE flow analytics: locality and simplicity eat centrality and complexity for breakfast appeared first on The SAS Data Science Blog.
]]>In The Unicorn Project, the five ideals are:
These ideals are the five areas that Kim says he gravitated towards during his career in the IT industry. He suggests that they “seem to underpin what is required to create better value sooner, safer, and happier.” Kim further offers that these ideals are required to help organizations survive and win in the marketplace.
This first ideal from The Unicorn Project is based on the idea that local optimizations are often more effective than global ones. It suggests that the more teams and individuals are empowered to do things for themselves, the more focus, flow, and joy they will have (as observed by the second ideal). Locality and simplicity mean that we need to design things so that we have locality in our systems and the organizations that design them.
The ideal considers the degree to which software teams can make code changes without impacting other teams. It also suggests that we need to ensure that internal complexity is avoided in our code, organizations, and processes. The ideal is based on the insight that Kim correlated from data that were created in observing tens of thousands of organizations across the globe in a highly scientific survey. This research is described in the book Accelerate and was used to create the State of DevOps reports, DORA, and the DevOps Handbook.
Locality and simplicity help create an efficient flow of value to the end consumer by reducing impediments and meeting the required user needs. All this without expending too much energy. By using the FINE flow equations, in combination with graph theory, we can observe the impact of locality and simplicity in a simple example.
Figure 1 depicts a situation where a resource (team, individual, or system) is shared centrally between two consumers. The shared resource (node C) sits between the two consumers (node A and node B) and in turn consumes the resources of a downstream supplier (node D). The directed edges of this graph depict the dependencies that are in opposition to the flow of value. The complexity of the centralized node will be increased as it tries to jointly collaborate with both consumers while at the same time receiving the services of the downstream node.
The network of this graph is organizational and models the interactions between software development teams. For this example, we state that the interaction style between nodes A and C, and between nodes B and C is that of collaboration. Additionally, the interaction style between nodes C and D is that of as-a-service. These interaction styles are important when determining the cognitive load (energy) that is required at each node. They are mapped in the FINE analysis by attaching relative values to the edges of the graph and computing an average value seen at each node.
Additionally, PageRank centrality is used from graph theory to compute the relative value that each node has in the potential to impede flow. With values for energy and impediments computed, it is possible to mathematically determine the other two dimensions of flow and needs. An open source project is available on GitHub that can be used to compute these values.
The FINE values for the graph with centralized sharing are shown in Table 1.
Node | Flow | Impediments | Needs | Energy |
A | 2.3333 | 0.1378 | 0.3214 | 0.75 |
B | 2.3333 | 0.1378 | 0.3214 | 0.75 |
C | 1.0554 | 0.505 | 0.533 | 0.5625 |
D | 1.0201 | 0.8408 | 0.8578 | 0.875 |
Table 1: FINE values with a centralized shared node
From this analysis, we can see that flow is the greatest and evenly shared in the two consumer nodes (nodes A and B). These two nodes have the same FINE values, as you might expect. The potential to impede flow is greatest in the downstream node (node D) but is also high in the centralized shared resource node (node C). These nodes with higher impedance will have the greatest impact on flow if they are unable to support the needs of the upstream nodes. We see that energy (cognitive load) is highest in the downstream node and least in the centralized shared node. The lower effort used by the centralized node will likely be a key indicator of the reduced flow in the consuming nodes.
To improve local efficiency, we can subdivide the centralized shared resource into two separate dedicated nodes (nodes C1 and C2) that each support one of the upstream consumers. This results in a change to the graph where only the downstream node is now a shared resource. The new graph with local dedicated resources is shown in Figure 2.
By using the FINE flow analysis, we can ascertain the updated values for flow, impediments, needs, and energy that are observed for this new graph. The FINE values for the graph with localized dedicated resources are shown in Table 2.
Node | Flow | Impediments | Needs | Energy |
A | 2.5393 | 0.1163 | 0.2954 | 0.75 |
B | 2.5393 | 0.1163 | 0.2954 | 0.75 |
C1 | 1.4416 | 0.2807 | 0.4046 | 0.5833 |
C2 | 1.4416 | 0.2807 | 0.4046 | 0.5833 |
D | 0.9607 | 0.903 | 0.8675 | 0.8333 |
Table 2: FINE values with local dedicated nodes
We can see from the FINE values in the updated graph that the flow, in the consumer nodes (A and B), has increased from 2.3333 to 2.5393. In addition, the flow for the dedicated resource nodes has also increased when compared to the centralized and complex node of the previous graph (1.0554 to 1.4416). The flow has dropped slightly for the downstream node (node D). We can see the impediments for the downstream node increasing, which reflects its new importance to the total graph.
Overall, the flow of the graph has improved when localized and dedicated resources are used. This shows how sharing centralized and complex resources causes a reduction in the flow.
From this example, we can that the first ideal of locality and simplicity is indeed aimed toward the efficient flow of value to the end consumer. We can reason how this first ideal from Kim’s book has a parsimonious explanation that is more than just an intuitive notion. By using the methods of FINE Flow Analytics, we have shown the causal relationship of why locality and simplicity will eat centrality and complexity for breakfast.
The post FINE flow analytics: locality and simplicity eat centrality and complexity for breakfast appeared first on The SAS Data Science Blog.
]]>The post Automating a shift-left CI/CD security workflow to track, report, and analyze software security vulnerabilities appeared first on The SAS Data Science Blog.
]]>Before SAS implemented an automated shift-left CI/CD security workflow, it wasn’t easy to know what the top OWASP vulnerabilities were without spending large amounts of time searching for and gathering information. Even then, the information might not be accurate or up to date. Additionally, there wasn't a way to know whether other applications have completed the DAST scans, nor was there a cohesive way to report and track vulnerabilities. To overcome those challenges, one of our design goals was to collect the DAST scan results, then use a SAS product stack to help report and analyze the security vulnerability status and trends. The eventual goal was to have the automated process learn and optimize false positives automatically over time. As we walk you through the implementation, we will also share how certain steps improved our work process. To learn more about the shift-left testing practice, visit Dzone.
In our automated security workflow, we use OWASP ZAP to run security scans. In short, ZAP is an open source security tool that is widely accepted within communities around the world. It has strict compliance and security standards. We begin the process with the following steps:
1. Design a schema and a view to store the security data in a shared Postgres database. Our internal name for this database is Continuous Quality Metrics (CQM). Some of the example data collected and stored in the CQM database tables are:
2. Create a Python script to parse the contents of the security scan results into an XML format.
3. When the Python script reads in an alert filter, a file of known vulnerabilities can be flagged to be ignored. Next, the script processes any differences between the alert filter and the report. If a new vulnerability is found, a new Jira ticket is automatically generated and assigned to a team member to triage. This aligns with our standard security requirements. Subsequently, the filter is updated with the newly identified alert so it will be recognized in the next scan. At the end of the script, it sends all alert records to our CQM database. This tracking mechanism brings benefits to our practice. For example, it allows us to use Jira for cross-referencing, accountability, and minimizing manual intervention. An example of an automated Jira ticket generated when the automation detected a new security vulnerability is seen in Figure 2.
4. SAS Studio is used to develop code to query the security data from the CQM database to produce a SAS data set with the results.
5. The SAS code is integrated into the SAS Job Execution Service. This is a REST service that calls the SAS Job definition to run the SAS code and then loads the table into CAS (Cloud Analytic Services).
6. A SAS Visual Analytics report is designed to aggregate the security data collected from the security scans. This gives a visual analytical view of the security data of the products.
7. Finally, to tie everything together, a Jenkins pipeline is built that contains the following stages:
A well-known saying is “Knowledge is power,” and we believe having the right data will build that knowledge. This shift-left automated effort gives us the security data that we need to process, share, and analyze for a better understanding of our products. To put this in perspective, here are some ways the security data can help many of us in different situations:
A picture is worth a thousand words. Below are some snapshots of our SAS Visual Analytics report. The data have been sanitized from our test scan results.
Our shift-left automated effort to track and report security DAST vulnerability received positive feedback in many ways. It alleviates the manual responsibilities so security champions can focus on other work priorities. The Jenkins job was shared with other teams within our organization. The same process is repeated for both the Application Programming Interface (API) and User Interface (UI) test containers. We continue to collect security data from different applications and deployable units regularly. The SAS Visual Analytics report is current with security data findings and trends and holds keys to many more opportunities. This makes our DevOps journey more relevant than ever because of what we learned and how we improved. Here are a few lessons learned:
That being said, the benefits outweighed the nuisances, and our DevOps journey continues. We would love to hear your feedback about your CI/CD experience!
The post Automating a shift-left CI/CD security workflow to track, report, and analyze software security vulnerabilities appeared first on The SAS Data Science Blog.
]]>The post The Empirical Mode Decomposition for handling non-stationary time series appeared first on The SAS Data Science Blog.
]]>Empirical Mode Decomposition (EMD) is a powerful time-frequency analysis technique that allows for the decomposition of a non-stationary and non-linear signal into a series of intrinsic mode functions (IMFs). The method was first introduced by Huang et al. in 1998 and has since been widely used in various fields, such as signal processing, image analysis, and biomedical engineering.^{[1]}
In EMD, the key concepts include non-stationary signals, non-linear signals, and intrinsic mode functions (IMFs). Non-stationary signals are those whose statistical properties, such as mean and variance, change over time. These signals are often encountered in real-world applications and pose challenges for traditional signal processing techniques. Non-linear signals, on the other hand, exhibit complex behavior that cannot be modeled by linear systems. Examples of linear systems for stationary time series include the classical Fourier transform for frequency domain analysis and linear regression, AR, MA, ARMA and ARIMAX models for prediction and forecasting in the time domain.
Intrinsic mode functions (IMFs) are the building blocks of the signal obtained through the EMD process. They represent local oscillatory modes that satisfy specific conditions, allowing for a more intuitive and adaptive representation of the underlying signal. The decomposition into IMFs enables the analysis of non-stationary and non-linear signals in both time and frequency domains, providing valuable insights into their structure and dynamics. An oscillatory mode is a periodic fluctuation or wave-like pattern that appears within a signal. It can be thought of as a continuous function that oscillates between its maximum and minimum values, representing the peaks and troughs of a wave. Visualizing an oscillatory mode, one would see a series of crests and troughs that may vary in amplitude and frequency, yet maintain a consistent oscillatory behavior.
The concept of an envelope is closely related to oscillatory modes. In the context of EMD, an envelope is a smooth curve that surrounds the oscillatory mode, encapsulating the extreme points (peaks and troughs) of the oscillations. There are two types of envelopes: the upper envelope, which connects the local maxima, and the lower envelope, which connects the local minima. Envelopes play a crucial role in the EMD process, as they help to identify and extract the intrinsic mode functions (IMFs) from the original signal. By analyzing the oscillatory modes and their envelopes, it is possible to gain a better understanding of the signal’s structure, its dominant frequencies, and how these components evolve over time.
The EMD algorithm can be described in the following steps:
In the Empirical Mode Decomposition (EMD) algorithm, determining whether the residuals is negligible or not is typically based on either a predefined stopping criterion or a user-defined threshold. Here are some common ways to decide if the residue is negligible:
These methods can be used individually or in combination to determine if the residuals is negligible or not. It is important to note that the choice of the stopping criterion may influence the results of the EMD algorithm, and it is often problem-specific.
Because of EMDs ability to analyze non-stationary and non-linear signals, it has been applied in many contexts and domains. Here are three examples of the application of EMD:
The applications discussed above demonstrate the versatility and effectiveness of the Empirical Mode Decomposition (EMD) technique in various fields, such as automated detection of epilepsy using EEG signals, fault diagnosis of rolling bearings, and financial time series forecasting. In light of these successful applications, it is worth exploring the application of EMD to other time series data as well. The SAS/IML CALL EMD routine provides a convenient tool for computing the empirical mode decomposition of a time series, enabling you to further investigate the potential benefits of EMD in different contexts. In the following example we will investigate the use of EMD on simulated oil price data, specifically West Texas Intermediate Oil(WTI). The time series of the WTI oil prices can be visualized below:
In this example, the Empirical Mode Decomposition (EMD) method analyzes simulated West Texas Intermediate (WTI) Oil Price Data. Specifically, the CALL EMD subroutine within the SAS/IML software performs the decomposition process.^{[2]} This technique will enables you to extract Intrinsic Mode Functions (IMFs) from the data, which can then be used to uncover underlying trends, cycles, and other significant components present in the WTI oil price time series.
In addition to the decomposition process, we will hint at methodologies for forecasting future values of the Intrinsic Mode Functions (IMFs) obtained through EMD. By generating accurate forecasts for the IMFs, you can subsequently predict future oil prices, thereby enabling more informed decision-making in the energy market. The insights gained from this analysis could prove valuable for industry stakeholders, policymakers, and financial investors as they navigate the complexities of the oil market and its potential future trends. In this example the following steps will be discussed:
The full source code for this article can be found on GitHub.
The SAS code below performs two main operations on a dataset containing daily WTI oil prices. The first part of the code uses the TIMESERIES procedure to process the dataset wti_oil_prices and add rows for any missing dates, creating an output dataset called wti_oil_prices_fill. It sets the ID variable DATE at daily intervals, ensuring that the dataset is consistently spaced in time. The TIMESERIES procedure specifies no accumulated values, sets missing values for added rows and formats the DATE variable with the DATE9. format. The variable of interest in this step is PRICE, which represents the WTI oil prices.
The second part of the code uses the EXPAND procedure to interpolate missing values in the dataset created in the previous step. The input dataset is wti_oil_prices_fill, and the output dataset is wti_oil_prices_fill_int. The EXPAND procedure uses the JOIN method to interpolate the missing values for the PRICE variable and creates a new variable, PRICE_INT, containing the interpolated values. The ID variable for this step is DATE, which helps the procedure understand the structure of the data. Overall, the code handles missing values and interpolates the WTI oil prices dataset to create a more complete and consistent dataset for further analysis.
proc timeseries data = wti_oil_prices out = wti_oil_price_fill; id date interval = day accumulate = none setmiss = missing format = date9.; var price; run; proc expand data=wti_oil_price_fill out=wti_oil_price_fill_int; convert price=price_INT / method=join; id DATE; run; |
In the PROC IML code, the input data (dates and interpolated oil prices) are read from the wti_oil_prices_fill_int dataset into two vectors, DATES and OIL_PRICES. Afterward, the CALL EMD subroutine is called with the specified options to decompose the oil prices into Intrinsic Mode Functions (IMFs) and a residual component. The obtained IMFs are illustrated in a panel series plot to visualize the IMFs over time. The code then combines the dates, IMFs, original oil prices, and the residual component into a single matrix called OUTPUT, which is subsequently used to create a new dataset wti_oil_price_IMF.
options orientation = portrait; ods graphics on / reset width=600px height=1000px border=off ANTIALIASMAX=50100; proc iml; use wti_oil_price_fill_int; read all var {DATE} into dates; read all var {price_INT} into oil_prices; close wti_oil_prices_interp; optn = {10 1 .01 .001}; call EMD(IMF, residual, oil_prices,optn); title "IMFs of West Texas Intermediate Crude Prices"; call panelSeries(dates, IMF, {'IMF1','IMF2','IMF3',...,'IMF7'}) grid="y" label={"DATE" "IMF"} NROWS=7; ndates=nrow(dates); output=shape(1,ndates,10); output[,1]=dates; output[,2:8]=IMF; output[,9]=oil_prices; output[,10]=residual; varnames={'DATE' 'IMF1' 'IMF2' 'IMF3' ... 'IMF7' 'price_INT' 'RESIDUAL'}; create wti_oil_price_IMF from output[colname=varnames]; append from output; close; quit; |
You can visualize the historical IMFs in the following panel:
The TSMODEL code below fits a moving average (MA) model on the first intrinsic mode function (IMF1) extracted from the decomposed West Texas Intermediate (WTI) oil price series, and then generates forecasts. The MA orders are defined using an array (ma) with three elements: 1, 2, and 3. The ARIMA model specification (mySpec) is then configured with these MA orders using the AddMAPoly() method, while the NOINT and METHOD options are set to 1 (no intercept) and ’CLS’ (conditional least squares), respectively. Lastly, the model forecasts, parameter estimates, and statistics are collected using TSMFor, TSMPEst, and TSMSTAT objects, respectively.
proc tsmodel data=CASUSER.TRAIN outobj=(outStat=CASUSER.OUTEST_IMF1(replace=YES) outFcast=_tmpcas_.outFcastTemp(replace=YES) parEst=CASUSER.OUTFOR_IMF1(replace=YES) ) seasonality=7; id DATE interval=Day FORMAT=_DATA_ nlformat=YES; var IMF1; require tsm; submit; declare object myModel(TSM); declare object mySpec(ARIMASpec); rc=mySpec.Open(); array ma[3]/nosymbols; ma[1]=1; ma[2]=2; ma[3]=3; rc=mySpec.AddMAPoly(ma); rc=mySpec.SetOption('noint', 1); rc=mySpec.SetOption('method', 'CLS'); rc=mySpec.Close(); rc=myModel.Initialize(mySpec); rc=myModel.SetY(IMF1); rc=myModel.SetOption('lead', &lead); rc=myModel.SetOption('alpha', 0.05); rc=myModel.Run(); declare object outFcast(TSMFor); rc=outFcast.Collect(myModel); declare object parEst(TSMPEst); rc=parEst.Collect(myModel); declare object outStat(TSMSTAT); rc=outStat.Collect(myModel); endsubmit; run; |
The next step in forecasting WTI oil prices involves leveraging the intrinsic mode functions (IMFs) extracted from the empirical mode decomposition (EMD) of historical data. By considering both the actual historical IMFs and projected future IMFs, we can incorporate these insights as exogenous regressors in a predictive model. This approach allows for a comprehensive understanding of the underlying components and trends that drive oil price fluctuations. By utilizing the IMFs as exogenous inputs, the model can better capture nonlinear and non-stationary patterns that traditional time series models might struggle with. Consequently, this method aims to enhance the accuracy and reliability of WTI oil price forecasts, providing valuable information for stakeholders and decision-makers in the energy industry.
The next step in forecasting WTI oil prices involves leveraging the intrinsic mode functions (IMFs) extracted from the empirical mode decomposition (EMD) of historical data. By considering both the actual historical IMFs and forecasted future IMFs, you can incorporate the IMFs as exogenous regressors in a predictive model. This approach allows for a comprehensive understanding of the underlying components and trends that drive oil price fluctuations. Since the empirical mode decomposition (EMD) technique decomposes the original time series into a set of IMFs, it isolates the intrinsic oscillations and trends present in the data. This allows for the identification and separation of various time-varying patterns, which are often responsible for the non-stationarity in oil price series. By incorporating these IMFs as exogenous inputs, the model is better equipped to account for the dynamic nature of oil price fluctuations. As a result, forecasting models becomes more robust and reliable, successfully handling non-stationarity and providing more accurate predictions.
The post The Empirical Mode Decomposition for handling non-stationary time series appeared first on The SAS Data Science Blog.
]]>The post Poisson HMM: The model of count time series appeared first on The SAS Data Science Blog.
]]>Count time series is ill-suited for most traditional time series analysis techniques, which assume that the time series values are continuously distributed. This can present unique challenges for organizations that need to model and forecast them. As a popular discrete probability distribution to handle the count time series, the Poisson distribution or the mixed Poisson distribution might not always be suitable. This is because both assume that the events occur independently of each other and at a constant rate. In time series data, however, the occurrence of an event at one point in time might be related to the occurrence of an event at another point in time, and the rates at which events occur might vary over time.
HMM is a valuable tool that can handle overdispersion and serial dependence in the data. This makes it an effective solution for modeling and forecasting count time series. We will explain how the Poisson HMM can handle count time series by modeling different states by using distinct Poisson distributions while considering the probability of transitioning between them.
HMMs are a class of models where the distribution that generates an observation is dependent on the state of an underlying, unobserved Markov process. Figure 1 illustrates how demand varies based on the hidden state. In State 1 (S_{1}), the demand is two units; in State 2 (S_{2}), the demand is one unit; in State t (S_{t}), the demand is three units, and in State t+1 (S_{t+1}), the demand is zero units. Of course, you might already know the power of HMM through my blog post on the SAS Batting Lab. The Poisson HMM is a specific type of HMM where different states are modeled by using distinct Poisson distributions while considering the probability of transitioning between them.
To demonstrate the effectiveness of a certain approach, a simulated example was created in the context of the retail industry. Specifically, the example involved generating a set of count data with small, discrete values that mimicked the demand pattern of a product with a slow-moving inventory. This type of data is frequently encountered in inventory control problems, especially for items with low demand rates. The objective of this example was to illustrate the challenges associated with analyzing such data to help make inventory plans.
To begin with, Figure 2 displays the first 200 of the total 1000 observations of the generated time series. This gave an insight into the overall pattern of the data. Additionally, Figure 3 demonstrates that the values of the time series were discrete, ranging from 0 to 18. Moreover, it was found that 27.3% of the time series consisted of zeroes. This indicated that the time series was a count time series with a considerable number of zero values.
The mean and variance shown in Table 1 indicate there is overdispersion of the data.
One of the challenges in model training is to determine the appropriate number of hidden states to use in the model. One approach is to search for models with different numbers of states. You can then compare their information criteria to identify the best model. In this study, we used the Akaike information criterion (AIC) to select the optimal model. After comparing the AIC values of the models with 1 to 10 states in Table 2, we found that the model with three hidden states had the smallest AIC value. So it would be the most suitable. The Poisson HMM model allows for multiple hidden states, each of which might represent times of varying customer demand.
The parameter estimates shown in Table 3 reveal that the means (lambda) of the Poisson distribution for the three hidden states are 0.148569, 4.141552, and 8.127111. The state with a mean of 0.148569 is particularly effective at generating zero values in the observations. We can categorize these three states as low, normal, and high-demand states in the retail industry. This would be helpful in making inventory plans.
The plot in Figure 4 overlays the decoded hidden states (blue line) of the Poisson HMM onto the count time series (red line). The predominance of zero values is reflected by the troughs in the blue line, which corresponds to the low-demand state that generates mostly zeros.
The HMM procedure can also be used to forecast count time series. Table 4 shows the one-step forecast of the Poisson HMM. The state with low demand has a probability of 0.10884, the state with normal demand has a probability of 0.82668, and the state with high demand has a probability of 0.064485. The expected mean of the mixed Poisson distribution for the next period is 3.96397. The table also lists the quantiles of the predicted distribution. The 0.025 quantile equals 0, the median equals 4, and the 0.975 quantile equals 10.
In this post, we introduced the Poisson HMM. It is a useful tool for modeling discrete time series and dealing with overdispersed and serially correlated data. The HMM procedure implements a powerful technique that can estimate parameters, decode the hidden states, and forecast the series. For further information on HMMs, visit The HMM Procedure. Here you will find more types of HMMs, more algorithms, and more applications in different fields. The SAS code for this example can be downloaded from GitLab.
The post Poisson HMM: The model of count time series appeared first on The SAS Data Science Blog.
]]>The post Using SAS Viya Machine Learning to classify COVID from non-COVID appeared first on The SAS Data Science Blog.
]]>In this post, we will demonstrate how to utilize SAS Viya Machine Learning to train a convolutional neural network that can accurately detect patients with COVID-19 by using the transfer learning technique. As a reference, we will follow the methodology established in the study by Tuan D. Pham.
For this demonstration, we will use CT images from 80 COVID and 542 non-COVID subjects from Harvard Dataverse, Mendeley Data, and the Cancer Imaging Archive. All the COVID subjects have a confirmed positive COVID-19 diagnosis. The CT images for the COVID subjects and the six non-COVID subjects were originally 3-D DICOM files with 100+ image slices. The CT images for the 536 non-COVID subjects are in PNG format. So all 3-D DICOM files were converted to 2-D PNG format. Images that do not include enough lung regions were removed from any further analysis. In total, 1392 COVID and 1120 non-COVID CT 2-D images were used to train the classification model. Figure 1 shows an example of CT images for a COVID and a non-COVID subject.
To begin our pipeline, we first import all the images by using the image action set. Note that ‘decode’ must be set to ‘False’ for further deep learning analysis.
inputimg=s.CASTable(name='inputimg',replace=True) s.image.loadimages(casout=inputimg, path='COVID_chest_CT/covid_classify/', caslib='dlib', recurse=True, labellevels=-1, decode=False, addcolumns=['width','height']) |
Next, all the input images are resized to 224x224. They are normalized to a range of 0 to 255 by using the MINMAX normalization. These functionalities are available in the processImages action.
processed=s.CASTable(name="processed") s.image.processImages(images=dict(table=inputimg), steps=[ dict(step=dict(steptype='RESIZE',type='BASIC',height=224,width=224)), dict(step=dict(steptype='NORMALIZE', type='MINMAX',alpha=0 ,beta=255)) ], decode=False, casOut=processed) |
Further, the processed images are converted to ImageTable by using the DLPy ImageTable class, as shown here. In this way, all the processed images can be sent to DLPy for further deep learning analysis.
my_images=ImageTable.from_table(processed, image_col='_image_', label_col='_label_') |
Once the processed images are converted to ImageTable, they are then split into the training, validation, and test sets. Here, 64%, 16%, and 20% of the images are used for training, validation, and testing, respectively. The DLPy library is applied to implement the data splitting in a Pythonic way.
train_val, test_table=two_way_split(my_images, test_rate=20, seed=123) train_table, val_table = two_way_split(train_val, test_rate=20, seed=123) |
Given that our classification task is relatively simple, it is necessary to utilize a deeper model to learn the complex representations of COVID-19. We must do this while avoiding overfitting. In this regard, the ResNet-50 model has been shown to be effective, achieving 93% accuracy according to Pham. However, training a model from scratch often requires a large amount of data to achieve satisfactory results. To mitigate this issue, we can utilize a pretrained model on the ImageNet data set to leverage the learned representations from this data set.
To load a pretrained ResNet50, we can utilize the DLPy library and specify certain model specific parameters such as the number of classes, the channels of the input data, and offsets, for example. The pretrained weights should be stored in an h5 file and specified when instantiating the ResNet50_Caffe class.
pre_train_weight_file=os.path.join(PRE_TRAIN_WEIGHT_LOC, 'ResNet-50-model.caffemodel.h5') from dlpy.applications import resnet from dlpy.model import AdamSolver,VanillaSolver, Optimizer from dlpy.lr_scheduler import ReduceLROnPlateau, CyclicLR, PolynomialLR resModel=resnet.ResNet50_Caffe(s, model_table='ResNet50_Caffe', n_classes=2, n_channels=3, width=224, height=224, scale=1, random_crop=None, pre_trained_weights=True, pre_trained_weights_file=pre_train_weight_file, include_top=False) |
Finally, the fit function from the DLPy library is applied to fit the model. Here, VanillaSolver was specified along with a learning rate scheduler of the Cyclic Learning Scheduler class. The optimizer was defined with the VanillaSolver, a log level of 2, a max_epochs of 10, and a mini_batch_size of 4.
lrs=CyclicLR(s, train_table, 4, 1.0, 1E-4, 0.01) solver=VanillaSolver(lr_scheduler=lrs) optimizer=Optimizer(algorithm=solver, log_level=2, max_epochs=20, mini_batch_size=4) resModel.fit(data=train_table, optimizer=optimizer, gpu=dict(devices=[2]), n_threads=4, valid_table=val_table, log_level=2) |
Now that we have built and trained the model, we can use the model to score the test data set. This is done by using the evaluate function from the DLPy library. The response is shown in Figure 2. For the given test images, our model classifies subjects with COVID with 99.4% accuracy. It classifies non-COVID subjects with 96.3% accuracy.
We can further visualize the performance of our classification model by creating a confusion matrix by using the valid_conf_mat function from the DLPy library. The response is shown in Figure 3. The confusion matrix is used to compare the predicted classes of true positives (column 1, row 1), false negatives (column 2, row 1), false positives (column 1, row 2), and true negatives (column 2, row 2). In this case, out of the validation data set, one subject with COVID was misclassified as non-COVID. Nine non-COVID subjects were misclassified as COVID subjects.
The image classification results can be plotted as shown in Figures 4 and 5. Here, we utilize the plot_evaluate_res function from DLPy to display a plot with an image that was correctly classified (actual class = predicted class, or img_type = ‘C’), along with a predicted probability bar chart for this given image.
By specifying the img_type to ‘M’, the code displays a plot with an image that was incorrectly classified (actual class not equal to predicted class, or img_type = ‘M’), along with a predicted probability bar chart for this image as shown in Figure 5.
Further, the heat_map_analysis function from DLPy can be applied to use color to indicate the regions of interest that provide the most useful information to the model, letting it determine a distinction of each class. We can use these regions of interest to understand how reliable our model’s predictions are and why the model struggles with misclassified images. The resulting figures are shown in Figures 6 and 7.
Figure 7 shows two subjects with correctly classified classes. The top row shows a COVID subject with the disease region mostly focused on the posterior lung. The heat map also highlights the posterior lung as the region of interest (yellow and red regions in the heat map). The bottom row shows a non-COVID subject with no diseased region appearing in the lung. The heatmap highlighted the central part of the lung without focusing on any specific areas.
Figure 7 shows two subjects with misclassified classes. The top row shows a healthy subject that is misclassified as a COVID subject. Although this subject does not have apparent disease areas in the lung, the classification model looked through the entire lung area (yellow and red regions covering nearly the entire heat map) and decided there is a higher likelihood that this subject is a COVID subject. The predicted probability is 51.24% for COVID and 48.76% for non-COVID.
The bottom row in Figure 8 shows another healthy subject that is misclassified as a COVID subject. Similar to the example on the top row, no apparent disease areas are present in the lung. Therefore the heatmap covers the whole image without focusing on a specific lung region. Like the subject on the top row, the classification model determined there is a higher chance that this subject is a COVID subject. The predicted probability is 54.75% for COVID and 45.25% for non-COVID.
The use of the optimizer and learning rate scheduler was crucial in achieving exceptional results after only three epochs. Non-adaptive optimizers (vanilla solver in this work), as demonstrated in the study by Wilson et al., tend to generalize better than adaptive optimizers. This is likely because non-adaptive optimizers do not make changes to their learning rate based on the current state of the model, which can prevent overfitting. On the other hand, adaptive optimizers adjust the learning rate based on the model's performance. This can lead to better performance on the training data but potentially poorer generalization to unseen data.
In addition to the use of a non-adaptive optimizer, the inclusion of a cyclic learning rate scheduler has been shown to significantly increase the speed of training in computer vision tasks. This is accomplished by periodically varying the learning rate over the course of training. In turn, this can help prevent the model from getting stuck in suboptimal configurations and facilitate faster convergence. Given the importance of both training speed and generalization capabilities for the model, the combination of these two elements is an optimal choice for achieving exceptional results.
In this post, we examined the process of training a pretrained convolutional neural network to classify COVID-19 CT scans by using SAS Viya Machine Learning, leveraging its learned representations to achieve exceptional results. The classification model in this work was built upon a pretrained ResNet-50 model. It achieved a 99.4% accuracy rate in classifying COVID subjects and a 96.3% accuracy rate in classifying non-COVID subjects. The classification model can accurately focus on the lung regions with disease infection on the COVID CT images.
While examining the misclassified subjects, it was found that most of them are non-COVID subjects that were misclassified as COVID subjects. Among these misclassified non-COVID subjects, all of them do not have apparent disease regions in the lung. Although the classification model assigns a COVID class to these subjects, all of them only have a slightly higher predicted probability for the COVID class compared to the non-COVID class. Overall, SAS Viya Machine Learning provides a high degree of experimental design freedom and versatility in achieving our COVID classification model in this work.
Special thanks to Sebastian Alberto Neri for his contribution to this work. Sebastian is a rising senior student at Monterrey Institute of Technology in Mexico. He joined the Computer Vision team in August 2022 as an intern. This work is part of Sebastian’s internship project.
The post Using SAS Viya Machine Learning to classify COVID from non-COVID appeared first on The SAS Data Science Blog.
]]>