Blogs

Blogs

Tag: Data Analysis

Analytics | Data Visualization | Programming Tips

Rick WicklinSeptember 3, 2019 0

Cosine similarity of vectors

An important application of the dot product (inner product) of two vectors is to determine the angle between the vectors. If u and v are two vectors, then cos(θ) = (u ⋅ v) / (|u| |v|) You could apply the inverse cosine function if you wanted to find θ in

Read More

Programming Tips

Rick WicklinAugust 26, 2019 0

Conditionally append observations to a SAS data set

Most SAS programmers know how to use PROC APPEND or the SET statement in DATA step to unconditionally append new observations to an existing data set. However, sometimes you need to scan the data to determine whether or not to append observations. In this situation, many SAS programmers choose one

Read More

Programming Tips

Rick WicklinAugust 21, 2019 0

Two tips for optimizing a function that has a restricted domain

An important application of nonlinear optimization is finding parameters of a model that fit data. For some models, the parameters are constrained by the data. A canonical example is the maximum likelihood estimation of a so-called "threshold parameter" for the three-parameter lognormal distribution. For this distribution, the objective function is

Read More

Learn SAS | Programming Tips

Rick WicklinAugust 19, 2019 0

Timing performance in SAS/IML: Built-in functions versus Base SAS functions

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies

Read More

Learn SAS | Programming Tips

Rick WicklinAugust 7, 2019 0

The essential guide to binning in SAS

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques

Read More

Learn SAS | Machine Learning | Programming Tips

Rick WicklinAugust 5, 2019 0

How to use PROC HPBIN to bin numerical variables

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name

Read More

Data Visualization | Programming Tips

Rick WicklinJuly 10, 2019 0

Find the center of each cell in a mosaic plot

I recently showed how to create an annotation data set that will overlay cell counts or percentages on a mosaic plot. A mosaic plot is a visual representation of a cross-tabulation of observed frequencies for two categorical variables. The mosaic plot with cell counts is shown to the right. The

Read More

Analytics | Learn SAS | Programming Tips

Rick WicklinJune 26, 2019 0

Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such

Read More

Analytics | Data Visualization | Learn SAS

Rick WicklinJune 24, 2019 0

Add loess smoothers to residual plots

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include

Read More

Analytics | Data Visualization | Learn SAS

Rick WicklinJune 19, 2019 0

Influential observations in a linear regression model: The DFFITS and Cook's D statistics

A previous article describes the DFBETAS statistics for detecting influential observations, where "influential" means that if you delete the observation and refit the model, the estimates for the regression coefficients change substantially. Of course, there are other statistics that you could use to measure influence. Two popular ones are the

Read More

Analytics | Data Visualization | Learn SAS

Rick WicklinJune 17, 2019 0

Influential observations in a linear regression model: The DFBETAS statistics

My article about deletion diagnostics investigated how influential an observation is to a least squares regression model. In other words, if you delete the i_th observation and refit the model, what happens to the statistics for the model? SAS regression procedures provide many tables and graphs that enable you to

Read More

Advanced Analytics | Programming Tips

Rick WicklinJune 12, 2019 0

Leave-one-out statistics and a formula to update a matrix inverse

For linear regression models, there is a class of statistics that I call deletion diagnostics or leave-one-out statistics. These observation-wise statistics address the question, "If I delete the i_th observation and refit the model, what happens to the statistics for the model?" For example: The PRESS statistic is similar to

Read More

Learn SAS | Programming Tips

Rick WicklinJune 10, 2019 0

5 reasons to use PROC FORMAT to recode variables in SAS

Recoding variables can be tedious, but it is often a necessary part of data analysis. Almost every SAS programmer has written a DATA step that uses IF-THEN/ELSE logic or the SELECT-WHEN statements to recode variables. Although creating a new variable is effective, it is also inefficient because you have to

Read More

Data Visualization | Learn SAS | Programming Tips

Rick WicklinJune 3, 2019 0

Graph wide data and long data in SAS

Statistical programmers and analysts often use two kinds of rectangular data sets, popularly known as wide data and long data. Some analytical procedures require that the data be in wide form; others require long form. (The "long format" is sometimes called "narrow" or "tall" data.) Fortunately, the statistical graphics procedures

Read More

Analytics | Programming Tips

Rick WicklinMay 28, 2019 0

The Theil-Sen robust estimator for simple linear regression

Modern statistical software provides many options for computing robust statistics. For example, SAS can compute robust univariate statistics by using PROC UNIVARIATE, robust linear regression by using PROC ROBUSTREG, and robust multivariate statistics such as robust principal component analysis. Much of the research on robust regression was conducted in the

Read More

Analytics | Learn SAS

Rick WicklinMay 15, 2019 0

What is Kolmogorov's D statistic?

Have you ever run a statistical test to determine whether data are normally distributed? If so, you have probably used Kolmogorov's D statistic. Kolmogorov's D statistic (also called the Kolmogorov-Smirnov statistic) enables you to test whether the empirical distribution of data is different than a reference distribution. The reference distribution

Read More

Analytics | Programming Tips

Rick WicklinMay 8, 2019 0

Discrimination, accuracy, and stability in binary classifiers

At SAS Global Forum 2019, Daymond Ling presented an interesting discussion of binary classifiers in the financial industry. The discussion is motivated by a practical question: If you deploy a predictive model, how can you assess whether the model is no longer working well and needs to be replaced? Daymond

Read More

Analytics | Programming Tips

Rick WicklinApril 24, 2019 0

A CUSUM test for autregressive models

The CUSUM test has many incarnations. Different areas of statistics use different assumption and test for different hypotheses. This article presents a brief overview of CUSUM tests and gives an example of using the CUSUM test in PROC AUTOREG for autoregressive models in SAS. A CUSUM test uses the cumulative

Read More

Programming Tips

Rick WicklinApril 22, 2019 0

The CUSUM test for randomness of a binary sequence

Many statistical tests use a CUSUM statistic as part of the test. It can be confusing when a researcher refers to "the CUSUM test" without providing details about exactly which CUSUM test is being used. This article describes a CUSUM test for the randomness of a binary sequence. You start

Read More

Learn SAS | Programming Tips

Rick WicklinApril 3, 2019 0

Convergence in mixed models: When the estimated G matrix is not positive definite

I've previously written about how to deal with nonconvergence when fitting generalized linear regression models. Most generalized linear and mixed models use an iterative optimization process, such as maximum likelihood estimation, to fit parameters. The optimization might not converge, either because the initial guess is poor or because the model

Read More

Learn SAS | Programming Tips

Rick WicklinApril 1, 2019 0

Matrix operations and BY groups

Many SAS procedures support the BY statement, which enables you to perform an analysis for subgroups of the data set. Although the SAS/IML language does not have a built-in "BY statement," there are various techniques that enable you to perform a BY-group analysis. The two I use most often are

Read More

Programming Tips

Schematic diagram of outliers in bivariate normal data. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'b' is a multivariate outlier.

Rick WicklinMarch 25, 2019 0

The geometry of multivariate versus univariate outliers

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate

Read More

Data Visualization | Learn SAS

Rick WicklinMarch 20, 2019 0

Truncate response surfaces

An analyst was using SAS to analyze some data from an experiment. He noticed that the response variable is always positive (such as volume, size, or weight), but his statistical model predicts some negative responses. He posted the data and asked if it is possible to modify the graph so

Read More

Programming Tips

Rick WicklinMarch 18, 2019 0

Interpolation vs extrapolation: the convex hull of multivariate data

Statisticians often emphasize the dangers of extrapolating from a univariate regression model. A common exercise in introductory statistics is to ask students to compute a model of population growth and predict the population far in the future. The students learn that extrapolating from a model can result in a nonsensical

Read More

Data Visualization | Learn SAS

Rick WicklinMarch 6, 2019 0

Use PROC BOXPLOT to display hundreds of box plots

A previous article shows how to use a scatter plot to visualize the average SAT scores for all high schools in North Carolina. The schools are grouped by school districts and ranked according to the median value of the schools in the district. For the school districts that have many

Read More

Analytics | Data Visualization

Rick WicklinMarch 4, 2019 0

Visualize SAT scores in North Carolina

Standardized tests like the SAT and ACT can cause stress for both high school students and their parents, but according to a Wall Street Journal article, the SAT and ACT "provide an invaluable measure of how students are likely to perform in college and beyond." Naturally, students wonder how their

Read More

Analytics | Data Visualization

Rick WicklinFebruary 20, 2019 0

An easier way to create a calibration plot in SAS

Last year I published a series of blogs posts about how to create a calibration plot in SAS. A calibration plot is a way to assess the goodness of fit for a logistic model. It is a diagnostic graph that enables you to qualitatively compare a model's predicted probability of

Read More

Programming Tips

Rick WicklinFebruary 18, 2019 0

An easier way to perform regression with restricted cubic splines in SAS

Maybe if we think and wish and hope and pray It might come true. Oh, wouldn't it be nice? The Beach Boys Months ago, I wrote about how to use the EFFECT statement in SAS to perform regression with restricted cubic splines. This is the modern way to use splines

Read More

Analytics | Learn SAS

Rick WicklinFebruary 11, 2019 0

4 reasons to use PROC PLM for linear regression models in SAS

Have you ever run a regression model in SAS but later realize that you forgot to specify an important option or run some statistical test? Or maybe you intended to generate a graph that visualizes the model, but you forgot? Years ago, your only option was to modify your program

Read More

Analytics | Machine Learning | Programming Tips

Partition data into training, validation, and testing in SAS

Rick WicklinJanuary 21, 2019 0

Create training, validation, and test data sets in SAS

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing. Training data is used to fit each model. Validation data is a random sample that is used for model selection. These data are used to select

Read More

Previous 1 … 4 5 6 7 8 … 17 Next