Video: Using Recode in JMP for data preparation

Data preparation before modeling is an unavoidable chore. One of the most time-consuming tasks can be cleaning up categorical data that may have misspellings, inconsistent capitalization and abbreviations, and the like. The Recode tool in JMP makes data prep a lot easier.

Watch this video by my colleague Ryan DeWitt to learn about the Group Similar Values option, which lets you group categories that are almost the same. Group Similar Values lets you ignore such things as case, non-printable characters, whitespace and punctuation.

In the video, he also covers these other Recode options: Convert to Titlecase, Convert to Uppercase, Convert to Lowercase, Trim Whitespace, Collapse Whitespace, First Word and Last Word, All But First Word, and All But Last Word.

If you have JMP 12 or the trial version, you can follow along using the same sample  data set that Ryan uses for the example in the video.

Read more about Recode right here in the JMP Blog.

Subscribe to the JMP channel on YouTube to see the latest videos.

Post a Comment

Submit an abstract to present in Prague

Discovery Summit Europe may not be until next March, but we’re already thinking about the conference agenda. The call for papers is now open to those who’d like to present.

If you’ve tackled an interesting problem with JMP, the conference steering committee wants to hear from you. Submit an abstract describing your work – it doesn’t have to be long, about 150-200 words. If your abstract is selected, you’ll present at the event in Prague, March 20-23, 2017.

As always, much of the agenda will be dedicated to user-led breakout sessions. Each year, our breakout presenters are a source of inspiration for the Discovery Summit – posing and challenging analytic theories, benchmarking best practices and conceiving innovative concepts.

Presenters will:

  • IMG_9840_croppedShare the Discovery Summit agenda with progressive analytic minds, including our keynote speakers, who are thought leaders in statistics, technology and innovation.
  • Shape the conference conversation about how to apply analytics in forward-looking companies around the globe.
  • Gather feedback from other attendees so you can refine your own analyses.
  • Demonstrate to your colleagues and managers how analytics benefits your organization.

Oh, and you get a discount, too! Paper and poster presenters receive 50 percent off conference admission, and student presenters receive complimentary conference admission.

If you aren’t interested in giving a talk or your application is better suited for a smaller, niche audience, consider presenting a poster. Posters can depict a class assignment, a research project or a business application. Posters will be judged based on their originality, innovative application and/or the use of visualization to express the data.

Not sure what to present? Take a look at past Discovery Summit presentation materials in the JMP User Community.

Post a Comment

Banner-fying your images

These days, customization of social media profiles is crucial. Everyone can find images to populate their banners, walls and timelines. But sometimes, banner images don't quite cut it. Especially, if you're anything like me, you aren't satisfied with only one picture for your LinkedIn profile banner (particularly if you have multiple interests you want to show off). So, I used JMP Scripting Language to find a solution.

For example, I love art and computer science, and have these two images:

Art    ProgrammingSmall

I wanted something smoother than a vertical line to separate the two images on my banner.

Perhaps, I'd prefer something more like this to place on my profile page:

finalbannerSmall

This is the image I created for my LinkedIn profile, and you can do it, too! I wrote a short segment of JMP code that can blend two images of any size (though images in landscape work best), forming a picture with the correct LinkedIn banner height-to-width ratio. The usage isn't limited to LinkedIn, either. Whenever you need to combine pictures for a smoother image, you can use it.

Read More »

Post a Comment

Video: Subsetting data from a JMP Distribution report

Let's say you are in the Distribution platform in JMP, and you have created a report that you wish to drill down into. Well, the Local Data Filter can help with that.

But perhaps you also want to share a portion of the data with a co-worker, and not just the report itself.

So how do you do that?

You could subset the data in the report. In JMP, you can do that visuallly and very easily with a few clicks. Check out this short video by my co-worker Ryan DeWitt, who demonstrates subsetting (and mentions the Local Data Filter) using some sample data included in JMP.

Thanks for watching!

Subscribe to the JMP channel on YouTube to see the latest videos.

Post a Comment

Video: Joining tables in JMP

Joining tables that are open in JMP is a task we often need to do when collecting and combining data from different sources or observation dates.

Plus, you may want to customize your join further, for example, by matching specific columns or even leaving out a few columns.

Watch the demo below to see how to customize joining data tables in JMP.

And then try it out yourself with sample data sets provided by my colleague, Ryan DeWitt, who created this video tip. The data sets are in the JMP User Community, which is the best place to ask questions about JMP and share your own knowledge. Join the community, if you haven't already!

Subscribe to the JMP channel on YouTube to see the latest videos.

Post a Comment

Visual Six Sigma: A practical approach to data analysis and process improvement

67155Do you want to discover new and useful knowledge in your data using interactive, dynamic graphical displays? Would you like to be able to make sound decisions faster by understanding the patterns of variation in your data and separating it into useful signal and random noise? You can, with the help of Visual Six Sigma: Making Data Analysis Lean, Second Edition!

What is Visual Six Sigma?
It is "a practical and pragmatic approach to data analysis and process improvement.... In the typical business environment of process improvement, people are looking for simple-to-use tools that can be used by everyone at all levels to rapidly explore and interpret data, and then use that understanding to drive improvement. By making these tools highly visual and engaging, we can accelerate the process of analysis and eliminate the need for advanced statistical analysis in all but the most complex of situation," wrote Andrew Ruddick, Andy Liddle and Malcolm Moore in a 2008 white paper.

Using the principles, concepts and detailed road map outlined in the book Visual Six Sigma – along with JMP – you can broaden and deepen data analysis and process improvement in your organization by making the tools intuitive and easy to use. Plus, the results are easy to interpret! You will be able to quickly see the important and useful patterns in your data – enabling you to improve processes, connect with customer needs and expectations, react to emerging market trends, and seize opportunities for growth.

This second edition incorporates ways to take advantage of developments that make the implementation of Visual Six Sigma even easier, further increasing the scope and efficiency of its application. The book was updated using JMP 12.2.0 (with detailed instructions and illustrative screenshots demonstrating the latest functionalities in JMP and JMP Pro). It also includes two new chapters: "Managing Data and Data Quality" and "Beyond 'Point and Click' with JMP."

Visual Six Sigma is a powerful way to help you focus on the relevant and important data you have and to use this data effectively. According to Ruddick, Liddle and Moore, visual approaches facilitate rapid exploration of the data to quickly find the " 'hot Xs,' the process inputs responsible for driving variation in product quality or associated with variation in product quality."

Now for some fun…

61985If you are among the first seven people to comment on this blog post describing your strategies for making data analysis lean in your organization, you could win a hardcover copy of Visual Six Sigma: Making Data Analysis Lean. (This is the first edition of the book – only seven of them are left!)

Be sure to enter your e-mail address when you write your comment so we can contact you if you are a winner. Only one book per commenter and for U.S. addresses only.

In addition, SAS is turning 40, and we're celebrating you, our users! Every Friday in July, we'll feature a special offer for SAS and JMP users. The discounts include specials on training, certification, books and SAS events.

How do you get these rewards?

  • Follow us on Twitter @SASSoftware or Facebook.
  • Watch these social channels each Friday in July to see the special offers.
Post a Comment

Helping clinical trials run better, faster

JMP Clinical enables everyone involved in clinical trials to work better and faster together -- and keep fraud out.

JMP Clinical enables everyone involved in clinical trials to work better and faster together -- and to help improve safety and data quality.

As you read this post over your afternoon coffee, scientists all over the world are hard at work trying to prevent the spread of deadly viruses, and cure and treat debilitating illnesses like cancer, HIV and Alzheimer’s.

When a breakthrough happens and one of those scientists puts her finger on a potentially helpful drug, her laboratory faces a new obstacle: the clinical trial.

With JMP Clinical, medical monitors, medical writers, clinical operations and reviewers can evaluate clinical trials efficiently and effectively. The latest version, JMP Clinical 6, which was released June 24, makes it possible for all involved to perform their jobs better and faster.

Designed for all organizations involved in clinical trials, including clinical research organizations (CROs), pharmaceutical and biotechnology companies, regulatory agencies, and medical universities, JMP Clinical has been on the market for seven years and is the gold standard in the industry, said Geoffrey Mann, JMP Life Sciences Product Manager. The new release makes an already-popular product much easier to use, and includes new tools that will help save time and money, and ultimately, produce better drugs.

Organizations of all sizes find value in the software. “While large pharmaceutical companies already have more than 100 copies of our software, the little company that has five employees is able to behave like a large pharmaceutical company when it uses JMP Clinical,” Mann said.

What’s new about JMP Clinical 6?

Every new drug that comes to market has to go through three rigorous phases of clinical trials – such trials produce an enormous amount of data that has to be sorted and analyzed to determine the answers to questions, such as:

  • What are the most common side effects of the drug?
  • Do side effects occur in certain populations more than others?
  • Does the drug do what it’s supposed to do?
  • What is the best dosage?

The all-new user interface of JMP Clinical was built to make it easier to answer these questions quickly and correctly. “It reduces the work by fivefold to tenfold,” said Mann.

The new user interface allows for both a tabulation and a visualization of the data. “Users can generate any table they want, and it’s interactively filtering according to the reviewer’s specifications,” he said. Users can also print or download static views of tables or visualizations as PDFs or PowerPoint slides.

JMP Clinical 6 facilitates collaboration across various divisions and between multiple users. “These configurations let you share all of your data and reviews with anyone in the world,” said Mann.

The new risk-based monitoring tools in JMP Clinical 6 are especially important, as data quality has always been a major issue in clinical trials. Mann tells a story about a data scientist at a CRO who was found to have been falsifying data for years. If that company had been using JMP Clinical, the problem would have been uncovered immediately, saving time and money, and also would have prevented bad drugs from going to market.

“The software runs all kinds of algorithms to find data quality issues, and it will discover things humans alone could never find,” said Mann.

The software helps to ensure that all parties involved in clinical trials stay honest. “It’s a check on everybody,” said Mann.

JMP Clinical helps regulatory agencies hold pharmaceutical companies and CROs to the highest standard. It helps pharmaceutical companies bring lifesaving drugs to the market faster. And it helps CROs ensure that their studies are free of falsification. This means that when that miracle drug is finally released – whether it’s a cure for cancer or a Zika vaccine – you’ll be able to trust it.

Learn more about JMP Clinical at the JMP website.

Post a Comment

Video: Red triangles in JMP

The little red triangles in JMP are ubiquitous, hard-working and powerful!

Here's a quick video by my colleague Ryan DeWitt on these drop-down menus that some users call "hot spots," "inverted triangles" or just plain old "triangles."

Subscribe to the JMP channel on YouTube to see the latest videos.

Post a Comment

Graph Makeover: Bars on a log scale

Every once in a while, I run across a bar chart on a log scale, and it always feels wrong. At first glance, I compare the bar lengths and start making comparisons. But eventually, I notice the log scale on the axis and try to convince my brain to forget everything it just saw and just compare the tops of the bars against the axis scale. In that sense, bars on a log scale are a special case of bars without a meaningful baseline.

Here’s a recent example I saw, comparing speeds for reading CSV files (comma-separated value text files).

csv_logbars

The source of the comparison is a white paper from the vendor for the coral-colored tool, ParaText from wise.io, showing how fast it is. The company can hardly be accused of deception in the visualization since using a log scale only makes the competitor speeds look closer to its own speeds. It's about 10x faster than R readr but looks only 2x faster. The only advantage ParaText gets from the log scale is that its speed looks very close to the black I/O bandwidth bar (the upper limit) when, in fact, the speeds are about half the I/O bandwidth.

Like any other non-trivial endeavor, data visualization often involves conflicting constraints that must be balanced. Yes, using bars on a log scale certainly interferes with gaining insight from the graph, but it’s possible that all the alternatives are worse. That’s why I always look at alternatives when making assessments of data visualizations.

Log scales are most useful when the underlying data is very skewed or varies by many orders of magnitude. This speed data is both skewed and varied, but not terribly so. The maximum variation is about 200:1, which is only two orders of magnitude. Immediately, we can try two variations on this chart:

  1. Keep the bars and change the scale to linear.
  2. Keep the log scale and change the bars.

Here’s a straightforward conversion to a linear scale. Using JMP, I’ve scaled all the values to be relative to the I/O bandwidth, so the black bars are not shown since they would all be at 100%.

csv_vbars

I haven’t labeled the bars with values. I don’t think all the bars need labeling with exact values (I'd rather have a supporting table for that). But if I were sharing this in a report, I would try labeling the highest bar or two in each category for some grounding. I find all the rotated labels in the original to detract from the visual representation of the data and take too much effort to read.

The linear scale is not bad, and I already like it better than the original in that it portrays the speed differences among the products directly. One weakness of both charts is that the product labels are separated from their bars. Rotating the bars at least puts the bars and the legend labels into the same arrangement.

csv_hbars

A different grouping hierarchy lets us label the product bars directly.

Comparing across tests is now less direct, but I’m thinking that’s a less important comparison.

Now let’s go back to the beginning and try keeping the log scale and changing the data elements. Here’s a view using points and lines instead of bars.

csv_logline1

The points themselves are enough to carry the position information, but the lines add connection information, which helps simplify the labeling. In general, line segments carry three connotations:

  1. Interpolation (continuous)
  2. Connection (categorical)
  3. Pattern recognition (continuous or categorical)

Interpolation doesn’t make much sense here since our x-axis is categorical, so that’s a detraction here. But connection is very valuable, and pattern recognition is informative, too. For instance, we notice a couple products have the same up-down-up pattern.

With the lines labeled in place, the color is not as necessary. While the color does help distinguish intersecting lines and help the data lines stand out from the grid lines, there is enough separation that we can try using color for technology group (R, Python or specialty) rather than individual labels.

csv_logline2

That makes the chart less busy but keeps the advantages of color.

The chart looks nice, but does it work? We still have a log scale, which still requires more thinking. But at least now the data elements are not in such conflict with the scale, and we have more room to show grid lines that reinforce the non-linearity. The log scale makes it easier to understand the differences between values across the entire range. In particular, we can see how the low values differ from each other better than we can on a linear scale.

It’s interesting to me that the data itself makes such a big difference in the usefulness of each chart option. The linear scale is at its limit of usefulness with differences around 10x. If the differences were more like 1000x, a linear scale would be useless. And if the values were too similar across products, the points would be obscuring each other and less useful.

Having seen a few possibilities, which is most effective for understanding the performance? Or would something else entirely be better?

Post a Comment

A statistical history of the Cy Young Award

“That’s all baseball is, is numbers; it’s run by numbers, averages, percentages and odds. Managers make their decision based on the numbers.” -- Rollie Fingers, 1981 AL Cy Young Award Winner

The Cy Young Award has been given to the best pitcher in baseball since 1956. The American and National leagues have had separate awards since 1967. Members of the Baseball Writers Association of America vote on the winners. And as Rollie Fingers said, many different numbers are taken into account when determining which pitcher is most worthy of the award.

Throughout the history of the Cy Young Award, the writers have relied upon the traditional statistics, such as earned run average (ERA), strikeouts (Ks), wins, saves and innings pitched (IP). Today’s Cy Young-voting writers have even more statistics to consider, with the development of sabermetric statistics like wins above replacement (WAR), fielding independent pitching, and batting average on balls in play.

This blog entry will look at the history of the Cy Young Award winners from a statistical standpoint, taking a look at both traditional statistics and WAR, which offers a comprehensive statistic of how valuable a player is to his team. For more information about how WAR is calculated, Baseball Reference has a good explanation.

The average stat line for the 110 Cy Young Award winners is shown below:

Cy Young Table 1

These stats are a little skewed since they include nine relief pitchers who have won the award. If we filter out relief pitchers because they pitch relatively few innings, here is the average stat line for the remaining pitchers:

Cy Young Table 2

During the 60-year period of the Cy Young Award, many of these statistics have remained relatively constant for the winners. ERA, win percentage, Ks and WAR have not shown a statistically significant change throughout the history of the award.

Other statistics such as wins, innings pitched and Ks/IP have shown a clear trend. These trends in IPs and wins can be attributed to pitchers getting fewer starts each year and managers paying more attention to pitch counts. For example, last year’s National League Cy Young Award winner, Jake Arrieta, won 22 games in 33 starts while pitching 229 innings. Arrieta had the most wins, second most starts and third most innings pitched in the MLB in 2015. If we look at the 1963 Cy Young winner, Sandy Koufax, he won 25 games in 40 starts while pitching 311 innings that year. Koufax had the most wins, third most starts and third most innings pitched in the MLB in 1963.

Up until the 1970s, many teams used a four-man starting pitcher rotation. This led to more starts, more innings pitched and the chance for more wins. Since the late 1970s, teams have moved to a five-man starting pitcher rotation. This has given pitchers less chance to win large numbers of games. The last pitcher to have more than 36 starts in a single season was Greg Maddux in 1991. Only two pitchers have had more than 40 starts since 1980. These trends can be seen in the JMP graph and table below:

CY Young graph 1

Cy Young Table 3

While the numbers of innings and wins have been trending down for Cy Young Award winners, the number of strikeouts per inning pitched has been trending upward. Before 1979, only one Cy Young-winning pitcher, Sandy Koufax in 1965, averaged more than one strikeout per inning pitched. In contrast, since 2000 over half of the Cy Young winners have averaged more than one strikeout per inning. This could be due to the negative correlation between number of innings pitched and strikeouts per inning pitched. This correlation is shown in the graph below.

There is a statistically significant difference in strikeouts per nine innings by comparing four-man rotations to five-man rotations, which you can see below as well. While more innings pitched and fewer rest days appear to be contributing factors, another potential factor might be a different philosophy on pitching. Sandy Koufax was quoted as saying, "I became a good pitcher when I stopped trying to make them miss the ball and started trying to make them hit it."

CY Young graph 2

According to the wins above replacement (WAR) statistic, the five best seasons by Cy Young winners are listed in the table below. Although we will continue to debate what was the best season that a pitcher has ever had, this is a pretty good list:

Cy Young Table 4

This got me thinking: Were there any pitchers who did not win the Cy Young Award who had comparable WAR statistics to these top five? Looking at pitchers’ seasons back to 1956, I found one WAR statistic that would be in the top five from a pitcher who did not win the award: Wilbur Wood in 1971 had a WAR of 11.7. Wood finished third in American League Cy Young voting that year. You can hardly blame the writers because the Cy Young winner Vida Blue had more wins, fewer losses, lower ERA and more strikeouts than Wood. Also the WAR statistic was not developed until 2004, so that did not factor into the voting.

Looking back through all of the Cy Young Award winners, 54 of the 110 led their league or MLB in WAR for pitchers. Excluding the relief pitchers who won, 54 of the 101 starters who won the Cy Young Award also led the league in WAR.

That led me to ask: Which Cy Young Award winners were least deserving by WAR standards? Five pitchers won the award with a WAR that was at least 4 less than another pitcher.

Cy Young Table 5

WAR is not a perfect statistic, and the Cy Young Award should not default to the pitcher with the best WAR. But when there are large discrepancies, I hope the Cy Young-voting baseball writers will take notice. In 1990, Roger Clemens was clearly a better pitcher than Bob Welch. Not only did he have the gap in WAR, but a much lower ERA (1.93-2.95), more strikeouts (209-127) and yielded fewer home runs (7-26). It is difficult to feel much sympathy for the seven-time Cy Young-winning (and allegedly steroid using) Roger Clemens, but clearly he was more deserving of the award in 1990.

For a case of a player more deserving of sympathy for not receiving a Cy Young Award, look to Dave Stieb. Stieb led the AL in WAR for pitchers from 1982-1984 but never finished higher than fourth in the Cy Young voting. While Stieb is a seven-time All-Star, most baseball fans do not think of him when they think of the best pitchers of the 1980s. Perhaps a Cy Young Award or two would have changed that.

Following the creation of the WAR statistic in 2004, the writers have been pretty good about taking it into account. Since 2004, a difference of greater than 1 win above replacement between the Cy Young winner and the leader in the WAR (for pitchers) statistic has only happened four times, or about 17 percent of the time. Before this time, a difference greater than 1 win above replacement between the Cy Young winner and the leader in the WAR (for pitchers) statistic happened 41 times, or approximately 48 percent of the time. This trend is captured in the graph below:

CY Young graph 3

Cy Young Award voting remains subjective, and that is part of the charm of baseball. The Cy Young voters seem to be paying attention to the newer sabermetic statistics when considering which pitcher is truly the most deserving of the award. That is a good thing for baseball now and into the future. While it is too late for Dave Stieb to win a Cy Young Award, I feel like today he would have a better chance than he did during his playing career.

Post a Comment