If you are an instructor who teaches large-enrollment introductory statistics courses and wishes to teach a modern data-driven course, read on.
You know about the challenges of assessing student mastery in courses where there are hundreds – or even thousands! – of students and little or no support for grading homework or exams. Online tools are available, but often these scalable tools for assessment are limited to multiple-choice formats emphasizing procedural technique and/or hand computation. Instructors who want to adopt modern approaches that emphasize concepts, application and use of data analysis software may find these online tools less than adequate for assessing student performance. So how might an instructor of a high-volume course who is constrained by grading resources assess students on their statistical reasoning abilities?
JMP is pleased to announce a new partnership with WebAssign, a provider of online assessment tools used by more than 2,600 colleges, universities and high schools. The result of this partnership is a collection of new assessment items integrating interactive and dynamic JMP graphs with questions directed at interpretation and reasoning. Best of all, these assessment items are free to WebAssign users and cover many of the concepts introduced in introductory statistics courses. While the assessment items look and act like JMP graphs and output, they are actually JMP HTML5 output embedded into the WebAssign application. So, the items completely stand alone and don’t require JMP to be installed or accessed.
The HTML5 output from JMP provides interactivity within a browser environment.
Check out the video below that illustrates these and other assessment capabilities with JMP. We hope you will give these new assessment items a try and consider adopting JMP for your course. For more information about accessing these items and using them in your own course, contact WebAssign.
One more thing...
In a previous post, we shared with you that JMP has integrated learning “applets,” or what we have referred to as interactive learning modules, into the core JMP product. These are concept demonstration tools that help students “see” the nature of core statistical concepts in an interactive and visual manner. These JMP applets are similar to Java applets that have been available for many years, but with the added benefit of a single interface and the ability to both analyze and explore concepts from the same data of interest.
With the partnership with WebAssign, JMP can now support three critical needs in your introductory statistics class: data analysis and visualization, concept demonstration and now assessment. While these foundational functions are fairly standard in the modern course, they have largely been provided by different products, which can add complexity to the learning process. Feel free to contact us at firstname.lastname@example.org if you have any questions or would like to see JMP in action.
In my previous post, I showed how we explored the eras of baseball using a simple scatterplot that helped us generate questions and analytical direction. The next phase was figuring out how I might use analytics to aid the “subject matter knowledge” that had been applied to the data. Could I confirm what had been surmised regarding the eras of the game? And, might we use analytics to surface more periods of time in the game that we should investigate, or at least, attempt to explain?
Generalized Regression is a method that we might use to build a predictive model, but in this case I wasn’t seeking to build a predictive model. I wanted to take advantage of the variable selection capabilities that are available in the Lasso Option of this method. I wanted to see which variables were selected as “significant and important,” and in what order. I thought that what’s selected will ultimately be the years in which I’m interested – that is, I thought I would confirm my assessment of the eras of the game and probably find some interesting information that my “eyeball” method had overlooked.
Longtime SAS associate and friend Tony Cooper is one of the people who talks baseball with me. Besides being a sports fan in his own right, Tony is an expert JSL (JMP Scripting Language) programmer. In mere minutes, he created a script to help me develop a set of dummy variables that enabled better differentiation of run production through the years. It was a key to the results!
In Generalized Regression, I was looking for a method that would help me surface “change,” and I had seen this tool in action in other areas. My expectations were high, and this method did not disappoint. In fact, the results were exciting, indicating that this method could uncover “change” not only in baseball data but also in other areas!
Utilizing the JMP platform for Generalized Regression enabled me to walk through the years of the game and confirm suspicions about the eras of the game. And it also found what might not necessarily be an “era” but a demonstration of the effects of expansion in specific years, or (for example) the act of raising the mound in 1968 and then lowering it in 1969.
The graph and table shown below tells a story: Solution Path shows the order at which variables enter the model. And, in the accompanying table, the estimates tell us if RPG (Runs Per Game) went up or down in the associated year.
Just like the scatterplot, this visualization led to questions as each year was presented. For example:
What was going on in 1993? Answer: It could have been the start of the “Steroid Era,” or it could be due expansion (two new teams joined the league) and an indication of the dilution of pitching talent as runs went up, or both!
What was happening in 1942? Answer: Runs went down as major league players left the league to join the military. The period from 1942 to 1946 is generally referred to as the “War Years.”
What was happening in 1920 and 1921? Answer: Baseball was leaving the “Dead Ball Era” for the “Live Ball Era.” Both players (like Babe Ruth) and events affected the game.
What about 1963? Answer: Baseball historians traditionally point to this year as the beginning of the period of expansion, and some call the period from 1963 to present the “Expansion Era.”
And finally, why were runs down in 2010? Answer: Baseball had cracked down hard on steroid users, and it may have caused a decrease in overall runs.
So, while having some fun playing with data from a favorite game, we were also able to demonstrate the effectiveness of generating questions and analytical direction using the simple scatterplot. We then showed how we could surface information at even greater levels using Generalized Regression. Potential applications to areas outside the arena of sports are endless. How might you use Generalized Regression?
Interested in seeing more?
You can learn more about the JSL script Tony Cooper created to aid in the analysis (and download it for yourself) by visiting the JMP File Exchange.
Follow the analysis using the Generalized Regression method by watching this step-by-step video:
I work with some brilliant people – there’s no doubt about that. Just around the corner and down the hall, you’ll find one of the most brilliant of them: Russ Wolfinger.
Russ is the JMP Director of Scientific Discovery and Genomics here at SAS, leading an R&D team for genomics and clinical research. He’s also a thought leader in linear and nonlinear mixed models, multiple testing and density automation.
Over the past year, I’ve heard rumblings of Russ’ involvement with various data science competitions like Kaggle and DREAM. These competitions are an excellent way to crowdsource ideas to solve some of the most complicated and pressing problems. They are open to data scientists around the world who want to lend their expertise and develop their skills in the process.
A couple of weeks ago, I had the chance to hear Russ talk about his obsession with the competition. He is passionate about his work in this arena and in awe of the sharp, innovative minds in the data science and predictive modeling community, minds that are investing hundreds of hours developing effective models to more usefully deal with complex problems posed by big corporations and nonprofits alike.
Just ask him about it, and you’ll feel his enthusiasm right away. Even if you’re like me and don’t completely understand everything he’s throwing out there, you’ll be inspired by the excitement and implications of harnessing the cognitive talents of the top data scientists.
Of course, Russ is using JMP, SAS, Python and R to work on these challenges. He finds JMP software especially well-suited for the discovery and exploration phase of model building, saving him loads of time.
Earth Day focuses attention on big questions: What’s the future for Homo sapiens? How can we coexist sustainably with other species?
To stem the loss of biodiversity and ensure continued provision of essential ecosystem services, world leaders adopted the 20 Aichi Biodiversity Targets in 2010, to be fulfilled by 2020. One key target (Target 11) prescribes an expansion of the global protected area system to at least 17 percent of land surface and 10 percent of oceans.
One of the most revered conservation biologists of our time, E.O. Wilson, a Professor at Harvard and Duke, has an even bolder vision: We should set aside half the planet for the other species we share the Earth with.
That’s music to the ears of many who care about the future of our planet. But how to decide what to put aside to allow the majority of species to flourish? We’ll need not only political will, but more robust data on the numbers and distribution of the 16,000 endangered species we know (possibly 0.1 percent of all species).
Currently, species distribution maps more closely represent where conservation biologists range than the species they study.
Take one of the best studied species in the world: the cheetah. Much of its distribution across Africa (for example, Angola and Sudan) is completely unknown.
The cheetah (Acinonyx jubatus) is Africa's most endangered felid and listed as Vulnerable with a declining population trend by the IUCN Red List of Threatened Species. There are only between 7,000 and 10,000 cheetah globally, and Namibia has around a third of them.
The challenge of spotting cheetah
Spotting cheetah is not as easy as it might seem. You’d be forgiven for thinking such an iconic large cat, especially one that’s active during the day, should be pretty easy to count and geotag. Not so fast! Cheetah are surprisingly elusive, and generally have huge ranges, up to 1,400 sq km in Namibia. Roaming across commercial farmland in Namibia, they are often persecuted for raiding domestic livestock and have learned to keep a low profile. As a result, population estimates range from 2,905 to 13,520 and distribution maps show huge country-sized gaps over Africa.
Shortly after we’d formed WildTrack to monitor endangered species, we were approached by the N/a’an ku sê Wildlife Foundation, based near Windhoek in Namibia. The conservationists there were trying to make peace with local landowners – who were mostly tough, no-nonsense Afrikaaners – to mitigate human/cheetah conflict. Unless the Foundation could relocate troublesome cheetahs to new areas, the farmers would simply shoot them. At that point, the then-Director of Research, Dr. Florian Weise, was convinced by his own trackers that footprints offered a new key to censusing and managing Namibian cheetah populations, helping understand distributions to keep cheetah and livestock separate. He reached out to us to ask us if FIT could help.
A 'training data set' of cheetah footprints
Our first step in developing the FIT algorithm for the cheetah was to collect a "training data set" of footprints from cheetah of known ID, age-class and sex. Drawing on lessons learned from tiger footprinting at Carolina Tiger Rescue in North Carolina, we thought about adapting the technique to Namibian conditions. Florian and his expert team at N/a’an ku sê engaged the capable help of Chester Zoo in the UK, the Cheetah Conservation Botswana, AfriCat and Foundation SPOTS in the Netherlands, where the training data set of cheetah footprints would be made.
We had some initial challenges finding the right substrate texture to hold clear prints. We tried different sand/water mixtures and experimented in coaxing cheetah to walk at a "natural" pace over the sand trails we laid. Before long, we demonstrated that captive cheetah can directly contribute toward the conservation of their wild counterparts through their footprints and had collected 781 footprints (male:female ratio was 395:386) belonging to 110 trails, from 38 cheetah.
Processing pictures of footprints
The digital images of footprints were then ready to be processed in the FIT software that runs as an add-in to JMP data analysis and visualisation software from SAS. FIT first takes the prints and optimises their presentation and orientation to a standardised format. It then allows measurements of distances, angles and areas to be made between anatomical points on the footprint image.
These are fed into a robust cross-validated discriminant analysis and the output processed by Ward’s clustering. The resulting output is a dendrogram that allows us to tweak the algorithm to provide classification for the number of individuals, sex and age-class. For cheetah, we have consistent accuracies of >90 percent for individual identification.
Last year, we were approached by the Journal of Visualized Experiments (JoVE) to publish an article. JoVE is the world’s first peer-reviewed scientific video journal and offered us an ideal opportunity to communicate about FIT in JMP for cheetah to a wide audience.
Our paper in JoVE details the whole process of FIT for cheetah, with video footage from Namibia. The paper will be available online in early May, and this is the full reference to it: Jewell, Z. C., Alibhai, S. K., Weise, F., Munro, S., Van Vuuren, M., Van Vuuren, R. Spotting Cheetahs: Identifying Individuals by Their Footprints. J. Vis. Exp., e54034, doi:10.3791/54034 (2016).
Let the field monitoring begin!
The FIT algorithm for cheetah is up and running in the field. Duke University Master's student Kelly Laity, supervised by Professor Stuart Pimm, N/a’an ku sê and WildTrack, has categorized the quality of footprints that can be used by FIT. We’re now going to apply the technique to monitoring cheetah at N/a’an ku sê fieldwork sites and are confident that it will begin to shed light on true numbers and distribution of cheetah.
FIT algorithms for many other species are in the pipeline, and together with many emerging technologies in conservation, engineering, forensic science, computer science and other disciplines, can begin to map species better.
Moving forward, if we are to set aside land for other species, we’ll need to make more efficient use of the land we have. We’ll need new technologies to intensify agriculture and develop innovative architectural engineering strategies to make cities healthier and better places to live.
Earth Day is a day to think about the challenges we face. How we approach them will, quite simply, determine our future on Earth. E.O. Wilson’s clear vision is a wonderful start.
To learn more about WildTrack, here's a quick look at our mission:
In consulting with companies about building models with their data, I always talk to them about how their data may differentiate itself over time. For instance, are there seasons in which you might expect a rise in flu cases per day, or is there an economic environment in which you might expect more loan defaults than normal? These are examples of key pieces of information that come with a challenge: How do you identify these periods in your data where change occurs? And, can you explain the change?
This topic is always at the top of mind when I work with customers. This week, the game of baseball is also on my mind, as the season is now underway.
Recent discussions with some of my friends who are baseball fans have centered on the history of the game, and how various rules and events affected the game over time. Or had the game been affected? Some thought the rule changes and events could not have had a significant impact, while others were noncommittal.
As a statistician with data and tools to analyze it, I decided to do a bit of research. It occurred to me that this was a nice opportunity to illustrate how we might discern the periods of time that have been affected by events, policy or rules. We could have fun with baseball data while keeping in mind that the same approach could apply to businesses in other industries.
Major League Baseball (MLB) makes its data publicly available through Sean Lahman (SeanLahman.com). From his robust set of files, I built a comprehensive database of SAS data sets featuring baseball data.
The approach to discerning different periods of time (or in the case of baseball, eras) was twofold: First, I would rely on the expert opinion of … myself. And second, I would explore an analytical technique to see if the result would agree and support expert opinion – and also, would it surface more periods of interest? You'll see this second part in a follow-up post.
I like to develop my data using the SAS Data Step, and did so from within SAS Enterprise Guide. In doing so, I developed a simple metric representing the Runs Per Game (RPG), believing that would be the metric that I could use to represent rule changes over time. It’s been said that “runs are the currency of baseball,” and if a rule or event disrupted the normal production of runs over time, then we should discuss it! I built the data set and seamlessly sent it to JMP.
A graph spurs discussion
Using Graph Builder in JMP, I quickly created one of my favorite means of analytical communication: the scatterplot. This one featured the mean RPG versus Year. And, as soon as I built the graph (and shared it), the questions and observations from my friends started to flow:
Why were there so many runs before 1900?
Why were there so few runs between 1900 and 1920?
Why did runs fall off in the early 1940s?
Runs didn’t rise as much as I had expected in the 2000s…
What era are we in now?
The graph evolved a bit as we discussed these questions. Here’s the scatterplot of Mean Runs Per Game Through the History of Baseball that triggered these questions and many more.
I added the colors and references lines as the eras of the game were differentiated in our discussions. The majority of the questions directly related to eras as identified by baseball historians.
Some of the questions (and answers) were as follows:
Why were there so many runs scored before 1900?
Until 1887, the batter could essentially call the pitch (i.e., “high or low”), and the pitcher was obligated to comply.
Until 1885, “flat” bats were used.
Until 1883, pitches were launched below the waist and had less velocity.
Until 1877, there were “fair foul hits” where balls that might hit inbounds and “kick out” before first or third base were considered hits (today they are called foul balls).
This era was known as the “19th Century Era.”
Why were there so few runs scored from 1900 to 1920?
Many manufacturers produced baseballs with poor and inconsistent specifications.
Teams used the same ball literally until the cover came off – it became dirty and difficult to see.
This era was known as the “Dead Ball Era.”
What happened to increase run production after 1920?
After Ray Chapman was hit and killed by a pitch, baseball began using clean balls. Witnesses stated that Chapman didn’t even flinch, which led most to believe that he hadn’t seen the ball approaching.
Home run hitters like Babe Ruth emerged.
Consistent manufacturing (with consistent rubber cores) made the baseballs come off the bats more readily.
This era was known as the “Live Ball Era.”
What happened in the early 1940s? It appears runs fell off again.
Replacement players played the games as “the regulars” joined the military during World War II.
This period is not always called an era, but referred to as the “War Years.”
Questions continued to bubble up… The discussion continued in many interesting directions, for example:
The era from 1947 to 1962 is referred to as the “Integration Years,” as Jackie Robinson joined the Dodgers on April 15, 1947.
The era that begins in 1963 still perplexes baseball historians at what to call it, or even how many eras might exist from 1963 to the present. Here are some of the events and rule changes that have affected the game since 1963:
The league expanded from 16 to 30 teams, effectively diluting the talent among teams and prompting many to refer to this entire period from 1963 to present as the “Expansion Era.”
The American League instituted the “Designated Hitter” into the game in 1973, leading some to refer to the period from 1973 to present time as the “Designated Hitter Era.”
Rumors of players using performance-enhancing drugs surfaced in the mid-1990s, resulting in some calling 1995 to 2009 the “Steroid Era.”
What’s cool about all this “discovery” is that it happened from the initial scatterplot, and the identification of what appears to be clusters of years with similar RPG. As we identified clusters of years with similar run production, we either explained the reason behind the cluster, or noted it as a period of time having a change due an unknown cause (and looked forward to researching it further!).
Next week, we'll use analytics to try to confirm these eras of the game and possibly uncover more periods worth investigating.
Interested in seeing more? This step-by-step video shows how I created the graph in Graph Builder:
Designed experiments, especially small DOEs, are a perfect place to practice model visualization. Another term for this could be “analysis by Graph Builder.”
I am not suggesting that you don’t build numerical models and look at parameter estimates, p-values and residuals. However, I am saying, “Plot the data.” I often will plot the data, build the model, revisit the plots and revisit the models. That is, graphics are part of my model-building process.
For example, I was sent the results of a four-factor fractional factorial design by a client. In Graph Builder, I plotted the four main effects and found that not only did factor X_4 have a strong effect, but I could also see something fun at the high level of X_4. The two groupings of data immediately suggest an interaction.
With the interactivity of JMP, you can highlight the data and quickly figure out that the groupings at the high level of X_4 are most likely due to the level of X_2.
So it took four factors, eight runs, data collection and measurements, followed by a quick analysis in Graph Builder to show that the outcome appears to be driven by X_4, with the impact of X_4 dependent on X_2.
This Graph Builder analysis was all that was needed for this small designed experiment. To me, this is the beauty of DOE – and model visualization.
I used JMP to explore a recent FAA drone data set, inspired by a weekly data visualization challenge called 52Vis. The data set contains "reports of unmanned aircraft (UAS) sightings from pilots, citizens and law enforcement." I decided to focus on exploring the time data. I'll describe how I prepared the time data, noticed some quality issues in it and categorized it accordingly. Fortunately, JMP makes it easy to do visual data cleanup.
Preparing the data
The data is provided in two separate Excel files, and JMP's Excel import UI had no problem importing them with the default settings. However, the column names are a bit messy and didn't even match each other.
I cleaned up the column names to match and then used Concatenate under the Tables menu to append one data table to the other. Concatenate even handled the columns being in different orders.
Finding quality issues
I started playing with the data and noticed some quality issues with the time field. I created a new column that extracts the time of day from the DateTime column and made a histogram for it. I called it Time of Day Table because I'll be computing a different time of day later.
The most noticeable feature is the spike at midnight. It turns out there are a lot of times reported as exactly 12:00 AM, apparently as a missing value code. A second feature is the smaller but significant dip at noon. Likely, some of the noon times have had the AM/PM part miscoded, and some of the noon-hour times are reported in the midnight-hour slot (in addition to the missing value times).
While the distribution looks nice otherwise, it does seem to be a more shifted toward the evening hours than expected. If only there were some way to quality-check the time values...
In fact, there is a way to quality-check the time values! The Report column contains free text describing the incident and often includes a report time in 24-hour GMT (note the Z suffix).
So let's extract the GMT time and compare with the time in the table. I'll make a new formula column based on a regular expression to extract the digits between "Time:" and "Z" in the report text.
And another one to convert it to a JMP time value (number of seconds).
The new time column is called Time of Day Extracted Z, and we can compare it with the table time in a scatterplot to reveal some interesting patterns.
I added two diagonal reference lines. The orange one corresponds to EST, and the diagonal clusters near it are other US time zones. The gray diagonal is where the time in the table is the same as GMT. Lots of those! Strangely, few of those occur before 1 PM (1300Z) hours, which seems reasonable for GMT sighting times in the US but not for local times.
At this point, we can think of the times as falling into three categories:
Missing -- there was no GMT time within the free text
Identical -- the GMT time was the same as the "local" time
Offset -- the GMT time was at an offset from the local time, usually something reasonable for the US
I added a couple columns to compute that classification and then plotted a histogram for each category.
Now we can see how the Identical category was skewing the local times into the evening. The Offset category has a more expected peak during daylight hours and matches well with the times where the GMT times were Missing. Both of the latter suffer from the 12:00 AM spike and 12:00 PM drop, due to coding issues. It's unclear why the Identical category has a mostly uniform distribution. It could be a combination of two offset normal distributions, one for the East Coast and one for the West Coast.
By considering the extracted GMT time, we can filter out suspicious time values for further time-based analysis. That is, we have more trust in the local times that are offset from GMT. And since those missing a GMT time have a similar distribution, we can use those with similar confidence.
The 52Vis challenge is geared toward scripting solutions, and fortunately JMP offers both styles of interfaces: interactive and scripting. Though my initial explorations used the interactive interface, I made a script in the JMP Scripting Language (JSL) for the challenge that reproduces all of my steps. If you're interested, you can see the JSL in an earlier version of this post on GitHub.
A couple more things...
Even with only a few columns of data, there's still more to explore. You can use the geocoder add-in to convert the city/state to longitude and latitude for mapping.
And did you notice the altitude data in the report text I showed? A good number of the reports have altitude data and sometimes information on the appearance or size of the unmanned aircraft.
Just days after the baseball season has started, two 10th-graders from a public high school in Delaware and their math teacher are headed to Furman University for the Carolinas Sports Analytics Meeting on Saturday to talk baseball.
The students, Umar Khan and Tyler Schanzenbach, both 15 years old, are presenting a poster on “Rating Offensive Production in Baseball.”
The students analyzed and visualized real data from a collegiate summer baseball team, the Martha’s Vineyard Sharks, a team in a league that attracts major league scouts' attention.
Math teacher Jesse McNulty advises the Sports Analytics Club at the William Penn High School in New Castle, Delaware. Umar and Tyler are two of the club’s founding members, and they were recently featured in a public radio story about their baseball analysis.
Thanks to the efforts of McNulty, the students received play-by-play data from the Sharks in the summer of 2015. From that data, they sought to understand who were the stronger and weaker hitters on the team. McNulty received game sheets from coaching staff via email each night following games. He sent the game sheets to Tyler and Umar for tracking and upload, and then the analysis was sent back to coaching staff -- typically, an eight-to-10-hour turnaround time.
"The data was used to support team personnel/lineup decisions made by the staff," McNulty says.
Did it help? Well, the Sharks had seven more wins in 2015, the year they received analysis from the students, than the year before.
When they started the project, the students had no idea how good any of the players were. So they used metrics such as offensive production index, hard hit percentage and quality plate appearance to find out which players contributed the most to their team’s success.
They worked more than 50 hours last summer on the project. Yes, during summer vacation.
On Saturday, the teenagers get to present their work at a sports analytics conference. You can take a closer look at their poster by clicking on the image below. And they just found out that they will also get to present this poster at JSM in Chicago! Congratulations!
I asked McNulty some questions about this project and others he’s working on with students at his school.
Q. How did you present the results to the Sharks? McNulty: Data was presented with a variety of charts and visuals created with the use of Google Docs in the first half of the season and using JMP in the end of the season.
Q. What was the response of the team to the analysis? McNulty: It's important to note that it was not a perfect system on day one. There was a lot of learning that our students took in to correct, adapt and improve to make the data analysis and presentation to the team effective. We really needed our students to understand what information was important for the staff, and we needed to present that to the staff in a way that was simple and easy to understand.
Q. What other projects are the students working on? Any other sports? McNulty: Our students have been working on developing some systems for analysis with a professional ultimate disc team, the Los Angeles Aviators of the AUDL. They have also been building some visualizations of the team's 2015 data and using JMP Student Edition to determine what player contribution data affects team success. We also have our students working on a variety of projects for sports teams at our high school and using JMP to provide data analysis using Distributions.
Q. What do you hope students get out of working on such projects? McNulty: My goal is to immerse our students in genuine learning experiences with real teams. I feel that not only will it help our students build their networks and resumes, but it will also extend their learning beyond the classroom and their textbooks.
Q. Why did you start the sports analytics club? McNulty: The group was started out of a collective student interest to evaluate data in sports. Umar and Tyler were students in my ninth grade Integrated Mathematics I class and asked if we could use some time after school a couple of days a week to tackle patterns in data in sports. I was happy to oblige and used my connections in the sports industry to find our first connected learning opportunity with a team.
Q: What are your hopes for this club years from now? McNulty: We are looking to build in opportunities and projects to promote women in STEM. I would love to have this program be a feeder pattern for sports internships once our students are in college. I would love for this program to showcase the great work our students are doing, and extend their reach to colleges and universities.
Model visualization? Data visualization has gained traction in the past few years, with numerous interesting books and talks focusing on improving our data visualization skills. JMP’s own Xan Gregg recently spoke about data visualization on Analytically Speaking). Model visualization is simply applying data visualization to models.
When I can “see” the model, my confidence and my ability to share and explain the model also increases. While the Profiler is one model visualization tool in JMP not to be overlooked, here I will look at a simple data plot for a specific type of model as an illustration of model visualization.
One area that I work in is medical diagnostics (e.g., a blood tests for some condition). We often start with a large number (say 1,000) of potential model factors (often referred to as markers, as in a biological marker for a disease, condition or state) and then work to reduce this number to a small number (say 10) factors to build an algorithm (model) that can be implemented on a piece of lab equipment.
Throughout this process, we have candidate models to evaluate. ROC curves and the area under the ROC curve (AUC) are one standard way to evaluate such diagnostic models. However, while these are helpful, they don’t tell the whole story. Different-shaped curves can have similar AUC, and often one area of the curve (say where sensitivity is high) may be of more clinical interest than the whole curve.
In addition to ROC curves, I have found that a plot of the disease state by the model outcome using data points as well as violin plots (as shown below) aids in the evaluation and understanding of the model. This plot provides a sense of how well the model can separate the data and provides an immediate feel for how shifting the cut-off (the value used to distinguish positive from negative) impacts the diagnostic performance.
Such a plot is also useful in understanding whether the diagnostic test may be more useful as a three-way test where you can identify with confidence negatives and positive subjects but have subjects in the middle for whom additional information is warranted before a clinical judgment is made.
Model visualization is an integral part of my model building strategy. Do you have favorite model visualization?
In 2011, my colleague Prof. Chris Nachtsheim and I introduced Definitive Screening Designs (DSDs) with a paper in the Journal of Quality Technology. A year later, I wrote a JMP Blog post describing these designs using correlation cell plots. Since their introduction, DSDs have found applications in areas as diverse as paint manufacturing, biotechnology, green energy and laser etching.
When a new and exciting methodology comes along, there is a natural inclination for leading-edge investigators to try it out. When these investigators report positive results, it encourages others to give the new method a try as well.
I am a big fan of DSDs, of course, but as a co-inventor I feel a responsibility to the community of practitioners of design of experiments (DOE) to be clear about their intended use and possible misuse.
So when should I use a DSD?
As the name suggests, DSDs are screening designs. Their most appropriate use is in the earliest stages of experimentation when there are a large number of potentially important factors that may affect a response of interest and when the goal is to identify what is generally a much smaller number of highly influential factors.
Since they are screening experiments, I would use a DSD only when I have four or more factors. Moreover, if I had only four factors and wanted to use a DSD, I would create a DSD for six factors and drop the last two columns. The resulting design can fit the full quadratic model in any three of the four factors.
DSDs work best when most of the factors are continuous. That is because each continuous factor has three levels, allowing an investigator to fit a curve rather than a straight line for each continuous factor.
When is using a DSD inappropriate?
Here, the optimal split-plot design dramatically outperforms the Definitive Screening Design created by sorting the hard-to-change factor, wp. See point 4) below.
1) When there are constraints on the design region
An implicit assumption behind the use of DSDs is that it is possible to set the levels of any factor independently of the level of any other factor. This assumption is violated if a constraint on the design region makes certain factor combinations infeasible. For example, if I am cooking popcorn, I do not want to set the power at its highest setting while using a long cooking time. I know that if I do that, I will end up with a charred mess.
It might be tempting to draw the ranges of the factors inward to avoid such problems, but this practice reduces the DSD’s power to detect active effects. It is better to use the entire feasible region even if the shape of that region is not cubic or spherical.
2) When some of the factors are ingredients in a mixture
Similarly, using a DSD is inappropriate if two or more factors are ingredients in a mixture. If I raise the percentage of one ingredient, I must lower the percentage of some other ingredient, so these factors cannot vary independently by their very nature.
3) When there are categorical factors with more than two levels
DSDs can handle a few categorical factors at two levels, but if most of the factors are categorical, using a DSD is inefficient. Also, DSDs are generally an undesirable choice if categorical factors have more than two levels. A recent discussion in The Design of Experiment (DOE) LinkedIn group involved trying to modify a DSD to accommodate a three-level categorical factor. Though this is possible, it required using the Custom Design tool in JMP treating the factors of the DSD as covariate factors and adding the three-level categorical factor as the only factor having its levels chosen by the Custom Design algorithm.
4) When the DSD is run as a split-plot design
It is also improper to alter a DSD by sorting the settings of one factor so that the resulting design is a split-plot design. For the six factor DSD, the sorted factor would have only three settings. There would be five runs at the low setting, three runs at the middle setting and give runs at the high setting. Using such a design would mean that inference about the effect of the sorted factor would be statistically invalid.
5) When the a priori model of interest has higher order effects
For DSDs, cubic terms are confounded with main effects, so identifying a cubic effect is impossible.
Regular two-level fractional factorial designs and Plackett-Burman designs are also inappropriate for most of the above cases. So, they are not a viable alternative.
What is the alternative to using a DSD in the above cases?
For users of JMP, the answer is simple: Use the Custom Design tool.
The Custom Design tool in JMP can generate a design that is built to accommodate any combination of the scenarios listed above. The guiding principle behind the Custom Design tool is
“Designs should fit the problem rather than changing the problem to suit the design.”
DSDs are extremely useful designs in the scenarios for which they were created. As screening designs they have many desirable characteristics:
1) Main effects are orthogonal.
2) Main effects are orthogonal to two-factor interactions (2FIs) and quadratic effects.
3) All the quadratic effects of continuous factors are estimable.
4) No 2FI is confounded with any other 2FI or quadratic effect although they may be correlated.
5) For DSDs with 13 or more runs, it is possible to fit the full quadratic model in any three-factor subset.
6) DSDs can accommodate a few categorical factors having two levels.
7) Blocking DSDs is very flexible. If there are m factors, you can have any number of blocks between 2 and m.
8) DSDs are inexpensive to field requiring only a minimum of 2m+1 runs.
9) You can add runs to a DSD by creating a DSD with more factors than necessary and dropping the extra factors. The resulting design has all of the first seven properties above and has more power as well as the ability to identify more second-order effects.
In my opinion, the above characteristics make DSDs the best choice for any screening experiment where most of the factors are continuous.
However, I want to make it clear that using a DSD is not a panacea. In other words, a DSD is not the solution to every experimental design problem.
The official blog about JMP software from SAS, which links rich statistics with graphics, in memory and on the desktop. Contributors to the blog are members of the extended JMP family: from R&D, marketing, training, technical support and sales, as well as guest bloggers. Join us also in the JMP User Community, where you can ask your JMP questions, share your expertise, and get scripts and sample data: community.jmp.com