Monday, March 30. 2009Happy Birthday ... I'll Take That $1
I’m sure most of you reading this have encountered the birthday probability problem:
Given a group of n people, what is the probability that at least two people have the same birthday? The answer is given by the formula ![]() This formula is simplistic because it assumes birthdays are equally spread across all 365 days of the year, which according to some studies is not true. It also ignores leap year. But, for our purposes the formula works fine. Let’s have some fun with it! Surprisingly, with only 23 people, the probability of at least one match is about 0.5073. With 40 people, the probability is about 0.8912, and with 60 people, the probability is about 0.9941. A plot of the formula is shown here: ![]() For our little example, pretend I’m a teacher who teaches short courses. On the first day of class, I ask the students if they are willing to go along with a little wager. If there is at least one birthday match, everyone in the class has to give me $1. If not, I give everyone $1. What is the average amount of money I will have at the end of a year? The answer is easy to work out, and depends on the number of students in a class, and how many classes I teach during the year. The average amount at the end of the year is where n is the number of people in the class, Prob is the probability from the formula above, and Classes is how many classes I teach during the year. I’m also interested in the standard deviation of the annual amount, given by the following: ![]() where Mean is the average given by the first formula. Hint, Variance(x) = E(x^2)-E(x)^2. Note, these two formulas hold only if the number of classes taught each year is the same, and if each class has the same number of students. If Classes is a random variable (can change from year to year), then the formulas are different, particularly the standard deviation. So, for simplicity, we’ll assume n is the same for each class and the number of Classes is the same each year. I want to estimate the probability that the annual amount will be $0 or less. I can use JMP’s Profiler and Simulator to do this. Furthermore, I can use the Simulator to study how the probability is affected by different combinations of n and Classes. I created a JMP file with all these formulas included, ready to Profile. It’s called BirthdayModel.jmp and can be downloaded from JMP’s File Exchange. The column we want to simulate is called Total. The Simulator not only allows you to place distributions on the model inputs, but it also allows you to add random noise to the simulated response value. I can feed the standard deviation of the annual amount into the Simulator. Simply specify the amount as shown below: ![]() But, there’s a problem. This feature only allows you to enter a single value for the variability in the response. In our example, the standard deviation of the annual amount changes as a function of n and Classes. So, what can I do? Look at the formula for Total: Mean + Normal*StdDev where Mean is the formula for average annual amount, and StdDev is the formula for standard deviation of the annual amount. It includes a factor called Normal. When I use the Simulator, I’ll make that factor have a Normal (0,1) distribution. Therefore, the resulting simulated values for Total will have the right standard deviation, and that standard deviation will change as n or Classes changes. Just what I want! Also, this forces the sampling distribution of Total (for given n and Classes) to be Normal. According to other simulations I’ve done, this is a reasonable approximation, as long as n is between 10 and 40. Run the “Profiler” script attached to the data table. This launches the Profiler and Simulator. Economic conditions are uncertain, so I can’t predict n and Classes precisely. Instead, I can simulate for a variety of combinations of n and Classes. Therefore, I make n a Poisson variate with mean 35. And as we discussed before, Normal is a Normal (0,1) distribution. But what about Classes? I think I’ll be able to teach 1 or 2 classes per month, which equates to a min of 12 and max of 24 per year. But, how do I simulate this? I use the Expression feature, which is essentially JSL code. The expression I use creates 12 random variables (each a random integer either 1 or 2), and then sums the 12 variables to get the simulated number of classes in the year. Add a lower Spec Limit of 0, and run the Simulator to get results similar to the following: ![]() Incorporating the uncertainty in n and Classes, at the end of a year I will have on average $393, with a 3.7% chance of being less than $0. Should I play that game? What happens if I ignore the uncertainty in n and Classes? Change the variables to Fixed and set n=35 and Classes=18. ![]() The chance of being less than $0 drops to 0.03%. But ignoring uncertainty is not good risk management policy, so I’ll stick with the first result. This example is not as complete as I would like. As I said before, this method doesn’t choose a new class size for each class. It assigns the same class size for each class in a year. To get closer to reality, I needed to use JSL. The data table contains an attached script called “Birthday Script.” Copy the script contents to a script window and close the data table before running the script. The script chooses a random class size for each class and allows for the number of classes to change each year. All you do is provide the averages you want for n and Classes. The result is a histogram of simulated annual amount. On a different but similar note, I recently learned that two other people on my floor here at work have the same birthday as me. My floor has 37 people. You might think, "Wow, that’s surprising! That doesn’t happen too often." You might ask, "What is the probability of at least three people out of 37 having the same birthday?" The formula I gave before is for at least 2 people having the same birthday, so it doesn’t work. I searched the Internet and couldn’t find any formula for at least 3 people. But, no worries, I simulated it with JSL. For a group of 37 people, the probability that at least three have the same birthday is about 0.053. My company has about 11,200 employees, which is about 302 groups of 37. Therefore, using the Binomial distribution with p=0.053 and n=302, I predict that 9-23 of the 302 groups have at least three people with the same birthday. The JSL code I used to estimate the probability is attached to the data table and is called “At Least 3”. Maybe it happening on my floor is not so surprising after all. Wednesday, December 17. 2008Set the Right Selling Price for Christmas Cookies
In my last blog post, I found the best tasting recipe for the Christmas cookies I want to sell. In this post, I will use JMP's Profiler and Simulator to decide the selling price of the cookies and to investigate profitability. If the price is set too high, then fewer cookies will be sold. If the price is set too low, then lots of cookies will be sold, but the revenue will be lower than it could be. (The example I present here is adapted from one I encountered in Making Hard Decisions: An Introduction to Decision Analysis, by Robert T. Clemen.)
Let’s consider a simple model for Profit: Profit = Market*Proportion*(Price - Variable) - Fixed. Market represents the daily size of the market for cookies. Proportion represents the proportion of that market I will capture. Price is the selling price of a cookie. Variable is the variable cost associated with each cookie. Fixed is the fixed costs. Consider the following example. Assume that the daily market size for cookies is 100,000, the proportion I will capture is 20%, the price of a cookie is $0.50, the variable cost per cookie is $0.10, and the fixed costs are $8000. The Profit formula evaluates to Profit = 100,000*0.20*(0.75 - 0.10) - 8000 = $2000. Those numbers are based on my best guesses of the variables in the formula. What if the actual values are different? How will that affect Profit? Using JMP’s Profiler, I can visualize the affect of the variables on Profit. Furthermore, using the Simulator, I can estimate profitability under the uncertain business conditions. Before I do that, let’s consider one thing. It’s obvious that, all else being equal, charging a higher price is better. But, economics dictates that all else will not be equal. I can anticipate less sales as the price increases. This means the proportion will decrease. I choose to model that in the following way: For every $0.10 increase in price above $0.50, the proportion decreases by 3.5%. And likewise, if the price goes below $0.50, the proportion increases. I’ve incorporated this model into a JMP data table, called Cookies.jmp, and you can download it from the JMP File Exchange. The table has a column called Profit, which contains the formula for profit. Run the attached script to get the Profiler and Simulator shown below. ![]() All of the factors in the Profit model are shown. The current value (shown in red under each plot) of the factors is listed, as well as the resulting Profit (shown in red on the left). Each factor has a vertical dashed red line that you can grab and move with the mouse to change the value of the factor to see its effect on Profit, as well as the interaction with other factors. As you can see from the Profiles, when either Market or Prop goes up, the predicted Profit goes up. And likewise, when either of the costs go up, the Profit goes down. The Price profile is most interesting: It is curved, and there is an optimal value for Price. Profit goes down if Price is above or below the optimal value. I can’t precisely predict market conditions (the model factor values), which means I can't know with certainty whether my business will be profitable. Therefore, instead of guessing at the values of the factors in the model, I will assume a distribution (a range of values). For example, instead of just assuming Market size will be 100,000 cookies, I will assume that Market will be somewhere in the range of 70,000 to 130,000. More precisely, I model Market size as a Normal variable with mean of 100,000 and standard deviation of 10,000. Proportion is modeled as a triangular variable between 18% and 26%, with the most likely value at 22%. Variable costs are uniform between $0.08 and $0.12. And finally, Fixed costs are triangular between $6,500 and $9,000, with the most likely value at $8,000. I'm using distributions to represent the uncertain market conditions I'm faced with. The Simulator draws random values from these distributions, calculates Profit, and repeats many thousand times. This results in a distribution for Profit. In other words, I've assessed Profit under thousands of different market conditions. The important question is: Given the uncertain conditions, how likely is my cookie business to be profitable? The Simulator calculates this as well. Of the thousands of simulated conditions, JMP tracks how many result in a negative profit. With Price at $0.50, I click the Simulate button to estimate the probability that Profit is greater than or equal to 0. JMP produces a histogram of simulated Profit values and gives the probability as shown below. ![]() As shown, if I set price at $0.50, my average profit will be $971. But, because I'm uncertain about the model factors, that uncertainty gets transferred to uncertainty in Profit. Notice that part of the distribution of Profit is below 0 (the red line on the histogram). In this case, the probability of being profitable is 77.5%. But Price is not at its optimal value. Let’s move Price to the top of its profile, say 0.61. After making the change, I click Simulate again. ![]() Notice the average Profit increased to $1,448, and the probability of being profitable increased to 86%. I think I’ll charge $0.61 per cookie for now at least. If in the future the market changes, or my costs change, I can rerun the simulation to investigate what Price to charge. Using JMP’s Profiler I was able to see that Price does affect Profit and also that there is an optimal Price. Using the Simulator, I was able to decide on a reasonable cost and quantify the risks associated with that cost. The Profiler and Simulator are much richer than what I have demonstrated. This example is the simplest of models, with only one response, Profit. You can model multiple responses and have JMP solve for optimal solutions across all responses. Thursday, December 11. 2008Mixing the Right Recipe for Christmas Cookies Using JMP 8
When I think back to Christmas times of my youth, I recall the cookies my grandmother used to make. They were great! Now she wants to mass-produce the cookies and sell them. Unfortunately, the recipe has been lost. We remember all of the recipe except the “secret sauce” she added to the dough. The only thing we know is that the recipe called for 1 cup of secret sauce, which is a mixture of three ingredients, call them A, B, and C. But, as luck would have it, we forgot the proportion of the ingredients in the secret sauce. To have the optimal taste, the 1 cup worth of secret sauce has to be composed of the right proportions of the ingredients -- or else tragedy could result.
What can I do? I can run an experiment to study the “taste” of cookies made by sauce with varying proportions of the ingredients. This is a mixture experiment. Since the secret sauce is 1 cup, I know the ingredients have to add up to 1 cup. The factors in the experiment are the proportions of the ingredients, and the response is the goodness of the cookies. Luckily, I have JMP 8. Below I use a ternary plot to show the design points of the mixture experiment. ![]() The point in the middle corresponds to equal proportions of the three ingredients. In other words, the secret sauce is composed of 1/3 cup of each ingredient. The point to the lower left of the middle corresponds to the following proportions: 2/3 cup of A, 1/6 cup of B, and 1/6 cup of C. Points on the edges of the ternary plot correspond to secret sauce for which only two ingredients are used, and the proportion of the third ingredient is set to 0. We proceed to make secret sauce and cookies corresponding to the design points of the experiment, and have people taste test the cookies. The cookies are ranked on a scale from 1 to 100, with 100 being the best cookie ever. Using the data from the experiment, we fit a model that predicts goodness (taste of cookie) as a function of the varying proportion of the ingredients. We can use this model to find the proportions of ingredients in the secret sauce that lead to the best tasting cookie. In other word, we can find the spot with a Taste value as close as possible to 100. The data and model are found in Ingredient Data.jmp, and can be downloaded from JMP’s file exchange. New to JMP 8 is the Mixture Profiler, which allows the user to interactively visualize a response surface for a model with mixture factors. ![]() The platform generates a contour (the red line) plot. The value of Taste at the contour shown is 54.44. Any recipe with ingredient proportions that fall on that contour will produce cookies with average Taste value of 54.44. For example, I placed the three-way cross hairs on the contour. The ingredient proportions at that spot are read from the top of the window under Current X. In this case, the values are 15% for ingredient A, 27.5% for ingredient B, and 57.5 % for ingredient C. Notice the proportions add to 1 as they should: 15% + 27.5% + 57.5% = 100%. The little dots running along the contour represent the direction of increasing Taste. I can’t show you in a screenshot, but you can use the Response slider to move the contour to different parts of the response surface. Doing this allows you to locate the spot with a value of Taste as close to 100 as possible, which is what we want. To visualize many contours at once, we can add a contour grid to the plot, as shown below. ![]() The three-way cross hairs can be used to investigate the value of Taste at various contours. I use the Response slider to study how the contours change as I move them, and as the dots show, the highest value of Taste occurs in the region of the center of the smallest contour. As shown below, place the cross hairs in the middle of the contour and read the ingredient proportions from the top of the window. The ideal secret sauce (the one that yields the best tasting cookie) has the following proportions of the ingredients: 41% for A, 34% for B, and 25% for C. ![]() Now that we know the best recipe for the cookies, we are ready mass-produce them and sell them. What price should we charge for a cookie? Stay tuned -- in my next blog post I will use JMP’s Simulator to investigate what price I should charge for the cookies. The functionality of the Mixture Profiler goes far beyond what I have shown here. Multiple responses can be contoured at once, and more than three ingredients can be investigated at the same time. Constraints can be added, so that unfeasible regions are identified on the plot. Data referenced in this entry can be found in the JMP File Exchange. Monday, October 27. 2008What Does a JMP Fanatic Do for Fun During Election Season? Part 2
In my last blog post I started the analysis of fictitious survey data concerning what percent of people plan to vote for candidate A in the upcoming presidential election. By fitting the Beta Binomial distribution, I was able to discover that the percent (p) of people varies between cities, and can be modeled with the Beta(27.64, 28.63) distribution. This distribution has an average of 0.491. But this doesn’t necessarily mean candidate A will get 49.1% of the total vote when combined across those cities I sampled. Cities with a higher population have more weight when the popular vote is totaled.
If all the cities have close to equal populations, then the popular vote total will be around 49.1%. But some cities are larger than others in reality. If the smaller values of p come from the larger cities, then the result will be less than 49.1%. If the larger values of p come from the larger cities, then the result will be greater than 49.1%. If the range of p is scattered across small and large cities, then the result will likely be close to 49.1%. In my fictitious example, the populations of the cities I sampled are included in the file Voting.com. I retrieved the data from Wikipedia, but the original source is the US Census Bureau. By the way, it’s easy to retrieve data from a Web page with JMP by using the Internet Open command on the File menu. Simply provide the URL, and JMP imports any tabular data it finds at that URL. How do we estimate the overall proportion of votes for candidate A for the 100 cities I sampled? Use the survey results as estimates of the percent in a city, multiply by the population of the city, and then sum across cities to get overall totals. But not so fast -- we need to adjust the population numbers down to account for the fact that not all the population is of voting age. Also, not everyone of voting age actually votes, so a further adjustment is needed. Again turning to the Census Bureau, I found data on the proportion of a population that is of voting age. The numbers vary from location to location. Instead of using one number (like the average) to summarize all the proportions, I fit a distribution. Since proportions are between 0 and 1, the Beta distribution is a great candidate. Using JMP’s distribution fitting capabilities, the fitted distribution turns out to be the Beta(397,126). Below is a graph of the distribution. The bulk is between 0.7 and 0.8, which means that for most locations the percent of the population that is of voting age is 70% - 80%. ![]() I also found data on what percentage of a voting age population actually vote. Again fitting a Beta distribution, I get the Beta(30,32). See the graph below. The bulk is between 0.3 and 0.7, which means that for most locations the percent of a voting age population that actually vote is 30% - 70%. ![]() I’m a fan of Monte Carlo simulation. So let's use the JMP scripting language to run a simulation. The script is attached with the data if you want to try it yourself. The script is shown below: ![]() Running the simulation 10,000 times (which takes less than 10 seconds) gives the following distribution for the proportion of the popular vote going to candidate A in the 100 cities. ![]() The average of the distribution is 50.37%. For an interval estimate, use the 2.5th and 97.5th percentiles. I estimate candidate A will get between 50.12% and 50.61% of the popular vote for those 100 cities I sampled. 50% is in the lower tail of the distribution. What is the probability Candidate A will get at least 50% of the vote in the 100 cities? We can fit a distribution and estimate the probability. If you are thinking the distribution looks Normal, then you are right. Recall the central limit theorem, which says the distribution of an average is asymptotically normal, as the number of items in the average increases. The overall proportion is a weighted average of individual city proportions; thus it is normal. Using a Normal distribution with mean = 0.5037 and stdev = 0.00124, the probability that candidate A receives at least 50% of the vote in the 100 cities is 99.85%. That is good news if you like candidate A. Remember that the distribution of p has a mean of 49.1%. This value is more than 10 standard deviations below the mean in the simulation results. In fact, it’s not even shown on the histogram. This means there is virtually no chance the overall % for Candidate A will be as low as 49.1%. This would have been the average assuming the populations are the same. Thus, accounting for the city populations certainly made a difference when totaling across cities. Maybe not much though, since 50.3% - 49.1% = 1.2%. This example is simple and doesn’t account for a number of factors. For example, how do you handle within-city variation? A different sample of people could yield different results, and people could very well change their minds between now and election day. A good idea might be to model the within-city variation with a Beta distribution as well. And what about cities with less than 100,000 in population? Is it safe to extrapolate these results to smaller cities? And what about the electoral college? For an interesting answer to the electoral college question, see this article about what a BYU professor is doing to predict the election outcome. Tuesday, October 21. 2008What Does a JMP Fanatic Do for Fun During Election Season? Part 1
Among the new features in JMP 8 is the Beta-Binomial distribution. This distribution arrives from assuming x|p ~ Binomial(n,p), and p ~ Beta(a,b). In other words, if you want to simulate a Beta Binomial variate, first randomly select a value of p from the Beta(a,b) distribution, and then use that p to generate a Binomial x of size n.
Using JMP 8, you can calculate Beta-Binomial probabilities, generate random data, and fit the distribution to data. As an example of fitting the Beta-Binomial distribution to data, consider the following. This is election season in the US, and many eagerly await for Nov. 4 to arrive so they can vote. But I’m impatient, I want to know now who will be elected president. So, suppose I do a little survey sampling of my own. I randomly select 100 cities (among cities that have at least 100,000 people) and survey 1,000 people in each. For each city, I record the number of people who say they will vote for candidate A. (I didn’t really do it. This is a fictitious example, with fictitious data.) You can download the data, called Voting.jmp, from JMP’s file exchange. The file and a script I use in my next blog post are both part of a zip file called Voting.zip. The histogram below (created with JMP’s Distribution platform) shows the distribution of the results. The average number of people (out of 1,000) who say they will vote for candidate A is 491, about 49%. ![]() If the support rate for candidate A is constant for all cities, then the observed variation is explained by the binomial distribution. Fitting a Binomial distribution results in the following: ![]() Notice that JMP overlays the fitted distribution density curve. In this case, the range in the data extends beyond the fitted curve, so there is more variation than is explained by the Binomial distribution. Perhaps the support rate, p, is different from city to city. If true, the Beta-Binomial distribution is more appropriate. Fitting a Beta-Binomial results in the following: ![]() Notice the Beta-Binomial fit (green curve) encompasses all the variation in the data, and therefore it is a better fit than the Binomial. The estimated parameters need some explaining. Remember the Beta-Binomial arises from assuming x|p ~ Binomial and p ~ Beta(a,b). JMP estimates the following parameters: p = a/(a+b), the mean of the Beta distribution. and delta = 1/(a+b+1), an overdispersion parameter. If the overdispersion parameter is 0, the Beta-Binomial reduces to the Binomial (p). This means p is constant for all cities. Of most interest to me is the distribution of the rate. I want estimates of the Beta parameters, a and b so I can quantify and visualize the distribution of p. To obtain estimates of a and b, use the following: ![]() and ![]() Doing this for our data, you get a = 27.64 and b = 28.63. The Beta(27.64, 28.63) distribution looks like the following: ![]() The bulk of the distribution is between 0.30 to 0.70. The average of this distribution is 0.491. Does this mean candidate A will get 49.1% of the popular vote when combined across those cities I sampled? Not necessarily. You need to account for the population of the cities. Cities with a higher population have more weight when the popular vote is totaled. But how different would the results be? Stay tuned. In my next blog post, I will continue the analysis and address the population question! By the way, JMP 8 also introduces the Gamma Poisson distribution, which results from assuming that x|lamda ~ Poisson(lamda), and lamda ~ Gamma(alpha, beta). Friday, August 8. 2008What Would Be the World Record?
With the 2008 Summer Olympics happening in China right now, I am reminded of when I used to run the 200-meter (1/2 lap) and 400-meter (1 lap) events on my high school track team. I wasn’t that good, but I enjoyed it. My best time in the 200m was about 25 seconds (current world record is 19.32 seconds), and my best time in the 400m was -- now don’t laugh -- about 58 seconds (current world record is 43.18 seconds). By the way, both those world records are held by the same man: Michael Johnson.
Consider the world record for every major running event, sprinting and distance. The IAAF’s (International Association of Athletic Federations) Web site has all the data. I entered the data into a JMP table, everything from 100m to 42,194.988m, which is a marathon. Times are recorded in seconds. The data can also be downloaded from JMP's file exchange. ![]() Can the record be modeled/predicted as a function of distance? I make a scatterplot of Seconds vs. Meters using JMP’s Fit Y by X platform: ![]() The data seems to be cramped together for the shorter races. I take the natural log of both Seconds and Meters and replot: ![]() The relationship looks very linear and very strong! Let’s fit a simple linear regression using the Fit Line command: ![]() The RSquare is .999. What model am I actually fitting? Consider the following power model: Seconds = a*Meters^b, where a and b are parameters. I take the log of both sides to obtain: ln(Seconds) = ln(a) + b*ln(Meters). That is the same model I fit above. The relationship between Seconds and Meters can be approximated by a power model, but it can be linearized by taking logs. The parameters of the power model can be computed. The b parameter is given above as 1.1065, and the a parameter is exp(-2.8081) = .060319. So, the final prediction model is: Seconds = .060319*Meters^1.1065. How well does it predict? The RSquare is strong as we've shown. Also, for each row, I compute the absolute value of (Actual-Predicted)/Actual. The mean result is .036, meaning a 3.6% difference between actual and predicted. The largest percent difference is for the 200m race, with a predicted value of 21.21. Now to the title of this blog entry, “What would be the world record?” Pick any number between 100 and 42,195. Let’s use 6731 meters. The model predicts the current world record for that would be 1038.06 seconds. The model fits fairly well, so we might be close. But, we’ll never know because it’s not a real race. I could have fit this model directly on Seconds and Meters, using the Fit Special command. It allows you to specify a transformation on the Y and X. The RSquare and parameters turn out to be the same. Take note of the exponent in the model, the b parameter, 1.1065. Because that is so close to 1, the model is nearly linear. Will a simple straight line predict just as well as the power model? A linear model indicates that the meters per second (mps) is the same for every distance. Computing the mps from the data gives the following: ![]() The mps clearly decreases as the distance increases. I fit the linear model to the data. It fits well for the long races (stable mps), but it fits terribly for the short races (mps changing a lot). After all, I don't know anyone who can maintain a full sprint pace much longer than 400m. If you are wondering, I also fit a similiar model to the women’s records. But I leave it to you to get the data and try it on your own.
(Page 1 of 1, totaling 6 entries)
|
ABOUT THIS BLOG
JMP Statistical Discovery Software from SAS
is proud to bring you this blog on all things related to
data visualization, visual Six Sigma, design of experiments
and other statistical topics.
The blog content appearing on this site does not necessarily represent the opinions of SAS. Your use of this blog is governed by the Terms of Use. CategoriesQuicksearchSyndicate This Blog |

