My colleague Robert Allison has a knack for finding fascinating data. Last week he did it again by locating data about how blood types and Rh factors vary among countries.
He produced a series of eight world maps, each showing the prevalence of a blood type (A+, A-, B+, B-, AB+, AB-, O+, and O-) in various countries around the world. As I studied his maps, I noticed that the distribution of blood types in certain ethnic groups (Chinese, Japanese, Indians,...) was different than the distribution in Western Europe and former European colonies.
When dealing with multivariate data, a single visualization is rarely sufficient to answer all the questions that you can ask. Robert's maps answer the question, "What is the spatial distribution of each blood type?" I was curious about a different question: "Within each country, what is the distribution of blood types?" To answer my question, I needed a different visualization of the same multivariate data.
My attempt is shown to the left. (Click to enlarge.) The graph is a stacked bar chart of the percentage of blood type for 63 countries, sorted by the percentage of types that have positive Rh factors. Blood types with positive Rh factors are plotted on the right; negative Rh factors are plotted on the left. The A+ and A- types are plotted closest to the 0% reference line. The next types are AB, B, and O, in increasing distances from the 0% reference line.
A few ethnic differences are apparent. At the top of the chart are Western European countries and former European colonies such as Brazil, Australia, and New Zealand. A little lower on the list are countries in Eastern Europe and Scandinavia.
After that, the list starts to get geographically jumbled. The United States, Canada, and South Africa were all settled by people of multiple ethnicities. The middle of the list is dominated by countries from the Middle East, Northern Africa, and the Near East.
The next set of countries include South American countries such as Argentina and Bolivia, Caribbean countries, India, and African countries.
Finally, the bottom of the list features Asian populations such as China, Japan, Mongolia, and the Philippines. These populations have almost no negative Rh factors in their blood. The distribution of blood types in those countries are similar to each other, although regional differences appear, such as the relatively small number of A+ blood in Thailand.
This one dimensional ranking of countries by blood types reflects historical connections between peoples as a result of conquest, trade, and colonization.
A few countries seem "out of place" in their list order. Lebanon, Ireland and Iceland, and Peru and Chile, are some of the countries whose distribution of blood types differ from those adjacent to them in the list.The distribution of blood types by country. #Statistics Click To Tweet
Relationships between countries
Some of the "out of place" countries are probably a result of the fact that it is hard to linearly order the countries when there are eight variables to consider. Principal component analysis (PCA) is a statistical technique that can group observations according to similar characterisics. In SAS software, you can use the PRINCOMP procedure to conduct a principal component analysis.
The analysis reveals that 81% of the variation in the data can be explained by the first two principal components. About 92% can be explained by using three principal components, which means that the eight variables (percentages of each blood type) fit well into these lower-dimensional linear subspaces.
The score plot from a two-dimensional PCA analysis is shown to the left. (Click to enlarge.) I added colors to the data to indicate a geographical region for the countries; the regions came from the United Nations list of countries and geographic regions. This plot shows the relationships between countries based on similarities in the distribution of blood types.
The middle of the plot contains African and West Asian nations. (West Asia is the UN name for the region that many people call the Middle East.) The right side of the plot is dominated by European countries and their former colonies. The upper left quadrant contains the Asian countries. The lower left quadrant includes Caribbean, Central American, and South American countries. This presentation once again shows that the distribution of blood types in Peru and Chile are different from other countries, but are similar to each other.
You can download the data and the SAS program that analyzes it and do additional analyses.
What interesting features can you find in these data? Are there other ways to view these data? Leave a comment.