Who paid $500k for a US visa? Over 10,000 people!

Having spent many years in graduate school, and living in the Research Triangle Park (RTP) in North Carolina, I have a lot of friends from other countries. Therefore when I recently saw some stories & graphs about EB-5 visas (where you invest a cool half-million US $ to bypass the long lines) they caught my attention...

The EB-5 Immigrant Investor Program was created in 1990 to stimulate the U.S. economy through job creation and capital investment by foreign investors. EB-5 investors must invest in a new commercial enterprise, which creates at least 10 full-time jobs, and make a capital investment of $1,000,000 ... or $500,000 in a high-unemployment or rural area (which is usually the case).

And now for the big question - who paid $500k for these EB-5 visas last year?!?

I thought I had found the answer in an infographic on dadaviz.com where they showed a custom chart made up of bubbles (see below). It was an interesting graphic, but a little confusing. Were all the blue dots sub-regions within China? I guess Taiwan and Hong Kong might be, but definitely not Japan, etc. What countries were represented by the 'Other' bubble? Did China really have an order of magnitude more than any other country, or was that just a quirk of the way I was interpreting the graphic? I had more questions than answers...


I did some digging and found the raw data on the Department of State's web page, and finagled it into a SAS dataset (it was in a pdf file rather than simple text, therefore I couldn't import it directly). I then experimented with several different ways to plot the data, such as a bar chart - but there were just too many countries, and too big a spread from the minimum to the maximum values, for traditional graphs to work well and show all the data.

What I finally came up with was a SAS bubble map, followed by a simple sorted table. The map allows me to represent both the quantities and the geographic locations, and the table allows me to quickly see the actual values and determine which countries have higher/lower values than the others.

Click the map below to see the full-size version, with html hover-text for each country. With this map, you can easily see which countries the people getting EB-5 visas were from, and that China had way more than any other country (so many, that they actually reached the limit before the end of FY-2014).


If you were going to invest $500,000 for a US EB-5 visa, what enterprise would you invest in, and where would you locate?

Post a Comment

Spotting a misleading chart

Everyone loves a good conspiracy theory - hopefully you'll enjoy mine about the number of US E1 visas!

I was perusing some of the US government charts, and found one on US immigration visas that caught my attention. It was a 3D bar chart, and since I always mistrust 3D charts, I immediately assumed there was something misleading about it. I noticed that the number of E1 visas was very close to the 40,000 reference line (see circled in red below), and I wondered whether it was in fact above or below the line. It looked like it was below the line, but you know how 3D graphs are difficult to read when you're looking at them from 3D perspective angle.


Luckily their chart had a table below it, so in theory I could just easily glance at the table to see the exact number of E1 visas, and know whether or not they were above or below 40,000. But I was thwarted again! The table showed the number of E1 visas broken down into 2 groups (corresponding to the red & blue bar segments), but not the total! See the E1 row marked in the table below.

Read More »

Post a Comment

When art and analytics collide

The best graphs are both beautiful and informative - a smooth blend of art and analytics. But more often than not, the two collide rather than blending smoothly...

Here is a link to a artistic infographic I recently saw posted by Vendavo on twitter. Their message (80% of your profit is generated by 20% of your customers) seemed 'plausible' ... but something just didn't seem quite right about their infographic. Upon closer scrutiny, I noticed that the slices in their pie charts did not seem to accurately represent the numbers (80% and 20%) in the text.

So, of course, I decided to make a SAS version that was both beautiful and informative (... with correctly sized pie slices!) Here's what I came up with, to show that an infographic can be both artistic and accurate!


Read More »

Post a Comment

Euro vs Dollar exchange rate: An historic event?

I recently read a Washington Post article about the euro versus the dollar, and I wanted to analyze the data myself to see whether the article was simply stating the facts, or "sensationalizing" things.

The washingtonpost.com article started with the headline, "This is historic: The dollar will soon be worth more than the euro." And the article had the following graph showing the value of the euro dropping:


Based solely on the title and the graph (which is probably all that most people look at), I assumed that the exchange rate had always been about 1.25, and had recently started dropping towards 1.00, and that this was an unprecedented historic event. But as I read the details in the article, I started to become a bit more skeptical, and decided to find the actual data, and plot it myself.
Read More »

Post a Comment

What's your opinion on daylight saving time?

Is daylight saving time the ultimate in efficiency, or is it living a lie? Here are some graphs that might help facilitate a discussion on this topic ...

With daylight saving time (DST), a whole geo/political area (such as a country) decides to set their clocks forward an hour during the 'summer' months (when the sun rises earlier and sets later) so that they can take advantage of the extra sunlight hours, without all the factories/stores/etc having to change their hours of operation.

Not all countries honor DST, and in some cases not even all the areas within a country agree to honor it. Here is a world map I created with SAS (similar to one I saw on dadaviz) that shows which areas do and don't honor DST. In general, it looks like most of North America and Europe honor DST, and countries that are close to the equator or in Asia tend not to.


Of course, it's not as easy as saying a country does or doesn't use DST -- different countries can also choose when they want to start and stop DST! For this part of my graphical analysis, I've created some graphs only for the US DST. But even plotting the data for just the US is a bit tricky, because when we start & stop DST has changed over the years. For example, in 2007 the date to go on DST moved from the first Sunday in April to the second Sunday in March, and the date to go off DST moved from the last Sunday in October to the first Sunday in November. Here's the calendar for the current year (2015) showing which days are/aren't DST days:


Looking at that calendar chart, it appears that we (in the US) are now spending over 50% of the year with our clocks adjusted forward in DST. Let's use a different chart that will make it even easier to see the percentages. We could use a bar chart with 2 bars, but I think a pie chart is more intuitive (a lot of people like to bash pie charts, but I think they are a good/intuitive way to show the data when comparing part-to-whole with a 2-slice pie). From this chart, it's evident that we're spending almost 2/3 of the year in DST!


Personally, I have mixed feelings about DST. I can see the advantages of using it in the summer when days are very long, but I think the US might have gone a bit overboard if we're spending over 1/2 of the year (actually about 2/3 of the year) living a lie and adhering to a fake time.

So, what's your opinion on DST? Does your area honor it? If your area doesn't honor DST, do your factories/stores/etc change their hours in summer and winter?


Post a Comment

Data Super Savers vs Data Science

Computer Files“Dear Cat,
I got an email from my IT department that says:
[We are nearing capacity on the Flotsam Drive. Please clear data from any folders you are no longer using so we can save disk space.
The IT Department]

Doesn’t this strike you as a bit old-fashioned? I mean, isn’t disk space practically free now?

Dear DataLover,

My first reaction to this is that yes! You’re right! Disk space is practically free. Why are we worried about storing some extra files? I am a bit of a data hoarder, though, so perhaps my views require some analysis.

Certainly, the message coming out of providers of software and services for the Hadoop ecosystem is that a good data science citizen keeps everything. Long gone are the days when we had to carefully scrub the data, roll the files up to something compact, and get rid of the excess to free up storage space. There might be untapped value in unstructured logs, transactional databases, and other “clutter” files.

So, when is it better to keep versus eliminate data? I have a few thoughts about this.

  • If you can gain more from using the data than you spend to keep the data, then by all means, keep the data. Sources might be surprising. Data scientists make billions of dollars for their companies annually by making data products out of log files and other data that has historically been considered garbage or exhaust.
  • If the data get no use, and are old enough that data products would not benefit from them, then it is best to delete. But I would ask, if the data are not used, should they be? There could be value there.
  • If there is historical information about your company’s performance that can be tied to specific initiatives, then keep the data. There is something to learn here. As an example, if you can track marketing campaigns, staffing decisions, acquisitions and merger information, etc. then you can see which activities were followed by changes in revenue, customer reach, profit, market share, etc. This is not causal information, but it can direct you to your next business experiment in a hurry.
  • If the data can place your organization at risk, then it is prudent to eliminate. This is the case with personally identifiable information (PII), financial records that are no longer needed for audit trails, email records that may contain proprietary conversations with clients, and so on. In this case, there is more to lose from keeping the data than can be gained from eliminating the files.
  • And finally, if the data include pictures of your boss at the last company picnic wearing that Hello Kitty costume and dancing the electric slide, it’s probably best to just let it go. Nobody needs to see that.

Our experiences are different, so I’d love to hear your thoughts about this in the comments below. And, if you would like to spend some quality time talking data hoarding with my colleagues and me, consider coming to one of our data scientist training courses. See you in class!

Strategies and Concepts for Data Scientists and Business Analysts

Data Science: Building Recommender Systems with SAS and Hadoop

  • Bogota
Post a Comment

Tracking billionaires with beauty and accuracy

Which is more important - having beautiful graphs, or accurate graphs? Let's explore this question using the locations of the world's richest billionaires...

I recently saw a beautiful map on dadaviz.com that purported to show the cities with the most billionaires. Here's a screen-capture of that map:


I decided to try creating a similar map using SAS. I found the data source on the forbes.com website, entered the data into a SAS dataset, programmatically looked up the lat/long of the cities, and plotted them on a map.


I was happy & satisfied that I had reproduced their map, until I noticed that "one of these things is not like the other..." My Shenzhen city bubble was near Hong Kong, whereas in the dadaviz map it was up north of Seoul and Beijing. I researched this (and even confirmed it with my co-workers in China), and determined that the position of Shenzhen in the SAS map is correct ... and the beautiful map on dadaviz was not accurate!

I'm not sure how the dadaviz map ended up wrong (maybe the bubbles were positioned by hand, or using hard-coded values that contained a typo?) But this is one of the reasons I prefer to use SAS and position my markers in a data driven way - which allows me to create output that is both beautiful and accurate!

Technical Details:

To create the background map, I created a grid of gray dots, across all the possible lat/long locations of the world, and then used Proc Ginside to see which dots fell within a country (and discarded the dots that weren't in a country). I looked up the lat/long positions of all the listed cities in the mapsgfk.world_cities dataset (which ships with SAS/Graph), and created annotated pies at those lat/long positions, with the area of the pie proportional to the number of billionaires. I got a little 'tricky' with annotate pie commands to draw the line from the pie and then write a label at the end of the line. Here's a link to the full SAS code.

If you were a billionaire, which city would you live in, and why?...

Post a Comment

I can see clearly now thanks to this SAS student

amazing sas eg tipWith any software program, there are always new tips and tricks to learn, and nobody can know them all. Sometimes I even pick up tips or techniques from my students while they’re learning broader programming tips from me.

Like fine wine, instructors only get better with age. Every customer interaction we encounter, every software upgrade that happens, every SAS training course we get sent on, all add to the length and breadth of our SAS teaching knowledge.

But the absolute best learning experience hands down has got to be the SAS classroom setting. Customers come up with real life problems, and ask probing and insightful questions. It makes us instructors pull deep down into the recesses of our minds and see how far we can extend the use of SAS in creative ways.

Here’s an Enterprise Guide tip I recently learned from a student.

Open up the advanced expression builder to write an expression for a computed column. Keep CTRL key depressed while you use the scroll button on the mouse. Scroll up to increase font size, scroll down to reduce font size.

What tips do you know that you think might be new to me or those in my class? Send them along or comment here, and I’ll spread them around!

Post a Comment

How to use the SAS pi constant

With Pi Day coming up on 3/14, I wanted to make sure all you SAS programmers know how to use the pi constant in your SAS code...

All you have to do is use constant("pi") in a data step, and you've got the value of pi out to a good many decimal places (probably enough for most any practical scenario). In this example, I let the user specify the value for a circle's radius as a macro variable, and then get the value of the pi constant in a data step, and use that value to calculate the circumference and area of a circle with that radius. I convert those calculated dataset values into macro variables, and use the calculated macro variables in various ways in a GPlot and annotated circle.

Here's the code for doing the calculations and creating the macro variables - hopefully a light bulb is going off in your head right now, and you are thinking of all kinds of ways you could reuse this code!  

%let radius=7.0;
data foo;
proc sql;
select unique pi format=comma20.18 into :pi separated by ' ' from foo;
select unique circumference format=comma8.3 into :circum separated by ' ' from foo;
select unique area format=comma8.3 into :area separated by ' ' from foo;
quit; run;

And here's what my graph looks like (here's the code if you'd like to see all the details).



Have a happy pi day!

Post a Comment

So what should ‘the data’ have told England cricket coach Peter Moores?

England’s shambolic early exit from the Cricket World Cup has stirred up a hornet’s nest about the team’s supposed over-reliance on data. In the aftermath of their defeat to Bangladesh, coach Peter Moores said: ‘We thought 275 (runs) was chaseable. We’ll have to look at the data.’ It prompted outrage from fans and pundits alike, who felt this showed a team and management that were slaves to data rather than one focused on freeing up the players’ minds so they could play without any inhibitions – as many other teams in the tournament seem to be doing.

The reality of course lies somewhere between two extremes. Being over-reliant on poor quality data or placing too much emphasis on the results of data analysis can be counter-productive. Equally, ignoring data completely will put you at a disadvantage compared to teams that are using data sensibly to contribute to decision-making. Data is not the devil here – what matters is being able to understand what the data can tell you, analysing it to derive insights and then using these insights to inform decision-making based on other factors, including the team’s cricketing knowledge and expertise. It’s the same with any business that is using analytics to help them make decisions. The business leaders are experts in their field. Analytics can help reinforce decisions they would have taken anyway, as well as point them in another direction they may not previously have considered. Other sports know only too well that they’re at a competitive disadvantage if they ignore data entirely – the GB Rowing Team is just one example where it’s accepted that data analytics can contribute fractions of a second in improved performance, which can ultimately mean a gold medal rather than silver.

SAS plot showing runs over time plus a linear regression line

SAS plot showing runs over time plus a linear regression line

So what should ‘the data’ have told England about scores in One Day Internationals at the Adelaide Oval? We’ve done some very simple analysis based on scores at the ground stretching back over several years. Cricket commentators and pundits regularly talk about “a par score”, the implication being that there’s some sort of magical number or average score that a team needs to beat. The arithmetic average score from 50 overs at the ground over the period was 255 runs. However, as any cricket fan will tell you, in recent years there has been a trend towards teams making bigger totals. What this requires is a weighting applied to the more recent scores, as they better reflect a “par score” today. Our ‘weighted’ mean was 260, which funnily enough was the score England managed to record.

So was Bangladesh’s score of 275 clearly above par? Not necessarily. What also emerged from our analysis is a greater variation in runs scored over recent matches. Put simply, to be confident of scoring a total that wins three-quarters of your matches required a total of 273 for the period up to 2013. From 2013 to date, the total needed to achieve the same win rate becomes significantly higher – 307. So, Moores was right to say 275 was ‘chaseable’. However, as this analysis shows, the greater variation in runs scored means a significantly higher total is needed to be ‘confident’ of success. This won’t come as any surprise to cricket fans who know that scores have increased and something over 300 is usually needed on a good pitch these days to have a good chance of victory.

This demonstrates how teams can use data to give them a good indication of the sort of score they should aim to make, taking into account other factors on the day such as the opposition, conditions etc. It also shows how important it is to be able to explain statistics and analytics to non-mathematicians. This can now be achieved through data visualisation technology, where the data is displayed graphically via multiple graphs and charts. It can even be used by those who aren’t experts but simply want to get meaningful answers from their data straight away.

Of course none of this is possible without having the expertise somewhere to design this technology, as well as deliver more complex, advanced analytical solutions – e.g. one that helps a global investment bank understand its real-time risk exposure across the business. Recent research by SAS and the Tech Partnership has shown the rising demand for people with data science skills. They’re needed not just in cricket, but pretty well any scenario where you’re trying to make sense of data.

Post a Comment