- In 1887, Herman Hollerith invented punch cards and a reader to organize census data.
- In 1937, the US government had a punch-card reading machine created to keep track of 26 M Americans and 3 M employers as a result of the Social Security Act.
- In 1943, Colossus was created to decipher Nazi codes during World War II.
- In 1952, the National Security Agency was created to confront decrypting intelligence signals during the Cold War.
- In 1965, the US Government built the first data center to store 742 M tax returns and 175 M sets of fingerprints.
- In 1989, British computer scientist Tim Berners-Lee coined the phrase "World Wide Web" combining hypertext with the Internet.
- In 1995, the first super-computer is built.
- In 2005 Roger Mougalas from O'Reilly Media coined the term Big Data.
- In 2006, Hadoop is created.
The story goes on to the tune of 90 percent of available data today has been created in the last two years!
As SAS (and the computing world) moves to the cloud, the question of, "How do I deal with my data (Big and otherwise), which used to be on-prem, in the cloud?" is at the forefront of many organizations. I ran across a series of relevant articles by my colleague, Nicolas Robert, on the SAS Support Communities on SAS and data access and storage on Google Cloud Storage (GCS). This post organizes the articles so you can quickly get an overview of the various options for SAS to access data in GCS.
As the title suggests, this is an overview of the series. Some basic SAS terminology and capabilities are discussed, followed by an overview of GCS data options for SAS. Options include:
- gsutil - the "indirect" way
- REST API - the "web" way
- gcsfuse - the "dark" way
- BigQuery - the "smart" way.
In the overview Nicolas provides the pros and cons of each offering to help you decide which option works best for your situation. Below is a list of subsequent articles providing technical details, specific steps for usage, and sample code for each option.
The Google Cloud Platform (GCP) provides an API for manipulating objects in Google Cloud Storage. In this article, Nicolas provides step-by-step instructions on using this API to access GCS files from SAS.
Cloud Storage FUSE provides a command-line utility, named “gcsfuse”, which helps you mount a GCS bucket to a local directory so the bucket’s contents are visible and accessible locally like any other file. In this article, Nicolas presents rules for CLI usage, options for mounting a GCS bucket to a local directory, and SAS code for accessing the data.
In this article, Nicolas provides the results of a performance test of GCS integrated with SAS when accessed from cloud instances. New releases of SAS will only help facilitate integration and improve performance.
Google BigQuery naturally interacts with Google Cloud Storage using popular big data file formats (Avro, Parquet, ORC) as well as commodity file formats like CSV and JSON. And since SAS can access Google BigQuery, SAS can access those GCS resources under the covers. In the final article, Nicolas debunks the myth that using Google BigQuery as middleware between SAS and GCS is cumbersome, not direct and requires data duplication.
Being able to access a wide variety of data on the major cloud providers' object storage technologies has become essential if not already mandatory. I encourage you to browse through the various articles, find your specific area of interest, and try out some of the detailed concepts.
* Big Data history compiled from A Short History Of Big Data, by Dr Mark van Rijmenam.