Dataset too big for PROC PRINT? One weird trick solves your problem!
proc print data=bigdata (obs=10);
run;
The OBS= dataset option specifies the last observation to process from an input dataset. In the above example, regardless of dataset size, only the first 10 observations are printed; an easy way to take a quick peek at your data, or preview your PROC PRINT report.
The OBS= dataset option is also useful on the SET statement, to test your datastep on a small number of observations.
data test;
set bigdata (obs=10);
<datastep code>
run;
Watch the SAS Training Post for more SASĀ tricks!
8 Comments
SAS has many different ways to export data to excel, and each of these has serious drawbacks.
Suppose I have 200,000 rows and 10 columns. Suppose I want to print BOTH data labels AND variable labels. So, for example, I want to print 'ETHNICITY', a column containing values 1-4, with data labels such that 1=White, 2=Black, 3=Hispanic, and 4=All others.
I cannot use ODS because SAS runs out of memory at about 5,000 rows.
I cannot use PROC EXPORT because it does not export the data labels.
I cannot use "Send to Excel" because that function does not print data labels for blank cells. So, for example, suppose I use a PROC MEANS with types `YEAR' and `YEAR'*`ETHNICITY', and I want to print 'All ethnicities' to the resulting blanks. No can do.
So, is there another function I'm missing?
My current workaround is to re-code the blank fields produced by the PROC MEANS statement, and then assign data labels to those codes. So, 'ETHNICITY' = 97 = "All ethnicities".
Hello,
What is the limit of number lines printed ? I try to print more than 100.000 lines with 23 variables and it take too much time for my online session of my SAS.
Thanks for your response.
Thanks to those who posted for excellent comments regarding sampling. Great examples of PROC SURVEYSELECT and DATA STEP techniques can be found here ...
http://support.sas.com/kb/24/802.html
Simple random sample with replacement (an observation can be chosen more than once).
... and here ...
http://support.sas.com/kb/24/722.html
Simple random sample without replacement (an observation cannot be chosen more than once).
Andrew pointed out a very valid point about how knowledge of an individual data set can be very crucial when viewing it.
Those problems aside, my favorite method for getting a quick peak at a set (caveat: I work in display manager on Windows systems) is to use the SAS File viewer which displays the data in a tabular (spread sheet) format. I will warn, though, that if the data are on a server, you can put yourself in a "wait" mode if you try to do a scroll and it starts having to move a lot of data.
Some further advantages to this approach are that you can click on a column and then see the variable name, label, format, and informat. If you have data that have come from a database where the column names had invalid characters and, perhaps, multiple spaces, it can be very useful to compare the label with the "real" variable name. Also, you can play with the format if you so choose.
Hence, you get the combined functions of Proc Contents (albeit somewhat limited) and Print in a "live" setting. And, you can "hold" an initial column(s) so as you scroll to the right in a very wide set, you can see id variables. Also, you can hide columns.
Two comments/observations about Jim's post.
First, you want to take a "peek" at your data, not a "peak." (grin)
Second, while the OBS= option in PROC PRINT and the OBS= data set option (it is also a SAS System option, by the way) have their benefits, there are also significant drawbacks to them in the context of examining your large data files.
Here's an example:
A few years ago I did a SAS training class for employees of a state prisons/corrections agency. All data in their files were pre-sorted by offender / prisoner number, which is assigned sequentially to each person when they became an involuntary "guest" of their state prison system. Data sets with information about currently incarcerated individuals sorted from lowest-to-highest offender number always meant the first N rows in the file had the data on the longest serving prisoners. These folks were typically the oldest inmates serving life-without-parole sentences or who had been on death row for many years awaiting execution, and were therefore a distinct subset of the overall inmate population, most of whom would be released from custody in the future. So, the OBS= options always displayed reports with these oldest (both in age and in length of incarceration, usually).
Using a random number generator such the RANUNI or UNIFORM programming functions or the CALL STREAMINIT and RAND Functions will do a better job, I think, of giving you a "peek" at a more representative subset of data sets with too many rows to examine individually. See, "How to Generate Random Numbers in SAS" by Rick Wicklin, at http://blogs.sas.com/content/iml/2011/08/24/how-to-generate-random-numbers-in-sas.html for more information and insight about these tools
Another useful approach for this task is PROC SURVEYSELECT in the SAS/STAT module. Among its many features is the ability to create stratified random samples (e.g., 50 customers at random within each value of geographic region) with a minimum of coding.
Andrew Karp
Sierra Data Science
http://www.SierraDataScience.com
Sometimes data is not "well behaved" and looking at the first ten observations may not be helpful or useful. In such situations, I use firstobs= and obs= to go farther into the dataset.
proc print data=work.bigdata(firstobs=10001 obs=10010);
run;
Ten rows (not 10010) will be printed starting at row 10001.
Using DICTIONARY.TABLES or other means, you can programmatically select a firstobs and obs value that would be dynamic, rather than assuming that you will always have 10010 rows.
You can also specify firstobs=11,obs=10 to get the _next_ 10 obs.
Or WHERE=(ranuni(0) < .01) to randomly pluck 1% of the data (I think there are more statistically correct ways to subsample datasets, but this is at least easy).