Reading hierarchical data - Part 1

9

FamilyA family and its members represent a simple hierarchy.  For example, the Jones family has four members:

Family_JS

A text file might represent this hierarchy with family records followed by family members' records, like this:

Family2_JS

 

The PROC FORMAT step below defines the codes in Column 1:

proc format; 
   value $type
	 'F'='Family'
	 'E'='Employee'
	 'S'='Spouse'
	 'C'='Child';
run;

A hierarchical file is defined by a header record (such as a family record) followed by any number of detail records (such as family members records).   While our simple family hierarchy has only two levels, in the business world, a hierarchy can have any number of levels.  For example, medical claims data has at least five levels:

  1. Insured person
  2. Insured person's dependents
  3. Patient (insured person or dependent)
  4. Provider (physician, lab, hospital, etc.)
  5. Service (date, DX, $, etc.)

This blog will use our simple family hierarchy, with only two levels, but the illustrated techniques can be generalized to any number of levels.  The Family record (TYPE='F') represents a header record.  All other records represent detail records.

While a text file can represent a hierarchy, modern databases are relational and rectangular.  Therefore, a data step must "flatten" the hierarchy to create a rectangular SAS dataset.

This blog is the first in a series of blogs which will illustrate three different solutions to flattening a hierarchical file.

Solution 1 creates one observation per header record, summarizing the detail data with a COUNT variable, like this:

Summary Approach: One observation per header record
 
Obs    Family     Count
 
 1     Jones        4
 2     Sanchez      1
 3     Smith        2
                  =====
                    7

The above solution is concise, but detail data is lost.  Solution 1 is illustrated in today's blog.

Solution 2 also creates one observation per header record, but includes detail data, in a wide format, like this:

Detail Approach: One observation per header record
 
Obs    Family     Employee    Spouse    Child1    Child2    Child3
 
 1     Jones       Bob        Carol     Sally     Alice
 2     Sanchez     Mary
 3     Smith       Nancy      Harold

The above solution is readable for a small number of detail items, but is impractical for larger numbers of detail items.  This example allows for up to three children, an arbitrary number, and obviously unrealisic.  How many children should we allow for?  10?  15?   With mostly missing values?  You see the problem.

Solution 3 contains identical data to Solution 2, but in a tall format, with one observation per detail record, like this:

Detail Approach: One observation per detail record
 
Obs    Family     Person    Type
 
 1     Jones      Bob       Employee
 2     Jones      Carol     Spouse
 3     Jones      Sally     Child
 4     Jones      Alice     Child
 5     Sanchez    Mary      Employee
 6     Smith      Nancy     Employee
 7     Smith      Harold    Spouse

The above solution is practical for any number of detail items, with no arbitrary limit to the number of children, and no missing values.

All three solutions "flatten" the hierarchical text file to create a rectangular SAS dataset.

All three solutions are efficient because:

  1. Sorting is not required
  2. A single DATA step is used

The following DATA step imports the text file into a SAS dataset.  This is not a solution:

data not;
   infile 'c:\temp\families.txt';
   input Type: $1. Name $;
run;
 
title 'Not a Solution';
proc print data=not;
   var name type;
   format type $type.;
run;
Not a Solution
 
Obs    Name       Type
 
  1    Jones      Family
  2    Bob        Employee
  3    Carol      Spouse
  4    Sally      Child
  5    Alice      Child
  6    Sanchez    Family
  7    Mary       Employee
  8    Smith      Family
  9    Nancy      Employee
 10    Harold     Spouse

The above approach is not a solution because the meaning of an observation is ambiguous.  What do the 10 observations represent?

10 families? No!
10 family members? No again!

Some observations represent a family.  Other observations represent family members.  Worse, family data are not linked to family members' data.  For example, "Bob" and "Jones" appear on separate observations, when they should appear on the same observation.

Today's blog will present Solution 1, a summary approach with one observation per header record, without detail data, like this:

Summary Approach
 
Obs    Family     Count
 
 1     Jones        4
 2     Sanchez      1
 3     Smith        2
                  =====
                    7

Subsequent blogs will present solutions 2 and 3, which preserve detail data.

As a reminder, the data step for our non-solution looked like this:

data not;
   infile 'c:\temp\families.txt';
   input Type: $1. Name $;
run;

Let's modify the non-solution above to output one observation per header record, by adding a KEEP statement and a sub setting IF statement, like this:

The saslog below shows a read count of 10 and a write count of 3.  Perfect!

1186  data summary1;
1187     keep Name;
1188     infile 'c:\temp\families.txt';
1189     input Type: $1. Name $;
1190     if type='F';  * Header records only;
1191  run;
 
NOTE: 10 records were read from the infile 'c:\temp\families.txt'.
      The minimum record length was 5.
      The maximum record length was 9.
NOTE: The data set WORK.SUMMARY1 has 3 observations and 1 variables.

The results look like this:

Summary Approach
 
Obs     Name
 
 1     Jones
 2     Sanchez
 3     Smith

That was easy!  We just need to add a COUNT variable and we're done for today!

Initializing the COUNT variable and controlling the output is tricky without a BY statement and FIRST. / LAST. variables, but nothing we can't handle!

Here's Solution 1.  The log below shows a read count of 10 and a write count of 3, identical to our simple version above.

930  data summary2;
931     keep Family Count;
932     retain Family;
933     infile 'c:\temp\families.txt' end=eof;
934     input Type $ Name $;
935     if type='F' then do; * Family record;
936        if _N_ > 1 then output;
937        Family=Name;
938        Count=0;
939     end;
940     else Count+1;        * Family member record;
941     if eof then output;  * Last record;
942  run;
 
NOTE: 10 records were read from the infile 'c:\temp\families.txt'.
      The minimum record length was 5.
      The maximum record length was 9.
NOTE: The data set WORK.SUMMARY has 3 observations and 2 variables.

The RETAIN statement holds the value of the FAMILY variable in memory until the next  "F" record, which causes the DO group to:

  1. output one observation per header (family) record
  2. overwrite the retained value of the FAMILY variable with the NAME of the next family
  3. initialize the retained value of the COUNT variable back to zero for the next family.

The value of the COUNT variable is also retained because it appears in a sum statement

ELSE COUNT+1;

where it counts the detail records (family members) for each family when TYPE is not 'F';

Since there is no final 'F' record to trigger the DO group for the last family, the final statement

IF EOF THEN OUTPUT;

outputs the last family.  EOF is an end-of-file flag, created by the END= option on the INFILE statement.

The table below illustrates how the above program works.

Family_3_JS

This blog has gotten long enough, so I will stop here, for today, but watch this space for my next blog, which will illustrate Solution 2, which, like Solution 1, creates one observation per header record, but includes detail data, in a wide format, like this:

Detail Approach: One observation per header record
 
Obs    Family     Employee    Spouse    Child1    Child2    Child3
 
 1     Jones       Bob        Carol     Sally     Alice
 2     Sanchez     Mary
 3     Smith       Nancy      Harold
Share

About Author

Jim Simon

Principal Technical Training Consultant

Jim Simon is a principal instructor and course developer for SAS Education. Jim has a bachelor's degree from UCLA and a master's degree from California State University at Northridge. Prior to joining the SAS Irvine office in 1988, Jim was an instructor at Ventura College and a SAS programmer at The Medstat Group in Santa Barbara. Jim's areas of specialization include the DATA step, application development, web enablement, and the SAS macro language. A native of Southern California, Jim enjoys anything in the warm California sun. On weekends, Jim loves jumping in his Corvette, turning up the stereo, and cruising Pacific Coast Highway, top down, South to Laguna Beach or North to his old home town, Santa Barbara.

9 Comments

  1. Hi Jim,

    How can I read these kind of data in sas, any example? I have schema and I have tried mapping too, but I am having hard time.
    Please help with any example.

    Thanks in advance.
    Hemant

  2. Kathleen Hanrahan on

    Jim,

    I think that this set of blogs on reading hierarchical data is an excellent series. Thank you! I've got one question - how is the total count (7) generated and printed? Or is that left as an exercise for the student?!

    Kathleen

    • Jim Simon

      Hi Kathleen. Thank you for posting. The total count (7) is generated by adding SUM COUNT; to the PROC PRINT step (not shown).

  3. Pingback: Reading Hierarchical Data - Part 3

  4. Pingback: Reading hierarchical data (Part 2) | SAS Training Post

  5. bob mcconnaughey on

    And..the 1987 edition of the SAS Applications Guide has an example of how to read in the Census Public Use hierarchical files. (An earlier edition included JCL for reading the Public Use files from tape!)

    • Maggie Miller

      Hi Daria,
      Thank you for noticing the &gt. That happens when I'm editing the posts in different views. I have fixed the code. -Maggie, SAS Training Post Editor

Back to Top