Reading hierarchical data (Part 2)

This post is the second in a series that illustrates three different solutions to "flattening" hierarchical data.

Solution 1, from my previous post, created one observation per header record, summarizing the detail data with a COUNT variable, like this:

Summary Approach: One observation per header record
 
Obs    Family     Count
 
 1     Jones        4
 2     Sanchez      1
 3     Smith        2
                  =====
                    7

Solution 2, illustrated in today's blog, creates one observation per header record, like Solution 1, but replaces the COUNT variable with detail data, in a wide format, like this:

Detail Approach: One observation per header record
 
Obs    Family     Employee    Spouse    Child1    Child2    Child3
 
 1     Jones       Bob        Carol     Sally     Alice
 2     Sanchez     Mary
 3     Smith       Nancy      Harold

Here is the Solution 1 program from my previous post:

data summary2;
	keep Family Count;
	retain Family;
	infile 'c:\temp\families.txt' end=eof;
	input Type $ Name $;
	if type='F' then do;	* Family record;
		if _N_ > 1 then output;
		Family=Name;
		Count=0;				
	end;
	else Count+1;  		* Family member record;
	if eof then output; 	* Last record;
run;

The Solution 2 program below begins similarly to the Solution 1 program.
The read and write counts are identical.

There are also several differences:

1. The number of variables has increased, affecting the KEEP and RETAIN statements.
2. LENGTH and ARRAY statements are new.
3. The CALL MISSING statement initializes the retained values of the detail variables back to missing values for the next family.
4. Conditional assignment statements populate the variables EMPLOYEE and SPOUSE.
5. The N variable counts children.
6. The CHILD1-CHILD3 variables are populated thru an array.

 
data wide;
 
	keep   Family Employee Spouse Child1-Child3;
	retain Family Employee Spouse Child1-Child3;
        length Family Employee Spouse Child1-Child3 $ 8;
 
	array C{3} $ 8 Child1-Child3;
 
	infile 'c:\temp\families.txt' end=eof;
 
	input Type $ Name $;
 
	if type='F' then do;		  
		if _N_ > 1 then output;  
		Family=Name;
		N=0;			* Number of children;
		call missing(of Employee Spouse Child1-Child3);
	end;
 
	else if type='E' then Employee=Name;
	else if type='S' then Spouse=Name;
 
        else if type='C' then do;         * Child record;
		N+1;			  * Count children;
		C{N}=Name;		  * Assign each child into array;
	end;
 
	if eof then output; 	  
 
run;

As I said in my previous post, today's solution works for a small number of detail items, but is impractical for larger numbers of detail items. The example above allows for up to three children, an arbitrary number, and obviously unrealisic. How many children should we allow for? 10? 15? With mostly missing values? You see the problem.

Watch this space for Solution 3 in my next blog, which, unlike today's Solution 2, has no arbitrary limit to the number of detail items, because it stores the detail data in a tall, rather than wide, format, as shown below, with one observation per detail record, rather than one observation per header record.

Detail Approach: One observation per detail record
 
Obs    Family     Person    Type
 
 1     Jones      Bob       Employee
 2     Jones      Carol     Spouse
 3     Jones      Sally     Child
 4     Jones      Alice     Child
 5     Sanchez    Mary      Employee
 6     Smith      Nancy     Employee
 7     Smith      Harold    Spouse

Solution 3, as shown above, contains identical data to Solution 2, but is practical for any number of detail items. Be sure to watch for my next post!

John on October 14, 2015 10:04 am

There are advantages to both the wide data set (from solution 2) and the tall data set (from solution 3). However, in solution 2 the use of an arbitrary array size as in the above example is not an insoluble problem. You can use the code from solution 1 as the basis for a data set to determine the array size so that the program dynamically sizes the array so that it is always large enough to hold your data but never larger than necessary as follows:

data _null_;
retain MaxChild;
infile 'c:\temp\families.txt' end=eof;
input Type $ Name $;
if _n_=1 then MaxChild=1 * Minimum array size;
if type='F' then do; * Family record;
MaxChild=max(NumChild,MaxChild);
NumChild=0;
end;
else if type='C' then NumChild+1; * Child record;
if eof then do; * Last record;
call symput('ChildRec',strip(put(MaxChild,8.)));
end;
run;

data wide;

keep Family Employee Spouse Child1-Child&ChildRec;
retain Family Employee Spouse Child1-Child&ChildRec;
length Family Employee Spouse Child1-Child&ChildRec $ 8;

array C{&ChildRec} $ 8 Child1-Child&ChildRec;

infile 'c:\temp\families.txt' end=eof;

input Type $ Name $;

if type='F' then do;
if _N_ > 1 then output;
Family=Name;
N=0; * Number of children;
call missing(of Employee Spouse Child1-Child&ChildRec);
end;

else if type='E' then Employee=Name;
else if type='S' then Spouse=Name;

else if type='C' then do; * Child record;
N+1; * Count children;
C{N}=Name; * Assign each child into array;
end;

if eof then output;

run;

3 Comments

John on October 14, 2015 10:04 am

There are advantages to both the wide data set (from solution 2) and the tall data set (from solution 3). However, in solution 2 the use of an arbitrary array size as in the above example is not an insoluble problem. You can use the code from solution 1 as the basis for a data set to determine the array size so that the program dynamically sizes the array so that it is always large enough to hold your data but never larger than necessary as follows:

data _null_;
retain MaxChild;
infile 'c:\temp\families.txt' end=eof;
input Type $ Name $;
if _n_=1 then MaxChild=1 * Minimum array size;
if type='F' then do; * Family record;
MaxChild=max(NumChild,MaxChild);
NumChild=0;
end;
else if type='C' then NumChild+1; * Child record;
if eof then do; * Last record;
call symput('ChildRec',strip(put(MaxChild,8.)));
end;
run;

data wide;

keep Family Employee Spouse Child1-Child&ChildRec;
retain Family Employee Spouse Child1-Child&ChildRec;
length Family Employee Spouse Child1-Child&ChildRec $ 8;

array C{&ChildRec} $ 8 Child1-Child&ChildRec;

infile 'c:\temp\families.txt' end=eof;

input Type $ Name $;

if type='F' then do;
if _N_ > 1 then output;
Family=Name;
N=0; * Number of children;
call missing(of Employee Spouse Child1-Child&ChildRec);
end;

else if type='E' then Employee=Name;
else if type='S' then Spouse=Name;

else if type='C' then do; * Child record;
N+1; * Count children;
C{N}=Name; * Assign each child into array;
end;

if eof then output;

run;
- John on October 14, 2015 1:58 pm
  
  I just realize that the symput statement in the first data step should be:
  call symput('ChildRec',strip(put(max(NumChild,MaxChild),8.)));
  otherwise the array in the second data step will be too small if the last family is the one with the most children.
Pingback: Reading Hierarchical Data - Part 3

Blogs

Blogs

Reading hierarchical data - Part 2

About Author

3 Comments