Visualize repetition in song lyrics


One of my favorite magazines, Significance, printed an intriguing image of a symmetric matrix that shows repetition in a song's lyrics. The image was created by Colin Morris, who has created many similar images. When I saw these images, I knew that I wanted to duplicate the analysis in SAS!

Visualize repetition in lyrics: A simple example

The analysis is easy. Suppose that a song (or any text source) contains N words. Define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. Now visualize the matrix by using a heat map: Black indicates cells where the matrix is 1 and white for 0. A SAS program that performs this analysis is available at the end of this article.

To illustrate this algorithm, consider the nursery rhyme, "Row, Row, Row the Boat":

Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily
Life is but a dream.

There are 18 words in this song. Words 1–3 are repeated, as are words 10-–13. You can use the SAS DATA steps to read the words of the song into a variable and use other SAS functions to strip out any punctuation. You can then use SAS/IML software to construct and visualize the repetition matrix. The details are shown at the end of this article.

Visualize repetition in song lyrics: 'Row, Row, Row Your Boat'

The repetition matrix for the song "Row, Row, Row, Your Boat" is shown to the right. For this example I could have put the actual words along the side and bottom of the matrix, but that is not feasible for songs that have hundreds of words. Instead, the matrix has a numerical axis where the number indicates the position of each word in the song.

Every repetition matrix has 1s on the diagonal. In this song, the words "row" and "merrily" are repeated. Consequently, there is a 3 x 3 block of 1s at the top left and a 4 x 4 block of 1s in the middle of the matrix. (Click to enlarge.)

As mentioned, this song has very little repetition. One way to quantify the amount of repetition is to compute the proportion of 1s in the upper triangular portion of the repetition matrix. The upper triangular portion of an N x N matrix has N(N–1)/2 elements. For this song, N=18, so there are 153 cells and 9 of them are 1s. Therefore the "repetition score" is 9 / 153 = 0.059.

Another simple example of repetition in song lyrics

I wrote a SAS/IML function that creates and visualizes the repetition matrix and returns the repetition score. In order to visualize songs that might have hundreds of words, I suppress the outlines (the grid) in the heat map. To illustrate the output of the function, the following image visualizes the words of the song "Here We Go Round the Mulberry Bush":

Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.
Visualize repetition in song lyrics: 'Here We Go Round the Mulberry Bush'

The repetition score for this song is 0.087. You can see diagonal "stripes" that correspond to the repeating phrases "here we go round" and "the mulberry bush". In fact, if you study only the first seven rows, you can "see" almost the entire structure of the song. The first seven words contain all lyrics except for four words ("so", "early", "in", "morning").

Visualize Classic Song Lyrics

Let's visualize the repetitions in the lyrics of several classic songs.

Hey Jude (The Beatles)

When I saw Morris's examples, the first song I wanted to visualize was "Hey Jude" by the Beatles. Not only does the title phrase repeat throughout the song, but the final chorus ("Nah nah nah nah nah nah, nah nah nah, hey Jude") repeats more than a dozen times. This results in a very dense block in the lower right corner of the repetition matrix and a very high repetition score of 0.183. The following image visualizes "Hey Jude":

Visualize repetition in song lyrics: 'Hey Jude'

Love Shack (The B-52s)

The second song that I wanted to visualize was "Love Shack" by The B-52s. In addition to a title that repeats almost 40 times, the song contains a sequence near the end in which the phrase "Bang bang bang on the door baby" is alternated with various interjections. The following visualization of the repetition matrix indicates that there is a lot of variation interspersed with regular repetition. The repetition score is 0.035.

Visualize repetition in song lyrics: 'Love Shack'

Call Me (Blondie)

Lastly, I wanted to visualize the song "Call Me" by Blondie. This classic song has only 241 words, yet the title is repeated 41 times! In other words, about 1/3 of the song consists of those two words! Furthermore, there is a bridge in the middle of the song in which the phrase "oh oh oh oh oh" is alternated with other phrases (some in Italian and French) that appear only once in the song. The repetition score is 0.077. The song is visualized below:

Visualize repetition in song lyrics: 'Call Me'

How to create a repetition matrix in SAS

If you think this is a fun topic, you can construct these images yourself by using SAS. If you discover a song that has an interesting repetition matrix, post a comment!

Here's the basic idea of how to construct and visualize a repetition matrix. First, use the DATA step to read each word, use the COMPRESS function to remove any punctuation, and standardize the input by transforming all words to lowercase:
data Lyrics;
length word $20;
input word @@;
word = lowcase( compress(word, ,'ps') ); /* remove punctuation and spaces */
Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.

In SAS/IML software you can use the ELEMENT function to find the locations in the i_th row that have the value 1. After you construct a repetition matrix, you can use the HEATMAPDISC subroutine to display it. For example, the following SAS/IML program reads the words of the song into a vector and visualizes the repetition matrix. It also returns the repetition score, which is the proportion of 1s in the upper triangular portion of the matrix.

ods graphics / width=500 height=500; /* for 9.4M5 you might need NXYBINSMAX=1000000 */
proc iml;
/* define a function that creates and visualizes the repetition matrix */
start VizLyrics(DSName, Title);
   use (DSName); read all var _CHAR_ into Word;  close;
   N = nrow(Word);
   M = j(N,N,0);                       /* allocate N x N matrix */
   do i = 1 to N;
      M[,i] = element(Word, Word[i]);  /* construct i_th row */
   run heatmapdisc(M) title=Title
       colorramp={white black} displayoutlines=0 showlegend=0;
   /* compute the proportion of 1s in the upper triangular portion of the matrix */
   upperIdx = loc(col(M)>row(M));
   return ( M[upperIdx][:] );  /* proportion of words that are repeated */
score = VizLyrics("Lyrics", "Here We Go Round the Mulberry Bush");
print score;

If you want to reproduce the images in this post, you can download the SAS program for this article. In addition, the program creates repetition matrices for "We Didn't Start the Fire" (Billy Joel) and a portion of Martin Luther King Jr.'s "I Have a Dream" speech. You can modify the program and enter lyrics for your favorite songs.


About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.


  1. Dr. Biljana Belamaric Wilsey

    Hi Rick, what an interesting idea! You have inspired me to use SAS's Text Analytics products to do an analysis of most frequent content (not function) terms and topics across popular music genres. Stay tuned for my results! :)

  2. Chris Brooks on

    This is a really interesting idea! I have one suggestion though - in running a few random songs through the code I found I sometimes needed to change the DATALINES statement in the data step to DATALINES4 as the punctuation in some lyrics can cause issues otherwise.

    I also wondered whether it would be possible to determine merely from the pattern what genre of song (or what artist) was being processed - for example could you tell a rap number from a Mowtown song or distinguish between Eminem and Dr Dre merely by the pattern or repetition score? This may be asking to much of the concept but it could be an interesting project...

    • Rick Wicklin

      Thanks for the comment. Regarding the genre, repetition is probably not a primary indicator of the genre because the same song can be arranged as pop, R&B, reggae, etc. However, I think that hip-hop might use the "refrain-verse-refrain" structure less often than other genres.

  3. Great Visualization, thanks Rick. I was reading "Where The Wild Things Are" to my son last night, and was thinking, I wonder what this would like like. “And the wild things roared their terrible roars and gnashed their terrible teeth and rolled their terrible eyes and showed their terrible claws...”

    • Rick Wicklin

      :-) I've fond memories of reading that book to my kids, too. The genre of Early Reading books is filled with repetition. How about "Green Eggs and Ham" by Dr. Seuss?

Leave A Reply

Back to Top