`data.table` and Text Analysis: Analyzing the Four Gospels
06 Jul 2019I am really enjoying quantitative text analysis as of late, sparked by a
project I helped on regarding Autoscore.
I’ve also had the desire to use data.table
more often in my work. A
post
showing data.table
being used with pipes (which I tweeted about
here)
got me really going.
This post, then, is showing data manipulation using %>%
, data.table
and other useful functions from the tidyverse
to analyze textual data.
I picked the text from the Holy Bible, provided by Andrew
Heiss on GitHub
here and CRAN. I also pull
from the quanteda
and tm
packages for some of the textual analyses.
Finally, can’t forget the tidytext
package for preparing and analyzing
the text.
Below I grab the new_testament
data set (King James Version), create a
data table with it, and grab just the four gospels.[1] The next part of
the pipe creates the “tokens” variable, which is each individual word in
each verse of the scriptures. The purrr::map(...)
is doing the heavy
lifting and ultimately creates a list column (a variable that has a list
inside of it). To get it to a more useful form, we unnest it.
## book_title chapter_number verse_number tokens
## 1: Matthew 1 1 THE
## 2: Matthew 1 1 book
## 3: Matthew 1 1 of
## 4: Matthew 1 1 the
## 5: Matthew 1 1 generation
## 6: Matthew 1 1 of
The Four Gospels
The first part of our analysis will look at general themes of the gospels, including the amount of references to God (and synonyms), as well as Satan (and synonyms). We also look at general sentiment analysis, readability, and common word choices.
First, we look at the references to God and Satan. We start by creating a vector of synonyms for God and then for Satan.
Then we grab gospels
and make the tokens
variable to lower, create
the god
variable that says if the token
is within the god
synonym
vector, then give the variable a 1, otherwise a 0. We did the same for
the satan
variable.
This next step calculates by book, chapter, and verse, whether any of
the tokens are referencing God; then gets a proportion of verses (of all
the verses within each book and chapter), are referencing God. Finally,
the pipe plots it using the ggplot2
package.
Overall, the four books consistently reference God about 5-15% of the verses reference God across the chapters. John appears to reference the most, especially toward the end of the book, with the last few chapters referencing >10%.
Now for the references to Satan.
Much, much lower compared to the references to God. The spikes are the chapters wherein the temptations of Satan to Jesus are presented. But for the most part, not many references to Satan in the gospels.
Sentiment
The next question deals with the sentiment of the writing. Using the
“NRC” sentiment algorithm from tidytext
, we get the overall sentiments
of each book.
Readability
The Flesch reading ease score attempts to quantify how difficult text is to understand. This site demonstrates that:
Score | Notes |
---|---|
90 – 100 | easily understood by an average 11-year old student |
60 – 70 | easily understood by 13-15-year-old students |
0 – 30 | best understood by university graduates |
We run the statistic using the quanteda::textstat_readability()
function and then summarize with the mean and standard deviation of the
reading score.
## book_title Mean SD
## 1: Matthew 73.41289 14.58498
## 2: Mark 74.95304 13.84790
## 3: Luke 74.80256 14.05367
## 4: John 77.34579 13.03790
This suggests that each are between 11 - 13 year old level. This may be somewhat misleading as the words themselves are not particularly difficult, but the old-English style of the King James Version may make it more difficult than that.
Word Choice
First we look at book length. Mark was the shortest; Luke was the longest (not too surprising if you’ve read these books before).
## book_title N
## 1: Mark 15186
## 2: John 19125
## 3: Matthew 23727
## 4: Luke 25985
Next, we look at the most common three and five word sequences in each
book. We go back to our original gospels
data table to do this
(without the unnesting).
The following lines are a bit convoluted but I’ll attempt to explain
them. First, we unnest the gospels
data table by our new variable
gram3
, which houses all of the three word combinations. Next, we
remove the underscores and then remove the words that we are not very
interested in (e.g. tell, said, saith). Next we get the counts (using
.N
) by book and each three word combination. We order it (in reverse
order) and grab the top 10 sequences for each book using .SD[1:10]
. We
then create a new variable that is the average count of each three word
combination across books and order by the new variable. Finally, we make
our three letter words a factor and put them in the order of average
counts. Lastly, we plot it.
This shows the most common three-word sequences. Several are common across the books: kingdom of God (heaven), Son of man, and other references to heaven.
We follow the same procedure for the five word sequences.
This shows the most common five-word sequences. Interestingly, this is quite different from the three-word sequences. This is partly due to the nature of finding a longer string of words that are repeated (there are fewer five-word sequences than three). Still, there is reference to the kingdom of God (heaven) in both Luke and Matthew. In John, the quote “Simon, son of Jonas, lovest thou me?” shows up as two of the five-word sequences. It also shows that the phrase: “I will raise him up at the last day” is common in John. Matthew, on the other hand, has four of the top 10 referencing heaven. In Mark, the phrase: “If any man have ears to hear, let him hear” is common.
The Words of Jesus Upcoming
In a future blog post, I want to analyze the words of Jesus as quoted in the Gospels. I’m interested in using the New Internation Version or another modern translation for this. If you are aware of a corpus that would allow this, I would love to hear from you!
[1] For those unfamiliar, the first four books of the New Testament are about the mortal life of Jesus and are often referred to as the “gospels”. They may be the most oft read books in all of Christianity.