Stylometry and the Pauline Epistles

Stylometry is the statistical analysis of literary style. It includes the study of documents’ quantitative attributes like average sentence length or relative frequencies of certain words. Authorship attribution is an important application of stylometry. Individual authors have stylometric fingerprints, and those fingerprints can help to determine who wrote (or didn’t write) certain documents.

A famous authorship attribution problem concerns The Federalist Papers, a collection of essays by Alexander Hamilton, John Jay, and James Madison. All three men published essays under the common pseudonym “Publius.” Because of this, it was unknown for some time which of them had written some of the essays. In 1963, statisticians Frederick Mosteller and David Wallace analyzed word frequencies to determine the authorship of the disputed papers . They made the striking observation that Hamilton used the word “upon” about 18 times more frequently than did Madison. This feature, among many others, allowed Mosteller and Wallace to conclude that Madison had written twelve of the disputed essays.

The New Testament is another interesting candidate for stylometric analysis. Its twenty-seven books were written by several authors; some known, some completely anonymous and unknown, and some contested. Of special interest are the Pauline epistles: thirteen letters to churches or individuals that bear the Apostle Paul’s name. Scholarly consensus regards seven of these epistles (Romans, 1 and 2 Corinthians, Galatians, Philippians, 1 Thessalonians, and Philemon) as genuinely Pauline. The authenticity of the others (Ephesians, Colossians, 2 Thessalonians, 1 and 2 Timothy, and Titus) is disputed. Some church traditions consider Hebrews to be Pauline, but it is anonymous and scholars agree that Paul didn’t write it.

Burrows’s Delta

Burrows’s Delta is a statistic that measures similarity between two texts . Roughly speaking, it compares the relative frequencies of certain words in each text. It is defined as follows.

Let D and D' be two documents within a larger database of documents and let w_1, w_2, \dots, w_n be the n most frequently occurring words in the database. We will denote by f_i(D) the number of times that w_i appears in D. In order to treat commonly appearing words with the same weight as less common words, Burrows used z-scores to normalize word counts.

    \[z_i(D) = \frac{f_i(D)-\mu_i}{\sigma_i}\]

Here \mu_i and \sigma_i are the average and standard deviation, respectively, of f_i over all documents in the database. Delta is then defined as follows.

    \[\Delta^{(n)}(D,D') = \sum_{i=1}^n \frac{|z_i(D)-z_i(D')|}{n}\]

Burrows originally used his Delta to match a disputed text to one author from a fixed set of possible authors. Our situation is a bit different: we’re trying to determine whether or not a single person is likely to be the author of certain texts; there is no fixed list of candidate authors. Even though our implementation will be different than Burrows’s, we should still be able to extract some meaningful information.

Our database of documents will be the collection of New Testament books (or, in some cases, groups of New Testament books by the same author). The most important document in our database will be a single Pauline “training text” to which other New Testaments books can be compared. I randomly chose four of Paul’s undisputed epistles (Romans, 1 Corinthians, Galatians, and Philemon) to serve as the Pauline training text. Those books, together, will be our baseline for evaluating Paul’s writing style. His remaining three undisputed epistles (2 Corinthians, Philippians, and 1 Thessalonians) will be left as a sort of control group. By measuring Delta between the control group epistles and our Pauline training text, we’ll get an idea of what Delta values to expect from authentic Pauline literature. We’ll also measure Delta between non-Pauline New Testament works and the Pauline training text to get an idea of what Delta values to expect from other authors.

Preparing the Text for Analysis

A complete Greek New Testament is available in from Let’s see how it looks.

In [1]:
lines = []
with open('ugntdat.txt', encoding='utf8') as f:
    for line in f:
        line = line.strip()
In [2]:
for line in lines[:10]:
Mat|1|1|βίβλος γενέσεως ἰησοῦ χριστοῦ υἱοῦ δαυὶδ υἱοῦ ἀβραάμ.
Mat|1|2|ἀβραὰμ ἐγέννησεν τὸν ἰσαάκ, ἰσαὰκ δὲ ἐγέννησεν τὸν ἰακώβ, ἰακὼβ δὲ ἐγέννησεν τὸν ἰούδαν καὶ τοὺς ἀδελφοὺς αὐτοῦ,
Mat|1|3|ἰούδας δὲ ἐγέννησεν τὸν φάρες καὶ τὸν ζάρα ἐκ τῆς θαμάρ, φάρες δὲ ἐγέννησεν τὸν ἑσρώμ, ἑσρὼμ δὲ ἐγέννησεν τὸν ἀράμ,
Mat|1|4|ἀρὰμ δὲ ἐγέννησεν τὸν ἀμιναδάβ, ἀμιναδὰβ δὲ ἐγέννησεν τὸν ναασσών, ναασσὼν δὲ ἐγέννησεν τὸν σαλμών,
Mat|1|5|σαλμὼν δὲ ἐγέννησεν τὸν βόες ἐκ τῆς ῥαχάβ, βόες δὲ ἐγέννησεν τὸν ἰωβὴδ ἐκ τῆς ῥούθ, ἰωβὴδ δὲ ἐγέννησεν τὸν ἰεσσαί,
Mat|1|6|ἰεσσαὶ δὲ ἐγέννησεν τὸν δαυὶδ τὸν βασιλέα. δαυὶδ δὲ ἐγέννησεν τὸν σολομῶνα ἐκ τῆς τοῦ οὐρίου,
Mat|1|7|σολομὼν δὲ ἐγέννησεν τὸν ῥοβοάμ, ῥοβοὰμ δὲ ἐγέννησεν τὸν ἀβιά, ἀβιὰ δὲ ἐγέννησεν τὸν ἀσάφ,
Mat|1|8|ἀσὰφ δὲ ἐγέννησεν τὸν ἰωσαφάτ, ἰωσαφὰτ δὲ ἐγέννησεν τὸν ἰωράμ, ἰωρὰμ δὲ ἐγέννησεν τὸν ὀζίαν,
Mat|1|9|ὀζίας δὲ ἐγέννησεν τὸν ἰωαθάμ, ἰωαθὰμ δὲ ἐγέννησεν τὸν ἀχάζ, ἀχὰζ δὲ ἐγέννησεν τὸν ἑζεκίαν,
Mat|1|10|ἑζεκίας δὲ ἐγέννησεν τὸν μανασσῆ, μανασσῆς δὲ ἐγέννησεν τὸν ἀμώς, ἀμὼς δὲ ἐγέννησεν τὸν ἰωσίαν,

Each verse is conveniently formatted as (abbreviated book title)|(chapter number)|(verse number)|(verse text). We only need the book titles and the individual words in each book, so we’ll make a dictionary that associates titles with texts. Each dictionary value will be a list of words. Python’s string module will help us get rid of all the punctuation.

In [3]:
import string
In [4]:
# Make a list of New Testament book abbreviations.
books = []
for line in lines:
    if line[:3] not in books:
        books += [line[:3]]

# Make a dictionary of word lists from New Testament books.
book_dict = {}
for line in lines:
    if line[:3] not in book_dict:
        book_dict[line[:3]] = []
    book_dict[line[:3]] += line.split('|')[-1].translate(str.maketrans('', '', string.punctuation+'·')).split()
['βίβλος', 'γενέσεως', 'ἰησοῦ', 'χριστοῦ', 'υἱοῦ', 'δαυὶδ', 'υἱοῦ', 'ἀβραάμ', 'ἀβραὰμ', 'ἐγέννησεν']


Now that the data is in a usable format, we can start counting words. We’ll make a list of features, that is, distinct words. Burrows’s Delta uses only the n most frequent words, so we’ll only include words that appear in the New Testament at least twice. That’s probably more words than we need, but we can worry about choosing n later.

In [5]:
# Make a dictionary of word counts of distinct New Testament words.
word_count_dict = {}
for book, text in book_dict.items():
    for word in text:
        if word in word_count_dict:
            word_count_dict[word] += 1
            word_count_dict[word] = 1
In [6]:
import nltk
In [7]:
NT_word_list = []

for book in books:
    NT_word_list += book_dict[book]

# Make a list of every word in the New Testament that appears at least twice.
features = [x[0] for x in nltk.FreqDist(NT_word_list).most_common() if x[1] >= 2]

for feature in features[:10]:

Now we’ll group some New Testament works together. Romans, 1 Corinthians, Galatians, and Philemon were chosen to be our Pauline training text, so we’ll group those books together. I also chose to group John’s three epistles together. 2 John and 3 John are very short (less than 250 words each), and shorter documents are more likely to have uncharacteristic word frequencies. Combining them with 1 John avoids this problem.

In [8]:
# Make a dictionary that groups together some books by author.
group_dict = {'Rom-Co1-Gal-Plm':[], 'Jo1-2-3':[]}
for book in books:
    if book in ['Rom', 'Co1', 'Gal', 'Plm']:
        group_dict['Rom-Co1-Gal-Plm'] += book_dict[book]
    elif book in ['Jo1', 'Jo2', 'Jo3']:
        group_dict['Jo1-2-3'] += book_dict[book]
        group_dict[book] = book_dict[book]

Now we can make a Pandas DataFrame with word frequencies for each book or group of books.

In [9]:
import pandas as pd
In [10]:
# Make a dataframe with word occurrence counts for each group of New Testament books.
df = pd.DataFrame(columns=features)

for group, text in group_dict.items():
    group_len = len(group_dict[group])
    for feature in features:[group, feature] = group_dict[group].count(feature)/group_len

καὶ ἐν δὲ τοῦ εἰς τὸ τὸν τὴν αὐτοῦ ψευδοπροφήτης ἔζησαν ἐκρίθησαν γεγραμμένων ἀληθινοί ὑψηλόν ἔδειξέν πυλῶσιν ἐμέτρησεν ἴασπις
Rom-Co1-Gal-Plm 0.0390185 0.0170251 0.0238716 0.023508 0.0199334 0.0114511 0.0153287 0.00751287 0.00908816 0.00418055 0 0 0 0 0 0 0 0 0 0
Jo1-2-3 0.0610365 0.0333973 0.0345489 0.00383877 0.0261036 0.0049904 0.00921305 0.0230326 0.0103647 0.0230326 0 0 0 0 0 0 0 0 0 0
Mat 0.0640467 0.0268723 0.0159708 0.0256732 0.0160253 0.0118827 0.0123733 0.0120462 0.0110651 0.0144991 0 0 0 0 0 0 0 0 0 0
Mar 0.0959837 0.020966 0.0119427 0.013712 0.0116773 0.014862 0.0115888 0.0132696 0.0111465 0.0153043 0 0 0 0 0 0 0 0 0 0
Luk 0.0752489 0.0204804 0.0184786 0.026332 0.0195052 0.0115491 0.0113951 0.0110872 0.00877733 0.013089 0 0 0 0 0 0 0 0 0 0

5 rows × 7985 columns

Note that the word “καὶ” accounts for about 3.9% of all the words in our Pauline training text. Other books use the word much more frequently. Burrows’s Delta will be able to detect this and other differences between texts. Next, we’ll normalize the data and write a function to compute Burrows’s Delta.

In [11]:
z_df = (df-df.mean())/df.std()

καὶ ἐν δὲ τοῦ εἰς τὸ τὸν τὴν αὐτοῦ ψευδοπροφήτης ἔζησαν ἐκρίθησαν γεγραμμένων ἀληθινοί ὑψηλόν ἔδειξέν πυλῶσιν ἐμέτρησεν ἴασπις
Rom-Co1-Gal-Plm -1.31842 0.0499249 -0.271917 1.09761 0.186104 -0.330525 0.767821 -0.369101 -0.523802 -0.810344 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201
Jo1-2-3 -0.0133103 2.01642 0.596179 -1.49046 1.16666 -1.72501 -0.509269 2.76741 -0.222828 2.72213 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201
Mat 0.165118 1.23269 -0.914268 1.3825 -0.434948 -0.237362 0.150659 0.547079 -0.0576906 1.12313 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201
Mar 2.05818 0.523272 -1.24176 -0.19135 -1.12592 0.405696 -0.0131526 0.794328 -0.038495 1.27401 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201
Luk 0.829131 0.464947 -0.710377 1.46919 0.118061 -0.309362 -0.0535985 0.353253 -0.597087 0.85891 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201 -0.213201

5 rows × 7985 columns

In [12]:
def Delta(A, B, n):
    # A and B must be z_df indices, e.g., A='Mat'.
    return sum(abs(z_df.loc[A].iloc[:n]-z_df.loc[B].iloc[:n]))/n

Choosing the Number of Words

Burrows’s Delta depends on the number n of most frequent words. Choosing n is an inexact science. The best we can do is pick one that seems to work. To this end, we’ll try out many different values of n and see if any are able to distinguish between Pauline and non-Pauline texts.

In [13]:
Authentic_Paul = ['Co2', 'Phi', 'Th1']
Not_Paul = ['Mat', 'Mar', 'Luk', 'Joh', 'Act', 'Heb', 'Jam', 'Pe1', 'Pe2', 'Jo1-2-3', 'Jde', 'Rev']
Disputed_Paul = ['Eph', 'Col', 'Th2', 'Ti1', 'Ti2', 'Tit']
In [14]:
Deltas_df = pd.DataFrame(columns=Authentic_Paul+Not_Paul+Disputed_Paul)

for D in Authentic_Paul+Not_Paul+Disputed_Paul:
    for n in range(100, len(features), 100):[n, D] = Delta('Rom-Co1-Gal-Plm', D, n)

Co2 Phi Th1 Mat Mar Luk Joh Act Heb Jam Pe2 Jo1-2-3 Jde Rev Eph Col Th2 Ti1 Ti2 Tit
100 0.590386 0.816021 0.955953 1.03529 1.09072 1.04807 1.07619 1.00563 0.793676 0.836294 0.968213 1.14745 1.26673 1.26473 0.90666 0.916783 1.10959 0.876764 0.964832 1.14005
200 0.568355 0.790985 0.86594 0.98531 1.00992 0.952059 1.11247 0.879832 0.696403 0.830722 0.88469 1.10087 1.14812 1.18726 0.770047 0.862214 0.981993 0.796074 0.901726 1.10793
300 0.628555 0.816867 0.848256 0.955302 1.02531 0.912605 1.13462 0.89879 0.745772 0.82532 0.88617 1.07506 1.11814 1.05962 0.771954 0.862535 0.952238 0.837269 0.930475 1.10863
400 0.627058 0.815613 0.887769 0.970455 1.04463 0.900084 1.07989 0.898804 0.774535 0.801521 0.884029 1.02122 1.07225 0.997758 0.768453 0.824525 0.952448 0.833202 0.910145 1.08566
500 0.620599 0.798926 0.83607 0.948579 1.04083 0.910582 1.04385 0.931416 0.766229 0.809598 0.892747 0.998436 1.03329 1.01532 0.74009 0.805912 0.930678 0.79667 0.8816 1.02297

5 rows × 21 columns

In [15]:
import plotly.plotly as py
import cufflinks as cf
In [16]:
dashes = ['solid' for D in Authentic_Paul] + ['dash' for D in Not_Paul]

    title='Burrows\'s Delta between New Testament Books and a Pauline Training Text',
    xTitle='Number of Most Frequent Words', yTitle='Burrows\'s Delta',
    filename='pauline_epistles_fig1', dash=dashes)

There’s a lot going on in that graph, so let’s take some time to break it down. First of all, notice that Delta values for the four gospels (Matthew, Mark, Luke, and John) along with Acts and Revelation all begin to diverge at around 1200 words. Those six books are the only New Testament books that are not epistles, so Delta may be detecting a difference in content or genre between epistles and non-epistles. Second, it appears that Delta is able to distinguish between Pauline and non-Pauline epistles when n is between about 2000 and 2600. You can zoom in on the graph to see this up close. Note that Deltas for authentic Pauline epistles are plotted with solid lines and Deltas for non-Pauline works are plotted with dashed lines.


It looks like 2300 is a good choice of n, so let’s compare Delta values for the disputed Pauline epistles to Delta values for other books using the 2300 most frequent New Testament words.

In [17]:
Co2     0.58551
Th1    0.656688
Phi    0.656939
Name: 2300, dtype: object
In [18]:
Pe1        0.674724
Pe2        0.678179
Jde        0.686343
Heb        0.709118
Jam        0.715614
Jo1-2-3     0.73883
Act        0.895024
Luk        0.920648
Rev        0.931996
Joh        0.934643
Mat         1.02118
Mar         1.05384
Name: 2300, dtype: object
In [19]:
Eph    0.650448
Col    0.654166
Th2    0.660886
Tit    0.665933
Ti2    0.683839
Ti1    0.686919
Name: 2300, dtype: object

Ephesians and Colossians are both more similar to our Pauline training text than 1 Thessalonians and Philippians. This indicates that Ephesians and Colossians were both likely written by Paul. 2 Thessalonians and Titus are both more similar to our Pauline training text than any non-Pauline texts, so Pauline authorship also appears plausible for those two books. 1 and 2 Timothy, on the other hand, are both less similar to the Pauline training text than both 1 and 2 Peter (and in the case of 1 Timothy, even less similar than Jude). This is evidence against Pauline authorship, although it’s far from conclusive.

Further Discussion

There are some drawbacks to the methods employed here. First of all, I do not know Greek. Naturally, this limits my ability to analyse Greek text. One issue that I don’t know how to handle is the presence of different diacritics on words that appear to be otherwise identical. For example, the name “Abraham” appears in Matthew 1:1 (“ἀβραάμ”) and Matthew 1:2 (“ἀβραὰμ”), but with a different diacritic above the final α in each case. Since these are distinct strings in Python, my analysis considers them to be distinct words. I have no idea if these kinds of differences are helpful or unhelpful for authorship determination.

Other complications arise from the messiness of New Testament authorship. In an ideal world, I would have grouped New Testament books by author (as I did with 1, 2, and 3 John). This would have prevented a single author from exerting undo force on word frequency averages, and it would have avoided the problem of comparing any one author to Paul more than once. However, it is impossible to make perfect delineations between New Testament authors for at least two reasons. First, there is debate over authorship of 1 and 2 Peter and over authorship of John’s Gospel, John’s Epistles, and Revelation. Second, some New Testament books blur authorship lines by reusing material from other books. Matthew and Luke, for example, both copied content from Mark, and Jude contains material that is very close to parts of 2 Peter (hence the similar Deltas between those two books and our Pauline training text).

Despite these difficulties, my naive approach was sensitive enough to distinguish between known categories (the undisputed Pauline epistles and the non-Pauline epistles). Hopefully this demonstrates the power of stylometric analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *