McAnulty College and Graduate School of Liberal Arts
Authorship attribution, Email classification, Enron, EVL Lab, JGAAP, RANDOMFOREST
In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest.
Li, X. (2013). Authorship Attribution on the Enron Email Corpus (Master's thesis, Duquesne University). Retrieved from https://dsc.duq.edu/etd/823