Defense Date
3-26-2013
Graduation Date
2013
Availability
Immediate Access
Submission Type
thesis
Degree Name
MS
Department
Computational Mathematics
School
McAnulty College and Graduate School of Liberal Arts
Committee Chair
Patrick Juola
Committee Member
Abhay Gaur
Keywords
Authorship attribution, Email classification, Enron, EVL Lab, JGAAP, RANDOMFOREST
Abstract
In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest.
Format
Language
English
Recommended Citation
Li, X. (2013). Authorship Attribution on the Enron Email Corpus (Master's thesis, Duquesne University). Retrieved from https://dsc.duq.edu/etd/823