Author

Xuan Li

Defense Date

3-26-2013

Graduation Date

2013

Availability

Immediate Access

Submission Type

thesis

Degree Name

MS

Department

Computational Mathematics

School

McAnulty College and Graduate School of Liberal Arts

Committee Chair

Patrick Juola

Committee Member

Abhay Gaur

Keywords

Authorship attribution, Email classification, Enron, EVL Lab, JGAAP, RANDOMFOREST

Abstract

In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest.

Format

PDF

Language

English

Share

COinS