Distinguishing 19th Century British Novels by Women Authors Using Natural Language Processing


  • Phoebe M. Xu North Hollywood High School Highly Gifted Magnet, North Hollywood, CA Author


Authorship, Jane Austen, Mary Shelley, Mary Brunton, Natural Language Processing, BERT model, Binary Logistic Regression


This paper utilized the BERT model and binary logistic regression to distinguish books written by 19th-century British women, specifically exploring AI’s ability to determine author differences and keywords in each book. Two books each by Jane Austen, Mary Shelley, and Mary Brunton were divided into uniformly sized sections to train and test the BERT model. Its task was to analyze the author-labeled training set, and then assign author labels to the separate testing set. The results showed that the model achieved 84.44% accuracy. A z-test yielded a z-score of 35.63 and a negligibly small p-value approaching 0. Binary logistic regression was then utilized to pinpoint the most distinctive words from each book, helping to understand the differences between the books.