© 2019 IEEE. Text mining is one of the main and typical tasks of machine learning (ML). Authorship identification (AI) is a standard research subject in text mining and natural language processing (NLP) that has undergone a remarkable evolution these last years. We need to identify/determine the actual author of anonymous texts given on the basis of a set of writing samples. Standard text classification often focuses on many handcrafted features such as dictionaries, knowledge bases, and different stylometric characteristics, which often leads to remarkable dimensionality. Unlike traditional approaches, this paper suggests an authorship identification approach based on automatic feature engineering using word2vec word embeddings, taking into account each author's writing style. This system includes two learning phases, the first stage aims to generate the semantic representation of each author by using word2vec to learn and extract the most relevant characteristics of the raw document. The second stage is to apply the multilayer perceptron (MLP) classifier to fix the classification rules using the backpropagation learning algorithm. Experiments show that MLP classifier with word2vec model earns an accuracy of 95.83% for an English corpus, suggesting that the word2vec word embedding model can evidently enhance the identification accuracy compared to other classical models such as n-gram frequencies and bag of words.
- Authorship Identification,
- MLP classifier,
- Natural Language Processing,
- Text Mining,
- Word2Vec
Available at: http://works.bepress.com/monther-aldwairi/33/