Skip to main content
Article
A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research
Proceeding WebSci '18 Proceedings of the 10th ACM Conference on Web Science
  • Mohammadreza Rezvan, Wright State University - Main Campus
  • Saeedeh Shekarpour
  • Lakshika Balasuriya, Wright State University - Main Campus
  • Valerie L Shalin, Wright State University - Main Campus
  • Amit P. Sheth, Wright State University - Main Campus
Document Type
Conference Proceeding
Publication Date
1-1-2018
Abstract

A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2.

DOI
10.1145/3201064.3201103
Citation Information
Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Valerie L Shalin, et al.. "A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research" Proceeding WebSci '18 Proceedings of the 10th ACM Conference on Web Science (2018) p. 33 - 36 ISSN: 978-1-4503-5563-6
Available at: http://works.bepress.com/valerie_shalin/68/