"A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research" by Mohammadreza Rezvan

Selected Works of Valerie Shalin

Follow Contact

Article

A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

Proceeding WebSci '18 Proceedings of the 10th ACM Conference on Web Science

Mohammadreza Rezvan, Wright State University - Main Campus
Saeedeh Shekarpour
Lakshika Balasuriya, Wright State University - Main Campus
Valerie L Shalin, Wright State University - Main Campus
Amit P. Sheth, Wright State University - Main Campus

Find in your library

Document Type

Conference Proceeding

Publication Date

1-1-2018

Disciplines

Abstract

A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2.

DOI

10.1145/3201064.3201103

Citation Information

Mohammadreza Rezvan, Saeedeh Shekarpour, Lakshika Balasuriya, Valerie L Shalin, et al.. "A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research" Proceeding WebSci '18 Proceedings of the 10th ACM Conference on Web Science (2018) p. 33 - 36 ISSN: 978-1-4503-5563-6
Available at: http://works.bepress.com/valerie_shalin/68/