Written by Maite Taboada, Simon Fraser University

The SFU Opinion and Comments Corpus (SOCC) is a corpus for the analysis of online news comments. Our corpus contains comments and the articles from which the comments originated. The articles are all opinion articles, not hard news articles. The corpus is larger than any other currently available comments corpora, and has been collected with attention to preserving reply structures and other metadata. In addition to the raw corpus, we also present annotations for four different phenomena: constructiveness, toxicity, negation and its scope, and appraisal.

The articles include all the opinion pieces published in the Canadian newspaper The Globe and Mail in the five-year period between 2012 and 2016, a total of 10,339 articles and 663,173 comments. The corpus is part of a project that investigates the linguistic characteristics of online comments. It can be used to study, among other aspects:

  • the connections between articles and comments
  • the connections of comments to each other
  • the types of topics discussed in comments
  • the nice (constructive) or mean (toxic) ways in which commenters respond to each other
  • how language is used to convey very specific types of evaluation

Our current focus is the study of constructiveness and evaluation in the comments. To that end, we have annotated a subset of the large corpus (1,043 comments) with three layers of annotations: constructiveness, negation, and Appraisal, following existing theories (Martin & White 2005). While our focus is comments posted in response to opinion news articles, the phenomena in this corpus are likely to be present in many commenting platforms: other news comments, comments and replies in fora such as Reddit, feedback on blogs, or YouTube comments. A number of research questions related to journalism, online discourse, the dialogic structure of online comments, and evaluative language can be explored from this resource.

Our preliminary research of the annotated data shows that constructiveness is an interplay between a variety of other phenomena of interest in computational linguistics, such as argumentation, relevance of the comment to the article, and the tone of the comment. We believe that we may obtain better quality annotations if we ask specific questions leading to constructiveness (e.g., whether the comment is relevant to the article or whether the claims made in the article are supported by evidence), instead of asking a single binary question, and in our current work, we are pursuing this research direction.

With respect to the negation annotation, we developed extensive and detailed guidelines for the annotation of negative keywords, scope and focus. We used the guidelines to annotate the chosen subset of the comments corpus, producing a completely annotated corpus for negation, including its scope and focus. This corpus has been curated to provide the most accurate annotations according to the guidelines. We have also achieved reasonable results for agreement between annotators on these annotations.

With the Appraisal annotations, we have shown that it is possible to achieve favourable rates of agreement using the system, though agreement requires a high degree of familiarity with the guidelines and can still be hindered by ambiguity in comments. Aside from the guidelines, we have provided a novel, extensively annotated corpus of online comments that will be used to investigate the relationship between negation and Appraisal, yet which has the potential for other avenues of research as well.

The annotations were carefully curated, and interannotator agreement suggests that they are reliable and replicable. In an article under review (see references below), we have provided extensive detail about the corpus, because we think data collection and curation is an important process, which should be well documented and accountable. Our corpus is freely available for non-commercial use.

Full description of the corpus and structure: https://github.com/sfu-discourse-lab/SOCC

Direct link to the data: https://researchdata.sfu.ca/islandora/object/islandora:9109

This work was supported by the Social Sciences and Humanities Research Council of Canada (Insight Grant 435-2014-0171). We thank all the members of the Discourse Processing Lab at Simon Fraser University for their help in testing annotation questions, and especially Erin Jastrzebski and Sarah Mulhall for annotating the data.


Varada Kolhatkar (vkolhatk@sfu.ca)
Maite Taboada (mtaboada@sfu.ca)


Kolhatkar, V., H. Wu, L. Cavasso, E. Francis, K. Shukla and M. Taboada (under review) The SFU Opinion and Comments Corpus: A corpus for the analysis of online news comments. Journal paper under review.

Kolhatkar. V. and M. Taboada (2017) Using New York Times Picks to identify constructive comments. Proceedings of the Workshop Natural Language Processing Meets Journalism, Conference on Empirical Methods in Natural Language Processing. Copenhagen. September 2017.

Kolhatkar, V. and M. Taboada (2017) Constructive language in news comments. Proceedings of the 1st Abusive Language Online Workshop, 55th Annual Meeting of the Association for Computational Linguistics. Vancouver. August 2017, pp. 11-17.

James R. Martin and Peter R. R. White. The Language of Evaluation. Palgrave, NewYork, 2005.