CORPUS SOFTWARE

As a corpus linguist, the effectiveness of your analysis is usually determined by the capability of the software you use. We have put together a list of some of the most widely used corpus software and highlighted the different tools they possess.

Antconc

AntConc is a freeware corpus analysis toolkit for concordancing and text analysis that was designed by Professor Laurence Anthony.  

AntConc is only one of a handful of specialist tools designed by Anthony within the field of linguistics. Further information about AntConc, as well as Anthony’s other tools can be found on his personal website.

wmatrix

Wmatrix is a software tool for corpus analysis and comparison that was initially developed by Dr Paul Rayson.

Wmatrix provides a web interface to the English USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.

wordsmith tools

Initially developed by Dr Mike Scott, WordSmith Tools is a selection of modules for searching patterns in a language.

WordSmith’s three modules (Concord, KeyWord, WordList) allow for the analysis of concordances, keyness and frequency. Additionally, with some of WordSmith’s more advanced features you are able to carry out a number of functions, which are not available on most other corpus software (i.e. extracting data based on XML tags).

sketch engine

Sketch Engine was originally developed by Dr Adam Kilgarriff and Dr Pavel Rychly. It’s name is derived from a feature within the software that produces word sketches, which summarise a word’s grammatical and collocational behaviour. 

The Sketch Engine software tool comes with a number of in-built corpora and also allows you to upload your own corpus into the software. You may use Sketch Engine to analyse your corpus by examining frequency lists, keywords and n-grams, as well as using it for a number of other methods of corpus analysis.

Online corpora

If you do not plan to create your own corpus and are more interested in investigating language usage within an existing corpus, there are some online corpora currently available, which you could use.

Corpus.byu.edu

The BYU corpus site contains a number of corpora that were created by Professor Mark Davies. According to their website, they are probably the most used corpora online, with more than 130,000 users each month.

The corpora have been extracted from various sources, such as Wikipedia, proceedings from the UK Houses of Parliament and American Soap Operas. The News on the Web (NOW) is the largest of all the corpora on the site and consists of over 5.5billion words.

 

cqp web

CQPweb is a web-based corpus analysis system that is maintained by Dr Andrew Hardie and provides a user-friendly interface to the Corpus Workbench (CWB) system.

There are a large number of corpora available on the CQPweb system including the British National Corpus (BNC) and the recently compiled Spoken BNC2014.  Additionally, for those without a licence, ‘restricted-access’ is now provided to previously ‘off-limit’ corpora.

CLiC (Corpus Linguistics in Context)

CLiC (Corpus Linguistics in Context) has been specifically designed to support the study of literary texts. In addition to standard corpus tool functionalities, CLiC allows the user to restrict searches to text within or outside of quotation marks. Concordance searches can also be refined through ‘KWIC grouping’ of results. CLiC currently contains over 130 books categorised into four corpora: the corpus of Dickens’s Novels (DNov), a 19th Century Reference Corpus (19C), a corpus of 19th Century Children’s Literature (ChiLit) and Additional Requested Texts (ArTs).

CLiC was created as part of an AHRC-funded project (grant reference AH/P504634/1) led by Professor Michaela Mahlberg. It is hosted at the Centre for Corpus Research (CCR), University of Birmingham.

 

don't panic! more corpus software to come.

We are very aware that the above mentioned software tools and online corpora are not the only ones available.
However, due to time constraints, we are not able to feature all available software and corpora just yet.
If you feel you may know of some websites, that we may not have heard about, we will be happy to hear from you.