当前位置: HOME >> CL News >> Content

BYU TV and movie corpora

发布者: [发表时间]:2019-02-16 [来源]: [浏览次数]:

We are pleased to announce two new corpora from the BYU suite of corpora:

 

-- The TV Corpus : 325 million words in 75,000 very informal TV episodes (e.g. comedies and dramas) from 1950-2018

-- The Movie Corpus: 200 million words in 25,000 movies from 1930-2018

As psycholinguistic and corpus-based research by Brysbaert and others have shown (e.g. 1, 2, 3), TV and movie subtitles often agree better with native speaker intuitions about common, informal English than actual spoken corpora. And while there are other corpora of subtitles, we believe that the BYU corpora allow a much wider range of searches of these subtitles than is available elsewhere. As with the other BYU corpora, users can search by word, phrase, lemma, PoS, synonym, and customized wordlists. They can see the frequency of matching strings, the frequency in different sections of the corpora, collocates, and re-sortable concordance lines.

 

The TV and Movie corpora also allow users to examine frequency and usage over time (1930-2018 for movies, 1950-2018 for TV shows), as well as compare between different dialects of English (for example British vs American English).

 

Users can also quickly and easily create, search, and create keyword lists from their own "Virtual Corpora", such as (for TV) all episodes of Dr Who, Star Trek Next Generation, The Office, or The Good Place, or (for movies) all James Bond movies, or all American sci-fi movies from 1990-present, which have a certain movie rating or IMDB score, and with a given keyword in the IMDB plot summary.

 

Finally, all 75,000 episodes from TV shows and all 25,000 movies are linked directly to their IMDB entry and OpenSubtitles page. As a result, if you find some interesting data in the corpus and want to see the original subtitles file or find out more about the TV show or movie (actors, rating, extended plot summary, etc), it's just one click away.

 

In summary, we believe that the new TV Corpus and Movie Corpus provide are the largest, most searchable corpora of very informal English, and we hope that they are of value to you in your research and teaching.

 

Brief overview (PDF) of the TV and Movie corpora

 

P.S. We're glad to announce that "one click" comparisons in the BYU corpora are back, which allows you to seamlessly move between and compare results in the different BYU corpora (e.g. TV, Movies, Soap Operas, iWeb, COCA, COHA, GloWbE, BYU-BNC, NOW, Wikipedia, and others).

 

Mark Davies

BYU Corpora