Authors
Ann Taylor, Mitchell Marcus, Beatrice Santorini
Publication date
2003
Source
Treebanks: Building and using parsed corpora
Pages
5-22
Publisher
Springer Netherlands
Description
The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.
Total citations
Scholar articles
A Taylor, M Marcus, B Santorini - Treebanks: Building and using parsed corpora, 2003