View article

[PDF] from psu.edu

The Penn treebank: an overview

Authors

Ann Taylor, Mitchell Marcus, Beatrice Santorini

Publication date

2003

Source

Treebanks: Building and using parsed corpora

Pages

5-22

Publisher

Springer Netherlands

Description

The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.

Total citations

Cited by 625

2002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420253 7 5 4 5 8 6 8 8 4 8 10 17 23 34 47 52 58 55 72 69 66 42 8

Scholar articles

The Penn treebank: an overview

A Taylor, M Marcus, B Santorini - Treebanks: Building and using parsed corpora, 2003

zproxy.org