CS506/606: Txt Nrmlztn

Richard Sproat

Steven Bedrick

TA: Emily Tucker Prud'hommeaux

Fall 2011

TR 9-10:30, WCC403

Office Hours: By Appointment

Synopsis Structure of the Course Grades Syllabus


Text normalization is a prerequisite for a variety of downstream speech and language processing tasks. Text normalization can include classification of text entities (dates, times, numbers, currency amounts, etc.), as well as normalization of those entities into words ("$2.50" → "two dollars and fifty cents"). Different downstream tasks require different levels and types of normalizations, and approaches that work well on one textual domain may not work on another.

This course will cover the literature on text normalization in a seminar style with participants presenting sets of papers. We will also have a class project, which will be to replicate portions of the 1999 Hopkins Workshop on Text Normalization (Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words." Computer Speech and Language, 15(3), 287-333, 2001), using tools from OpenGrm, and working with raw data from social media such as Twitter. Text derived from social media sources tends to be extremely noisy and irregular, and therefore requires extensive normalization. At the end of the course, we will have produced a suite of tools that will be made publicly available, and students taking the course will have gained hands-on experience acquiring and working with data from "real world" sources.

Structure of the Course

This course will consist of a combination of a (few) lectures, discussion of papers from the literature, and a lab component where the class as a team will build a set of modules for text normalization using the Thrax open-source finite-state grammar toolkit. For most classes, there will be a combination of reading discussion, and discussion of progress on the project.

See here for instructions on how to install OpenFst and Thrax.

We also have a class wiki.


Your grade will depend upon the following components:


Week 1

Week 2

Detailed plans for development of the system.

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

11/29 and 12/1

Final presentation of work done. Tests and system integration.

© 2011, Richard Sproat, Steven Bedrick