Text normalization is a prerequisite for a variety of downstream
speech and language processing tasks. Text normalization can include
classification of text entities (dates, times, numbers, currency
amounts, etc.), as well as normalization of those entities into words
("$2.50" → "two dollars and fifty cents"). Different downstream
tasks require different levels and types of normalizations, and
approaches that work well on one textual domain may not work on
This course will cover the literature on text normalization in a
seminar style with participants presenting sets of papers. We will
also have a class project, which will be to replicate portions of the
1999 Hopkins Workshop on Text Normalization (Richard Sproat, Alan
Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher
Richards. "Normalization of non-standard words." Computer Speech
and Language, 15(3), 287-333, 2001), using tools from
OpenGrm, and working
with raw data from social media such as Twitter. Text derived from
social media sources tends to be extremely noisy and irregular, and
therefore requires extensive normalization. At the end of the course,
we will have produced a suite of tools that will be made publicly
available, and students taking the course will have gained hands-on
experience acquiring and working with data from "real world" sources.
Structure of the Course
This course will consist of a combination of a (few) lectures,
discussion of papers from the literature, and a lab component where
the class as a team will build a set of modules for text normalization
open-source finite-state grammar toolkit. For most classes, there
will be a combination of reading discussion, and discussion of
progress on the project.
See here for instructions on how to install OpenFst
We also have a class wiki.
Your grade will depend upon the following components:
Participation in discussion: 50%
Contribution to the class project: 50%
Intro to course, discussion of Sproat et al. text norm paper, intro to
Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari
Ostendorf, and Christopher Richards. "Normalization of non-standard
words." Computer Speech and Language, 15(3), 287-333,
Assignment of readings.
Homework on annotation of
Detailed plans for development of the system.
Cannon, Garland. 1989. "Abbreviations and acronyms in English
word-formation." American Speech,
David Yarowsky. "Homograph disambiguation in text-to-speech
synthesis." In Jan van Santen, Richard Sproat, Joseph Olive, and Julia
Hirschberg, editors, Progress in Speech Synthesis, pages
157--172. Springer, New York,
- Chapters 1-3 (pp 1-98) from: Hurford, James. 1975. The Linguistic
Theory of Numerals. Cambridge University Press,
J.T. Chang, H Schütze, and R.B. Altman. 2002. "Creating an Online Dictionary
of Abbreviations from MEDLINE" JAMIA,
Neil Rowe and Kari Laitinen. "Semiautomatic disabbreviation of
technical text." Information Processing and Management,
Andrei Mikheev. 2000. "Document centered approach to text
normalization", Research and Development in Information Retrieval
(Proceedings of SIGIR),
Andrew Golding and Dan Roth. "A Winnow-based approach to spelling
correction." Machine Learning, 1999.
Michael Collins and Yoram Singer. 1999. "Unsupervised Models for Named
Entity Classification." EMNLP/VLC-99.
D. Bikel, Richard Schwartz, and Ralph Weischedel. "An algorithm that
learns what's in a name." Machine Learning, 34(1/3):221--231,
Olinsky, C. and Black, A. (2000) "Non-Standard Word and Homograph
Resolution for Asian Language Text Analysis", ICSLP 2000, Beijing,
S. Schwarm and M. Ostendorf. "Text normalization with varied data
sources for conversational speech language modeling."
In Proc. ICASSP,
pages I:789--792, 2002. PDF.
K. F. Wong and Y. Xia, "Normalization of Chinese chat language,"
Language Resources and Evaluation, 42, 219-242 2008.
(accessible on campus).
- Ju, Yun-Cheng and Odell, Julian. 2008. "A Language-Modeling
Approach to Inverse Text Normalization and Data Cleanup for Multimodal
Voice Search Applications". Interspeech 2008, Brisbane.
- Shugrina, Maria. 2010. "Formatting Time-Aligned ASR
Transcripts for Readability". NAACL 2010, Los Angeles.
Willis, Tim; Pain, Helen and Trewin, Shari. 2005.
"A Probabilistic Flexible Abbreviation Expansion System for Users With Motor Disabilities".
Accessible Design in the Digital World Conference.
Jonnalagadda and Topham. "NEMO: Extraction and normalization of
organization names from PubMed affiliations." J Biomed Discov Collab
(2010) vol. 5 pp. 50-75.PDF.
Cook, P. & Stevenson, S., 2009.
"An unsupervised model for text message normalization". CALC '09.
Samuel Brody, Nicholas
Diakopoulos. 2011. "Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using
Word Lengthening to Detect Sentiment in Microblogs".
Fei Liu, et al. 2011.
"Insertion, Deletion, or Substitution? Normalizing Text Messages
without Pre-categorization nor Supervision." ACL 2011
Bo Han and Timothy Baldwin. 2011. "Lexical normalisation of short text
messages: Makn sens a #twitter." ACL 2011.
Choudhury et al. 2007. "Investigation and modeling of the structure of
texting language." Int. J. Doc. Anal. Recognit. 10,
pp. 157-174. PDF.
Pennell and Liu. 2011. "Toward text message normalization: Modeling
Proceedings of the IEEE. pp. 5364-5367.
Deana L. Pennell and Yang Liu. 2011. "A Character-Level Machine Translation
Approach for Normalization of SMS Abbreviations." IJCNLP. PDF.
Aw, et al. 2006.
"A phrase-based statistical model for SMS text normalization"
ACL 2006. PDF.
"Syntactic Normalization of Twitter Messages". Int'l Conference on
11/29 and 12/1
Final presentation of work done. Tests and system integration.
© 2011, Richard Sproat, Steven Bedrick