CSE 554/654 - Text-Based Language Processing Systems

Instructor: Brian Roark

Class time: M/W 11:00 AM - 12:30 PM    Mar. 31 - June 9, 2008

Class location: Center for Health & Healing 12181, videoconf'd to Wilson Clark Center 403

Office hours: Th 10-12, Central Building 115, or by appointment

Required textbooks:

None, reading will come from papers available on-line

Skip to overview of topics.

Goals

With a focus on bio-medical text, the goal of this course is to present the current best practices in building systems that cluster, label or transform raw text to improve information access. Such systems are often chained together within larger applications that retrieve documents, extract information, summarize, answer questions and translate to other languages. This course will provide a hands-on, project oriented introduction to such applications.

Prerequisites

There is no official programming language for this course, but there will be a some amount of scripting or programming required to complete assignments, hence facility with some programming language (or willingness to acquire such facility) is assumed.

Homework and term projects

The course will be structured around an end-to-end query-directed text processing system that retrieves documents, performs query-directed summarization/question answering, and automatic translation. Simple baseline components will be in place within a baseline system, each of which can be independently improved. For homework projects, students will select particular components, try to improve performance over the baseline using various techniques, and evaluate the impact on system performance. For the term project, students will be given more leeway in selecting a topic for further investigation.

Grading

10% of your grade will depend on in-class discussion, 15% on in-class presentations, 15% each on 3 homework projects and 30% on a term project and presentation.

What we'll cover and an approximate schedule (in progress, may change)

Date     Topic Reading Lecture videoslides
Mar.31 Overview of class structure; introduction to the text processing "pipeline", including IR, IE, QA, summarization and MT; homework and term project options   vid pdf
Apr.2 Introduction to Information Retrieval (IR) and Information extraction (IE)   vidpdf
Apr.7 Introduction to Question Answering (QA) and Automatic Summarization Tutorial vid pdf
Apr.9 Introduction to Machine Translation (MT)   vid pdf
Apr.14 Statistical methods and knowledge-based methods; finite-state automata and transducers; pipelining systems HR07 vidpdf
Apr.16 Raw text processing; text normalization; domain specific text processing; key issues in bio-medical text processing Norm01 vidpdf
Apr.21 Topics in text normalization and IR; student HW project presentations SPNorm02
SH03
vid 
Apr.23 Topics in IE; student HW project presentations GKM05 vid 
Apr.28 Topics in QA; student HW project presentations RH02 vid 
Apr.30 Topics in QA DFL07 vid 
May 5 Topics in IE (guest lecture: Aaron Cohen, DMICE) CoHer05
CoHun08
vid 
May 7 Topics in Summarization; student HW project presentations RHNYSB06 vid 
May 12 Topics in Summarization (guest lecture: Seeger Fisher) OER05
Mil05
vid pdf
May 14 Topics in MT; student HW project presentations CZ05 vid  
May 19 Topics in MT (guest lecture: Kristy Hollingshead) Chi05 vid pdf
May 21 Topics in MT; student HW project presentations   vid  
May 26 No class, Memorial Day    
 
May 28 Topics in natural language processing (NLP) for text-based applications;
student HW project presentations
  vidpdf
Jun.2 Generalizing methods for use with uncertain input (e.g., spoken language)     pdf
Jun.4 Surveying the state-of-the-art: large research programs and system competitions; open problems; likely future directions      
Jun.9,11 Term project presentations      


References:
Chi05   David Chiang. A Hierarchical Phrase-Based Model for Statistical Machine Translation. Proceedings of the Annual Meeting of the ACL, pp. 263-270, 2005.
CZ05   Vincent Claveau and Pierre Zweigenbaum. Translating Biomedical Terms by Inferring Transducers. Proceedings of the 10th conference on artificial intelligence in medicine in Europe. AIME, pp. 236-240, 2005.  (Mieszko Kruger, lead)
CoHer05   Aaron M. Cohen and William Hersh. A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics, 6(1):57-71, 2005.
CoHun08 K. Bretonnel Cohen and Lawrence Hunter. Getting started in text mining. PLoS Computational Biology, 4(1), 2008.
DFL07   Dina Demner-Fushman and Jimmy Lin. Answering Clinical Questions with Knowledge-Based and Statistical Techniques. Computational Linguistics, 33(1):63-103, 2007.  (Seeger Fisher, lead)
GKM05   Trond Grenager, Dan Klein and Chris Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 371-378, 2005.  (Youngjun Kim, lead)
HR07   Kristy Hollingshead and Brian Roark. Pipeline Iteration. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 952-959, 2007.  (Kristy Hollingshead, lead)
Mil05   Rada Mihalcea. Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling. Proceedings of HLT-EMNLP, 2005.
OER05   Jahna Otterbacher, Gunes Erkan and Dragomir R. Radev. Using Random Walks for Question-focused Sentence Retrieval. Proceedings of HLT-EMNLP, 2005.
SPNorm02   Serguei Pakhomov. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 160-167, 2002.  (Aaron Dunlop, lead)
RH02   Deepak Ravichandran and Eduard Hovy. Learning Surface Text Patterns for a Question Answering System. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 41-47, 2002.  (Emily Tucker, lead)
RHNYSB06   Lawrence Reeve, Hyoil Han, Saya V. Nagori, Jonathan C. Yang, Tamara A. Schwimmer, and Ari D. Brooks. Concept Frequency Distribution in Biomedical Text Summarization. Proceedings of the 15th Conference on Information and Knowledge Management, 2006.  (Glenn Diviney, lead)
SH03   Ariel Schwartz and Mari Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. Proceedings of the 8th Pacific Symposium on Biocomputing, pp. 451-462, 2003.  (Matt MacNaughton, lead)
Norm01   Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. Normalization of non-standard words. Computer Speech and Language, 15(3):287-333, 2001.   (Steven Bedrick, lead)