CSE 555/655 - Biological and Linguistic Sequence Analysis

Instructor: Brian Roark

Class time: M/W 11:00 AM - 12:30 PM    Jan. 7 - Mar. 19, 2008

Class location: Center for Health & Healing 12181, videoconf'd to Wilson Clark Center 403

Office hours: Th 10-12, Central Building 115, or by appointment

Required textbooks:

Dan Gusfield Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Richard Durbin, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Skip to overview of lectures.

Goals

The goal of this course is to give a broad but detailed introduction to the key algorithms and modeling techniques used for sequence processing in both biological and linguistic applications, with an emphasis on exact and approximate sequence matching problems.

Prerequisites

There is no official programming language for this course, but there will be a some amount of programming required to complete assignments, hence facility with some programming language (or willingness to acquire such facility) is assumed.

Grading

10% of your grade will depend on in-class discussion, 50% on the homeworks and 20% each on the midterm and final.

What we'll cover and an approximate schedule

Date     Topic Reading AssignmentFAQs
Jan.7 Introduction to biological and linguistic strings/sequences; formal representation; overview of main problems Gusfield Ch.10
Durbin Ch.1
   
Jan.9 Introduction to string edit distance, dynamic programming and approximate alignment; motivation for efficient exact match Gusfield Ch.11
Durbin Ch.2
 
Jan.14 Deterministic exact string matching (a): simple approaches;
intro to Knuth-Morris-Pratt and Boyer-Moore algorithms
Gusfield
Ch.1
   
Jan.16 Deterministic exact string matching (b): Knuth-Morris-Pratt and Boyer-Moore algorithms Gusfield
Ch.2
HW 1
Due Jan.30
 
Jan.21 No class, MLK Jr. Day    
 
Jan.23 Deterministic exact string matching (c): Aho-Corasick algorithm for sets of patterns; regular expression patterns Gusfield
Ch.3
   
Jan.28 Suffix trees: introduction and linear-time construction algorithms; some applications (exact string matching) Gusfield Ch.5-6 HW 2
Due Feb.11
Jan.30 Suffix automata; Suffix arrays; Lempel Ziv; Lowest common ancester retrieval Gusfield Ch.7-8, 9.1  
Feb.4 Efficient approximate matching: linear space, bounded approximate matching and exclusion methods. Brief introduction to HMM alignment models Gusfield Ch.12
Durbin Ch.3
  
Feb.6 Hidden Markov models for tagging, bracketing, segmentation and pairwise alignment; dynamic programming; finite-state transducers Durbin Ch.4 HW 3
Due Feb.25
 
Feb.11 Machine Translation alignment: HMM and IBM models Och and Ney (2003)    
Feb.13 HMM parameter re-estimation; forward-backward for Expectation Maximization; Learning HMM alignment models; pronunciation alignments Ristad and Yianilos (1998)   
Feb.18 No class, Presidents' Day Take home midterm
Due Mar.3
 
 
Feb.20 Introduction to multiple sequence alignments; families; profile HMMs; Perceptron Algorithm for profiles Durbin Ch.5    
Feb.25 Aligning multiple sequences; Minimum sum-of-pairs alignment; higher dimensional dynamic programming; iterative pairwise alignment Durbin Ch.6
Gusfield Ch.14
HW 4
Due Mar.10
 
Feb.27 Introduction to phylogenic tree building; Ultrametric and additive distance trees; distance-based tree construction; parsimony Durbin Ch.7
Gusfield Ch.17
   
Mar.3 Probabilistic models of phylogeny Durbin Ch.8 HW 5
Due Mar.19
 
Mar.5 Protein and RNA secondary structure; context-free models Durbin Ch.9
Searls (2002)
  
Mar.10 Context free inference; RNA structure prediction; Protein folding Durbin Ch.10
Hockenmaier et al. (2006)
   
Mar.12 Mildly context-sensitive models for secondary structure Chiang et al. (2006a)
(2006b)
   
Mar.17 Machine Translation alignment (2); Transduction grammars Wu (1997)    
Mar.19 In class final presentations: informal presentations of extensions to homework 5      


References:

Brown et al. (1993)   Peter E Brown, Vincent J. Della Pietra, Stephen A. Della Pietra and Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311, 1993.
Chiang et al. (2006a)    David Chiang, Aravind K. Joshi and David B. Searls. Grammatical representations of macromolecular structure. Journal of Computational Biology, 13(5):1077-1100, 2006.
Chiang et al. (2006b)    David Chiang, Aravind K. Joshi and Ken A. Dill. A Grammatical Theory for the Conformational Changes of Simple Helix Bundles. Journal of Computational Biology, 13(1):21-42, 2006.
Hockenmaier et al. (2006)    Julia Hockenmaier, Aravind K. Joshi and Ken A. Dill. Routes are trees: The parsing perspective on protein folding. Proteins: Structure, Function, and Bioinformatics, 66(1):1-15, 2006.
Och and Ney (2003)    Franz Josef Och and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19-51, 2003.
Ristad and Yianilos (1998)    Eric Sven Ristad and Peter N. Yianilos. Learning String Edit Distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522-532, 1998.
Searls (2002)   David B. Searls. The language of genes. Nature, 420:211-217.
Wu (1997)   Dekai Wu. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23(3):377-403, 1997. 2003.