CSE 506/606 - Topics in Information Retrieval

Instructors: Steven Bedrick, Emily T. Prud'hommeaux and Brian Roark

Class time: Tu/Th 11:00-12:30pm

Class location: WCC 403

Required textbooks:

None, reading will come from papers available on-line or from one of the following on-line textbooks: We will add additional readings to this list as needed.

Other Resources

NIST trec_eval software (used to calculate MAP, etc.- please don't waste time implementing your own NDCG calculator, unless you desperately want to). It's not exactly the most intuitive piece of software ever, so don't be afraid to ask for help.

See official OHSU Grade Policy and Disability Statement below



Skip to overview of topics.

Goals

The course will cover a variety of topics in the general area of information retrieval (IR). An initial series of lectures will cover the fundamentals of IR, with a particular emphasis on applications of modern NLP techniques in the field and on evaluation. The remainder of the course will be taught seminar-style, and will consist of a review of selected recent papers from the IR literature. To provide a historical perspective, we will also include several classic papers in this review. Example topics to be covered will include (but are not limited to): practical issues related to webcrawling; indexing of raw text and other data, such as word lattices output from speech recognizers; and query expansion and suggestion methods. Issues in IR evaluation will be covered throughout the course. Students will be expected to actively participate in discussions of research papers, and to lead the discussions in several sessions. The course will also include one or more homework assignments and a term project, which will involve implementing and evaluating IR systems.

Prerequisites

There is no official programming language for this course, but there will be a some amount of scripting or programming required to complete assignments, hence facility with some programming language (or willingness to acquire such facility) and with Linux is assumed.

Homework and term projects

Specifics to be determined.

Grading

See university Grade Policy below. 10% of your grade will depend on in-class discussion, 30% on in-class presentations, 10% each on 2 homework projects and 40% on a term project and presentation.

What we'll cover and an approximate schedule

Date     Topic Tentative Reading slides Assigned To
Sept.25 Information behavior, browsing vs seeking, types of search, history of IR Hearst Ch. 3 lec1.pdf  
Sept.27 IR Basics: inverted index, query and document representations, boolean retrieval, simple tf/idf and other ranking schemes
Homework 0: getting setup
Manning, Raghavan and Schütze, Ch.1-2 lec2.pdf  
Oct.2 IR models: boolean, vector space, language models
Homework 1
Manning et al. chapters 6, 7, 12, and 18. Don't worry; the last two are short chapters, and large parts of them will probably be review. If you thought the LSA/LSI stuff was interesting, I suggest checking out Furnas, et al.'s 1988 SIGIR paper on using SVD for IR. lec3.pdf  
Oct.4 Index construction/optimization/compression Manning et al. chapters 4 and 5 lec4.pdf  
Oct.9 Experimental Evaluation Manning et al. Ch. 8, Hearst Ch. 2. Of historical interest, and highly recommended, is Cleverdon 1991. Also highly recommended is Käki & Aula 2007. lec5.pdf  
Oct.11 Relevance feedback Manning, Raghavan and Schütze, Ch.9;
White et al. (2006); Lee et al. (2008)
lec6.pdf Brian Roark
slides
Oct.16 Search UI/UX
Homework 2
Clarke et al. (2007); Wu et al. (2012)   Khoa Pham
slides
Oct.18 Web search, PageRank Manning et al. Ch. 19 and Ch. 21; Page et al. (1998); Kurland and Lee (2010) lec8.pdf Tomer Meshorer
slides
Oct.23 Parallel & Map-Reduce approaches Lin (2009); Lin et al. (2009); Pantel et al. (2009)   Masoud Rouhizadeh
slides
Oct.25 Query suggestion/reformulation Jain et al. (2011); Ozertem et al. (2012)   Golnar Sheikhshabbafghi
slides
Oct.30 Guest lecture: Brooke Cowan, Information Extraction Banko et al. (2007); Carlson et al. (2010); McClosky et al. (2011); Jurafsky Ch. 22 slides  
Nov.1 Student project proposals      
Nov.6 Weighted lattice indexing Saraclar and Sproat (2004); Allauzen, Mohri and Saraclar (2004); Chelba and Acero (2005)   Andrew Fowler
slides
Nov.8 Spoken term detection Hori et al. (2007); Mamou et al. (2007); Parada et al. (2010)   Maider Lehr and Meysam Asgari
slides
Nov.13 Guest lecture: Bill Hersh, Biomedical information retrieval     slides | handout
Nov.15 Guest lecture: Amanda Jones, E-discovery Grossman and Cormack (2011); Barnett and Godjevac (2011) slides  
Nov.20 Learning to rank Manning et al. Ch.15;   Sculley (2010) slides Alireza Bayestehtashk
Nov.22 Thanksgiving, no class      
Nov.27 Multimedia retrieval Müller et al. (2004); Zhou et al. (2012); Mitchell et al. (2012) slides Hamidreza Mohammadi
Nov.29 Prediction and sentiment analysis O'Connor et al. (2010); Chahuneau et al. (2012); Gayo-Avello (2012) slides Mahsa Langarani
Dec. 4,6 Term project presentations Tuesday, Dec. 4: Andrew, Tomer, Hamid
Thursday, Dec. 6: Golnar, Khoa, Masoud
   


References:

C. Allauzen, M. Mohri and M. Saraclar. 2004. General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval. In HLT-NAACL 2004 Workshop: Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 33-40.

M. Banko, et al. 2007. Open Information Extraction from the Web. In IJCAI '07, pp. 2670-2676.

T. Barnett and S. Godjevac. 2011. Faster, better, cheaper legal document review, pipe dream or reality? In Proceedings of the ICAIL 2011 Workshop on Setting Standards for Searching Electronically Stored Information in Discovery Proceedings.

A. Carlson, et al. 2010. Coupled Semi-Supervised Learning for Information Extraction. In WSDM '10: Proceedings of the third ACM international conference on Web search and data mining, pp. 101-110.

V. Chahuneau, K. Gimpel, B.R. Routledge, L. Scherlis and N.A. Smith. 2012. Word Salad: Relating Food Prices and Descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP 2012).

C. Chelba and A. Acero. 2005. Position Specific Posterior Lattices for Indexing Speech. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 443-450.

C.L.A. Clarke, E. Agichtein, S. Dumais and R.W. White. 2007. The influence of caption features on clickthrough patterns in web search. In Proceedings of SIGIR, pp. 135-142.

C. Cleverdon. 1991. The significance of the Cranfield tests on index languages. In SIGIR '91: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3-12.

G. Furnas, et al. 1988. Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure from SIGIR '88.

D. Gayo-Avello. 2012. "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter. In Computing Research Repository.

M. Grossman and G. Cormack. 2011. Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. Richmond Journal of Law and Technology, 17:3.

M. Hearst. 2009. Search User Interfaces. Cambridge University Press.

T. Hori, I. L. Hetherington, T. J. Hazen and J. Glass. 2007. Open-Vocabulary Spoken Utterance Retrieval Using Confusion Networks. In Proceedings of ICASSP.

A. Jain, U. Ozertem and E. Velipasaoglu. 2011. Synthesizing High Utility Suggestions for Rare Web Search Queries. In Proceedings of SIGIR.

D. Jurafsky and J. Martin. Chapter 22: Information Extraction from Speech and Language Processing, 2ed.

M. Käki and A. Aula. 2007. Controlling the complexity in comparing search user interfaces via user studies. From Information Processing & Management 44(1): 81-91.

O. Kurland and L. Lee. 2010. PageRank without hyperlinks: Structural reranking using links induced by language models. ACM Transactions on Information Systems (TOIS), 28(4).

K.S. Lee, W.B. Croft and J. Allen. 2008. A cluster-based resampling method for pseudo-relevance feedback. In Proceedings of SIGIR, pp. 235-242.

J. Lin. 2009. Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In Proceedings of SIGIR, pp. 155-162.

J. Lin, D. Metzler, T. Elsayed and L. Wang. 2009. Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search. In Proceedings of the Eighteenth Text REtrieval Conference (TREC).

J. Mamou, B. Ramabhadran and O. Siohan. 2007. Vocabulary Independent Spoken Term Detection. In Proceedings of the 30th Annual International ACM SIGIR conference, pp. 615-622.

C. Manning, P. Raghavan and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.

D. McClosky, M. Surdeanu and C. Manning. 2011. Event Extraction as Dependency Parsing. In Proceedings of ACL, pp. 1626-1635.

M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, A. Mensch, A. Berg, T. Berg and H. Daume III. 2012. Midge: Generating Image Descriptions From Computer Vision Detections. In Proceedings of EACL, pp. 747-756.

H. Müller, N. Michoux, D. Bandon and A. Geissbuhler. 2004. A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics, 73(1):1-23.

B. O'Connor, R. Balasubramanyan, B.R. Routledge and N.A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pp. 122–129.

U. Ozertem, O. Chapelle, P. Donmez and E. Velipasaoglu. 2012. Learning to suggest: a machine learning framework for ranking query suggestions. In Proceedings of SIGIR.

L. Page, S. Brin, R. Motwani, and T. Winograd. 1998. Pagerank citation ranking: Bringing order to the web. Technical report, Stanford University.

P. Pantel; E. Crestan; A. Borkovsky; A.M. Popescu; V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 938-947.

C. Parada, A. Sethy and B. Ramabhadran. 2010. Balancing false alarms and hits in Spoken Term Detection. In Proceeings of ICASSP, pp. 5286- 5289.

M. Saraclar and R. Sproat. 2004. Lattice-Based Search for Spoken Utterance Retrieval. In Proceedings of HLT-NAACL, pp. 129-136.

D. Sculley. 2010. Combined Regression and Ranking. In KDD 2010: Proceedings of the 16th ACM SIGKDD International Conference on Data Mining and Knowledge Discovery

R.W. White, J.M. Jose and I. Ruthven. 2006. An implicit feedback approach for interactive information retrieval. Information Processing and Management, 42(1):166-190.

W.C. Wu, D. Kelly and K. Huang. 2012. User evaluation of query quality. In Proceedings of SIGIR, pp. 215-224.

X. Zhou, R. Stern and H. Müller. 2012. Case-based fracture image retrieval. International Journal of Computer Assisted Radiology and Surgery, 7(3):401-411.


OHSU Grade Policy

OHSU SoM Graduate Studies Grade Submission Policy
Approved by SoM Graduate Council April 8, 2008

Graduate Studies in the OHSU School of Medicine is committed to providing grades to students in a timely manner. Course instructors will provide students with information in writing at the beginning of each course that describes the grading policies and procedures including but not limited to evaluation criteria, expected time needed to grade individual student examinations and type of feedback they will provide.

Class grades are due to the Registrar by the Friday following the week of finals. However, on those occasions when a grade has not been submitted by the deadline, the following procedure shall be followed:

1) The Program Coordinator will immediately contact the Instructor requesting the missing grade, with a copy to the Program Director and Registrar.

2) If the grade is still overdue by the end of next week, the Program Coordinator will email the Department Chair directly, with a copy to the Instructor and Program Director requesting resolution of the missing grade.

3) If, after an additional week the grade is still outstanding, the Coordinator may petition the Office of Graduate Studies for final resolution.


OHSU Disability Statement

Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail at orchards@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student's disability is kept in accordance with relevant state and federal laws.