Computational Linguistics
This webpage serves as a syllabus for this course (Fall 2008), and also a repository of links of interesting computational linguistics that I send people to. For some items you need a password to gain access.
Computer Science CMSC 25020-1 and CMSC 35030-1
Fall 2008
John Goldsmith ja-goldsmith@uchicago.edu. Office in CS: Ryerson 162. Also in Classics 307 (office phone there 702-8525), where I have office hours.
Note that links in what follows will light up in yellow when you run the cursor over them.
This is a course in the Computer Science department, intended for upper-level undergraduates, or graduate students, who have a good programming background. In general, we face the same kind of negotiation over choice of language that you might expect. If you want to submit code in C++, perl, or Python, that should be no problem; other choices are discussable, and the decision will have to be made by the instructor and the TA jointly.
We will look at several current topics in natural language processing, dividing our time between discussing material in the textbook and material in current research papers, and getting our hands dirty with code and corpora. In line with most current work, our emphasis will be on systems that draw conclusions from training data rather than relying on the encoding of generalizations obtained by humans studying the data. As a consequence of that, in part, and also because we will stand to learn more about natural language if we do so, we will make an effort not to focus on English, but to look at a range of human languages in our treatments.
My notes for this course are a work in progress; their current state can be found at this-here link.
In the past, the primary textbook has been Manning and Schütze, Foundations of Statistical Natural Language Processing, which also has a website which you can (and should) peruse. I've decided this year not to ask you to buy the book. It's still as good as it was before, and I do recommend it; but it seemed to me that we did not use enough of it to justify asking you to spend the money on it now. I will therefore attempt to find free material in the internet that will suffice for our purposes. So while I recommend the book, I do not insist that you buy it for this course.
The grade for this course will be based on class attendance and participation (which means reading the material before the class discussion of it),
including a few quizzes on Fridays (10 points), and three programming projects, worth 30 points each.
I will give several quizzes on Fridays at the beginning of class. They will simply be to write down certain mathematical definitions that we have used, or in one or two cases a proof: these will not be hard, but you have to learn the definitions well enough that you simply know them (and don't have to look them up). I will even tell you in advance what I expect to ask you on the quizzes.
Each project will take 3 weeks, and there will be 3 things to hand in, 1 per week:
| Section | Topics | Reading, etc. |
|---|---|---|
| Introduction |
Relationship between linguistics, computational linguistics, and machine learning. Methods of evaluation: precision and recall scores. Corpora. |
Lillian Lee: "I'm sorry Dave,
I'm afraid I can't do that": Linguistics, statistics, and natural language
processing circa 2001.
Fernando Pereira: Formal grammar and information theory: together again? In The Legacy of Zellig Harris: Language and information into the 21st century, volume 2, pp.13-32. Ed. Bruce E. Nevin and Stephen M. Johnson Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press. Steven Abney. Statistical Methods.Encyclopedia of Cognitive Science, 2000. Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433-460. |
Probability and information theory | Basic ideas. Distribution, frequency, probability. Letters, phonemes, words. Entropy, cross-entropy, K-L divergence. Example: language identification. Phonological complexity webpage Example: calculating the effect of taking "th" as a chunk. Bayes' Rule. Looking at Phonological Structure program. Language models: word-based (n-gram) models.Language model: phonology. Using Hidden Markov model to find simple categories.
|
Shannon: Prediction and entropy of printed English.(1951) Bell Systems Technical Journal, 30, 50-64. Goldsmith: Probability for linguists. Chapter 2 of Manning and Schütze. A spreadsheet for entropy and cross-entropy. Recommended: Church and Ward 1989: Word association norms, mutual information, and lexicography Excerpts from Silicon Dreams. Robert Lucky. |
| Word segmentation and Zipfian distributions. |
What is a Zipfian distribution? The problem of word-segmentation or "chunking." Minimum Description Length (MDL) analysis for word segmentation. The relationship between length of encoding with a prefix-encoding system and the inverse log probability. Sequitur: a non-probabilistic approach. Example: Finding compounds in English and elsewhere. |
Arithmetic encoding excerpt, from Text Compression, by Bell, Cleary, and Witten. Ray Solomonoff: The discovery of algorithmic probability: a guide for the programming of true creativity. This story starts at the College of the University of Chicago. Composition and perturbation, Carl de Marcken, ACL 1996. Graduate reading: Unsupervised Language Acquisition, at the same site Link to my implementation of de Marcken's algorithm. Sequitur: a non-probabilistic approach. Nevill-Manning and Witten: Identifying structure in sequences. |
| Morpheme segmentation and morphology induction | Successor Frequency. Linguistica/Europa. Notes. String edit distance.
|
Zellig Harris, From Phoneme to Morpheme. Notes on Hafer and Weiss evaluation of Harris's SF proposals. Margaret Hafer and Stephen Weiss Word segmentation by letter successor varieties John Goldsmith 2001 Unsupervised learning of natural language morphology; and also see the Linguistica website Durban et al, Biological Sequence Analysis, is very good (on this as well as on HMMs which we will get to.) This is discussion is a lot more detailed than you need to have at this point; but the whole chapter is clear and interesting, and these are algorithms that you may be using your whole life. |
| Word-classes and Grammar induction | Context vectors; clustering of context-vectors. |
Chapter 3 of Manning and Schütze, and also 15.2 ("The vector space model"). Brown et al.: Class-based n-gram models of natural language Redington, Chater, and Finch Distributional information: a powerful technique for acquiring syntactic categories. Cognitive Science 22(4), 425-469, 1998. Eric Brill: Unsupervised learning of disambiguation rules for Part of Speech Tagging Alexander Clark: Inducing syntactic categories by context distribution clustering. |
| Word sense disambiguation. | Unsupervised (and nearly unsupervised) methods | Unsupervised word sense disambiguation rivaling supervised methods. David Yarowsky. 1995. |
| Machine translation. | Time permitting -- but it's unlikely. | Slides from class |
Very good introduction to probability by David Forsyth.
Finite Automata, by Mark V. Lawon. There is a partial version of this online in pdf format that you can download.
See Roni Rosenfeld's notes on probability and information theory.
M. Li and P. M. B. Vitányi. Introduction to Kolmogorov Complexity and its Applications. Springer Verlag 1997 (second edition).
Kit, Chunyu and Yorick Wilks. Unsupervised learning of word boundary with description length gain.
M.R. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning. 34:71-106. 1999.
M.R. Brent and T.A. Cartwright. Distributional regualrlity and phonological constraints are useful for segmentation. Cognition. 61:93-125. 1996.
G. Chaitin. 1966. On the length of programs for computing finite binary sequences. J. Assoc. Comput. Math. 13: 547-569.
A. N. Kolmogorov. Three approaches for defining the concept of 'information quantity.' Problem of Information Transmission 1:4-7. 1965.
Jorma Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific. 1989.
A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics. 27(3): 352-372. 2001.
Trost, Harald. Computational Morphology.
Yarowsky, D. and R. Wicentowski, Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL-2000, Hong Kong, pages 207-216, 2000.
Herve Dejean Morphemes as Necessary Concept
Edit distance and dialect proximity, John Nerbonne et al.
Biological Sequence Analysis, by R. Durbin et al. Chapter 2: Pairwise alignment.
Linda Van Guilder's overview of part of speech tagging.
See Roni Rosenfield's notes on Biological Sequence Analysis, by R. Durbin et al. Chapters 3 and 4: Markov chains and HMMs; Pairwise alignment using HMMs.
John Hutchins: Machine translation: a brief history. Webpage on MT.
Catherine Ball's excellent MT page.
Sylvia Wong: Machine translation techniques.
Berger, Brown et al: The Candide system for machine translation.
Segmentation, or word-discovery, or collocation-discovery, or text segmentation (see separate topic for related bioinformatic work)
Nevill-Manning, Craig, Ian Witten and Gordon W. Paynter. Browsing in digital libraries: a phrase-based approach.
Steier, Amy M. and Richard K. Belew. Exporting phrases: a statistical analysis of topical language.
Utiyama, Masao and Hitoshi Isahara. A statistical model for domain-independent text segmentation.
Sentence segmentation:
Applications:
A
free CPAN Perl module for sentence splitting.
Shlomo Yona maintains another perl-based sentence splitter.
Ghassan Mourad
(1999)
La
segmentation de textes par l'étude de la ponctuation
Ghassan Mourad
La
segmentation de textes par exploration contextuelle automatique, présentation du
module SegATex Ghassan.Mourad@paris4.sorbonne.fr
Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization."
Tibor Kiss and Jan
Strunk
Scaled log
likelihood ratios for the detection of abbreviations in text corpora
Tibor Kiss and Jan
Strunk
Multilingual
Least-Effort Sentence Boundary Disambiguation
Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.
Andrei Mikheev
Tagging Sentence
Boundaries (2000)
Andrei Mikheev
Periods, Capitalized
Words, etc (1999)
David D. Palmer
(2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl
and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel
Dekker
David D. Palmer and Marti
A. Hearst,
Adaptive
Multilingual Sentence Boundary Disambiguation
J. Reynar and A.
Ratnaparkhi,
A
Maximum Entropy Approach to Identifying Sentence Boundaries
Word sense disambiguation, word distributions
Bookstein, Abraham, S.T.Klein, and T. Raita. Clumping Properties of Content-bearing Words.
Lin, Dekan. Automatic retrieval and clustering of similar words.
Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation Computers and the Humanities, 34(2):179-186, 2000.
Yarowsky, D. ``Word Sense Disambiguation.'' In R. Dale, H. Moisl and H. Somers (eds.) The Handbook of Natural Language Processing. New York: Marcel Dekker, pp. 629-654, 2000.
Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189-196, 1995.
Latent semantic indexing: dimensionality-reduction on the space of word-senses
Patterns in unstructured data: a presentation to the Andrew W. Mellon Foundation from Middlebury College
Original paper by Deerwester, Dumais et al. Indexing by Latent Semantic Analysis (1990).
Morphology and morphology learning
Named Entity Recogntion
Cucerzan, S. and D. Yarowsky. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings, 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pp. 90-99, 1999.
Michael Collins and Yoram Singer 1999. Unsupervised Models for Named Entity Classification . In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
Word Alignment
Reducing Parameter Space for Word Alignment Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji Yamada
Grammar induction
Atwell, Eric. 2004. Clustering of Word types and Unification of WordTokens into Gramamtical Word-Classes. TALN 2004.
Eric Brill. 1993. Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach. ACL 1993.
Stanley F. Chen. Bayesian
Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual
Meeting of the Association for Computational Linguistics, pages 228-235, 1995.
Alexander Clark (2003) Pre-processing very noisy text, Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics 2003, Lancaster. (to appear) [Postscript][PDF]
Alexander Clark (2003) Combining Distributional and Morphological Information for Part of Speech Induction, Proceedings of EACL 2003, [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Shallow Parsing using Probabilistic Grammatical Inference, Proceedings of the 6th International Colloquium on Grammatical Inference, (ICGI-2002) [Postscript][PDF] Amsterdam, the Netherlands
Franck Thollard and Alexander Clark (2002) Apprentissage d'Automates Probabilistes Déterministes, Proceedings of CAp2002 [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Détection de Groupes Nominaux par Inférence Grammaticale Probabiliste, Proceedings of CAp2002 [Postscript][PDF]
Alexander Clark (2002) Memory-Based Learning of Morphology with Stochastic Transducers, Proceedings of ACL 2002, Philadelphia, USA (to appear) [Postscript][PDF]
Alexander Clark (2001) Unsupervised Language Acquisition: Theory and Practice, DPhil Thesis, University of Sussex, December 2001 [PDF]
Alexander Clark (2001) Learning Morphology with Pair Hidden Markov Models, Proceedings of the Student Workshop at ACL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering, Proceedings of CoNLL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Partially Supervised Learning of Morphology with Stochastic Transducers, Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS) 2001, November 2001, Tokyo. [Postscript][PDF]
Alexander Clark (2000) Inducing Syntactic Categories by Context Distribution Clustering, Proceedings of CoNLL 2000, September 2000, Lisbon. [Postscript][PDF,]
Alexander Clark (1998) Inducing fields in NLP. MSc Dissertation at
University of Sussex. [
Compressed postscript]
Mennon M. van Zaanen. Bootstrapping structure into language: Alignment-based learning. 2001. PhD dissertation, University of Leeds.
Pieter W. Adriaans and Erik de Haas Grammar Induction as Substructural Inductive Logic Programming, . In James Cussens and Saso Dzeroski, editors, Learning Language in Logic, volume 1925 of Lecture Notes in Computer Science, pages 127--142. Springer-Verlag, June 2000.Dan Klein and Christopher D. Manning, "Natural Language Grammar Induction Using a Constituent-Context Model", Advances in Neural Information Processing Systems 14 (NIPS-2001), 2001. [ps] [pdf] [bib]
Dan Klein and Christopher D. Manning, "Distributional Phrase Structure
Induction", Proceedings of the Fifth Conference on Natural Language Learning
(CoNLL-2001), 2001. [ps]
[pdf]
[bib]
Phonology
Belz, Anja. An approach to the automatic acquisition of phonotactic constraints.
Ellison, T. Mark. The Machine Learning of Phonological Structure. 1992. Dissertation, University of Western Australia.
Grammar
Grünwald, Peter. A Minimum Description Length approach to grammar inference.
Bioinformatics
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. 2001. Nature Genetics. Regulatory element detection using correlation with expression.
Mathematical background
Language modeling, parsing
Eugene Charniak, 1997. Statistical Parsing with a Context-free Grammar and Word Statistics Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pp 598-603.
Michael Collins,
1997. Three
generative, lexicalised models for statistical parsing 35th Annual Meeting
of the Association for Computational Linguistics and 8th Conference of the
European Chapter of the Association for Computational Linguistics: Proceedings
of the Conference (ACL/EACL '97), pp 16-23.
Grammar makes a
comeback: Ciprian Chelba and Frederick Jelinek, 1998. Exploiting
syntactic structure for language modeling COLING-ACL '98 36th Annual Meeting
of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics: Proceedings of the Conference, Vol 1,
pp 225--231
Week 3 Shallow Parsing
Machine translation
Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press
Peter E Brown; Vincent J. Della Pietra; Stephen A. Della Pietra; Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Volume 19, Number 2.
Kevin Knight, 1999. A Statistical MT Tutorial Workbook.
John Hutchins: Machine translation: a brief history.
Sylvia Wong: Machine translation techniques.
Adam L. Berger, Peter
F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett,
John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The
Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on
Human Language Technology.
Other
Co-training: Blum, A. and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of the 11th annual Conference on Computational Learning Theory. New York: ACM Press.
Gibbs (Boltzmann) modeling
Gibbs-Markov models. John Lafferty.
Other syllabi
Mark Johnson (Brown)
Steven Bird (Penn)
James Pustejovsky (Brandeis)
Ray Dougherty (NYU)
Mark Gawron (SDSU)
Lillian Lee (Cornell)