Computational Linguistics
This webpage serves as a syllabus for this course (Fall 2008), and also a repository of links of interesting computational linguistics that I send people to. For some items you need a password to gain access.
Computer Science CMSC 25020-1 and CMSC 35030-1
Fall 2007
John Goldsmith ja-goldsmith@uchicago.edu. Office in CS: Ryerson 258 x4-7152
(look inside the lab). Also in Classics 307 (office phone there 702-8525), where
I have office hours.
Note that links in what follows will light up in yellow when you run the cursor over them.
This is a course in the Computer Science department, intended for upper-level undergraduates, or graduate students, who have a good programming background. In general, we face the same kind of negotiation over choice of language that you might expect. If you want to submit code in C++, perl, or Python, that should be no problem; other choices are discussable, and the decision will have to be made by the instructor and the TA jointly.
We will look at several current topics in natural language processing, dividing our time between discussing material in the textbook and material in current research papers, and getting our hands dirty with code and corpora. In line with most current work, our emphasis will be on systems that draw conclusions from training data rather than relying on the encoding of generalizations obtained by humans studying the data. As a consequence of that, in part, and also because we will stand to learn more about natural language if we do so, we will make an effort not to focus on English, but to look at a range of human languages in our treatments.
I'll be writing up my class notes as I go along, and you are welcome to download them from here.
In the past, the primary textbook has been Manning and Schütze, Foundations of Statistical Natural Language Processing, which also has a website which you can (and should) peruse. I've decided this year not to ask you to buy the book. It's still as good as it was before, and I do recommend it; but it seemed to me that we did not use enough of it to justify asking you to spend the money on it now. I will therefore attempt to find free material in the internet that will suffice for our purposes. So while I recommend the book, I do not insist that you buy it for this course.
Assignments will be due on the Monday following the week in which we discuss the material (roughly, the Monday following the week number indicated on the assignment). I've given 7 assignments below (the first week's assignment, gathering corpora, doesn't count! it's just a warm-up): Students enrolled for 25020 must do 6 of them, and students enrolled for 35030 must do all of them. You are responsible for writing code that compiles, and providing any relevant data files, and the relevant output file. Zip all this together, and email it to me and to the TA.
Two items that I would like you to read: they are relatively non-technical, but not necessarily easy to read. Read them at your leisure over the first couple of weeks:Lillian Lee: "I'm sorry Dave, I'm afraid I can't do that": Linguistics, statistics, and natural language processing circa 2001.
Fernando Pereira: Formal grammar and information theory: together again? In The Legacy of Zellig Harris: Language and information into the 21st century, volume 2, pp.13-32. Ed. Bruce E. Nevin and Stephen M. Johnson
Each topic below will take a little more than a week, so I don't expect that we will get past topic 7.
Link to the Brown Corpus.|
Topic |
Topic | Reading for this week |
|
1 |
Probability and information theorya. probabilistic approach to letters and letter sequences; likewise for phonemes. Bayes' rule. Phonotactics, conditional probability, mutual information. Phonotactics as redundancy in the lexicon at the neighborhood level. Slides. c. K-L divergence as positive semi-definite; calculating the probability of a string, effect on the calculation if we add 10 e's, effect if we treat "th" as a single chunk. |
Shannon: Prediction and entropy of printed English.(1951) Bell Systems Technical Journal, 30, 50-64. Goldsmith: Probability for linguists. |
|
2 |
PhonologyLanguage identification using Bayes' rule. Categories: vowels, consonants. How do we infer such categories? First introduction to hidden Markov models (HMMs). Looking at Phonological Structure program. |
Arithmetic encoding excerpt, from Text Compression, by Bell, Cleary, and Witten.
Excerpt from Information theoretical approaches to phonological structure. |
|
3 |
Words: Zipf's law; and the problem of word segmentationa. Zipf's Law b. Re-doing the
effect of "chunking"
bigrams on string probability;
c. beginning the discussion
of de Marcken's dissertation. Introduction to MDL: probability of data
as a sum, over all models, of the product of the model's own probability
and its assigned probability to the data; approximation of this with a
best model, and a prior distribution over models. d. de
Marcken's dissertation, part 2. e. Sequitur: a non-probabilistic approach. Nevill-Manning and Witten: Identifying structure in sequences. f. Applications of this algorithm to natural language and genomic
sequences. |
Reading: Composition and perturbation, Carl de Marcken, ACL 1996. Graduate reading: Unsupervised Language Acquisition, at the same site. Link to my implementation of de Marcken's algorithm. Nevill-Manning and Witten: Identifying structure in sequences. |
|
4 |
Morpheme segmentationb. signatures (Greenberg rectangles) as
indicator of morpheme status. MDL model with signatures. Notes.
c.
Distinguishing between MDL model's objective function and heuristics used
navigating in search space. Exploring English. |
Zellig Harris, From Phoneme to Morpheme. Margaret Hafer and Stephen Weiss Word segmentation by letter successor varieties John Goldsmith 2001 Unsupervised learning of natural language morphology and see the Linguistica website Ray Solomonoff: The discovery of algorithmic probability: a guide for the programming of true creativity. This story starts at the College of the University of Chicago. 2 sets of slides used (Week 6, Monday) on learning morphology in general, and on finer corrections using MDL. |
|
5 |
String edit distanceFinding strings that differ in many ways (except reordering).a. Basic algorithm b. Using word-pair alignments to find common morphemes in languages with rich morphologies. c. Using word-pair alignments to find related words and morphemes: phonological rules. |
Durban et al, Biological Sequence Analysis, is very good (on this as well as on HMMs which we will get to.) This is discussion is a lot more detailed than you need to have at this point; it's the discussion on p. 19ff that you need for our assignment, but the whole chapter is clear and interesting, and these are algorithms that you may be using your whole life.
Pekka Kilpeläinen's slides are good, too . |
|
6 |
Word-classes and distributional similarities among wordsa. traditional word-classes:
Parts of Speech. Penn treebank categories.
Powerpoint slides. |
Chapter 3 of Manning and Schütze, and also 15.2 ("The vector space model"). Brown et al.: Class-based n-gram models of natural language |
|
7 |
Hidden Markov models and Part of Speech (PoS) taggingThis is more than 3 classes' worth of material! |
A good discussion from bioinformatics: Durban et al. on HMMs.
Chapter 9 of Manning and Schütze Graduate reading: A
Practical Part-of-Speech Tagger (1992) by Doug Cutting, Julian Kupiec,
Jan Pedersen, Penelope Sibun |
|
8 |
Word sense disambiguationa. the problem; its importance for machine-translation; topic-dependent
vs. topic-independent disambiguation; the pseudoword method (merge
banana and door, then split them); Bayes classifier. |
Unsupervised word sense disambiguation rivaling supervised methods. David Yarowsky. 1995. |
|
9 |
Alignment and Machine translation |
Slides from class Chapter 13 of Manning and Schütze |
|
|
Caveat: I expect the material given in the chart above for the first 7 or 8 weeks to expand to fill the whole quarter: it's very possible that we will not reach the last two or three topics. The pace with which we will go through these topics depends in large part on the background and interests of the students in the class.
Assignments
I think you know what I mean by ``pseudocode'' in the assignments below. Here is a link to a Latex style file for pseudocode which gives some good examples. Please feel free to use it in writing up your answers.
| Week | |
| 1 | None. |
| 2 | Given the data regarding letter frequencies in English, French, German, and Swahili, identify the mystery language from among that set. Your data here. |
| 3 | Problem: An algorithm for finding compounds. |
| 4 | Problem: Using successor frequency to find suffixes. |
| 5 | Problem: String edit distance algorithm. |
| 6 | Problem: Context vectors. |
| 7 | Problem: Hidden Markov models. |
| 9 | See last week's assignment |
| 10 |
In depth explanation of assignments to avoid confusion. 15 minutes of discussion on Mondays about homework would help.
Either include just a bit less material or be just a bit longer.
Different room (people were falling asleep). Perhaps a little more speech analysis.
More time, then we can study more aspects.
Possibly add non-programming assignments to solidify understanding of concepts for which there isn't time for programming assignments. More interactive! I am never really a fan of slide based lectures.
Cover more topics. Suggest students use a language like Perl or Python and not C++ for assignments (they were designed with text processing in mind and make the assignments much easier.)
I would add another textbook and do less powerpoint.
Not much. There is so much material to cover and not enough time to cover it all, but that can't be helped. There was a little too much math for my taste—a lot of the math I still don't understand well.
| Week |
Suggested additional
readings |
| 1 |
Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433-460. |
| 2 |
Very good introduction to probability by David Forsyth. Finite Automata, by Mark V. Lawon. There is a partial version of this online in pdf format that you can download. See Roni Rosenfeld's notes on probability and information theory. M. Li and P. M. B. Vitányi. Introduction to Kolmogorov Complexity and its Applications. Springer Verlag 1997 (second edition). |
| 3 |
Kit, Chunyu and Yorick Wilks. Unsupervised learning of word boundary with description length gain. M.R. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning. 34:71-106. 1999. M.R. Brent and T.A. Cartwright. Distributional regualrlity and phonological constraints are useful for segmentation. Cognition. 61:93-125. 1996. G. Chaitin. 1966. On the length of programs for computing finite binary sequences. J. Assoc. Comput. Math. 13: 547-569. A. N. Kolmogorov. Three approaches for defining the concept of 'information quantity.' Problem of Information Transmission 1:4-7. 1965. Jorma Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific. 1989. A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics. 27(3): 352-372. 2001. |
| 4 |
Trost, Harald. Computational Morphology. Yarowsky, D. and R. Wicentowski, Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL-2000, Hong Kong, pages 207-216, 2000. Herve Dejean Morphemes as Necessary Concept |
| 5 |
Edit distance and dialect proximity, John Nerbonne et al. Biological Sequence Analysis, by R. Durbin et al. Chapter 2: Pairwise alignment. |
| 6 | |
| 7-8 |
Linda Van Guilder's overview of part of speech tagging. See Roni Rosenfield's notes on Biological Sequence Analysis, by R. Durbin et al. Chapters 3 and 4: Markov chains and HMMs; Pairwise alignment using HMMs. |
| 8-9 | |
| 9-10 |
John Hutchins: Machine translation: a brief history. Webpage on MT. Catherine Ball's excellent MT page. Sylvia Wong: Machine translation techniques. Berger, Brown et al: The Candide system for machine translation. |
| 10 |
|
What follows is just notes from previous explorations for syllabi: notes toward the construction of a syllabus for another 10 weeks of computational linguistics.
1. Counts; Zipf's law. High frequency of rare events if we focus on words.
2. Stickiness of words: collocations. Mutual information. Types of collocations.
3. Predicting next word: bigram model. Trigram model. Sparsity problem. Back-off and model mixture. Context vectors.
4. The distribution of context vectors (CVs) in CV-space is not uniform: words tend to clump with regard to their CVs. These clusters are a major part of the motivation for grammatical categories. IBM model for sentence generation. What kind of categories are these? Relation to traditional linguistic categories.
5. Letters and phonemes. Distributions. Sonority waves (syllables). MI between phonemes. Phonotactics. Learning sonority with HMMs. Vowel harmony.
6. Between letters and phonemes: morphemes. Learning morphemes. MDL.
7. Rich morphologies: string edit distance.
8. Discovering words and morphemes from continuous string: Brent, de Marcken.
9. Constituency.
10. Machine translation.
2.
Words
a. Words and corpora (chapter 4). Defining words, sentences. Zipf's law. Basics of natural language morphology.
b. Collocations. Mutual information (and probability theory, a bit of information theory).
c. Vector space models of word-contexts
d. Bigram models over sparse data (chapter 6)
e. Word-sense disambiguation (chapter 7)
f. Cross entropy and author identification. (discussion of letter-model cross-entropy and language identification)
3. Markov models of part-of-speech
a. Hidden Markov models (HMMs) (chapter 9): Viterbi parsing, E/M algorithm
b. POS tagging (chapter 10)
6. Machine learning
a. Tranformation based learning
b. Decision trees
7. Document retrieval, Webpage ranking (google ranking)?
8. Phonology: bigram models; vowel harmony and Gibbs model. Learning syllable structure and sonority with HMMs.
Segmentation, or word-discovery, or collocation-discovery, or text segmentation (see separate topic for related bioinformatic work)
Nevill-Manning, Craig, Ian Witten and Gordon W. Paynter. Browsing in digital libraries: a phrase-based approach.
Steier, Amy M. and Richard K. Belew. Exporting phrases: a statistical analysis of topical language.
Utiyama, Masao and Hitoshi Isahara. A statistical model for domain-independent text segmentation.
Sentence segmentation:
Applications:
A
free CPAN Perl module for sentence splitting.
Shlomo Yona maintains another perl-based sentence splitter.
Ghassan Mourad
(1999)
La
segmentation de textes par l'étude de la ponctuation
Ghassan Mourad
La
segmentation de textes par exploration contextuelle automatique, présentation du
module SegATex Ghassan.Mourad@paris4.sorbonne.fr
Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization."
Tibor Kiss and Jan
Strunk
Scaled log
likelihood ratios for the detection of abbreviations in text corpora
Tibor Kiss and Jan
Strunk
Multilingual
Least-Effort Sentence Boundary Disambiguation
Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.
Andrei Mikheev
Tagging Sentence
Boundaries (2000)
Andrei Mikheev
Periods, Capitalized
Words, etc (1999)
David D. Palmer
(2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl
and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel
Dekker
David D. Palmer and Marti
A. Hearst,
Adaptive
Multilingual Sentence Boundary Disambiguation
J. Reynar and A.
Ratnaparkhi,
A
Maximum Entropy Approach to Identifying Sentence Boundaries
Word sense disambiguation, word distributions
Bookstein, Abraham, S.T.Klein, and T. Raita. Clumping Properties of Content-bearing Words.
Lin, Dekan. Automatic retrieval and clustering of similar words.
Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation Computers and the Humanities, 34(2):179-186, 2000.
Yarowsky, D. ``Word Sense Disambiguation.'' In R. Dale, H. Moisl and H. Somers (eds.) The Handbook of Natural Language Processing. New York: Marcel Dekker, pp. 629-654, 2000.
Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189-196, 1995.
Latent semantic indexing: dimensionality-reduction on the space of word-senses
Patterns in unstructured data: a presentation to the Andrew W. Mellon Foundation from Middlebury College
Original paper by Deerwester, Dumais et al. Indexing by Latent Semantic Analysis (1990).
Morphology and morphology learning
Named Entity Recogntion
Cucerzan, S. and D. Yarowsky. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings, 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pp. 90-99, 1999.
Michael Collins and Yoram Singer 1999. Unsupervised Models for Named Entity Classification . In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
Word Alignment
Reducing Parameter Space for Word Alignment Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji Yamada
Grammar induction
Atwell, Eric. 2004. Clustering of Word types and Unification of WordTokens into Gramamtical Word-Classes. TALN 2004.
Eric Brill. 1993. Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach. ACL 1993.
Stanley F. Chen. Bayesian
Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual
Meeting of the Association for Computational Linguistics, pages 228-235, 1995.
Alexander Clark (2003) Pre-processing very noisy text, Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics 2003, Lancaster. (to appear) [Postscript][PDF]
Alexander Clark (2003) Combining Distributional and Morphological Information for Part of Speech Induction, Proceedings of EACL 2003, [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Shallow Parsing using Probabilistic Grammatical Inference, Proceedings of the 6th International Colloquium on Grammatical Inference, (ICGI-2002) [Postscript][PDF] Amsterdam, the Netherlands
Franck Thollard and Alexander Clark (2002) Apprentissage d'Automates Probabilistes Déterministes, Proceedings of CAp2002 [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Détection de Groupes Nominaux par Inférence Grammaticale Probabiliste, Proceedings of CAp2002 [Postscript][PDF]
Alexander Clark (2002) Memory-Based Learning of Morphology with Stochastic Transducers, Proceedings of ACL 2002, Philadelphia, USA (to appear) [Postscript][PDF]
Alexander Clark (2001) Unsupervised Language Acquisition: Theory and Practice, DPhil Thesis, University of Sussex, December 2001 [PDF]
Alexander Clark (2001) Learning Morphology with Pair Hidden Markov Models, Proceedings of the Student Workshop at ACL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering, Proceedings of CoNLL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Partially Supervised Learning of Morphology with Stochastic Transducers, Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS) 2001, November 2001, Tokyo. [Postscript][PDF]
Alexander Clark (2000) Inducing Syntactic Categories by Context Distribution Clustering, Proceedings of CoNLL 2000, September 2000, Lisbon. [Postscript][PDF,]
Alexander Clark (1998) Inducing fields in NLP. MSc Dissertation at
University of Sussex. [
Compressed postscript]
Mennon M. van Zaanen. Bootstrapping structure into language: Alignment-based learning. 2001. PhD dissertation, University of Leeds.
Pieter W. Adriaans and Erik de Haas Grammar Induction as Substructural Inductive Logic Programming, . In James Cussens and Saso Dzeroski, editors, Learning Language in Logic, volume 1925 of Lecture Notes in Computer Science, pages 127--142. Springer-Verlag, June 2000.Dan Klein and Christopher D. Manning, "Natural Language Grammar Induction Using a Constituent-Context Model", Advances in Neural Information Processing Systems 14 (NIPS-2001), 2001. [ps] [pdf] [bib]
Dan Klein and Christopher D. Manning, "Distributional Phrase Structure
Induction", Proceedings of the Fifth Conference on Natural Language Learning
(CoNLL-2001), 2001. [ps]
[pdf]
[bib]
Phonology
Belz, Anja. An approach to the automatic acquisition of phonotactic constraints.
Ellison, T. Mark. The Machine Learning of Phonological Structure. 1992. Dissertation, University of Western Australia.
Grammar
Grünwald, Peter. A Minimum Description Length approach to grammar inference.
Bioinformatics
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. 2001. Nature Genetics. Regulatory element detection using correlation with expression.
Mathematical background
Language modeling, parsing
Eugene Charniak, 1997. Statistical Parsing with a Context-free Grammar and Word Statistics Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pp 598-603.
Michael Collins,
1997. Three
generative, lexicalised models for statistical parsing 35th Annual Meeting
of the Association for Computational Linguistics and 8th Conference of the
European Chapter of the Association for Computational Linguistics: Proceedings
of the Conference (ACL/EACL '97), pp 16-23.
Grammar makes a
comeback: Ciprian Chelba and Frederick Jelinek, 1998. Exploiting
syntactic structure for language modeling COLING-ACL '98 36th Annual Meeting
of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics: Proceedings of the Conference, Vol 1,
pp 225--231
Week 3 Shallow Parsing
Machine translation
Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press
Peter E Brown; Vincent J. Della Pietra; Stephen A. Della Pietra; Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Volume 19, Number 2.
Kevin Knight, 1999. A Statistical MT Tutorial Workbook.
John Hutchins: Machine translation: a brief history.
Sylvia Wong: Machine translation techniques.
Adam L. Berger, Peter
F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett,
John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The
Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on
Human Language Technology.
Other
Co-training: Blum, A. and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of the 11th annual Conference on Computational Learning Theory. New York: ACM Press.
Gibbs (Boltzmann) modeling
Gibbs-Markov models. John Lafferty.
Other syllabi
Mark Johnson (Brown)
Steven Bird (Penn)
James Pustejovsky (Brandeis)
Ray Dougherty (NYU)
Mark Gawron (SDSU)
Lillian Lee (Cornell)