Computational Linguistics
Computer Science CMSC
250201 and CMSC 350301
Spring 2004
John Goldsmith
jagoldsmith@uchicago.edu. Office in CS: Ryerson 258 x47152 (look inside the
lab). Also in Classics 307 (office phone there 7028525), where I have office
hours Tuesday 1011. Otherwise email me and we'll get together.
Teaching
assistant: Irina Matveeva matveeva@cs.uchicago.edu Office Hours Friday 46 in
Ryerson 177, and Monday 46 for students who arrange ahead of time.
About this course
This is a new course in the Computer Science department, intended for upperlevel undergraduates, or graduate students, who have a good C++ background.
We will look at several current topics in natural language processing, dividing our time between discussing material in the textbook and material in current research papers, and getting our hands dirty with code and corpora. In line with most current work, our emphasis will be on systems that draw conclusions from training data rather than relying on the encoding of generalizations obtained by humans studying the data. As a consequence of that, in part, and also because we will stand to learn more about natural language if we do so, we will make an effort not to focus on English, but to look at a range of human languages in our treatments.
Reading
The primary textbook is Manning and Schütze, Foundations of Statistical Natural Language Processing, which also has a website which you should peruse. We will also be reading a number of papers whose links I've given below.
Assignments
Assignments will be due on the Monday following the week in which we discuss the material (roughly, the Monday following the week number indicated on the assignment). I've given 7 assignments below (the first week's assignment, gathering corpora, doesn't count! it's just a warmup): Students enrolled for 25020 must do 6 of them, and students enrolled for 35030 must do all of them. You are responsible for writing C++ code that compiles (or Perl code that runs), and providing any relevant data files, and the relevant output file. Zip all this together, and email it to Irina and me.
The schedule below roughly corresponds to weeks in the quarter, but only roughly; the quarter is likely to be interrupted by events such as the meeting of the Chicago Linguistics Society.
Virtual week 
Topic  Reading for this week 
1 
Background and overview a. overview of
course 
Chapters 1 and 4 of Manning and Schütze, and also section 15.1.2 ("Evaluation measures") and look at section 8.1 on a similar topic. Lillian Lee: "I'm sorry Dave, I'm afraid I can't do that": Linguistics, statistics, and natural language processing circa 2001. Fernando Pereira: Formal grammar and information theory: together again? In The Legacy of Zellig Harris: Language and information into the 21st century, volume 2, pp.1332. Ed. Bruce E. Nevin and Stephen M. Johnson 
2 
Probability and information theory a. probabilistic
approach to letters and letter sequences; likewise for phonemes. Bayes'
rule. Phonotactics, conditional probability, mutual information.
Phonotactics as redundancy in the lexicon at the neighborhood level. Slides. 
Chapters 2 and 6 of Manning and Schütze 
3 
The problem of word segmentation, and Minimum Description Length (MDL) analysis a. Redoing the effect of "chunking" bigrams on string probability; looking at Phonological Structure program; b. beginning the discussion
of de Marcken's dissertation. introduction to MDL: probability of data
as a sum, over all models, of the product of the model's own probability
and its assigned probability to the data; approximation of this with a
best model, and a prior distribution over models. c. de
Marcken's dissertation, part 2. d. NevilleManning notes; Sequitur homepage. . Applications of this algorithm to natural language and genomic sequences. 
Reading: Composition and perturbation, Carl de Marcken, ACL 1996. Graduate reading: Unsupervised Language Acquisition, at the same site. Link to my implementation of de Marcken's algorithm. 
4 
Segmenting a word into morphemes

John Goldsmith 2001 Unsupervised learning of natural language morphology and see the Linguistica website 
5 
Finding strings that are similar in unpredictable ways: the string edit distance algorithm, and its applications. a. Basic
algorithm 
I'm still looking for a good public description. Pekka Kilpeläinen's slides are good. Durban et al, Biological Sequence Analysis, is very good (on this as well as on HMMs, and I may put that on reserve). 
6 
Wordclasses and distributional similarities between words. a. traditional wordclasses:
Parts of Speech. Penn treebank categories.
Powerpoint slides. 
Chapter 3 of Manning and Schütze, and also 15.2 ("The vector space model"). 
78 
Hidden Markov models and Part of Speech (PoS) tagging This is more than 3 classes' worth of material! 
Chapter 9 and 10 of Manning and Schütze Graduate reading: A
Practical PartofSpeech Tagger (1992) by Doug Cutting, Julian Kupiec,
Jan Pedersen, Penelope Sibun 
89 
Word sense disambiguation a. the problem; its importance for machinetranslation; topicdependent
vs. topicindependent disambiguation; the pseudoword method (merge
banana and door, then split them); Bayes classifier. 
Chapter 7 of Manning and
Schütze Graduate reading: Unsupervised word sense disambiguation rivaling supervised methods. David Yarowsky. 1995. 
910 
Alignment and Machine translation 
Chapter 13 of Manning and Schütze 

Caveat: I expect the material given in the chart above for the first 7 or 8 weeks to expand to fill the whole quarter: it's very possible that we will not reach the last two or three topics. The pace with which we will go through these topics depends in large part on the background and interests of the students in the class.
Assignments
Week  
1  Preparation for upcoming assignments: Find and download files at least 25K in length in at least 3 languages using an alphabetic writing system. 
2  Write a program that calculates the distribution of all of the symbols (letters plus punctuation) in your corpora from last week. Call those corpora your training corpus. Write a program that takes a sample from one of these languages (but not part of the training corpus) and calculates the crossentropy between the sample and the distributions you have calculated for each of the training corpora. Use those results to determine automatically what language the sample comes from, and explain why this should work. 
3  Write a program, using dynamic programming, to find the best parse of a line of text, on the assumptions that: you possess a lexicon of the language, each word of which is assigned an inverse log probability independent of context; and the best parse of a line is the parsing of the line into words such that the sum of the inverse log probabilities is a minimum. At each point i in the string, where you are analyzing the first i letters of the line, you will have to keep track of the best parse for the L preceding points, where L is the longest member of the lexicon  that is, the best parse up to each point from iL to i1  and that will allow you to decide what the best parse is up to that point. Illustrate the success of your program on several interesting lines. 
4  Write a program that determines the successor frequency for each wordprefix, given a corpus. As you should know by now, the successor frequency of a wordprefix is the number of distinct orthographic ways that the prefix continues in a word in the corpus. Figure out a way to use this to identify some stemsuffix boundaries. What successes does this method encounter, and what failures? Do you think you could overcome these failures with a little more work, or are they systemic? 
5  Write a program to find the stringedit distance between two words. Assume that a deletion or insertion costs 1, a voweltovowel association costs 0.5, a consonanttoconsonant association costs 0.6, and identity costs 0. Illustrate your program with some interesting examples, such as the best alignment between washington and mastication. Extra credit: Take two corpora from different languages that describe the same subject (preferably a news report, so that there will be many names in the text) and attempt to automatically find cognate words and words that refer to the same person (e.g., Vladimir Putin in English, Poutine in French). 
6  Write a program to model each word w's distribution in the corpus by a vector ("w's context vector") of length 2V, where V is the size of the vocabulary in the corpus; the first V coordinates give the counts of the words found to the immediate left of w, and the last V coordinates give the counts of the words found to the immediate right of w. For each of the 500 most frequent words in the corpus, find the 15 most similar words, where the measure of similarity is the cosine of the angle between between the two context vectors. 
7  HMM part 1: Write a program that analyzes letters as produced by two states (labeled "C" and "V") in an arcemision hidden Markov model. Each of the 4 arcs (transitions between states) will produce each of the 27 letters (including space) with a nonzero probability. (1) Build the basic structure of the program, and write a function to assign initial probabilities to these transitions in such a way that all numbers that should add up to 1 (i.e., form a distribution) do so. Explain what you have done. (2) Write a function that calculates the forward probability for each alpha_i, as defined in equation (9.10). (3) Write a function that calculates the backward probability for each beta_i, as in (9.11). In order that you not encounter values that are too small, you may have to ensure that the total length of the lines of your corpus are not too large. You may use single words. 
8  HMM part 2: (4) Write a Viterbi function that calculates the sequence of state transitions that assigns the highest probability to a given input string. (5) Write the crucial soft count identification function ("third problem"): for each letteroccurrence (letter L at position i) in a training corpus, given a current set of parameters for the model, calculate the probability of generating an L in a transition from state i to state j. With that, calculate the probability of generating the first i1 letters and ending in state i, followed by L and a transition to state j, followed by the letters from position i+1 to the end. Calculate those probabilities for the four combinations of i and j. Use those four probabilities to distribute the count of letter L over the four statetransitions. (6) With these statetostateemission counts, do Maximization: recalculate the probabilities based on the counts you calculated. (7) Steps 5 and 6 constitute the Expectation part of the algorithm; make sure you implement them in such a way that you can perform EM in an iterative loop. (8) Take a training corpus from English and another language and calculate the probabilities using your HMM. You may need to let the EM algorithm iterate several hundred times. (9) What letters has your HMM put into one category, and what letters has it put into the other? (10) Make a 3state HMM, and find out what the three categories are for letters, given training corpus from English. 
9  
10 
Week 
Suggested additional
readings 
1 
Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433460. 
2 
Very good introduction to probability by David Forsyth. See Roni Rosenfeld's notes on probability and information theory. M. Li and P. M. B. Vitányi. Introduction to Kolmogorov Complexity and its Applications. Springer Verlag 1997 (second edition). 
3 
Kit, Chunyu and Yorick Wilks. Unsupervised learning of word boundary with description length gain. M.R. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning. 34:71106. 1999. M.R. Brent and T.A. Cartwright. Distributional regualrlity and phonological constraints are useful for segmentation. Cognition. 61:93125. 1996. G. Chaitin. 1966. On the length of programs for computing finite binary sequences. J. Assoc. Comput. Math. 13: 547569. A. N. Kolmogorov. Three approaches for defining the concept of 'information quantity.' Problem of Information Transmission 1:47. 1965. Jorma Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific. 1989. A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics. 27(3): 352372. 2001. 
4 
Trost, Harald. Computational Morphology. Yarowsky, D. and R. Wicentowski, Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL2000, Hong Kong, pages 207216, 2000. Herve Dejean Morphemes as Necessary Concept 
5 
Edit distance and dialect proximity, John Nerbonne et al. Biological Sequence Analysis, by R. Durbin et al. Chapter 2: Pairwise alignment. 
6  
78 
Linda Van Guilder's overview of part of speech tagging. See Roni Rosenfield's notes on Biological Sequence Analysis, by R. Durbin et al. Chapters 3 and 4: Markov chains and HMMs; Pairwise alignment using HMMs. 
89  
910 
John Hutchins: Machine translation: a brief history. Webpage on MT. Catherine Ball's excellent MT page. Sylvia Wong: Machine translation techniques. Berger, Brown et al: The Candide system for machine translation. 
10 

What follows is just notes from previous explorations for syllabi: notes toward the construction of a syllabus for another 10 weeks of computational linguistics.
2.
Words
a. Words and corpora (chapter 4). Defining words, sentences. Zipf's law. Basics of natural language morphology.
b. Collocations. Mutual information (and probability theory, a bit of information theory).
c. Vector space models of wordcontexts
d. Bigram models over sparse data (chapter 6)
e. Wordsense disambiguation (chapter 7)
f. Cross entropy and author identification. (discussion of lettermodel crossentropy and language identification)
3. Markov models of partofspeech
a. Hidden Markov models (HMMs) (chapter 9): Viterbi parsing, E/M algorithm
b. POS tagging (chapter 10)
6. Machine learning
a. Tranformation based learning
b. Decision trees
7. Document retrieval, Webpage ranking (google ranking)?
General:
Segmentation, or worddiscovery, or collocationdiscovery, or text segmentation (see separate topic for related bioinformatic work)
NevillManning, Craig, Ian Witten and Gordon W. Paynter. Browsing in digital libraries: a phrasebased approach.
Steier, Amy M. and Richard K. Belew. Exporting phrases: a statistical analysis of topical language.
Utiyama, Masao and Hitoshi Isahara. A statistical model for domainindependent text segmentation.
Sentence segmentation:
Applications:
A
free CPAN Perl module for sentence splitting.
Shlomo Yona maintains another perlbased sentence splitter.
Ghassan Mourad
(1999)
La
segmentation de textes par l'étude de la ponctuation
Ghassan Mourad
La
segmentation de textes par exploration contextuelle automatique, présentation du
module SegATex Ghassan.Mourad@paris4.sorbonne.fr
Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization."
Tibor Kiss and Jan
Strunk
Scaled log
likelihood ratios for the detection of abbreviations in text corpora
Tibor Kiss and Jan
Strunk
Multilingual
LeastEffort Sentence Boundary Disambiguation
Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.
Andrei Mikheev
Tagging Sentence
Boundaries (2000)
Andrei Mikheev
Periods, Capitalized
Words, etc (1999)
David D. Palmer
(2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl
and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel
Dekker
David D. Palmer and Marti
A. Hearst,
Adaptive
Multilingual Sentence Boundary Disambiguation
J. Reynar and A.
Ratnaparkhi,
A
Maximum Entropy Approach to Identifying Sentence Boundaries
Word sense disambiguation, word distributions
Bookstein, Abraham, S.T.Klein, and T. Raita. Clumping Properties of Contentbearing Words.
Lin, Dekan. Automatic retrieval and clustering of similar words.
Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation Computers and the Humanities, 34(2):179186, 2000.
Yarowsky, D. ``Word Sense Disambiguation.'' In R. Dale, H. Moisl and H. Somers (eds.) The Handbook of Natural Language Processing. New York: Marcel Dekker, pp. 629654, 2000.
Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189196, 1995.
Latent semantic indexing: dimensionalityreduction on the space of wordsenses
Patterns in unstructured data: a presentation to the Andrew W. Mellon Foundation from Middlebury College
Original paper by Deerwester, Dumais et al. Indexing by Latent Semantic Analysis (1990).
Morphology and morphology learning
Named Entity Recogntion
Cucerzan, S. and D. Yarowsky. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings, 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pp. 9099, 1999.
Michael Collins and Yoram Singer 1999. Unsupervised Models for Named Entity Classification . In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
Word Alignment
Reducing Parameter Space for Word Alignment Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji Yamada
Grammar induction
Eric Brill. 1993. Automatic Grammar Induction and Parsing Free Text: A TransformationBased Approach. ACL 1993.
Stanley F. Chen. Bayesian
Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual
Meeting of the Association for Computational Linguistics, pages 228235, 1995.
Alexander Clark (2003) Preprocessing very noisy text, Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics 2003, Lancaster. (to appear) [Postscript][PDF]
Alexander Clark (2003) Combining Distributional and Morphological Information for Part of Speech Induction, Proceedings of EACL 2003, [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Shallow Parsing using Probabilistic Grammatical Inference, Proceedings of the 6th International Colloquium on Grammatical Inference, (ICGI2002) [Postscript][PDF] Amsterdam, the Netherlands
Franck Thollard and Alexander Clark (2002) Apprentissage d'Automates Probabilistes Déterministes, Proceedings of CAp2002 [Postscript][PDF]
Franck Thollard and Alexander Clark (2002) Détection de Groupes Nominaux par Inférence Grammaticale Probabiliste, Proceedings of CAp2002 [Postscript][PDF]
Alexander Clark (2002) MemoryBased Learning of Morphology with Stochastic Transducers, Proceedings of ACL 2002, Philadelphia, USA (to appear) [Postscript][PDF]
Alexander Clark (2001) Unsupervised Language Acquisition: Theory and Practice, DPhil Thesis, University of Sussex, December 2001 [PDF]
Alexander Clark (2001) Learning Morphology with Pair Hidden Markov Models, Proceedings of the Student Workshop at ACL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Unsupervised Induction of Stochastic ContextFree Grammars using Distributional Clustering, Proceedings of CoNLL 2001, July 2001, Toulouse, France. [Postscript][PDF]
Alexander Clark (2001) Partially Supervised Learning of Morphology with Stochastic Transducers, Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS) 2001, November 2001, Tokyo. [Postscript][PDF]
Alexander Clark (2000) Inducing Syntactic Categories by Context Distribution Clustering, Proceedings of CoNLL 2000, September 2000, Lisbon. [Postscript][PDF,]
Alexander Clark (1998) Inducing fields in NLP. MSc Dissertation at
University of Sussex. [
Compressed postscript]
Mennon M. van Zaanen. Bootstrapping structure into language: Alignmentbased learning. 2001. PhD dissertation, University of Leeds.
Pieter W. Adriaans and Erik de Haas Grammar Induction as Substructural Inductive Logic Programming, . In James Cussens and Saso Dzeroski, editors, Learning Language in Logic, volume 1925 of Lecture Notes in Computer Science, pages 127142. SpringerVerlag, June 2000.Dan Klein and Christopher D. Manning, "Natural Language Grammar Induction Using a ConstituentContext Model", Advances in Neural Information Processing Systems 14 (NIPS2001), 2001. [ps] [pdf] [bib]
Dan Klein and Christopher D. Manning, "Distributional Phrase Structure
Induction", Proceedings of the Fifth Conference on Natural Language Learning
(CoNLL2001), 2001. [ps]
[pdf]
[bib]
Phonology
Belz, Anja. An approach to the automatic acquisition of phonotactic constraints.
Ellison, T. Mark. The Machine Learning of Phonological Structure. 1992. Dissertation, University of Western Australia.
Grammar
Grünwald, Peter. A Minimum Description Length approach to grammar inference.
Bioinformatics
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.
Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. 2001. Nature Genetics. Regulatory element detection using correlation with expression.
Mathematical background
Language modeling, parsing
Eugene Charniak, 1997. Statistical Parsing with a Contextfree Grammar and Word Statistics Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI97), pp 598603.
Michael Collins,
1997. Three
generative, lexicalised models for statistical parsing 35th Annual Meeting
of the Association for Computational Linguistics and 8th Conference of the
European Chapter of the Association for Computational Linguistics: Proceedings
of the Conference (ACL/EACL '97), pp 1623.
Grammar makes a
comeback: Ciprian Chelba and Frederick Jelinek, 1998. Exploiting
syntactic structure for language modeling COLINGACL '98 36th Annual Meeting
of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics: Proceedings of the Conference, Vol 1,
pp 225231
Week 3 Shallow Parsing
Machine translation
Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press
Peter E Brown; Vincent J. Della Pietra; Stephen A. Della Pietra; Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Volume 19, Number 2.
Kevin Knight, 1999. A Statistical MT Tutorial Workbook.
John Hutchins: Machine translation: a brief history.
Sylvia Wong: Machine translation techniques.
Adam L. Berger, Peter
F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett,
John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The
Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on
Human Language Technology.
Other
Cotraining: Blum, A. and T. Mitchell. 1998. Combining labeled and unlabeled data with cotraining. Proceedings of the 11th annual Conference on Computational Learning Theory. New York: ACM Press.
Other syllabi
Mark Johnson (Brown)
Steven Bird (Penn)
James Pustejovsky (Brandeis)
Ray Dougherty (NYU)
Mark Gawron (SDSU)
Lillian Lee (Cornell)