Computational Linguistics

Computer Science CMSC 25020-1 and CMSC 35030-1
Spring 2004
John Goldsmith Office in CS: Ryerson 258 x4-7152 (look inside the lab). Also in Classics 307 (office phone there 702-8525), where I have office hours Tuesday 10-11. Otherwise email me and we'll get together.
Teaching assistant: Irina Matveeva Office Hours Friday 4-6 in Ryerson 177, and Monday 4-6 for students who arrange ahead of time.

About this course

This is a new course in the Computer Science department, intended for upper-level undergraduates, or graduate students, who have a good C++ background.

We will look at several current topics in natural language processing, dividing our time between discussing material in the textbook and material in current research papers, and getting our hands dirty with code and corpora. In line with most current work, our emphasis will be on systems that draw conclusions from training data rather than relying on the encoding of generalizations obtained by humans studying the data. As a consequence of that, in part, and also because we will stand to learn more about natural language if we do so, we will make an effort not to focus on English, but to look at a range of human languages in our treatments.


The primary textbook is Manning and Schütze, Foundations of Statistical Natural Language Processing, which also has a website which you should peruse. We will also be reading a number of papers whose links I've given below.


Assignments will be due on the Monday following the week in which we discuss the material (roughly, the Monday following the week number indicated on the assignment). I've given 7 assignments below (the first week's assignment, gathering corpora, doesn't count! it's just a warm-up): Students enrolled for 25020 must do 6 of them, and students enrolled for 35030 must do all of them. You are responsible for writing C++ code that compiles (or Perl code that runs), and providing any relevant data files, and the relevant output file. Zip all this together, and email it to Irina and me.

The schedule below roughly corresponds to weeks in the quarter, but only roughly; the quarter is likely to be interrupted by events such as the meeting of the Chicago Linguistics Society.

Virtual week
Topic Reading for this week

Background and overview

a. overview of course
b. modern linguistics and contemporary computational linguistics: a bit of recent history. Testing the 'poverty of the stimulus' argument.
c. corpora, resources, tokenization (word, sentence), speech vs text vs bioinformatic resources, PoS tagging, morphology, syntax: a brief overview. Precision and recall. Notes.

Chapters 1 and 4 of Manning and Schütze, and also section 15.1.2 ("Evaluation measures") and look at section 8.1 on a similar topic.

Lillian Lee: "I'm sorry Dave, I'm afraid I can't do that": Linguistics, statistics, and natural language processing circa 2001.

Fernando Pereira: Formal grammar and information theory: together again? In The Legacy of Zellig Harris: Language and information into the 21st century, volume 2, pp.13-32. Ed. Bruce E. Nevin and Stephen M. Johnson


Probability and information theory

a. probabilistic approach to letters and letter sequences; likewise for phonemes. Bayes' rule. Phonotactics, conditional probability, mutual information. Phonotactics as redundancy in the lexicon at the neighborhood level. Slides.
b. entropy as average (positive) log probability; cross-entropy; K-L divergence as a reasonable measure of the relationship between two distributions.
c. K-L divergence as positive semi-definite; calculating the probability of a string, effect on the calculation if we add 10 e's, effect if we treat "th" as a single chunk.

Chapters 2 and 6 of Manning and Schütze

Goldsmith: Probability for linguists.


The problem of word segmentation, and Minimum Description Length (MDL) analysis

a. Re-doing the effect of "chunking" bigrams on string probability; looking at Phonological Structure program;

b. beginning the discussion of de Marcken's dissertation. introduction to MDL: probability of data as a sum, over all models, of the product of the model's own probability and its assigned probability to the data; approximation of this with a best model, and a prior distribution over models.
Viterbi algorithm applied to finding the best parse of a string, given probabilities for a lexicon. Learning such probabilities using Expectation/Maximization.

c. de Marcken's dissertation, part 2.
Slides on Viterbi parsing and E/M

d. Neville-Manning notes; Sequitur homepage.

. Applications of this algorithm to natural language and genomic sequences.

Reading: Composition and perturbation, Carl de Marcken, ACL 1996. Graduate reading: Unsupervised Language Acquisition, at the same site.

Link to my implementation of de Marcken's algorithm.


Segmenting a word into morphemes

a. Morphology as the capturing of redundancies in the lexicon at the level of strings of several phonemes; Zellig Harris' successor frequencies; its strengths and weaknesses. Backwards and forwards; Hafer and Weiss evaluation.
b. signatures (Greenberg rectangles) as indicator of morpheme status. MDL model with signatures. Notes.
c. Distinguishing between MDL model's objective function and heuristics used navigating in search space. Exploring English.

John Goldsmith 2001 Unsupervised learning of natural language morphology and see the Linguistica website

Finding strings that are similar in unpredictable ways: the string edit distance algorithm, and its applications.

a. Basic algorithm
b. Using word-pair alignments to find common morphemes in languages with rich morphologies.

c. Using word-pair alignments to find related words and morphemes: phonological rules.

I'm still looking for a good public description. Pekka Kilpeläinen's slides are good. Durban et al, Biological Sequence Analysis, is very good (on this as well as on HMMs, and I may put that on reserve).

Word-classes and distributional similarities between words.

a. traditional word-classes: Parts of Speech. Penn treebank categories. Powerpoint slides.
b. Context vectors; difference between wide and narrow context-vectors.

c. Reducing the dimensionality of context vectors; inducing word-classes

Chapter 3 of Manning and Schütze, and also 15.2 ("The vector space model").

Inserting a string into a trie.


Hidden Markov models and Part of Speech (PoS) tagging

This is more than 3 classes' worth of material!

Chapter 9 and 10 of Manning and Schütze

Graduate reading: A Practical Part-of-Speech Tagger (1992) by Doug Cutting, Julian Kupiec, Jan Pedersen, Penelope Sibun


Word sense disambiguation

a. the problem; its importance for machine-translation; topic-dependent vs. topic-independent disambiguation; the pseudoword method (merge banana and door, then split them); Bayes classifier.
b. MI-based approach (section 7.2.2), similar to decision trees; dictionary- and thesaurus-based methods (7.3); unsupervised disambiguation using EM.

Chapter 7 of Manning and Schütze
Graduate reading: Unsupervised word sense disambiguation rivaling supervised methods. David Yarowsky. 1995.

Alignment and Machine translation

Chapter 13 of Manning and Schütze




Caveat: I expect the material given in the chart above for the first 7 or 8 weeks to expand to fill the whole quarter: it's very possible that we will not reach the last two or three topics. The pace with which we will go through these topics depends in large part on the background and interests of the students in the class.

1 Preparation for upcoming assignments: Find and download files at least 25K in length in at least 3 languages using an alphabetic writing system.
2 Write a program that calculates the distribution of all of the symbols (letters plus punctuation) in your corpora from last week. Call those corpora your training corpus. Write a program that takes a sample from one of these languages (but not part of the training corpus) and calculates the cross-entropy between the sample and the distributions you have calculated for each of the training corpora. Use those results to determine automatically what language the sample comes from, and explain why this should work.
3 Write a program, using dynamic programming, to find the best parse of a line of text, on the assumptions that: you possess a lexicon of the language, each word of which is assigned an inverse log probability independent of context; and the best parse of a line is the parsing of the line into words such that the sum of the inverse log probabilities is a minimum. At each point i in the string, where you are analyzing the first i letters of the line, you will have to keep track of the best parse for the L preceding points, where L is the longest member of the lexicon -- that is, the best parse up to each point from i-L to i-1 -- and that will allow you to decide what the best parse is up to that point. Illustrate the success of your program on several interesting lines.
4 Write a program that determines the successor frequency for each word-prefix, given a corpus. As you should know by now, the successor frequency of a word-prefix is the number of distinct orthographic ways that the prefix continues in a word in the corpus. Figure out a way to use this to identify some stem-suffix boundaries. What successes does this method encounter, and what failures? Do you think you could overcome these failures with a little more work, or are they systemic?
5 Write a program to find the string-edit distance between two words. Assume that a deletion or insertion costs 1, a vowel-to-vowel association costs 0.5, a consonant-to-consonant association costs 0.6, and identity costs 0. Illustrate your program with some interesting examples, such as the best alignment between washington and mastication. Extra credit: Take two corpora from different languages that describe the same subject (preferably a news report, so that there will be many names in the text) and attempt to automatically find cognate words and words that refer to the same person (e.g., Vladimir Putin in English, Poutine in French).
6 Write a program to model each word w's distribution in the corpus by a vector ("w's context vector") of length 2V, where V is the size of the vocabulary in the corpus; the first V coordinates give the counts of the words found to the immediate left of w, and the last V coordinates give the counts of the words found to the immediate right of w. For each of the 500 most frequent words in the corpus, find the 15 most similar words, where the measure of similarity is the cosine of the angle between between the two context vectors.
7 HMM part 1: Write a program that analyzes letters as produced by two states (labeled "C" and "V") in an arc-emision hidden Markov model. Each of the 4 arcs (transitions between states) will produce each of the 27 letters (including space) with a non-zero probability. (1) Build the basic structure of the program, and write a function to assign initial probabilities to these transitions in such a way that all numbers that should add up to 1 (i.e., form a distribution) do so. Explain what you have done. (2) Write a function that calculates the forward probability for each alpha_i, as defined in equation (9.10). (3) Write a function that calculates the backward probability for each beta_i, as in (9.11). In order that you not encounter values that are too small, you may have to ensure that the total length of the lines of your corpus are not too large. You may use single words.
8 HMM part 2: (4) Write a Viterbi function that calculates the sequence of state transitions that assigns the highest probability to a given input string. (5) Write the crucial soft count identification function ("third problem"): for each letter-occurrence (letter L at position i) in a training corpus, given a current set of parameters for the model, calculate the probability of generating an L in a transition from state i to state j. With that, calculate the probability of generating the first i-1 letters and ending in state i, followed by L and a transition to state j, followed by the letters from position i+1 to the end. Calculate those probabilities for the four combinations of i and j. Use those four probabilities to distribute the count of letter L over the four state-transitions. (6) With these state-to-state-emission counts, do Maximization: recalculate the probabilities based on the counts you calculated. (7) Steps 5 and 6 constitute the Expectation part of the algorithm; make sure you implement them in such a way that you can perform EM in an iterative loop. (8) Take a training corpus from English and another language and calculate the probabilities using your HMM. You may need to let the EM algorithm iterate several hundred times. (9) What letters has your HMM put into one category, and what letters has it put into the other? (10) Make a 3-state HMM, and find out what the three categories are for letters, given training corpus from English.

Suggested additional readings

Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press

Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433-460.


Very good introduction to probability by David Forsyth.

See Roni Rosenfeld's notes on probability and information theory.

M. Li and P. M. B. Vitányi. Introduction to Kolmogorov Complexity and its Applications. Springer Verlag 1997 (second edition).


Kit, Chunyu and Yorick Wilks. Unsupervised learning of word boundary with description length gain.

M.R. Brent. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning. 34:71-106. 1999.

M.R. Brent and T.A. Cartwright. Distributional regualrlity and phonological constraints are useful for segmentation. Cognition. 61:93-125. 1996.

G. Chaitin. 1966. On the length of programs for computing finite binary sequences. J. Assoc. Comput. Math. 13: 547-569.

A. N. Kolmogorov. Three approaches for defining the concept of 'information quantity.' Problem of Information Transmission 1:4-7. 1965.

Jorma Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific. 1989.

A. Venkataraman. A statistical model for word discovery in transcribed speech. Computational Linguistics. 27(3): 352-372. 2001.


Trost, Harald. Computational Morphology.

Yarowsky, D. and R. Wicentowski, Minimally supervised morphological analysis by multimodal alignment. Proceedings of ACL-2000, Hong Kong, pages 207-216, 2000.

Herve Dejean Morphemes as Necessary Concept


Edit distance and dialect proximity, John Nerbonne et al.

Biological Sequence Analysis, by R. Durbin et al. Chapter 2: Pairwise alignment.


Linda Van Guilder's overview of part of speech tagging.

See Roni Rosenfield's notes on

Biological Sequence Analysis, by R. Durbin et al. Chapters 3 and 4: Markov chains and HMMs; Pairwise alignment using HMMs.


John Hutchins: Machine translation: a brief history. Webpage on MT.

Catherine Ball's excellent MT page.

Sylvia Wong: Machine translation techniques.

Berger, Brown et al: The Candide system for machine translation.



What follows is just notes from previous explorations for syllabi: notes toward the construction of a syllabus for another 10 weeks of computational linguistics.

2. Words

a. Words and corpora (chapter 4). Defining words, sentences. Zipf's law. Basics of natural language morphology.
b. Collocations. Mutual information (and probability theory, a bit of information theory).
c. Vector space models of word-contexts
d. Bigram models over sparse data (chapter 6)
e. Word-sense disambiguation (chapter 7)
f. Cross entropy and author identification. (discussion of letter-model cross-entropy and language identification)

3. Markov models of part-of-speech

a. Hidden Markov models (HMMs) (chapter 9): Viterbi parsing, E/M algorithm
b. POS tagging (chapter 10)

6. Machine learning

a. Tranformation based learning
b. Decision trees

7. Document retrieval, Webpage ranking (google ranking)?


Segmentation, or word-discovery, or collocation-discovery, or text segmentation (see separate topic for related bioinformatic work)

Nevill-Manning, Craig, Ian Witten and Gordon W. Paynter. Browsing in digital libraries: a phrase-based approach.

Steier, Amy M. and Richard K. Belew. Exporting phrases: a statistical analysis of topical language.

Utiyama, Masao and Hitoshi Isahara. A statistical model for domain-independent text segmentation.

Sentence segmentation:

A free CPAN Perl module for sentence splitting.

Shlomo Yona maintains another perl-based sentence splitter.

Ghassan Mourad (1999)
La segmentation de textes par l'étude de la ponctuation

Ghassan Mourad
La segmentation de textes par exploration contextuelle automatique, présentation du module SegATex

Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization."

Tibor Kiss and Jan Strunk
Scaled log likelihood ratios for the detection of abbreviations in text corpora

Tibor Kiss and Jan Strunk
Multilingual Least-Effort Sentence Boundary Disambiguation

Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.

Andrei Mikheev
Tagging Sentence Boundaries (2000)

Andrei Mikheev
Periods, Capitalized Words, etc (1999)

David D. Palmer (2000)
Tokenisation and Sentence Segmentation,
Robert Dale, Hermann Moisl and Harold Somers (Eds)
in A Handbook of Natural Language Processing, Marcel Dekker

David D. Palmer and Marti A. Hearst,
Adaptive Multilingual Sentence Boundary Disambiguation

J. Reynar and A. Ratnaparkhi,
A Maximum Entropy Approach to Identifying Sentence Boundaries

Word sense disambiguation, word distributions

Bookstein, Abraham, S.T.Klein, and T. Raita. Clumping Properties of Content-bearing Words.

Lin, Dekan. Automatic retrieval and clustering of similar words.

Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation Computers and the Humanities, 34(2):179-186, 2000.

Yarowsky, D. ``Word Sense Disambiguation.'' In R. Dale, H. Moisl and H. Somers (eds.) The Handbook of Natural Language Processing. New York: Marcel Dekker, pp. 629-654, 2000.

Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189-196, 1995.

Latent semantic indexing: dimensionality-reduction on the space of word-senses

Patterns in unstructured data: a presentation to the Andrew W. Mellon Foundation from Middlebury College

LSI homepage

Original paper by Deerwester, Dumais et al. Indexing by Latent Semantic Analysis (1990).

Morphology and morphology learning


Named Entity Recogntion

Cucerzan, S. and D. Yarowsky. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings, 1999 Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora, pp. 90-99, 1999.

Michael Collins and Yoram Singer 1999. Unsupervised Models for Named Entity Classification . In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.

Word Alignment

Reducing Parameter Space for Word Alignment Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji Yamada

Grammar induction

Eric Brill. 1993. Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach. ACL 1993.

Stanley F. Chen. Bayesian Grammar Induction for Language Modeling. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 228-235, 1995.

Alexander Clark (2003) Pre-processing very noisy text, Proceedings of Workshop on Shallow Processing of Large Corpora, Corpus Linguistics 2003, Lancaster. (to appear) [Postscript][PDF]

Alexander Clark (2003) Combining Distributional and Morphological Information for Part of Speech Induction, Proceedings of EACL 2003, [Postscript][PDF]

Franck Thollard and Alexander Clark (2002) Shallow Parsing using Probabilistic Grammatical Inference, Proceedings of the 6th International Colloquium on Grammatical Inference, (ICGI-2002) [Postscript][PDF] Amsterdam, the Netherlands

Franck Thollard and Alexander Clark (2002) Apprentissage d'Automates Probabilistes Déterministes, Proceedings of CAp2002 [Postscript][PDF]

Franck Thollard and Alexander Clark (2002) Détection de Groupes Nominaux par Inférence Grammaticale Probabiliste, Proceedings of CAp2002 [Postscript][PDF]

Alexander Clark (2002) Memory-Based Learning of Morphology with Stochastic Transducers, Proceedings of ACL 2002, Philadelphia, USA (to appear) [Postscript][PDF]

Alexander Clark (2001) Unsupervised Language Acquisition: Theory and Practice, DPhil Thesis, University of Sussex, December 2001 [PDF]

Alexander Clark (2001) Learning Morphology with Pair Hidden Markov Models, Proceedings of the Student Workshop at ACL 2001, July 2001, Toulouse, France. [Postscript][PDF]

Alexander Clark (2001) Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering, Proceedings of CoNLL 2001, July 2001, Toulouse, France. [Postscript][PDF]

Alexander Clark (2001) Partially Supervised Learning of Morphology with Stochastic Transducers, Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS) 2001, November 2001, Tokyo. [Postscript][PDF]

Alexander Clark (2000) Inducing Syntactic Categories by Context Distribution Clustering, Proceedings of CoNLL 2000, September 2000, Lisbon. [Postscript][PDF,]

Alexander Clark (1998) Inducing fields in NLP. MSc Dissertation at University of Sussex. [ Compressed postscript]

Mennon M. van Zaanen. Bootstrapping structure into language: Alignment-based learning. 2001. PhD dissertation, University of Leeds.

Pieter W. Adriaans and Erik de Haas Grammar Induction as Substructural Inductive Logic Programming, . In James Cussens and Saso Dzeroski, editors, Learning Language in Logic, volume 1925 of Lecture Notes in Computer Science, pages 127--142. Springer-Verlag, June 2000.

Dan Klein and Christopher D. Manning, "Natural Language Grammar Induction Using a Constituent-Context Model", Advances in Neural Information Processing Systems 14 (NIPS-2001), 2001. [ps] [pdf] [bib]

Dan Klein and Christopher D. Manning, "Distributional Phrase Structure Induction", Proceedings of the Fifth Conference on Natural Language Learning (CoNLL-2001), 2001. [ps] [pdf] [bib]

Parsing (Project Page) (Download Parser)



Belz, Anja. An approach to the automatic acquisition of phonotactic constraints.

Ellison, T. Mark. The Machine Learning of Phonological Structure. 1992. Dissertation, University of Western Australia.


Grünwald, Peter. A Minimum Description Length approach to grammar inference.



Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model.

Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.

Bussemaker, Harmen J., Hao Li, and Eric D. Siggia. 2001. Nature Genetics. Regulatory element detection using correlation with expression.

Mathematical background

Principal components analysis

Language modeling, parsing

Eugene Charniak, 1997. Statistical Parsing with a Context-free Grammar and Word Statistics Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), pp 598-603.

Michael Collins, 1997. Three generative, lexicalised models for statistical parsing 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics: Proceedings of the Conference (ACL/EACL '97), pp 16-23.

Grammar makes a comeback: Ciprian Chelba and Frederick Jelinek, 1998. Exploiting syntactic structure for language modeling COLING-ACL '98 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics: Proceedings of the Conference, Vol 1, pp 225--231
Week 3 Shallow Parsing

Machine translation

Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press

Peter E Brown; Vincent J. Della Pietra; Stephen A. Della Pietra; Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, Volume 19, Number 2.

Kevin Knight, 1999. A Statistical MT Tutorial Workbook.

John Hutchins: Machine translation: a brief history.

Sylvia Wong: Machine translation techniques.

Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology.


Co-training: Blum, A. and T. Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proceedings of the 11th annual Conference on Computational Learning Theory. New York: ACM Press.


Other syllabi

Mark Johnson (Brown)

Steven Bird (Penn)

James Pustejovsky (Brandeis)

Ray Dougherty (NYU)

Mark Gawron (SDSU)

Lillian Lee (Cornell)