An Overview of Industrial Linguistics

CS 324 583 01

Ling 260 391 01
John Goldsmith and Sean Fulop

ja-goldsmith@uchicago.edu (http://humanities.uchicago.edu/faculty/goldsmith)
sfulop@midway.uchicago.edu
(http://people.cs.uchicago.edu/~sfulop/index.htm)

 

Tuesdays 5:30 to 8:30 pm Spring quarter 2001

Teaching assistant: Derrick Higgins   dchiggin@midway.uchicago.edu 


General requirements: The grade will be based on homework assignments, listed below, and a term project, which will be a computational project selected from the list below (though we will be willing to consider a different project if you want to make the case). [List does not presently appear here.] You will notice that there is no homework assigned after Week 6; this is so that you can concentrate on your term project. 

Programming skills: Perl is the language of choice for many of the projects involved in this course. If you don't already know Perl, but know C, we think it might very well be worth your time to spend a long evening and learn enough Perl to write code for these projects. 


Readings and links

Two kinds of reading: Some of the assigned readings below are readings for background; others are straight readings. The difference is that reading for background is material that you should read to get the big picture and so that you can go back there later if you find you need to understand a concept in detail. You are expected to know the material in the reading-for-background sections. Straight reading is material that you are expected to study carefully and learn. Assignments that are not marked "reading for background" are intended as straight reading assignments.

Principal textbook: Daniel Jurafsky and James H. (2000) Martin: Speech and Language Processing. Prentice Hall.

Home page for website: http://www.cs.colorado.edu/~martin/SLP/slp-web-resources.html

Other readings: Other assigned readings will be distributed gratis to registered students.

 

Assignments and suggested readings from:
Charniak, Eugene (1993). Statistical Language Learning. Cambridge MA: MIT Press.
Jelinek, Frederick (1997). Statistical Methods for Speech Recognition. Cambridge: MIT Press.
Keller, Eric. (1994) Fundamentals of Speech Synthesis and Speech Recognition. John Wiley & Sons
Manning, Christopher D., and Schütze, Hinrich (2000). Foundations of Natural Language Processing. Cambridge: MIT Press.
Osherson, D. N., Stob, M., and Weinstein, S. (1986) Systems that Learn.  Cambridge: MIT Press.
Sproat, Richard (1992). Morphology and Computation. Cambridge: MIT Press.

An online (free) resource:  Survey of the State of the Art in Human Language Technology (1996) http://cslu.cse.ogi.edu/HLTsurvey/



Week 1             Overview of the course    Fulop and Goldsmith. 

 

Introduction: who we are and what our backgrounds are. Organization: readings, assignments, meetings, office hours and appointments. Programming expectations; a few important words about Perl. A quick spin through the whole syllabus.

Reading for next week: Keller, Chapter 1. Reading for background: Jurafsky and Martin: Chapters 1 and 2. Some of this you will need to understand in order to understand material in Chapter 3. Read for next week: Jurafsky and Martin, Chapter 4: Read 91-110; 120-130.  Chapter 5: 141-184. Note that some of Chapter 5 requires some knowledge of probability, which some of you may not currently have (we will cover it in Week 6). Do the best you can. The Viterbi algorithm is extremely important, and is used widely in both computational linguistics and in other computational areas. 

Read "Comparative Evaluation of Letter-to-Sound Conversion Techniques for English Text-to-Speech Synthesis," R.I.Damper et al. Damper et al

Assignment for next week: Download the Brown corpus. Write a program that provides a frequency-sorted list of words in the corpus. Reach a decision as to how to treat punctuation and the distinction between capitalized and non-capitalized words. Submit the code and the output plus any thoughts you have on significant decisions you needed to make in writing the program. If you are programming in Perl, then you can use their hashes ("associative arrays"). If you're writing in C++, you will need to learn to use a "map" class (hash). Hopefully you won't have to write it yourself, but that option is always there. 

On the Brown corpus: http://www.hit.uib.no/icame/brown/bcm.html#bc3  There are many places to download it; one is http://humanities.uchicago.edu/faculty/goldsmith/data/Browncorpus.txt 


Week 2     Phonetics and phonology         Fulop and Goldsmith

Phonology slides (in powerpoint format)   We take "phonology" in a very broad sense to include spelling and punctuation.

Assignment: Letter to sound relationships in English. Download the Nettalk.data.gz  labeled corpus at : ftp://svr-ftp.eng.cam.ac.uk/pub/pub/pub/comp.speech/dictionaries. Using this as your data source, write a program that will determine the phonemic realization of each letter in English, also associating with each phoneme a proportion. E.g., the letter L is realized 91% (or .91) as the phoneme L, and 9% (.09) as NULL (e.g., in calm). Write a program to do the inverse, that is, showing for each phoneme what letters can represent it, along with frequencies. 

Some on-line resources:
On punctuation: Say, Bilge and Akman, Varol (1997) Current Approaches to Punctuation in Computational Linguistics. Computers and the Humanities 30(6):457-469  http://cogprints.soton.ac.uk/documents/disk0/00/00/01/98/index.html
 Letter to sound (LTS): Issues in Building General Letter to Sound Rules (Black et al)

Reading for next week: Jurafsky and Martin: From Chapter 3, Read pp. 57-71. Read for background: 71-82. Real reading pp. 82-88. Read about the Viterbi algorithm, which is very important, and which we'll encounter three times during the quarter. Juraksky and Martin cover in on pp. 177ff and 244ff; read those passages. 

Suggested: You might also want to look at: Sproat Morphology and Computation. 


Week 3            English morphology     Goldsmith

        We will begin with a discussion of the Viterbi algorithm, in connection with Minimum String Edit and probabilistic letter to sound conversion. Powerpoint slidesAny-browser-readable format.

Morphology Powerpoint slides  

Reading for next week: Jurafsky and Martin: Chapter 9,10 Read Chapter 8 to p. 298, all of 9, start 10. 

Assignment for next week: Write a program to determine compounds in a corpus of English. Run it on a large corpus (e.g., Brown corpus), and determine (by sampling, if necessary) how well it works. Submit the code, the output, and its score; explain your scoring method. Hint: One natural way to try to find compounds is to look for words which can be spelled as the concatenation of two independently existing words in the corpus. Hint 2: That strategy will include false compounds like "mean" and "meat". Be sure to deal with that problem.


Week 4    Introduction to natural language syntax Fulop 

Syntax is the arrangement of words in sentences; most current theories of natural language syntax specify the organization of a sentence as a hierarchy of subconstituents in a syntactic structure.

Reading for next week: Jurafsky and Martin: the part of Chapter 8 that you haven't yet read, and Chapter 10.

Assignment for next week: Do J&M Exercises 9.1, 9.2, 9.3


Week 5 Current approaches to syntax Fulop

This week we consider theoretically motivated ways of computing syntactic structures and recognizing the sentences that have them.

Reading for next week:

Read: Introduction to probability for linguists  (pdf format -- the sigmas aren't visible). Word format    Html format

Assignment for next week: Do J&M Exercise 10.2


Week 6 Basics of probability and information theory Goldsmith  
Good additional resources:
Charniak, Eugene  Statistical Language Learning.
Manning and Schütze

Assignment: do the exercises in Introduction to probability for linguists (the reading for this week).

Reading for next week: Reading for background: Goldsmith, Unsupervised learning of the morphology of a natural language (to appear in Computational LinguisticsRead: Systems that Learn, Chapter 1. 


Week 7 Learnability and some aspects of machine learning Fulop and Goldsmith   

Reading for next week: Jurafsky and Martin, Chapters 6 and reread 8; good additional resource is Manning and Schütze, Chapter 6, which we recommend.


Week 8 Ngram language models and the sparseness of data problem Goldsmith  

Powerpoint slides

Reading for next week: Jurafsky and Martin, Chapter 5 and Chapter 7 (partial review).


Week 9 Speech recognition; Hidden Markov models. Fulop

Reading for next week: Jurafsky and Martin, pp.130-133; Keller Ch. 6 “Formant synthesis” 

Also, Goldsmith, John. 1999. Dealing with prosody in a Text to Speech system. International Journal of Speech Technology 3: 51-63.

Good additional resources:
Charniak; Manning and Schütze; Jelinek.


Week 10 Speech synthesis and intonation Fulop and Goldsmith

Some on-line resources:
A Short Introduction to Text-to-Speech Synthesis (Thierry Dutoit)


Term Projects

1. Read Jurafsky & Martin Chapter 11; implement the modified Earley algorithm for unification parsing on p. 431, and test it on a toy example.

2. Do Jurafsky & Martin Exercise 7.3, and implement the resulting version of the Viterbi algorithm. Show that it works by providing some toy inputs.

3. Develop a letter-to-phoneme conversion system, and a method for testing how well it works. This could be done for English, or for another language.

4. Develop a finite-state morphology along the lines described in Jurafsky and Martin.