Linguistica
Download software now (latest: v3.2, December 2003 - older versions temporarily unavailable)

John Goldsmith (email, homepage)
Departments of Linguistics and Computer Science
University of Chicago

Areas of this page:
  • What is Linguistica?
  • Understanding Linguistica
  • Using Linguistica

    What is Linguistica? (return to top)
    Linguistica is a program which can be used to explore the unsupervised learning of natural language, with primary focus on morphology, which is to say, word-structure. It runs under Windows, and is written in C++. Its demands on memory depend on the size of the corpus analyzed. We are currently developing a Linux and a Macintosh version.

    Unsupervised learning refers to the computational task of making inferences (or acquiring knowledge) about the structure that lies behind some set of data without any direct access to that structure. In the case of unsupervised learning of morphology, and the possibilities of morpheme-combinations, for a set of words, based on no knowledge whatsoever of the language from which the words are drawn.

    Segmentation is the first task of this process: figuring out where the morpheme breaks are in the words, and what are the stems, what are the suffixes, and so forth. Most of Linguistica's functionality, at this point, goes into making these decisions. It has some limited capabilities for learning allomorphy, which is to say, the ways in which stems or affixes are modified in particular contexts (for example, most stems that end in -e in English will drop the -e before various vowel-initial suffixes, such as -ed, -ing, and -ity).

    This document presents a brief description of how to use the program, and links to other documents which explain the ideas incorporated in Linguistica.

    Understanding Linguistica (return to top)
    This section attempts to present a bit of background about Linguistica, to help better understand the function of the program.

  • A technical description of the Minimum Description Length model that motivates this work has appeared in Computational Linguistics 27:2 (2001) pp. 153-198, under the title Unsupervised Learning of the Morphology of a Natural Language.
  • Some PowerPoint slides describing the ideas behind this program can be accessed here. Slides from a talk at Microsoft on Nov 9 2001 are available here.
  • Linguistica relies on quality text files in an interesting language. If you don't have any, there are some good links here.
  • This document is likely incomplete. I am more than pleased to answer inquiries by email.

    Using Linguistica (return to top)

    First, a couple of things to note.

  • The details of these screens change frequently as I modify the program. I'll do my best to keep the graphics and documentation up-to-date, but there will undoubtedly be slippage from time to time.
  • Linguistica is best viewed if your display is set at 1280x1024 (not all systems and displays are capable of this resolution - these system-wide settings can be changed in your Control Panel's Display Properties). If you have your display set at a lower resolution (like 1024x768), the program may not look exactly as it should. If something appears to be missing (such as the Text area), you can try grabbing at the bottom of the Linguistica window with the mouse and pulling the edge of that area up to a size that suits you.
    The screen is divided into four main parts: on the left, the Upper Tree, the Lower Tree, and on the right, the Collections, and the Text.



    Most of the user input occurs with the drop-down menus. In earlier versions, this was largely accomplished through the Lower Tree on the left, and some of that functionality remains there; it is being phased out. Most of the program's output occurs in the Upper Tree and Collections areas. The Upper Tree shows general information about the given corpus and works in tandem with the Collections area, which shows more detailed information about the corpus. The Text area is used for feedback to user in a few cases. The other major source of information for the user comes from the optional Log File. If the user turns on the Log File, then the program will write a detailed description of its operations to a text file, at a location specified by the user. The Log File must be turned on for each individual operation that the user wants logged.

    How to Begin
    The first operation is to read in a corpus, or a part of a corpus. The default setting for Linguistica's corpus input is 5,000 words: this is the number of words from a corpus that the program will read. If you wish to change this setting, select "Words requested" in the Lower Tree on the left. A pop-up window will appear in which you can specify a different number of words to be read from the corpus. This number refers to the total number of word (tokens) read, not word types.

    To read a corpus, click on the third menu item, "Reading", and then click on "Read corpus". A window will appear in which you identify the location of the text file (not a word processor document) which you wish to read. If you have already run Linguistica and have previously read in a corpus, it will remember the location of the file, and you can simply click "Reread corpus" to reread the same old corpus. Shortcut key: Type Control-D to reread the same corpus.

    When the reading is complete, "Words" will appear in the Upper Tree area at the top left of the screen (which you saw earlier), under "Lexicon". The Lexicon contains many collections of information, including Words, Stems, Suffixes, Prefixes, and Signatures. When these collections are empty, they do not appear in the Statistics area. You may click on Words, and words of the corpus will appear as a list in the Collections area. The width of the columns in the Collections area may be too small or too large for your purposes. You can change the widths of the columns by grabbing an edge at the top of the columns with your mouse and moving it to the left or the right. You can also sort by any of the column values by clicking on the title of that column. This may be particularly useful for clicking on the "Corpus Count" column to bring the most frequent words to the top of the column. You can return to an alphabetical display of the words by clicking on the top of the first column, "Words". If you wish to see the words organized into a "trie", you can click on the "Forward trie" line, also under "Lexicon" in the Upper Tree area.

     

    Finding a suffixal system: signature-based analysis
    Let us suppose you have chosen a language such as English in which you wish to discover the suffixal system. Click menu: Find suffix system, and click item Run all. (Keyboard shortcut: type Control-S.) We will return to other such actions you may take; for now, let's look at what results you may obtain if you perform this operation. You may see something like this in the Upper Tree:

    In the Lexicon, then, we have 29 suffixes, and 105 signatures built up out of them, along with 819 stems. Of the 3,049 words, 1,103 were analyzed. You should click consecutively on each of these groups, and see that they are displayed in the Collection window on the right as you do so. When the collections get large, it may take a while to display a collection (as much as 10 seconds or more if there are much more than 5,000 members).

     

    Saving to file
    To save an analysis to a text file that you can open in a text editor or a spreadsheet, click menu item File and then Save as. A window will open, and you select a folder and a description of your project (e.g., WarrenCommissionReport_50K). Linguistica will save a set of about 12 different text files: a list of words, stems, affixes, signatures, etc. Most of the information in them is transparent; I will provide further documentation below on the less transparent values.

     

    Prefixes
    You can determine prefixes either of the entire set of words in the lexicon, or, if you have performed a suffix analysis as described above, you can analyze the prefixes of the stems (that is, of the suffixal stems that you have discovered). The latter is generally more extensive and accurate. You do this with the menu selection Find prefixes, and then prefixes of words or prefixes of suffixal stems.

     

    Allomorphy
    At present, Linguistica is capable of determine a limited amount of allomorphy in stems. In many languages (including English), stem final material is deleted in front of certain suffixes. For example, stem-final -e is deleted in English before a number of suffixes: thus love, but lov-ing and not love-ing; sane, and sanity, not saneity. This can be discovered automatically with the menu command Allomorphy and Find allomorphy (keyboard shortcut: Control-A)

    Linguistica looks for reanalyses in which certain material that had previously been analyzed as a suffix will be reintegrated into the stem, and other suffixes will be informed that they are capable of deleting that material when it appears before them. For example, the words love, loves, loved, and loving, which had been analyzed as lov + signature e.ed.es.ing, will be reanalyzed with the stem love and the suffixes NULL, ed, s, and ing. The suffixes ed and ing will be informed that they are capable of deleting a preceding e, and this is indicated by placing an e in angle brackets before the prefix, thus: <e>ing and <e>ed. Thus the new signature for love is NULL.<e>ed.<e>ing.s, and this signature correctly deals with stems that end with -e and those that do not.

    You may note that Linguistica treats y-final nouns and verbs this way: thus academy/academies is treated as based on the stem academy and the suffixes NULL and <y>ies.

     

    Rich morphologies
    As we describe elsewhere, the signature-based analysis that we have sketched above makes certain assumptions about morphology that will fail in many languages. The primary assumption active in the signature-based analysis boils down to the assumption that a sufficient number of words can be divided into stem plus one affix for the resulting generalizations to be generalizable over the whole lexicon. This may be too strong an assumption, and we are currently working on a more general family of algorithms. The user may access some of these functions under the Rich morphologies menu item. If a lexicon has been read in, and the submenu item Find morphemes
    is selected, then a function is called that analyzes words into two or three pieces, and seeks subgeneralizations. This is significantly slower than the operations described above. These subgeneralizations appear in two collections, listed under "Words" in the Lexicon of the Upper Tree. The first is called Initial templates, and the second is called Templates.

    Consider a typical initial template, arising in a 15,000 word corpus of Swahili:

  • This initial template consists of #a in the first column (a is the 3rd person singular verbal prefix; note that a word boundary # is prefixed to all words in this function), a column of 16 items of various functions, and a final verb stem fanya, Swahili for "do". In the 2nd column, we find various items: some tense markers (ka, ki, li, me, na, ta, taka) and in some cases, a following object marker (ji, ki, zi) or a morpheme used in a relative or adverbial construction (lo, po, zo, cho). Many of the initial templates are spurious generalizations. Initial templates consist of two or three columns, in which exactly one column has more than one member; all selections across the template are forms actually found in the corpus
  • Templates: In the same Swahili analysis, the following templates are among those found:

  • Templates are formed by collapsing templates, and have the property that two columns have more than one member. Again, as with initial templates, all selections across the template (cartesian products, if we think of each column as a set) are forms actually found in the corpus. In the first Swahili example, the first column has the 3rd person sg subject marker in the first column, two tense markers in the second column (past and present), and some morphologically complex verb stems in the third column (the verb stem proper preceded in some cases by relative clause markers). In the second example, the suffix -ni is correctly identified, and in the third example, the negative and affirmative infinitive prefixes are identified in the first column.

    Signatures
    Returning to the signature-based analysis: if the user selects Signatures, they will be displayed in the collection window (the keyboard shortcut Shift-S brings the suffix signatures up, or one may use the menu Display followed by Display suffix signatures). If the user clicks on a signature in the first column, the stems associated with that signature will be displayed in the Text box below.

    By default, signatures are ranked by their robustness, which is roughly the number of letters saved by the analysis, compared with the total number of letters in the original words which are analyzed in the signature. That is, the robustness of a signature is (roughly) the number of letters in the original words minus the number of letters in the signature. The signatures can be resorted by clicking on the header at the top of various columns of the display. "Remarks" gives an indication of which function was responsible for the identification of the signature.


    When the reading is complete, "Words" will appear in the Statistics area at the top left of the screen (which you saw earlier), under "Lexicon". The Lexicon contains many collections of information, including Words, Stems, Suffixes, Prefixes, and Signatures. When these collections are empty, they do not appear in the Statistics area. You may click on Words, and words of the corpus will appear as a list in the Collections area. The width of the columns in the Collections area may be too small or too large for your purposes. You can change the widths of the columns by grabbing an edge at the top of the columns with your mouse and moving it to the left or the right. You can also sort by any of the column values by clicking on the title of that column. This may be particularly useful for clicking on the "Corpus Count" column to bring the most frequent words to the top of the column. You can return to an alphabetical display of the words by clicking on the top of the first column, "Words". If you wish to see the words organized into a "trie", you can click on the "Forward trie" line, also under "Lexicon" in the Statistics area.

    Now you can click, successively, on the rest of the items within "For first-time users" in the Tree: (1) Successor Freq 1, (2) Known stems and suffixes, (3) Loose fit, (4) Check signatures, and (5) Find prefixes (of suffixal stems). You will find the resulting data of these actions under "Lexicon" in the Statistics area.

    If you click on one of these items, details will appear in the Collections area. These steps have now provided you with a morphological analysis of the suffixal system of the language. We will discuss later what these steps consist of.





    Thanks to Mike LeBeau for work on this webpage.