a Windows program to produce an automatic morphological analysis of a corpus.


Updated March 12, 1999

This is old hat now. Go to the Linguistica page for current work.

There is a new version of Automorphology for Windows (version 1.1.1) now available on this page. Because it is compiled under the newest version of Microsoft's C++, I'm posting the version that doesn't require any other programs to be on your computer. Thus this program is a bit larger -- it's 536K, still not too large, all things considerd. Two copies of it will fit on a floppy.

{ You can still download the older version, if you wish: WinAutomorphology Lite: just click on that underlined hyperlink. (This is version 2 [Sept 28 1998]; it runs faster, and requires less memory, than version 1. It's less than 100K in length.) If your computer tells you it can't find MFC42.DLL, you'll want to download a larger version of WinAutomorphology, which doesn't require that program (most people will have a copy of that file in their Windows/System directory). Click here for that larger version (400K). }

WinAutomorphology accepts as input a text file in any language and performs a morphological analysis. This version (version 1) requires that the average number of suffixes per word be not much more than 2 , so this works well on Indo-European languages, and not so well on many other language families. It gives reasonable results with 5,000 words as input, but the results begin to be really interesting when it is given 100,000 or more.

Usage: First open a text file, using the same menu item that you would with a text-processor.  On your first try, leave the "maximum number of words" set to 5000. Click the Analyze icon (or select Analyze, under Analyze, in the menu). In a few seconds, you will get a morphological analysis of the corpus.

Then try increasing the number of words. You will find the analysis takes longer -- naturally -- as you analyze more words. This will also require more memory of your computer, and you will almost certainly run up against a memory constraint on your computer as you increase the number of words. I'd be grateful to know when it was that you hit that glass ceiling, so to speak.

The Stems and Suffixes box on the screen are self-explanatory. The Signatures box indicates the morphological groupings that WinAutomorphology finds. The signature is a list of all the suffixes that appear with a given stem (where "NULL" stands for the zero-suffix). Thus "NULL-ed-ing-s" is the signature found on many regular English verbs, and "NULL-s-'s" is the signature found on many English nouns.

Rules are rules of allomorphy detected by Automorphology, and this generally requires corpora of 50,000 words or more. It will note that e-final stems alternate with stems lacking that e, for example (e.g., like and lik, as in lik-ing, lik-ed) in English, or that a alternates with umlauted a in German, etc. It will list the stems that illustrate each such pattern.

If you don't have any text files, you can get some English, French, or German right here. Help yourself. Download by clicking, then save it to your disk.

You can see some results of running the program on the 1,000,000 word Brown corpus by clicking here



If you would like to read about the algorithm embodied by the program, and some of the consequences, you may download a paper that I have written: Word 1997 version; RTF version; a brief overview on a web page. However, the newest version of the program (Version 1.1.1) includes a number of revisions since that paper was written, especially regarding prefixation -- the prefixation algorithm is quite different from what is described in that paper.


John Goldsmith

My homepage