Some Natural Language Processing Techniques

 

CORRUPTION IDENTIFIER

I chose this example because it’s a nice, neat little means to talk about statistical vs. linguistic methods. What lies at the root of the contention is that linguists don’t know enough about how computers work and programmers don’t know enough about how languages work. This is exacerbated by the fact that linguistics has gone through an era since about my birth year (1958) when they spent much of their time building castles in the air relatively unconcerned with whether their theories were actually falsifiable, which resulted in attempts to implement fairly useless theories which resulted in fairly useless results.

 

It’s not the case that this contention started with Google Translate. Programmers have always looked down on linguists. In the world of computational linguistics, programmers design the software, and hence define the parameters within which linguists must operate. Linguists write rules for the tools that programmers designed and they get half their pay. It’s their own fault too, because it’s easier to learn fluent C++ than to learn fluent Cantonese. Linguists can’t quite be eliminated from the equation, because they are needed to verify that the software is working for Cantonese, but when it doesn’t quite. they started getting ideas, and thinking, “What’s going on is X, and a solution would be Y,” which is not their province.

 

So when I start with a project, I almost always start with statistics and not linguistics because:

  1. It’s easier, more scalable, less work, less thinking
  2. It’s guaranteed to be true to the data. I use it to create my baseline. When my theorizing takes issue with my statistics, the statistics win. The more data you have, the more accurate a statistical tool will be, and for this reason, before the Internet and distributed processing, linguistics software was in greater demand.

 

The following is essentially a simple statistical tool. What I want to exemplify with this is that even what you’re doing is essentially statistical, a little linguistic priming of the pump can go a long way. This tool takes a word list and alphabet for a language and analyzes the word list statistically to determine which words within it deviate from its own pattern.

 

The analogous statistical approach is ngram-based. Something like this:

 

Method I: statistical

 

my $MAX_NGRAM = 3;

my %ngrams;

sub analyze_word

{

      my $word = @_;

      for($i = 0; $i < length($word); $i++) {

            for($j = 0; $j < $MAX_NGRAM; $j++) {

                  if($i + $j < length($word)) {

                        $word =~ /^.{$i}(.{$j+1})/;

                        $ngrams{$1}++;

 

                  }

            }

      }

}

 

So you can get far better results by complicating this just a bit, by telling the software which are the consonants and which are the vowels in the alphabet, and by counting linguistically significant components, namely onsets, vowel clusters and final consonant clusters. This works much better, because what happens at the boundary between consonants and vowels is, with few exceptions, pretty random, whereas the vowel sequences and consonant sequences are very constrained, so the ngram model introduces a bunch of noise.

 

Method II: a little linguistics

 

my %letter_sequences;

my $CONSONANTS = “bcdfghjklmnpqrstvwxyz”;

my $VOWELS = “aeiou”;

sub analyze_word

{

      my $word = @_;

      $word =~ /^([$CONSONANTS]*)(.*)$/;

      $letter_sequences{$1}++;

      my $subword = $2;

      while((length($subword) > 0) && $subword =~ /^([$VOWELS]*)([$CONSONANTS]*)(.*)$/) {

            $letter_sequences{$1}++;

            $subword = $3;

      }

      if($2 =~ /^[$CONSONANTS]*$/)

            $letter_sequences{$2}++;

}

 

(I think that will work. I don’t have Perl installed on this machine.) Notice I only add consonant clusters at the beginning and end of the word. This is because what happens at the syllable boundary is also random. We want to include ‘br’ and ‘x’ in ‘breadbox’, but not ‘bd’. ‘bd’ is not typical of English.

 

Already this works way better than the ngram approach. You can convince yourself of this by setting a lower threshold on the counts in %letter_sequences, and you’ll see that words which match the patterns in the lower range for the most part contain ‘qu’ (because the consonant-vowel boundary isn’t random here) and ‘y’ (because it’s both consonant and vowel). The second will fail for the edge cases. That is to say, your linguistic approach was correct, and your failures were rooted in you inaccurate linguistic assumptions, namely that the consonant/vowel boundary is random, and that these are the vowels and these the consonants. Also very cool – the size of %letter_sequences is much smaller than the size of %ngrams, and it is more informative.

 

So then you’re hooked and you modify your code to handle the ‘qu’ edge case and the ‘y’ edge case, and very dangerous!!! You have just turned into a linguistic programmer. That process of picking nits and modifying your code to look like English rather than trying to stuff English into your code is not scalable or understandable to your average programmer. It is frowned upon.

 

But if you continue down this path, and recognize next that most of the vowel sequences that fail mostly happen on the boundary between a prefix and root, and so use a statistical method to find the prefixes and lop them off before you apply Method II, and so forth, you will find that you get a very effective, inexpensive (processing time and memory) word list corruption checker. Even Method II alone augmented by ‘qu’ and ‘y’ heuristics will find a lot of garbage in the lower counts. This method works nicely for a language identifier as well.