Some Natural Language Processing Techniques



I thought it might be helpful not to have to read a whole book on top down and bottom up parser design. Here is an example of a basic parser design which I have used for many different purposes and for many different businesses. Some of it is standard stuff, and some is not as indicated.


A parser can be used for many different things. Some things that this style of parser has been used for are the following:

* To identify various types of phrases – headlines, proper nouns, technological terminology, short verbs, synonyms of word X, passive voice, missing conjunctions and relative pronouns, misspellings, etc., etc.

* To identify whether a sentence is complete and to give it the best analysis

* To find a translation for a given phrase in an English document in a German translation of that document.

* To choose among a variety of candidates for a word the one which best fits what has gone before.


I observe people tend to think of parsers as having only one purpose, the prototypical use case being machine translation, but if you broaden your view a little of what a parser really is, it can apply in many different places.


One design I have found very effective is to model a parse tree with a finite state machine. The parser itself is simply a database table which has a series of rules executed in order. When a string matches a rule, it is stored as a hash/map by rule number in a ‘phrase bucket’ in RAM. You don’t try to determine ‘the right tree’ until the sentence is fully parsed, except that rules that are known to be dead ends can be pruned to conserve RAM, depending on the purpose for the parser. There are several reasons for this:

  1. You can’t usually make the best decision until you’re done.
  2. The purpose of the parser may not be to find the best parse state. Its purpose may be to find phrases matching a particular description, such as proper names of businesses. You may want several alternative parses of a single sequence of words for some reason.
  3. Retaining several different phrases matching different rules allows you to use them for other purposes as well, such as using a match of one phrase to block the match of another or boost the probability that the match of another is the one you want. You can write rules more efficiently if every match of a particular string doesn’t have to fit in the same parse rule. (This latter observation is, I believe, my own invention. I’ve never seen it mentioned anywhere.)


For example:



1: Noun_Sg_Nom       #book

2: Noun_Pl_Nom        #books

3: Noun_Sg_Gen        #book’s

4: Noun_Pl_Gen         #books’

5: V_Infinitive                        #run

6: V_Pres                    #run

7: V_Sg_3rd_Pres       #runs



Wild Cards – as in Perl

* none or more

+ one or more

1 once



The Parser

Rule     Position           Inflection        WildCard        Param1(e.g. Pl)            Param2 (e.g. Trans)     Head

100      0                      1                      +                     

100      1                      1:2                   1                                                                                              -1

101      0                      5:6:7                1                                                          -1                                 -1

101      1                      2:100               1                      -1


This parser will match an infinitive or present tense transitive verb followed by a noun or plural compound noun:

Use thumb tacks

Uses polyester mouse pads

Use books.


Some points to note:

  1. Notice in rule 51, position 2, that rule number 50 appears side by side with an inflection. A rule can match either. For this reason, parse rule numbers start higher than the highest inflection number, so there’s no ambiguity. (This too is my own invention, I think. I haven’t seen it anywhere.)
  2. Notice that in order to know that rule #50 is plural, something has to be done internally. Certain parameters assigned to a single position get assigned internally to the whole phrase. In general, these will be the parameters assigned to the head, and that’s why it’s flagged.


Now how about the code to process this?

The parse tree is modeled as a finite state machine. For what I call an incremental parser (namely one where you don’t get the whole sentence at once, but get only one word at a time and have to modify the tree gradually), the state might look like this. (A use case for such a parser might be a spelling checker. When your word is misspelled, you might have several candidates for what was really meant, and such a parser might help you decide.):


struct ParseState


      ParseState *prev;

      set<ParseState *> next;

      ulong level;

      str word;

      ulong rule;

      ulong element;

      ulong launcher_rule;

      ulong launcher_element;

      bool initial;

bool invalidate;



For this type of parser, the ‘level’ is how far you are into the sentence. Level 0 is the first word, and so forth. Your parser is given several different candidates for a single position, and evaluates each candidate against the parse engine to see if it fits with previous information. Once a word is decided on for a particular level, you make a pruning call, and delete everything in the Phrase Bucket that was generated for a candidate which was rejected. A launcher rule is for situations like that in rule 51, position 1 in the example above, where you need to know where you came from.


For an incremental parser (which I’m using as an example, because it’s a more general case of the problem where you get the whole sentence in at once), the Parser class might have 3 basic calls:


void PruneStates(ulong new_level, const str &word);

void ResetStates();

void ParseWord(const str &word, ulong inflection);


When you reach the end of sentence, you call ResetStates() to empty the Phrase Bucket. When a word is decided upon for a particular position (i.e. the user chooses a word as a repacement for the misspelling), you choose PruneStates, and the system kills all branches of the tree which were not matched by that word.


And EvaluateWord is the primary routine, which checks a particular word candidate against the existing parse tree/finite state machine. It might look like this. Notice this doesn’t actually return anything. I leave that up to the reader. And I’m not worrying here about memory management and edge cases. You have to take responsibility for that yourself as well:


void ParseWord(const str &word, ulong inflection)


      CheckExistingStates(word, inflection);

      StartNewRules(word, inflection);




void CheckExistingStates(const str &word, const str &inflection)



      Grammar *pG = dynamic_cast<Grammar*>(GrammarBank::GetGrammar(m_langNum));

      //states are keyed by the rule number that matched

      map<ulong, set<ParseState *> > tmp = m_state_map;

//each state that matched the previous position in the sentence

      map<ulong, set<ParseState *> >::iterator CS = tmp.find(m_level - 1);

      set<ParseState *>::iterator iCS = CS->second.begin();

      for(int index = 0; iCS != CS->second.end(); iCS++, index++) {

        vector<ParseState *> new_states;

        //A parse rule match need not necessarily be exported to the outside world.

        //It might be just a subbranch or it might serve some internal purpose

        bool export_phr = false;

        bool match = false;

        //Start at the position where the last word left off

        ParseRule *PR = pG->GetParseRule((*iCS)->rule);

        vector<ParseElement>::iterator iPE = PR->the_rule.begin() + (*iCS)->element;

        //For rule elements with a wild card that allow it to match more than once

        if(!iPE->repeat && (iPE != PR->the_rule.end()))

            iPE++;   //move to the next ParseElement in the rule unless it's a repeating element

        bool initial_element = true;

        while(!match && (iPE != PR->the_rule.end())) {

            //Standard stuff. I leave this up to the imagination of the reader

            new_states = MatchWordWithParseElement(iPE, word, inflection);

            vector<ParseState *>::iterator iNS = new_states.begin();

            for( ; iNS != new_states.end(); iNS++) {

                    match = true;

                //for subrules with more than one element, transfer the launcher rule information.

                if(((*iCS)->launcher_rule != -1) && ((*iNS)->rule == (*iCS)->rule)) {

                    (*iNS)->launcher_rule = (*iCS)->launcher_rule;

                    (*iNS)->launcher_element = (*iCS)->launcher_element;


                (*iNS)->prev = *iCS;


                //if you've come to the end of the referenced rule, revert to the higher rule

                if((*iNS)->launcher_rule != -1) {

                    ParseRule *PRLower = pG->GetParseRule((*iNS)->rule);

                    vector<ParseElement>::iterator iPELower;

  iPELower = PRLower->the_rule.begin() + (*iNS)->element;

                    if(iPELower->final) {

                        (*iNS)->rule = (*iNS)->launcher_rule;

                        (*iNS)->element = (*iNS)->launcher_element;

                        (*iNS)->launcher_rule = -1;

                        (*iNS)->launcher_element = -1;




            //if the first element doesn't match

            if(new_states.empty()) {

                //repeating elements, you just tried the same element again

                //That did't work, so try the next

                if(initial_element && iPE->repeat)


                //non optional elements give up

                else if(!iPE->optional)


                //skip optional elements and try again




            initial_element = false;




void StartNewRules(const str &word, ulong inflection)


      Grammar *pG = dynamic_cast<Grammar*>(GrammarBank::GetGrammar(m_langNum));

      vector<ParseRule>::iterator iPR = pG->m_parser.m_rules.begin();

      for( ; iPR != pG->m_parser.m_rules.end(); iPR++) {

        while(new_states.empty()) {




            if(iPE == iPR->the_rule.end())


              new_states = MatchWordWithParseElement(iPE, word, inflection);


        vector<ParseState*>::const_iterator iNS = new_states.begin();

        for( ; iNS != new_states.end(); iNS++) {

            ParseRule *PR = pG->GetParseRule((*iNS)->rule);

              vector<ParseElement>::iterator iPEtmp = PR->the_rule.begin() + (*iNS)->element;

            if(iPEtmp->final && ((*iNS)->launcher_element != -1)) {

                (*iNS)->rule = (*iNS)->launcher_rule;

                (*iNS)->element = (*iNS)->launcher_element;

                (*iNS)->launcher_rule = -1;

                (*iNS)->launcher_element = -1;