Some Natural Language Processing Techniques
PARSER
I thought it might be helpful not to have to read a whole book on top down and bottom up parser design. Here is an example of a basic parser design which I have used for many different purposes and for many different businesses. Some of it is standard stuff, and some is not as indicated.
A parser can be used for many different things. Some things that this style of parser has been used for are the following:
* To identify various types of phrases – headlines, proper nouns, technological terminology, short verbs, synonyms of word X, passive voice, missing conjunctions and relative pronouns, misspellings, etc., etc.
* To identify whether a sentence is complete and to give it the best analysis
* To find a translation for a given phrase in an English document in a German translation of that document.
* To choose among a variety of candidates for a word the one which best fits what has gone before.
I observe people tend to think of parsers as having only one purpose, the prototypical use case being machine translation, but if you broaden your view a little of what a parser really is, it can apply in many different places.
One design I have found very effective is to model a parse tree with a finite state machine. The parser itself is simply a database table which has a series of rules executed in order. When a string matches a rule, it is stored as a hash/map by rule number in a ‘phrase bucket’ in RAM. You don’t try to determine ‘the right tree’ until the sentence is fully parsed, except that rules that are known to be dead ends can be pruned to conserve RAM, depending on the purpose for the parser. There are several reasons for this:
For example:
Inflections:
1: Noun_Sg_Nom #book
2: Noun_Pl_Nom #books
3: Noun_Sg_Gen #book’s
4: Noun_Pl_Gen #books’
5: V_Infinitive #run
6: V_Pres #run
7: V_Sg_3rd_Pres #runs
Etc.
Wild Cards – as in Perl
* none or more
+ one or more
1 once
Etc.
The Parser
Rule Position Inflection WildCard Param1(e.g. Pl) Param2 (e.g. Trans) Head
100 0 1 +
100 1 1:2 1 -1
101 0 5:6:7 1 -1 -1
101 1 2:100 1 -1
This parser will match an infinitive or present tense transitive verb followed by a noun or plural compound noun:
Use thumb tacks
Uses polyester mouse pads
Use books.
Some points to note:
Now how about the code to process this?
The parse tree is modeled as a finite state machine. For what I call an incremental parser (namely one where you don’t get the whole sentence at once, but get only one word at a time and have to modify the tree gradually), the state might look like this. (A use case for such a parser might be a spelling checker. When your word is misspelled, you might have several candidates for what was really meant, and such a parser might help you decide.):
struct ParseState
{
ParseState *prev;
set<ParseState *> next;
ulong level;
str word;
ulong rule;
ulong element;
ulong launcher_rule;
ulong launcher_element;
bool initial;
bool invalidate;
};
For this type of parser, the ‘level’ is how far you are into the sentence. Level 0 is the first word, and so forth. Your parser is given several different candidates for a single position, and evaluates each candidate against the parse engine to see if it fits with previous information. Once a word is decided on for a particular level, you make a pruning call, and delete everything in the Phrase Bucket that was generated for a candidate which was rejected. A launcher rule is for situations like that in rule 51, position 1 in the example above, where you need to know where you came from.
For an incremental parser (which I’m using as an example, because it’s a more general case of the problem where you get the whole sentence in at once), the Parser class might have 3 basic calls:
void PruneStates(ulong new_level, const str &word);
void ResetStates();
void ParseWord(const str &word, ulong inflection);
When you reach the end of sentence, you call ResetStates() to empty the Phrase Bucket. When a word is decided upon for a particular position (i.e. the user chooses a word as a repacement for the misspelling), you choose PruneStates, and the system kills all branches of the tree which were not matched by that word.
And EvaluateWord is the primary routine, which checks a particular word candidate against the existing parse tree/finite state machine. It might look like this. Notice this doesn’t actually return anything. I leave that up to the reader. And I’m not worrying here about memory management and edge cases. You have to take responsibility for that yourself as well:
void ParseWord(const str &word, ulong inflection)
{
CheckExistingStates(word, inflection);
StartNewRules(word, inflection);
}
void CheckExistingStates(const str &word, const str &inflection)
{
//CHECK EXISTING STATES
Grammar *pG = dynamic_cast<Grammar*>(GrammarBank::GetGrammar(m_langNum));
//states are keyed by the rule number that matched
map<ulong, set<ParseState *> > tmp = m_state_map;
//each state that matched the previous position in the sentence
map<ulong, set<ParseState *> >::iterator CS = tmp.find(m_level - 1);
set<ParseState *>::iterator iCS = CS->second.begin();
for(int index = 0; iCS != CS->second.end(); iCS++, index++) {
vector<ParseState *> new_states;
//A parse rule match need not necessarily be exported to the outside world.
//It might be just a subbranch or it might serve some internal purpose
bool export_phr = false;
bool match = false;
//Start at the position where the last word left off
ParseRule *PR = pG->GetParseRule((*iCS)->rule);
vector<ParseElement>::iterator iPE = PR->the_rule.begin() + (*iCS)->element;
//For rule elements with a wild card that allow it to match more than once
if(!iPE->repeat && (iPE != PR->the_rule.end()))
iPE++; //move to the next ParseElement in the rule unless it's a repeating element
bool initial_element = true;
while(!match && (iPE != PR->the_rule.end())) {
//Standard stuff. I leave this up to the imagination of the reader
new_states = MatchWordWithParseElement(iPE, word, inflection);
vector<ParseState *>::iterator iNS = new_states.begin();
for( ; iNS != new_states.end(); iNS++) {
match = true;
//for subrules with more than one element, transfer the launcher rule information.
if(((*iCS)->launcher_rule != -1) && ((*iNS)->rule == (*iCS)->rule)) {
(*iNS)->launcher_rule = (*iCS)->launcher_rule;
(*iNS)->launcher_element = (*iCS)->launcher_element;
}
(*iNS)->prev = *iCS;
(*iCS)->next.insert(*iNS);
//if you've come to the end of the referenced rule, revert to the higher rule
if((*iNS)->launcher_rule != -1) {
ParseRule *PRLower = pG->GetParseRule((*iNS)->rule);
vector<ParseElement>::iterator iPELower;
iPELower = PRLower->the_rule.begin() + (*iNS)->element;
if(iPELower->final) {
(*iNS)->rule = (*iNS)->launcher_rule;
(*iNS)->element = (*iNS)->launcher_element;
(*iNS)->launcher_rule = -1;
(*iNS)->launcher_element = -1;
}
}
}
//if the first element doesn't match
if(new_states.empty()) {
//repeating elements, you just tried the same element again
//That did't work, so try the next
if(initial_element && iPE->repeat)
iPE++;
//non optional elements give up
else if(!iPE->optional)
break;
//skip optional elements and try again
else
iPE++;
}
initial_element = false;
}
}
void StartNewRules(const str &word, ulong inflection)
{
Grammar *pG = dynamic_cast<Grammar*>(GrammarBank::GetGrammar(m_langNum));
vector<ParseRule>::iterator iPR = pG->m_parser.m_rules.begin();
for( ; iPR != pG->m_parser.m_rules.end(); iPR++) {
while(new_states.empty()) {
if(!iPE->optional)
break;
iPE++;
if(iPE == iPR->the_rule.end())
break;
new_states = MatchWordWithParseElement(iPE, word, inflection);
}
vector<ParseState*>::const_iterator iNS = new_states.begin();
for( ; iNS != new_states.end(); iNS++) {
ParseRule *PR = pG->GetParseRule((*iNS)->rule);
vector<ParseElement>::iterator iPEtmp = PR->the_rule.begin() + (*iNS)->element;
if(iPEtmp->final && ((*iNS)->launcher_element != -1)) {
(*iNS)->rule = (*iNS)->launcher_rule;
(*iNS)->element = (*iNS)->launcher_element;
(*iNS)->launcher_rule = -1;
(*iNS)->launcher_element = -1;
}
}
}