From the research of machine translation and artificial intelligence in the 1950s, NLP (Nature
Language processing (natural language processing) has a history of half a century. exist
In this process, many important theories and methods have been put forward by academic circles, and rich achievements have been made.
. The author believes that in the past two decades, the landmark contributions in this field are as follows:
(1) Complex feature set and unified grammar; (2) Lexicalism in linguistic research; ( 3)
Corpus approach and statistical language model. These three achievements will continue to exert great influence on linguistics and computational linguistics.
And NLP research had a far-reaching impact. In order to better understand the significance of these achievements, first introduce and
Two related facts.
2. Two facts
2. 1 One of the facts-phrase structure grammar cannot effectively describe natural language.
In natural language processing, in order to identify the syntactic structure of input sentences, it is necessary to
Cut out the words in the sentence one by one, then look them up in the dictionary and give each word a reference.
Send appropriate parts of speech; Then use syntactic rules to wrap the sentence.
Identify the syntactic components including noun phrases, verb phrases and clauses one by one. enter
And judge the syntactic function of each phrase, such as subject, predicate and object. , and its semantic role,
Finally, the meaning expression of the sentence is obtained, such as logical and semantic expression. This is a syntactic analysis.
The whole process.
The first thing I want to mention in this paper is: phrase structure grammar (phrase structure)
Grammar (PSG) cannot effectively describe natural language. Research on PSG in Chomsky Language
Theory plays an important role in the syntactic description of natural language.
. However, it has some fundamental weaknesses, mainly because it uses parts of speech and phrases.
Class, so it can't effectively express and explain the structural ambiguity in natural language.
Look at the combination of "V+N" in Chinese. If we put "strike, entrust, investigate" and so on.
This word is designated as the verb (v); Take the words "strength, method, piracy and Party A" as nouns (
N), and agree that "strike strength" and "entrustment mode" are noun phrases (NP), "strike"
Piracy "and" Principal A "are both verb phrases (VP), so there are two differences as follows.
Syntactic rules of meaning:
( 1)NP→VN
(2) Verb phrase → Verb phrase
In other words, when the computer observes the adjacent sequence of the part of speech "V+N" in the text, it is still
Not sure if they are NP or VP. We call this ambiguity "phrase class"
Type ambiguity ". For example:
The company is recruiting sales staff.
The earth is constantly [changing the V-shape n] VP.
Looking at the combination of "n+v" again, it will also produce the rule of vague phrase types.
Yes, for example:
(3)NP→NV case: market survey; Political influence.
(4)S→NV example: price rises; The situation is stable.
Where the symbol s stands for a clause.
Not only that, sometimes when the machine observes the adjacent sequence of "n+v" parts of speech, even
It is impossible to tell whether they are in the same phrase. That is to say, "n+v" part-of-speech sequence
It may form a noun phrase NP or clause S, or it may not be in the same phrase at all. After ...
This ambiguity is called "phrase boundary ambiguity". Here are two related examples:
China's [Railway N Building V] NP has developed rapidly.
[China Railway N] NP builds V quickly.
In the previous example, "railway construction" constitutes a NP; In the latter example, these two
Two adjacent words belong to two different phrases. This is enough to show that based on a single mark,
PSG can't fully describe syntactic ambiguity in natural language. Let's take a look at some of them.
Examples.
(5)NP→V N 1 DE N2
(6)VP→V N 1 DE N2
Where de stands for the structural auxiliary word "de". For example, "VP's knife for peeling apples" is NP; but
"Peeling apples" NP is VP. There are both ambiguity of phrase types and ambiguity of phrases.
The boundary is blurred. For example, two adjacent words "peel V apple N" may form one word.
VP may also be in two adjacent phrases.
(7)NP→P N 1 N2
(8)PP→P N 1 N2
P and PP in the rules represent prepositions and prepositional phrases respectively. For example, the seal of PP [to Shanghai]
Elephant "is NP; And "for[ Shanghai student] NP" is PP. The adjacent word "Duihun".
It may form a PP or two phrases.
(9)NP→NumP N 1 de N2
Where NumP stands for quantitative phrase. Although rule (9) represents an NP, it can be replaced separately.
Table 2 Structural meaning:
For example, five [employees of the company] NP.
(9b) [NUMP N 1] NP Den2 For example, NP employees of [five companies].
( 10)NP→N 1 N2 N3
The rule (10) also represents an NP, but "N 1+N2" or "N2+N3" is combined first.
First of all, there will be two different structural ways and meanings, namely:
(10a) [n1n 2] npn3 such as [Modern Chinese] NP dictionary.
(10b) N 1 [N2n3] NP, for example, the new Chinese Dictionary NP.
The first fact discussed above shows that:
Due to the lack of binding force, the PSG rule of single mark can not completely solve the problem of phrase type and
Phrase boundaries are blurred. In mathematical terms, PSG rule is necessary, but not enough.
. So the machine only judges whether a rule is short according to a part-of-speech sequence on the right.
Language, or any phrase, has some uncertainties.
Reconstructing the grammatical system of natural language by using complex feature set and lexicalism method is
The most important efforts made by global linguists in the past twenty years.
2.2 Fact 2- Limited coverage of phrase structure rules
Through the investigation of large-scale corpus, it is found that the distribution of phrase rules in a language is consistent
Ziff's law. Zipf is a statistician and linguist. He proposed, such as
If you calculate a language unit (whether it is a letter or a word), put it in.
The frequency of a corpus is recorded as f, which is sorted in descending order of frequency.
Each cell is assigned an integer rank r. The result is that the product of r and f is about
A constant. that is
F…w│w│w)。
.. p (w [,n] │ w...w │ w conditional probability, and so on. no
It is hard to see that in order to predict the word W │ W [,1]) II [,I = 3, …, n]P(w[, i].
│w[,i-2]w[,- 1]) (5)
The method of statistical language model is a bit like weather forecast. Large-scale estimation of probability parameters
Corpus is like a meteorological record accumulated in a region for many years, and it uses ternary model to create the sky.
Weather forecast is like predicting the weather of the day according to the weather conditions of the previous two days. When is the weather forecast?
However, it is impossible to be 100% correct. This is also a feature of probability and statistics method.
3.3. 1 speech recognition
As a way to input Chinese characters instead of computer keyboard, speech recognition is more and more trusted by people.
Interest of people from all walks of life. The so-called dictation machine is such a commodity. It is reported that China's mobile phone
With the popularity of mobile phones and personal digital assistants (PDA), the number of telephone users has exceeded 1 billion, especially
When these portable devices can access the Internet wirelessly, it is more urgent for users.
I hope to input short text information by voice recognition or handwriting board instead of keyboard.
In fact, the task of speech recognition can be regarded as a problem of calculating the maximum of the following conditional probabilities:
W[*]=argmax[, W]P(W│ voice signal)
=argmax[, W]P (voice signal │W)P(W)/
voice signal
=argmax[, W]P (voice signal │W)P(W) (6)
The mathematical symbol argmax[, w] in the formula indicates that the conditional probability P (W) is calculated for different candidate word sequences W.
│ voice signal), making W[*] the one with the largest conditional probability value.
Word sequence, which is the recognition result selected by the computer. In other words, by Formula (6)
Through calculation, the computer found the word string W[ 1] which is most suitable for the current input speech signal.
*]。
The second line of equation (6) is the result of transliteration of Bayesian law, because the conditional probability p (
Speech signal │W) is easier to estimate. Denominator p (speech signal) pair of the formula
The given speech signal is constant and does not affect the calculation of the maximum value, so it can be deleted from the formula.
Except ... In the result shown in the third line, P(W) is the statistical language model mentioned above, that is
Generally, the ternary model shown in Formula (5) is adopted; P (speech signal │W) is called acoustic model.
At this point, readers may have understood that the Pinyin-Chinese character conversion in the Chinese Pinyin input method is arbitrary.
In fact, the service is also implemented in the same way, and the Chinese language model used by both is binary.
Or ternary model) is the same model.
At present, dictation machine products and Microsoft Pinyin Input Method (Version 3.0) on the market all use words.
The realization of ternary model hardly needs syntactic and semantic analysis. Because according to comparable comments,
The test results show that the error rate of pinyin-Chinese character conversion system based on ternary model is higher than other products.
Reduce by about 50%.
3.3.2 Part of speech tagging
About 14% of word types in thesaurus have more than one part of speech. In the corpus,
There are more than one part of speech for words that account for about 30% of the total number of words. So for each text
Part-of-speech tagging of a word is to resolve part-of-speech ambiguity through the constraints of context. calendar
There have been two automatic part-of-speech tagging systems in history. One is to use context-sensitive rules.
Then it is called TAGGIT( 197 1), and another binary model with part of speech is called CLAWS (
1987) (see Garside et al. 1989). The two systems are used to evaluate English with 654.38+00000 words respectively.
Part-of-speech tagging for unrestricted text. The results show that paw and statistical language model
The labeling accuracy of this system is much higher than that of TAGGIT system based on rule method. Please see the table below.
Than:
System Name Taggit (1971) CLAWS (1987) Label No.86 133 Method 3000 CSG Rules Hidden Markov Model Label Accuracy 77% 96% Test Corpus Brown LOB.
Let c and w represent the order of part-of-speech markers and the order of words respectively, then the problem of part-of-speech tagging can be regarded as a scheme.
Calculate the maximum of the following conditional probabilities:
C[*]=argmax[,C]P(C│W)
=argmax[,C]P(W│C)P(C)/P(W)
≈argmax[,C]ⅱ[,i= 1,…,n]P(w[,i]│c[,i])P(c[,i]│c[,I
- 1]) (7)
Where P(C│W) is the bar where the part-of-speech marker sequence C appears when the word sequence W is known to be input.
Piece rate. The mathematical symbol argmax[, C] represents different candidate parts of the speech mark sequence C by checking.
To find the part-of-speech marker sequence C[*] that maximizes conditional probability. The latter should be
The result of w's part-of-speech tagging.
The second line of the formula is the result of transliteration of Bayesian law, because the denominator P(W) is given.
W is a constant, which does not affect the calculation of the maximum value and can be deleted from the formula. And then face the public
Types of approximate analysis. Firstly, the independence hypothesis is introduced, which holds that any word w[, i] is out.
The present probability approximation is only related to the part-of-speech marker c[, i] of the current word, but to the surrounding (context).
Part of speech markers are irrelevant. Then the lexical probability can be calculated as follows:
P(W│C)≈ⅱ[,i= 1,…,n]P(w[,i]│c[,i]) (8)
Secondly, the binary hypothesis is adopted, that is, the occurrence probability of any part-of-speech marker c[, i] is approximately considered.
It is only related to its previous part-of-speech marker c[, i- 1]. rule
P(C)≈P(c[, 1])ⅱ[,i=2,…,n]P(c[,i]│c[,i- 1]) (9)
P(c[, i]│c[, i- 1]) is the transition probability of the part-of-speech marker, which is also called duality based on part-of-speech.
Model.
These two probability parameters can be estimated by corpus with part-of-speech tags:
P(w[,i]│c[,i])≈count(w[,i],c[,i])/count(c[,i])(
10)
P(c[,i]│c[,i- 1])≈count(c[,i- 1]c[,i])/count(c[,i- 1]
) ( 1 1)
According to literature reports, using statistical language model method, Chinese and English part-of-speech tagging is correct.
The rate can reach about 96% (Bai Shuaihu 1992).
3.3.3 Attachment ambiguity of prepositional phrase PP
In English, whether a prepositional phrase is attached to the preceding noun or verb is a sentence.
Common structural ambiguity in legal analysis. The following example illustrates how to solve this problem with corpus method.
A question, how high the correct rate can be achieved by this method.
For example, Pierre Wenken, 6 1 year old, joined the board of directors as a director.
Non-executive director.
Let a = 1 denote noun attachment and a = 0 denote verb attachment, then the problem of PP attachment in the above example can be expressed.
Used for:
(A=0, V = joined, n 1 = board, P=as, N2 = controller)
Let v, N 1 and N2 represent the head words of verb phrases, object phrases and object phrases respectively.
And the probability of the following quadruple is counted in the corpus with syntactic tags (also called tree library).
P[,r]:
P[,r]=(A= 1│V=v,N 1=n 1,P=p,N2=n2) ( 10)
The algorithm for judging the attachment of the input sentence PP is as follows:
If p [,r] = (1 │ V, n 1, p, n2)≥0.5,
It is judged that PP is attached to n 1
Otherwise, it is determined that PP is attached to V.
Collins company. The corpus used in Brooks( 1995) experiment is marked by the University of Pennsylvania.
WSJ tree library, including: 20,801quad training sets, testing.
Try to set 3097 quadrilaterals. They put forward the following points for the upper and lower limits of the automatic determination accuracy of PP accessories.
Analysis:
All of them are regarded as noun addition (i.e. A ≡ 1) 59.0%.
Only 72.2% of the most common attachment of preposition P is considered.
The three experts only judged 88.2% according to the four words of the center.
Three experts judged 93.2% according to the whole sentence.
Obviously, the lower limit of automatic judgment accuracy is 72.2%, because the machine is no better than just considering sentences.
The most common attachment of preposition p is worse; The upper limit is 88.2%, because the machine can't compare with three.
Experts make better judgments based on these four central words.
This paper reports that among the 3097 quadrangles tested, the system correctly judges the quadrangles.
It is 2606, so the average accuracy rate is 84. 1%. This is different from the above-mentioned upper limit of 88.2%
In contrast, it should be said that it is quite a good result.
4. Conclusion
The efforts of linguists, whether using complex function sets and unified grammar, or lexicalism.
Methods are all great contributions made under the original so-called rationalism framework. Lexical method
This method is especially commendable, because it not only puts forward a finer-grained language knowledge representation method.
At the same time, it also embodies a new idea of gradual development and language knowledge accumulation. Particularly worthy of attention.
It seems that corpus and statistical methods have played a great role in the development of many vocabulary resources.
The role of. This is also a welcome beginning of the integration of empiricism and rationalism. pen
Researchers believe that corpus method and statistical language model are the mainstream of natural language processing technology at present.
Their practical value has been proved in many application systems. Research on statistical language model,
Especially in the statistical modeling of structured objects, there is still a broad space for development.
References:
Alzheimer, Jane & William Myers (editor. ). 1990. Corpus Linguistics:
Theory and practice [c] Amsterdam: Rodopi.
Collins, m. and j. brooks. 1995. Prepositional phrases
Connection through backward model [p]. In the minutes of the meeting
The third symposium on very large corpora. Cambridge, Massachusetts.
Garside, R., G. Leech and G. Sampson (eds.). 1989. This
Computational analysis of English: a corpus-based approach.
London: Longman.
Hudson, Rhode Island 199 1. English word grammar [m] Cambridge,
Quality. : Basil Blackwell.
Bai Shuaihu, 1992, Research on Chinese Part of Speech Automatic Tagging System [[Ma]]. Tsinghua University calculation
Master's degree thesis of mechanical science and technology department.
Dong Zhendong and Dong Qiang, 1997, hownet [J]. The third phase of language application.
Yu et al., 1998, Dictionary of Modern Chinese Grammatical Information [M]. Beijing:
Tsinghua University Publishing House.