Posted by: Tony.DING | 2011年11月1日

Lovins 算法

Intro

Lovins 算法是构词学中最早提出的一种Stemming(截词)词根化的算法,它由 Lovins JB  1968 发表。 这个算法的目的是寻求一个单词的”词根”, 这里racine 词根更多的指的是信息学角度的,并不一定是语言学上的词源。和另一种Stemming算法Porter相比,Lovins更加庞大,但是它更快。

Strategy

Lovins算法的主要策略如下,首先 Lovins 研究总结出294 种词尾 (terminaison),29种构词条件(Condition)和35种转化规则(Transformation rules),然后展开如下算法:

  1. 在294个词尾中,我们寻找符合29个条件的最长匹配词尾
  2. 删除找到的词尾
  3. 测试截尾后的词是否符合35种转化规则,若符合进行转化

Example

nationally 这个单词,我们可以在294个词尾中找到它的词尾 ationally,截去词尾后词根长度为1,不符合构词条件 B(最短词根长度为3),所以我们不作此操作;然后我们同样在词尾表中找到 ionally 这个词尾,根据条件A 我们移去这个词尾 我们找到了我们的词根nat.

转化规则主要用来对付那些不规则单词,比如 sitting -> sit 或者 matrix and matrices 双写动词结尾或者不规则单复数变化。

Appendix

下面列出 Lovins 总结的词尾,构词条件和转化规则:

附录 A 294 个词尾和相关条件

.11.
alistically B arizability A izationally B
.10.
antialness A arisations A arizations A entialness A
.09.
allically C antaneous A antiality A arisation A
arization A ationally B ativeness A eableness E
entations A entiality A entialize A entiation A
ionalness A istically A itousness A izability A
izational A
.08.
ableness A arizable A entation A entially A
eousness A ibleness A icalness A ionalism A
ionality A ionalize A iousness A izations A
lessness A
.07.
ability A aically A alistic B alities A
ariness E aristic A arizing A ateness A
atingly A ational B atively A ativism A
elihood E encible A entally A entials A
entiate A entness A fulness A ibility A
icalism A icalist A icality A icalize A
ication G icianry A ination A ingness A
ionally A isation A ishness A istical A
iteness A iveness A ivistic A ivities A
ization F izement A oidally A ousness A
.06.
aceous A acious B action G alness A
ancial A ancies A ancing B ariser A
arized A arizer A atable A ations B
atives A eature Z efully A encies A
encing A ential A enting C entist A
eously A ialist A iality A ialize A
ically A icance A icians A icists A
ifully A ionals A ionate D ioning A
ionist A iously A istics A izable E
lessly A nesses A oidism A
.05.
acies A acity A aging B aical A
alist A alism B ality A alize A
allic BB anced B ances B antic C
arial A aries A arily A arity B
arize A aroid A ately A ating I
ation B ative A ators A atory A
ature E early Y ehood A eless A
elity A ement A enced A ences A
eness E ening E ental A ented C
ently A fully A ially A icant A
ician A icide A icism A icist A
icity A idine I iedly A ihood A
inate A iness A ingly B inism J
inity CC ional A ioned A ished A
istic A ities A itous A ively A
ivity A izers F izing F oidal A
oides A otide A ously A
.04.
able A ably A ages B ally B
ance B ancy B ants B aric A
arly K ated I ates A atic B
ator A ealy Y edly E eful A
eity A ence A ency A ened E
enly E eous A hood A ials A
ians A ible A ibly A ical A
ides L iers A iful A ines M
ings N ions B ious A isms B
ists A itic H ized F izer F
less A lily A ness A ogen A
ward A wise A ying B yish A
.03.
acy A age B aic A als BB
ant B ars O ary F ata A
ate A eal Y ear Y ely E
ene E ent C ery E ese A
ful A ial A ian A ics A
ide L ied A ier A ies P
ily A ine M ing N ion Q
ish C ism B ist A ite AA
ity A ium A ive A ize F
oid A one R ous A
.02.
ae A al BB ar X as B
ed E en F es E ia A
ic A is A ly B on S
or T um U us V yl R
s’ A ‘s A
.01.
a A e A i A o A
s W y B

附录 B 29种条件 (与附录 A 中的条件序号相符)

A No restrictions on stem
B Minimum stem length = 3
C Minimum stem length = 4
D Minimum stem length = 5
E Do not remove ending after e
F Minimum stem length = 3 and do not remove ending after e
G Minimum stem length = 3 and remove ending only after f
H Remove ending only after t or ll
I Do not remove ending after o or e
J Do not remove ending after a or e
K Minimum stem length = 3 and remove ending only after l, i or u*e
L Do not remove ending after u, x or s, unless s follows o
M Do not remove ending after a, c, e or m
N Minimum stem length = 4 after s**, elsewhere = 3
O Remove ending only after l or i
P Do not remove ending after c
Q Minimum stem length = 3 and do not remove ending after l or n
R Remove ending only after n or r
S Remove ending only after dr or t, unless t follows t
T Remove ending only after s or t, unless t follows o
U Remove ending only after l, m, n or r
V Remove ending only after c
W Do not remove ending after s or u
X Remove ending only after l, i or u*e
Y Remove ending only after in
Z Do not remove ending after f
AA Remove ending only after d, f, ph, th, l, er, or, es or t
BB Minimum stem length = 3 and do not remove ending after met or ryst
CC Remove ending only after l

附录 C 35种转化条件

1 remove one of double b, d, g, l, m, n, p, r, s, t
2 iev -> ief
3 uct -> uc
4 umpt -> um
5 rpt -> rb
6 urs -> ur
7 istr -> ister
7a metr -> meter
8 olv -> olut
9 ul -> l except following a, o, i
10 bex -> bic
11 dex -> dic
12 pex -> pic
13 tex -> tic
14 ax -> ac
15 ex -> ec
16 ix -> ic
17 lux -> luc
18 uad -> uas
19 vad -> vas
20 cid -> cis
21 lid -> lis
22 erid -> eris
23 pand -> pans
24 end -> ens except following s
25 ond -> ons
26 lud -> lus
27 rud -> rus
28 her -> hes except following p, t
29 mit -> mis
30 ent -> ens except following m
31 ert -> ers
32 et -> es except following n
33 yt -> ys
34 yz -> ys

Conclusion

总的来讲,Lovins算法是一种类似枚举的算法,更多的是从构词学的角度入手,研究构词规则变化等,优点是因为遍历查表速度会很快,缺点是没有拓展性 比如换一种语言就要从头研究语言规则了。

参考文献:  http://snowball.tartarus.org/algorithms/lovins/stemmer.html

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.