Al Hoang : NLP techniques applied to Vietnamese

I’ve been getting interested in computational linguistics lately and have
been trying to read up on the
fundamentals. It’s been quite
a bit of probability which I’m not sure I really understand that well but
I imagine it’ll get easier to understand if I work with it a bit more.

Most of the examples in the textbook are focused on English so I thought it
would be interesting to apply it to Vietnamese. After looking around for
grammars for Vietnamese that I could use as a model for being able to tag
a Vietnamese corpus I stumbled across the website for the
Vietnam Lexicography Centre.
It looks like a great resource for Vietnamese computational linguistics issues.
There’s a paper describing an effort for descring the tags for part-of-speech
(POS) tagging in Vietnamese titled Lexical descriptions for Vietnamese
language processing. It’s a very good description of problems running
across computational linguists trying to tackle Vietnamese (hint: the number
is very low).

Here’s some brief highlights:

The classification methods which would be a necessary model to use for
POS tagging is still in discussion among Vietnamese linguists.
There has been some other efforts but without an implementation available
it’s impossible to check the effectiveness of those efforts
They are trying to adhere to some model called the MULTEXT model (which
itself is some effort to help standardize some parts of linguistics databases
so researchers can actually exchange their data easier)
Few resources for tools for Vietnamese text analysis are in the public
research domain

I think the authors of the paper make a good point. Publicly available
resources for linguistic processing research are very important. I think
the problem is magnified far more for languages that are not ‘economic
powerhouse’ languages such as English, French, or Japanese. What I mean
by ‘economic powerhouse’ is that the cultures that use these languages have
a much more stable financial base to build research efforts and derive
some sort of financial benefit from tools that are built. However, for
Vietnamese or Cambodian I’d argue that economic incentives for doing
nlp research in these languages is quite low. In order to progress research
in these languages I feel it is even MORE necessary to have the tools and
as much of the data as possible available for public research usage. Without
such efforts I get the feeling much of the research will not make progress
since people will be too busy building fundamental building blocks rather
than slowly stacking the building blocks on top of one another to find out
what works and what doesn’t.

One last note, in order to become a competent Vietnamese linguist it seems
from a cursory inspection you’ll need a decent command of Vietnamese (not
surprising) and perhaps slightly more surprising is some literacy in
French to catch some research that hasn’t been translated to English yet
(if it ever will be).

Resources I found:

Lexical
descriptions for Vietnamese language processing(PDF)

Vietnamese linguistic and
Cultural Information (A little light imo)

Wikipedia article on the Vietnamese language

Vietnamese Grammar Project(Seems dead and definitely far from complete)

Mon-khmer.com’s description of Vietnamese