Transliterating Sanskrit and Pali [updated]

Transliterating Sanskrit, and its derivatives such as Pali, remains an annoying problem. The problem isn’t with the language itself; Sanskrit’s wonderfully precise and clear about sounds and letters. Likewise, there’s no issue with scripts or alphabets. You might think that there is some mystical connection between the script that a language is written in and the language itself but that’s really not the case. Sanskrit in India is written in Devanagari but there’s no special reason to use Devanagari for Sanskrit instead of the Latin alphabet or another one. Plus, Sanskrit’s only been written in Devanagari for a comparatively short period of time.

(Surprisingly, the alphabet was only invented once and all alphabets are genetically related to each other, branches from this one root. Devanagari is linked to Latin letters via Brahmi and Aramaic.)

But in order to write Sanskrit correctly, you need some Latin letters not used in English. This is a common-enough situation; think of accent marks, or the French and Portuguese cedilla — ç — or the Spanish enye — ñ — or even Motörhead’s heavy metal umlaut. So, for example, “Devanagari” ought to be written “Devanāgarī” and “Pali” should be “Pāḷi.” The complete set of diacritics for Pali is: ā, ī, ū, ṁ, ṇ, ñ, ṭ, ḍ, ṅ, ḷ .

There’s another, separate but related, issue about when to use these ‘extra’ letters and marks; for native English readers, the argument goes, these ‘extra’ letters and marks — called diacritics — are distracting and make the words harder to read.

Specialists typically prefer to preserve diacritics, because losing them changes the meaning of the word in its original language. The question comes down to: “when do these foreign words become English words?” There’s an active debate going on now on H-Buddhism, an academic Buddhist studies mailing list on this very topic. Dictionaries are split on the issue, with some words preserving diacriticals and others losing them: for more on this, see the list of Buddhist Terms Found in English Print Dictionaries and Gerald Jackson’s series on fonts and diacritics in academic publishing.

The problem arises when you need to write diacritics in your friendly word processing application. This immediately leads to a technical conversation about Unicode and Unicode fonts. Which is when things start to get hairy.

The best starting point for Unicode issues is Alan Wood’s page, It’s worth reading for the introduction, as an overview of the topic of digital transcription. More specifically, for the topic at hand, the Tibetan & Himalayan Digital Library has a good survey of Unicode fonts for transliterating “Indo-Tibetan” languages.

(By Indo-Tibetan they mean Indian languages for Buddhist studies, including Sanskrit, Pali, Gandhari, and so on, plus Tibetan. “Indo-Tibetan” isn’t a language family like Indo-European but the term point to the very close relationship between Tibet and India. Buddhist Tibetan is a specialized language unreadable to a native Tibetan, optimized a thousand years ago to translate Buddhist Sanksrit into Tibetan. Smart people have been dealing with these issues for a long time.)

They make the point that not all Unicode fonts contain the necessary characters, so simply choosing a Unicode font isn’t enough: “To properly display all the diacritic marks used in Indo-Tibetan studies, a Unicode font must contain the following character ranges:

Basic Latin: U+0000 – U+007F (View Unicode Chart)
Latin-1 Supplement: U+0080 – U+00FF (View Unicode Chart)
Latin Extended-A: U+0100 – U+017F (View Unicode Chart)
Latin Extended-B: U+0180 – U+024F (View Unicode Chart)
Latin Extended Additional: U+1E00 – U+1EFF (View Unicode Chart)”

(More…)

For Pali, this is the Unicode set:

character	ASCII rendering	character name	Unicode number	key combination	HTML code
ā	aa	a macron	61580	Alt+A	ā
ī	ii	i macron	61620	Alt+I	ī
ū	uu	u macron	61672	Alt+U	ū
ṁ	.m	m dot-under			ṁ
ṇ	.n	n dot-under	61686	Alt+N	&#7751
ñ	~n	n tilde	61590	Alt+Ctrl+N	ñ
ṭ	.t	t dot-under	61642	Alt+T	ṭ
ḍ	.d	d dot-under	61622	Alt+D	ḍ
ṅ	“n	n dot-over	61626	Ctrl+N	ṅ
ḷ	.l	l dot-under	61634	Alt+L	ḷ

From Wikipedia‘s “Pali transliteration on computers.”

The Tibetan Digital Library people also have a good chart of relevant diacritics.

(In the distant past, like five years ago, there were various gnarly work-arounds, including the now-deprecated Times Norman / Normyn font. Nobumi Iyanaga has written a useful library of scripts to convert from Times Norman / Normyn to ‘good’ Unicode.)

So, what are the practical options for a good font for transliterating Sanskrit and Pali today? It seems to me that there are at least five good choices:

Times Ext Roman

The Tibetan Digital Library people really like Times Ext Roman. But the only source for it is the Bahá’i Computer & Communication Association and it’s not clear to me what license its published under so I would be reluctant to recommend it even though I trust that it’s technically valid.

Gentium

If you can get past SIL’s Christian missionary agenda, they do outstanding linguistics work and their Gentium font is well regarded, seems complete for the purposes of transliterating Sanskrit and Pali, is widely accepted, is under active development and is licensed under a good, if idiosyncratic, open source license. It’s a nice-looking typeface, in my opinion.

IndUni

John Smith has recently updated this font family. It’s exactly designed for the topic under discussion, “the representation of Indian-language (and similar) material in Roman script using the Unicode character set.” But he’s just one, albeit committed, guy and I don’t know what license he’s publishing these under, so I worry about its long-term supportability. But worth mentioning; sort of in the same category to me as Times Ext Roman.

TransIndic Transliterator

There’s also a commercial product, TransIndic Transliterator in Unicode, from Linguists Software that seems like it does the job, although I don’t know much about it. It costs $100 per typeface (they have Times, Palatino, Arial, etc.) or $250 for the whole thing. Commercial license. Paying for it has the advantage of having someone on the hook to help you with it, not a small thing.

Gandhari Unicode

This nice-looking typeface was originally designed to transcribe the newly discovered Buddhist manuscripts from Afghanistan. (Gandhari is another Middle Indic prakrit like Pali.) Gandhari Unicode is under active development, which is good, and seems widely accepted. (Main page, download.)

The license status of Gandhari Unicode is a little bit troubling; it’s based on work licensed under the “Aladdin Free Public License” which isn’t, despite the name, a free public license. The Free Software Foundation considers it a non-free license. Other parts of Gandhari Unicode are GPL-derived but I don’t understand which takes precedence. Note that the link in Andrew Glass’s documentation to the Aladdin license (at Wisconsin) is out of date.

[26 August 2009 update: According to reliable reports, there are issues with Gandhari Unicode’s spacing, especially italics, when printed.]

2 thoughts on “Transliterating Sanskrit and Pali [updated]”

Benct Philip Jonsson on 5 October 2009 at 12:49 am said:

The “diacritics are distracting” argument is just so daft for us whose native languages use diacritics, and/or who regularly read and write foreign languages which use them. I have on occasion had to read passages in my native language Swedish where the diacritics had been ‘stripped away’, and in fact a native reader cannot fluently read such a passage because it is distracting to have to mentally insert the diacritics from context, e.g. to decide whether the letters really is the word _har_ ‘has’ or the word _här_ ‘here’, which neither sound nor mean the same; these words are as different from eachother as either of them is from _hur_ ‘how’! Furthermore the context is not always of help: it would for example be relatively easy to write a passage in Sanskrit where be ambiguous between _nāgam_ ‘the snake’, _nagam_ ‘him who doesn’t go’ and _nagām_ ‘her who doesn’t go’!
I’d like to ask those English speakers who consider diacritics mere adornments to consider vhat it vould pe like to pe presented vith a pook vhere the vriter hadnt pothered to distinguish p, p and p, v and v, or to vrite out any apostropes; it vouldnt be impossiple to read, since most vords vouldnt be ampiguous anyvay, put it vould pe distracting. Omitting diacritics in languages which use them amounts to the same thing!
John Whalen-Bridge on 25 March 2011 at 1:22 am said:

Hello. “Satori” should be on the list, even if dictionaries forgot to put it in. It’s everyday use.

Comments are closed.