CAASL3
Third Workshop on
Computational Approaches to Arabic Script-based Languages

Machine Translation Summit XII Ottawa, Ontario, Canada
August 26, 2009

Session 1

STeP-1: Standard Text Preparation for Persian Language
Mehrnoush Shamsfard, Soheila Kiani, and Yaser Shahedi (Shahid Beheshti University)

Many NLP applications need a pre-processing task to convert the input into an appropriate form or format. The preprocessing may include segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis and so on. The output of this phase should be a list of correct standard tokens with unique coding, spelling and prescription. In this paper we introduce a Persian text preprocessor called STeP-1. STeP-1 performs a combination of tokenization, spell checking and morphological analysis. It turns all Persian texts with different prescribed forms of writing to a series of tokens in the standard style introduced by Academy of Persian Language and Literature (APLL). Experimental results show very good performance.
back

Construction of a Persian letter-to-sound conversion system based on classification and regression tree
Ali Azimizadeh and Mohammad Mehdi Arab (Azad University, Mashhad)

Persian writing system, like all other Arabic script-based languages, is special because of omission of some vowels in its standard orthography. Lack of these vowels causes some problems in Text -To-Speech systems because full transcription of words is needed for synthesis. Then construction of a Letter-To-Sound conversion system is necessary for Text-To- Speech systems because it is not possible to list all words of a language with their corresponding pronunciation in a lexicon. In this paper, we have presented a Persian Letter-To-Sound conversion system based on Classification and Regression Tree. The training data is a lexicon of 32,000 words with their corresponding pronunciation which is extracted from Persian linguistic database corpora. The CART is built with Wagon that is a tool of Edinburgh Speech Tools for constructing decision trees in Festival. The final accuracy of this system is 93.614 %, which means that this system is able to predict Persian words’ pronunciation comparatively by a high accuracy in comparison with the same system for English which is 94.6% accurate to predict English words’ pronunciation in Festival. Also accuracy of the implemented Persian Letter- To-Sound system in festival is more than other previous systems which are implemented out of Festival.
back

Session 2

Corpus-based analysis for multi-token units in Persian
Massoud Sharifi-Atashgah and Mahmood Bijankhan (Tehran University)

Morphological and syntactic annotation of multi-token units confront several problems due to the concatenating nature of Persian script and so its orthographic variation. In the present paper, by the analysis of the different collocation types of the tokens, the compositional, non-compositional and semi-compositional constructions are described and then, in order to explain these constructions, the static and dynamic multi-token units will be introduced for the non-generative and generative structures of the verbs, infinitives, prepositions, conjunctions, adverbs, adjectives and nouns. Defining the multi-token unit templates for these categories is one of the important results of this research. The findings can be input to the Persian Treebank generator system. Also, the machine translation systems using the rule-based methods to parse the texts can utilize the results in text segmentation and parsing.
back

Automatic translation between English and Persian texts
Chakaveh Saedi and Yasaman Motazadi (Islamic Azad University), and Mehrnoush Shamsfard (Shahid Beheshti University)

PEnTrans is an automatic bidirectional English/ Persian text translator. It contains two main module, PEnT1,2. PEnT1 translates English sentences into Persian and PEnT2 perform translation from Persian to English. WSD which is an important part in translation is done in both systems by employing a combination of extended dictionary and corpus based approaches in PEnT1 and employing a combination of rule based, knowledge based and corpus based approaches in PEnT2. In this paper, introducing PEnTrans and its components, we propose a new WSD method by presenting a hybrid measure to score different senses of a word and also scoring different senses of a word according to its condition in a sentence.
back

Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages
Nizar Habash (Columbia University) and Ibrahim Saleh (Georgetown University)

We present an approach for automatic extraction and filtering of a lemma-based comprehensive up-to-date Arabic-English machine-readable dictionary from parallel corpora. Comparing the results of our system to a manually built dictionary shows a high degree of coverage complementarity and faster creation time.
back

Session 3

NP subject detection in verb-initial Arabic clauses
Spence Green, Conal Sathi, and Christopher D. Manning (Stanford University)

Phrase re-ordering is a well-known obstacle to robust machine translation for language pairs with significantly different word orderings. For Arabic-English, two languages that usually differ in the ordering of subject and verb, the subject and its modifiers must be accurately moved to produce a grammatical translation. This operation requires more than base phrase chunking and often defies current phrase-based statistical decoders. We present a conditional random field sequence classifier that detects the full scope of Arabic noun phrase subjects in verb-initial clauses at the F =1 61.3% level, a 5.0% absolute improvement over a statistical parser baseline. We suggest methods for integrating the classifier output with a statistical decoder and present preliminary machine translation results.
back

A unification-based approach to the morphological analysis and generation of Arabic
Selçuk Köprü and Jude Miller (AppTek)

In this paper, we present a powerful Arabic morphological analyzer and generator. The approach employs finite state machines enriched with unification capability. The presented system is used as a component in both statistical and rule based machine translation systems. We give detailed illustrations on how we handle nominal and verbal morphology in Arabic. Issues regarding derivational morphology and morphological generation are also addressed. Stimulating problems particular to Arabic and our solutions to these problems are explored meanwhile. An evaluation of the system is presented at the end.
back

Endoclitics in Pashto: Can they really do that?
Craig Kopris (AppTek)

A cross-linguistically very rare type of clitic, the endoclitic, occurs in Pashto. Like infixes, endoclitics can be inserted inside of a word, splitting words apart into separate non-adjacent pieces which themselves might not have any meaning. Unlike infixes, however, endoclitics are not inflections; their meaning is unrelated to that of their host word. This paper discusses some of the problems endocli-tics cause for processing Pashto, both written and spoken.
back

Session 4

Developing English-Urdu Machine Translation via Hindi
R. Mahesh K. Sinha (Indian Institute of Technology)

The paper presents a strategy for deriving English to Urdu translation using English to Hindi MT system. The English-Hindi lexical database is used to collect all possible Hindi words and phrases. These are further augmented by including their morphological variations and attaching all possible postpositions. This list is used to provide mapping from Hindi to Urdu. There may be change in gender and a word or a word group may be of multiple parts of speech. These are resolved using information available from English- Hindi MT. As Urdu is structurally very close to Hindi using similar post-positions, the output obtained is as acceptable as the Hindi translation.
back

Investigations on standard Arabic geographical classification
Ahmed Abdelali and Steve Helmreich (New Mexico State University), and Ron Zacharski (University of Mary Washington)

This paper reports on a series of studies focused on the geographical classification of Standard Arabic. The studies examined documents from newspapers in five countries: Egypt, Libya, Sudan, Syria, and the U.K. Methods used were over 99% accurate in geographically classifying these documents.
back


Other papers: For papers included in the Proceedings but unable to present at the workshop, please click here.