September 24, 2014

The University of Edinburgh (United Kingdom) and the Technology Development for Indian Languages (TDIL) of the Indian government have released a system for machine translation Hindi-Punjabi as developed by a faculty member from The Punjabi University, Mr. Ajit Singh, an assistant professor at the MM Modi College, with help from the university’s Computer Science department. The software has recently been tested at both organizations.Hindi-to-Punjabi-Machine-Translation-System

Vishal Goyal, assistant professor at the department of Computer Science at the Punjabi University, said: “the software has been made available online on the servers of the Edinburgh University and the TDIL.” He added the software is around 94% accurate, a much higher figure than for English-based systems.

Punjabi University vice-chancellor Jaspal Singh congratulated the department and the faculty’s efforts as it brought international recognition to the University.

Development work

Initially, Mr Singh installed both the Moses server and the Web server on a single machine using a linux platform. Afterwards, he tested on the local host the system to work.

However, after installing the system on Web server for general, public use, part of the system worked fine but most of it did not seem to get the translation of the input text. Instead, the input text appeared as transliterated in the post-processing script written in transliterate.pl The development is unique because it is using Moses with 2 sets of non-Western languages, testing Unicode compatibility to its limits. It took several months to develop the system as developers faced problems when translate.cgi expected to have many copies of the daemon.pl running, all listening on different ports and each one should wrap a different instance of Moses. Therefore, multi-threading was not an option for a web-based translation system as it had been written before Moses had threads. The option for the Machine Translation Hindi-Punjabi system had to be multi-process.

The development is based on the direct approach. It includes Preprocessing (Text Normalization, Replacing Collocations, Replacing Proper Nouns), Translation Engine (Identifying Surnames, Identifying Titles, Lexicon Lookup, Word Sense Disambiguation, Inflection Analysis, Transliteration) and Post processing module. The developers of the software claim that it has accuracy of about 94% on the basis of intelligibility test (human evaluation). The developers are also working on higher accuracy of the system.

Why is a Machine Translation Hindi-Punjabi System Important?

Punjabi is an Indo-Aryan language, a descendant of the Shauraseni language, which centuries ago was the chief language of mediaeval northern India. Nowadays, it is spoken by 130 million speakers worldwide, which makes it the 10th most widely spoken language in the world (data from 2013). Punjabi is the native language of the historical Punjab region now divided between Pakistan and India. As a language, Punjabi is the most widely spoken language in Pakistan – half of the population speak it as mother tongue. However, Punjabi lacks official status in Pakistan: it has not been granted official status at the national level despite being the most spoken language and the provincial language of Punjab, the second largest and the most densely inhabited province of Pakistan.

In India, Punjabi is a minority, Northern language, the first official language of the Indian State of Punjab and one of the 22 scheduled languages. Nationwide, it is the 11th most spoken language in the country.

Out of a population of 1,2bn Indians, over 33 million people spoke Punjabi in India (2011), which represents 2,73% of the population. Across the border, the latest official figures point to close to 82 million speakers in 2012 in Pakistan. Outside the Indian subcontinent, Punjabi is the 4th most common language in the United Kingdom (1,5M in 2012) and Canada (around 500,000 in 2011), countries with large immigrant communities.


The Punjab and India showing Punjabi and Hindi names in local language – Courtesy of Google Maps

Hindi also belongs to the Indo-Aryan group, but links to the Indo-Iranian branch of the Indo-European language family. It is the 4th-most-widely spoken language in the world, but this includes not only Hindu speakers of Hindustani, but also people who identify as native speakers of related languages who consider their speech to be a dialect of Hindi, (also known as the Hindi belt). Hindi is mother tongue to 425 million people and is a second language to some 120 million more in India. In India, most government documentation is prepared in three languages: English, Hindi, and the primary official language of the local state, when the majority of the inhabitants of the state do not speak Hindi or English.

The dialect upon which Standard Hindi is based is Khari boli, the vernacular speech of Delhi and the surrounding western Uttar Pradesh and Southern Uttranchal region. This dialect acquired linguistic prestige during the Mughal Empire (1600s) and became known as Urdu, “the language of the court”.

We thus face a historic, “very Indian” tongue like Punjabi spoken by Aryan conquerors whose speakers are now displaced mostly into neighboring Pakistan. Hindu is the national language of India. Hindi-Punjabi are a closely related language pair, which points to former unity. There a sense of unity and many links that unity both Punjabi communities either side of the border, which has been subject to conflict for many decades. There is a case of machine translation from Punjabi into the national tongue, Hindi.

Hindi-Punjabi Writing Systems

Majhi-Standard Punjabi is the written standard for Punjabi in both parts of Punjab. In Pakistan, Punjabi is generally written using the Shahmukhī script, created from a modification of the Persian Nastaʿlīq script. In India, Punjabi is most often rendered in the Gurumukhī, though it is also  written in the Devanagari or Latin scripts due to influence from Hindi and English, India’s two main official languages.

There are two ways to write Punjabi: Gurmukhi and Shahmukhi. The word Gurmukhi translates into “Guru’s mouth”, Shahmukhi means “from the King’s mouth”. You can see the two different writing systems in the map above: in the Punjab province of Pakistan, the script used is Shahmukhi and differs from the Urdu alphabet. It also has 4 more letters. East Punjab, located in India, is divided into three states. In the state of Punjab, the Gurmukhī script is generally used for writing Punjabi.

Linguistically, Hindi and Urdu are the same language but there are historical reasons why the are separate. Hindi is written in the Devanagari script and uses more Sanskrit words, whereas Urdu is written in the Persian script and uses more Persian words.

Internet and Technology

Hindi has a very strong presence on the internet. India is home to many SEO specialists and its software designers are reputed all over the world. However, due to lack of standard encoding, many search engines cannot locate text written in Hindi. Nevertheless, Hindi is one of the seven languages of India that can be used to make web addresses.

Interestingly, Hindi has lent words to technology: ‘avatar’ means “a spirit taking a new form”. The word has been used in computer sciences, films, artificial intelligence and even robotics.

Related news: Hindi-Punjabi machine translation system

Leave a Reply

Your email address will not be published. Required fields are marked *

× 3 = twenty seven

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>