Due to time constraints, I am posting the draft in English. This idea of a grammar based spell checker for Tamil was proposed by Mauran and drafted by me in Thamizha! group but yet to be materialised. I welcome all Tamil computing enthusiasts to help realise this project.
I see an immediate possibility to come up with a grammar based spell checker for Tamil. Such approaches are there in other languages also but they do it usually by statistical occurrences of regular mistakes rather than on a rule based approach. Based on our current human resources and immediacy of the issue, we can do a grammar based rule approach.
This approach would virtually eliminate all typos in Tamil if not the spelling mistakes..
Here is what we need to do.
1. compile a database of letter sequences in Tamil in that are not grammatically allowed.
It can be done as follows –
1a. Compile a 247×247 matrix for Tamil letters and remove the allowed combinations and upload the rest to the database.
for example –
அஅ – spelling mistake
அஆ- spelling mistake
அஇ- spelling mistake
அஈ- spelling mistake
..and so on for அ against all 247 Tamil letters. upload the spelling mistakes to the database and leave the rest.
Repeat this procedure for each of the Tamil letter against each of the Tamil letter..
1b. To address the issue of சந்தி, we create a rule that is similar to the one below..
ச்+space+ச, சா, …சௌ – may not be a typo or spelling mistake and hence we don’t add it to the database.
ச்+space+any other letter except (ச, சா, …சௌ) is a spelling mistake and hence we add it to the data base
and repeat the procedure for க், த், ப்
1c. to address the issue of ஒற்றெழுத்து, தொடக்க எழுத்து we create rules as follows:
space+உயிரெழுத்து = allowed
space+மெய்யெழுத்து = spelling mistake and add it to the data base..here a point to be noted is that space+மெய்யெழுத்து can occur for non-Tamil words and names and we give it as a hint in the error message.
and create rules for உயிரெழுத்து+space, மெய்யெழுத்து+space, உயிர்மெய்யெழுத்து+space, space+உயிர்மெய்யெழுத்து on a case by case basis. It is to be noted that உயிரெழுத்து+space can occur in poetic instances.
People like me, mauran can take responsibility for creating the above said matrix, rules and checking them..or i can do it myself completely if Mauran has time constraints..The guideline needed is in what file format, text format should this database be created so that it is readable for the Firefox extension, or any word processor? I need technical guidelines on these aspect.
2. Now this database is to be converted into a Firefox extension.
Things to do here –
the regular dictionaries in Firefox or word processors compare the input word in the database and if it is not there then announces it as a spelling mistake. But our approach should do the inverse. If this needs a completely new dictionary script to be written, then the technical people like Mugunth, Sethu, Sundar can help.
After creating such script the database should be embedded in it and the extension should be uploaded to Firefox repository or from a private site initially.
Care should be taken that besides creating a Firefox supported dictionary we do this in order to support any word processing application..