Designing an accurate OCR in Indian languages is one of the greatest challenges in
computer science. Unlike European languages, Indian languages have more than 300
characters to distinguish, a task that is an order of magnitude greater than distinguishing 26
characters. This also means that the training set needed is significantly higher for Indian
languages. It is estimated that at least a ten million-word corpus would be needed in any font
to recognise with acceptable accuracies in Indian languages. DLI is expected to provide such
a phenomenally large amount of data for training and testing of OCRs in Indian languages.
Many of the contents have been manually entered besides scanned images for this purpose.
Using this extremely large repertoire of data, a Kannada OCR had been developed. The
current level of accuracy that we get is around 96-97% on clean documents scanned at 400
dots per inch, and 40-50 % if the image is of bad quality. This OCR is currently being
improved and also being extended to other Indian languages including Tamil.
"ಕನ್ನಡ ಭಾಷೆ ಮಾತನಾಡುವ ಜನರ ನಡುವೆಯೇ ಬಾಳಿ ಬದುಕಿ ಅನ್ನ ಸಂಪಾದನೆ ಮಾಡುತ್ತಾ ಜ್ಞಾನ, ಆಲೋಚನೆ, ವಿಚಾರಗಳನ್ನೆಲ್ಲಾ ಬೇರೊಂದು ಭಾಷೆಯಲ್ಲೇ ಇಡಲೆತ್ನಿಸುವವನು ನೀಗ್ರೋ ಗುಲಾಮರಿಂದ ದುಡಿಸಿಕೊಂಡು ದೊಡ್ಡವರಾದ ಅಮೆರಿಕನ್ ಶೋಷಕರಿಗಿಂತ ಕೆಟ್ಟ ಹಾಗೂ ಅತಿ ಸೂಕ್ಷ್ಮ ಶೋಷಕ"
ಉ: ಕನ್ನಡದಲ್ಲಿ ಓಸಿಆರ್ ಇದೆಯೇ?
ಇಲ್ಲೊಂದು ಚೂರು ಮಾಹಿತಿ ಈಗಷ್ಟೇ ಸಿಕ್ಕಿತು. ನೋಡಿ, http://www.valaconf.org.au/vala2006/papers2006/91_Ganapathiraju_Final.pd...
OCR in Indian Languages- Kannada
Designing an accurate OCR in Indian languages is one of the greatest challenges in
computer science. Unlike European languages, Indian languages have more than 300
characters to distinguish, a task that is an order of magnitude greater than distinguishing 26
characters. This also means that the training set needed is significantly higher for Indian
languages. It is estimated that at least a ten million-word corpus would be needed in any font
to recognise with acceptable accuracies in Indian languages. DLI is expected to provide such
a phenomenally large amount of data for training and testing of OCRs in Indian languages.
Many of the contents have been manually entered besides scanned images for this purpose.
Using this extremely large repertoire of data, a Kannada OCR had been developed. The
current level of accuracy that we get is around 96-97% on clean documents scanned at 400
dots per inch, and 40-50 % if the image is of bad quality. This OCR is currently being
improved and also being extended to other Indian languages including Tamil.