He had been training his model on clean, printed English documents. Real-world text, however, was messy, inconsistent, and multilingual. Arjun needed OCR datasets that reflected the chaos of the real world — including handwritten scripts, multilingual fonts, and images with poor lighting. Determined to change this, Arjun began building his own OCR dataset. He gathered thousands of documents — from street signs in Hindi to grocery bills in Tamil, historical letters in Urdu to shop boards in English. He and his team used annotation tools to mark text regions, transcribe content, and tag language types. Soon, the OCR dataset grew into a multilingual, multi-format goldmine. With this rich training data, his AI system began to improve — reading not just perfectly printed text, but faded ink, cursive handwriting, skewed receipts, and even overlapping words. The results were astonishing. Government departments approached him to digitize records, schools used the system to convert handwritten notes into digital textbooks, and historians used it to preserve ancient scripts.