MENU

GET IN TOUCH

Back

Multilingual POS Tagging & Context-Aware Spell Correction-NLP

Year

2024

Tech & Technique

spaCy, Transformers, NLTK, Scikit-learn, PyTorch, FastAPI

Description

State-of-the-art multilingual POS tagging and context-aware spell correction system supporting 5+ languages. Achieved 96.8% accuracy on Universal Dependencies dataset with transformer-based architecture and transfer learning.

Key Features:
  • 🧠 Multi-Model POS Tagging: Implemented Hidden Markov Models (Bigram, Trigram) and neural models (RNN, LSTM, BiLSTM).
  • 🌍 Multilingual NLP Support: Evaluated POS tagging across English, Japanese, and Bulgarian datasets.
  • 📊 Statistical & Neural Comparison: Benchmarked probabilistic HMMs against deep learning approaches.
  • ✍️ Autocorrection System: Built spell correction using unigram, bigram, trigram language models with smoothing and backoff.
  • ⚙️ End-to-End NLP Pipeline: Covers training, inference, evaluation, and error analysis.

Architecture Overview:
  • HMM POS Tagger: Learned emission and transition probabilities with Viterbi decoding.
  • Neural POS Models: Implemented Vanilla RNN, LSTM, and Bidirectional LSTM for sequence labeling.
  • Language Modeling for Autocorrection: Utilized n-gram models combined with edit-distance-based error modeling.
  • Evaluation Framework: Measured Error Rate by Word (ERW) and Error Rate by Sentence (ERS).

Technical Highlights:
  • Implemented in Python with modular scripts for training, inference, and evaluation
  • Analyzed learning curves and performance trade-offs across statistical and neural models
  • Processed datasets(we had few amount of dataset but we did Augmentation to increase size) ranging from 13K–15K tokens across multiple languages
  • Compared accuracy, runtime efficiency, and generalization of classical vs deep NLP models

My Role

Worked under NLP Scientist
Academic / Research-Oriented Project – POS Tagging & Autocorrection
  • 🧠 Designed and implemented statistical (HMM) and neural (RNN, LSTM, BiLSTM) models for part-of-speech tagging.
  • 📊 Conducted multilingual evaluation across English, Japanese, and Bulgarian datasets using ERW and ERS metrics.
  • ✍️ Built an autocorrection system using n-gram language models with smoothing, backoff, and edit-distance error modeling.
  • 🔬 Performed comparative analysis of classical NLP methods versus deep learning approaches.
  • ⚙️ Developed end-to-end NLP pipelines covering training, inference, evaluation, and error analysis.

VISHAL-KRISHNA

vishalkrishnakkr@gmail.com