Vishwajeet Kumar
👨‍🔬 Vishwajeet Kumar
🏛️
Position Staff Research Scientist
📚
Citations 4,600+ on Scholar
🔬 IBM Research AI · Bengaluru, India

Vishwajeet
Kumar

Staff Research Scientist — Speech & Language Group
IBM Research Lab, India

Researcher focused on Large Language Models for RAG and text embedding for information retrieval. Joint Ph.D. from IIT Bombay & Monash University. Co-author of IBM's open-source Granite Embedding models.

30+
Papers
10+
Patents
4.6K+
Citations
Scroll
Research Focus

What I Work On

My research sits at the intersection of language models, structured reasoning, and efficient information retrieval for enterprise AI.

🔍 Retrieval Augmented Generation ❓ Question Answering & Generation 🌐 Multilingual NLP · Indic Languages 🏗️ Knowledge Graphs & Ontologies 🧠 Large Language Models 📊 Table Understanding 🔗 Code-Mixed NLP (Hinglish) ⚡ Embedding Models & Dense Retrieval
🤖
Latest Highlight
IBM Granite Embedding R2 — Aug 2025
Co-authored the Granite Embedding model family — open-source encoder models for dense/sparse retrieval released under Apache 2.0, outperforming similar-sized public models on IBM retrieval benchmarks.
R1 Paper → R2 Paper →
Live Feed

Research Updates 10

Papers, patents, models, and milestones — as they happen.

📄
New Paper
LMK > CLS: Landmark Pooling for Dense Embeddings
Introduces Landmark (LMK) pooling for sequence encoders — partitions input into chunks with interleaved landmark tokens, improving long-context retrieval substantially while matching existing methods on short-context tasks. A scalable alternative to [CLS] and mean pooling.
arXiv:2601.21525 · Jan 2026 Read Paper →
JAN 2026
📄
New Paper
Influence Guided Sampling for Domain Adaptation of Text Retrievers
Proposes Inf-DDS — a novel RL-driven training data sampling framework for embedding models that adaptively reweights datasets using influence-based rewards. Achieves 5.03 NDCG@10 improvement on multilingual bge-m3 at 1.5–4× lower GPU cost than gradient-based baselines.
arXiv:2601.21759 · Jan 2026 Read Paper →
JAN 2026
🔒
Patent Granted
US Patent 12639351 Granted — Table-of-Contents in Differential Search Index
New US patent granted for a system and methods that induce Table-of-Contents knowledge into a Differential Search Index — advancing structured document understanding and retrieval for enterprise AI.
US 12639351 · May 26, 2026 View Patent →
MAY 2026
🤖
Model Released
Granite Embedding R2 Models
Co-authored the next-generation IBM Granite R2 embedding models — ModernBERT-based architecture trained on 2T tokens, with extended context length and improved inference optimizations. Released under Apache 2.0 for commercial & research use.
arXiv:2508.21085 Read Paper →
AUG 2025
📄
Technical Report
Granite Embedding Models — IBM's Open Retrieval Family
Released the Granite Embedding family — encoder models for dense and sparse retrieval with English and multilingual capabilities. Uses retrieval-oriented pretraining, contrastive finetuning, knowledge distillation, and model merging.
arXiv:2502.20204 · Feb 2025 Read Paper →
FEB 2025
📄
Conference Paper
MILU: A Multi-task Indic Language Understanding Benchmark
Accepted at NAACL 2025. A comprehensive multi-task benchmark for evaluating understanding across major Indic languages, advancing multilingual NLP evaluation beyond English.
NAACL 2025
2025
📄
Short Paper
Benchmarking Zero-Shot Hindi Retrieval with Hindi-BEIR & NLLB-E5
Accepted at NAACL 2025. Introduces Hindi-BEIR for Hindi information retrieval benchmarking alongside a zero-shot retrieval model for low-resource Hindi search.
NAACL 2025
2025
🔒
Patent Granted
US Patent 12411875 Granted
Latest US patent granted, covering NLP-based question answering systems — part of a growing IP portfolio of 10+ patents and disclosures in AI language systems.
US 12411875 · Sep 8, 2025
SEP 2025
📄
Demo Paper
PrimeQA: State-of-the-Art Multilingual Question Answering
Accepted at ACL 2023. IBM's open-source unified framework for multilingual QA research — enabling researchers to build, train, and evaluate across languages at scale.
ACL 2023 Read Paper →
2023
📄
Conference Paper
Hinglish Code-Mixed NLP via Meta-Learning & Embedding Resources
Accepted at EMNLP 2022. A meta-learning framework leveraging Hindi and English language resources significantly improves performance on Hinglish NLP tasks using the GLUECoS benchmark.
EMNLP 2022
2022
Research Output

Selected Publications

30+ peer-reviewed papers across ACL, EMNLP, NAACL, SIGIR, PAKDD, ISWC, IJCAI and more.

2027
🚀
Upcoming in 2027
Papers and research outputs in progress. Check back here for updates — or follow on Google Scholar for the latest.
🔍 RAG Research ⚡ Embedding Models 🌐 Multilingual NLP
2026
arXiv 2026
LMK > CLS: Landmark Pooling for Dense Embeddings
Meet Doshi, Aashka Trivedi, Vishwajeet Kumar, Parul Awasthy, Yulong Li, Jaydeep Sen, Radu Florian, Sachindra Joshi
Introduces Landmark (LMK) pooling — a new sequence encoding strategy that partitions input into chunks with landmark tokens, improving long-context retrieval while preserving local salient signals.
arXiv 2026
Influence Guided Sampling for Domain Adaptation of Text Retrievers
Meet Doshi, Vishwajeet Kumar, Yulong Li, Jaydeep Sen
Proposes Inf-DDS, a reinforcement learning framework that adaptively reweights training datasets using influence-based reward signals — achieving up to 5.03 NDCG@10 improvement with 1.5–4× lower GPU cost.
2025
NAACL 2025
MILU: A Multi-task Indic Language Understanding Benchmark
Sshubam Verma, Mohammed Safi Ur Rahman, et al. (incl. V. Kumar)
NAACL 2025
Benchmarking Zero-Shot Hindi Retrieval with Hindi-BEIR and NLLB-E5
Arkadeep Acharya, Rudra Murthy, et al. (incl. V. Kumar)
arXiv 2025
Granite Embedding Models (R1)
Parul Awasthy, Aashka Trivedi, ..., Vishwajeet Kumar, ..., Radu Florian
arXiv 2025
Granite Embedding R2 Models
Parul Awasthy, Aashka Trivedi, ..., Vishwajeet Kumar, ..., Radu Florian
2023
ACL 2023
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering
Avi Sil, Jaydeep Sen, et al. (incl. V. Kumar)
2022
EMNLP 2022
On Utilizing Matrix and Embedding Language Resources to Improve Downstream Tasks in Hinglish
Vishwajeet Kumar, Rudra Murthy Venkataramana, Tejas Indulal Dhamecha
2021
EMNLP 2021
Topic Transferable Table Question Answering
Saneem Ahmed Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, et al.
SIGIR 2021
Select, Substitute, Search: Knowledge-Augmented Visual QA Benchmark
Aman Jain, Mayank Kothyari, V. Kumar, Preethi Jyothi, et al.
NAACL 2021
Capturing Row and Column Semantics in Transformer QA over Tables
Michael Glass, Mustafa Canim, Alfio Gliozzo, Saneem Chemmengath, V. Kumar, et al.
EACL 2021
Meta-Learning for Effective Multi-task and Multilingual Modelling
Ishan Tarunesh, Sushil Khyalia, V. Kumar, Ganesh Ramakrishnan, Preethi Jyothi
2019
ACL 2019
Cross-Lingual Training for Automatic Question Generation
Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Ganesh Ramakrishnan, Preethi Jyothi
IJCAI 2019
Neural Program Induction for KBQA Without Gold Programs
Amrita Saha, Ghulam Ansari, V. Kumar, Karthik Shankaranarayanan, et al.
EMNLP 2019
ParaQG: A System for Generating Questions and Answers from Paragraphs
Vishwajeet Kumar et al.
ISWC 2019
Difficulty-Controllable Multi-hop Question Generation from Knowledge Graphs
Vishwajeet Kumar et al.
Earlier
NAACL 2018
Entity Resolution & Location Disambiguation in Ancient Hindu Temples Domain
Ayush Maheshwari, V. Kumar, Ganesh Ramakrishnan, Sakethanath Jagarlapudi
PAKDD 2018
Automating Reading Comprehension by Generating Question and Answer Pairs
Vishwajeet Kumar, Kreeti Boorla, Ganesh Ramakrishnan, Yuan Fang Li
EMNLP 2016
Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages
Alan Aikbik, V. Kumar, Yunyao Li
VLDB 2016
Civique: Using Social Media to Detect Urban Emergencies
Diptesh Kanojia, V. Kumar, Krithi Ramamritham
IP Portfolio

Patents & Disclosures

11+ patents and disclosures in NLP, structured data reasoning, and AI language systems.

US 12639351 · May 26, 2026
System and Methods for Inducing Table-of-Contents Knowledge in Differential Search Index
Granted View →
US 12411875 · Sep 8, 2025
Question Generation & Answering System (NLP)
Granted 🇺🇸 United States
US 12210538 · Jan 27, 2025
Multi-instance, Multi-answer Training For Table And Text Question Answering
Granted View →
US 12182508 · Dec 30, 2024
Natural Language Question Answering Using Non-relational Tables
Granted View →
US 12050877 · Jul 29, 2024
Contextual Dialogue Framework Over Dynamic Tables
Granted View →
IBM Disclosure · NLP Systems
Question Generation System for Structured Tables and Passages
Disclosed 🔒 IBM
Full Portfolio
11+ Patents & Disclosures
Spanning NLP, table reasoning, question answering, knowledge graphs, and enterprise AI systems.
Presentations

Talks & Workshops

Invited talks, tutorials, and workshop contributions at leading AI/NLP venues worldwide.

2023
PrimeQA: Multilingual Question Answering at Scale
ACL 2023 Demo Session · Toronto, Canada
2022
Code-switched NLP and Meta-Learning for Hinglish
EMNLP 2022 · Abu Dhabi, UAE
2021
Table Question Answering with Topic Transfer
EMNLP 2021 · Punta Cana, Dominican Republic
2021
Transformer-based Reasoning over Structured Tables
NAACL 2021 · Online
2019
Cross-Lingual Training for Automatic Question Generation
ACL 2019 · Florence, Italy
2019
Multi-hop Question Generation from Knowledge Graphs
ISWC 2019 · Auckland, New Zealand
2017
Dictionary Generalization Across Languages
Amazon India AI Summit 2017 · Poster Presentation
Get in Touch

Contact

Open to research collaborations, academic discussions, and opportunities in NLP and AI.

📍
Location
IBM Research India · Bengaluru, Karnataka
✉️
Email
[firstname][lastname]86 [at] gmail [dot] com
(firstname + lastname + 86)
🎓
Education
Ph.D. — IIT Bombay & Monash University (2020)
Computer Science & Engineering
🏢
Affiliation
IBM Research India · Speech & Language Group
Research Interests
🔍 Retrieval Augmented Generation (RAG)
Question Answering & Generation
🌐 Multilingual & Indic NLP
🧠 Large Language Models
Embedding Models & Dense Retrieval
🏗️ Knowledge Graphs & Structured Reasoning