Vishwajeet Kumar — Staff Research Scientist, IBM Research AI

Research Focus

What I Work On

My research sits at the intersection of language models, structured reasoning, and efficient information retrieval for enterprise AI.

🔍 Retrieval Augmented Generation ❓ Question Answering & Generation 🌐 Multilingual NLP · Indic Languages 🏗️ Knowledge Graphs & Ontologies 🧠 Large Language Models 📊 Table Understanding 🔗 Code-Mixed NLP (Hinglish) ⚡ Embedding Models & Dense Retrieval

🤖

Latest Highlight

IBM Granite Embedding R2 — Aug 2025

Co-authored the Granite Embedding model family — open-source encoder models for dense/sparse retrieval released under Apache 2.0, outperforming similar-sized public models on IBM retrieval benchmarks.

R1 Paper → R2 Paper →

Live Feed

Research Updates 10

Papers, patents, models, and milestones — as they happen.

📄

New Paper

LMK > CLS: Landmark Pooling for Dense Embeddings

Introduces Landmark (LMK) pooling for sequence encoders — partitions input into chunks with interleaved landmark tokens, improving long-context retrieval substantially while matching existing methods on short-context tasks. A scalable alternative to [CLS] and mean pooling.

arXiv:2601.21525 · Jan 2026 Read Paper →

JAN 2026

📄

New Paper

Influence Guided Sampling for Domain Adaptation of Text Retrievers

Proposes Inf-DDS — a novel RL-driven training data sampling framework for embedding models that adaptively reweights datasets using influence-based rewards. Achieves 5.03 NDCG@10 improvement on multilingual bge-m3 at 1.5–4× lower GPU cost than gradient-based baselines.

arXiv:2601.21759 · Jan 2026 Read Paper →

JAN 2026

🔒

Patent Granted

US Patent 12639351 Granted — Table-of-Contents in Differential Search Index

New US patent granted for a system and methods that induce Table-of-Contents knowledge into a Differential Search Index — advancing structured document understanding and retrieval for enterprise AI.

US 12639351 · May 26, 2026 View Patent →

MAY 2026

🤖

Model Released

Granite Embedding R2 Models

Co-authored the next-generation IBM Granite R2 embedding models — ModernBERT-based architecture trained on 2T tokens, with extended context length and improved inference optimizations. Released under Apache 2.0 for commercial & research use.

arXiv:2508.21085 Read Paper →

AUG 2025

📄

Technical Report

Granite Embedding Models — IBM's Open Retrieval Family

Released the Granite Embedding family — encoder models for dense and sparse retrieval with English and multilingual capabilities. Uses retrieval-oriented pretraining, contrastive finetuning, knowledge distillation, and model merging.

arXiv:2502.20204 · Feb 2025 Read Paper →

FEB 2025

📄

Conference Paper

MILU: A Multi-task Indic Language Understanding Benchmark

Accepted at NAACL 2025. A comprehensive multi-task benchmark for evaluating understanding across major Indic languages, advancing multilingual NLP evaluation beyond English.

NAACL 2025

2025

📄

Short Paper

Benchmarking Zero-Shot Hindi Retrieval with Hindi-BEIR & NLLB-E5

Accepted at NAACL 2025. Introduces Hindi-BEIR for Hindi information retrieval benchmarking alongside a zero-shot retrieval model for low-resource Hindi search.

NAACL 2025

2025

🔒

Patent Granted

US Patent 12411875 Granted

Latest US patent granted, covering NLP-based question answering systems — part of a growing IP portfolio of 10+ patents and disclosures in AI language systems.

US 12411875 · Sep 8, 2025

SEP 2025

📄

Demo Paper

PrimeQA: State-of-the-Art Multilingual Question Answering

Accepted at ACL 2023. IBM's open-source unified framework for multilingual QA research — enabling researchers to build, train, and evaluate across languages at scale.

ACL 2023 Read Paper →

2023

📄

Conference Paper

Hinglish Code-Mixed NLP via Meta-Learning & Embedding Resources

Accepted at EMNLP 2022. A meta-learning framework leveraging Hindi and English language resources significantly improves performance on Hinglish NLP tasks using the GLUECoS benchmark.

EMNLP 2022

2022

Research Output

Selected Publications

30+ peer-reviewed papers across ACL, EMNLP, NAACL, SIGIR, PAKDD, ISWC, IJCAI and more.

2027

🚀

Upcoming in 2027

Papers and research outputs in progress. Check back here for updates — or follow on Google Scholar for the latest.

🔍 RAG Research ⚡ Embedding Models 🌐 Multilingual NLP

2026

arXiv 2026

LMK > CLS: Landmark Pooling for Dense Embeddings

Meet Doshi, Aashka Trivedi, Vishwajeet Kumar, Parul Awasthy, Yulong Li, Jaydeep Sen, Radu Florian, Sachindra Joshi

Introduces Landmark (LMK) pooling — a new sequence encoding strategy that partitions input into chunks with landmark tokens, improving long-context retrieval while preserving local salient signals.

arXiv →

arXiv 2026

Influence Guided Sampling for Domain Adaptation of Text Retrievers

Meet Doshi, Vishwajeet Kumar, Yulong Li, Jaydeep Sen

Proposes Inf-DDS, a reinforcement learning framework that adaptively reweights training datasets using influence-based reward signals — achieving up to 5.03 NDCG@10 improvement with 1.5–4× lower GPU cost.

arXiv →

2025

NAACL 2025

MILU: A Multi-task Indic Language Understanding Benchmark

Sshubam Verma, Mohammed Safi Ur Rahman, et al. (incl. V. Kumar)

Paper →

NAACL 2025

Benchmarking Zero-Shot Hindi Retrieval with Hindi-BEIR and NLLB-E5

Arkadeep Acharya, Rudra Murthy, et al. (incl. V. Kumar)

Paper →

arXiv 2025

Granite Embedding Models (R1)

Parul Awasthy, Aashka Trivedi, ..., Vishwajeet Kumar, ..., Radu Florian

PDF →

arXiv 2025

Granite Embedding R2 Models

Parul Awasthy, Aashka Trivedi, ..., Vishwajeet Kumar, ..., Radu Florian

PDF →

2023

ACL 2023

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering

Avi Sil, Jaydeep Sen, et al. (incl. V. Kumar)

Paper →

2022

EMNLP 2022

On Utilizing Matrix and Embedding Language Resources to Improve Downstream Tasks in Hinglish

Vishwajeet Kumar, Rudra Murthy Venkataramana, Tejas Indulal Dhamecha

Paper →

2021

EMNLP 2021

Topic Transferable Table Question Answering

Saneem Ahmed Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, et al.

arXiv →

SIGIR 2021

Select, Substitute, Search: Knowledge-Augmented Visual QA Benchmark

Aman Jain, Mayank Kothyari, V. Kumar, Preethi Jyothi, et al.

PDF →

NAACL 2021

Capturing Row and Column Semantics in Transformer QA over Tables

Michael Glass, Mustafa Canim, Alfio Gliozzo, Saneem Chemmengath, V. Kumar, et al.

PDF →

EACL 2021

Meta-Learning for Effective Multi-task and Multilingual Modelling

Ishan Tarunesh, Sushil Khyalia, V. Kumar, Ganesh Ramakrishnan, Preethi Jyothi

PDF →

2019

ACL 2019

Cross-Lingual Training for Automatic Question Generation

Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Ganesh Ramakrishnan, Preethi Jyothi

PDF →

IJCAI 2019

Neural Program Induction for KBQA Without Gold Programs

Amrita Saha, Ghulam Ansari, V. Kumar, Karthik Shankaranarayanan, et al.

PDF →

EMNLP 2019

ParaQG: A System for Generating Questions and Answers from Paragraphs

Vishwajeet Kumar et al.

PDF →

ISWC 2019

Difficulty-Controllable Multi-hop Question Generation from Knowledge Graphs

Vishwajeet Kumar et al.

Earlier

NAACL 2018

Entity Resolution & Location Disambiguation in Ancient Hindu Temples Domain

Ayush Maheshwari, V. Kumar, Ganesh Ramakrishnan, Sakethanath Jagarlapudi

PDF →

PAKDD 2018

Automating Reading Comprehension by Generating Question and Answer Pairs

Vishwajeet Kumar, Kreeti Boorla, Ganesh Ramakrishnan, Yuan Fang Li

PDF →

EMNLP 2016

Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages

Alan Aikbik, V. Kumar, Yunyao Li

VLDB 2016

Civique: Using Social Media to Detect Urban Emergencies

Diptesh Kanojia, V. Kumar, Krithi Ramamritham

IP Portfolio

Patents & Disclosures

11+ patents and disclosures in NLP, structured data reasoning, and AI language systems.

US 12639351 · May 26, 2026

System and Methods for Inducing Table-of-Contents Knowledge in Differential Search Index

Granted View →

US 12411875 · Sep 8, 2025

Question Generation & Answering System (NLP)

Granted 🇺🇸 United States

US 12210538 · Jan 27, 2025

Multi-instance, Multi-answer Training For Table And Text Question Answering

Granted View →

US 12182508 · Dec 30, 2024

Natural Language Question Answering Using Non-relational Tables

Granted View →

US 12050877 · Jul 29, 2024

Contextual Dialogue Framework Over Dynamic Tables

Granted View →

IBM Disclosure · NLP Systems

Question Generation System for Structured Tables and Passages

Disclosed 🔒 IBM

Full Portfolio

11+ Patents & Disclosures

Spanning NLP, table reasoning, question answering, knowledge graphs, and enterprise AI systems.

Presentations

Talks & Workshops

Invited talks, tutorials, and workshop contributions at leading AI/NLP venues worldwide.

2023

PrimeQA: Multilingual Question Answering at Scale

ACL 2023 Demo Session · Toronto, Canada

2022

Code-switched NLP and Meta-Learning for Hinglish

EMNLP 2022 · Abu Dhabi, UAE

2021

Table Question Answering with Topic Transfer

EMNLP 2021 · Punta Cana, Dominican Republic

2021

Transformer-based Reasoning over Structured Tables

NAACL 2021 · Online

2019

Cross-Lingual Training for Automatic Question Generation

ACL 2019 · Florence, Italy

2019

Multi-hop Question Generation from Knowledge Graphs

ISWC 2019 · Auckland, New Zealand

2017

Dictionary Generalization Across Languages

Amazon India AI Summit 2017 · Poster Presentation

VishwajeetKumar

What I Work On

Research Updates 10

Selected Publications

Patents & Disclosures

Talks & Workshops

Contact

Vishwajeet
Kumar