π»π³ Vietnamese NLP Tasks β Benchmark & SOTA Overview
π
This page tracks major Vietnamese NLP datasets and models for Dependency Parsing, Intent Detection, Machine Translation, NER, POS Tagging, Semantic Parsing, and Word Segmentation.
Dependency Parsing
ποΈ
VnDT v1.1/v1.0: Benchmark treebank >10K sentences.
Test: 1,020 (v1.1), Dev: 200, Rest: Train.
VnDT v1.1
VnDT v1.0 (Gold POS)
Intent Detection & Slot Filling
π«
PhoATIS Dataset (flight booking domain): Train: 4,478, Dev: 500, Test: 893
Machine Translation
π
PhoMT Dataset: 3.02M sentence pairs | 6 domains (TED, WikiHow, MediaWiki, OpenSubtitles, News, Blog)
Named Entity Recognition (NER)
π©Ί
PhoNER_COVID19: 10 types, 34,984 entities, 10,027 sentences
π
VLSP 2016 NER: 16,861 train/dev, 2,831 test sentences.
Part-of-Speech Tagging
π€
VLSP 2013: 27,870 train/dev, 2,120 test
Semantic Parsing
ποΈ
ViText2SQL: 10K question/SQL pairs, the first public Text-to-SQL dataset for Vietnamese.
Model | Exact Match Acc. | Paper | Code | Note |
IRNet (2019) | 53.2 |
ViText2SQL |
Link |
Using PhoBERT encoder |
EditSQL (2019) | 52.6 |
ViText2SQL |
Link |
Using PhoBERT encoder |
Word Segmentation
βοΈ
VLSP 2013: 75k train, 2,120 test sentences (manually word-segmented)