πŸ‡»πŸ‡³ Vietnamese NLP Tasks β€” Benchmark & SOTA Overview

πŸ“ˆ This page tracks major Vietnamese NLP datasets and models for Dependency Parsing, Intent Detection, Machine Translation, NER, POS Tagging, Semantic Parsing, and Word Segmentation.

Dependency Parsing

πŸ—‚οΈ VnDT v1.1/v1.0: Benchmark treebank >10K sentences.
Test: 1,020 (v1.1), Dev: 200, Rest: Train.

VnDT v1.1

Model LAS UAS Paper Code
PhoNLP (2021)79.1185.47 PhoNLP Official
PhoBERT-base (2020)78.7785.22 PhoBERT Official
Biaffine (2017)74.9981.19 Biaffine Parsing
VnCoreNLP (2018)71.3877.35 VnCoreNLP Official

VnDT v1.0 (Gold POS)

Model LAS UAS Paper Code
VnCoreNLP (2018)73.3979.02 VnCoreNLP Official
BIST BiLSTM graph (2016)73.1779.39 BIST Parser Official
MSTparser (2006)70.2976.47 MSTparser

Intent Detection & Slot Filling

πŸ›« PhoATIS Dataset (flight booking domain): Train: 4,478, Dev: 500, Test: 893
ModelIntent Acc.Slot F1Sent. Acc.PaperCode
JointIDSF (2021)97.6294.9886.25 JointIDSF Official
JointBERT+PhoBERT97.4094.7585.55 JointIDSF Official

Machine Translation

🌐 PhoMT Dataset: 3.02M sentence pairs | 6 domains (TED, WikiHow, MediaWiki, OpenSubtitles, News, Blog)
ModelEN→VI (BLEU)VI→EN (BLEU)PaperCode
mBART (2020)43.4639.78 mBART Link
Transformer-big42.9437.83 Transformer Link
πŸ“‹ IWSLT2015: 150K sentence pairs (EN↔VI) | Data & Scripts
ModelBLEUPaperCode
Nguyen & Salazar (2019)32.8 Transformers w/o Tears Official
Provilkov et al. (2019)33.27 (uncased) BPE-Dropout
Xu et al. (2019)31.4 Layer Norm Official
Transformer (2017)28.9 Transformer Link

Named Entity Recognition (NER)

🩺 PhoNER_COVID19: 10 types, 34,984 entities, 10,027 sentences
ModelF1PaperCode
PhoBERT-large94.5 PhoBERT Official
XLM-R-large93.8 XLM-R Official
BiLSTM-CRF + CNN-char91.0 BiLSTM-CRF Link
πŸ“„ VLSP 2016 NER: 16,861 train/dev, 2,831 test sentences.
ModelF1PaperCode
PhoBERT-large94.7 PhoBERT Official
PhoNLP94.41 PhoNLP Official
vELECTRA94.07 vELECTRA Official
VnCoreNLP91.30 VnCoreNLP Official

Part-of-Speech Tagging

πŸ”€ VLSP 2013: 27,870 train/dev, 2,120 test
ModelAccuracyPaperCode
PhoBERT-large96.8 PhoBERT Official
vELECTRA96.77 vELECTRA Official
PhoNLP96.76 PhoNLP Official
PhoBERT-base96.7 PhoBERT Official
VnCoreNLP-VnMarMoT95.88 VnMarMoT Official
BiLSTM-CRF + CNN-char95.40 BiLSTM-CRF Official
RDRPOSTagger95.11 RDRPOSTagger Official

Semantic Parsing

πŸ—ƒοΈ ViText2SQL: 10K question/SQL pairs, the first public Text-to-SQL dataset for Vietnamese.
ModelExact Match Acc.PaperCodeNote
IRNet (2019)53.2 ViText2SQL Link Using PhoBERT encoder
EditSQL (2019)52.6 ViText2SQL Link Using PhoBERT encoder

Word Segmentation

βœ‚οΈ VLSP 2013: 75k train, 2,120 test sentences (manually word-segmented)
ModelF1PaperCode
UITws-v1 (2019)98.06 UITws-v1 Official
VnCoreNLP-RDRsegmenter (2018)97.90 VnCoreNLP Official
UETsegmenter (2016)97.87 UETsegmenter Official
vnTokenizer (2008)97.33 vnTokenizer
JVnSegmenter (2006)97.06 JVnSegmenter