首页    期刊浏览 2024年12月05日 星期四
登录注册

文章基本信息

  • 标题:Zero-resource Multi-dialectal Arabic Natural Language Understanding
  • 本地全文:下载
  • 作者:Muhammad Khalifa ; Hesham Hassan ; Aly Fahmy
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2021
  • 卷号:12
  • 期号:3
  • 页码:577-591
  • DOI:10.14569/IJACSA.2021.0120369
  • 出版社:Science and Information Society (SAI)
  • 摘要:A reasonable amount of annotated data is required for fine-tuning pre-trained language models (PLM) on down-stream tasks. However, obtaining labeled examples for different language varieties can be costly. In this paper, we investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on modern standard Arabic (MSA) data only— identifying a significant performance drop when evaluating such models on DA. To remedy such performance drop, we propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD) on several DA varieties. Our results demonstrate the effectiveness of self-training with unlabeled DA data: improving zero-shot MSA-to-DA transfer by as large as ~10% F₁ (NER), 2% accuracy (POS tagging), and 4.5% F₁ (SRD). We conduct an ablation experiment and show that the performance boost observed directly results from the unlabeled DA examples used for self-training. Our work opens up opportunities for leveraging the relatively abundant labeled MSA datasets to develop DA models for zero and low-resource dialects. We also report new state-of-the-art performance on all three tasks and open-source our fine-tuned models for the research community.
  • 关键词:Natural language processing; natural language understanding; low-resource learning; semi-supervised learning; named entity recognition; part-of-speech tagging; sarcasm detec-tion; pre-trained language models
国家哲学社会科学文献中心版权所有