MultiCoNER

Description

We release a MultiCoNER dataset, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) and includes 11 languages, as well as multilingual and code-mixing subsets. MultiCoNER is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. More detail about MultiCoNER is described in [1], where we also report the performance of several approaches including a novel GEMNET approach (see [2,3]).

License

CC BY 4.0

How to Download

The dataset is stored at the public Amazon S3 bucket: multiconer. See more in Open Data on AWS. You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset use:

aws s3 cp s3://multiconer  ./ --recursive --no-sign-request

Reference

MultiCoNER: a Large-scale Multilingual dataset for Complex Named Entity Recognition. 2022. Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar and Oleg Rokhlenko.
GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input. 2021. Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of NAACL.
Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries. 2021. Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of SIGIR.