We release a MultiCoNER dataset, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) and includes 11 languages, as well as multilingual and code-mixing subsets. MultiCoNER is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. More detail about MultiCoNER is described in [1], where we also report the performance of several approaches including a novel GEMNET approach (see [2,3]).
The dataset is stored at the public Amazon S3 bucket: multiconer
. See more in Open Data on AWS.
You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset use:
aws s3 cp s3://multiconer ./ --recursive --no-sign-request