arxiv.org/abs/2206.15147

Preview meta tags from the arxiv.org website.

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

https://arxiv.org/abs/2206.15147

esCorpius: A Massive Spanish Crawling Corpus

In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license and is available on HuggingFace.

Bing

esCorpius: A Massive Spanish Crawling Corpus

https://arxiv.org/abs/2206.15147

DuckDuckGo

https://arxiv.org/abs/2206.15147

esCorpius: A Massive Spanish Crawling Corpus

General Meta Tags
19
- title
  [2206.15147] esCorpius: A Massive Spanish Crawling Corpus
- title
  open search
- title
  open navigation menu
- title
  contact arXiv
- title
  subscribe to arXiv mailings
Open Graph Meta Tags
10
- og:type
  website
- og:site_name
  arXiv.org
- og:title
  esCorpius: A Massive Spanish Crawling Corpus
- og:url
  https://arxiv.org/abs/2206.15147v2
- og:image
  /static/browse/0.3.4/images/arxiv-logo-fb.png
Twitter Meta Tags
6
- twitter:site
  @arxiv
- twitter:card
  summary
- twitter:title
  esCorpius: A Massive Spanish Crawling Corpus
- twitter:description
  In the recent years, transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be...
- twitter:image
  https://static.arxiv.org/icons/twitter/arxiv-logo-twitter-square.png
Link Tags
12
- apple-touch-icon
  /static/browse/0.3.4/images/icons/apple-touch-icon.png
- canonical
  /abs/2206.15147
- icon
  /static/browse/0.3.4/images/icons/favicon-32x32.png
- icon
  /static/browse/0.3.4/images/icons/favicon-16x16.png
- manifest
  /static/browse/0.3.4/images/icons/site.webmanifest

arxiv.org/abs/2206.15147

Linked Hostnames

Thumbnail

Search Engine Appearance

Google

esCorpius: A Massive Spanish Crawling Corpus

Bing

esCorpius: A Massive Spanish Crawling Corpus

DuckDuckGo

esCorpius: A Massive Spanish Crawling Corpus

General Meta Tags

Open Graph Meta Tags

Twitter Meta Tags

Link Tags

Links