Danish corpus

From Brede Wiki
Jump to: navigation, search
Topic (help)
Danish corpus
Category: Danish corpus


Papers: DOAJ Google Scholar PubMed
Ontologies: MeSH NeuroLex Wikidata Wikipedia
Other: Google Twitter WolframAlpha

This is a graph with borders and nodes. Maybe there is an Imagemap used so the nodes may be linking to some Pages.

Danish corpus

[edit] Corpora

  • Danish Wikipedia. Reasonably easy to access, but with much markup.
  • Danish Wikisource
  • ADL. Redistribution not allowed. Strange URL.
  • Runeberg (Danish part). Can be downloaded. Old language with old spelling. OCR-errors is a problem and a major problem for works with gothic script. Labels can be constructed through Wikidata.
  • Gutenberg has some Danish works, e.g., [1]
  • Danish NLTK's europarl_raw. Contains 22476 "sentences", 563358 tokens and 27920 unique tokens. No labels. Easy to access. The sentence tokenization is not done well: many sentences are split due to punctuations around "hr." and "f.eks.".
  • DanNet. Danish wordnet which contains sentences as examples for the items.

[edit] Python

Access to Danish europarl sentences via NLTK

from nltk.corpus import europarl_raw
sentences = europarl_raw.danish.sents()
Personal tools