Corpora » History » Version 1
Redmine Admin, 01/04/2017 05:24 PM
| 1 | 1 | Redmine Admin | h1. Corpora |
|---|---|---|---|
| 2 | |||
| 3 | * [[Corpora conversion workflow]] (proposal) |
||
| 4 | * [[Corpora conversion]] |
||
| 5 | * [[Corpora compilation]] |
||
| 6 | * [[Conversion benchmarks]] |
||
| 7 | * "List of available corpora":https://docs.google.com/spreadsheets/d/1K0ZpJNVRcd5Yt1Ti1p2zr3lrxRxyKAZwdzx8fOrjkH0/edit?usp=sharing |
||
| 8 | |||
| 9 | h2. Introduction |
||
| 10 | |||
| 11 | Conversion of corpora from vertical text to binary format is done by the compilecorp tool provided by the manatee package. |
||
| 12 | Two files are needed for the conversion: the vertical text (i.e. corpus) itself and corpus configuration file that describes in detail the |
||
| 13 | contents of the corpus. |
||
| 14 | |||
| 15 | The vertical text is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText |
||
| 16 | The corpus configuration file is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc |
||
| 17 | |||
| 18 | h2. Directory structure |
||
| 19 | |||
| 20 | The directory structure on kontext-dev (kontext) servers is as follows: |
||
| 21 | |||
| 22 | <pre> |
||
| 23 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/registry # configuration files (no subdirectories) |
||
| 24 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/data # compiled corpora |
||
| 25 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/speech # mp3 files |
||
| 26 | /opt/project/lindat-services/devel/data/corpora/conversion # conversion of corpora (data and scripts) |
||
| 27 | /opt/project/lindat-services/devel/data/corpora/vert # vertical text files (corpora data) |
||
| 28 | </pre> |