Corpora compilation » History » Version 1
Redmine Admin, 01/04/2017 05:26 PM
| 1 | 1 | Redmine Admin | h1. Corpora compilation |
|---|---|---|---|
| 2 | |||
| 3 | h2. Compilation |
||
| 4 | |||
| 5 | *Set MANATEE_REGISTRY environmental variable to the directory with registry files:* |
||
| 6 | |||
| 7 | <pre> |
||
| 8 | export MANATEE_REGISTRY=/opt/projects/lindat-services-kontext/devel/data/corpora/registry |
||
| 9 | </pre> |
||
| 10 | |||
| 11 | *Compile the corpus:* |
||
| 12 | |||
| 13 | <pre> |
||
| 14 | cd $MANATEE_REGISTRY |
||
| 15 | compilecorp --no-sketches --recompile-corpus <corpus config file> |
||
| 16 | </pre> |
||
| 17 | |||
| 18 | h2. Troubleshooting |
||
| 19 | |||
| 20 | *Corpus config file in MANATEE_REGISTRY must be named in lowercase* |
||
| 21 | |||
| 22 | This is probably bug in KonText. |
||
| 23 | |||
| 24 | *Corpus config file in MANATEE_REGISTRY must consist only of alphanumerical characters* |
||
| 25 | |||
| 26 | The name of the config file is used in the within clause of CQL queries and special characters cause CQL syntax errors. |
||
| 27 | This happens always when trying to search in two or more parallel corpora at the same time. |
||
| 28 | |||
| 29 | |||
| 30 | *Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:* |
||
| 31 | |||
| 32 | <pre> |
||
| 33 | [20140612-09:47:00] Processed 288000000 lines, 248604749 positions. |
||
| 34 | [20140612-09:47:04] encodevert error: File too large for FD_FD, use FD_FGD |
||
| 35 | Writing log to /opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/log/compilecorp_2014-06-12_0917.log |
||
| 36 | </pre> |
||
| 37 | |||
| 38 | In large corpora the type of basic attributes (word, lemma...) needs to be changed to @FD_FGD@ (see http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc#Attributestypes) |
||
| 39 | |||
| 40 | *Computing sizes can fail if the doc structure element doesn't contain ATTRIBUTE wordcount:* |
||
| 41 | |||
| 42 | You will see the following message at the beginning of compilation: |
||
| 43 | |||
| 44 | <pre> |
||
| 45 | Reading corpus configuration... |
||
| 46 | corpinfo: CorpInfoNotFound (wordcount) |
||
| 47 | ... |
||
| 48 | </pre> |
||
| 49 | Add ATTRIBUTE wordcount to doc structure element. |
||
| 50 | |||
| 51 | <pre> |
||
| 52 | STRUCTURE doc { |
||
| 53 | ATTRIBUTE wordcount |
||
| 54 | } |
||
| 55 | </pre> |
||
| 56 | |||
| 57 | *Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:* |
||
| 58 | |||
| 59 | <pre> |
||
| 60 | [20140612-18:33:28] lexicon (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id) make_lex_srt_file |
||
| 61 | [20140612-18:33:29] encodevert error: FileAccessError (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id.rev.idx) in ToFile: fopen [Too many open files] |
||
| 62 | </pre> |
||
| 63 | |||
| 64 | In this case the system limits are too low. See the *adjust limits* section on [[Installation|Installation page]]. |