Configuration » History » Version 1
Redmine Admin, 01/04/2017 05:22 PM
| 1 | 1 | Redmine Admin | h1. Configuration |
|---|---|---|---|
| 2 | |||
| 3 | h2. Overview |
||
| 4 | |||
| 5 | Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext). |
||
| 6 | |||
| 7 | This section will cover some Know-How that was not written clearly or was completely missing from the documentation. |
||
| 8 | |||
| 9 | h2. Speech corpora |
||
| 10 | |||
| 11 | Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup. |
||
| 12 | |||
| 13 | h3. Preparing audio |
||
| 14 | |||
| 15 | * split the audio recordings into pieces corresponding to speech segments that should be played at a time |
||
| 16 | * assign each segment some unique identifier (it's value will be referenced as $ID) |
||
| 17 | * convert audio segment to mp3 |
||
| 18 | * store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID |
||
| 19 | |||
| 20 | <pre> |
||
| 21 | /opt |
||
| 22 | /data |
||
| 23 | /speech |
||
| 24 | /speech_corpus1 |
||
| 25 | file1.mp3 |
||
| 26 | file2.mp3 |
||
| 27 | file3.mp3 |
||
| 28 | ... |
||
| 29 | </pre> |
||
| 30 | |||
| 31 | In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1 |
||
| 32 | |||
| 33 | h3. Preparing corpus vertical file |
||
| 34 | |||
| 35 | * delimit the speech segments like this: |
||
| 36 | |||
| 37 | <pre> |
||
| 38 | <doc id="1"> |
||
| 39 | <s id="1"> |
||
| 40 | <seg soundfile="file1.mp3"> |
||
| 41 | word1 |
||
| 42 | word2 |
||
| 43 | word3 |
||
| 44 | ... |
||
| 45 | </seg> |
||
| 46 | </s> |
||
| 47 | <s id="2"> |
||
| 48 | <seg soundfile="file2.mp3"> |
||
| 49 | word1 |
||
| 50 | word2 |
||
| 51 | word3 |
||
| 52 | ... |
||
| 53 | </seg> |
||
| 54 | </s> |
||
| 55 | ... |
||
| 56 | </doc> |
||
| 57 | </pre> |
||
| 58 | |||
| 59 | * the names *seg* and *soundfile* can be chosen arbitrarilly and recompile the corpus in a standard way. |
||
| 60 | |||
| 61 | h3. Updating config.xml |
||
| 62 | |||
| 63 | * to <corpora> add the following elements: |
||
| 64 | ** <speech_files_path>$SPEECH_FILES_PATH</speech_files_path> |
||
| 65 | ** <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr> |
||
| 66 | * to <corplist><corpus> add attribute: |
||
| 67 | ** speech_segment="$SPEECH_SEGMENT" |
||
| 68 | |||
| 69 | * in the above case of vertical file: |
||
| 70 | ** $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg |
||
| 71 | ** $SPEECH_SEGMENT should be set to: seg.soundfile |
||
| 72 | |||
| 73 | The whole snippet should look like: |
||
| 74 | |||
| 75 | <pre> |
||
| 76 | <corpora> |
||
| 77 | <speech_files_path>/opt/data/speech</speech_files_path> |
||
| 78 | <speech_segment_struct_attr>seg</speech_segment_struct_attr> |
||
| 79 | <corplist title=""> |
||
| 80 | <corplist title="ÚFAL speech corpora"> |
||
| 81 | <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/> |
||
| 82 | </corplist> |
||
| 83 | </corplist> |
||
| 84 | </corpora> |
||
| 85 | </pre> |