Version 1 - History - Configuration - KonText - UFAL Redmine

Configuration » History » Version 1

Redmine Admin, 01/04/2017 05:22 PM

-Redmine Admin
+h1. Configuration
 h2. Overview
 Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).
 This section will cover some Know-How that was not written clearly or was completely missing from the documentation.
 h2. Speech corpora
 Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.
 h3. Preparing audio
 * split the audio recordings into pieces corresponding to speech segments that should be played at a time
 * assign each segment some unique identifier (it's value will be referenced as $ID)
 * convert audio segment to mp3
 * store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID
 <pre>
 /opt
    /data
        /speech
            /speech_corpus1
                file1.mp3
                file2.mp3
                file3.mp3
                ...
 </pre>
 In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1
 h3. Preparing corpus vertical file
 * delimit the speech segments like this:
 <pre>
 <doc id="1">
 <s id="1">
 <seg soundfile="file1.mp3">
 word1
 word2
 word3
 ...
 </seg>
 </s>
 <s id="2">
 <seg soundfile="file2.mp3">
 word1
 word2
 word3
 ...
 </seg>
 </s>
 ...
 </doc>
 </pre>
 * the names *seg* and *soundfile* can be chosen arbitrarilly and recompile the corpus in a standard way.
 h3. Updating config.xml
 * to <corpora> add the following elements:
 ** <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
 ** <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
 * to <corplist><corpus> add attribute:
 ** speech_segment="$SPEECH_SEGMENT"
 * in the above case of vertical file:
 ** $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
 ** $SPEECH_SEGMENT should be set to: seg.soundfile
 The whole snippet should look like:
 <pre>
 <corpora>
     <speech_files_path>/opt/data/speech</speech_files_path>
     <speech_segment_struct_attr>seg</speech_segment_struct_attr>
     <corplist title="">
         <corplist title="ÚFAL speech corpora">
             <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/>
         </corplist>
     </corplist>
 </corpora>
 </pre>

Project

General

Profile

Lindat Projects » Services » KonText

Configuration » History » Version 1