Corpora conversion workflow » History » Version 1
Redmine Admin, 01/04/2017 05:25 PM
| 1 | 1 | Redmine Admin | h1. Corpora conversion workflow |
|---|---|---|---|
| 2 | |||
| 3 | h2. Prerequisites |
||
| 4 | |||
| 5 | * website with a limited list of supported corpora formats and their detailed description |
||
| 6 | * packaging guidelines for submitters |
||
| 7 | * corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration |
||
| 8 | |||
| 9 | h2. Packaging guidelines for submitters |
||
| 10 | |||
| 11 | # make sure that the data are in some well defined (supported) format |
||
| 12 | # package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately) |
||
| 13 | |||
| 14 | h2. Workflow |
||
| 15 | |||
| 16 | This is a summary of necessary steps to convert user submitted corpus to KonText: |
||
| 17 | |||
| 18 | # Submission stage (DSpace) |
||
| 19 | ## submit data to repository |
||
| 20 | ## specify data format (CoNLL, Treex, PML?, ..., other) |
||
| 21 | ## describe the data format in more detail using some wizard (analogy to license selector, can be repository independent) |
||
| 22 | ## optionally tag the data as "Include the corpus in KonText" |
||
| 23 | ## trigger validation stage upon submission |
||
| 24 | # Validation stage (DSpace/Manatee server/?) |
||
| 25 | ## validate the data against metadata in restricted environment (can be very space and CPU intensive) |
||
| 26 | ## trigger Prepare conversion stage |
||
| 27 | # Prepare conversion stage (Manatee server) |
||
| 28 | ## download the user provided metadata |
||
| 29 | ## generate conversion config file based on user provided metadata |
||
| 30 | ## copy conversion config file to cluster |
||
| 31 | ## trigger conversion stage |
||
| 32 | # Conversion job (cluster) |
||
| 33 | ### check for new conversion config files |
||
| 34 | ### download data from repository |
||
| 35 | ### unpack downloaded data |
||
| 36 | ### split data to parts of reasonable size |
||
| 37 | ### perform conversion on cluster |
||
| 38 | ### monitor status of the conversion job |
||
| 39 | ### collect the data generated by cluster nodes and assemble vertical file |
||
| 40 | ### delete the data, splits and conversion configuration file from cluster |
||
| 41 | ### trigger compilation stage |
||
| 42 | # Compilation stage (Manatee server) |
||
| 43 | ## download the vertical file back from cluster |
||
| 44 | ## download the user provided metadata |
||
| 45 | ## generate manatee metadata file based on user provided metadata |
||
| 46 | ## compile the corpus based on manatee metadata file |
||
| 47 | ## trigger delete cluster data files stage |
||
| 48 | ## trigger update KonText configuration stage |
||
| 49 | # Delete cluster data files stage (Cluster) |
||
| 50 | ## delete cluster data files and user provided metadata |
||
| 51 | # Update KonText configuration stage (Manatee server) |
||
| 52 | ## check for new user metadata files |
||
| 53 | ## generate KonText partial config file based on user provided metadata |
||
| 54 | ## update context config file based on KonText partial config file |
||
| 55 | |||
| 56 | h2. Notes |
||
| 57 | |||
| 58 | # Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate |
||
| 59 | # User metadata might be entered incorrectly, modification of metadata means repeating the whole process |
||
| 60 | # Updating of corpus must be done in separate environment (if we want 99% uptime) |
||
| 61 | # The whole process is *VERY* error prone |