Version 1 - History - Corpora conversion workflow - KonText - UFAL Redmine

1

Redmine Admin

h1. Corpora conversion workflow

2

3

h2. Prerequisites

4

5

* website with a limited list of supported corpora formats and their detailed description

6

* packaging guidelines for submitters

7

* corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration

8

9

h2. Packaging guidelines for submitters

10

11

# make sure that the data are in some well defined (supported) format

12

# package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately)

13

14

h2. Workflow

15

16

This is a summary of necessary steps to convert user submitted corpus to KonText:

17

18

# Submission stage (DSpace)

19

## submit data to repository

20

## specify data format (CoNLL, Treex, PML?, ..., other)

21

## describe the data format in more detail using some wizard (analogy to license selector, can be repository independent)

22

## optionally tag the data as "Include the corpus in KonText"

23

## trigger validation stage upon submission

24

# Validation stage (DSpace/Manatee server/?)

25

## validate the data against metadata in restricted environment (can be very space and CPU intensive)

26

## trigger Prepare conversion stage

27

# Prepare conversion stage (Manatee server)

28

## download the user provided metadata

29

## generate conversion config file based on user provided metadata

30

## copy conversion config file to cluster

31

## trigger conversion stage

32

# Conversion job (cluster)

33

### check for new conversion config files

34

### download data from repository

35

### unpack downloaded data

36

### split data to parts of reasonable size

37

### perform conversion on cluster

38

### monitor status of the conversion job

39

### collect the data generated by cluster nodes and assemble vertical file

40

### delete the data, splits and conversion configuration file from cluster

41

### trigger compilation stage

42

# Compilation stage (Manatee server)

43

## download the vertical file back from cluster

44

## download the user provided metadata

45

## generate manatee metadata file based on user provided metadata

46

## compile the corpus based on manatee metadata file

47

## trigger delete cluster data files stage

48

## trigger update KonText configuration stage

49

# Delete cluster data files stage (Cluster)

50

## delete cluster data files and user provided metadata

51

# Update KonText configuration stage (Manatee server)

52

## check for new user metadata files

53

## generate KonText partial config file based on user provided metadata

54

## update context config file based on KonText partial config file

55

56

h2. Notes

57

58

# Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate

59

# User metadata might be entered incorrectly, modification of metadata means repeating the whole process

60

# Updating of corpus must be done in separate environment (if we want 99% uptime)

61

# The whole process is *VERY* error prone

Project

General

Profile

Lindat Projects » Services » KonText

Corpora conversion workflow » History » Version 1