Coptic SCRIPTORIUM Wiki

This is an old revision of the document!

Basic Annotation Workflow

At this point, you may follow one of two paths:

Use the Natural Language Processing (NLP) Service online
Process using our NLP tools individually on your local machine

These two paths are outlined in detail in the following sections.

NLP Service Online Workflow

Copy your transcribed text into NLP Service. This service provides SGML output of your text in a format that is tokenized, normalized, and tagged for part-of-speech, lemma, and language of origin

Import the SGML into a spreadsheet.

Rename the existing layers according to the Annotation layer names guidelines. (Not all layers in the guidelines will exist in your file at this point.)

Proofread the tokenization of the bound groups.

Add or delete rows if necessary to create additional tokens.
You may wish to use Google Refine if working with a large file.
WARNING: Since the spreadsheet now contains many annotation layers, be careful when adding or deleting rows:
- Make sure that the spans of the tok, norm, and at least one of the group layers are all accurately aligned.
- Make sure that any annotation layers (e.g., hi@rend, gap@, etc.) still annotate the correct token(s)
- Make sure that the pos, lemma, and lang layers are accurately aligned

Note: the following steps are a guide to the kinds of work you will be doing. It is organized around the principle of editing one annotation layer at a time. Some people who use the NLP pipeline to annotate everything at once then find it easier to correct all annotation layers for one row at the same time and to go through the file row by row.

Create an original text ("orig") layer (You may do this last; you may even find it easier to do this last.)

Create a new or clean up an existing layer for original text in bound groups ("orig_group") (You may do this last; you may even find it easier to do this last.)

Proofread the normalized (norm) layer.

You may wish to use Google Refine.
You do not need to simultaneously proofread the norm_group layer; we can reconstruct norm_group using the data in norm.)

Reconstruct the norm_group layer.

Proofread the part of speech (pos) and lemma (lemma) layers.

Annotate for sub-word morphemes and create a morph layer.

Proofread the language of origin (lang) layer.

You may wish to use Google Refine.
Coptic SCRIPTORIUM annotates for language of origin on the morph level not the word (norm) level. Be sure the language of origin tags align with the content and span in the morph layer rather than the norm layer.

Add Metadata

Validate the file using the validation Excel Add-in.

Process using our NLP tools individually on your local machine

Tokenizer

Import into spreadsheet

You now have an Excel file with tokenized morphemes aligned with bound groups, normalized morphemes. (If you are working with a Sahidica document, you may have translations and verses as well; with a diplomatic transcription line breaks and column breaks and other manuscript annotations are aligned.)

Proofread the tokenization of the bound groups. Add or delete rows if necessary. You may wish to use Google Refine.

Normalization

Create a normalized bound group layer

Create a morph layer

Ensuring orig and norm layers are the same span

Part of speech tagging and lemmatization. (Read the tagger instructions for parameters to lemmatize.)

Language of origin tagger

Metadata

Validate the file using the validation Excel Add-in.

Table of Contents

Basic Annotation Workflow

NLP Service Online Workflow

Process using our NLP tools individually on your local machine