Table of Contents

Basic Annotation Workflow

Transcribe your text

Transcribe your text in GitDox. Alternatively, transcribe your text into a text file. Be sure the transcription divides the text into bound groups

At this point, you will want to use the Natural Language Processing (NLP) Service online. Veteran users will remember processing texts with individual NLP tools on a local machine and then editing the annotations in an Excel spreadsheet. The local process is outlined at the end of this page, but it is no longer recommended.

Running the NLP Service in GitDox

Run the NLP Service on your transcribed text in GitDox.

You will see an NLP button below the text window. Click it.

Note for veteran GitDox users: you do not need to proofread tokenization as part of the NLP Service process. The NLP service works better, now, without tokenizing first.

Clicking this button will annotate your data and change the format from running text to a spreadsheet, in which each item of textual data (known as a “token”) is a row in the spreadsheet, with each annotation as a column.

Editing the annotations in the GitDox spreadsheet

You can then edit the annotations manually. For example:

Tips:

Add verse and chapter layers following our guidelines on chapter divisions and versification.

Add a translation layer (even if you are not providing a translation at this time). Translation layers are usually the same length as the verse layer.

Check all the column names to be sure they conform to our guidelines for annotation layers.

Click the “validation” button to validate the data (= ensure it has valid structure to be published). If you are having trouble fixing validation errors, please contact a senior editor.

While data is saved automatically and frequently in the spreadsheet mode, we strongly recommend annotators commit their changes often using the commit log under the spreadsheet. Please use a commit message you will understand when you return to your work days or weeks later. (For example, “checked rows 101-200” is more detailed and more understandable than “continued checking annotations”.)

Using our NLP tools individually on your local machine

You will need to download the tools from our GitHub site. We no longer recommend using this process, as the most up to date tools are on our NLP online tool suite, available in three ways:

  1. on our website (cut and paste or type in Coptic text)
  2. an API (see https://corpling.uis.georgetown.edu/coptic-nlp/ for contact information)
  3. using our annotation environment GitDox (see above)

If you are using our standalone tools, here are the steps:

Import the SGML into a spreadsheet.

Rename the existing layers according to the Annotation layer names guidelines. (Not all layers in the guidelines will exist in your file at this point.)

Remove any redundant columns. These may be hi (keep hi@rend); supplied (keep supplied@reason etc.); gap (keep gap@reason etc.).

Add missing information to existing layers. For instance, replace lb and cb placeholders in lb@n and cb@n columns with line and column numbers from original manuscript.

Note: the following steps are a guide to the kinds of work you will be doing. It is organized around the principle of editing one annotation layer at a time. When using the NLP pipeline to annotate everything at once, some editors, however, prefer to correct all annotation layers for one row at the same time and to go through the file row by row.

Create an original text ("orig") layer (You may do this last; you may even find it easier to do this last.)

Create a new or clean up an existing layer for original text in bound groups ("orig_group") (You may do this last; you may even find it easier to do this last.)

Proofread the normalized (norm) layer.

Reconstruct the norm_group layer.

Proofread the part of speech (pos), lemma (lemma), and morpheme (morph) layers. Part of speech and lemma are annotated on the norm level.

Proofread the language of origin (lang) layer.

Add translation, paragraph, and other layers as necessary following the annotation layer names guidelines.

Add Metadata.

Validate the file using the validation Excel Add-in.

Process using our NLP tools individually on your local machine

Tokenizer

Import into spreadsheet

You now have an Excel file with tokenized morphemes aligned with bound groups, normalized morphemes. (If you are working with a Sahidica document, you may have translations and verses as well; with a diplomatic transcription line breaks and column breaks and other manuscript annotations are aligned.)

Proofread the tokenization of the bound groups. Add or delete rows if necessary. You may wish to use Google Refine.

Normalization

Create a normalized bound group layer

Create a morph layer

Ensuring orig and norm layers are the same span

Part of speech tagging and lemmatization. (Read the tagger instructions for parameters to lemmatize.)

Language of origin tagger

Metadata

Validate the file using the validation Excel Add-in.