User Tools

Site Tools


basic_annotation_workflow

This is an old revision of the document!


Table of Contents

Basic Annotation Workflow

Text file

At this point, you may follow one of two paths:

  1. Use the Natural Language Processing (NLP) Service online
  2. Process using our NLP tools individually on your local machine
NLP Service Online Workflow

Copy your transcribed text into NLP Service. This service provides SGML output of your text in a format that is tokenized, normalized, and tagged for part-of-speech, lemma, and language of origin

Import the SGML into a spreadsheet.

Proofread the tokenization of the bound groups.

  • Add or delete rows if necessary to create additional tokens.
  • You may wish to use Google Refine.
  • WARNING: Since the spreadsheet now contains many annotation layers, be careful when adding or deleting rows:
    • Make sure that the spans of the tok, norm, and at least one of the group layers are all accurately aligned.
    • Make sure that any annotation layers (e.g., hi@rend, gap@, etc.) still annotate the correct token(s)
    • Make sure that the pos, lemma, and lang layers are accurately aligned

Create an "orig" layer

Create an "orig_group" layer

Proofread the norm layer. (You may wish to use Google Refine.)

Reconstruct

Tokenizer

Import into spreadsheet

You now have an Excel file with tokenized morphemes aligned with bound groups, normalized morphemes. (If you are working with a Sahidica document, you may have translations and verses as well; with a diplomatic transcription line breaks and column breaks and other manuscript annotations are aligned.)

Proofread the tokenization of the bound groups. Add or delete rows if necessary. You may wish to use Google Refine.

Normalization

Create a normalized bound group layer

Create a morph layer

Ensuring orig and norm layers are the same span

Part of speech tagging

Language of origin tagger

Metadata

basic_annotation_workflow.1444844915.txt.gz · Last modified: 2015/10/14 11:48 by admin