Differences

This shows you the differences between two versions of the page.

--- basic_annotation_workflow [2015/10/14 12:17] – admin
+++ basic_annotation_workflow [2020/08/03 16:56] – admin
@@ Line 1: / Line 1: @@
 ==== Basic Annotation Workflow ====
-[[transcribe_a_text|Text file]]
+=== Transcribe your text ===
+Transcribe your text in [[gitdox_workflow|GitDox]]. Alternatively, transcribe your text into a [[transcribe_a_text|text file]]. Be sure the transcription divides the text into bound groups
 At this point, you may follow one of two paths:
@@ Line 11: / Line 13: @@
 === NLP Service Online Workflow ===
-Copy your transcribed text into [[https://corpling.uis.georgetown.edu/coptic-nlp/|NLP Service]].  This service provides SGML output of your text in a format that is tokenized, normalized, and tagged for part-of-speech, lemma, and language of origin
+[[natural_language_processing_service_online|Run the NLP Service]] on your transcribed text in GitDox.
+  * If your text is in a text file, copy and paste it into the GitDox text editor. (See the [[gitdox_workflow|GitDox]] page for more information on using the GitDox text editor.)
+  * If your text is already transcribed into the GitDox text editor and validated (see [[gitdox_workflow|GitDox]]), you're ready for the NLP tools.
+You will see an NLP button below the text window. Click it.
+//Note for veteran GitDox users: you do not need to proofread tokenization as part of the NLP Service process. The NLP service works better, now, without tokenizing first.//
 [[import_macro|Import the SGML into a spreadsheet.]]
@@ Line 17: / Line 25: @@
 Rename the existing layers according to the [[annotation_layer_names|Annotation layer names guidelines]]. (Not all layers in the guidelines will exist in your file at this point.)
-Proofread the tokenization of the bound groups.
+Remove any redundant columns. These may be hi (keep hi@rend); supplied (keep supplied@reason etc.); gap (keep gap@reason etc.).
-  * Add or delete rows if necessary to create additional tokens.
-  * You may wish to use [[Google Refine]].
-  * WARNING: Since the spreadsheet now contains many annotation layers, be careful when adding or deleting rows:
-      * Make sure that the spans of the tok, norm, and at least one of the group layers are all accurately aligned.
-      * Make sure that any annotation layers (e.g., hi@rend, gap@, etc.) still annotate the correct token(s)
-      * Make sure that the pos, lemma, and lang layers are accurately aligned
-[[create_a_normalized_bound_group_layer|Create an original text ("orig") layer]]
+Add missing information to existing layers. For instance, replace lb and cb placeholders in lb@n and cb@n columns with line and column numbers from original manuscript.
-[[create_a_normalized_bound_group_layer|Create an original text in bound groups ("orig_group") layer]]
+Note:  the following steps are a guide to the kinds of work you will be doing.  It is organized around the principle of editing one annotation layer at a time.  When using the NLP pipeline to annotate everything at once, some editors, however, prefer to correct all annotation layers for one row at the same time and to go through the file row by row.
+[[create_a_normalized_bound_group_layer|Create an original text ("orig") layer]] (You may do this last; you may even find it easier to do this last.)
+Create a new or clean up an existing layer for [[create_a_normalized_bound_group_layer|original text in bound groups ("orig_group")]] (You may do this last; you may even find it easier to do this last.)
 Proofread the normalized (norm) layer.
@@ Line 33: / Line 39: @@
   * You do not need to simultaneously proofread the norm_group layer; we can reconstruct norm_group using the data in norm.)
-[[create_a_normalized_bound_group_layer|Reconstruct the norm_group layer.]]
+[[create_a_normalized_bound_group_layer|Reconstruct the norm_group layer]].
-Proofread the part of speech (pos) and lemma (lemma) layers.
+Proofread the part of speech (pos), lemma (lemma), and morpheme (morph) layers. Part of speech and lemma are annotated on the norm level.
-[[annotating_sub-word_morphemes|Annotate for sub-word morphemes and create a morph layer]].
 Proofread the language of origin (lang) layer.
   * You may wish to use [[Google Refine]].
-  * Coptic SCRIPTORIUM annotates for language of origin on the **morph** level not the **word (norm)** level.  Be sure the language of origin tags align with the content and span in the ignore:morph layer rather than the norm layer.
+  * Coptic SCRIPTORIUM annotates for language of origin on the **morph** level not the **word (norm)** level.  Be sure the language of origin tags align with the content and span in the morph layer rather than the norm layer.
-[[Metadata]]
+Add translation, paragraph, and other layers as necessary following the [[annotation_layer_names|annotation layer names guidelines]].
+Add [[Metadata]].
 Validate the file using the [[https://github.com/CopticScriptorium/XLAddIns|validation Excel Add-in]].
+=== Process using our NLP tools individually on your local machine ===
 [[tokenizer|Tokenizer]]
@@ Line 64: / Line 71: @@
 [[Ensuring orig and norm layers are the same span]]
-[[part_of_speech_tagging_using_tree-tagger|Part of speech tagging]]
+[[part_of_speech_tagging_using_tree-tagger|Part of speech tagging and lemmatization.]]  ([[https://github.com/CopticScriptorium/tagger-part-of-speech/blob/master/README.md|Read the tagger instructions for parameters to lemmatize.]])
 [[language_of_origin_tagging|Language of origin tagger]]
@@ Line 71: / Line 78: @@
 Validate the file using the [[https://github.com/CopticScriptorium/XLAddIns|validation Excel Add-in]].