basic_annotation_workflow
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
basic_annotation_workflow [2015/10/14 12:26] – admin | basic_annotation_workflow [2020/08/03 18:08] (current) – admin | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== Basic Annotation Workflow ==== | + | ===== Basic Annotation Workflow |
- | [[transcribe_a_text|Text file]] | + | ==== Transcribe your text ==== |
- | At this point, you may follow one of two paths: | + | Transcribe your text in [[gitdox_workflow|GitDox]]. Alternatively, transcribe |
- | - Use the Natural Language Processing (NLP) Service online | + | |
- | - Process using our NLP tools individually on your local machine | + | |
- | These two paths are outlined in detail | + | At this point, you will want to use the Natural Language Processing (NLP) Service online. |
- | === NLP Service | + | ==== Running the NLP Service |
- | Copy your transcribed text into [[https://corpling.uis.georgetown.edu/coptic-nlp/|NLP Service]]. | + | [[natural_language_processing_service_online|Run the NLP Service]] on your transcribed text in GitDox. |
+ | * If your text is in a text file, copy and paste it into the GitDox text editor. (See the [[gitdox_workflow|GitDox]] page for more information on using the GitDox text editor.) | ||
+ | * If your text is already transcribed into the GitDox text editor and validated (see [[gitdox_workflow|GitDox]]), | ||
+ | |||
+ | You will see an NLP button below the text window. Click it. | ||
+ | |||
+ | //Note for veteran GitDox users: you do not need to proofread tokenization as part of the NLP Service process. The NLP service works better, now, without tokenizing first.// | ||
+ | |||
+ | Clicking this button will annotate your data and change the format from running text to a spreadsheet, | ||
+ | |||
+ | ==== Editing the annotations in the GitDox spreadsheet ==== | ||
+ | |||
+ | You can then edit the annotations manually. For example: | ||
+ | * You can add " | ||
+ | * You can correct the lemma, part-of-speech tag or normalization | ||
+ | * You can add an editorial note in a note_note column | ||
+ | |||
+ | Tips: | ||
+ | * You may need to add or subtract rows if the tokenization needs correcting. When doing so **be careful of all the annotation layers, especially the spans for pages, columns, and lines. Ensure the spans all begin and end where they are supposed to.** | ||
+ | * At the top of the spreadsheet in the upper right, you can click on a little bar in the upper right above the scrolling column and pull it down one row to freeze the column of labels in place as you scroll. | ||
+ | * You do not need to manually edit the columns for parsing. You do not need to worry about preserving the spans in these columns. Parsing will be done at a later step (whether manually or automatically). | ||
+ | * Use the [[https://coptic-dictionary.org|online dictionary]] to look up word forms and their lemmas. | ||
+ | * Use ANNIS to search for word forms and how we usually tag/lemmatize them | ||
+ | * Prefacing a layer label (column label) with " | ||
+ | |||
+ | Add verse and chapter layers following our [[versification|guidelines on chapter divisions and versification]]. | ||
+ | |||
+ | Add a translation layer (even if you are not providing a translation at this time). Translation layers are usually the same length as the verse layer. | ||
+ | * To add a translation layer, position | ||
+ | * If you are not providing an English translation right now, fill all translation spans with " | ||
+ | |||
+ | Check all the column names to be sure they conform to [[annotation_layer_names|our guidelines for annotation layers]]. | ||
+ | * In most cases, verse and translation spans should be identical in length, because typically verses are a sentence long. See [[versification|our guidelines on versification]] | ||
+ | * You can delete unnecessary layers. If you are unsure about whether a layer/ | ||
+ | |||
+ | Click the " | ||
+ | |||
+ | While data is saved automatically and frequently in the spreadsheet mode, we strongly recommend annotators commit their changes **often** using the commit log under the spreadsheet. | ||
+ | |||
+ | |||
+ | ==== Using our NLP tools individually on your local machine ==== | ||
+ | |||
+ | You will need to download the tools from our GitHub site. We no longer recommend using this process, as the most up to date tools are on our NLP online tool suite, available in three ways: | ||
+ | - [[https:// | ||
+ | - an API (see [[https:// | ||
+ | - using our annotation environment GitDox (see above) | ||
+ | |||
+ | If you are using our standalone tools, here are the steps: | ||
[[import_macro|Import the SGML into a spreadsheet.]] | [[import_macro|Import the SGML into a spreadsheet.]] | ||
Line 17: | Line 62: | ||
Rename the existing layers according to the [[annotation_layer_names|Annotation layer names guidelines]]. (Not all layers in the guidelines will exist in your file at this point.) | Rename the existing layers according to the [[annotation_layer_names|Annotation layer names guidelines]]. (Not all layers in the guidelines will exist in your file at this point.) | ||
- | Proofread the tokenization of the bound groups. | + | Remove any redundant columns. These may be hi (keep hi@rend); supplied (keep supplied@reason |
- | * Add or delete rows if necessary to create additional tokens. | + | |
- | * You may wish to use [[Google Refine]]. | + | |
- | * WARNING: Since the spreadsheet now contains many annotation layers, | + | |
- | * Make sure that the spans of the tok, norm, and at least one of the group layers are all accurately aligned. | + | |
- | * Make sure that any annotation layers | + | |
- | * Make sure that the pos, lemma, and lang layers are accurately aligned | + | |
- | [[create_a_normalized_bound_group_layer|Create an original | + | Add missing information to existing layers. For instance, replace lb and cb placeholders in lb@n and cb@n columns with line and column numbers from original |
- | [[create_a_normalized_bound_group_layer|Create an original text in bound groups (" | + | Note: the following steps are a guide to the kinds of work you will be doing. |
+ | |||
+ | [[create_a_normalized_bound_group_layer|Create an original text (" | ||
+ | |||
+ | Create a new or clean up an existing layer for [[create_a_normalized_bound_group_layer|original text in bound groups (" | ||
Proofread the normalized (norm) layer. | Proofread the normalized (norm) layer. | ||
Line 33: | Line 76: | ||
* You do not need to simultaneously proofread the norm_group layer; we can reconstruct norm_group using the data in norm.) | * You do not need to simultaneously proofread the norm_group layer; we can reconstruct norm_group using the data in norm.) | ||
- | [[create_a_normalized_bound_group_layer|Reconstruct the norm_group layer.]] | + | [[create_a_normalized_bound_group_layer|Reconstruct the norm_group layer]]. |
- | Proofread the part of speech (pos) and lemma (lemma) layers. | + | Proofread the part of speech (pos), lemma (lemma), and morpheme (morph) layers. |
- | + | ||
- | [[annotating_sub-word_morphemes|Annotate for sub-word morphemes | + | |
Proofread the language of origin (lang) layer. | Proofread the language of origin (lang) layer. | ||
* You may wish to use [[Google Refine]]. | * You may wish to use [[Google Refine]]. | ||
- | * Coptic SCRIPTORIUM annotates for language of origin on the **morph** level not the **word (norm)** level. | + | * Coptic SCRIPTORIUM annotates for language of origin on the **morph** level not the **word (norm)** level. |
+ | |||
+ | Add translation, | ||
- | [[Metadata]] | + | Add [[Metadata]]. |
Validate the file using the [[https:// | Validate the file using the [[https:// |
basic_annotation_workflow.1444847188.txt.gz · Last modified: 2015/10/14 12:26 by admin