Differences

This shows you the differences between two versions of the page.

--- natural_language_processing_service_online [2020/08/03 17:27] – admin
+++ natural_language_processing_service_online [2020/08/06 08:33] (current) – amirzeldes
@@ Line 1: / Line 1: @@
-===Natural Language Processing (NLP) Pipeline===
+===== Natural Language Processing (NLP) Pipeline =====
+This page describes how to use the NLP tools on the public Coptic Scriptorium website.
   * Access the [[https://corpling.uis.georgetown.edu/coptic-nlp/|Natural Language Processing Service Online]].
-  * Copy the digitized text into NLP Service.
+  * Copy the digitized text into NLP Service text box
-  * Be sure "My data contains meaningful linebreaks" is selected, assuming that your text has been transcribed according to Coptic SCRIPTORIUM [[http://copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf|transcription guidelines]], and that line breaks are indicated using the "enter" or "return" key. If your text already includes </lb> tags, select "ignore linebreaks in my data."
+  * Be sure "My data contains meaningful linebreaks" is selected, assuming that your text has been transcribed according to Coptic SCRIPTORIUM [[https://github.com/CopticScriptorium/tagger-part-of-speech/raw/master/scriptorium-transcription-guidelines.pdf|transcription guidelines]], and that line breaks are indicated using the "enter" or "return" key. If your text already includes </lb> tags, select "ignore linebreaks in my data."
-The NLP can either tokenize Coptic as part of the entire NLP SGML pipeline (select "SGML pipeline" in the Service) or produce tokenization as a separate step. As of 2020, Coptic SCRIPTORIUM annotators no longer need to tokenize and proof the output before running the rest of the SGML pipeline.
-  - Select “Just piped and dashed morphemes” and run the Service.  (Pipes indicate segmentation into words; dashes indicate smaller morphs.)
+The NLP tools can either tokenize Coptic as part of the entire NLP SGML pipeline (select "SGML pipeline" in the Service) or produce tokenization as a separate step. **As of 2020, Coptic SCRIPTORIUM annotators no longer need to tokenize and proof the output before running the rest of the pipeline; you can now skip to step 4 if you are a project annotator using the web interface for NLP.  Note: Most annotators choose to use the GitDox annotation tool instead of the public website. Visit the [[gitdox_workflow|GitDox and GitHub]] page for more information about GitDox.**
+  - To proofread tokens: Select “Just piped and dashed morphemes” and run the Service.  (Pipes indicate segmentation into words; dashes indicate smaller morphs.)
   - Cut and paste the SGML output into a text file and proofread the automatic tokenization, editing as necessary.
   - Copy the proofread SGML back into the NLP Service input window.  Under “Tokenize,” select “From pipes in input.”
-  - Select all annotations desired (usually all except “parse”), and run the Service.
+  - Select all annotations desired, and run the Service.
-  - Copy and convert the SGML output into a multilayer spreadsheet format using the [[import_macro|project’s converter]].
+  - Copy the SGML output for your desired use
-  - Manually proofread and edit data in existing layers.
+      * Note: Coptic SCRIPTORIUM annotators can copy the output, paste & save it in a plain text file, and then import that text file in GitDox. In the spreadsheet mode in GitDox there is an option to import SGML. Most annotators will not ever use this feature; do not be concerned if you are not familiar with this feature.
-  - Add any missing layers manually or using other existing tools.
-  - Check layer names to ensure they conform to project standards for the [[annotation_layer_names|data model]]