Coptic SCRIPTORIUM Wiki

Natural Language Processing (NLP) Pipeline

This page describes how to use the NLP tools on the public Coptic Scriptorium website.

Access the Natural Language Processing Service Online.
Copy the digitized text into NLP Service text box
Be sure “My data contains meaningful linebreaks” is selected, assuming that your text has been transcribed according to Coptic SCRIPTORIUM transcription guidelines, and that line breaks are indicated using the “enter” or “return” key. If your text already includes </lb> tags, select “ignore linebreaks in my data.”

The NLP tools can either tokenize Coptic as part of the entire NLP SGML pipeline (select “SGML pipeline” in the Service) or produce tokenization as a separate step. As of 2020, Coptic SCRIPTORIUM annotators no longer need to tokenize and proof the output before running the rest of the pipeline; you can now skip to step 4 if you are a project annotator using the web interface for NLP. Note: Most annotators choose to use the GitDox annotation tool instead of the public website. Visit the GitDox and GitHub page for more information about GitDox.

To proofread tokens: Select “Just piped and dashed morphemes” and run the Service. (Pipes indicate segmentation into words; dashes indicate smaller morphs.)
Cut and paste the SGML output into a text file and proofread the automatic tokenization, editing as necessary.
Copy the proofread SGML back into the NLP Service input window. Under “Tokenize,” select “From pipes in input.”
Select all annotations desired, and run the Service.
Copy the SGML output for your desired use
- Note: Coptic SCRIPTORIUM annotators can copy the output, paste & save it in a plain text file, and then import that text file in GitDox. In the spreadsheet mode in GitDox there is an option to import SGML. Most annotators will not ever use this feature; do not be concerned if you are not familiar with this feature.

Table of Contents

Natural Language Processing (NLP) Pipeline