User Tools

Site Tools


Natural Language Processing (NLP) Pipeline

  • Copy the digitized text into NLP Service.
  • Be sure “My data contains meaningful linebreaks” is selected, assuming that your text has been transcribed according to Coptic SCRIPTORIUM transcription guidelines, and that line breaks are indicated using the “enter” or “return” key. If your text already includes </lb> tags, select “ignore linebreaks in my data.”

The NLP can either tokenize Coptic as part of the entire NLP SGML pipeline (select “SGML pipeline” in the Service) or produce tokenization as a separate step. Coptic SCRIPTORIUM annotators will generally want to tokenize and proof the output before running the rest of the SGML pipeline.

  • Select “Just piped and dashed morphemes” and run the Service. (Pipes indicate segmentation into words; dashes indicate smaller morphs.)
  • Cut and paste the SGML output into a text file and proofread the automatic tokenization, editing as necessary.
  • Copy the proofread SGML back into the NLP Service input window. Under “Tokenize,” select “From pipes in input.”
  • Select all annotations desired (usually all except “parse”), and run the Service.
  • Copy and convert the SGML output into a multilayer spreadsheet format using the project’s converter.
  • Manually proofread and edit data in existing layers.
  • Add any missing layers manually or using other existing tools.
  • Check layer names to ensure they conform to project standards for the data model.
natural_language_processing_service_online.txt · Last modified: 2016/08/28 23:12 by eplatte