Table of Contents

Tokenizer

The Coptic SCRIPTORIUM tokenizer, tokenize_coptic.pl, is located at https://github.com/CopticScriptorium/tokenizers/releases/latest. Instructions for use, including parameters, are in the documentation (the README file) there.

Tokenization guidelines are in sections 3 & 4 of the Transcription Guidelines at http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf

The default parameters of the tokenizer will break up Coptic bound groups separated by a space into constituent morphemes.

The -l parameter will tokenize a text that contains line breaks within bound groups (such as in a transcription of a manuscript).

Ensure the Coptic text file:

The next steps describe how to process the text file with the tool. These instructions are written for users who are not familiar with Perl and the Terminal utility.