User Tools

Site Tools


tokenizer

Table of Contents

Tokenizer

The Coptic SCRIPTORIUM tokenizer, tokenize_coptic.pl, is located at https://github.com/CopticScriptorium/tokenizers/releases/latest. Instructions for use, including parameters, are in the documentation (the README file) there.

Tokenization guidelines are in sections 3 & 4 of the Transcription Guidelines at http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf

The default parameters of the tokenizer will break up Coptic bound groups separated by a space into constituent morphemes.

The -l parameter will tokenize a text that contains line breaks within bound groups (such as in a transcription of a manuscript).

Ensure the Coptic text file:

  • is saved with UTF-8 character encoding
  • uses the Coptic Unicode character set (and not an older font that visualizes Latin type as Coptic characters; if you have a transcription in a legacy font, we have converters, and there are others elsewhere online)
  • uses Windows line breaks/carriage returns. Most text editors have an option for toggling between Windows/Mac/Unix line breaks
  • is well-structured if it contains any XML tags. (In other words, the tags are nested in the appropriate hierarchy. A program such as Oxygen will highlight any tags that are not nested, if you wrap the whole text in an XML tag first.)

The next steps describe how to process the text file with the tool. These instructions are written for users who are not familiar with Perl and the Terminal utility.

  • Put a copy of the text file in the same directory (or folder) on the computer as the tokenizer and the lexicon.
  • Windows users: open a Command line window; Mac users: open a new Terminal window (Terminal is found in the Utilities folder on a Mac.)
  • Change the directory on the command line/in Terminal to the directory containing your working files. Depending on whether you are on a Mac or PC, you will type the command “cd” followed by the path to the relevant directory (e.g., home/documents/directorycontainingworkingfiles). On a Mac, the shortcut is to type cd followed by a space, then go to the Finder window for the relevant directory/folder, click on the little folder icon on the top bar next to the name of the directory, and drag it into the Terminal window. The correct path should appear.
  • In Terminal or on the command line, enter the command to process the file, being sure to provide a title for the new output file. Your command will look something like: perl tokenize_coptic.pl inputfilename.txt > outputfilename.txt. You may wish to call one of the parameters, for example if you are tokenizing a diplomatic transcription of a manuscript: perl tokenize_coptic.pl -l inputfilename.txt > outputfilename.txt
tokenizer.txt · Last modified: 2015/09/09 04:31 by ctschroeder