User Tools

Site Tools


processing

Procedure for processing a text for inclusion in our corpora: March 2015 Workshop Demo

We are happy to help you with this process. Please contact a member of the SCRIPTORIUM team if you have a text you want to include in our corpora. Sometimes there are copyright or intellectual property issues with a text (especially if you are working from a published edition), and we can talk to you about the options in these cases.

The following documentation describes processing a demo text (one saying from the Sahidic Apophthegmata Patrum).

What you will need

Installations

Perl programming language (Windows users only; comes pre-installed on MacOs)
Text editor such as Text Wrangler
Antinoou Coptic font for Unicode (UTF-8) Microsoft Excel

Downloads

All of the following are available in a zip file
Sample file of Coptic text (in UTF-8)
SCRIPTORIUM Tokenizer
Zeldes Excel add-in to import text files into a spreadsheet for annotation MacOS | Windows
SCRIPTORIUM Normalizer
Tree-tagger program for Windows or Mac OS
SCRIPTORIUM Coptic models for Tree-tagger
Language of origin tagger
Also please download this additional file (the lexicon for the tokenizer). This file is not in the big zip file. After you download it, open the zip, and move the file to the main workshop folder (the same folder that contains the tokenize_coptic.pl file).

Overview of process
1. Text file

Take a text file of Coptic keyed in the unicode Coptic characterset. (SCRIPTORIUM uses the Antinoou font and keyboard.)

Save the file as a plain text file (.txt) encoded in UTF-8. (Most text editors have an option in Preferences for opening and saving texts in UTF-8.

Be sure there is a space between each Coptic bound group. (You do not need to segment all the morphemes. We follow Layton's conventions for segmenting bound groups. For example ⲁϥⲟⲩⲱϣⲃ ⲛⲁϥ, not ⲁ ϥ ⲟⲩⲱϣⲃ ⲛⲁ ϥ, nor ⲁϥⲟⲩⲱϣⲃⲛⲁϥ. (Our Transcription Guidelines have more details.)

Mac users: save your text file using Unix line breaks/carriage returns. Most text editors have an option for toggling between Windows/Mac/Unix line breaks; we recommend the free texteditor Text Wrangler http://www.barebones.com/products/textwrangler/ . This formatting option is displayed at the bottom of the window.

Advanced: If you have a text in ASCII, in one of the old, pre-Unicode legacy fonts, such as Lasercoptic, it may be possible to convert your text using one of our character converters. Check with us after the workshop.

2. Tokenizer

Put a copy of the text file in the folder housing the tokenizer (tokenize_coptic.pl).

Open a terminal window in the directory containing the tokenizer and text file.

Type the command perl tokenize_coptic.pl AP.006.n196.worms_utf8.txt > AP.006.n196.tokenized.txt
(Note: the basic command format is to perl nameofscript.pl inputfile.txt > outputfile.txt. You need to create an original filename for your outut file, so that you don't overwrite any existing file. We suggest AP.006.n196.tokenized.txt but you can use anything.)

Advanced: There is more than one parameter for the Tokenizer. We will be using the defaults. See the documentation in our GitHub repository for the tokenizer. (E.g., manuscript transcriptions that include words broken across a line should use the -l parameter: perl tokenize_coptic.pl -l input.txt > output.txt)

3. Import into spreadsheet

Install the Excel Add-in to import your tokenized file into an Excel spreadsheet. Install the add-in appropriate for your OS; the Mac and Windows add-ins differ significantly. To install select in Excel Tools > Add ins > [browse to the proper add in on your computer to enable]

Windows: run the add-in (see documentation)

Mac: Open the tokenized text file. Select and copy all. Paste the contents into the first column in an empty Excel spreadsheet. Run the Import add-in.

Check the tokenization and check the bound groups to make sure they are correct. [Note: in our file, there will probably be errors in rows 2, 15, 17 and a couple more.]

4. Normalize text

In your spreadsheet, select the column labeled “orig”. “orig” is short for “original,” signifying the original text transcription. Copy.

Paste into a new text file. Save in the folder where the normalizer is located. Normalizer is titled auto_norm.pl. Mac users: save with Windows carriage returns.

Open a terminal window in directory containing the autonorm script or change the directory of your terminal window if necessary.

Run command perl auto_norm.pl nameofyourorigtextfile.txt > AP.006.nau196.norm.txt.

Open the new file (AP.006.nau196.norm.txt). Copy all the text. Paste it into a new column in your Excel file; label that column “norm”.

Note: the sample text is fairly “clean” since it is from an edition not a manuscript, so there will not be many – if any – changes in the demo. Type in a random Coptic supralinear stroke, nomina sacra, or under-dot to see fuller functionality.

Advanced: At this point, you can use a macro to make normalized bound groups. Check with us for instructions after the workshop if you want to do this.

5. Tag for part-of-speech

To tag for part of speech, you will need the Tree-tagger package and our training corpus.

Copy the “norm: column into a text file. Save it the file into the “bin” folder in the Tree Tagger directory. Make sure it is in UTF-8. Mac users: save with Unix carriage returns.

Make sure the coptic_fine.par file in the larger tree-tagger folder is also in the “bin” folder.

Open a terminal window in that directory or change the directory of your terminal window.

Run the tree-tagger program: enter ./tree-tagger coptic_fine.par -token AP.006.nau196.norm.txt AP.006.nau196.tagged.txt
(Note: the basic command format is ./tree-tagger coptic_fine.par -parameter inputfile.txt perl outputfile.txt. More information on the parameters for the tagger is in the documentation on our GitHub repository. For the output file, we suggest AP.006.n196.tagged.txt but you can use anything.)

Open output file. Select all. Copy and paste into Excel. Delete your extra norm column. Rename the part-of-speech column “pos.”

Advanced: we are showing the simplest method for tagging for part of speech. The workflow requires a few more steps if you are working with a diplomatic edition of a manuscript (a manuscript transcription), or a text that has morphemes such as mnt or ref segmented. Contact us if you are working with a manuscript transcription or have a text with compound words or morphs such as mnt and ref.

6. Tag for language of origin

The language tagger annotates loan words for language of origin (Greek, Hebrew, Aramaic, Latin, etc.)
Note: While re recognize that some of these words (especially in biblical texts) are sometimes called Greco-Hebrew, we annotate for the earliest language of origin. Deciding between Hebrew and Greco-Hebrew, Aramaic and Greco-Aramaic, etc., leads to discrepancies between editors or annotators. Researchers searching for loanwords in ANNIS may wish to search for multiple languages in order to find all the “hits” they need.

To tag text for language of origin you will need two files: the lexicon file for language of origin (lexicon.txt), and the script (_enrich.pl). These files need to be in the same directory.

Move the normalized text file into the folder with the language of origin tagger.
Mac users: Be sure the file is UTF-8 with Unix carriage returns.

Open a terminal window in that directory or change the directory of your terminal window.

Type the command perl _enrich.pl AP.006.nau196.norm.txt > AP.006.nau196.lang.txt.

Open the new file. Select all and copy. Paste into the Excel file. Delete the extra norm column. Name the new column “lang.”

7. Additional annotation layers

You may wish to provide additional annotation layers. Examples include annotations for manuscript information (line breaks, column breaks, page breaks, color of ink, etc.), or segmenting words with morphs such as mnt or ref, or segmenting compound words. Please contact us after the workshop for more details.

8. Proofread

Proofread the annotation layers for human or machine errors in the Excel file.

9. Metadata

Click on the second sheet in the Excel file. Add metadata according to the standards described on our data model documentation (http://corpling.uis.georgetown.edu/wiki/doku.php?id=annotation_layer_names).

Name this sheet “meta”.

10. Update SCRIPTORIUM

Help us update our tools and technology. Tell us how it went. For example, please let us know if the tokenizer or language tagger are missing items, or if they are tagging incorrectly.


processing.txt · Last modified: 2015/09/09 08:52 by ctschroeder