User Tools

Site Tools


normalization

This is an old revision of the document!


Table of Contents

Normalization

SCRIPTORIUM Tokenizer

  • Duplicate the “orig” column (the segmented morphemes and punctuation), copy it into a blank text file, and use the normalizing script to produce a normalized version of the text. Copy that into the spreadsheet. Save in the folder where the normalizer is located. Normalizer is titled auto_norm.pl. Mac users: save with Windows carriage returns.

Open a terminal window in directory containing the autonorm script or change the directory of your terminal window if necessary.

Run command perl auto_norm.pl nameofyourorigtextfile.txt > AP.006.nau196.norm.txt.

Open the new file (AP.006.nau196.norm.txt). Copy all the text. Paste it into a new column in your Excel file; label that column “norm”.

[Note: If you haven’t been trained in the using the normalizing script, then Schroeder or Zeldes will return to you an Excel file with a normalized layer of text, as well.] Proofread the normalization, making a note of errors. [Google refine can help with this, especially if you are normalizing several chapters at once. See https://docs.google.com/document/d/1Ddks6p-EmvWYqH5PgYjrxkvYn1BlCMtMD-uhNhLrZy8/edit?usp=sharing ]

  • When normalizing and cleaning the normalized data, you will need to account for for morphs and compound words. Morphs include mnt, at, ref, and the verbal auxiliary r (only when r is bound to a noun with no article). (See 4.4 of the transcription guidelines.) The entire word (e.g., ⲙⲛⲧⲁⲧⲥⲱⲧⲙ or ⲣⲉϥⲣⲛⲟⲃⲉ) should appear in the orig layer and norm layer. (In the orig layer, it will have all the original orthography, and, if not Sahidica text (e.g., diplomatic text or another edition), supralinear strokes and such; the norm layer will have the same word but with regularized spelling and no supralinear strokes. Most of the time with Sahidica data, the norm layer will be identical to the orig layer; in a diplomatic transcription, the layers’ content will differ most of the time.)
normalization.1441800775.txt.gz · Last modified: 2015/09/09 06:12 by ctschroeder