User Tools

Site Tools


normalization

Table of Contents

Normalization

SCRIPTORIUM Tokenizer

  • Duplicate the “orig” column (the segmented morphemes and punctuation), copy it into a blank text file, and use the normalizing script to produce a normalized version of the text. Copy that into the spreadsheet. Save in the folder where the normalizer is located. Normalizer is titled auto_norm.pl. Mac users: save with Windows carriage returns.

  • Open a terminal window in directory containing the autonorm script or change the directory of your terminal window if necessary.

  • Run command perl auto_norm.pl nameofyourorigtextfile.txt > AP.006.nau196.norm.txt.

  • Open the new file (AP.006.nau196.norm.txt). Copy all the text. Paste it into a new column in your Excel file; label that column “norm”.

  • [Note: If you haven’t been trained in the using the normalizing script, then Schroeder or Zeldes will return to you an Excel file with a normalized layer of text, as well.] Proofread the normalization, making a note of errors. [Google refine can help with this, especially if you are normalizing several chapters at once. See https://docs.google.com/document/d/1Ddks6p-EmvWYqH5PgYjrxkvYn1BlCMtMD-uhNhLrZy8/edit?usp=sharing ]

  • When normalizing and cleaning the normalized data, you will need to account for for morphs and compound words. Morphs include mnt, at, ref, and the verbal auxiliary r (only when r is bound to a noun with no article). (See 4.4 of the transcription guidelines.) The entire word (e.g., ⲙⲛⲧⲁⲧⲥⲱⲧⲙ or ⲣⲉϥⲣⲛⲟⲃⲉ) should appear in the orig layer and norm layer. (In the orig layer, it will have all the original orthography (e.g., if a diplomatic manuscript transcription) including supralinear strokes and such; the norm layer will have the same word but with regularized spelling and no supralinear strokes or other diacritics.)
normalization.txt · Last modified: 2015/09/09 06:16 by ctschroeder