Table of Contents
Language of origin tagger
The language tagger annotates loan words for language of origin (Greek, Hebrew, Aramaic, Latin, etc.)
The language of origin tagger tags on the morph level. (Note: In Coptic SCRIPTORIUM we create a layer specifically for tagging we usually name the ignore:morph layer. since our morph layer of annotations only contains those words annotated on this level and does not contain all the words that are not further annotated on this level.)
While we recognize that some of these words (especially in biblical texts) are sometimes called Greco-Hebrew, Coptic SCRIPTORIUM annotates for the earliest language of origin. Deciding between Hebrew and Greco-Hebrew, Aramaic and Greco-Aramaic, etc., leads to more discrepancies between editors or annotators. Researchers searching for loanwords in ANNIS may wish to search for multiple languages in order to find all the “hits” they need.
To tag text for language of origin you will need two files: the lexicon file for language of origin (lexicon.txt), and the script (_enrich.pl). These files need to be in the same directory.
- Move the normalized text file into the folder with the language of origin tagger. Mac users: Be sure the file is UTF-8 with Unix carriage returns.
- Open a terminal window in that directory or change the directory of your terminal window.
- Type the command perl _enrich.pl AP.006.nau196.norm.txt > AP.006.nau196.lang.txt.
- Open the new file. Select all and copy. Paste into the Excel file. Delete the extra norm column. Name the new column “lang.”
- Use Google refine to check your language tags. You can facet search the norm layer and then click on foreign words that appear in the list. Or you can facet search the norm layer and then go to the lang layer: facet>customized facets>facet by blank. Click on “false” to get all the norms that have a lang tag (e.g., false means the lang tag is not blank); click on “true” to get all the norms that don’t have a lang tag (e.g., true means the lang tag is blank).