Differences

This shows you the differences between two versions of the page.

--- annotating_sub-word_morphemes [2015/09/09 07:01] – ctschroeder
+++ annotating_sub-word_morphemes [2015/10/14 12:11] (current) – admin
@@ Line 1: / Line 1: @@
-=== Language of origin tagger ===
+=== Create a morph layer ===
-The language tagger annotates loan words for language of origin (Greek, Hebrew, Aramaic, Latin, etc.) \\ \\
+There are multiple ways of creating a morph layer.
-The language of origin tagger tags on the morph level.  (Note:  In Coptic SCRIPTORIUM we create a layer specifically for tagging we usually name the ignore:morph layer. since our morph layer of annotations only contains those words annotated on this level and does not contain all the words that are not further annotated on this level.)\\ \\
-While we recognize that some of these words (especially in biblical texts) are sometimes called Greco-Hebrew, Coptic SCRIPTORIUM annotates for the earliest language of origin.  Deciding between Hebrew and Greco-Hebrew, Aramaic and Greco-Aramaic, etc., leads to more discrepancies between editors or annotators.  Researchers searching for loanwords in ANNIS may wish to search for multiple languages in order to find all the "hits" they need.\\ \\
+== Follow these steps if you have no morph layer at all ==
-To tag text for language of origin you will need two files:  the [[https://github.com/CopticScriptorium/lexical-taggers/releases/tag/1.3|lexicon file]] for language of origin (lexicon.txt), and the script (_enrich.pl).  These files need to be in the same directory.\\ \\
-  *Move the normalized text file into the folder with the language of origin tagger.  Mac users:  Be sure the file is UTF-8 with Unix carriage returns.\\ \\
+Duplicate the norm layer. Name the new layer "ignore:morph".
-  *Open a terminal window in that directory or change the directory of your terminal window.\\ \\
-  *Type the command perl _enrich.pl AP.006.nau196.norm.txt > AP.006.nau196.lang.txt.\\ \\
+Manually or [[google_refine|using Google Refine]] identify the normalized words that need to be annotated on the morph level.  (See the [[http://copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf|Transcription Guidelines]] for more information about normalized words vs morphs.)
-Open the new file.  Select all and copy.  Paste into the Excel file.  Delete the extra norm column.  Name the new column "lang."\\ \\
-  *Use [[http://wiki.copticscriptorium.org/doku.php?id=google-refine|Google refine]] to check your language tags.      You can facet search the norm layer and then click on foreign words that appear in the list.  Or you can facet search the norm layer and then go to the lang layer:  facet>customized facets>facet by blank.  Click on “false” to get all the norms that have a lang tag (e.g., false means the lang tag is not blank); click on “true” to get all the norms that don’t have a lang tag (e.g., true means the lang tag is blank).
+Split the words you have identified into the requisite number of tokens.
+  * Be sure to break up the words in the **tok** and the **ignore:morph** layers.  (Therefore, you may need to add rows to your spreadsheet.)
+  * Ensure the norm, orig, norm_group, orig_group, pos, lemma, hi@rend, and other annotation layers stay aligned with the correct tokens.
+  * Do not break up the words in the norm or orig layers; only the tok and ignore:morph layers
+Consider using [[google_refine|Google Refine]] to check your work (see if compound words or words containing mnt-, at-, ref-, etc., are still in the ignore:morph layer).
+Complete the steps in the next section to create the morph layer.
+== Follow these steps if/when your file has an ignore:morph layer ==
+Many of these steps are demonstrated in this [[https://www.youtube.com/watch?v=ptexY-zaNcI|video]].
+You need to create a clean morph layer that has only unique data in it; 80-90% of the data in ignore:morph is identical to the data in norm, making it difficult for a human to see when you’ve got compound words or morphs.  It’s cluttered.  So the morph layer in our published annotated corpora only contains unique data that differs from the word-level layers.  (Word-level layers in Coptic SCRIPTORIUM are usually named "orig" and "norm.")
+. Insert a new column for the morph layer but it should be empty (as in the video)
+. In the first cell of data, type in a conditional function that will look to see if the ignore:morph cell is identical to the norm cell on that row. If they are identical, the formula will make the cell blank; if they’re not identical morph will contain the morphemes found in ignore:morph.  The formula should look something like this:  \\
+=IF(E2=F2,"",F2)\\
+where E2 is the norm layer and F2 is the ignore:morph layer. Hit "return" when you are done typing the formula so that it disappears. Then select the cell.
+. Select the cell with your formula in it and select the rest of the column down to the end of the layer data. Use the “Edit>Fill>Down” menu item to fill in that column with the formula.  You should now have a clean morph layer that contains only the relevant morphs when they appear.