User Tools

Site Tools


annotation_layer_names

This is an old revision of the document!


Annotation layer names for Coptic SCRIPTORIUM

Data

This document supercedes the Google doc used previously. (Note: in the layer names, @ and _ are interchangeable; @ gets converted to _ when the file is converted)

tok tokens, smallest possible unit to be annotated; MAY BE SMALLER THAN THE MORPHEMES IN ORIG
orig smallest unit of LANGUAGE (morpheme or word level; smaller than the bound group level); orthography is from the original text (diplomatic, edition, whatever); includes supralinear strokes and other markings from the manuscript (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf. Spans in this layer must match those in norm exactly in their length.
orig_group bound groups using the original orthography, including supralinear strokes and other markings. Spans in this layer must match those in norm_group exactly in their length.
norm_group bound groups (same structure as orig_word but with normalized spelling, etc., so content is based on norm). Spans in this layer must match those in orig_group exactly in their length.
norm normalized version of orig. Spans in this layer must match those in orig exactly in their length.
pos part of speech tags. Spans in this layer must match those in norm exactly in their length. (i.e. norm units are the units that carry parts of speech.
lang language of origin tags (Hebrew, Greek, Latin, Aramaic, etc.)
morph morphs that are below the word level – this is where words containing mnt, at, ref are annotated a second time (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.3 & 4.4. Note that morph units DO NOT receive parts of speech.
lemma lemma (dictionary head word); annotates on the normalized words (“norm” layer)
note notes that normally would go in a TEI XML <note note=“xxx”> tag
hi@rend text renderings (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.2 & 5). Use spaces not commas between elements (e.g., red large not red, large); to render spaces, be sure to place underscores in the phrase (e.g., 1_space_right not 1 space right); validates to TEI XML only if there are five or fewer elements
gap Annotates for lacunae. Corresponds to the EpiDoc TEI-XML element gap. Uses attributes such as @reason, @unit, @quantity, and @extent. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model
supplied Annotates for supplied text where text is missing from the original for a variety of reasons. Corresponds to the EpiDoc TEI-XML element supplied. Uses attributes such as @evidence and @reason. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model.
lb@n line breaks – numbered according to the original manuscript
cb@n column breaks – numbered according to the original manuscript
pb_xml_id Page numbers of original manuscript (not the current repository numbering); be sure column label does not include a colon (e.g. pb_xml_id not pb_xml:id); be sure page numbers do not include spaces (e.g. EG202 not EG 202) (TEI XML <pb xml:id=“xxx”>)
ignore:note notes that will NOT be imported into ANNIS or exported as TEI or PAULA XML; private notations from annotators/encoders/editors
translation English translation; necessary for analytic view: if the text has no translation, use “…” in logical spans where a translation would go. Use small sentences or partial sentences. Must be aligned with “p” paragraph breaks and not cut across the paragraph breaks. If you use quotation marks, do not use double straight quotation marks; you can use round smart quotes or single quotes.
p paragraph breaks for translation and normalization; necessary for the normalized view. If the paragraphs are not numbered, put “p” in each span. Be sure p spans are equal to or contain translation spans; they cannot break across translation spans. Typically provide paragraph breaks for every ekthesis.
verse verse of text written as number (always use in Bible of any kind, including Sahidica)
chapter chapter of text recorded as number; currently used only in corpora in which there are canonical or disciplinary-standard chapter divisions (not a required annotation; for Bible this information is typically in the metadata, as well)
sbl_greek The Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
sbl_apparatus The apparatus for the Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic and Greek New Testament text. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
div@type For EpiDoc TEI compatibility, this element needs to be present in div spans. Usually our number (n) in the div tag corresponds to the verse in the bible corpora. In the spreadsheet file, this layer should be present, it should be equal in size to the verse span, and every cell should contain the text “textpart” (no quotes). EpiDoc TEI conversion should produce <div type=“textpart” n=“1”> for the div wrapping verse one in any chapter of the Sahidica corpus.
vid formerly verse@id (Sahidica)
chapter@cname chapter of text written as text and number (not necessary – in other data)
chapter@cid chapter id (Sahidica– not necessary)
verse@vname verse of text written as text and number (e.g. 1 Corinthians 1:10) (not necessary – in other data)
add_place
vid_n

Preferred order of layers

tok orig orig_group norm_group norm pos morph lang translation lb_n cb_n pb_xml_id p hi_rend supplied_reason ignore:note

Metadata

(Note: see also the corpus-level metadata documentation for adding metadata for the entire corpus.)

corpus
Coptic_edition if the text has been published before, include publication information here
Greek_source optional, information about the Greek version of the text if it exists (e.g., Greek Alphabetical or Systematic Apophthegmata Patrum)
title title of this document (unique)
msItem_title the name or title of the conceptual work, e.g. Abraham Our Father, To Thieving Nuns; in the TEI export, if this is not available, the main title of the document is used (metadata field “title”)
author author of the conceptual work
language language in which the text is written
annotation names of annotators (transcribers, editors, annotators) in comma delimited sequence
project name of project supporting the transcription/annotation/publication (e.g., Coptic SCRIPTORIUM, KoMET, etc.)
translation use “none” if no translation; if English translation published by Coptic SCRIPTORIUM then name(s) of translator(s) inserted here in comma delimited sequence
msName use CMCL code (e.g., MONB.YA); optional but must use msName, pages_from, pages_to all three or none at all
pages_from beginning of page sequence of document (original page number of scribe but written in arabic numerals) optional but must use msName, pages_from, pages_to all three or none at all
pages_to optional but must use msName, pages_from, pages_to all three or none at all
msContents_title@type used for things like Shenoute's Canons or Discourses; optional
msContents_title@n volume number of the thing in msContents_title@type; if this field has data then in order for it to be outputted to the TEI XML then msContents_title@type must have data
repository current museum/library/etc where the manuscript currently resides
collection collection or department in the current repository
idno catalogue # of the manuscript in the current repository
version@n version of this Coptic SCRIPTORIUM data
version@date version date of this Coptic SCRIPTORIUM data in YYYY-MM-DD format; format cell as text not as date format in excel
source_info
license use for copyright in Sahidica, CC-BY for everything else. If using CC-BY, enter &lt;a href='https://creativecommons.org/licenses/by/4.0/'&gt;CC-BY 4.0&lt;/a&gt; to produce a link
document_cts_urn urn that applies to the document following data model created by Bridget Almas
Trismegistos [enter the trismegistos # if it exists/is known for the manuscript]; optional but should be included if a TM number exists
objectType codex, papyrus, ostracon, etc; optional
country country of origin of the text object; optional
placeName city or village or place name of the original location of the text object; should be recognizable name in gazetteer (note, see TEI guidelines and EpiDoc for how this can be linked to Pleiades); optional
origPlace name of the place of the text object, not necessarily city/town/village name (e.g., White Monastery); optional
origDate prose about the date of the text object (e.g.Between 900 and 1200 C.E.); optional
origDate_precisionlikelihood that the dating is accurate – usually “low”, “medium”, or “high”; optional
origDate_notBefore date of the terminum post quem (in four digits with leading zeros, e.g., 0900); be sure the format of the cell in Excel is text not number or date; optional
origDate_notAfter date of the terminum ante quem (in four digits with leading zeros, e.g., 1200); be sure the format of the cell in Excel is text not number or date; optional
source if the digitized text comes from another source, the editors of that source are listed here (used for Sahidica and other donated texts); optional
note optional
witness prose note about parallels; optional
redundant required: “yes” or “no” – tied to parallels: yes = file marked redundant (a parallel is the primary witness); no=this file is the primary witness (whether or not it has a parallel); any file with NO parallel witness should be marked redundant=no
previous contains the CTS urn for the previous document in the corpus; optional
next contains the CTS urn for the next document in the corpus; optional
endnote contains a note about the document that will appear in the HTML visualizations at the bottom of the visualization; optional
order contains a number that orders the documents in a list, with the preferred first document numbered 01. For this project, the list should begin by corresponding to the same order dictated by the next and previous metadata fields. After this list of documents (which almost always contains only non-redundant documents—see the “redundant” metadatum above), the redundant parallel witness documents appear (if any). These redundant parallel witnesses also should be numbered according to the order of the running text of the conceptual work(s), for ease of reading and search.; optional
parsing describes whether parsing has been reviewed. Available values are automatic, checked, or gold; required
segmentation describes whether segmentation and tokenization has been reviewed. Available values are automatic, checked, or gold; required
tagging describes whether tagging has been reviewed. Available values are automatic, checked, or gold; required
annotation_layer_names.1539906953.txt.gz · Last modified: 2018/10/18 17:55 by admin