Table of Contents

Annotation layer names for Coptic SCRIPTORIUM

Data

This document supercedes the Google doc used previously. (Note: in the layer names, @ and _ are interchangeable; @ gets converted to _ when the file is converted)

tok tokens, smallest possible unit to be annotated; MAY BE SMALLER THAN THE MORPHEMES IN ORIG
orig smallest unit of LANGUAGE (morpheme or word level; smaller than the bound group level); orthography is from the original text (diplomatic, edition, whatever); includes supralinear strokes and other markings from the manuscript (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf. Spans in this layer must match those in norm exactly in their length.
orig_group bound groups using the original orthography, including supralinear strokes and other markings. Spans in this layer must match those in norm_group exactly in their length.
norm_group bound groups (same structure as orig_word but with normalized spelling, etc., so content is based on norm). Spans in this layer must match those in orig_group exactly in their length.
norm normalized version of orig. Spans in this layer must match those in orig exactly in their length.
pos part of speech tags. Spans in this layer must match those in norm exactly in their length. (i.e. norm units are the units that carry parts of speech.
lang language of origin tags (Hebrew, Greek, Latin, Aramaic, etc.)
morph morphs that are below the word level – this is where words containing mnt, at, ref are annotated a second time (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.3 & 4.4. Note that morph units DO NOT receive parts of speech.
lemma lemma (dictionary head word); annotates on the normalized words (“norm” layer)
note notes that normally would go in a TEI XML <note note=“xxx”> tag
hi@rend usually appears as hi_rend in the column name in spreadsheet mode; for text renderings (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.2 & 5). Use spaces not commas between elements (e.g., red large not red, large); to render spaces, be sure to place underscores in the phrase (e.g., 1_space_right not 1 space right); validates to TEI XML only if there are five or fewer elements
gap Annotates for lacunae. Corresponds to the EpiDoc TEI-XML element gap. Uses attributes such as @reason, @unit, @quantity, and @extent. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model
supplied Annotates for supplied text where text is missing from the original for a variety of reasons. Corresponds to the EpiDoc TEI-XML element supplied. Uses attributes such as @evidence and @reason. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model.
lb@n usually appears as lb_n in column header in spreadsheet mode; line breaks – numbered according to the original manuscript
cb@n usually appears as cb_n in column header in spreadsheet mode; column breaks – numbered according to the original manuscript
pb_xml_id Page numbers of original manuscript (not the current repository numbering); be sure column label does not include a colon (e.g. pb_xml_id not pb_xml:id); be sure page numbers do not include spaces (e.g. EG202 not EG 202) (TEI XML <pb xml:id=“xxx”>)
ignore:note notes that will NOT be imported into ANNIS or exported as TEI or PAULA XML; private notations from annotators/encoders/editors
translation English translation; necessary for analytic view: if the text has no translation, use “…” in logical spans where a translation would go. Use small sentences or partial sentences. Must be aligned with “p” paragraph breaks and not cut across the paragraph breaks. If you use quotation marks, do not use double straight quotation marks; you can use round smart quotes or single quotes.
p paragraph breaks for translation and normalization; necessary for the normalized view. If the paragraphs are not numbered, put “p” in each span. Be sure p spans are equal to or contain translation spans; they cannot break across translation spans. Typically provide paragraph breaks for every ekthesis.
verse_n verse of text written as number (always use in Bible of any kind, including Sahidica)
chapter_n chapter of text recorded as number; currently used only in corpora in which there are canonical or disciplinary-standard chapter divisions (not a required annotation; for Bible this information is typically in the metadata, as well)
sbl_greek The Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
sbl_apparatus The apparatus for the Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic and Greek New Testament text. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
div@type For EpiDoc TEI compatibility, this element needs to be present in div spans. Usually our number (n) in the div tag corresponds to the verse in the bible corpora. In the spreadsheet file, this layer should be present, it should be equal in size to the verse span, and every cell should contain the text “textpart” (no quotes). EpiDoc TEI conversion should produce <div type=“textpart” n=“1”> for the div wrapping verse one in any chapter of the Sahidica corpus.
vid formerly verse@id (Sahidica)
chapter@cname chapter of text written as text and number (not necessary – in other data)
chapter@cid chapter id (Sahidica– not necessary)
verse@vname verse of text written as text and number (e.g. 1 Corinthians 1:10) (not necessary – in other data)
add_place
vid_n CTS URN for the verse (e.g., urn:cts:copticLit:johannes.canons.monbfa:1.1)
ed_page_n page number of a text as it appears in an edition
ed_line_n line number of a text as it appears in an edition
entity one of the ten entity types (e.g. person, place) see entity guidelines. Note that there can be multiple entity columns due to nested entities. These columns are typically not edited manually in the spreadsheet, but are added by the graphical entity editing interface in entities mode.
identity this annotation stores linked entry identifiers for named entities; it is populated automatically during export by GitDox if named entities have been added using the entity annotation interface. Annotators do not need to manually add this column
arabic Arabic translation. Spans should follow translation and verse layers

NOT columns in the spreadsheet

The following information should NOT be annotated manually in the spreadsheet, as it is added by other processes:

Metadata

(Note: see also the corpus-level metadata documentation for adding metadata for the entire corpus.)

Download a checklist of the following fields (does not yet include arabic_translation and possibly other newer fields; the list below is the most accurate)

annotation names of annotators (transcribers, editors, annotators) in comma delimited sequence
arabic_translation names of people who translated the text into Arabic in comma-delimited sequence
attributed_author optional. attributed author of a conceptual work who may or may not be the historical author
author author of the conceptual work
collection collection or department in the current repository
Coptic_edition if the text has been published before, include publication information here
copyist optional. copyist or scribe of the text on the manuscript or text-bearing object
corpus the corpus name in ANNIS (in the C SCRIPTORIUM corpus architecture, e.g. shenoute.abraham)
country country of origin of the text object; optional
document_cts_urn urn that applies to the document following data model created by Bridget Almas
endnote contains a note about the document that will appear in the HTML visualizations at the bottom of the visualization; optional
entities describes whether entity annotation has been reviewed. Available values are automatic, checked, or gold; required
Greek_source optional, information about the Greek version of the text if it exists (e.g., Greek Alphabetical or Systematic Apophthegmata Patrum)
identities describes whether named entity linking has been reviewed. Available values are automatic, checked, or gold; required
idno catalogue # of the manuscript in the current repository
language language in which the text is written
license use for copyright in Sahidica, CC-BY for everything else. If using CC-BY, enter &lt;a href='https://creativecommons.org/licenses/by/4.0/'&gt;CC-BY 4.0&lt;/a&gt; to produce a link
msContents_title@n volume number of the thing in msContents_title@type; if this field has data then in order for it to be outputted to the TEI XML then msContents_title@type must have data
msContents_title@type used for things like Shenoute's Canons or Discourses; optional
msItem_title the name or title of the conceptual work, e.g. Abraham Our Father, To Thieving Nuns; in the TEI export, if this is not available, the main title of the document is used (metadata field “title”)
msName use CMCL code (e.g., MONB.YA); optional but must use msName, pages_from, pages_to all three or none at all
next contains the CTS urn for the next document in the corpus; optional
note optional
objectType codex, papyrus, ostracon, etc; optional
order contains a number that orders the documents in a list, with the preferred first document numbered 01. For this project, the list should begin by corresponding to the same order dictated by the next and previous metadata fields. After this list of documents (which almost always contains only non-redundant documents—see the “redundant” metadatum above), the redundant parallel witness documents appear (if any). These redundant parallel witnesses also should be numbered according to the order of the running text of the conceptual work(s), for ease of reading and search.; optional
origDate prose about the date of the text object (e.g.Between 900 and 1200 C.E.); optional
origDate_notAfter date of the terminum ante quem (in four digits with leading zeros, e.g., 1200); be sure the format of the cell in Excel is text not number or date; optional
origDate_notBefore date of the terminum post quem (in four digits with leading zeros, e.g., 0900); be sure the format of the cell in Excel is text not number or date; optional
origDate_precisionlikelihood that the dating is accurate – usually “low”, “medium”, or “high”; optional
origPlace name of the place of the text object, not necessarily city/town/village name (e.g., White Monastery); optional
pages_from beginning of page sequence of document (original page number of scribe but written in arabic numerals) optional but must use msName, pages_from, pages_to all three or none at all
pages_to optional but must use msName, pages_from, pages_to all three or none at all
parsing describes whether parsing has been reviewed. Available values are automatic, checked, or gold; required
paths_authors PATHs project stable id from PATHs project for the author entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/authors/80'>80</a>. Optional but preferred if available.
paths_manuscripts PATHs project Coptic Literary Manuscript stable id entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/manuscripts/195'>195</a>; optional but preferred if available
paths_works PATHs Project stable ID for conceptual works recorded and identified by the Clavis Coptica (CC) entries composed of a 4-digit number preceded by, for example, CC 0599. PATHs stable ids contain only the number. E.g., <a href='http://paths.uniroma1.it/atlas/works/246'>246</a>; optional but preferred if available
placeName city or village or place name of the original location of the text object; should be recognizable name in gazetteer (note, see TEI guidelines and EpiDoc for how this can be linked to Pleiades); optional
previous contains the CTS urn for the previous document in the corpus; optional
project name of project supporting the transcription/annotation/publication (e.g., Coptic SCRIPTORIUM, KoMET, etc.)
redundant required: “yes” or “no” – tied to parallels: yes = file marked redundant (a parallel is the primary witness); no=this file is the primary witness (whether or not it has a parallel); any file with NO parallel witness should be marked redundant=no
repository current museum/library/etc where the manuscript currently resides
segmentation describes whether segmentation and tokenization has been reviewed. Available values are automatic, checked, or gold; required
source if the digitized text comes from another source, the editors of that source are listed here (used for Sahidica and other donated texts); optional
source_info
tagging describes whether tagging has been reviewed. Available values are automatic, checked, or gold; required
title title of this document (unique)
translation use “none” if no translation; if English translation published by Coptic SCRIPTORIUM then name(s) of translator(s) inserted here in comma delimited sequence
Trismegistos [enter the trismegistos # if it exists/is known for the manuscript]; optional but should be included if a TM number exists
version@date version date of this Coptic SCRIPTORIUM data in YYYY-MM-DD format; format cell as text not as date format in excel
version@n version of this Coptic SCRIPTORIUM data
witness prose note about parallels; optional

Automatic metadata

GitDox will automatically generate semi-colon separated lists of named entities in the following metadata fields during export. They will not show up in the GitDox table, and you should not add or edit these manually:

people named people identifiers for people mentioned in the document (separated by “; ”)
places named place identifiers for places mentioned in the document (separated by “; ”)