Table of Contents
Annotation layer names for Coptic SCRIPTORIUM
Data
This document supercedes the Google doc used previously. (Note: in the layer names, @ and _ are interchangeable; @ gets converted to _ when the file is converted)
tok | tokens, smallest possible unit to be annotated; MAY BE SMALLER THAN THE MORPHEMES IN ORIG |
orig | smallest unit of LANGUAGE (morpheme or word level; smaller than the bound group level); orthography is from the original text (diplomatic, edition, whatever); includes supralinear strokes and other markings from the manuscript (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf. Spans in this layer must match those in norm exactly in their length. |
orig_group | bound groups using the original orthography, including supralinear strokes and other markings. Spans in this layer must match those in norm_group exactly in their length. |
norm_group | bound groups (same structure as orig_word but with normalized spelling, etc., so content is based on norm). Spans in this layer must match those in orig_group exactly in their length. |
norm | normalized version of orig. Spans in this layer must match those in orig exactly in their length. |
pos | part of speech tags. Spans in this layer must match those in norm exactly in their length. (i.e. norm units are the units that carry parts of speech. |
lang | language of origin tags (Hebrew, Greek, Latin, Aramaic, etc.) |
morph | morphs that are below the word level – this is where words containing mnt, at, ref are annotated a second time (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.3 & 4.4. Note that morph units DO NOT receive parts of speech. |
lemma | lemma (dictionary head word); annotates on the normalized words (“norm” layer) |
note | notes that normally would go in a TEI XML <note note=“xxx”> tag |
hi@rend | usually appears as hi_rend in the column name in spreadsheet mode; for text renderings (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.2 & 5). Use spaces not commas between elements (e.g., red large not red, large); to render spaces, be sure to place underscores in the phrase (e.g., 1_space_right not 1 space right); validates to TEI XML only if there are five or fewer elements |
gap | Annotates for lacunae. Corresponds to the EpiDoc TEI-XML element gap. Uses attributes such as @reason, @unit, @quantity, and @extent. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model |
supplied | Annotates for supplied text where text is missing from the original for a variety of reasons. Corresponds to the EpiDoc TEI-XML element supplied. Uses attributes such as @evidence and @reason. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model. |
lb@n | usually appears as lb_n in column header in spreadsheet mode; line breaks – numbered according to the original manuscript |
cb@n | usually appears as cb_n in column header in spreadsheet mode; column breaks – numbered according to the original manuscript |
pb_xml_id | Page numbers of original manuscript (not the current repository numbering); be sure column label does not include a colon (e.g. pb_xml_id not pb_xml:id); be sure page numbers do not include spaces (e.g. EG202 not EG 202) (TEI XML <pb xml:id=“xxx”>) |
ignore:note | notes that will NOT be imported into ANNIS or exported as TEI or PAULA XML; private notations from annotators/encoders/editors |
translation | English translation; necessary for analytic view: if the text has no translation, use “…” in logical spans where a translation would go. Use small sentences or partial sentences. Must be aligned with “p” paragraph breaks and not cut across the paragraph breaks. If you use quotation marks, do not use double straight quotation marks; you can use round smart quotes or single quotes. |
p | paragraph breaks for translation and normalization; necessary for the normalized view. If the paragraphs are not numbered, put “p” in each span. Be sure p spans are equal to or contain translation spans; they cannot break across translation spans. Typically provide paragraph breaks for every ekthesis. |
verse_n | verse of text written as number (always use in Bible of any kind, including Sahidica) |
chapter_n | chapter of text recorded as number; currently used only in corpora in which there are canonical or disciplinary-standard chapter divisions (not a required annotation; for Bible this information is typically in the metadata, as well) |
sbl_greek | The Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software. |
sbl_apparatus | The apparatus for the Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic and Greek New Testament text. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software. |
div@type | For EpiDoc TEI compatibility, this element needs to be present in div spans. Usually our number (n) in the div tag corresponds to the verse in the bible corpora. In the spreadsheet file, this layer should be present, it should be equal in size to the verse span, and every cell should contain the text “textpart” (no quotes). EpiDoc TEI conversion should produce <div type=“textpart” n=“1”> for the div wrapping verse one in any chapter of the Sahidica corpus. |
vid | formerly verse@id (Sahidica) |
chapter@cname | chapter of text written as text and number (not necessary – in other data) |
chapter@cid | chapter id (Sahidica– not necessary) |
verse@vname | verse of text written as text and number (e.g. 1 Corinthians 1:10) (not necessary – in other data) |
add_place | |
vid_n | CTS URN for the verse (e.g., urn:cts:copticLit:johannes.canons.monbfa:1.1) |
ed_page_n | page number of a text as it appears in an edition |
ed_line_n | line number of a text as it appears in an edition |
entity | one of the ten entity types (e.g. person, place) see entity guidelines. Note that there can be multiple entity columns due to nested entities. These columns are typically not edited manually in the spreadsheet, but are added by the graphical entity editing interface in entities mode. |
identity | this annotation stores linked entry identifiers for named entities; it is populated automatically during export by GitDox if named entities have been added using the entity annotation interface. Annotators do not need to manually add this column |
arabic | Arabic translation. Spans should follow translation and verse layers |
NOT columns in the spreadsheet
The following information should NOT be annotated manually in the spreadsheet, as it is added by other processes:
- identity - this is the ANNIS annotation corresponding to named entity linking (Wikification). This information comes from the entity identification annotations in entities mode (after clicking “List named entities”). It should not be entered manually into the spreadsheet itself.
- func / head - this is syntactic information from automatic or gold parsing. It is never done in spreadsheet mode, but added during publication by an automatic parser, or annotated manually in the Arborator interface (but NOT in GitDox)
- multiword - multiword expression annotation is also added automatically during publication based on the currect state of multiword entries in the Coptic Dictionary Online. It is not edited manually and should not be included in the spreadsheet.
Metadata
(Note: see also the corpus-level metadata documentation for adding metadata for the entire corpus.)
Download a checklist of the following fields (does not yet include arabic_translation and possibly other newer fields; the list below is the most accurate)
annotation | names of annotators (transcribers, editors, annotators) in comma delimited sequence |
arabic_translation | names of people who translated the text into Arabic in comma-delimited sequence |
attributed_author | optional. attributed author of a conceptual work who may or may not be the historical author |
author | author of the conceptual work |
collection | collection or department in the current repository |
Coptic_edition | if the text has been published before, include publication information here |
copyist | optional. copyist or scribe of the text on the manuscript or text-bearing object |
corpus | the corpus name in ANNIS (in the C SCRIPTORIUM corpus architecture, e.g. shenoute.abraham) |
country | country of origin of the text object; optional |
document_cts_urn | urn that applies to the document following data model created by Bridget Almas |
endnote | contains a note about the document that will appear in the HTML visualizations at the bottom of the visualization; optional |
entities | describes whether entity annotation has been reviewed. Available values are automatic, checked, or gold; required |
Greek_source | optional, information about the Greek version of the text if it exists (e.g., Greek Alphabetical or Systematic Apophthegmata Patrum) |
identities | describes whether named entity linking has been reviewed. Available values are automatic, checked, or gold; required |
idno | catalogue # of the manuscript in the current repository |
language | language in which the text is written |
license | use for copyright in Sahidica, CC-BY for everything else. If using CC-BY, enter <a href='https://creativecommons.org/licenses/by/4.0/'>CC-BY 4.0</a> to produce a link |
msContents_title@n | volume number of the thing in msContents_title@type; if this field has data then in order for it to be outputted to the TEI XML then msContents_title@type must have data |
msContents_title@type | used for things like Shenoute's Canons or Discourses; optional |
msItem_title | the name or title of the conceptual work, e.g. Abraham Our Father, To Thieving Nuns; in the TEI export, if this is not available, the main title of the document is used (metadata field “title”) |
msName | use CMCL code (e.g., MONB.YA); optional but must use msName, pages_from, pages_to all three or none at all |
next | contains the CTS urn for the next document in the corpus; optional |
note | optional |
objectType | codex, papyrus, ostracon, etc; optional |
order | contains a number that orders the documents in a list, with the preferred first document numbered 01. For this project, the list should begin by corresponding to the same order dictated by the next and previous metadata fields. After this list of documents (which almost always contains only non-redundant documents—see the “redundant” metadatum above), the redundant parallel witness documents appear (if any). These redundant parallel witnesses also should be numbered according to the order of the running text of the conceptual work(s), for ease of reading and search.; optional |
origDate | prose about the date of the text object (e.g.Between 900 and 1200 C.E.); optional |
origDate_notAfter | date of the terminum ante quem (in four digits with leading zeros, e.g., 1200); be sure the format of the cell in Excel is text not number or date; optional |
origDate_notBefore | date of the terminum post quem (in four digits with leading zeros, e.g., 0900); be sure the format of the cell in Excel is text not number or date; optional |
origDate_precision | likelihood that the dating is accurate – usually “low”, “medium”, or “high”; optional |
origPlace | name of the place of the text object, not necessarily city/town/village name (e.g., White Monastery); optional |
pages_from | beginning of page sequence of document (original page number of scribe but written in arabic numerals) optional but must use msName, pages_from, pages_to all three or none at all |
pages_to | optional but must use msName, pages_from, pages_to all three or none at all |
parsing | describes whether parsing has been reviewed. Available values are automatic, checked, or gold; required |
paths_authors | PATHs project stable id from PATHs project for the author entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/authors/80'>80</a>. Optional but preferred if available. |
paths_manuscripts | PATHs project Coptic Literary Manuscript stable id entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/manuscripts/195'>195</a>; optional but preferred if available |
paths_works | PATHs Project stable ID for conceptual works recorded and identified by the Clavis Coptica (CC) entries composed of a 4-digit number preceded by, for example, CC 0599. PATHs stable ids contain only the number. E.g., <a href='http://paths.uniroma1.it/atlas/works/246'>246</a>; optional but preferred if available |
placeName | city or village or place name of the original location of the text object; should be recognizable name in gazetteer (note, see TEI guidelines and EpiDoc for how this can be linked to Pleiades); optional |
previous | contains the CTS urn for the previous document in the corpus; optional |
project | name of project supporting the transcription/annotation/publication (e.g., Coptic SCRIPTORIUM, KoMET, etc.) |
redundant | required: “yes” or “no” – tied to parallels: yes = file marked redundant (a parallel is the primary witness); no=this file is the primary witness (whether or not it has a parallel); any file with NO parallel witness should be marked redundant=no |
repository | current museum/library/etc where the manuscript currently resides |
segmentation | describes whether segmentation and tokenization has been reviewed. Available values are automatic, checked, or gold; required |
source | if the digitized text comes from another source, the editors of that source are listed here (used for Sahidica and other donated texts); optional |
source_info | |
tagging | describes whether tagging has been reviewed. Available values are automatic, checked, or gold; required |
title | title of this document (unique) |
translation | use “none” if no translation; if English translation published by Coptic SCRIPTORIUM then name(s) of translator(s) inserted here in comma delimited sequence |
Trismegistos | [enter the trismegistos # if it exists/is known for the manuscript]; optional but should be included if a TM number exists |
version@date | version date of this Coptic SCRIPTORIUM data in YYYY-MM-DD format; format cell as text not as date format in excel |
version@n | version of this Coptic SCRIPTORIUM data |
witness | prose note about parallels; optional |
Automatic metadata
GitDox will automatically generate semi-colon separated lists of named entities in the following metadata fields during export. They will not show up in the GitDox table, and you should not add or edit these manually:
people | named people identifiers for people mentioned in the document (separated by “; ”) |
places | named place identifiers for places mentioned in the document (separated by “; ”) |