Coptic SCRIPTORIUM Wiki

Annotation layer names for Coptic SCRIPTORIUM

Data

This document supercedes the Google doc used previously. (Note: in the layer names, @ and _ are interchangeable; @ gets converted to _ when the file is converted)

tok	tokens, smallest possible unit to be annotated; MAY BE SMALLER THAN THE MORPHEMES IN ORIG
orig	smallest unit of LANGUAGE (morpheme or word level; smaller than the bound group level); orthography is from the original text (diplomatic, edition, whatever); includes supralinear strokes and other markings from the manuscript (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf. Spans in this layer must match those in norm exactly in their length.
orig_group	bound groups using the original orthography, including supralinear strokes and other markings. Spans in this layer must match those in norm_group exactly in their length.
norm_group	bound groups (same structure as orig_word but with normalized spelling, etc., so content is based on norm). Spans in this layer must match those in orig_group exactly in their length.
norm	normalized version of orig. Spans in this layer must match those in orig exactly in their length.
pos	part of speech tags. Spans in this layer must match those in norm exactly in their length. (i.e. norm units are the units that carry parts of speech.
lang	language of origin tags (Hebrew, Greek, Latin, Aramaic, etc.)
morph	morphs that are below the word level – this is where words containing mnt, at, ref are annotated a second time (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.3 & 4.4. Note that morph units DO NOT receive parts of speech.
lemma	lemma (dictionary head word); annotates on the normalized words (“norm” layer)
note	notes that normally would go in a TEI XML <note note=“xxx”> tag
hi@rend	usually appears as hi_rend in the column name in spreadsheet mode; for text renderings (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.2 & 5). Use spaces not commas between elements (e.g., red large not red, large); to render spaces, be sure to place underscores in the phrase (e.g., 1_space_right not 1 space right); validates to TEI XML only if there are five or fewer elements
gap	Annotates for lacunae. Corresponds to the EpiDoc TEI-XML element gap. Uses attributes such as @reason, @unit, @quantity, and @extent. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model
supplied	Annotates for supplied text where text is missing from the original for a variety of reasons. Corresponds to the EpiDoc TEI-XML element supplied. Uses attributes such as @evidence and @reason. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model.
lb@n	usually appears as lb_n in column header in spreadsheet mode; line breaks – numbered according to the original manuscript
cb@n	usually appears as cb_n in column header in spreadsheet mode; column breaks – numbered according to the original manuscript
pb_xml_id	Page numbers of original manuscript (not the current repository numbering); be sure column label does not include a colon (e.g. pb_xml_id not pb_xml:id); be sure page numbers do not include spaces (e.g. EG202 not EG 202) (TEI XML <pb xml:id=“xxx”>)
ignore:note	notes that will NOT be imported into ANNIS or exported as TEI or PAULA XML; private notations from annotators/encoders/editors
translation	English translation; necessary for analytic view: if the text has no translation, use “…” in logical spans where a translation would go. Use small sentences or partial sentences. Must be aligned with “p” paragraph breaks and not cut across the paragraph breaks. If you use quotation marks, do not use double straight quotation marks; you can use round smart quotes or single quotes.
p	paragraph breaks for translation and normalization; necessary for the normalized view. If the paragraphs are not numbered, put “p” in each span. Be sure p spans are equal to or contain translation spans; they cannot break across translation spans. Typically provide paragraph breaks for every ekthesis.
verse_n	verse of text written as number (always use in Bible of any kind, including Sahidica)
chapter_n	chapter of text recorded as number; currently used only in corpora in which there are canonical or disciplinary-standard chapter divisions (not a required annotation; for Bible this information is typically in the metadata, as well)
sbl_greek	The Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
sbl_apparatus	The apparatus for the Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic and Greek New Testament text. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
div@type	For EpiDoc TEI compatibility, this element needs to be present in div spans. Usually our number (n) in the div tag corresponds to the verse in the bible corpora. In the spreadsheet file, this layer should be present, it should be equal in size to the verse span, and every cell should contain the text “textpart” (no quotes). EpiDoc TEI conversion should produce <div type=“textpart” n=“1”> for the div wrapping verse one in any chapter of the Sahidica corpus.
vid	formerly verse@id (Sahidica)
chapter@cname	chapter of text written as text and number (not necessary – in other data)
chapter@cid	chapter id (Sahidica– not necessary)
verse@vname	verse of text written as text and number (e.g. 1 Corinthians 1:10) (not necessary – in other data)
add_place
vid_n	CTS URN for the verse (e.g., urn:cts:copticLit:johannes.canons.monbfa:1.1)
ed_page_n	page number of a text as it appears in an edition
ed_line_n	line number of a text as it appears in an edition
entity	one of the ten entity types (e.g. person, place) see entity guidelines. Note that there can be multiple entity columns due to nested entities. These columns are typically not edited manually in the spreadsheet, but are added by the graphical entity editing interface in entities mode.
identity	this annotation stores linked entry identifiers for named entities; it is populated automatically during export by GitDox if named entities have been added using the entity annotation interface. Annotators do not need to manually add this column
arabic	Arabic translation. Spans should follow translation and verse layers

NOT columns in the spreadsheet

The following information should NOT be annotated manually in the spreadsheet, as it is added by other processes:

identity - this is the ANNIS annotation corresponding to named entity linking (Wikification). This information comes from the entity identification annotations in entities mode (after clicking “List named entities”). It should not be entered manually into the spreadsheet itself.
func / head - this is syntactic information from automatic or gold parsing. It is never done in spreadsheet mode, but added during publication by an automatic parser, or annotated manually in the Arborator interface (but NOT in GitDox)
multiword - multiword expression annotation is also added automatically during publication based on the currect state of multiword entries in the Coptic Dictionary Online. It is not edited manually and should not be included in the spreadsheet.

Metadata

(Note: see also the corpus-level metadata documentation for adding metadata for the entire corpus.)

Download a checklist of the following fields (does not yet include arabic_translation and possibly other newer fields; the list below is the most accurate)

annotation	names of annotators (transcribers, editors, annotators) in comma delimited sequence
arabic_translation	names of people who translated the text into Arabic in comma-delimited sequence
attributed_author	optional. attributed author of a conceptual work who may or may not be the historical author
author	author of the conceptual work
collection	collection or department in the current repository
Coptic_edition	if the text has been published before, include publication information here
copyist	optional. copyist or scribe of the text on the manuscript or text-bearing object
corpus	the corpus name in ANNIS (in the C SCRIPTORIUM corpus architecture, e.g. shenoute.abraham)
country	country of origin of the text object; optional
document_cts_urn	urn that applies to the document following data model created by Bridget Almas
endnote	contains a note about the document that will appear in the HTML visualizations at the bottom of the visualization; optional
entities	describes whether entity annotation has been reviewed. Available values are automatic, checked, or gold; required
facsimile_graphic_url	optional. url of manuscript images (if available)
Greek_source	optional, information about the Greek version of the text if it exists (e.g., Greek Alphabetical or Systematic Apophthegmata Patrum)
identities	describes whether named entity linking has been reviewed. Available values are automatic, checked, or gold; required
idno	catalogue # of the manuscript in the current repository
language	language in which the text is written
license	use for copyright in Sahidica, CC-BY for everything else. If using CC-BY, enter <a href='https://creativecommons.org/licenses/by/4.0/'>CC-BY 4.0</a> to produce a link
msContents_title@n	volume number of the thing in msContents_title@type; if this field has data then in order for it to be outputted to the TEI XML then msContents_title@type must have data
msContents_title@type	used for things like Shenoute's Canons or Discourses; optional
msItem_title	the name or title of the conceptual work, e.g. Abraham Our Father, To Thieving Nuns; in the TEI export, if this is not available, the main title of the document is used (metadata field “title”)
msName	use CMCL code (e.g., MONB.YA); optional but must use msName, pages_from, pages_to all three or none at all
next	contains the CTS urn for the next document in the corpus; optional
note	optional
objectType	codex, papyrus, ostracon, etc; optional
ocr	used only for documents that are OCR'd from print editions; use gold/checked/automatic
order	contains a number that orders the documents in a list, with the preferred first document numbered 01. For this project, the list should begin by corresponding to the same order dictated by the next and previous metadata fields. After this list of documents (which almost always contains only non-redundant documents—see the “redundant” metadatum above), the redundant parallel witness documents appear (if any). These redundant parallel witnesses also should be numbered according to the order of the running text of the conceptual work(s), for ease of reading and search.; optional
origDate	prose about the date of the text object (e.g.Between 900 and 1200 C.E.); optional
origDate_notAfter	date of the terminum ante quem (in four digits with leading zeros, e.g., 1200); be sure the format of the cell in Excel is text not number or date; optional
origDate_notBefore	date of the terminum post quem (in four digits with leading zeros, e.g., 0900); be sure the format of the cell in Excel is text not number or date; optional
origDate_precision	likelihood that the dating is accurate – usually “low”, “medium”, or “high”; optional
origPlace	name of the place of the text object, not necessarily city/town/village name (e.g., White Monastery); optional
pages_from	beginning of page sequence of document (original page number of scribe but written in arabic numerals) optional but must use msName, pages_from, pages_to all three or none at all
pages_to	optional but must use msName, pages_from, pages_to all three or none at all
parsing	describes whether parsing has been reviewed. Available values are automatic, checked, or gold; required
paths_authors	PATHs project stable id from PATHs project for the author entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/authors/80'>80</a>. Optional but preferred if available.
paths_manuscripts	PATHs project Coptic Literary Manuscript stable id entered as a link, e.g. <a href='http://paths.uniroma1.it/atlas/manuscripts/195'>195</a>; optional but preferred if available
paths_works	PATHs Project stable ID for conceptual works recorded and identified by the Clavis Coptica (CC) entries composed of a 4-digit number preceded by, for example, CC 0599. PATHs stable ids contain only the number. E.g., <a href='http://paths.uniroma1.it/atlas/works/246'>246</a>; optional but preferred if available
placeName	city or village or place name of the original location of the text object; should be recognizable name in gazetteer (note, see TEI guidelines and EpiDoc for how this can be linked to Pleiades); optional
previous	contains the CTS urn for the previous document in the corpus; optional
project	name of project supporting the transcription/annotation/publication (e.g., Coptic SCRIPTORIUM, KoMET, etc.)
redundant	required: “yes” or “no” – tied to parallels: yes = file marked redundant (a parallel is the primary witness); no=this file is the primary witness (whether or not it has a parallel); any file with NO parallel witness should be marked redundant=no
repository	current museum/library/etc where the manuscript currently resides
segmentation	describes whether segmentation and tokenization has been reviewed. Available values are automatic, checked, or gold; required
source	if the digitized text comes from another source, the editors of that source are listed here (used for Sahidica and other donated texts); optional
source_info
tagging	describes whether tagging has been reviewed. Available values are automatic, checked, or gold; required
title	title of this document (unique)
translation	use “none” if no translation; if English translation published by Coptic SCRIPTORIUM then name(s) of translator(s) inserted here in comma delimited sequence
Trismegistos	[enter the trismegistos # if it exists/is known for the manuscript]; optional but should be included if a TM number exists
version@date	version date of this Coptic SCRIPTORIUM data in YYYY-MM-DD format; format cell as text not as date format in excel
version@n	version of this Coptic SCRIPTORIUM data
witness	prose note about parallels; optional

Automatic metadata

GitDox will automatically generate semi-colon separated lists of named entities in the following metadata fields during export. They will not show up in the GitDox table, and you should not add or edit these manually:

people	named people identifiers for people mentioned in the document (separated by “; ”)
places	named place identifiers for places mentioned in the document (separated by “; ”)

Table of Contents

Annotation layer names for Coptic SCRIPTORIUM

Data

NOT columns in the spreadsheet

Metadata

Automatic metadata