Coptic SCRIPTORIUM Wiki

This is an old revision of the document!

−Table of Contents

Annotation layer names for Coptic SCRIPTORIUM

Annotation layer names for Coptic SCRIPTORIUM

Data

This document supercedes the Google doc used previously. (Note: in the layer names, @ and _ are interchangeable; @ gets converted to _ when the file is converted)

tok	tokens, smallest possible unit to be annotated; MAY BE SMALLER THAN THE MORPHEMES IN ORIG
orig	smallest unit of LANGUAGE (morpheme or word level; smaller than the bound group level); orthography is from the original text (diplomatic, edition, whatever); includes supralinear strokes and other markings from the manuscript (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf. Spans in this layer must match those in norm exactly in their length.
orig_group	bound groups using the original orthography, including supralinear strokes and other markings. Spans in this layer must match those in norm_group exactly in their length.
norm_group	bound groups (same structure as orig_word but with normalized spelling, etc., so content is based on norm). Spans in this layer must match those in orig_group exactly in their length.
norm	normalized version of orig. Spans in this layer must match those in orig exactly in their length.
pos	part of speech tags. Spans in this layer must match those in norm exactly in their length. (i.e. norm units are the units that carry parts of speech.
lang	language of origin tags (Hebrew, Greek, Latin, Aramaic, etc.)
morph	morphs that are below the word level – this is where words containing mnt, at, ref are annotated a second time (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.3 & 4.4. Note that morph units DO NOT receive parts of speech.
lemma	lemma (dictionary head word); annotates on the normalized words (“norm” layer)
note	notes that normally would go in a TEI XML <note note=“xxx”> tag
hi@rend	text renderings (see http://www.copticscriptorium.org/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf sections 4.2 & 5). Use spaces not commas between elements (e.g., red large not red, large); to render spaces, be sure to place underscores in the phrase (e.g., 1_space_right not 1 space right); validates to TEI XML only if there are five or fewer elements
gap	Annotates for lacunae. Corresponds to the EpiDoc TEI-XML element gap. Uses attributes such as @reason, @unit, @quantity, and @extent. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model
supplied	Annotates for supplied text where text is missing from the original for a variety of reasons. Corresponds to the EpiDoc TEI-XML element supplied. Uses attributes such as @evidence and @reason. With attributes, each element+attribute annotation generates a new layer in the multi-layer data model.
lb@n	line breaks – numbered according to the original manuscript
cb@n	column breaks – numbered according to the original manuscript
pb_xml_id	Page numbers of original manuscript (not the current repository numbering); be sure column label does not include a colon (e.g. pb_xml_id not pb_xml:id); be sure page numbers do not include spaces (e.g. EG202 not EG 202) (TEI XML <pb xml:id=“xxx”>)
ignore:note	notes that will NOT be imported into ANNIS or exported as TEI or PAULA XML; private notations from annotators/encoders/editors
translation	English translation; necessary for analytic view: if the text has no translation, use “…” in logical spans where a translation would go. Use small sentences or partial sentences. Must be aligned with “p” paragraph breaks and not cut across the paragraph breaks. If you use quotation marks, do not use double straight quotation marks; you can use round smart quotes or single quotes.
p	paragraph breaks for translation and normalization; necessary for the normalized view. If the paragraphs are not numbered, put “p” in each span. Be sure p spans are equal to or contain translation spans; they cannot break across translation spans. Typically provide paragraph breaks for every ekthesis.
verse	verse of text written as number (always use in Bible of any kind, including Sahidica)
chapter	chapter of text recorded as number; currently used only in corpora in which there are canonical or disciplinary-standard chapter divisions (not a required annotation; for Bible this information is typically in the metadata, as well)
sbl_greek	The Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
sbl_apparatus	The apparatus for the Greek New Testament text. Annotation for the New Testament corpora only. Aligned by verse with the Coptic and Greek New Testament text. Source is the XML Greek New Testament created by the Society of Biblical Literature and Logos Software.
div@type	For EpiDoc TEI compatibility, this element needs to be present in div spans. Usually our number (n) in the div tag corresponds to the verse in the bible corpora. In the spreadsheet file, this layer should be present, it should be equal in size to the verse span, and every cell should contain the text “textpart” (no quotes). EpiDoc TEI conversion should produce <div type=“textpart” n=“1”> for the div wrapping verse one in any chapter of the Sahidica corpus.
vid	formerly verse@id (Sahidica)
chapter@cname	chapter of text written as text and number (not necessary – in other data)
chapter@cid	chapter id (Sahidica– not necessary)
verse@vname	verse of text written as text and number (e.g. 1 Corinthians 1:10) (not necessary – in other data)
add_place
vid_n

Preferred order of layers

tok	orig	orig_group	norm_group	norm	pos	morph	lang	translation	lb_n	cb_n	pb_xml_id	p	hi_rend	supplied_reason	ignore:note

Metadata

(Note: see also the corpus-level metadata documentation for adding metadata for the entire corpus.)

corpus
Coptic_edition	if the text has been published before, include publication information here
Greek_source	optional, information about the Greek version of the text if it exists (e.g., Greek Alphabetical or Systematic Apophthegmata Patrum)
title	title of this document (unique)
msItem_title	the name or title of the conceptual work, e.g. Abraham Our Father, To Thieving Nuns; in the TEI export, if this is not available, the main title of the document is used (metadata field “title”)
author	author of the conceptual work
language	language in which the text is written
annotation	names of annotators (transcribers, editors, annotators) in comma delimited sequence
project	name of project supporting the transcription/annotation/publication (e.g., Coptic SCRIPTORIUM, KoMET, etc.)
translation	use “none” if no translation; if English translation published by Coptic SCRIPTORIUM then name(s) of translator(s) inserted here in comma delimited sequence
msName	use CMCL code (e.g., MONB.YA); optional but must use msName, pages_from, pages_to all three or none at all
pages_from	beginning of page sequence of document (original page number of scribe but written in arabic numerals) optional but must use msName, pages_from, pages_to all three or none at all
pages_to	optional but must use msName, pages_from, pages_to all three or none at all
msContents_title@type	used for things like Shenoute's Canons or Discourses; optional
msContents_title@n	volume number of the thing in msContents_title@type; if this field has data then in order for it to be outputted to the TEI XML then msContents_title@type must have data
repository	current museum/library/etc where the manuscript currently resides
collection	collection or department in the current repository
idno	catalogue # of the manuscript in the current repository
version@n	version of this Coptic SCRIPTORIUM data
version@date	version date of this Coptic SCRIPTORIUM data in YYYY-MM-DD format; format cell as text not as date format in excel
source_info
license	use for copyright in Sahidica, CC-BY for everything else. If using CC-BY, enter <a href='https://creativecommons.org/licenses/by/4.0/'>CC-BY 4.0</a> to produce a link
document_cts_urn	urn that applies to the document following data model created by Bridget Almas
Trismegistos	[enter the trismegistos # if it exists/is known for the manuscript]; optional but should be included if a TM number exists
objectType	codex, papyrus, ostracon, etc; optional
country	country of origin of the text object; optional
placeName	city or village or place name of the original location of the text object; should be recognizable name in gazetteer (note, see TEI guidelines and EpiDoc for how this can be linked to Pleiades); optional
origPlace	name of the place of the text object, not necessarily city/town/village name (e.g., White Monastery); optional
origDate	prose about the date of the text object (e.g.Between 900 and 1200 C.E.); optional
origDate_precision	likelihood that the dating is accurate – usually “low”, “medium”, or “high”; optional
origDate_notBefore	date of the terminum post quem (in four digits with leading zeros, e.g., 0900); be sure the format of the cell in Excel is text not number or date; optional
origDate_notAfter	date of the terminum ante quem (in four digits with leading zeros, e.g., 1200); be sure the format of the cell in Excel is text not number or date; optional
source	if the digitized text comes from another source, the editors of that source are listed here (used for Sahidica and other donated texts); optional
note	optional
witness	prose note about parallels; optional
redundant	required: “yes” or “no” – tied to parallels: yes = file marked redundant (a parallel is the primary witness); no=this file is the primary witness (whether or not it has a parallel); any file with NO parallel witness should be marked redundant=no
previous	contains the CTS urn for the previous document in the corpus; optional
next	contains the CTS urn for the next document in the corpus; optional
endnote	contains a note about the document that will appear in the HTML visualizations at the bottom of the visualization; optional
order	contains a number that orders the documents in a list, with the preferred first document numbered 01. For this project, the list should begin by corresponding to the same order dictated by the next and previous metadata fields. After this list of documents (which almost always contains only non-redundant documents—see the “redundant” metadatum above), the redundant parallel witness documents appear (if any). These redundant parallel witnesses also should be numbered according to the order of the running text of the conceptual work(s), for ease of reading and search.; optional
parsing	describes whether parsing has been reviewed. Available values are automatic, checked, or gold; required
segmentation	describes whether segmentation and tokenization has been reviewed. Available values are automatic, checked, or gold; required
tagging	describes whether tagging has been reviewed. Available values are automatic, checked, or gold; required