User Tools

Site Tools


checklist_for_publishing_corpora

Checklist for Publishing and Releasing Corpora

1. New and revised docs should be reviewed by a Senior editor.

  • Check questions from annotators in the document and/or pull request
  • read through the document
  • use Google Refine to see if errors in tokenization, pos-tagging, lang-tagging, morph annotation, and normalization pop out
  • ensure sure layer names conform to standards (see layer annotation documentation).
  • use the validation add-in (when available) to confirm normalized annotation spans cover the same spans as the original column spans, group layer spans are the same size, etc.

2. Add/correct metadata on new documents

  • Confirm metadata all conforms to standards on layer annotation documentation.
  • Pay close attention to names of annotators, version number, and version date for documents. We now use the SAME version # and date on new documents as on corpus metadata. Give the new document the same version # and date as the updated version # and date going in the corpus metadata (see step 5 below). (This mean a NEW document will only have a 1.0.0 version number if the corpus is also brand new.)

3. Check the Issues list for each corpus to be released (whether new or revised versions of documents) on GitHub.

Each corpus may have a list of errors noticed by users or team members. (E.g., https://github.com/CopticScriptorium/ap-dev/issues/35). Make corrections, and note on the issues list that the corrections have been made.

4. Add/correct metadata on edited, previously published documents.

  • Confirm metadata all conforms to standards on layer annotation documentation.
  • Pay close attention to names of annotators, version number, and version date for documents. Versioning:

We now use the SAME version # and date on revised documents as on corpora. Give the revised documents the same version # and date as the updated version # and date going in the corpus metadata (see step 5 below). Note: an annotator may have made a minor change a while back and changed the version # and version date, even though the revised document has not yet been published. We do not republish a corpus every time we make a minor revision to one document. You may wish to check the document's version # against the number in ANNIS.

5. Add/correct the corpus metadata.

Corpus metadata appears on the first document in a corpus.

  • Confirm metadata all conforms to standards on layer annotation documentation.
  • Pay close attention to names of annotators: the names of all annotators of all documents in a corpus should be in the corpus metadata; if someone has edited one document, be sure that person's name appears in the corpus metadata.
  • Version date should be the date of re-release.
  • Version #: +1.0.0 for major change to data and/or structure (entirely new layer annotation, entirely new tokenization method applied, etc.); +0.1.0 for significant edits but still structurally compatible with previous versions; +.0.0.1 for minor edits, e.g. fixing reported errors in transcription or pos-tags).

6. Validate the file.

7. Convert to TEI and PAULA and relANNIS and publish on SCRIPTORIUM ANNIS server.

Typically performed by AZ.

8. Check ANNIS visualizations to be certain there are no obvious bugs in the corpora or stylesheet.

  • Edit files as necessary.
  • If files have been published on the public server, be sure to update the version number and date number for corpora and files.
  • Repeat steps 7 & 8 if significant problems and/or edits.

9. Convert to TEI XML.

  • Each document in the corpus will be converted to TEI XML using the converter program developed by Amir Zeldes. (AZ typically does this step.)
  • Confirm that the document validates against the EpiDoc TEI schema. http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
  • Edit if necessary if problems with validation.
  • If files have been published on the public server, be sure to update the version number and date number for corpora and files edited post-conversion.
  • Re-convert to TEI XML after editing, check validation, update versioning; repeat as necessary.

10. Convert to PAULA & relANNIS and publish on ANNIS server

Typically performed by AZ.

===11. Post TEI, relANNIS and PAULA files to GitHub public repository in their respective directories==+ E.g., https://github.com/CopticScriptorium/corpora/tree/master/AP/apophthegmata.patrum_PAULA for the PAULA XML files of the Apophthegmata Patrum (AP) corpus

12. Create a new release of the GitHub corpora repository, posting information about the latest changes in the release.

At https://github.com/CopticScriptorium/corpora/releases, click “Draft New Release.” Give it a new version number. (Should be same number as the new corpus and document version #s) Describe the corpus and changes/ additions in the description.

13. New ingest at data.copticscriptorium.org to account for new data.

Create new corpora, visualizations, etc., if necessary; see documentation in wiki for this application)

checklist_for_publishing_corpora.txt · Last modified: 2017/01/27 14:36 by admin