User Tools

Site Tools


Table of Contents

Part of speech tagging

The norm layer will be tagged with the tree-tagger pos-tagger. (The morph layer will be tagged by the language of origin tagger.)

To tag for part of speech, you will need the Tree-tagger package and our training courpus

Select the norm column, copy it, paste it into a text file. Make sure the text file is using Unix returns and is in UTF-8. Save the new text file in whichever directory has the tree-tagger script (for Macs, the tree-tagger-MacOSX-3.2-intel/bin/ directory). Make sure the coptic_fine.par file in the larger tree-tagger folder is also in the “bin” folder. Open a terminal window at that directory. Run the tree-tagger. (E.g., type something like ./tree-tagger coptic_fine.par -token inputFileName outputFileName). Open the outputFileName. Copy and paste the data into empty columns in the Excel file.

If you have an additional morph annotation layer, the tree-tagger will not have respected the resulting spans in the norm layer. You will need to search your ORIGINAL norm layer for spans and make sure they are aligned properly with the tagger. You can look for this manually in a variety of ways. One example of manual correction of the data is here see video here. Watch the video before running the tagger.

  • select the ORIGINAL norm column (not the one you just pasted in; to be safe, you might rename the new one ignore:norm or something like that).
  • Click the “unmerge cells” button to unmerge the spans.
  • Using the Find function, find the next empty cell. (If the norm layer is selected, it will only find empty cells in that column.)
  • In the norm column, select the empty cell and the cell above it; merge the two cells.
  • Make sure the pos is aligned with the proper norm cell. You may need to: Select the relevant cell in ignore:norm and pos; use Insert>Cell to insert a cell above each of these cells.
  • Merge the cell in pos with the blank cell to correspond with the span in norm. (You do not have to merge the cell in ignore:norm; you will delete this layer soon.)
  • select the ORIGINAL norm column, and repeat until you are done.
  • select ignore:norm and delete the column.

You can do this without manual proofreading by pre-processing your data before running the tagger in the following way:

  • Make a copy of norm in a new sheet
  • Unmerge all spans
  • Add a new column with a serial ID (1,2,3…) (select the content of the new column, go to edit>fill>series in the menu bar. (a) when selecting content, make sure just to use as many rows as have data (not the whole column); (b) Put a “1” in the first cell before going to Edit>fill>serie
  • Sort by norm to get all the blanks together
  • Sort the non-blanks by ID
  • Tag that
  • Paste the tags back in
  • Sort everything back by ID again
  • Auto-stretch spans down

You can use Google Refine’s facet search to check your pos-tags, also.

part_of_speech_tagging_using_tree-tagger.txt · Last modified: 2015/09/09 06:30 by ctschroeder