Introduction

The TEI format is the standard format for transcribing scholarly texts in humanities. Unlike common word-processing formats, TEI focuses on the semantics of text elements instead of the look-and-feel. This is great for research purposes, as it enables software to answer questions like "give me all citations in Shakespeare where Hamlet uses the past tense," (provided of course somebody has encoded that data) but makes it harder to produce nicely formatted output from the same source. This is where the tei2html stylesheets kick-in.

tei2html is a collection of XSLT stylesheets that transform a duly marked up TEI file to an attractive HTML, ePub or PDF document. It is intended to work out-of-the-box with unadorned TEI files, but will be able to perform better if certain conventions are followed.

These guidelines describe the conventions for preparing TEI files such that they can successfully be converted by tei2html. They assume a working knowledge of the TEI Lite documentation. They are inspired by, and often follow the Wisconsin University Guidelines for Markup of Electronic Texts, which provides excellent examples of numerous issues. For real starters, a very gentle introduction to the TEI markup language is available.

Whether TEI is useful for you depends on your needs. If you are occasionally producing a text, and need to have it formatted, TEI and tei2html is probably not the way to go. If you have a large collection of texts, and need to maintain them for a long time, and would like to add numerous types of scholarly annotations to them, TEI certainly warrants serious consideration.

Originally, tei2html was developed to produce beautiful new editions of public domain texts for Project Gutenberg, but it can also be used for other TEI files.

Guiding Principles

The design of tei2html is based on a small set of guiding principles and design decisions. These are some assumptions, that I believe are reasonable to expect when digitizing text for research and preservation.

The guiding principles are:

  • Tags supplement the plain text content of the transcribed work. They do not replace content. When all tags are removed from the file, the remaining text should reflect the original source text as much as possible. As a corollary to this principle, tei2html does not supply much content itself. Unless specifically asked to do so, it will not insert tables of content, headers, labels, etc.

  • Tags are semantic: they reflect as much as possible the function of a certain part of the text, not its appearance. This also means that tei2html needs to rely on a number of defaults and formatting hints to decide what things should ultimately look like in the output.

  • The @rend, @style and @rendition attributes in tags are intended as formatting indications only. Ignoring them fully or partially should not render a text incomprehensible. (Note that the TEI standard itself prescribes that these attributes should be used to indicate the formatting of the source, rather than the desired output.)

The design decisions are:

  • The @rend attribute values are "rendition ladders" designed to map relatively directly to CSS format statements in the HTML version, however, they use a different syntax, and are also used for effects that are not supported by CSS, and are thus not copied verbatim.

  • The @rend attribute can also be used to refer to HTML classes, which can be defined as <rendition> elements in a <tagsDecl> section.

  • CSS can be imported from an external source, and attached to IDs or classes in the generated output. To make this workable, IDs in the source are passed to the output unchanged (when reasonable), and the generated classes are documented. Also, a class to be output can be specified in the @rend attribute.

  • Similarly, the value of @style attributes is copied verbatim to the CSS output; as are the content of <rendition> elements. Note that currently no translation of the @selector attribute of the <rendition> element takes place.

As a result of these design decisions, a document can be neatly formatted with CSS with minimal rendering information 'polluting' the TEI master file. Note that to meet ePub restrictions on CSS, no inline (that is, in an attribute on an HTML element) CSS is generated.

SGML versus XML

Originally, TEI was defined in SGML. Since version P4, this TEI has moved on to XML. XML is a considerably simplified reincarnation of SGML. It does away with much of the complexity of SGML. However, TEI (and TEILite) was originally developed for SGML, and I started using TEI even before XML was conceived.

Although XML is simpler, and far more tools can deal with it, SGML is somewhat easier to type due to its more relaxed syntax rules. You don’t need to use quotes on all attribute values, nor need to provide close tags for all elements. For this reason, I normally produce my files in SGML. The conversion to XML is straightforward, and can (almost always) be fully automated. Since I am used to SGML, the examples in this document will be valid SGML snippets, but not necessarily well-formed XML.

Of course, you can work directly in XML instead of SGML, and the stylesheets are supposed to work with the P4 and P5 incarnations of TEI as well (although not all elements, even those of P3, are supported). Working in XML directly has a number of benefits as well:

  1. Directly viewable in most modern browsers (which can apply an XSLT transform on the fly; however, in practice, the tei2html stylesheets used here are too complex for a browser, and no current browser supports XSLT 3.0 directly. Javascript implementations do exist, but I have no experience with them).

  2. No initial conversion tool required.

  3. Can use embedded namespaces for other XML schemas.

  4. Can use one or more of the various XML authoring tools available.

Considerations for Output Formats

TEI files are typically prepared from pre-existing sources, that is printed books, magazines or other textual works. In those cases, it may be desirable to not only to include the logical content of the work being transcribed, but also to describe the physical appearance of some aspects of the text. In that case, the @rend, @style and @rendition attributes can be used to record those aspects.

For many uses, a TEI file is not the most desirable format. For reading, it is far more convenient to have it available in some more presentation-oriented format. With tei2html, a TEI document can be converted to various output formats, which are:

  • HTML for viewing in a browser on a PC.

  • ePub for reading on dedicated eBook readers.

  • Plain text for reading on simple devices (supported using Perl, but only in a fairly limited way that requires some manual cleanup).

  • PDF for printing on paper (supported using Prince-XML, using specially adjusted HTML output).

These output formats place some restrictions on the structure of the TEI file, described in the following usage conventions. These usage conventions are not enforced by the DTD.

References to page numbers. In printed media, the page numbers produced in the output will differ from the source. To correctly replace page numbers in the source, each page number should be tagged as a reference (ref element) with @type=pageref. The tag should only enclose the page number itself, not any of the surrounding material. The transformation will replace the content of this tag with the actual number of the page the material referred to appears on. Almost always, the reference is not to the page itself, but to some element appearing on this page. For this reason, it is better not to link to the pb element, but to the element actually referred to. (In HTML output, the original page numbers will be shown in the margin and used in the text.)

Nesting of cross-references and anchors. In HTML, it is impossible to nest anchors and cross-references. As a results, certain elements in the TEI source should not be nested, as these will result in invalid nested anchors in HTML. For example, the ref element should not contain corr elements (as the latter generates an anchor for the automatically generated errata list). The proper way to resolve this is to place the ref element inside the corr element.

Nesting of paragraphs. In HTML, it is impossible to nest paragraphs. Usually, the transformation software will take this in account, and close (and re-open) open paragraphs in HTML when needed.

In a few other cases, the transformation may not result in entirely correct results. Always validate the result with tools such as tidy or epubcheck.

Special Characters in SGML files

For XML files actually used by the XSLT stylesheets, any character encoding supported by XML/XSLT works. For SGML, the options are more limited.

Also, the pre-processing scripts in Perl can only deal with either pure 7-bit ASCII or the ISO 8859-1 character set. All characters outside those ranges are to be represented by character entities.

Use entities from the following sets:

  • The standard SGML ISO 8859 entity sets.

  • The other declared entity sets that come with the TEI DTD.

  • Invent your own descriptive abbreviation. Always provide the Unicode code point for a character (if it exists) in the entity declaration, and provide the Unicode character name or a description in a comment. Please follow the pattern used by ISO where possible.

  • Numeric character entities, based on Unicode.

For longer fragments in a non-Latin script, I normally use an ASCII based transliteration scheme, and supply tools (called patc) to convert these to Unicode. For documents completely a non-Latin script, it will probably be better to work with Unicode in a suitable editor (and using XML directly).

Fractions entities

Fractions of one figure above and below.

&frac12;

Fractions with more than one figure above or below.

&frac3_16; &frac23_100;

Special filling characters for leaders

Future plan

&dotfil; &dashfil; &linefil; &spacefil;

These are roughly equivalent to TeX’s special glue values.

Note: a better approach will be to use the <space> tag, with appropriate rendering information, e.g.

<space @rend="leader(dotted)"/>

or

<space dim="horizontal" @rend="leader(' &ndash; ')"/>

As long the leader CSS this is not directly supported by browsers, we can use the method outlined here to render this in HTML: https://www.w3.org/Style/Examples/007/leaders.en.html

Prerequisites and Installation

See the page on installation.

Note that currently no installer is present, so you will need to get the code from the repository directly.

Running from the command line

Most of the interaction on the command line is with small Perl scripts that start the actual XSLT processor, and some other scripts.

Before you can run the Perl scripts, you will need to make sure the paths to the various executables are configured correctly. Perl should be able to find the place where you’ve located your scripts (when running perl with the -S option.)

The various .jar files (for Saxon and epubcheck), and the SGML catalog files need to be located in a specific subdirectory of the directory where the perl scripts reside.

Running interactively

No easy window-based interface is yet available. I am working on a small HTML application to run the scripts interactively.

Running with Oxygen

You can use the stylesheets with Oxygen, provided your sources are in XML format. Make sure you select an XSLT 3.0 processor to perform the transformation.

Required Downloads

The tei2html code has a range of dependencies on third-party software. This page lists those.

Saxon

An XSLT 3.0 processor is required for tei2html, any XSLT 3.0 processor should work, however, I’ve developed these stylesheets with Saxon (using the freely available version Saxon HE).

You should download a reasonable recent version of Saxon-HE product from saxonica.com. I will take care tei2html will continue to work with the free versions of Saxon, no matter how tempting the additional features in the paid versions are (such as higher order functions, binary file handling, etc.)

Java

Saxon-HE requires Java.

Make sure the java executables can be found on the path.

If you do not have Java, you can download it from http://java.com/en/

Perl

If you are planning to use the provided Perl scripts to glue things together, you will need a Perl interpreter.

For Windows, my advice is to download Strawberry Perl. Use either the 32 or 64 bits version, to match your system.

Note Upgrading Strawberry Perl does not work properly: please save-guard local installations in your site directory before upgrading, as the uninstaller will throw those files away.

Packages used
  • Image::Size

  • Lingua::BO::Wylie (Only for Tibetan support: download from www.thlib.org)

  • HTML::Entities;

  • Text::Levenshtein::XS

  • Statistics::Descriptive

  • File::Basename

  • Getopt::Long

  • Image::Info

  • MIME::Base64

  • XML::XPath

  • Unicode::Normalize

  • Roman

If you are missing a package, it can easily be installed using CPAN: cpan install <package>.

After installation, copy the files LanguageNames.pm and SgmlSupport.pm to the local site library (e.g., Strawberry\perl\site\lib).

SX, NSGML

Optional, only needed you want to use SGML as source format.

SX is an SGML to XML translator, and NSGML is an SGML validator, part of the SP package. Since XSLT only supports XML, you will need both those two tools to be able to work with SGML. They can be downloaded from James Clark website. For windows, get sp 1.3.4.

To enable SX and NSGML to understand your document types, you need to configure a catalog of DTDs (Which maps public DTDs to local resources containing their definitions). The scripts assume this is located in a file named dtd/CATALOG.

You will need to add the teilite.dtd to the CATALOG. This DTD can be found here: http://www.tei-c.org/Vault/P4/Lite/DTD/

A short explanation of Catalog files can be found in the SP documentation on James Clark’s website referenced above.

As an alternative to SX and NSGML, you can use osx and onsgml from the OpenJade Project. These are slightly more up-to-date. You will have to update the names in the main tei2html.pl script in that case.

ZIP

To compress ePub files, you will need a zip utility. tei2html uses info-zip to handle the peculiar requirements of the ePub format. (See e.g., this blog entry on creating ePub files.)

Node.js

Optional, only needed when you use mathematical formulas in TeX notation.

To convert mathematical formulas in TeX format to a format that can be included in static HTML pages, you will need to install the following:

Patc

Optional, only needed when you use the transcription schemes for non-Latin scripts I use in SGML.

Patc (pattern changer) is a small utility written in C to do multiple find-and-replace actions at once. You will need a C compiler to get it to work. It enables you to execute multiple find-replace actions in an efficient way. Mostly used to change the transliteration of non-Roman scripts I’ve used. If you don’t use that, you’ll not need it. (I’ve successfully compiled this on a variety of platforms, including Unix, and Windows; a solution file for Visual Studio 7.0 is included; contact me if you need a Windows binary. A make file for use with linux is also included.)

epubcheck

Optional, only needed if you want to check generated ePubs.

Epubcheck is a tool to validate epub books, it can be obtained from: https://github.com/IDPF/epubcheck. Note that tei2html doesn’t automatically generate correct ePubs: you can still do a lot of things that make ePubs non-conform, for example by including CSS3 constructs, or referring to resources in CSS and not including them in the ePub spine manually.

This tool also requires Java; the scripts assume you use epubcheck-3.0.1.jar, placed in the tools/lib subdirectory.

Prince XML

Optional, only needed if you want generated PDF output.

Note that Prince is a commercial product; a free version can be used for strictly private purposes, and downloaded from: https://www.princexml.com/. The free version does include a small icon on the first page to promote Prince. Shouldn’t be a big nuisance.

XSLTdoc

Optional, only needed if you want to browse through the documentation of the code.

You can clone XSLTdoc (at time of writing version 1.3.3) from GitHub, where you can also find the documentation.

Configuration

Environment variables

To run tei2html from the command line, it will be practical to configure several environment variables, that is:

  • set TEI2HTML_HOME to the location of the checked-out tei2html directory.

  • set SAXON_HOME to the location where Saxon is installed.

  • (optional) set PRINCE_HOME to the location where Prince is installed.

Processing Steps

For my internal processing, I follow a number of steps to go from my master TEI file to HTML or ePub output. These are described here.

Step 1: Convert from SGML to XML (Optional)

Most of my TEI files are in (old-school) SGML format with the '.tei' extension. This needs to be converted to XML, before XSLT wants to do anything with it.

Step 1.1: Convert transcriptions to SGML entities.

Some of my TEI files include snippets of non-Latin script in various transcription formats. These need to be converted to SGML entities (either named or numeric) before we can actually transform to XML. For this, I use a 'patc' tool that will change such transcriptions. For each transcription, a separate run is needed. The tei2html.pl script will check for such transcriptions, and run the right patc tool as needed. (For most text, this is irrelevant.)

Step 1.2: Convert to XML.

The actual SGML to XML conversion is achieved by running the James Clark’s SX tool.

Step 1.3: Normalize case.

Use Sebastian Rahtz' tei2tei.xsl stylesheet to use the proper casing for TEI elements. After this step, the TEI should be valid TEI (according to TEI P4) in XML form. The tei2html stylesheets should be able to deal with the text now.

Step 1.5: Convert to TEI P5. (Optional)

Use Sebastian Rahtz' p4top5.xsl stylesheet to convert the TEI P4 to TEI P5. The tei2html stylesheets should still be able to deal with the text, although I’ve not extensively tested this. The main distinction is that div0 elements will have been changed to div1 elements, and all underlying numbered divX elements have been 'elevated,' which impacts the appearance and breaking up into chunks when generating ePubs.

Step 2: Run Saxon to generate HTML and ePub

Now that we have XML, we can apply the XSLT transformation on it. This will result in a single HTML output file. Here, again a few intermediate steps are needed.

Step 2.1: Normalize tables

Table handling in TEI and HTML is complicated. To deal with them correctly, we have a separate XSLT stylesheet (normalize-table.xsl) that adds row and column numbers to cells in tables (leaving the rest of the XML untouched). With those numbers present, styling can be applied to tables more reliable. This script will also warn you when you have tables where the number of (effective, after dealing with cells that span multiple rows or cells) columns is different for each row.

Step 2.2: Transform to HTML

After the tables have been normalised, we’re ready to transform to HTML. This is done by the main XSLT stylesheet tei2html.xsl, which pulls in the various partial stylesheets.

Step 2.3: Transform to ePub

Similar to the conversion to HTML, the conversion to ePub 3.0 is done with the XSLT stylesheet tei2epub.xsl, which re-uses most of the stylesheets used by tei2html.xsl. The results of this transform are placed in a directory which mirrors the internal structure of an ePub archive.

Step 2.4: Package ePub

In a final step, the ePub files are compressed into a single zip file, following the conventions of an ePub container.

Configuration

Currently, there are several options to change the generated output. These can be configured by using a file named tei2html.config in which various parameters can be set. If you add such a file to the directory where you TEI file is located, the tei2html.pl perl script will automatically pick it up.

It is only necessary to specify those items that differ from the default in your configuration file.

Current Status

The default configuration file is given below. Its actual value can be found in the modules/configuration.xsl.

<tei2html.config>
    <debug>false</debug>                                            <!-- Use debug mode (uses CSS to color various elements in output HTML). -->
    <logLevel>INFO WARNING ERROR DEBUG</logLevel>                   <!-- Log levels: DEBUG, INFO, WARNING, ERROR -->

    <debug.facsimile>false</debug.facsimile>                        <!-- Insert links to the PGDP proofing images in the right margin -->

    <language>en-US</language>                                      <!-- Main language of text (if not specified explicitly with the @lang attribute on the text element). -->
    <defaultLanguage>en-US</defaultLanguage>                        <!-- Default language for localization. -->

    <drama.inline.speaker>false</drama.inline.speaker>              <!-- Inline the speaker (default the speaker is a separate paragraph) -->

    <lb.preserve>true</lb.preserve>                                 <!-- Preserve linebreaks indicate with the lb element. -->
    <lb.hyphen.remove>false</lb.hyphen.remove>                      <!-- Remove hyphens before line-breaks. -->
    <lb.removable.hyphen>&not;</lb.removable.hyphen>                <!-- Character used for removable hyphen before a line-break (DTA convention). -->
    <lb.hyphen>-</lb.hyphen>                                        <!-- Character used for non-removable hyphen before line-break. -->

    <toc.numberEntries>true</toc.numberEntries>                     <!-- Provide numbers with generated TOC entries. -->
    <toc.defaultEntries>false</toc.defaultEntries>                  <!-- Use generic heads in entries in the TOC, if no head is present. -->

    <pg.includeHeaders>false</pg.includeHeaders>                    <!-- Include Project Gutenberg headers and footers. -->
    <pg.includeComments>false</pg.includeComments>                  <!-- Include references to Project Gutenberg in comments. -->
    <pg.compliant>false</pg.compliant>                              <!-- Only use HTML and CSS constructs that are compliant with to Project Gutenberg guidelines. -->

    <showParagraphNumbers>false</showParagraphNumbers>              <!-- Output paragraph numbers, using the value of the @n attribute. -->
    <includeAlignedDivisions>true</includeAlignedDivisions>         <!-- Include divisions indicated by "align-with-document()" -->
    <useRegularizedUnits>false</useRegularizedUnits>                <!-- Use the regularized units specified in the measure-tag. (false: both are shown, the original in the text, the regularized units in a
                                                                         pop-up; true: regularized in text, original in pop-up) -->
    <xref.show>always</xref.show>                                   <!-- Method used to generate external links, possible values:
                                                                         - always:   external links are active at the location in the text.
                                                                         - never:    external links are not shown (only the anchor text is).
                                                                         - colophon: external links are active in the colophon (including in the external-links table, if generated).
                                                                      -->
    <xref.table>false</xref.table>                                  <!-- Collect all external links in a separate table in the colophon. -->
    <xref.exceptions>https://www.pgdp.net/; https://www.gutenberg.org/; pg:; music/; images/</xref.exceptions>  <!-- Semicolon-separated list of external URLs than can be always be used. -->

    <punctuation.hanging>false</punctuation.hanging>                <!-- Use hanging punctuation (by generating the relevant CSS classes. This requires tweaking, depending on the font used). -->

    <ditto.enable>true</ditto.enable>                               <!-- Use ditto marks in ditto (deprecated) or seg[@copyOf] elements. -->
    <ditto.mark>,,</ditto.mark>                                     <!-- The symbol to use as a ditto mark. May also be overridden by rend attribute ditto-mark() -->
    <ditto.repeat>word</ditto.repeat>                               <!-- How often to use a ditto mark, possible values: word | segment. May also be overridden by rend attribute ditto-repeat() -->

    <pageNumbers.show>true</pageNumbers.show>                       <!-- Show page numbers in the right margin. -->
    <pageNumbers.before>[</pageNumbers.before>                      <!-- String to place before the page number in the right margin. -->
    <pageNumbers.after>]</pageNumbers.after>                        <!-- String to place after the page number in the right margin. -->

    <facsimile.enable>false</facsimile.enable>                      <!-- Output section with and links to facsimile images if required information is present. -->
    <facsimile.wrapper.enable>true</facsimile.wrapper.enable>       <!-- Generate HTML wrapper files for the images, and link to these instead of to the image. -->
    <facsimile.path>page-images</facsimile.path>                    <!-- Path where the HTML wrapper files will be generated. -->
    <facsimile.target></facsimile.target>                           <!-- Value of the target attribute of generated links in HTML (leave empty for default; _blank, _top, _parent, _self). -->

    <notes.foot.returnArrow>true</notes.foot.returnArrow>           <!-- Place a small up-arrow at the end of a footnote to return to the source location in the text. -->
    <notes.foot.counter>chapter</notes.foot.counter>                <!-- At what level to count footnotes, possible values: chapter or text. -->
    <notes.apparatus.noteMarker>&deg;</notes.apparatus.noteMarker>  <!-- Note marker used with text-critical notes (coded with place=apparatus) used at location in text. -->
    <notes.apparatus.returnMarker>&deg;</notes.apparatus.returnMarker> <!-- Note marker used with text-critical notes (coded with place=apparatus) used before note, to return to text. -->
    <notes.apparatus.format>block</notes.apparatus.format>          <!-- How to format text-critical notes: as separate paragraphs or as a single block. Possible values: paragraphs | block. -->

    <images.path></images.path>                                     <!-- Prefix of path to images, relative to the HTML file -->
    <images.include>true</images.include>                           <!-- Include images in the generated output. -->
    <images.requireInfo>true</images.requireInfo>                   <!-- Require image-info to be present for an image (otherwise they won't be included in output) [TODO]. -->
    <images.scale>1.0</images.scale>                                <!-- Image scale factor: 1.0 is normal size; 0.5 is half size; 2.0 is double size. -->
    <images.maxSize>100</images.maxSize>                            <!-- Warn if image is larger than this number of kilobytes. -->
    <images.maxWidth>720</images.maxWidth>                          <!-- Warn if image is wider than this number of pixels (after applying images.scale). -->
    <images.maxHeight>720</images.maxHeight>                        <!-- Warn if image is taller than this number of pixels (after applying images.scale). -->

    <audio.useControls>false</audio.useControls>                    <!-- Use controls for links to local audio (MP3, Midi, Ogg) formats (HTML5 only). -->

    <text.parentheses>()[]{}</text.parentheses>                     <!-- Pairs of parentheses, first opening, then closing. -->
    <text.quotes>&ldquo;&rdquo;&lsquo;&rsquo;&laquo;&raquo;&bdquo;&rdquo;</text.quotes> <!-- Pairs of quotation marks, first opening, then closing. -->
    <text.curlyApos>true</text.curlyApos>                           <!-- Replace a plain apostrophe (') with a right single quote. -->
    <text.spaceQuotes>true</text.spaceQuotes>                       <!-- Insert a hair space between consecutive quotation marks. -->
    <text.useEllipses>true</text.useEllipses>                       <!-- Replace three consecutive periods with an ellipsis character. -->
    <text.useIJLigature>false</text.useIJLigature>                  <!-- Replace ij with the ij-ligature (Dutch and letter-spaced text only). -->
    <text.normalizeUnicode>true</text.normalizeUnicode>             <!-- Normalize Unicode to NFC (may break Hebrew or Tibetan text in some rare cases) -->
    <text.abbr>i.e.; I.e.; e.g.; E.g.; A.D.; B.C.; P.M.; A.M.</text.abbr> <!-- Common abbreviations, list separated by semicolons. -->

    <table.classifyContent>false</table.classifyContent>            <!-- Attempt to determine the content-type of cells in a table; add relevant classes in the HTML output. -->

    <q.insertQuotes>false</q.insertQuotes>                          <!-- Insert quotation marks around <q> markup based on first two pairs in setting <text.quotes>. -->
    <q.asDiv>true</q.asDiv>                                         <!-- Render the <q> element with a div if true, as a span otherwise. -->

    <beta.convert>false</beta.convert>                              <!-- Interpret beta-codes if the language is classical Greek (i.e., @xml:lang="grc"). -->
    <beta.caseSensitive>false</beta.caseSensitive>                  <!-- Beta-code is case-sensitive (i.e., not using the * notation for capital letters) -->

    <css.stylesheet>style/arctic.css</css.stylesheet>               <!-- Default CSS stylesheet(s) to include; these are distributed with tei2html in the style directory. -->
    <css.useCommon>true</css.useCommon>                             <!-- Use the build-in stylesheets (for screen) -->
    <css.useCommonPrint>true</css.useCommonPrint>                   <!-- Use the build-in stylesheets (for print media) -->
    <css.useCommonAural>false</css.useCommonAural>                  <!-- Use the build-in stylesheets (for aural support) -->
    <css.inline>true</css.inline>                                   <!-- use an inline (embedded in HTML) stylesheet; ignored for ePub. -->
    <css.support>2</css.support>                                    <!-- Level of support for CSS: used to filter out newer features. Possible values: 2 | 3. -->
    <css.frakturFont>Walbaum-Fraktur</css.frakturFont>              <!-- The font to use when font(fraktur) is specified. -->
    <css.blackletterFont>UnifrakturMaguntia</css.blackletterFont>   <!-- The font to use when font(blackletter) is specified. -->

    <rendition.id.prefix></rendition.id.prefix>                     <!-- Prefix used for rendition IDs. -->

    <colophon.showEditDistance>true</colophon.showEditDistance>     <!-- Show the Levenshtein edit distance in the list of corrections made in the colophon. -->
    <colophon.showCorrections>true</colophon.showCorrections>       <!-- Show a list of corrections in the colophon. -->
    <colophon.showSuggestedCorrections>false</colophon.showSuggestedCorrections> <!-- Show a list of suggested (but not applied) corrections in the colophon. -->
    <colophon.showMinorCorrections>true</colophon.showMinorCorrections> <!-- Include minor corrections in the colophon. -->
    <colophon.showAbbreviations>true</colophon.showAbbreviations>   <!-- Show a list of abbreviations in the colophon. -->
    <colophon.showExternalReferences>true</colophon.showExternalReferences>   <!-- Show a section on external references in the colophon. -->
    <colophon.maxCorrectionCount>20</colophon.maxCorrectionCount>   <!-- Maximum number of identical corrections that will be listed individually in the list of corrections. -->

    <math.decimalSeparator>.</math.decimalSeparator>
    <math.thousandsSeparator>,</math.thousandsSeparator>
    <math.numberPattern>^[0-9]{1,3}(,[0-9]{3})*(\.[0-9]+)?$</math.numberPattern>
    <math.label.position>right</math.label.position>
    <math.label.before>(</math.label.before>
    <math.label.after>)</math.label.after>
    <math.keepTexInComment>true</math.keepTexInComment>
    <math.filePath>formulas</math.filePath>                         <!-- Path where tei2html will write tex files and read SVG files. -->
    <math.htmlPath>formulas</math.htmlPath>                         <!-- Path the generated HTML (and ePub) will use as location for included SVG or PNG files. -->

    <math.mathJax.format>SVG+IMG</math.mathJax.format>                      <!-- Options: MathJax; MML; SVG; SVG+IMG -->
    <math.mathJax.configuration>TeX-MML-AM_SVG</math.mathJax.configuration> <!-- Options for MathJax format, e.g.: TeX-MML-AM_SVG TeX-MML-AM_CHTML, see https://docs.mathjax.org/en/latest/config-files.html#common-configurations -->

    <!-- Output-format specific settings: these override the general settings defined above for a specific output format. Supported formats: "html", "html5" and "epub". -->
    <output format="html">
        <useMouseOverPopups>true</useMouseOverPopups>           <!-- Use mouse-over pop-ups on various items (links, etc). -->
    </output>
    <output format="html5">
        <useMouseOverPopups>true</useMouseOverPopups>
    </output>
    <output format="epub">
        <useMouseOverPopups>false</useMouseOverPopups>
        <xref.show>always</xref.show>
        <xref.table>true</xref.table>

        <pageNumbers.show>false</pageNumbers.show>
        <includeAlignedDivisions>false</includeAlignedDivisions>

        <math.mathJax.format>MML</math.mathJax.format>
    </output>
</tei2html.config>

This can also be found configuration.xsl.

Future Ideas

  • Use Mouseover pop-ups. (for showing corrections, etc.)

  • Include images (Y/N/All/Important)

  • Image path (<path>)

  • Footnote location (Page/Chapter/Work)

  • Generate colophon (Y/N)

  • Generate a table of contents (Front/Back/None)

  • Additional CSS stylesheets (<name>)

  • Generate marginal page-numbers (Y/N)

  • Generate links to page-images (Y/N)

Things that can be handled via CSS

  • Default table alignment (Left/Right/Center)

  • Default verse alignment (Left/Right/Center)

The TEI Header

The TEI header (<teiHeader>) contains information about the text and the sources it is derived from. This metadata is grouped in various sections, in which it is often possible to distinguish information related to the electronic file from that related to the original source.

tei2html uses the metadata in the TEI header to construct the meta-data in the target formats, as well as the optional colophon. It is important to correctly specify the metadata in the TEI header.

Preparing a TEI header is less work than it might seem, as you can normally work from a template, prepared with the most common elements already in place.

The TEI header contains four main sections:

    <teiHeader>
        <fileDesc>...</fileDesc>
        <encodingDesc>...</encodingDesc>
        <profileDesc>...</profileDesc>
        <revisionDesc>...</revisionDesc>
    </teiHeader>

The File Description

The first element of the TEI header is the file description, <fileDesc>. This contains the most important meta-data in three main elements, the <titleStmt>, <publicationStmt>, and <sourceDesc>. Our template looks like this.

    <fileDesc>
        <titleStmt>
            <title>TITLE</title>
            <author>AUTHOR</author>
        </titleStmt>
        <publicationStmt>
            <publisher>Project Gutenberg</publisher>
            <pubPlace>Urbana, Illinois, USA.</pubPlace>
            <idno type="PGNum">12345</idno>
            <date>2010-01-31</date>
            <availability>
                <p>Some statements on the availability and copyright status of the work.</p>
            </availability>
        </publicationStmt>
        <sourceDesc>
        <bibl>
            <author>AUTHOR</author>
            <title>TITLE</title>
            <date>YEAR</date>
        </bibl>
        </sourceDesc>
    </fileDesc>

You might notice that the TEI header contains the title and author information twice. This is intentional, as one refers to the title of the file, and one to the title of the source. For the title given in the <titleStmt>, some TEI recommendations suggest appending "an electronic transcription" to the original title. I think that is unnecessary, and will use the original title as given on the title page, or perhaps normalized in some fashion. For personal names, I always give them in the natural order, that is, without a comma, and optionally I will supply the vital dates in parentheses, for example, <author>John Doe (1833-1901)</author>. You may add other elements allowed by TEI in this statement if needed.

In the publication statement, you can give some information on the publisher and the availability of the book. For typical Project Gutenberg publications, the information in the template is fine.

The following specific values can be used in the <publicationStmt> element to register the various ways the text is known at Project Gutenberg and elsewhere.

ID type Used for

<idno type=PGNum>12345</idno>

The Project Gutenberg eBook number.

<idno type=PGDPProjectID>project1234</idno>

The PGDP project number(s).

<idno type=PGClearance>1234name</idno>

The PG Clearance number.

<idno type=OCLC></idno>

OCLC catalog number (worldcat).

<idno type=OLN></idno>

Open Library catalog number.

<idno type=LCCN></idno>

Library of Congress Call Number.

<idno type=ISBN></idno>

The ISBN. Since ISBNs refer to specific manifestations of a work, this number typically refers to the ISBN of the source digitized. ISBNs have been in use since 1972, so it is unlikely many Project Gutenberg books have such a number. (you are not supposed to use the ISBN of the source here!)

The <availability> element should include a short reference to the copyright status of the work. For most texts in Project Gutenberg, the following phrase will be appropriate: Not copyrighted in the United States. If you live elsewhere please check the laws of your country before downloading this ebook.

In the <sourceDesc>, the title and author information appears again. This time, you should use the exact title and author names as given on the title page of the original work used to prepare your Project Gutenberg text. If the title on the spine or cover of the book differs, this may be noted, but the title page is leading.

Extensions to TEI

The @nfc attribute on <title> elements is used to indicate the number of non-filing characters, used when sorting a title. Counted should be all characters up to the character on which the title should be sorted, including spaces.

    <title nfc="4">The Return of the King</title>
    <title nfc="3">An Introduction to Mathematics</title>

The @key attribute on <author> and related elements is used to supply a sort key for a name. Typically, such sort keys would drop accents, such that they will sort in the expected place in lists.

I also use the @ref key to provide a link to the viaf.org authority file.

    <author key="Rijn, Rembrandt Harmenszoon van" ref="http://viaf.org/viaf/64013650">Rembrandt van Rijn (1606-1669)</author>
    <author key="MacKenzee, John">John McKenzee</author>
    <author key="Hotel, Desire l'">Desiré l'Hôtel</author>

Cover Pages

Older books where often sold in simple paper covers, often just repeating the information on the title page. It was expected that the buyer of the book would bind it (or more likely have it bound by a professional bookbinder). As a result, covers for older books are not standard, and often do not contain more than a short title on the spine. The practice of selling books in nice decorative covers started in the late nineteenth century. Including the original cover of the book as an illustration in the TEI version is often nice, and is also a great feature for ePub devices, which use cover images to generate thumbnails from.

When generating ePubs, tei2html expects the cover image to be indicated in a specific way in the front matter, using the exact `id`s as shown below:

<div1 id="cover" type="Cover">
<p><figure id="cover-image" rend="image(images/front.jpg)"/></p>
</div1>

To make this work with the various ePub ebook readers, make this image 600 by 800 pixels, preferably as full color JPG.

When a cover image is missing, a title page image can be used instead, as long as it is encoded following this convention (again, the `id`s matter, not the file names):

<div1 id="titlepage" type="Titlepage">
<p><figure id="titlepage-image" rend="image(images/titlepage.png)"/></p>
</div1>

Cover Thumbnail

Sometimes it makes sense to also have a cover thumbnail which isn’t just a reduction of the (original) cover page. To do this, just specify (as a hack) cover-thumbnail(imagefile) in the rend attribute of the text element. Similarly, cover-image(imagefile) can be used to specify a cover image that is not part of the original book, for those who like to have both.

favicon

Simply drop an image file named favicon.jpg in the images directory, and tei2html will generate the metadata in the HTML to use it. Use a square image, upto 192 x 192 pixels.

Those Pesky Page-Breaks

Printed books have pages, and thus page-breaks every couple of hundred words or so. Electronic files do not have this concept, but since pages and page numbers have traditionally been used to refer to passages in books, they can be captured in <pb> elements, which, for good reason, can appear almost anywhere in a TEI document. Here is where the trouble starts.

Reference:

  • TEI P5 Documentation pb

Encoding

When a page-break occurs between two divisions, it can be encoded in various was, e.g., before the closing tag:

<pb n="123"/>
</div1>
<div1>

Between the closing and opening tag:

</div1>
<pb n="123"/>
<div1>

After the opening tag:

</div1>
<div1>
<pb n="123"/>

Various guidelines for TEI propose different (conflicting) conventions for each option. The tei2html scripts all assume the first option (before the closing tag) This way, the generated tables of content, etc., refer to page-number that is current when the <div1> tag occurs.

It should be very well possible to write a small XSLT script to normalize this usage, avoiding complications in the other XSLT scripts.

Page Images

Cross-References

As mentioned, cross-references in printed books often reference to page numbers. The easiest way to encode this is to do something like this:

For details <ref target="pb123">see page 123</ref>.

However, there are several problems with this approach:

  1. This reference refers to a <pb id="pb123" n="123"/> somewhere in text. Now, of course, the reference is almost never to the page-break itself, but always to some content that appears on that page (a picture, a paragraph or phrase), but actually identifying and linking to the content referred to can be a time-consuming task, so is often postponed (think, for example, about a pre-existing index). So we often have texts that just link like this.

  2. The reference contains more than just the page number. This makes it harder to replace the page number with the actual page number in paged output, if such is generated.

  3. The reference does not make clear that it is indeed to a page number (which could hint the rendering process to replace it when rendering on paged media).

For these reasons, tei2html specifies that references to page-numbers are tagged as:

For details see page <ref target="pb123" type="pageref">123</ref>.

That is, directly surrounding the page number, and the additional @type attribute set to pageref. This indicates to the rendering application that the content of the <ref> element needs to be replaced with the actual page number on which the details appear.

To prevent that the details do not appear on the indicated page, but on a previous or next page, you should replace the target with the non-<pb> element the details are really in (most likely a paragraph).

When creating new paged output, we will have to deal with yet another set of issues:

First, exact values for cross-references can only be established when final pagination is known. In some cases, this may result in changes in the page layout inducing changes in the page numbers itself ("see page 100" is wider than "see page 99"). Most current layout tools can deal with this type of issues reasonably, although in some extreme cases this still pops up.

Second, when generating an index, you want to avoid things like

common entry, 12, 12, 12, 23, 24, 25, 45.

but summarize them as

common entry 12, 23-25, 45.

which is not supported by most current index generating tools, and is pretty hard to do as well. (tei2html does this when referring to pre-existing page-numbers, but this is not very useful for paged output.)

Figures

Figures according to TEI P3

Currently, tei2html supports figures following the TEI P3-model, with some modifications.

A figure element describes a single image, and optional some heading, legend text and a description.

The way the image itself was left somewhat implementation dependent. Within the structure of SGML, it could be specified as an entity, which was hardly supported anywhere. tei2html worked around this by deriving the image file from the @id, using an additional attribute @url, or placing the image file name in a rendition ladder in the @rend attribute.

Figures according to TEI P5

TEI P5 revised the figure model, introducing a graphic element, and allowing multiple graphics to be part of a single figure. This better matches with practices in books, but requires a revision of the code.

The new model allows multiple graphics in a figure and specifies the attribute @url as the way to specify the image.

Adjustments to Code

The current code should remain functional for existing texts.

  • When a single graphic element appears in a figure, behavior should be the same as an P3 style figure with an attribute, that is, the width of the block is the width of the image. (DONE.)

  • When multiple graphic elements appear, the images should be rendered under each other; the width of the entire block being the widest of these images.

Tables

Tables are supported in tei2html with the following features:

  • cells spanning multiple rows and columns, with the @rows and @cols attributes.

  • table headers, when rows are marked as label or unit on the @role attribute.

  • formatting of borders via predefined CSS classes. The following classes are defined, to be applied (in the @rend attribute) on the table element:

  • borderOutside: the table is surrounded by a single box.

  • verticalBorderInside: the columns are separated by vertical lines.

  • borderAll: all cells are given a border. Outside borders and the border between the table head and table body are thicker.

Columns

In addition to the TEI model, tei2html supports column definitions. These can be used to apply styling to all cells in a column. The following elements are defined:

column: a column definition, which can have the following attributes: @cols the number of columns this column definition applies to, by default only a single column; @rend the rendering that should be applied to each cell in the column matching the position of the column definition.

columnGroup: a group of column definitions, that can be repeated multiple times. A columnGroup should always have the attribute @repeat which indicates the number of times the column-group should be repeated.

Note that the @rend value align(decimal) triggers some special processing when generating HTML. Since HTML does not directly support alignment on the decimal separator, any column with align(decimal) will be split into two columns, with in the first the integer part of the numbers, and in the second the fractional part. CSS is used to make those parts align as expected.

Braces

Sometimes, rows are preceded by a tall brace. tei2html will automatically generate HTML to include an image of a tall brace when a cell spans more than one row and only includes a single brace character.

Table processing

tei2html cannot directly process tables correctly. A number of pre-processing steps are required to correctly deal with the complexities tables introduce. These are handled in the xslt file normalize-table.xsl, which includes code to determine the correct column of each cell (taking into consideration spanned rows and columns) and the special handling of decimal alignment. The tei2html.pl perl script normally applies this transform to the XML before applying the main transform.

Nested tables are supported.

Notes

tei2html supports various types of notes, that is:

  • footnotes

  • marginal notes

  • endnotes

  • notes to tables

  • apparatus notes

Each type is handled in a slightly different way.

Footnotes

Notes, given in the element <note> are, by default, considered footnotes. That can be made explicit by adding @place="foot"

Footnotes appear at the bottom (or foot) of the page. Since HTML does not have the concept of a page, tei2html collects all footnotes at the end of the div0 or div1 element they appear in.

If this is undesired, you can also explicitly call for the footnotes to be rendered, by inserting a <divGen type="footnotes">, which will generate the footnotes in a generate section. If only the body of that section is desired, use <divGen type="footnotebody">.

Footnotes will be automatically numbered per division. In the text, a small superscript number will link to the actual footnote. This same number is rendered in front of the footnote, and will link back to the location in the text. A small up-arrow at the end of the footnote will also link back to the text.

Sometimes, multiple footnote references on a page refer to the same footnote. This can be encoded using the @sameAs attribute, linking it to the @id of the referenced footnote. This way, multiple references can be made to the same footnote (using the same number). The up-arrow after such a duplicated note will be followed lower-case letters, each linking back to an instance of the note reference in the text.

If you want to override the automatically assigned note number, you can provide an alternative marker with @rend="note-marker(*)". Note that this will not change the generated number of other notes, that is, the note will still be counted [TODO: fix this].

Marginal notes.

tei2html support various types of marginal notes. The way they are placed in the margin in the HTML output is defined by various CSS rules. To indicate a note is a marginal note, use the @place attribute. This can have the following values:

  • margin

  • left

  • right

  • cut-in-left

  • cut-in-right

The top three options will place the note outside the text-block. The cut-in variants will place the note as a floating box inside the text-block of the main text.

Some care is need to prevent marginal notes to overlap each other. Simply don’t place them too close to each other (although that may not be an option when digitizing pre-existing books). It may help to widen the margin if needed.

Table notes.

Notes that appear inside a table can be marked with @place="table". In that case, the notes will be rendered directly below the table, and numbered with lower-case letters, starting anew for each table.

Notes in tables not marked as table notes will be treated as footnotes to the division the table appears in.

Apparatus notes.

Text-critical works often include numerous notes linked to words or lines in the source text, indicating variant readings of the text in one or more editions being compared.

You can indicate apparatus notes with @place="apparatus". Notes marked thus will not be automatically rendered in the output, but require you request their rendering using <divGen type="apparatus">. At that location, all apparatus notes preceding this divGen will be rendered in a single paragraph (unless an apparatus notes contains multiple paragraphs itself).

All apparatus notes will be given the same reference symbol (by default a degree sign), which will be used to link the notes with their point of reference.

If you use multiple instances of <divGen type="apparatus">, multiple blocks of apparatus notes will be generated, each containing the notes before the block, but after the previously generated block.

TODO: support a better model for apparatus notes, as described in the TEI guidelines.

Mathematical formulas

Requirements

Use the suggested TEI notation: <formula id=f1 notation="TeX">$E = mc^2$</formula> (see TEI guidelines).

  • Meet Project Gutenberg rules, that is convert to HTML and ePub without any dynamic content (that is, no javascript, no external fonts or dependencies on non-standard software, etc.)

  • Look as good as normal (dynamic) MathJax output.

  • Fully automated processing on command line (that is, it can be integrated in a build process).

Implementation

The Project Gutenberg prohibition on active content make the direct use of MathJax impossible, so we need to follow a more complicated way to include mathematical formulas in text intended for PG.

The process consists of three steps:

  1. Run tei2html to let XSLT export all formulas into separate .tex files in a directory formulas. This is done during normal processing.

  2. Run the perl script convertFormulas.pl to convert all these files to SVG files (and MathML).

  3. Run tei2html again (if needed with the -f option) to include the generated SVG files as images (or inline) in the resulting file, and use metrics from the SVG files to generate proper CSS.

The resulting output files are placed in a folder formulas, and named according to the following naming scheme (inline|display)-<ID>.tex. The name is important, as the conversion script uses that to determine the correct MathJax option to use when converting the file to SVG, etc.

The script convertFormulas.pl will convert these files to files named (inline|display)-<ID>.svg. This script will use node.js with mathjax-node-cli to produce HTML, SVG, and MathML files from the exported formulas.

Configuration

The following configuration options are available:

  • math.mathJax.format determines the way mathJax is used to generate math. Possible values:

    • MathJax: use MathJax dynamically (loading javascript libraries)

    • MML: use inline MathML notation (generated by mathjax-node-cli)

    • SVG: use inline SVG (generated by mathjax-node-cli)

    • SVG+IMG: use SVG files, included using an HTML img tag.

  • math.mathJax.configuration determines the MathJax configuration to use (only when the MathJax format is used).

  • math.label.position determines the place where equation numbers are placed (valid values: left and right).

  • math.label.before string to be placed before the equation number (default a left parenthesis).

  • math.label.after string to be place dafter the equation number (default a right parenthesis).

Installation of tools

  1. install Node.js. Make sure to install both node.js and npm.

  2. install Mathjax-node using npm install -g mathjax-node.

  3. install Mathjax-node-cli using npm install -g mathjax-node-cli.

Implementation details

Baseline correction

The baseline is stored in the SVG files generated by MathJax. They can be read from the SVG file, using Xpath /svg/@style, and then parsing for the CSS value vertical-align. Similarly, the width and height can be retrieved from /svg/@width and /svg/@height/, and need to be set in CSS. All this CSS should apply to the <img> tag, to make sure that the SVG image is rendered at the correct size and on the proper baseline.

The spoken version is also encoded in the MathJax generated SVG file, using Xpath, /svg/title, and needs to be extracted, and set in the @title attribute of the image (or a surrounding span) to be visible as a pop-up.

Some articles about finding the baseline (that didn’t contain the obvious solution that the information is available in the SVG file itself):

Deduplication

Often, the same formula appears multiple times in the same document. It makes sense to only generate a single SVG file for each formula, and include it multiple times.

Trivial math

Very often, a single letter or digit is (correctly) marked as an equation. To avoid needless use of MathJax, we can simply render those trivial expressions directly in HTML, and thus for some texts reduce the number of invocations of MathJax dramatically. Be aware that in this case, the appearance of letters in such trivial equations may differ from those in more complex equations, due to differences in font rendering and the way SVG is handled by browsers.

Labeled equations

Numbered equations have a label (between parentheses) set flush-right. Here we use the label encoded on @n attribute, format it, and place it flush right. The configuration setting math.label.position (with possible values left and right) can be used to determine whether the label should go to the left or right edge of the text. (Note that to maintain centering, a label will be generated on both sides of the display formula in any case, and either the right or the left will be made invisible.)

Since we cannot have further mark-up in the values of attributes, we use a 'light' markup syntax to use italics and bold on labels. To get 1*a*, type 1_a_; to get 2b, type 2b. Parenthesis can be supplied, based on the setting math.label.before and math.label.after.

Open Issues

  1. Unexpected Line breaks. Currently, we see linebreaks directly after the HTML <span>, even when punctuation marks directly follow the math. We will probably have to wrap the entire formula and the following punctuation in a no-wrap <span> to prevent this. A simple work-around is to put the final punctuation inside the equation.

  2. Punctuation after labeled display math. Some books place punctuation after the label to end a sentence or phrase. With the current implementation, this is very tricky to emulate. Placing the punctuation after the formula is the most logical from a semantic point of view, but will require a lot of effort. As a work-around place this punctuation within the displayed equation.

  3. No line-breaks. Since browsers rendering HTML cannot break SVG into lines, math formulas will not be broken into lines. You may consider breaking very long formulas into multiple fragments as a measure of last resort.

Using the @rend attribute

The @rend attribute is a simple hook into the TEI structure to document or specify how features are rendered. Within tei2html they can be used to achieve certain formatting effects. This feature should be used with moderation. Several ways of using the @rend attribute can be distinguished.

Note that according to the TEI Guidelines, the rendering attributes are intended to describe the presentation of the source material, and are not to prescribe any presentation of the output. Since the intention here is to faithfully reproduce text, this distinction is not really important for tei2html.

Simple

Simple rendering attribute values provide single keywords to provide a rendering hint. This type of usage is sufficient most of the time. The pre-defined rendering keywords are often element specific, and should be considered as hints only, that is, ignoring the rend attribute should not render the document illegible.

Element Recognized @rend values

hi

italic (default) bold bi sc asc sup sub ex

figure

center (default) left right inline

p

block center left (default) right indent noindent

q

block

list

number bullet none (default)

Any

rtl ltr (see section below on directionality)

Any

Any (these will be added to the class attribute of the output element in HTML)

Rendition Ladders

A slightly more powerful mechanism is provided by so-called 'rendering ladders', in which a number of key-value pairs are provided. These take the form key(value), and multiple can be present in a @rend attribute. They are either translated to CSS, or given a specific meaning, sometimes depending on the element they are applied to.

CSS equivalents

In most cases, rendition ladders are one-on-one translated to equivalent CSS rules. For example, the following snippet of TEI:

<p rend="font-style(italic) text-align(left)">A left-aligned paragraph in italics.</p>

Will be translated to the following HTML:

<p class="x12345">A left-aligned paragraph in italics.

And the following CSS rule:

.x12345 { font-style: italic; text-align: left; }

Note that if the same value for the @rend attribute is used multiple times, only one CSS rule will be generated matching all occurrences (and the same class attribute will be used on each of them).

Special interpretations

In a number of places, rendition ladders are interpreted by tei2html directly.

The following keys and values are supported: (Note that this list is not exhaustive, and the full list of options will be indicated with each element.)

Element Key Value Example

text

stylesheet

name of CSS stylesheet, multiple stylesheets can be specified, separated by a comma

style/classic.css

Any

font

italic, bold, fsc (full caps and smallcaps), smallcaps, underlined, gothic (note the difference from fall-through CSS)

<hi rend="font(bold)">

p, q

align

right, left, center, block (that is, justified)

<p rend="align(block)">

p, l

indent

The number characters to indent. The size of a character is not fixed, but is roughly the size of the letter m.

<l rend="indent(2)">

Any

link

any url, rendered as link to the indicated url.

<figure rend="link(images/a.jpg)">

figure, head, cell

image

any url, rendered as in-line image, obtained from the indicated url. When used on a head element, the image appears above the head, when used on a cell element, the image appears in the table-cell (typically used to pull in large braces spanning cells).

<figure rend="image(images/a.jpg)">

figure

float

The place to float in image, table, etc. Possible values: left, right.

<figure rend="float(left)">

figure

hover-overlay

In HTML output, when the mouse hovers over the image, the alternative image is shown (using CSS only).

<figure rend="hover-overlay(images/overlay.jpg)">

figure

image-frame

In HTML output, the image is placed centered on top of image specified as frame (using CSS positioning), such that the frame surrounds the image. For this, of course, the image with the frame should be larger than the image being framed.

<figure rend="image-frame(images/frame.jpg)">

table, list

columns

Set the element in multiple columns. May be applied to tables and lists.

<list rend="columns(2)">…​</list>

milestone

stars

Render a milestone element with the indicated number of stars

<milestone rend="stars(7)"/>

Any

class

Sets a class attribute in the corresponding HTML output. This can be used in combination with custom CSS stylesheets to achieve special effects. (Note: just using the naming the class without following parenthesis in the rend attribute is now sufficient)

<p rend="class(myClass)">

l

hemistich

Indents the current line with a certain space. When the value starts with a ^ followed by a number n, the content of the line n lines before is used, when the value starts with a # followed by an id, the content of the element with the id is used, otherwise, the literal content is used.

<l rend="hemistich(^1)">, <l rend="hemistich(#vs21)">, <l rend="hemistich(Content)">

Using @style and @rendition

As an alternative to the @rend attribute, the current TEI guidelines also provide @style and @rendition to define presentation in a formally defined language. tei2html assumes that is CSS. See the TEI guidelines on rendition attributes. Unlike the values of @rend, the specified CSS values are not interpreted at all, but passed to the output CSS directly.

Directionality

Since ePub does not allow CSS to be used for directionality, but requires that the HTML @dir attribute is used, the following @rend values are translated to a @dir attribute in HTML.

  1. direction(…​)

  2. class(rtl) and class(ltr) and the bare equivalents.

Implementation notes:
  1. Handle the @style attribute, and output it as a CSS rule.

    • generate a unique class name for the CSS fragment.

    • output the value of the @style attribute verbatim.

    • remove duplicates, such that identical @style attributes are only output once.

    • apply the generated class-name to the relevant output element in HTML.

  2. Handle the @rendition attribute.

    • apply the given class name(s) to the relevant output element in HTML.

    • verify <rendition> elements for the given class names are present in the <tagsDecl> of the TEI file.

    • warn if this is not the case.

  3. Handle the <rendition> tags in the <tagsDecl>.

    • verify the rendition id is used in the file.

    • output the corresponding CSS verbatim.

Aligned Bilingual Text

When you have a bilingual text, it is nice to show them side-by-side, such that the translations are aligned. Books sometimes places such bilingual texts on facing pages, but in an unpaged medium, that is not an option.

To present bilingual text side-by-side with the tei2html, you can do the following:

  1. Encode the two texts, each should be in a separate divisions (div1, div2, etc.).

  2. Give each of the divisions an @id attribute.

  3. On one of the divisions, add the following @rend attribute:

    rend="align-with(id-of-div-to-align)"
  4. give each element in the first division with a unique @n attribute value. These can be numbers. The order is not really relevant, but it is easy to just number them sequentially.

  5. give the matching elements in the second division the same @n attribute value as its counterpart in the first division. If an element has no counterpart, no @n attribute is needed.

When running tei2html, it will generate a table, with the text of the first division on the left, and the text of the second division on the right. Table cells will be used to make sure matching elements are aligned. The code understands sub-divisions and will recurse into them, still using a single table.

Excluding content from alignment

Sometimes, you want to exclude certain elements from the alignment (e.g., non-translated sections, headings, or figures). To do so, you have to include the element on both sides with the same @n attribute, and give one the following @rend attribute:

rend="span-alignment(both)"

The other element should remain empty (as it will be skipped).

Aligned text from a separate file

It is also possible to align texts kept in separate files. To do so, use the following @rend attribute:

rend="align-with-document(filename.xml#id-of-div-to-align)"

When using the perl scripts that come with tei2html, the to-be-aligned sections will be included in an intermediate document in a pre-processing step (in include.xsl), which replaces align-with-document with align-with and places the to-be aligned section directly after the section it will be aligned with. By doing so, generated tables of contents, lists of issues in the colophon, and footnotes will be processed correctly. The tei2html xslt stylesheeets can also do this directly, but with limited support for those additional items.

Footnotes in aligned texts

Footnotes in aligned text will be placed at the bottom of the column they appear in. If footnotes are present in both aligned texts, both sides will have footnotes.

Aligned verse

When aligning verses (or actually line-groups, encoded with the lg element), there is no need to number matching lines with a @n attribute. However, the number of items in the line-groups needs to match. Again to align two line-groups, on one of the line-groups add the following @rend attribute:

rend="align-with(id-of-lg-to-align)"

This will generate a table, in which all lines are placed in rows side-by-side. The code will recurse into nested lg elements, still using a single table.

Facsimile Support

TEI P5 contains a facsimile element to represent scanned documents, as well as some other features to better support facsimiles. These features can be used to link to the original scans.

Often scans are already available at some place on the web, and we may wish to point to them, sometimes, we wish to include them with our ebook, so we have two types of facsimiles: external images and internal images.

TEI offers two ways of referring to scanned facsimile images, one is using the @facs attribute on the pb element, the other is using graphic elements in a separate facsimile section. When using the latter option, the @facs attributes on the pb elements can also (and should) refer to those graphic elements.

The facsimile files themselves can be either hosted locally (on a file-system under our control) or remotely (with a third party). In the former case, we can generate self-contained ePubs with all the files on-board, in the later case, we have to make do with external references.

Finally, we can refer directly to the facsimiles, or generate an HTML wrapper that includes the facsimile (assumed to be an image in a supported format here). The HTML wrapper will be required if we include the facsimile images in an ePub file.

All combined, this gives several options to take into account when dealing with facsimile editions.

Facsimile Element

The facsimile element is a top-level element, that describes a series of page images, and can stand independently of the transcribed text. This allows to specify just the metadata (in the Header) and the scans, to produce a digital facsimile.

   <facsimile>
      <graphic id="facs123" url="p123.png"/>
   </facsimile>

@facs Attribute

The @facs attribute on pb elements can be used to point to scanned images of transcribed pages. This can be used to either link to some external source of page images (for example in the Internet Archive), or to link to an internal set of images (kept in a page-images subdirectory, for example).

  <pb n="123" facs="p123.png"/>

Let’s call this "direct" facsimile links.

Alternatively, it can link to an element in the facsimile element, for example:

  <pb n="123" facs="#facs123"/>

Let’s call this "indirect" facsimile links.

This later case also allows pointing to zones within a page, but that is currently out-of-scope for tei2html.

The two ways should not be combined. Currently, the code only allows the "direct" way of linking to page images.

Comparison of direct and indirect linking to facsimiles

Direct

Generate output wrapper file for each pb-element. [DONE]

Generate links from HTML output to wrapper file. [DONE]

Use location in text version defined by that pb-element to generate structural navigation aid (breadcrumbs) [DONE]

Non-transcribed pages cannot be included (or we should encode additional pb-elements for those pages).

Indirect

Generate output wrapper file for each graphic-element. [DONE]

Look-up referring pb-element to find location in text version to generate structural navigation aid. This may not be present, in which case we cannot use structural navigation guides (or use the next graphic element that does have a matching pb) [DONE]

Generate links from HTML output to wrapper file, taking this into account. [DONE]

Not all graphic-elements might be referred to by a pb-element. Need to decide what to do in this case, but probably using the next graphic-element that does have a matching pb-element is a good default strategy for producing navigational aids on those pages. [TODO]

Need to deal with case that no text is present at all (that is, we have a TEI file with just a teiHeader and a facsimile element) [TODO]

Generated Output

HTML

The output consists of a series of HTML pages, one per page, with some metadata in the heading of the page, and some navigational aids to conveniently jump to another page.

By convention, the facsimile images and wrapper pages will go into a directory page-images, and will look like this:

  <html>
    <head>
      <title>Document title, page x</title>
    </head>
    <body>
      <div class="facsimile-header">
         <h1>Document Title, by Document Author, Page x</hi1>
      </div>
      <div class="facsimile-navigation">
         <!-- Buttons to go to previous and next page, and back to text -->
      </div>
      <div class="facsimile-page">
         <img src="\images\pages\p123.gif"/>
      </div>
    </body>
  </html>

For each pb-element for which a @facs attribute is present, a link will be generated to the wrapper file, decorated with a facsimile page-image icon.

Things to show

In header

  • Title of document

  • Name of author

  • Page number (if encoded in @n attribute)

  • Title(s) of current division at bottom of page.

Titles are indicated as follows as bread-crumbs:

(Front | Body | Back |) > Title Level 1 > Title Level 2 > ... > Page 123

Each of these elements are active links, and will link back to those pages (content or facsimile view, to be decided)

Issue: current code removes unused anchors in HTML post-XSLT-transform, using a Perl script, that needs to be modified, as anchors used by those 'external' HTML files are not recognized.

Issue: our current conventions put the pb just before the div#`s. This will lead to the wrong header above the page-image for the first page of a section. Need to add check for case that `pb element is (almost) last element of a div#.

In Navigation

  • Link to Previous Page. [DONE]

  • Link to Next Page. [DONE]

  • (Optional) Links to all pages. [TODO]

  • Link back to location in transcribed text. [DONE]

ePub

Similar to HTML, taking into account additions to Spine, metadata, etc.

  • Add generated wrapper files to spine [TODO]

  • Add page-images to spine [TODO]

If no text element is present, the page-images should become the primary structure of the text.

ePub3 Support

The ePub 3 standard is a broad overhaul of ePub, and adds a lot of important features to ePub, including support for HTML5 and CSS3, better support for multilingual documents, etc.

To be able to use the new features introduced in ePub 3.0, we will need to make a number of changes to tei2html. The driving forces for this effort from the tei2html perspective are:

  • Better support for HTML5 and CSS3

  • Better support for metadata (including reliable way to specify cover images, etc.)

  • Better way to specify a table of contents.

  • Better ability to embed fonts

  • SVG and MathML support.

All such new features of course need support from readers. Such support can be expected quickly on PC-based viewers, but may take considerable time to trickle down to dedicated reader hardware.

For proper development, we also need to be able to test the resulting ePub3 with a number of supporting readers, which are not available at the time.

Features not within current scope are JavaScript support (we like to keep our books passive) and media overlays.

Things to do

For tei2html to support ePub 3.0 we need to do add some new information to our ePub files. In most cases, that information can easily be derived from the existing TEI files, so no changes to data-files will be required.

Also see ePub3 changes.

Many of the things to-do can be aligned with over-all support for HTML5/CSS3. Specific for ePub3 are:

Changes to the .OPF file

Metadata

New metadata uses the <meta> tag (samples taken from draft ePub 3.0 specification).

<meta> elements with the @about attribute give further information on the metadata.

<metadata>
    <meta property="dcterms:identifier" id="dcterms-id">urn:uuid:54dc9f06-3174-4b6b-a29a-0dd1fa0969e4</meta>
    <meta about="#pub-id" property="scheme">uuid</meta>

    <meta property="dcterms:identifier" id="isbn-id">urn:isbn:9780101010101</meta>
    <meta about="#isbn-id" property="scheme">isbn</meta>

    <meta property="source-identifier" id="src-id">urn:isbn:9780375704024</meta>
    <meta about="#src-id" property="scheme">isbn</meta>

    <meta property="dcterms:title" id="title">Norwegian Wood</meta>
    <meta about="#title" property="alternate-script" xml:lang="ja">ノルウェイの森</meta>
    <meta about="#title" property="title-type">primary</meta>

    <meta property="dcterms:modified">2011-01-01T12:00:00Z</meta>
    <meta property="dcterms:language">en</meta>
    <meta property="page-progression-direction">ltr</meta>

    <meta property="dcterms:creator" id="creator">Haruki Murakami</meta>
    <meta about="#creator" property="alternate-script" xml:lang="ja">村上 春樹</meta>
    <meta about="#creator" property="file-as">Murakami, Haruki</meta>
    <meta about="#creator" property="role">aut</meta>
</metadata>

Old-fashioned metadata can be linked to their 3.0 equivalents using the @prefer attribute. (I think it is unlikely readers will actually use this; library management software might do so.)

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:identifier id="pub-id" prefer="uuid">urn:uuid:54dc9f06-3174-4b6b-a29a-0dd1fa0969e4</dc:identifier>
    <meta property="dcterms:identifier" id="uuid">urn:uuid:54dc9f06-3174-4b6b-a29a-0dd1fa0969e4</meta>
</metadata>

Note that the identifier indicated by the @unique-identifier attribute on the <package> element combined with the last modification date is used to generate a package id.

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:identifier id="pub-id">urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809</dc:identifier>

    <meta property="dcterms:identifier" id="dcterms-id">urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809</meta>
    <meta about="#pub-id" property="scheme">uuid</meta>

    <meta property="dcterms:modified">2011-01-01T12:00:00Z</meta>
</metadata>

Results in a Package ID: urn:uuid:A1B0D67E-2E81-4DF5-9E67-A64CBE366809@2011-01-01T12:00:00Z.

Manifest

The manifest now has manifest item properties, to specify specific roles of the items listed. This can be used to indicate what is the navigation document, and cover-images. This is also required if certain features are used in an element (e.g., mathml, scripted, remove-content, SVG).

<item properties="nav" id="toc" href="contents.xhtml" media-type="application/xhtml+xml"/>
...
<item properties="cover-image" id="cover" href="cover.svg" media-type="image/svg+xml"/>

Navigation Document

The navigation document is now a valid XHTML page, wrapped in the <nav> element.

XHTML

Will need to generate valid XHTML, which means:

  • Removal of some attributes, such as summary, valign, align, width, etc. from elements.

Media-Overlays in ePub3

Based on a little reverse engineering of a working book (with the freely available Azardi ePub reader; also tested output with Tobi, which correctly picks-up the files, and the Readium app for Chrome. Other readers remain to be tested).

SMIL Files

SMIL files have a structure as given below, and must be placed in the same directory as the referenced HTML file.

<smil
    xmlns="http://www.w3.org/ns/SMIL"
    xmlns:epub="http://www.idpf.org/2007/ops" version="3.0">
  <body>
    <seq id="sequence_id" epub:textref="chapter.xhtml" epub:type="bodymatter chapter">
      <par id="par1">
        <text src="chapter.xhtml#p1"/>
        <audio clipBegin="0:00:00" clipEnd="0:00:02.162" src="audio/chapter.mp3"/>
      </par>
      <par id="par2">
        <text src="chapter.xhtml#p2"/>
        <audio clipBegin="0:00:02.162" clipEnd="0:00:05.718" src="audio/chapter.mp3"/>
      </par>
    </seq>
  </body>
</smil>

MP3 or OGG Files

The actual audio files are in a subdirectory called audio.

OPF File

Metadata section

Add a number of media overlay metadata items; they reference items in the manifest.

<meta property="media:duration" refines="#ch1">5:30:15</meta>
<meta property="media:duration">5:30:15</meta>
<meta property="media:narrator">Narrator Name</meta>
<meta property="media:active-class">-epub-media-overlay-active</meta>

Except for the narrator, these items are derived from the .smil files related to each section.

Manifest section

Add the .smil and audio files added to the ePub container:

This also indicates how to handle fall-back media in an alternative format.

<item id="ch1" href="chapter.smil" media-type="application/smil+xml"/>

<item id="audio01" href="audio/chapter.mp3" fallback="audio02" media-type="audio/mpeg"/>
<item id="audio02" href="audio/chapter.ogg" media-type="audio/ogg"/>

Then also refer to the media-overlay in the entry for the original (text) items

<item id="chapter" href="chapter.xhtml" media-type="application/xhtml+xml" media-overlay="ch1"/>

Implementation in tei2html

Currently, there is no support for SMIL types of overlay in TEI. We can solve this most conveniently by adding a rend attribute to a division, indicating the .smil file to use as an overlay.

<div1 id="ch1" rend="media-overlay(chapter1.smil)">

And then parse the .smil file for the entries that need to be added to the OPF manifest (that is, the .smil file itself and the associated audio files). This leaves us with a way to obtain the required overlay metadata.

All this is handled in the file tei2opf.xml.

TODO

  • Copy .smil files to ePub output directory. DONE

  • Better handling of media metadata. (Generate them from the smil file). DONE; need to look at minor rounding issues in calculation.

Implementation observations

  • Azardi doesn’t seem to like ids with periods in them.

  • Azardi needs fallback media to Ogg.

  • Azardi requires a body element of the .smil to have a seq element, the standard also allows par elements directly in the body (as Tobi generates it).

  • Casing of file names is important.

  • Now works fine in Readium, which seems to have a better interface than Azardi.

Coding Conventions

XSLT Domain

In the XSLT domain, the main naming convention uses dashes, so we use dashes for names of templates, modes, etc., except where such names directly reflect elements from the TEI domain that use camelCase.

TEI Domain

In the TEI domain, the main naming convention is camelCase, so for names primarily used in a TEI context, we follow that.

Use of the @type attribute

The @type attribute can indicate a subclass or special use of a certain element.

Special values

The following table gives special values for the @type attribute in various contexts. These are handled in a special way when processing the TEI file.

TODO use lower camel case convention for these names.

TEI element @type value Notes

ab

divNum

Indicates a division number (typically within a head element).

ab

figNum

Indicates a figure number (typically within a head element).

ab

flushright

Indicates the ab should be set flush right. (Deprecated: prefer semantic mark-up)

ab

itemNum

Indicates an item number (typically within a item element).

ab

lineNum

Indicates a line number (typically within a p or l element).

ab

lineNumRef

Indicates a reference to a line number.

ab

parNum

Indicates a paragraph number.

ab

tocDivNum

Indicates a division number within a toc.

ab

tocFigNum

Indicates a figure number within a list of illustrations.

ab

tocPageNum

Indicates a page number within a toc.

div1

TranscriberNote

The division is a note by the transcriber (rendered in a different color).

div2

SubToc

The division is a toc for the div1 it appears in (content will be replaced by generated content).

divGen

apparatus

Generate a section with apparatus notes (can be used multiple times, all apparatus notes between this and the previous instance will be included).

divGen

Colophon

Generate a colophon (based on information in the TEI header).

divGen

ColophonBody

Generate the body of a colophon (based on information in the TEI header).

divGen

Footnotes

Generate a section with footnotes.

divGen

FootnotesBody

Just the body.

divGen

gallery

Generate a gallery of illustrations (requires availability of thumbnail images).

divGen

Inclusion

Include an external file here; using @url attribute.

divGen

index

Generate an index.

divGen

IndexToc

Generate a one-line toc for an Index (displaying single-letter links).

divGen

LanguageFragments

Generate a list of fragments in foreign languages (i.e., not the main language of the text).

divGen

loi

Generate a list of illustrations.

divGen

pgfooter

Generate the Project Gutenberg boilerplate footer.

divGen

pgheader

Generate the Project Gutenberg header.

divGen

toc

Generate a table of contents. Additional @rend attribute value tocMaxLevel(n) can be used to control the depth.

divGen

toca

Generate a table of contents (including chapter arguments).

divGen

tocBody

Just the body.

head

label

The (division) heading is a label indicating its type and number. (Typically, 'Chapter IX') Omit when the label is the only head of a chapter.

head

sub

The (division) heading is a subtitle.

head

super

The (division) heading is the title of a higher level division. (Typically, the title of the book repeated above the first chapter.)

idno

epub-id

The idno gives a unique identifier for the generated ePub file.

idno

ISBN

The idno gives the ISBN for the edition (don’t use for the ISBN of the source!).

idno

LCCN

The idno gives the Library of Congress call number.

idno

LibThing

The idno gives the Library Thing catalog number for the edition.

idno

OCLC

The idno gives the WorldCat catalog number for the edition.

idno

OLN

The idno gives the Open Library catalog number for the (source) edition.

idno

OLW

The idno gives the Open Library catalog number for the work.

idno

PGnum

The idno gives the Project Gutenberg ebook number.

list

determinationTable

Convert the list to a (potentially nested) table as used for determination in biological works.

list

tocList

Convert the list to a (potentially nested) table of contents.

p

figBottom

The paragraph will be placed on the bottom-center of a figure.

p

figBottomLeft

The paragraph will be placed on the bottom-left of a figure.

p

figBottomRight

The paragraph will be placed on the bottom-right of a figure.

p

figTop

The paragraph will be placed on the top-center of a figure.

p

figTopLeft

The paragraph will be placed on the top-left of a figure.

p

figTopRight

The paragraph will be placed on the top-right of a figure.

ref

endnoteref

The reference refers to an end-note.

ref

noteref

The reference refers to a footnote (The generated footnote number of the note referred to is used in the output; this is intended to be used when a note reference marker is used multiple times to refer to the same footnote, not when referring to a footnote otherwise).

ref

pageref

The reference refers to a page (by number; the ref is supposed to only include the actual number referred to).

title

pgshort

The title is a short title for Project Gutenberg purposes.

titlePart

main

The title part is the main title.

titlePart

series

The title part is a series title.

titlePart

sub

The title part is a subtitle.

titlePart

volume

The title part is a volume label (e.g., 'Volume II').

TEI element @place value Notes

note

apparatus

The note is part of a critical apparatus.

note

foot

The note is a footnote (default).

note

margin, left, right

The note is a marginal note (set to the left or right of the text block).

note

cut-in-left, cut-in-right

The note is a cut-in note (set inside the text block; the main text flows around it).

note

table

The note appears directly under the table it appears in.

TEI element @unit value Notes

milestone

tb

The milestone is a thematic break.

Tools

Auxiliary Tools

Besides the main XSLT scripts, tei2html includes Perl scripts for various tasks. Many of these are quick hacks to get things done.

tei2html.pl

The main 'glue' script. This will run the various commands to actually run the tei2html XSLT scripts.

Syntax: perl.pl -S tei2html.pl [options] filename

For an overview of options, run perl.pl -S tei2html.pl --help

ucwords.pl

Perl script to list word-frequencies in an XML file. This script will recognize the @lang tag, and write a word-frequency list per language encountered. Words in the output will be color-coded, based on frequency and appearance in a spell-check dictionary.

Syntax: perl -S ucwords.pl filename

Tools to generate plain text files

extractNotes.pl

Extract TEI tagged notes from a text file.

Usage:

perl extractNotes.pl [-h] <filename>

Options:

-h	perpare for HTML output.

This produces two output files:

<filename>.out	The original text without the notes
<filename>.not	The notes

If the -h option is used, hyperlinks between the two files are inserted. Otherwise, the notes are indicated with a sequence number in square brackets.

tei2txt.pl

Convert a TEI file to plain vanilla ASCII.

Usage:

perl tei2txt.pl <filename>

Notes:

  1. This script assumes extractNotes.pl has been run before.

  2. Accented and special letters are converted to their nearest ASCII equivalents.

tei2txt.bat

Convert a TEI file to plain vanilla ASCII.

This batch files runs extractNotes.pl, concatenates the two resulting files, and then runs tei2txt.pl on the resulting file. The result is a single ASCII file.

Tools for post-processing PGDP output

fixpb.pl

Perl script to fix page numbers, as recorded in the @n attribute in <pb> elements. This will also set the @id attribute to pb#, where # is the current page number. Page-numbers will be changed after the page indicated, with an offset indicated. To fix mismatched page-numbers, always start with the first mismatch, and run the script as many times as needed (this typically occurs if the original source had unnumbered pages with illustrations.)

For an overview of options, run perl.pl -S fixpb.pl --help

catpars.pl

Perl script to unwrap paragraphs, such that they appear on one line. Handle with care, as it might unwrap entire tables or lists if it deems them part of a paragraph.

quotes.pl

Perl script to replace ASCII single and double-quotes with curly quotes. May need some manual verification and correction where quotes and tags interfere with the simple algorithm used.

Issues:

  • Open quotes before SGML tags and non-letters may go wrong (they become the default: a close quote).

  • Manually check for ASCII single-quotes, and apply the correct one. Use tei2html.pl with the -v option to find cases where quotes are still not balanced.

pgpp.pl

Script run initially on the output from PG Distributed Proofreaders, when converting that to TEI (The actual conversion is a manual process, as the resulting output is still very far removed from being an actual TEI file).

Tools for preparing texts for PGDP

pgprepare.pl

Script run to convert a bunch of text files to UTF-8, normalize spacing, and remove unwanted characters.

Merging Texts

Larger works were often issued in multi-volume sets, and as a result often also encoded as separate TEI files. However, in files do not have the restrictions of physical bound volumes, and it is often beneficial to merge two or more TEI files into a single TEI file. This will help in particular when there are a lot of internal cross-references between the volumes, such as will happen with an index (which often appears in the last volume).

Steps

1. Create an XSLT 'galley' file

The idea is to create an XSLT stylesheet that will merge the individual volumes, and provide them with a new TEI-header (which in turn can also be put together from elements pulled from the TEI-header of one or more of the source TEI files. Some things to consider:

  • Combined TEI-header. Often using the contents of the first TEI-header, with some minor changes will be enough.

  • Combined tables-of-contents. Here, we may need to have some smart xpath queries to extract the various fragments of the table-of-contents from the various volumes (or simply generate the entire thing anew).

  • Combined body content, which may require the introduction of a new level of divisions ("parts" or "volumes"), which in turn may require adjusting the levels of numbered div elements (e.g. div1 may need to become div2, etc.).

  • Combined material in the front- or back-matter (such as Appendices), which may require some re-numbering.

2. Run the XSLT galley file

Here we run XSLT over the galley file to create a new TEI file. In this step we will collect the content, and put it together to form a new single TEI file.

3. Use the combined file as source

Here we run XSLT again to perform the conversion to the final output format, on the output of step 2 above.

Galley File Support Functions

Most of the hard work happens in the creation of the galley file. For this a number of support functions are needed.

Change Ids To prevent id-clashes, the ids in both source TEI files need to be changed. This is done best by prefixing them with a unique string for each volume.

However, not all ids should be treated as such, as that would break inter-volume cross-references. To prevent this, such inter-volume cross-references should already have the prefix when they are referred to in other volumes, and the software should be smart enough not to add the prefix once more.

Furthermore, a number of ids have special value to the rendering process, and should be kept verbatim as well. The ids should be preserved (at most once) as-is in one of the source volumes.

Finally, some elements might no longer be needed in the combined volume (such as textual references to another volume, or labels giving the volume number.) They should be filtered out.

To make this easy, we will need to supplement the standard XSLT import function with a smart import that will take care of those changes. See the function f:import-document in merge-documents.xsl for the implementation.

Checking a TEI file

XSLT style-sheet to check various TEI conventions and textual issues.

Metadata

Verify metadata is present and correct.

Rendering

Verify valid rendering attributes are used

  • valid CSS-based rendition ladders

  • valid tei2html extensions.

Types

Verify types for elements are valid according to the conventions.

Numbering

Verify whether numbered items are in sequence.

  • page-breaks <pb>

  • divisions <div#>

Spacing and Punctuation

Verify punctuation marks are properly spaced.

Quotation Marks

Verify whether quotation marks are applied correctly.

Quotation marks need to be nested correctly, and for each opening mark, there should be a corresponding closing mark.

This is slightly complicated by several facts:

  • different types of usage

    • US: outer double, inner single quotation marks

    • UK: outer single, inner double quotation marks

  • different conventions for languages and countries.

  • usage around other punctuation might vary.

  • non-closed marks are to be re-opened in next paragraph.

Generating a KWIC index

For quite some time, I wanted to be able to generate a KWIC (Key Word In Context) index of TEI files. With a KWIC, you can quickly inspect the usage of certain words in a context, which helps to determine how words are used.

Requirements

The KWIC generator should:

  • Show all words in their context.

  • Indicate the page on which they appear.

  • Indicate the (tagged) language of the keyword.

  • Be able to show words in italics (and other font-styles).

  • Break up a text into segments to have meaningful contexts.

  • Ignore other tags in the text.

  • Handle text in footnotes correctly.

  • Present the resulting KWIC in HTML.

  • Work with all Unicode supported scripts.

  • Ignore case and accent variants (but they should be preserved and signalled).

  • Ignore differences between common ligatures and loose characters (such as æ and ae)

  • Properly handle a mixture of left-to-right and right-to-left scripts (as Hebrew and Arabic).

  • Optionally ignore differences between look-alike characters or sequences (such as b and h, or rn and m)

  • Optionally only show words of which more than one variant appears after ignoring minor differences.

Development

For a long time, I have felt that XSLT 2.0 was the right tool for the job. XSLT 1.0 lacked the required regular expression facilities to break text into words effectively, while other programming languages, such as PERL or Python, are much less suited to handling XML textual data.

The first idea was that you can easily use xsl:analyze-string to split up fragments of text into words, using a regular expression (the expression is a variant of: [\p{L}\p{N}\p{M}-]+ ). Given those words, you can then traverse the text, looking for the word you are interested in, and determine the preceding and following words, by following the preceding-sibling and following-sibling axes.

This worked fine, but quickly showed that not all context is equally relevant. Basically, you need not look further than paragraph boundaries and their equivalents in various parts of a documents. So, before splitting in words, we divide the text in segments, and flatten that segment structure. At the same time, we need to take care to lift non-sequential items, such as footnotes out of the context (and handle them elsewhere.)

Having done this for one word, it is fairly easy to iterate over all words in the document and build a KWIC for each of them. However, this is a very inefficient way to do it, if we want a KWIC index for all words. So instead, we simply start with the first word, collecting contexts as we go, and then sort them into the desired order afterwards. (Order by word, and then by the following context, or preceding context retrograde.)

Generating a KWIC unfortunately consumes a considerable amount of memory, so some tuning is required to be able to handle large texts. KWIC indexes tend to get fairly big: a 2 megabyte text can result in an 80 megabyte KWIC index, especially if no stop-word-list is used to remove the most common words.

The result is xml2kwic.xsl, about 500 lines of XSLT code (including documentation comments), that produces a KWIC index from a TEI file in a few seconds. This can be used in two modes. In the first, a KWIC is generated for every word in the document (including such common words as 'the' or 'a'), in the second, a word or list of words is provided to the script, and a KWIC of only those words is build: handy for example to compare the usage of two different spellings or synonyms.

The XSLT script currently accepts the following parameters:

  • keyword: one or more keywords to generate a KWIC for.

  • select-language: one or more (space separated) codes of languages to generate a KWIC for.

  • case-sensitive: should casing be folded to lowercase (one of true or false).

  • min-variant-count: the minimum number of variants required for a word to be reported upon.

  • mixup: a number of (space separated) symbols that need to be treated as equal.

  • context-size: the size of the context being shown in the report-out.

Calling the script has been integrated into tei2html.pl.

A particularly useful way of generating a kwic is with:

tei2html -k --kwicvariants=2 --kwiccasesensitive=true

Having fun with KWIC

The KWIC little tool already has helped me to locate numerous small issues Project Gutenberg texts that would otherwise have escaped the attention of proofreaders. In particular variant spellings of names and consistent usage of hyphenated versus non-hyphenated words are often difficult to catch, while in a KWIC, they simply jump into your face.

Some issues in texts I’ve found with this script:

  • Variant ways of references to cited works.

  • Spelling variants in names. Personal names are normally not known to spelling checkers, and can be spelled in various ways. An added complication is that another person may have a similar name, so you really need the context to make that out.

  • Checking a (preexisting) index. The index entry will show up in the KWIC with the page number it lists, together with the referenced phrase (and page number).

  • Missing periods after common abbreviations.

  • Inconsistent styling. Foreign words that are mostly given in italics, but sometimes not.

  • Hyphenated version non-hyphenated words.

Future Ideas

CSS stylesheet redesign [TODO]

The CSS stylesheets with tei2html have grown over time, and as a result are not very consistent or easy to maintain. To remedy this, I will redesign them using a more structured approach.

This will involve the following

  1. reset.css – reset all browser dependent settings, to achieve better consistency across browsers.

  2. use the BEM philosophy, to make the stylesheets easier to maintain.

BEM for ePub

BEM (Block Element Modifier) was developed to make the maintenance of CSS for websites easier. This does not always translate one-on-one to ebooks, but most of the principles can be applied without much trouble.

The main blocks that can easily be identified:

  • Cover (either newly designed or original, or both)

  • Title page (either newly designed or original, or both)

  • Text (no distinction made between front matter, main body, and back matter, as they are normally typographically treated the same.)

  • Advertisements (reproduced advertisements from the original book)

  • Colophon (our own metadata regarding the book)

Common elements in books:

  • Figures and plates

  • Tables of Contents

  • Tables

  • Lists

  • Index

Parsing CSS [Experimental]

If we could easily parse CSS with XSLT, this would make it far easier to generate optimized HTML. Currently, we include the full CSS stylesheets with even the simplest texts, which increases clutter. If we parse a CSS stylesheet, and only retain those CSS rules that actually apply, text size and clutter is reduced.

Current Implementation

The current implementation is in css2xml.xsl, which is incomplete. The CSS is parsed in a number of passes. Currently, the code not yet parses all valid CSS 2.1 stylesheets. In particular, @media groups are not supported.

  1. tokenize with a simple regular expression based 'lexer'

  2. group things between braces using 'sibling recursion' to match open braces with closing braces.

  3. group properties, based on semicolons.

  4. determine key-value pairs, based on the semicolon.

  5. determine selectors.

  6. clean-up parsing artifacts.

Items remaining to do are:

  • Better follow tokenization as defined for CSS 2.1 tokenization.

  • Properly understand all valid CSS 2.1 features.

  • Implement a way to translate CSS selectors to XPath expressions.

Usage

Having CSS in XML, makes it easier to determine from within XSLT what rules within CSS kan be removed or rewritten in a simpler form.

  • Determine which CSS rules are actually used in an HTML document.

  • Determine which CSS rules are overruled within the same CSS stylesheet.

As a result, the CSS stylesheet can be pruned, and only the relevant rules emitted in the output.

References

  • Some great implementation ideas are in json2xml

  • Need to translate CSS selectors to XPath.

  • Need to keep transformation result in variable to do tests against.