Foreign words and phrases

phrase-level encoding foreign languages
language lang foreign lang xml:lang

Encoding foreign-language words and phrases using the lang or xml:lang attribute on existing elements, and the foreign element when necessary

Encoding of foreign words and phrases can be useful for two fairly distinct reasons. The first and more important is to allow these words to be treated differently (by search engines, spell-checkers, speech synthesis software), in a way that may be specific to the language in question. This function is fulfilled by the lang attribute in P4, and by the xml:lang attribute in P5. Each is available on every P4 and P5 element, respectively, and can be used to identify the language of any chunk of text, however large. While it may not be necessary to specify the base language of every text (unless you are working with a multilingual collection), it can be very useful to identify the language of any significant chunk of text in a language other than the base language. Where possible, this identification (using the lang or xml:lang attribute) should be placed on the encoding that is already present in the text (div, p, quote, etc.). Where no element is already present (e.g. for single words or phrases), the foreign element should be used.

The second, more trivial function of identifying foreign-language words and phrases is simply renditional, as these words are often renditionally distinct simply because they are in another language. This function can be fulfilled by the foreign element.

Language identification in P4

The lang attribute does not identify the language being used all by itself; it does so by pointing to a language element stored in the TEI header, inside langUsage. There should be a language element for every language used in the document (including the base language). The language element carries an id attribute, whose value should be the appropriate language tag (not to be confused with an XML tag used to delimit an element) for the language in question. For details on how to construct such a language tag, see Language tags.

The lang attribute is an IDREF which points to this id value. The content of the language element may be used to describe the language, in plain prose, in whatever level of detail is desired. This indirection allows you to describe the languages used (see examples) in whatever detail is necessary, in a central location.

Note that the value of the id attribute of language, and thus the lang attribute of most any element may be arbitrary, but we strongly recommend that the appropriate language tag be used for this purpose. Besides having a clear and obvious meaning to humans, they make conversion to XHTML and migration to TEI P5 easier.

Language identification in P5

In general, the xml:lang attribute is used to directly identify the language being used; it does so because its value is (definitionally) a language tag, the value of which identifies a language, possibly with sublanguage and script information. For details on how to construct such a language tag, see Language tags.

The above mechanism works for the vast majority of the world’s languages. However, there are languages which have not been registered, and thus for which there is no language tag. Furthermore, there may be dialects or details about a language which you find are important to express, but for which there is no way to construct a proper language tag. In these cases special private use language tags may be constructed. (They are described in Language tags.)

Whenver a private use language tag is used, a language element with an ident attribute that matches the private use tag must be present in the langUsage element in the TEI header. The content of this element should describe the language identified in prose.

When normal (i.e. non-private use) language tags are used, a corresponding language element may be used to further describe the language, but it is not required.

Examples

Example 1 (P4): a document with multiple languages

<TEI.2>
  <!-- in the TEI header -->
  <langUsage>
    <language id="en-US">English (American, circa 1850)</language>
    <language id="fr">French</language>
  </langUsage>
  
  <!-- in the text -->
  <text lang="en-US">
    <div>
      <p>There was no doubt that <quote lang="fr">plus &amp;ccedil;a change,
        plus c’est la m&amp;ecirc;me chose</quote>, and yet he doubted he
        would ever have the <foreign lang="fr">sang-froid</foreign> to say
        so.</p>
    </div>
  </text>
</TEI.2>

Example 1 (P5): a document with multiple languages

<TEI xml:lang="en-US"
     xmlns="http://www.wwp.brown.edu/ns/documentation/1.0">

  <!-- in the TEI header -->
  <langUsage>
    <language ident="en-US">English (American, circa 1850)</language>
  </langUsage>

  <!-- in the text -->
  <text>
    <div>
      <p>There was no doubt that <quote xml:lang="fr">plus &amp;ccedil;a change,
        plus c’est la m&amp;ecirc;me chose</quote>, and yet he doubted he
        would ever have the <foreign xml:lang="fr">sang-froid</foreign> to say
        so.</p>
    </div>
  </text>
</TEI>

Note that no language is needed for French, as there was nothing more to say about it.