Discussion of specific letterforms in the WWP collection, including long s, disambiguation of I and J, U and V
Older texts often use forms of letters which are no longer in use and which may be unfamiliar or ambiguous. The following is a list of difficult letterforms and how we encode them.
Long s and long s ligatures:
The old-style long s in roman type looks like an f without the crossbar, or with a very short crossbar extending only to the left. In italics it looks like an integral sign. In both cases it is encoded with the entity &s;. A long s, when ligatured to a normal s, looks somewhat like a distorted upper-case italic B, or like the German s-zed ligature. We do not encode this as a special character; we transcribe it as a long s (&s;) followed by a short s. We do not encode any ligatured characters (e.g. st, sc, sf, etc.) using entity references; we simply transcribe the two characters involved as ordinary characters.
I and J:
In the lower-case, the difference between i and j is clear and unambiguous. In italics, in the upper-case, there is often only one letter-form, which looks like a long J with a cross-bar in the middle. This is (despite appearances) an I, and should be encoded as such (with vuji as necessary). There are also some texts in which this character appears, which also contain instances of the more familiar italic capital I, which looks like a roman I but with a slant. In such texts both characters should be encoded with I, using vuji as necessary. This seems counterintuitive, but is necessary to preserve consistency across the textbase.
In blackletter similarly, the character which looks like a capital I with a slight lefthand curve to the bottom and a crossbar in the middle is an I and should be encoded using I (with vuji as necessary).
U and V:
There are several forms of the lower-case letter v: with a pointed bottom, with a rounded bottom, and with a decorative swash. There are also be several forms of the upper-case letter V: with a pointed bottom or with a rounded bottom. All of these should be encoded as v or V, with vuji as necessary. They can be distinguished from u and U by the fact that these have a tail on the right-hand side.
It is important to maintain consistency from text to text in how we encode a given letter-form. Thus in texts which have two forms of v, but no u, both forms of v should still be encoded as v, rather than assigning a u to one and a v to the other. The WWP does not preserve different forms of the same letter in other cases (for instance, curly versus rectilinear upper-case E) and hence there is no reason to make an exception for v.