<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="../stylesheets/yaps-tei.css"?>
<?oxygen RNGSchema="../schema/yaps.rnc" type="compact"?>
<?oxygen SCHSchema="../schema/yaps.sch"?>
<TEI xmlns="http://www.wwp.brown.edu/ns/yaps/1.0" xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Introduction to XML</title>
        <author>Syd Bauman</author>
      </titleStmt>
      <xi:include href="./boilerplate_publicationStmt.xml">
        <xi:fallback>
          <publicationStmt status="restricted">
            <note type="auto">WARNING: XInclude processing failed &#x2014; this file should not be copied or
            used (and is invalid) as a result.</note>
          </publicationStmt>
        </xi:fallback>
      </xi:include>
      <sourceDesc>
        <p>Based on the same talk from the 2005-06 U. Victoria TEI
          workshop.</p>
      </sourceDesc>
    </fileDesc>
    <revisionDesc>
      <change who="#sbauman.emt" when="2008-06-18">made updates based
      on recent changes to xml_intro for Univ. WA</change>
      <change who="#sbauman.emt" when="2008-02-10">updated based on recent Buffalo talk</change>
      <change>automatically converted
        from presentation.odd conforming to yaps.odd conforming
        using p2y.xslt and p2y.perl</change>
    </revisionDesc>
  </teiHeader>
  <text>
    <presentation>

      <section>
        <head>XML</head>
        <slide>
          <p>
            <soCalled>eXtensible Markup Language</soCalled>
          </p>
          <p>A markup language that is extensible (unlike HTML).</p>
          <p>Extensible because it is not really a markup language.</p>
        </slide>
      </section>

      <section>
        <head>Markup Language</head>
        <slide>
          <list>
            <head>A markup language has</head>
            <item>a vocabulary, and</item>
            <item>a grammar</item>
          </list>
        </slide>
        <lectureNote>
          <p>By <q>markup language</q> I mean a vocabulary (i.e., the
          set of elements that have meaning), and a grammar (i.e. how
          they relate to one another).</p>
        </lectureNote>
      </section>

      <section>
        <head>Examples</head>
        <slide>
          <p>DocBook is a markup language: <list rend="class(nobul)">
              <item>vocabulary includes: <soCalled>article</soCalled>, <soCalled>title</soCalled>,
                and <soCalled>paragraph</soCalled></item>
              <item>grammar states: <soCalled>paragraph</soCalled> is not allowed inside
                  <soCalled>title</soCalled></item>
            </list></p>
          <p>OFX (open financial exchange) is a markup language: <list rend="class(nobul)">
              <item>vocabulary includes: <soCalled>BANKACCTINFO</soCalled> (bank account
                information), and <soCalled>SUPTXDL</soCalled> (supports transaction download)</item>
              <item>grammar states: <soCalled>SUPTXDL</soCalled> is required inside
                  <soCalled>BANKACCTINFO</soCalled></item>
            </list></p>
        </slide>
      </section>

      <section>
        <head>Extensible</head>
        <slide>
          <p>Not a language, but rather a meta-language<list>
              <item>methods of defining markup language</item>
              <item>syntax for expressing markup language</item>
            </list>
          </p>
          <p>XML has no tags of its own, but instead defines the
          syntax of tags; it defines no vocabulary or grammar of its
          own, but does tell you how to define a vocabulary and
          grammar</p>
        </slide>
        <lectureNote>
          <list>
            <head>methods of defining markup language</head>
            <item>rules for declaring what the vocabulary of a markup language is, and for writing
              the rules of their grammar</item>
            <item>e.g., rules for saying that <soCalled>title</soCalled> is a thing in my markup
              language, and <soCalled>paragraphs</soCalled> are not allowed in them.</item>
          </list>
          <list>
            <head>syntax for expressing markup language</head>
            <item>rules for how to incorporate a markup language into a document, a stream of text.</item>
            <item>i.e., rules for differentiating markup from content</item>
            <item>e.g., <q>when you see <tag>title</tag> that means a <soCalled>title</soCalled>
                thing has begun</q></item>
          </list>
        </lectureNote>
      </section>

      <section>
	<head><soCalled>Boxes in Boxes</soCalled> Representation</head>
	<slide>
	  <figure>
	    <graphic url="./gfx/boxes_book2.png" height="100%"/>
	    <figDesc>Classic boxes-inside-boxes representation of a mythical
	    book that contains an introduction, two chapters, and an index,
	    where each chapter contains a heading and two sections</figDesc>
	  </figure>
	</slide>
      </section>

      <section>
	<head>Tree Representation</head>
	<slide>
	  <figure>
	    <graphic url="./gfx/tree_book1.png" width="100%"/>
	    <figDesc>Classic tree representation of a mythical
	    book that contains an introduction, two chapters, and an index,
	    where each chapter contains a heading and two sections</figDesc>
	  </figure>
	</slide>
      </section>

      <section>
	<head>XML Representation</head>
	<slide>
          <eg><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<book>
  <introduction>Blah blah blah ... </introduction>
  <chapter>
    <heading>Wines</heading>
    <section>White wines ... </section>
    <section>Red wines ... </section>
  </chapter>
  <chapter>
    <heading>Beers</heading>
    <section>Ales ... </section>
    <section>Lagers ... </section>
  </chapter>
  <index> stuff ... </index>
</book>]]></eg>
	</slide>
      </section>

      <section>
        <head>XML languages vary greatly</head>
        <slide>
          <list>
            <item>Many different purposes (financial data,
              linguistics, literary texts, technical manuals&#x2026;)</item>
            <item>Many different kinds of markup (structure,
              content, interpretation, appearance&#x2026;)</item>
            <item>Many different user communities (IRS, airplane
              manufacturers, literary scholars, librarians&#x2026;)</item>
          </list>
        </slide>
      </section>

      <section>
        <head>Why XML?</head>
        <slide>
          <p>XML is
	  <list>
	    <item>easy to understand;</item>
	    <item>non-proprietary plain-text:
	    <list>
	      <item>human readable,</item>
	      <item>software independent,</item>
	      <item>hardware independent;</item>
	    </list></item>
	    <item>(relatively) easy to write a parser for;</item>
	    <item>widespread: very well supported by both commercial
	    and open source software.</item>
	  </list></p>
        </slide>
	<lectureNote>
	  <p>Have to admit that when I say <said>XML is easy</said>, I
	  am really refering to <emph>XML</emph> alone. In order to really use
	  the XML universe you need a lot more.
	  <cit>
	    <quote>When people say "XML is hard", they usually do not
	    mean "XML 1.0 is hard" but "XML 1.0 + namespaces in XML +
	    XPath + DOM + XSLT + W3C XML Schema + XML Base + xml:id +
	    XInclude + XPointer + ... is hard" and the proportion of
	    criticism that goes to XML 1.0 itself is usually pretty
	    low.</quote>
	    <ref target="http://www.cafeconleche.org/quotes2008.html#quote2008February13">
	      <name>Eric van der Vlist</name> on the
	      <title>xml-dev</title> mailing list, <date when="2008-02-12">Tuesday, 12 Feb 2008
	      08:28:05</date></ref>
	  </cit>
	  </p>
	</lectureNote>
      </section>

      <section>
        <head>XML Basics</head>
        <slide>
          <p>XML is a metalanguage
	  <list>
	    <item>No tags or attributes of its own</item>
	    <item>Instead, a set of rules for defining tags and
	    attributes</item>
	    <item>Imposes no constraints on elements and attributes in
	    document</item>
	    <item>Instead, defines how rules for such constraints are
	    written</item>
	  </list>
          </p>
        </slide>
      </section>

      <section>
        <head>Everything is Delimited</head>
        <slide>
          <p>Text is divided into <term>elements</term> (the
              <soCalled>nouns</soCalled> of the encoding —
              <term>content objects</term>).</p>
          <list>
            <item>elements by <term>start-tags</term> and <term>end-tags</term><eg><hi rend="class(current)">&lt;heading&gt;</hi><![CDATA[Wines]]><hi rend="class(current)">&lt;/heading&gt;</hi></eg></item>
            <item>start-tags by <code>&lt; &#x2026; &gt;</code><eg><hi rend="class(current)">&lt;</hi><![CDATA[heading]]><hi rend="class(current)">&gt;</hi></eg></item>
            <item>end-tags by <code>&lt;/ &#x2026; &gt;</code><eg><hi rend="class(current)">&lt;/</hi><![CDATA[heading]]><hi rend="class(current)">&gt;</hi></eg></item>
            <item>special case: short-hand for an element with no content<eg><hi rend="class(current)">&lt;</hi><![CDATA[anchor]]><hi rend="class(current)">/&gt;</hi><![CDATA[ = <anchor></anchor>]]></eg></item>
          </list>
        </slide>
      </section>

      <section>
        <head>Example Elements</head>
        <slide>
          <p>
            <eg><![CDATA[<name>Melinda Van Wingen</name>]]></eg>
            <eg><![CDATA[<p>Call me Ishmael. Some years ago—never mind how
long precisely —having little or no money in my purse,
and nothing particular to interest me on shore … </p>]]></eg>
            <eg><![CDATA[<l>To be or not to be, that is the question</l>
<l>Whether 'tis nobler in the mind to suffer</l>
<l>the slings and arrows of outrageous fortune</l>]]></eg>
            <eg><![CDATA[<p>Owl lived at The Chustnuts, an old-world residence
<lb/>of great charm, which was grander than anybody
<lb/>else's, or seemed so to Bear, because it had both a
<lb/>knocker <emph>and</emph> a bell-pull … </p>]]></eg>
          </p>
        </slide>
      </section>

      <section>
        <head>Everything's Delimited: attributes</head>
        <slide>
          <p>Elements have attributes (sort of as nouns have
            adjectives). </p>
          <eg><![CDATA[<name type="person">]]></eg>
          <eg><![CDATA[<lg type="stanza"
    rhyme="abab"
    rend="slant(italic)">]]></eg>
          <list>
            <item>feature-value pairs</item>
            <item>syntax: <code> name="value"</code> or <code> name='value'</code></item>
            <item>always specified in the start-tag (or the
	  empty-element tag)</item>
          </list>
        </slide>
      </section>

      <section>
        <head>Example Elements with Attributes</head>
        <slide>
          <p>
            <eg><![CDATA[<name type='city'>Victoria</name>]]></eg>
            <eg><![CDATA[<measure quantity="76" unit="L" commodity="gasoline">20 gals</measure>]]></eg>
            <eg><![CDATA[<l n="Ham.1710">To be or not to be, that is the question</l>
<l n="Ham.1711">Whether 'tis nobler in the mind to suffer</l>
<l n="Ham.1712">the slings and arrows of outrageous fortune</l>]]></eg>
            <eg><![CDATA[<book title="Better Living Through TEI"
      author="Mark Upgood"
      cost="CAD 8.11"
      stock='12' />]]></eg>
          </p>
        </slide>
      </section>

      <section>
	<head>Anatomy of an Element</head>
	<slide>
	  <figure>
	    <graphic url="./gfx/anatomy-element.png" width="960px"/>
	  </figure>
	</slide>
      </section>

      <section>
	<head>Sample text</head>
	<slide>
	  <eg>&#xA0;&#xA0;&#xA0;&#xA0;<hi rend="class(interest)">Warp Speed, Ms Bright!</hi>
There was a young lady named Bright,
Who travelled much faster than light,
She departed one day,
In a relative way way,
And returned on the previous night. </eg>
	</slide>
      </section>

      <section>
	<head>Sample Document Instance</head>
	<slide>
          <eg><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<lg type="limerick" rhyme="aabba" n="3">
  <head>Warp Speed, Ms Bright!</head>
  <l>There was a young lady named <rhyme label="a">Bright</rhyme>,</l>
  <l>Who travelled much faster than <rhyme label="a">light</rhyme>,</l>
  <l>She departed one <rhyme label="b">day</rhyme>,</l>
  <l>In a <term xml:id="t17">relative</term> way <rhyme label="b">way</rhyme>,</l>
  <l>And returned on the previous <rhyme label="a">night</rhyme>.</l>
  <note target="#t17">See
    <ptr target="http://en.wikipedia.org/wiki/Theory_of_relativity"/>.</note>
</lg>]]></eg>
	</slide>
      </section>

      <section>
	<head>Sample (simplified) Tree</head>
	<slide>
	  <figure>
	    <graphic url="./gfx/tree_limerick1.png" width="100%"/>
	    <figDesc>Classic simplified tree diagram of the limerick
	    <title>Warp Speed, Ms Bright!</title> that uses a dashed
	    curved arrow to represent the link from the <gi>note</gi>
	    to the <gi>term</gi>, and a dotted arror to indicate the
	    link from the <gi>ptr</gi> to the Wikipedia article. It
	    <soCalled>simplified</soCalled> because it does not
	    contain attribute (or text) nodes, and is missing an
	    <gi>l</gi>.</figDesc>
	  </figure>
	</slide>
      </section>

      <section>
        <head>Everything's Delimited: character references</head>
        <slide>
          <p>To refer to a character that is not on your keyboard,
            delimit its ISO 10646 (or Unicode) code-point with <list>
<item><code>&amp;#</code> and <code>;</code> for
                decimal values, or</item>
<item><code>&amp;#x</code> and <code>;</code> for
                hexadecimal values.</item></list>
            <eg>&lt;l&gt;Whether <hi>&amp;#x2019;</hi>tis nobler in the mind to suffer&lt;/l&gt;</eg>
          </p>
        </slide>
      </section>

      <section>
        <head>Everything's Delimited: entity references</head>
        <slide>
          <p>A way to handle hard-to-type characters (e.g., <q>&#x897F;</q>
            and <q>&#x20AC;</q> are not on an American keyboard) and
            boilerplate text</p>
          <list>
            <item>Delimited by <code>&amp;</code> and
              <code>;</code></item>
            <item>five are pre-declared: <code>amp</code>,
              <code>lt</code>, <code>gt</code>, <code>apos</code>,
              and <code>quot</code></item>
            <item>some have standard names: e.g. &amp;eacute; =
              &#xE9;</item>
            <item>some are locally defined: e.g. &amp;copyright;
              = "this text is copyrighted by the Women Writers
              Project&#x2026;"</item>
          </list>
          <eg>&lt;l&gt;Whether <hi>&amp;apos;</hi>tis nobler in the mind to suffer&lt;/l&gt;</eg>
        </slide>
      </section>

      <section>
        <head>Everything's Delimited: comments</head>
        <slide>
          <p>Comments are sections of text that are ignored by the
            XML processor.</p>
          <eg><![CDATA[      <bibl>
        ]]><hi rend="class(interest)">&lt;!-- Famous lost book --></hi><![CDATA[
        <author> <name>Rey, Margret</name> </author>
        <title>Whiteblack the Penguin Sees the World</title>
        <respStmt>
          <name>Rey, H. A.</name>
          <resp>illustrator</resp>
        </respStmt>
        <date when="2000"/>
        <pubPlace>Boston</pubPlace>
        <publisher>Houghton Mifflin Company</publisher>
      </bibl>]]></eg>
          <p>Comments start with <code>&lt;!--</code> and are terminated by <code>--&gt;</code>. </p>
          <p>Any character sequence, including markup, is allowed
            inside a comment, <emph>except <hi>&#x201C;</hi><code>--</code><hi>&#x201D;</hi></emph></p>
        </slide>
      </section>

      <section>
        <head>Well-formedness</head>
        <slide>
          <p>Simple set of rules on document syntax:
	  <list>
	    <item>single <soCalled>root</soCalled> element</item>
	    <item>every element has a start- and an end-tag (or is
	    an empty tag)</item>
	    <item>all elements, attributes, and references are
	    properly delimited</item>
	    <item>no elements overlap</item>
	  </list></p>
	</slide>
      </section>

      <section>
        <head>Example of Overlap</head>
        <slide>
          <p>The following example has <gi>s</gi> elements that overlap <gi>l</gi> elements. Thus it
            is not well-formed and not XML.</p>
          <eg type="ill-formed"><![CDATA[<lg type="stanza">
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: blue; )">&lt;s&gt;</hi><![CDATA[A bird came down the walk]]><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><![CDATA[He did not know I saw.]]><hi rend="CSS( color: blue; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: blue; )">&lt;s&gt;</hi><![CDATA[He bit an angleworm in half,]]><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><![CDATA[And ate the fellow raw.]]><hi rend="CSS( color: blue; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
</lg>]]></eg>
        </slide>
      </section>

      <section>
	<head><soCalled>Boxes</soCalled> Representation of Overlap</head>
	<slide>
	  <figure>
	    <graphic url="./gfx/boxes_overlap1.png" width="100%"/>
	    <figDesc>Chinese doll (NOT!) representation of overlap</figDesc>
	  </figure>
	</slide>
      </section>

      <section>
        <head>One Solution to Overlap</head>
        <slide>
          <p>The following example uses the TEI <att>part</att>
          attribute to work around this problem &#x2014; <val>I</val>
          is for initial, <val>F</val> is for final.</p>
          <eg><![CDATA[<lg type="stanza">
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: blue; )">&lt;s part="I"&gt;</hi><![CDATA[A bird came down the walk]]><hi rend="CSS( color: blue; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: blue; )">&lt;s part="F"&gt;</hi><![CDATA[He did not know I saw.]]><hi rend="CSS( color: blue; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: aqua; )">&lt;s part="I"&gt;</hi><![CDATA[He bit an angleworm in half,]]><hi rend="CSS( color: aqua; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
  ]]><hi rend="CSS( color: green; )">&lt;l&gt;</hi><hi rend="CSS( color: aqua; )">&lt;s part="F"&gt;</hi><![CDATA[And ate the fellow raw.]]><hi rend="CSS( color: aqua; )">&lt;/s&gt;</hi><hi rend="CSS( color: green; )">&lt;/l&gt;</hi><![CDATA[
</lg>]]></eg>
        </slide>
	<lectureNote>
	  <p>Of course software that comes along and counts how many
	  sentences you have in your poem&gt; elements now will need to
	  be smart enough to not count this as 4 sentences.</p>
	  <p>There is an example of how to do this in my stylesheet to
	  <ref target="http://www.tei-c.org/wiki/index.php/Count_Metrical_Lines_P5.xslt">count
	  metrical lines</ref> that is on the TEI wiki.</p>
	  <p>Software will also need to know the sentence isn't
	  <quote>... down the walkHe did not ...</quote>.</p>
	</lectureNote>
      </section>

      <section>
        <head>Validity</head>
        <slide>
          <p>A valid XML document follows the rules of a schema that
            describes a particular markup language:
	    <list>
	      <item>lexicon or available vocabulary: elements &amp;
	      attributes </item>
	      <item>grammar for how the lexicon is used: rules for
	      nesting, sequencing, etc.
	      <list>
		<item>e.g., a paragraph can be inside a chapter, but a
		chapter cannot be inside a paragraph</item>
		<item>e.g., a chapter must begin with a heading followed
		by at least one paragraph</item>
	      </list>
	      </item>
	      <item>There exist various schema languages with which
	      you can describe an XML grammar, each with advantages
	      and disadvantages.</item>
	    </list>
          </p>
          <list>
            <item>In order to be valid, an instance must be
	    well-formed;</item>
	    <item>A well-formed document need not be valid.</item>
          </list>
        </slide>
      </section>

      <section>
        <head>Namespaces</head>
        <slide>
          <list>
            <item>A way to use tag vocabularies from different markup
            languages</item>
            <item>Allows for specialization of markup languages (by
            discipline, by function)</item>
            <item>Good for metadata: can use TEI header in a METS
            record</item>
            <item>Good for specialized markup: e.g. MathML,
            MusicML</item>
             <item>No need for every markup language to handle
            everything</item>
	  </list>
          <p>Instance syntax:
            <eg><![CDATA[<document
  xmlns:wwp="http://www.wwp.brown.edu/ns/documentation/1.0"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
  <!-- ... -->
  <dc:creator>Gene Roddenberry</dc:creator>
</document>]]></eg>
          </p>
        </slide>
      </section>

      <section>
        <head>Some XML plusses and minusses</head>
        <slide>
	<p>XML thinks of texts as trees (with links).</p>
          <list>
            <head>some advantages</head>
	    <item>represents hierarchical text structures very
	    naturally</item>
            <item>many texts actually have a native tree-like
	    hierarchical structure</item>
            <item>use of tree structures makes processing easy</item>
            <item>yields some natural advantages, e.g.
            context-sensitive searching, or variable granularity of
            text retrieval</item>
          </list>
          <list>
            <head>some disadvantages</head>
	    <item>does not represent non-hierarchical structures
	    naturally</item>
            <item>many textual features do not naturally fit into a
	    tree-like hierarchical structure</item>
            <item>use of tree structures is sslloooww</item>
            <item>yields some natural disadvantages, e.g. verbosity</item>
          </list>
        </slide>
	<lectureNote>
          <p>Because elements are always nested inside each other, XML
          can be thought of as representing a <soCalled>boxes inside
          boxes</soCalled> model of text, or a
          <soCalled>tree</soCalled> structure.</p>
          <list>
            <head>some advantages</head>
	    <item>represents hierarchical text structures very
	    naturally</item>
            <item>many texts actually have a native tree-like
	    hierarchical structure (think sentences inside paragraphs
	    inside sections inside chapters)</item>
            <item>use of tree structures makes processing easy
	    (computer scientists know how to deal with trees)</item>
            <item>yields some natural advantages, e.g.
	    context-sensitive searching, or variable granularity of
	    text retrieval</item>
          </list>
          <list>
            <head>some disadvantages</head>
	    <item>does not represent non-hierarchical structures
	    naturally</item>
            <item>many textual features do not naturally fit into a
	    tree-like hierarchical structure (think of a verse line
	    split between two characters in a play)</item>
            <item>use of tree structures is sslloooww</item>
            <item>yields some natural disadvantages, e.g. verbosity</item>
          </list>
	</lectureNote>	  
      </section>
      
    </presentation>
  </text>
</TEI>
