The document below was prepared before the UTC meeting in early February.
A second document responding to the revised proposal for encoding cuneiform
is available by clicking here. This
second document gives additional examples and shows how lines of evidence
converge to confirm what are single signs of the script, and to confirm
that the traditional sign lists *do* already correctly distinguish these
in general with few exceptions (thus including UL, AR, etc. as single
signs, excluding ENSI as a combination of more than one sign, etc.).
Fitting Cuneiform Encoding to Cuneiform Script
Lloyd Anderson 29 January, 2004, Ecological Linguistics, PO Box 15156,
Washington DC 20003
For Unicode Technical Committee meeting 3 February, 2004 Document L2/04-041,
with very minor editorial revisions, and lacking a few graphics. Those
who wish to see the first illustrations can either use the version on
the Unicode web site, or can request a copy of a font to be sent to
them as an email attachment. If you wish to request the font, please
click here and state that you want
to receive the "CharSplits" font as an email attachment. Standard
.ttf file (16K) unless you specify Mac .suit file.
If those who proposed N2664 withdraw their proposal, then this paper
constitutes a re-introduction of it with changes to the text and tables
as outlined below. Changed text will be provided promptly.
A group which has been preparing a proposal for Cuneiform encoding
went through several stages. Decisions included the encoding of signs
in the sense traditionally understood in the field.
1. Encode signs not readings (script not language)
2. Encode signs not sequences of signs
3. Encode signs not variants
4. Encode signs not fragments of signs
5. Include sufficient distinctions for each stage covered (currently
mostly UrIII and later)
6. Unify those signs which are primary relatives in lineal historical
descent, encode them the same.
At a later time, the decision was taken to encode as sequences those
elements of text which are referred to as SIGN.SIGN with a period between
them, treating them as compounds of those existing signs which are the
parts of their names. This is obviously consistent with the decision
to not encode sequences of signs. But it also turned out to contradict
other decisions taken previously, and the members had not anticipated
some of the results. It was also somewhat vague, as it would cover both
sequences of signs, and also single signs referred to in this same way
for various historical reasons, such as lack of a known single-word
reading. In other words, the naming pattern "SIGN.SIGN" was
a glyph description languge at the same time as it sometimes represented
sequences of signs, without any easy distinction between the two. The
group decided to go ahead without yet attempting to consider all of
the consequences. Some consider that the decision to split "SIGN.SIGN"
superseded all earlier decisions.
When the first results came back, a majority of the active participants
were unhappy with some of the exclusions, as of the fundamental syllabary
signs. They were also unhappy with encoded units which are fragments,
not ever occurring independently. Some of those have been suggested
for encoding in N2664R. But they are mere band-aids on a system which
systematically disregards both the long-established scholarly tradition
on what are signs, and the empirical evidence on what are the units
of the script, which most of the participants in the small group have
not discussed in any detail.
The question: what is an appropriate encoding for Cuneiform?
I argue that the present proposal would be very damaging to the
field of cuneiform studies. The consequences should actually be examined,
not shoved under the rug. Analogies will help to make clear what is
being proposed for Cuneiform. Then I will survey those consequences
which have not been presented systematically by the individuals whose
proposal is document N2664 and revision.
Han Characters vs. Components.
CJK Han Characters are not split into fragments in encoding. The decision
about what is a character is of course much easier for Han characters
than for Cuneiform, because Han characters all fit a standard square
block. Not having this tool in Cuneiform means that we must work hard
to discover what are the distinctive units of the script. (Or accept
that the long scholarly tradition has already done most of that!) But
the Han analogy is close in that we clearly know the difference between
full characters and components of characters. The long scholarly tradition
of Cuneiform studies is also fully aware of the difference between Signs
and Components of signs. The current proposal for Cuneiform violates
that tradition in mixing the two, and omitting many standard catalogued
signs from the encoding when there is no reason to do so. I here turn
the characters sideways, partly also as a reminder that such a rotation
occurred early in the history of Cuneiform. [The large size "b"
and "ac" following should be rendered using the "CharSplits"
font. These are U+67F4, which is not encoded as U+6B64 <ligature>
U+6728. Rather than showing a vertical column, the Han characters are
here rotated 90 degrees. That keeps the sequence relation the same (sequence
of components vs. sequence of characters).]
b is not encoded as ac with
ligaturing etc.Analogies from Latin script are closer in some other
respects.
Latin Historical Ligatures which are now Simple Letters
æ is not encoded as ae
with ligaturing etc.
w is not encoded as vv with
ligaturing etc.
The last of these is very close to what is being proposed for Cuneiform,
the encoding of single characters as parts which they may historically
have arisen from, or which in the Cuneiform case they may later have
dissolved into, but which are in the use of the script distinct from
those. The <æ> digraph also raises an issue which affects
any script of this kind. Whether or not Unicode favors this, implementers
may possibly encode it as the sequence <a> <e> and render
that via ligaturing as a surface glyph <æ>.
This possibility is no argument that the Danish single letter
<æ> (which looks the same but which is definitely not a
ligature) should not have been encoded.
This next example has been withdrawn in the revised proposal N2664R,
but the fact that it could ever have been proposed shows how far off
the track the interpretation went both from deliberations and consensus
in the working group and from the reality of cuneiform script. There
are many more examples of similar kinds, and N2664R has only touched
the tip of the iceberg in correcting erroneous analyses from N2664.
[The next example also needs to be rendered with the "CharSplits"
font. "w" and "u v" are not intended here. Approximations
would be
<> (the sign called MASHGI or BARGI)
<>
not rendered as
< >
with ligaturing etc. (non-existent signs)
< >
w is not to be encoded as u v
with ligaturing
In this case an existing sign MASHGI (by default, a single character)
was split into two fragments, neither one of which exists as
a sign on its own. The sign which does exist was not proposed
for encoding. This is not a unique example. In some way it must reveal
the thinking which went into proposal N2664. I can only estimate that
the ideas were something like: split any sign if vertical white space
can be seen between fragments which would result, and rename any sign
in terms of component parts, disregarding traditional names. A procedure
based on white space certainly does not represent any reasonable interpretation
of a consensus reached by the working group to encode signs named "SIGN.SIGN"
as their parts. This sign was not traditionally named that way. The
artificial creation in proposal N2664R itself of a new name (U OVER
U U REVERSED OVER U REVERSED), the sequence of names of the two artificial
fragments which were substituted for it in N2664, does not cause this
to become a sign named "SIGN.SIGN". Or else that has no meaning
whatsoever. The sign was split, based not on its actual name but on
a theory reflected by this artificially created name. Quite a circular
proceeding. It makes it clear that what is being proposed is in an important
sense an encoding of a newly invented glyph description language,
not an encoding of the units of cuneiform script.
Our goal is a valid encoding for Cuneiform,
so if we find empirical data refutes the assumptions or procedures of
a claimed consensus, we must pay attention to the facts. The smallest
group seems to have locked itself into a tunnel.
There is another issue raised by the long history of the Cuneiform
script, and very real changes which occurred in it. Some characters
which were original single characters in the understanding of all of
us have dissolved into an apparent sequence, as scribes used familiar
elements. A wonderful example is that for the sign UMBIN, used to represent
among other things 'talon'. It is composed originally of a leg with
a superimposed turned hand which is used in meanings 'attach, join,
knot' and similar (Labat 'nouer, attacher'), and went through this evolution.
Intermediates exist between the last two not shown here. (See Labat
#92b, or use the "CharSplits" font to render the signs in
the next line.)
f g de
The first of these could be named (in a glyph description
language, a component description language) something like "leg"
x TAG4. The second would be named GAD.(DU x TAG4).
The third would be named GAD.TAG4.DU. It appears that
the cuneiform writing system of at least the last of these three stages
may have changed its set of significant units. But we cannot be sure
merely from these three illustrations. We simply cannot infer
status as sign vs. sign sequence merely by thinking of components in
later forms. That is deceiving oneself. The initial GAD of
the middle example may never have been separated from the part which
followed, they may have been merely components of one
sign, not two separate signs. In that case the sign would more revealingly
be named (GAD.DU) x TAG4 In fact that name would work
for both of the last two illustrations, since the TAG4 part is infixed
between the GAD and the DU parts! (In this instance, the TAG4 is not
reduced in size, but the visual form of infixed signs is specified in
fonts, not in encodings – see the web page http://www.CuneiformSigns.org/ContainerTypes.htm
) I suspect that by the Neoassyrian period, the last of the three illustrations
of UMBIN, the three components may possibly have been separable.
But that would have to be verified empirically, it is not appropriate
merely to speculate. How can we tell? There are ways, there is evidence.
And that evidence strongly correlates with and thus confirms
the long tradition in assyriology which is embodied in the sign catalogs,
carefully worked on with each contributor building on what went before.
We can question particular entries in those catalogs, but their compilers
were fully aware of the difference between components, signs,
and sequences of signs. They did not very often assign numbers
to mere text units, but treated them as lexical entries with a status
distinct from that of head entries (single signs).
The importance of the full historical range
Even without attempting to figure out which sequences of components
are single signs, and which sequences of components are sequences of
signs, for any texts, another point is already relevant here. The existence
of the first of these signs for 'talon' means that we do need
an encoding for it, whatever the analysis of later forms. Examples
of this kind exist even within artificially narrow time range to which
the majority of the small working group wishes to limit our encoding
efforts. None of Labat's citations shown above are from the earliest
Uruk period. The first two illustrations are from the Fara period (LAK#289),
with six attestations like the first illustration, and three like the
second. This is surely a secure identification of a sign, by normal
standards. In addition there are middle Babylonian and Middle Assyrian
sign forms which are not visually decomposable (Labat illustrates these).
There is a great resistance to including evidence from the full range
of cuneiform in preparing the present proposal, yet that inclusion can
precisely warn us against mistakes, not merely omissions of what can
be added later, but wrong analyses. We will more likely make an error
by not considering all of the available information
than by considering it. For quite a number of signs, proposal N2664
has in effect tended to focus its attention on later
forms which use a far smaller number of glyphic sign components, in
the extreme focusing on Neo-Assyrian, as for the sign UMBIN.
Sign Identity Is Stable Through Time,
Where Components And Glyph Fragments Are Not
Since one of our goals is to unify Cuneiform encodings across time periods,
it can be seen that artificial splits into glyphic fragments will hinder
that goal. Single signs may have their components arranged differently
at different times, which does not itself constitute evidence that the
combination of components is more than one sign. For Cuneiform, please
see the web page http://www.CuneiformSigns.org/InfixFluctuation.htm
and pages linked to frm there. The field of Han CJK characters provides
ample analogies for this statement. Please see the web page http://www.CuneiformSigns.org/CJKAnalogies.htm
How can we determine sign boundaries?
By respecting the accumulated tradition of assyriology, is the first
answer. We can easily check that tradition against the facts. Two default
manifestations of character boundaries are available for cuneiform just
as for most other scripts -- spacing and line breaks. Since many Cuneiform
words are spelled via a sequence of signs, line breaks between signs
of one word in Cuneiform are quite analogous to line breaks between
letters of the Latin script. Both can be regulated by special implementations,
but there are also important default behaviors on which such implementations
rely.
The first full-page figure accompanying this paper is from Gudea Statue
F column 4 (as published by Bord and Magnaioni 2002). [You can request
a copy of this figure by post or as an email attachment here.]
"Register" 6 of that column begins with the single sign MASH2,
which is acknowledged by all to be a single sign, given such status
as U+12239 in proposal N2664. Yet it consists of two parts which have
some white space between them. White space of this kind is simply not
diagnostic of sign boundaries, as shown above for the split of MASHGI
into artificial fragments. Attempting to rely on it makes one's methods
invalid, one's results insecure, sensitive to the wrong things.
Spacing signals sign boundaries?
If you look a bit more carefully at this
example, however, you see that this register is nicely spaced, and that
it has two lines (as most of us would refer to them), one with three
signs MASH2 ZI MU- and the second line with three signs NI SHAR2 SHAR2.
The spacing within the single sign MASH2 is different from the spacing
between signs. (There are three words in this register, MASH2 is the
first. ZI is the second, and the third word is MU-NI-SHAR2-SHAR2, according
to the transcription in Bord and Mugnaioni's publication of it 2002.
The third word is broken across lines at a sign boundary.)
Now compare two other lines, as they are
usually referred to: line 3 and line 7. (Here we do not have to worry
about the confusion we moderns would have in talking about a "line"
containing several "lines", or a line containing an "indent",
etc.) In line 3, we have text transliterated by the authors as sipa-bi
'leur pasteur' 'their shepherd' (or similar). It here consists of two
signs, SIPA and BI. When our smaller group started dividing things named
"SIGN.SIGN" into single signs, I of course assumed this was
a correct decision for all true sign sequences. I even thought SIPA
was probably a good candidate to treat that way. I have however discovered
that not merely the standard sign catalogs but also an important text
with nice typography which I first examined treats this as a single
sign. This is so far confirmed by parts of a second important text,
the "Codex Hammurabi".
The single sign SIPA has within it the
same amount of white space which occurs in line 6 previously discussed
within the agreed single sign MASH2. The scribe felt there were only
two signs in this line, and rendered them accordingly, in the process
leaving a gigantic white space the width of half the entire line. In
line 7, by contrast, the single sign SIPA no longer appears. The scribe
used instead the individual signs PA, LU (= UDU), and BI. This made
for a more evenly spaced appearance of the line, perhaps. The reading
and the context are the same.
Some might argue that this fluctuation
shows the units are really PA.LU.BI, just spaced differently, and that
the appropriate treatment is to add a zero-width joiner of some kind
between the signs to keep them together. This badly misunderstands the
nature of cuneiform script. The treatment in line 7 is abnormal in the
texts of the ten Gudea statues. I think probably unique there. It appears
an absence of split forms may characterize the law code of Hammurapi
as well. The other examples of SIPA which I found in the Gudea statues
wrote the components not merely closely together, as in Statue F at
4.3, but actually touching, so there is no white space whatsoever between
parts. These were on statues B and D, at locations B.2.8 and D.1.11.
What the "joiner" approach is
doing is applying bandaids to fix what would be done wrong in fragmenting
single signs, treating their components as if they were independent
signs. It reverses the relation of normal and exceptional, imposing
the burden in the normal cases, not in the exceptional ones. For a component
of a sign to look like an independent sign merely as a glyph is in
no way evidence of any kind that the sign in question is a
sequence of independent signs. No more than it would be for CJK Han
characters.
Evidence and Traditional Sign Catalogs Agree
A small survey of the spacing of some candidates for single signs in
the Gudea statues, and whether they are or are not split across line-breaks
(or indent breaks) within a register, yields a very strong correlation
between the spacing and line-break treatments, on the one hand, and
the standard sign catalogs, on the other hand. This is summarized in
table form on the web page http://www.CuneiformSigns.org/SignSpacingCorrelate.htm,
included as part of this paper.
What are the Consequences?
The two approaches to encoding Cuneiform differ greatly in the degree
to which they respect the empirically determinable significant sign
units of the script (different both from components and from sign sequences).
This contrast is made clear on the web page
http://www.CuneiformSigns.org/TwoApproaches.htm
included as part of this paper.
I believe there is simply no contest, and that proposal N2664(R) would
do considerable damage to the encoding of cuneiform, by loading large
amounts of extra complexity onto many aspects of implementations, and
making users and those who serve them needlessly dependent on implementers.
The cause of these disadvantages is demonstrable errors in the attempt
to identify what are the productive functioning units of the script.
Lists of Signs to Add
Also included as part of this paper are three web pages listing signs
which need to be replaced or added (in addition to the changes made
in revision N2664R). These web pages are
http://www.CuneiformSigns.org/ReplaceSigns.htm
and
http://www.CuneiformSigns.org/AddSigns.htm
and
http://www.CuneiformSigns.org/BorgerAdds.htm
[this one was preliminary, will be upgraded in mid-February –
currently it mainly serves to link N2664R proposed code points with
sign ID numbers in Borger's in-press Mesopotamisches Zeichenlexikon]
A full set of signs with images will be provided in February.
Moving Right Along
Doing it right need not interfere with getting a Cuneiform encoding
proposal approved in June 2004. Most of the text of N2664 is well written
and can be used as is, except where the analysis of this present paper
would require changes to it. Most of the very good work in extending
sign lists which is manifest in N2664 and N2664R, itself building on
the long traditions of the field, stands without need for change. That
includes work by Steve Tinney, the CDLI, and Miguel Civil. Only artificial
fragments need to be eliminated, and traditional signs added except
in individual instances where they can be shown to be errors perpetuated
in the traditional lists.
I propose that we stick to our foundations, keep our feet on the ground,
and proceed in the following manner.
A. Maintain the solid encoding principles we started with:
1. Encode signs not readings (script not language)
2. Encode signs not sequences of signs
3. Encode signs not variants
4. Encode signs not fragments of signs
5. Include sufficient distinctions for each stage covered (currently
mostly UrIII and later)
6. Unify those signs which are primary relatives in lineal historical
descent, encode them the same.
7. The standard sign catalogs, as extended by the work of PSL, CDLI,
and Civil, and with any additional whole signs found in N2664 and its
revision, should be the default list we start with. We can eliminate
signs only as we can show in exceptional instances that the identifications
are not secure, or that the traditional catalogs made some kind of error.
In addition to sign catalogs, we will of course use the best available
published work by the recognized authorities in each field of cuneiform,
and more recent and specific information from experts when it is available.
8. Where we have evidence on spacing or line breaking, we use that
judiciously to confirm or call into question status as single sign vs.
as sequence of signs
9. In cases of fluctuation, we go usually with normal usage, not with
exceptional instances.
B. Keep traditional sign names; names need not be tied to component
analyses.
10. Use traditional highly-recognizable sign names (MUL rather than
AN OVER AN AN), and for signs for which no reading or alphabetic name
is available, the catalog number with an initial letter to identify
the catalog the sign is taken from, as "C372")
11. Encoding order can reflect recognized components of signs. Alternate
names which represent the components of single signs (and to a degree
their arrangement) can be used to help our thinking, and even as a basis
for encoding order, but with clear awareness that the componential decomposition
of signs is not as stable across time as is the identity of the signs
as wholes.
12. Componential analysis of signs should reflect full historical knowledge
without limitation, so as to avoid implications for unification which
turn out to be false. For example, the two names "SIGN x SHE3"
and "SIGN x TUG2" are not distinguishable at a late stage
where the components SHE3 and TUG2 merge as KU and we have only "SIGN
x KU". Evidence from older time periods can resolve this in particular
instances (Steve Tinney has made use of some of this, from Krebernik,
as has this writer.)
C. The only criterion is whether we have securely identified signs
distinctive from each other. In cases of limited knowledge, we should
be explicit about the consequences of each kind of error which we can
anticipate. That is done below. We should encode what we can now. There
is to be no artificial limitation of time periods covered. Although
a few of the following general principles are phrased in terms of older
and later signs for which we may consider unification, they apply more
broadly to any question whether two signs are the same or are distinctive.
More specifically:
13. The fact that we certainly will later discover additional distinctions
in no way argues against encoding the distinctions we are already securely
aware of.
14. If we have securely identified a distinctive cuneiform sign, it
matters not at all if we do not know its exact "reading" or
meaning, or even any "reading" or meaning. To be most useful
to cuneiform specialists, we provide encodings precisely for signs whose
meanings are not yet known, or not fully known, just as for Linear B
(Unicode 10040 to 1005D). Having them encoded will assist analysis of
texts which use them.
15. For the large bodies of cuneiform texts, we expect those entering
the data on computers to be trained professional experts, able to recognize
distinctions and make choices as needed. As with any technical field,
advances may lead to the correction of readings and even sign identifications
in particular texts, but this is simply normal progress of science.
It has no implications for our encoding.
16. If we have a sign from an earlier time period which can be securely
unified with a sign from a later time period which is its primary lineal
descendent, then as with all other unifications, no additional encoded
character is appropriate. (Possible error: failure to encode a sign
which turns out to be distinctive. Such a newly discovered distinction
can be added later. But we do want to avoid the generation of encoded
data which has later to be changed, whenever reasonably possible, so
if a distinction is highly probable, we should encode it now.)
17. If a catalog listing of a sign does not make a distinction where
it should, if it merges what we already know to be two distinct signs,
then we make the distinction (by 5. Above). If some of the instances
lumped under one catalog listing are known to be unifiable with a later
or earlier sign, then (by the preceding paragraph) we do unify them.
If other instances lumped under one catalog listing are known to be
distinct from other signs in our list, then we encode them separately,
devising some practical workable new sign name as needed. (Possible
error if we fail to recognize a distinction – as in the preceding
item.) Example: ZATU catalog sign Z565 called "U2". According
to a discussion by expert Cale Johnson, this catalog listing conflates
two distinct signs, one of them indeed unifiable with the later sign
"U2", the other distinct from that and not continued in later
signs. So the newer sign might be called Z565b or Z565a, as the experts
prefer.
18. We do not let ourselves be confused by mere *names*. Giving an
old sign the same name as a known later sign does not constitute evidence
that the two are lineal descendents. If we have evidence that two signs
are not lineal descendents, we do not unify them. If the older sign
is securely attested and clear in at least some of its instances, unless
the older sign can be identified with *some* later sign, we must seriously
consider adding a distinct encoded sign to our list. (For examples from
the early Uruk stages, please after late 1st February see the web page
http://www.CuneiformSigns.org/ZATUSignTriage.htm . )
19. If an identification of a an earlier sign with a later sign is
probably false, and there is no other known valid unification with another
later sign, then we can usefully consider encoding it separately. Quite
a number of examples of this will be noted on the web page just mentioned.
(Possible error: two encoded signs are later found to be mere variants
of each other. Over-distinction in the encoded data brings with it no
information loss. At most, a tiny number of encoded signs would later
go out of active new use. Older data using them, to the extent not corrected
by its expert custodians, is still readable.)
No Serious Practical or Time Limitations
The task laid out in this paper is already nearly complete. Lists of
signs which need to be added are generally already complete. For my
own contributions, I am mostly in process of eliminating some mere variant
signs and others which are too insecure to encode now, using the available
published tools and any expert comments available. I will complete these
contributions without fail by the end of February, 2004, and most of
them by February 15th. Any expert contributions will of course be reflected
in modified lists.
With materials already so fully sorted
and controlled for quality through the combined efforts of the entire
assyriological tradition, including additions by participants in our
current activities, it will be simple for experts to review a nearly-final
list, as Steve Tinney has pointed out. They look for the items of most
interest to them, items to which their specialized knowledge is most
relevant. To the extent that experts in certain time periods can find
even as much as a day free in the next four months, they can warn us
of any errors they know in the sources we have available, can tell us
of additional distinctions needed, or perhaps in a very few cases tell
us of distinctions we have made that are very probably not warranted.
Many of the issues of fact and principle, and many of the signs which
are documented in this paper were proposed via general statements and
in part via lists of particular signs already in October and November
2003. This current paper is new in its comprehensiveness and in listing
signs in a format with pseudo-code-point labels added for easier comparability
with N2664(R).
One illustration and accompanying tables:
Gudea F.4
Pages from the web site http://www.CuneiformSigns.org, as linked to
above.
and http://www.CuneiformSigns.org/ZATUSignTriage.htm
(after mid-February)
|