Please choose  from the categories below
Cuneiform Signs

Analysis and reports to support an international standard for computer encoding of the Cuneiform writing system

Research on the development of Cuneiform signs

 

The document below was prepared before the UTC meeting in early February. A second document responding to the revised proposal for encoding cuneiform is available by clicking here. This second document gives additional examples and shows how lines of evidence converge to confirm what are single signs of the script, and to confirm that the traditional sign lists *do* already correctly distinguish these in general with few exceptions (thus including UL, AR, etc. as single signs, excluding ENSI as a combination of more than one sign, etc.).

Fitting Cuneiform Encoding to Cuneiform Script
Lloyd Anderson 29 January, 2004, Ecological Linguistics, PO Box 15156, Washington DC 20003
For Unicode Technical Committee meeting 3 February, 2004 Document L2/04-041, with very minor editorial revisions, and lacking a few graphics. Those who wish to see the first illustrations can either use the version on the Unicode web site, or can request a copy of a font to be sent to them as an email attachment. If you wish to request the font, please click here and state that you want to receive the "CharSplits" font as an email attachment. Standard .ttf file (16K) unless you specify Mac .suit file.

If those who proposed N2664 withdraw their proposal, then this paper constitutes a re-introduction of it with changes to the text and tables as outlined below. Changed text will be provided promptly.

A group which has been preparing a proposal for Cuneiform encoding went through several stages. Decisions included the encoding of signs in the sense traditionally understood in the field.

1. Encode signs not readings (script not language)
2. Encode signs not sequences of signs
3. Encode signs not variants
4. Encode signs not fragments of signs
5. Include sufficient distinctions for each stage covered (currently mostly UrIII and later)
6. Unify those signs which are primary relatives in lineal historical descent, encode them the same.

At a later time, the decision was taken to encode as sequences those elements of text which are referred to as SIGN.SIGN with a period between them, treating them as compounds of those existing signs which are the parts of their names. This is obviously consistent with the decision to not encode sequences of signs. But it also turned out to contradict other decisions taken previously, and the members had not anticipated some of the results. It was also somewhat vague, as it would cover both sequences of signs, and also single signs referred to in this same way for various historical reasons, such as lack of a known single-word reading. In other words, the naming pattern "SIGN.SIGN" was a glyph description languge at the same time as it sometimes represented sequences of signs, without any easy distinction between the two. The group decided to go ahead without yet attempting to consider all of the consequences. Some consider that the decision to split "SIGN.SIGN" superseded all earlier decisions.

When the first results came back, a majority of the active participants were unhappy with some of the exclusions, as of the fundamental syllabary signs. They were also unhappy with encoded units which are fragments, not ever occurring independently. Some of those have been suggested for encoding in N2664R. But they are mere band-aids on a system which systematically disregards both the long-established scholarly tradition on what are signs, and the empirical evidence on what are the units of the script, which most of the participants in the small group have not discussed in any detail.

The question: what is an appropriate encoding for Cuneiform?
I argue that the present proposal would be very damaging to the field of cuneiform studies. The consequences should actually be examined, not shoved under the rug. Analogies will help to make clear what is being proposed for Cuneiform. Then I will survey those consequences which have not been presented systematically by the individuals whose proposal is document N2664 and revision.

Han Characters vs. Components.
CJK Han Characters are not split into fragments in encoding. The decision about what is a character is of course much easier for Han characters than for Cuneiform, because Han characters all fit a standard square block. Not having this tool in Cuneiform means that we must work hard to discover what are the distinctive units of the script. (Or accept that the long scholarly tradition has already done most of that!) But the Han analogy is close in that we clearly know the difference between full characters and components of characters. The long scholarly tradition of Cuneiform studies is also fully aware of the difference between Signs and Components of signs. The current proposal for Cuneiform violates that tradition in mixing the two, and omitting many standard catalogued signs from the encoding when there is no reason to do so. I here turn the characters sideways, partly also as a reminder that such a rotation occurred early in the history of Cuneiform. [The large size "b" and "ac" following should be rendered using the "CharSplits" font. These are U+67F4, which is not encoded as U+6B64 <ligature> U+6728. Rather than showing a vertical column, the Han characters are here rotated 90 degrees. That keeps the sequence relation the same (sequence of components vs. sequence of characters).]
b is not encoded as ac with ligaturing etc.Analogies from Latin script are closer in some other respects.

Latin Historical Ligatures which are now Simple Letters
æ is not encoded as ae with ligaturing etc.
w is not encoded as vv with ligaturing etc.
The last of these is very close to what is being proposed for Cuneiform, the encoding of single characters as parts which they may historically have arisen from, or which in the Cuneiform case they may later have dissolved into, but which are in the use of the script distinct from those. The <æ> digraph also raises an issue which affects any script of this kind. Whether or not Unicode favors this, implementers may possibly encode it as the sequence <a> <e> and render that via ligaturing as a surface glyph <æ>. This possibility is no argument that the Danish single letter <æ> (which looks the same but which is definitely not a ligature) should not have been encoded.
This next example has been withdrawn in the revised proposal N2664R, but the fact that it could ever have been proposed shows how far off the track the interpretation went both from deliberations and consensus in the working group and from the reality of cuneiform script. There are many more examples of similar kinds, and N2664R has only touched the tip of the iceberg in correcting erroneous analyses from N2664. [The next example also needs to be rendered with the "CharSplits" font. "w" and "u v" are not intended here. Approximations would be

<>      (the sign called MASHGI or BARGI)
<>

not rendered as

<          >   with ligaturing etc. (non-existent signs)
<          >

w is not to be encoded as u v with ligaturing
In this case an existing sign MASHGI (by default, a single character) was split into two fragments, neither one of which exists as a sign on its own. The sign which does exist was not proposed for encoding. This is not a unique example. In some way it must reveal the thinking which went into proposal N2664. I can only estimate that the ideas were something like: split any sign if vertical white space can be seen between fragments which would result, and rename any sign in terms of component parts, disregarding traditional names. A procedure based on white space certainly does not represent any reasonable interpretation of a consensus reached by the working group to encode signs named "SIGN.SIGN" as their parts. This sign was not traditionally named that way. The artificial creation in proposal N2664R itself of a new name (U OVER U U REVERSED OVER U REVERSED), the sequence of names of the two artificial fragments which were substituted for it in N2664, does not cause this to become a sign named "SIGN.SIGN". Or else that has no meaning whatsoever. The sign was split, based not on its actual name but on a theory reflected by this artificially created name. Quite a circular proceeding. It makes it clear that what is being proposed is in an important sense an encoding of a newly invented glyph description language, not an encoding of the units of cuneiform script.
    Our goal is a valid encoding for Cuneiform, so if we find empirical data refutes the assumptions or procedures of a claimed consensus, we must pay attention to the facts. The smallest group seems to have locked itself into a tunnel.

There is another issue raised by the long history of the Cuneiform script, and very real changes which occurred in it. Some characters which were original single characters in the understanding of all of us have dissolved into an apparent sequence, as scribes used familiar elements. A wonderful example is that for the sign UMBIN, used to represent among other things 'talon'. It is composed originally of a leg with a superimposed turned hand which is used in meanings 'attach, join, knot' and similar (Labat 'nouer, attacher'), and went through this evolution. Intermediates exist between the last two not shown here. (See Labat #92b, or use the "CharSplits" font to render the signs in the next line.)
f g de

The first of these could be named (in a glyph description language, a component description language) something like "leg" x TAG4. The second would be named GAD.(DU x TAG4). The third would be named GAD.TAG4.DU. It appears that the cuneiform writing system of at least the last of these three stages may have changed its set of significant units. But we cannot be sure merely from these three illustrations. We simply cannot infer status as sign vs. sign sequence merely by thinking of components in later forms. That is deceiving oneself. The initial GAD of the middle example may never have been separated from the part which followed, they may have been merely components of one sign, not two separate signs. In that case the sign would more revealingly be named (GAD.DU) x TAG4 In fact that name would work for both of the last two illustrations, since the TAG4 part is infixed between the GAD and the DU parts! (In this instance, the TAG4 is not reduced in size, but the visual form of infixed signs is specified in fonts, not in encodings – see the web page http://www.CuneiformSigns.org/ContainerTypes.htm ) I suspect that by the Neoassyrian period, the last of the three illustrations of UMBIN, the three components may possibly have been separable.
But that would have to be verified empirically, it is not appropriate merely to speculate. How can we tell? There are ways, there is evidence. And that evidence strongly correlates with and thus confirms the long tradition in assyriology which is embodied in the sign catalogs, carefully worked on with each contributor building on what went before. We can question particular entries in those catalogs, but their compilers were fully aware of the difference between components, signs, and sequences of signs. They did not very often assign numbers to mere text units, but treated them as lexical entries with a status distinct from that of head entries (single signs).

The importance of the full historical range
Even without attempting to figure out which sequences of components are single signs, and which sequences of components are sequences of signs, for any texts, another point is already relevant here. The existence of the first of these signs for 'talon' means that we do need an encoding for it, whatever the analysis of later forms. Examples of this kind exist even within artificially narrow time range to which the majority of the small working group wishes to limit our encoding efforts. None of Labat's citations shown above are from the earliest Uruk period. The first two illustrations are from the Fara period (LAK#289), with six attestations like the first illustration, and three like the second. This is surely a secure identification of a sign, by normal standards. In addition there are middle Babylonian and Middle Assyrian sign forms which are not visually decomposable (Labat illustrates these). There is a great resistance to including evidence from the full range of cuneiform in preparing the present proposal, yet that inclusion can precisely warn us against mistakes, not merely omissions of what can be added later, but wrong analyses. We will more likely make an error by not considering all of the available information than by considering it. For quite a number of signs, proposal N2664 has in effect tended to focus its attention on later forms which use a far smaller number of glyphic sign components, in the extreme focusing on Neo-Assyrian, as for the sign UMBIN.

Sign Identity Is Stable Through Time,
Where Components And Glyph Fragments Are Not

Since one of our goals is to unify Cuneiform encodings across time periods, it can be seen that artificial splits into glyphic fragments will hinder that goal. Single signs may have their components arranged differently at different times, which does not itself constitute evidence that the combination of components is more than one sign. For Cuneiform, please see the web page http://www.CuneiformSigns.org/InfixFluctuation.htm and pages linked to frm there. The field of Han CJK characters provides ample analogies for this statement. Please see the web page http://www.CuneiformSigns.org/CJKAnalogies.htm

How can we determine sign boundaries?
By respecting the accumulated tradition of assyriology, is the first answer. We can easily check that tradition against the facts. Two default manifestations of character boundaries are available for cuneiform just as for most other scripts -- spacing and line breaks. Since many Cuneiform words are spelled via a sequence of signs, line breaks between signs of one word in Cuneiform are quite analogous to line breaks between letters of the Latin script. Both can be regulated by special implementations, but there are also important default behaviors on which such implementations rely.
The first full-page figure accompanying this paper is from Gudea Statue F column 4 (as published by Bord and Magnaioni 2002). [You can request a copy of this figure by post or as an email attachment here.] "Register" 6 of that column begins with the single sign MASH2, which is acknowledged by all to be a single sign, given such status as U+12239 in proposal N2664. Yet it consists of two parts which have some white space between them. White space of this kind is simply not diagnostic of sign boundaries, as shown above for the split of MASHGI into artificial fragments. Attempting to rely on it makes one's methods invalid, one's results insecure, sensitive to the wrong things.

Spacing signals sign boundaries?
     If you look a bit more carefully at this example, however, you see that this register is nicely spaced, and that it has two lines (as most of us would refer to them), one with three signs MASH2 ZI MU- and the second line with three signs NI SHAR2 SHAR2. The spacing within the single sign MASH2 is different from the spacing between signs. (There are three words in this register, MASH2 is the first. ZI is the second, and the third word is MU-NI-SHAR2-SHAR2, according to the transcription in Bord and Mugnaioni's publication of it 2002. The third word is broken across lines at a sign boundary.)
     Now compare two other lines, as they are usually referred to: line 3 and line 7. (Here we do not have to worry about the confusion we moderns would have in talking about a "line" containing several "lines", or a line containing an "indent", etc.) In line 3, we have text transliterated by the authors as sipa-bi 'leur pasteur' 'their shepherd' (or similar). It here consists of two signs, SIPA and BI. When our smaller group started dividing things named "SIGN.SIGN" into single signs, I of course assumed this was a correct decision for all true sign sequences. I even thought SIPA was probably a good candidate to treat that way. I have however discovered that not merely the standard sign catalogs but also an important text with nice typography which I first examined treats this as a single sign. This is so far confirmed by parts of a second important text, the "Codex Hammurabi".
     The single sign SIPA has within it the same amount of white space which occurs in line 6 previously discussed within the agreed single sign MASH2. The scribe felt there were only two signs in this line, and rendered them accordingly, in the process leaving a gigantic white space the width of half the entire line. In line 7, by contrast, the single sign SIPA no longer appears. The scribe used instead the individual signs PA, LU (= UDU), and BI. This made for a more evenly spaced appearance of the line, perhaps. The reading and the context are the same.
     Some might argue that this fluctuation shows the units are really PA.LU.BI, just spaced differently, and that the appropriate treatment is to add a zero-width joiner of some kind between the signs to keep them together. This badly misunderstands the nature of cuneiform script. The treatment in line 7 is abnormal in the texts of the ten Gudea statues. I think probably unique there. It appears an absence of split forms may characterize the law code of Hammurapi as well. The other examples of SIPA which I found in the Gudea statues wrote the components not merely closely together, as in Statue F at 4.3, but actually touching, so there is no white space whatsoever between parts. These were on statues B and D, at locations B.2.8 and D.1.11.
     What the "joiner" approach is doing is applying bandaids to fix what would be done wrong in fragmenting single signs, treating their components as if they were independent signs. It reverses the relation of normal and exceptional, imposing the burden in the normal cases, not in the exceptional ones. For a component of a sign to look like an independent sign merely as a glyph is in no way evidence of any kind that the sign in question is a sequence of independent signs. No more than it would be for CJK Han characters.

Evidence and Traditional Sign Catalogs Agree
A small survey of the spacing of some candidates for single signs in the Gudea statues, and whether they are or are not split across line-breaks (or indent breaks) within a register, yields a very strong correlation between the spacing and line-break treatments, on the one hand, and the standard sign catalogs, on the other hand. This is summarized in table form on the web page http://www.CuneiformSigns.org/SignSpacingCorrelate.htm, included as part of this paper.

What are the Consequences?
The two approaches to encoding Cuneiform differ greatly in the degree to which they respect the empirically determinable significant sign units of the script (different both from components and from sign sequences). This contrast is made clear on the web page
http://www.CuneiformSigns.org/TwoApproaches.htm included as part of this paper.

I believe there is simply no contest, and that proposal N2664(R) would do considerable damage to the encoding of cuneiform, by loading large amounts of extra complexity onto many aspects of implementations, and making users and those who serve them needlessly dependent on implementers. The cause of these disadvantages is demonstrable errors in the attempt to identify what are the productive functioning units of the script.

Lists of Signs to Add
Also included as part of this paper are three web pages listing signs which need to be replaced or added (in addition to the changes made in revision N2664R). These web pages are
http://www.CuneiformSigns.org/ReplaceSigns.htm and
http://www.CuneiformSigns.org/AddSigns.htm and
http://www.CuneiformSigns.org/BorgerAdds.htm [this one was preliminary, will be upgraded in mid-February – currently it mainly serves to link N2664R proposed code points with sign ID numbers in Borger's in-press Mesopotamisches Zeichenlexikon]
A full set of signs with images will be provided in February.

Moving Right Along
Doing it right need not interfere with getting a Cuneiform encoding proposal approved in June 2004. Most of the text of N2664 is well written and can be used as is, except where the analysis of this present paper would require changes to it. Most of the very good work in extending sign lists which is manifest in N2664 and N2664R, itself building on the long traditions of the field, stands without need for change. That includes work by Steve Tinney, the CDLI, and Miguel Civil. Only artificial fragments need to be eliminated, and traditional signs added except in individual instances where they can be shown to be errors perpetuated in the traditional lists.

I propose that we stick to our foundations, keep our feet on the ground, and proceed in the following manner.

A. Maintain the solid encoding principles we started with:

1. Encode signs not readings (script not language)
2. Encode signs not sequences of signs
3. Encode signs not variants
4. Encode signs not fragments of signs
5. Include sufficient distinctions for each stage covered (currently mostly UrIII and later)
6. Unify those signs which are primary relatives in lineal historical descent, encode them the same.

7. The standard sign catalogs, as extended by the work of PSL, CDLI, and Civil, and with any additional whole signs found in N2664 and its revision, should be the default list we start with. We can eliminate signs only as we can show in exceptional instances that the identifications are not secure, or that the traditional catalogs made some kind of error. In addition to sign catalogs, we will of course use the best available published work by the recognized authorities in each field of cuneiform, and more recent and specific information from experts when it is available.

8. Where we have evidence on spacing or line breaking, we use that judiciously to confirm or call into question status as single sign vs. as sequence of signs

9. In cases of fluctuation, we go usually with normal usage, not with exceptional instances.

 

B. Keep traditional sign names; names need not be tied to component analyses.

10. Use traditional highly-recognizable sign names (MUL rather than AN OVER AN AN), and for signs for which no reading or alphabetic name is available, the catalog number with an initial letter to identify the catalog the sign is taken from, as "C372")

11. Encoding order can reflect recognized components of signs. Alternate names which represent the components of single signs (and to a degree their arrangement) can be used to help our thinking, and even as a basis for encoding order, but with clear awareness that the componential decomposition of signs is not as stable across time as is the identity of the signs as wholes.

12. Componential analysis of signs should reflect full historical knowledge without limitation, so as to avoid implications for unification which turn out to be false. For example, the two names "SIGN x SHE3" and "SIGN x TUG2" are not distinguishable at a late stage where the components SHE3 and TUG2 merge as KU and we have only "SIGN x KU". Evidence from older time periods can resolve this in particular instances (Steve Tinney has made use of some of this, from Krebernik, as has this writer.)

 

C. The only criterion is whether we have securely identified signs distinctive from each other. In cases of limited knowledge, we should be explicit about the consequences of each kind of error which we can anticipate. That is done below. We should encode what we can now. There is to be no artificial limitation of time periods covered. Although a few of the following general principles are phrased in terms of older and later signs for which we may consider unification, they apply more broadly to any question whether two signs are the same or are distinctive. More specifically:

13. The fact that we certainly will later discover additional distinctions in no way argues against encoding the distinctions we are already securely aware of.

14. If we have securely identified a distinctive cuneiform sign, it matters not at all if we do not know its exact "reading" or meaning, or even any "reading" or meaning. To be most useful to cuneiform specialists, we provide encodings precisely for signs whose meanings are not yet known, or not fully known, just as for Linear B (Unicode 10040 to 1005D). Having them encoded will assist analysis of texts which use them.

15. For the large bodies of cuneiform texts, we expect those entering the data on computers to be trained professional experts, able to recognize distinctions and make choices as needed. As with any technical field, advances may lead to the correction of readings and even sign identifications in particular texts, but this is simply normal progress of science. It has no implications for our encoding.

16. If we have a sign from an earlier time period which can be securely unified with a sign from a later time period which is its primary lineal descendent, then as with all other unifications, no additional encoded character is appropriate. (Possible error: failure to encode a sign which turns out to be distinctive. Such a newly discovered distinction can be added later. But we do want to avoid the generation of encoded data which has later to be changed, whenever reasonably possible, so if a distinction is highly probable, we should encode it now.)

17. If a catalog listing of a sign does not make a distinction where it should, if it merges what we already know to be two distinct signs, then we make the distinction (by 5. Above). If some of the instances lumped under one catalog listing are known to be unifiable with a later or earlier sign, then (by the preceding paragraph) we do unify them. If other instances lumped under one catalog listing are known to be distinct from other signs in our list, then we encode them separately, devising some practical workable new sign name as needed. (Possible error if we fail to recognize a distinction – as in the preceding item.) Example: ZATU catalog sign Z565 called "U2". According to a discussion by expert Cale Johnson, this catalog listing conflates two distinct signs, one of them indeed unifiable with the later sign "U2", the other distinct from that and not continued in later signs. So the newer sign might be called Z565b or Z565a, as the experts prefer.

18. We do not let ourselves be confused by mere *names*. Giving an old sign the same name as a known later sign does not constitute evidence that the two are lineal descendents. If we have evidence that two signs are not lineal descendents, we do not unify them. If the older sign is securely attested and clear in at least some of its instances, unless the older sign can be identified with *some* later sign, we must seriously consider adding a distinct encoded sign to our list. (For examples from the early Uruk stages, please after late 1st February see the web page http://www.CuneiformSigns.org/ZATUSignTriage.htm . )

19. If an identification of a an earlier sign with a later sign is probably false, and there is no other known valid unification with another later sign, then we can usefully consider encoding it separately. Quite a number of examples of this will be noted on the web page just mentioned. (Possible error: two encoded signs are later found to be mere variants of each other. Over-distinction in the encoded data brings with it no information loss. At most, a tiny number of encoded signs would later go out of active new use. Older data using them, to the extent not corrected by its expert custodians, is still readable.)

No Serious Practical or Time Limitations
The task laid out in this paper is already nearly complete. Lists of signs which need to be added are generally already complete. For my own contributions, I am mostly in process of eliminating some mere variant signs and others which are too insecure to encode now, using the available published tools and any expert comments available. I will complete these contributions without fail by the end of February, 2004, and most of them by February 15th. Any expert contributions will of course be reflected in modified lists.
     With materials already so fully sorted and controlled for quality through the combined efforts of the entire assyriological tradition, including additions by participants in our current activities, it will be simple for experts to review a nearly-final list, as Steve Tinney has pointed out. They look for the items of most interest to them, items to which their specialized knowledge is most relevant. To the extent that experts in certain time periods can find even as much as a day free in the next four months, they can warn us of any errors they know in the sources we have available, can tell us of additional distinctions needed, or perhaps in a very few cases tell us of distinctions we have made that are very probably not warranted.

Many of the issues of fact and principle, and many of the signs which are documented in this paper were proposed via general statements and in part via lists of particular signs already in October and November 2003. This current paper is new in its comprehensiveness and in listing signs in a format with pseudo-code-point labels added for easier comparability with N2664(R).

One illustration and accompanying tables:
Gudea F.4
Pages from the web site http://www.CuneiformSigns.org, as linked to above.
and http://www.CuneiformSigns.org/ZATUSignTriage.htm (after mid-February)