| Please choose from the categories below |
Cuneiform Signs |
|
Analysis and reports to support an international standard for computer encoding of the Cuneiform writing system Research on the development of Cuneiform signs |
| Default Sort Orders |
Any proposal for a standard encoding involves two kinds of sorting order. One is the absolute default, the order in which characters will be sorted independent of any special assistance provided by modifications to operating systems, simply by what computer specialists call a "binary sort". This is important because support for Cuneiform within operating systems may be slow in coming. The second-level relative default is that produced by a table-driven sort. The table for this sort is expected to find its way into operating systems, with some delay, large or small. Such a table can specify any sort whatsoever, but there is no reason to have the "binary sort" differ from the default sort which the Assyriological community most prefers except where a simple listing cannot do the job, only a table can. There is a third level, implementation-specific sorts, which can of course be whatever a user's implementation prefers. Not all users will have the ability to specify such a sort. On the second and third levels, some characters can be sorted as if they are identical (unless they are the only differences between two strings being compared). Somewhat similar questions arise concerning equivalences in searching. Which sequences of signs shall be searched as if they were identical? This is possible only via the second, table-driven level or via third-level programming support. On the binary sort level, all characters are searched as distinct. |
At the meeting of the Unicode Technical Committee of 4 November, Steve Tinney presented reasoning in favor of an absolute default sort order (listing order in the Unicode / ISO standard) based on character names rather than wedge types and sequences. Here is my longer version of that, with a bit of additional information included for context and history, to promote public discussion of the proposal. (My purpose on this page is not to represent my own views, though I agree with most of what appears here.) [Added note 9 February, 2004: The basic argument of the following is in favor of a sort order based on components of signs, and many signs are proposed to be renamed in order to accomplish this. This entails such things as dropping the name MUL in the encoding standard, substituting "AN OVER AN AN" for it. But these two issues are separable. We can still retain the traditional and well-known name MUL yet encode in a default order which responds to the components, known to all to be AN. The virtue of the ordering is that users can find a sign based on its components (in part as for Chinese characters). But the use of non-standard names will not enhance this.] Orders Based on Glyph Shapes: The most common dictionary lookup orders (indexing order, sort order) have been based on Neo-Assyrian shapes of signs, partly because of the history of the field, that the first extensive corpus catalogued was Neo-Assyrian. This is the de facto standard tradition for a large range of uses. The algorithm is based on wedge types and arrangements -- horizontal single before offset double before even double before triple; then angled (up left head towards lower right tail) single, double etc.; "Winkelhaken" (point to the left, with two equal short tails or points); vertical (head at top, tail at bottom) single, double, etc. For other eras or styles of cuneiform, the shapes of signs are of course different, so that a different sort order will result from the same algorithm. Such a sort order is used for example in Rüster and Neu's Hittitisches Zeichen-Lexikon. The third millennium forms of Ur III are those which will be used for a Unicode standard, because they permit us to make necessary distinctions which are not possible using later first-millennium glyph shapes where many mergers have occurred. We could use a sort order based on these shapes, where we could not use first-millennium glyphs. (Implicit here was I think the point that many older signs do not have distinct corresponding first-millennium forms, in the case of mergers etc.) But beyond that, this proposal attempts to transcend the differences between glyphs of different periods, by ordering alphpabetically according to sign names (default readings) which virtually all assyriologists know. When this proposal was first raised, the main question was whether Sumerologists and Assyriologists could agree on a set of names. Tinney explained earlier that that seemed not to be a substantial barrier. So we turn now to sign names. Orders Based on Sign Names Ordering signs based on sign names has some interesting properties. If all complex signs SIGN x SIGN (container sign x infixed component signs) are named analytically, by their parts, then all signs having the same container will be sorted together, which is intuitively reasonable. Thus all signs LAGAB x SIGN would be ordered together. (Using atomic indivisible names for some signs of the type SIGN x SIGN would break such groups apart into various places in the sort, random from the point of view of signs.) Further tweaks are possible. For example, Tinney mentioned that all signs with single LU and modified LU could be kept together, followed by all signs with double LU and modified double LU. This was merely to point out that tweaking refinements are possible. Tinney was not trying to propose particular tweaks or survey all the possibilities. Decomposing Signs into Parts What is a single sign? Since sequences of independent signs (so-called "compound" signs) do not receive independent encodings as if they were single characters, the naming of signs is closely linked to which signs are to be encoded. (It is of course also of importance for how an encoding proposal will be received by the assyriological community.) Tinney pointed to the example of the sign in the draft proposal numbered 12278 PAD 'rations', which he said *looks like* U + GAR, but that is a late reanalysis. Rather it was originally a single sign, a sign like GAR but with a convex top in addition, for a heap of food or something like that (my notes do not contain Tinney's exact wording here; the point is that it was not originally a compound). Sign Names As If Decomposed The first draft of the merged PSL sign list was prepared under a regime promoting extensive decomposition of signs into lookalike parts. At least three of us involved in this effort have commented that some of these divisions seem artificial, some even would create sign "components" which do not occur independently or even do not occur elsewhere. (See comments previously posted to the Cuneiform discussion list by Jerry Cooper, Cale Johnson, and Lloyd Anderson.) Tinney invited others to work on fixing the list of sign names in these respects. In principle, no sign name should be of the form SIGN.SIGN, since if that implies it is really a sequence of two signs, then it cannot be encoded as if it were a single character. A few "Exceptions"? [Here I report from the meeting of 3 November.] Assyriologists participating in our effort pointed out that applying blindly the principles by which we are so far working would lead to the exclusion of about 13 signs from the most basic syllabary list of 80 signs, a list which was taught first in the ancient scribal school for which we have evidence. That is because they would be treated as compounds. One example of this type is AR, Labat #451 (which some might equate with IGI.RI, Labat #449.#86). [Lloyd added comment: I am on the lookout for kinds of evidence which might suggest that these 13 signs, or some of them, were regarded by the ancients as single signs for reasons beyond their use in the basic syllabary.] Line Break Evidence This is potentially a decisive source of information on what constitute single signs. But we still do not have a nuanced public statement on where line breaks can occur. We have had statements from assyriologists that any sign *can* abnormally be split across line breaks. We have had statements from assyriologists that what are traditionally regarded as single signs normally were *not* split across line breaks. We have had statements from assyriologists that what are normally regarded as compounds could *readily* be split across line breaks (that is, presumably normally, not abnormally). All of these together might help us to decide what are single signs vs. what are compounds. Can these be integrated in a more precise way? Can we distinguish *most of the time* between normal breaks and abnormal breaks? (An absolutely sharp dividing line is not necessary, all we need is a distinction that is sometimes clearly applicable, and to try to find more cases which are relatively clear.) Since line break is a normal consequence of the boundary between one character and the next, in all scripts, and we don't want implementers to have to "patch up" problems created by our fragmentation or lumping of signs, to alter default behavior even in normal texts, this is important for a good proposal. |
| Copyright © 2003. All Rights Reserved. Much of the analytical material on this web site will be included in an etymological study and concordance to cuneiform signs, to be published shortly, and may be used to validate the sign list, but should not be cited in any detail until it is published (guaranteed 2004 probably spring). Permission is granted for others to use the information on these web pages for preparation of a proposal to Unicode for a standard encoding of Cuneiform. The proposed sign list itself is free of any restrictions. |