Plan for unicode support

From OHRRPGCE-Wiki
Jump to navigation Jump to search

This is a plan for switching to unicode with utf8 encoding, to allow larger fonts, unlimited font icons, and painless support of non-Latin scripts.

In memory, strings will be stored in UTF8, which is very convenient. Rather little code will need to be updated, because you can use string concatenation, MID, INSTR, FORMAT, strprintf, and most of our utility functions on utf8 strings. Code that iterates over strings by character but only cares about characters < 128 doesn't need to be updated either, such as code to process embed codes.

Using UTF8, characters 128-255 take two bytes to encode, which means strings that currently fit in a fixed-width text field may no longer fit. Therefore to avoid having to replace all lump file formats with RELOAD-based ones, I propose that fixed width string fields can store either 8-bit (extended ascii) or utf8 strings.

UTF-8 implemention plan[edit]

  • Check all code that uses strings is utf8-aware if needed
    • strings which are not utf8 (eg binary data) should probably be annotated with TYPE aliases like bytestring, zbytestring or string8, zstring8
    • Any code that uses LEN, aside from IF LEN(x) THEN, or LEFT, RIGHT, lpad, rpad, rlpad, MID and maybe others needs to be checked
      • To support variable-width fonts, these typically need to be converted to measure in pixels instead.
    • Any code that iterates over the characters might need to iterate over codepoints instead
      • Password hashing needs special attention, so the hashes don't change
  • RELOAD documents will gain a new node type for utf8 encoded strings.
    • Code that stores binary data in a RELOAD doc needs to call separate functions, not GetString.
  • Update all other file formats with fixed-length text data fields to encode the text either as 8-bit (extended ascii) or utf8, whichever fits, indicated with an extra bit somewhere
    • For read/writebinstring (one byte per char), the encoding bit can be in the length field
    • For read/writebadbinstring (two bytes per char), as above, but utf-8 encoding can use one byte per char instead so up to twice as many characters can fit!
      • Item names are stored as these badbinstrings
    • A few strings are stored zero-padded with no length; the encoding bit either needs to be stored separately, or use a one-byte prefix to indicate utf8
  • strgrabber needs to be updated to treat the maxlength as number of bytes available for encoding using either as 8-bit or utf8.
  • Check every other way that strings enter or leave the engine
    • filenames. See decode_filename.
      • On Unix, filenames are already assumed to be utf8.
      • On Windows, filenames are 8-bit, which will apparently need to be converted to/from utf8 losslessly for consistency
        • OPENFILE and findfiles can handle this, and we should just replace all remaining instances of OPEN with OPENFILE.
        • Unfortunately it probably isn't practical to support unicode (UTF16) filenames, because FB's file routines would been to be modified
    • reading/writing .txt files
      • Writing UTF8 text files is fine, no changes needed
      • Reading text files is trickier, since we want to support Latin-1, UTF8 and UTF16. (UTF32 is used so little that even many kitchen-sink text editors don't support it)
        • FB file function support reading/writing unicode files, but OPENFILE doesn't expose this yet
    • Scripts (strings and script names). HSpeak already supports unicode, but dumbs it down to ascii when writing a .hs file
    • gfx backends (eg setwindowtitle)
    • The os_* modules (mainly winapi calls).

Unicode implications[edit]

Converting Unicode text to lower, upper or titlecase is hard, but I see no reason why we should need to attempt it.

Unicode characters have multiple canonical forms, such as combining letters with accents, or splitting them.

It will actually be possible to draw the accent modifiers in a font as a zero-width character which overlaps the previous character, but positioning may be poor. Therefore it's probably better to combine modifier characters as much as possible. We actually already have code to do a simple version of this.

Fallback if we can't use a TTF or OTF Unicode font, or if a particular such font lacks a certain Unicode character[edit]

We can use the latest version of *GNU Unifont* bitmap font to support most Unicode fonts - as a fallback mechanism. See *BabelMap* for MS Windows for an advanced charmap app (it also has an online version!).