Welcome, Guest. Please login or register.

Author Topic: Handling wchar.h, utf-8 and so on  (Read 3262 times)

Description:

0 Members and 1 Guest are viewing this topic.

Offline nyteschaydeTopic starter

  • VIP / Donor - Lifetime Member
  • Hero Member
  • *****
  • Join Date: Mar 2002
  • Posts: 643
    • Show only replies by nyteschayde
    • http://www.nyteshade.com
Handling wchar.h, utf-8 and so on
« on: September 11, 2019, 07:01:37 AM »
So I have a few programming environments setup, include gcc 2.95.3, sas/c 6.58, Amiga E and more. Trying to figure out a way to port programs that rely on wide character support or UTF character support. What is usually done when program to be ported relies on these things? Any thoughts, experience? Point a girl in the right direction?
Senior MTS Software Engineer with PayPal
Amigas: A1200T 060/603e PPC • A1200T 060 • A4000D 040 • A3000 (x2) • A2000 Vamp/V2 • A1200 (x4) • A1000 (x3) • A600 Vamp/V1 • A500
 

Offline Rotzloeffel

Re: Handling wchar.h, utf-8 and so on
« Reply #1 on: September 11, 2019, 12:33:06 PM »
I am not a Programmer, but normally codesets.library should be your friend :)

http://aminet.net/package/util/libs/codesets-6.21

Save Planet Earth! It is the only one in the galaxy with fresh and cold beer :laughing:
 

Offline Joloo

Re: Handling wchar.h, utf-8 and so on
« Reply #2 on: September 14, 2019, 08:30:54 AM »
Quote
So I have a few programming environments setup, include gcc 2.95.3, sas/c 6.58, Amiga E and more. Trying to figure out a way to port programs that rely on wide character support or UTF character support.

Normally, wide char support is only available using newer gcc versions; gcc 3.x.x+ and then only in conjunction with clib2; I don't know whether gcc 2.95.x is supported by clib2 or not.
With UTF you mean...? UTF-7, UTF-8, or UTF-16 and/or UTF-32 (for LE/BE machines), or all of them?


Quote
What is usually done when program to be ported relies on these things? Any thoughts, experience?

Depends on...
If the encoding scheme is only used for disk operations, you can drop the entire Unicode support and use the Standard C Library function equivalents for this instead.

If the intern representation of characters rely on a multi-byte encoding scheme, you have to support it natively, be it now UTF-16 or UTF-8.
It doesn't help much, like Rotzloeffel suggested, to use Codesets library; it is just a converter between the different formats and mainly used to transcode the multi-byte character set to a single-byte ISO encoding, because the intern representation for character codes, like we use it for AmigaOS, is bound to single-byte character sets and we only have functions available, which are offered by e.g. the Standard C Library - and that one supports only single-byte encodings (strictly speaking, not even these but purely ASCII). What in contrast can help you out for UTF-16 is clib2. It offers functions that deal with this multi-byte encoding scheme.

I have to confess that when I port software from Windows/Linux to AmigaOS, it is already my own software, where I am using UTF-8 internally. But I am biased, because I've written a lot UTF-8 stuff and I don't like UTF-16, because it is damn slow in case one doesn't limit her/himself to Unicode Standard 3. Today, we're using Unicode Standard 12.1, and then each 16 bit code unit must be investigated whether or not a Surrogate Code Point is encountered - and that is a time consuming process.

If I have to deal with UTF-16 strings, I am transcoding these in first place to UTF-8 and then apply the necessary operations (replacing texts and so on). But as I already stated, I am biased.

Multi-byte encoding: Range from 0x000000 to 0x10FFFF (21 bits) with ~135000 characters or control codes.
Single-byte encoding: Range from 0x00 to 0xFF (8 bits) corresponding to 255 characters or control codes.
ASCII: Range from 0x00 to 0x7F (7 bits) corresponding to 128 characters or control codes.

If one now transcodes a multi-byte sequences into a single-byte, it must be clear that one runs danger of truncating (invaliding) strings if the multi-byte sequences cannot be mapped into the single-byte - and that happens quite often, if you port from an Unicode aware system (Linux/BSD/Windows) to a non aware system (AmigaOS 1 to 3).

If you are really interested in porting software using the "Universal multiple-octet coded character set Transformation Format" (UTF) you should treat it as learning exercise, which will demand much time. It isn't achieved with no effort unless the strings will be exclusively used for disk operations, like f.e. "wfopen() (std) / _wfopen_s() (Windows)".