rejetto forum

2.4 beta unicode problem

NaitLee · 11 · 12845

0 Members and 1 Guest are viewing this topic.

Offline NaitLee

  • Tireless poster
  • ****
    • Posts: 203
  • Computer-brain boy
    • View Profile
I had already tested with various browsers before, but all of them have problems...

Your uploads were fine, I see the file 哲学.ppt is already fine also in my browser now...

But the file 生活处处有哲学.ppt, which is the original filename for the problem test, goes bad. Also try that please?
Sorry for offering a filename with no problem...

A discovery:
 Multi-byte ansi characters have something interesting --
 These chars almost goes with 2 bytes in ansi,
 but in utf-8, they are expressed in 3 bytes.
 So, I found that: if the numbers of utf chars are odd, the upload fails with orphan non-print byte. If it's even, it succeeds.
"Computation is not forbidden magic."
Takeback Template | PHFS


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
ok, now i can see the problem on your hfs.
Still not on mine, but at least will be able to make some more investigation.
I'll be out few hours now.


Offline NaitLee

  • Tireless poster
  • ****
    • Posts: 203
  • Computer-brain boy
    • View Profile
rejetto,

Now let's turn to utf-16, once you mentioned, as the format of filenames on Windows.
(It's my fault to mention ansi everytime, the main problem might not there)
The Unicode standard pdf might be useful.

I reffered to its figure2.11, and made a draft to simulate the conversion from utf8 to utf16, then found somewhere suspicious.
I'll send the draft tomorrow.
My computer will keep opened tonight, for your futher test work.

Edit:
I attatched that draft.

Figure 2.11:
AΩ
UTF-841CE A9E8 AA 9E
UTF-16004103A98A9E
I had a reversed conversion, from utf-8 to utf-16:
We can see the Chinese character takes place of 3 bytes in utf-8, but 2 in utf-16.
The omega always takes 2 bytes.
So, if there are only Greek symbols(2bytes) in filename, they will fine;
if there is an odd numbers of Chinese character(3bytes), even if the amount of multi-byte chars in a chunk is even, they will be bad.

I think it's the problem of byte counter or sth else, making the last odd byte orphan and connected to the following single-byte char, then both of them got corrupted.

Above is not 100% true, only for reference.

Filenames in draft:
语文.txt
语Ω.txt
语文书.txt
« Last Edit: June 02, 2020, 12:25:11 AM by NaitLee »
"Computation is not forbidden magic."
Takeback Template | PHFS


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
Now let's turn to utf-16, once you mentioned, as the format of filenames on Windows.

Windows is using UTF16 for its API. There's nothing to "turn to".
HFS is then trasmitting over the net using UTF8.
I'm trying to understand more on problem based on the little i have.
Your intuition about the number of chars being odd is correct.
« Last Edit: June 01, 2020, 03:25:04 PM by rejetto »


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
it took me hours but now i have a VM with XP in chinese, and the problem is reproduced there.
Of course I can't read anything of any prompt that XP does to me. Go on blindly.


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
i had to split the topic because this bug has nothing to do with the translation of HFS.

Anyway, after hours of work i finally found the point where the bug and fixed it.
You'll see in next release.

I could do this only after many many builds tested using the chinese Windows. I don't know how long it would have took otherwise -_-
I really appreciated your help anyway, and be happy because the bug is gone :)


Offline NaitLee

  • Tireless poster
  • ****
    • Posts: 203
  • Computer-brain boy
    • View Profile
I really appreciated your help anyway, and be happy because the bug is gone :)

Glad to hear that! Good job :D
"Computation is not forbidden magic."
Takeback Template | PHFS


Offline NaitLee

  • Tireless poster
  • ****
    • Posts: 203
  • Computer-brain boy
    • View Profile
The macro {.exec.} have some problems:
When run as {.exec|php.exe index.php|out=x.}, the php result is saved to variable ^x.
But, when original texts from file (read by php) are utf-8, the result read by HFS is still shown & transmitted as ansi. (When run directly in cmd.exe, the result is shown as unicode)
After converting ^x with {.convert|utf-8|ansi|{.^x.}.}, the last-char-corrupted problem I stated before appears again. Seems the last char already corrupted on ansi stage.

I'd offer some chars for test. Above 4 lines are Chinese (compatible with rejetto's Chinese XP VM, 3-bit per char in utf-8, 2-bit per char in utf-16), meddle 4 lines are Korean (usually cannot shown in non-Unicode environment), and below 4 lines are Greek(2-bit per char in utf-8 and utf-16).
Code: [Select]

二三
四五六
七八九十


각간
갈감갑
강개객갱

Α
ΒΓ
ΔΕΖ
ΗΘΙΚ
Save them to 1234.txt in HFS folder and run with php to see effect. See screenshot.
"Computation is not forbidden magic."
Takeback Template | PHFS


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
i started studying the problem, but it's not trivial.
apparently the output of consoles can be "oem" or unicode.


Offline Mars

  • Operator
  • Tireless poster
  • *****
    • Posts: 2063
    • View Profile
it is the same when you want to attach an external text file it can be in ANSI, UTF8 with or without BOM or UTF16, the result can give distorted characters. almost everything should be converted into a same single format


Offline rejetto

  • Administrator
  • Tireless poster
  • *****
    • Posts: 13523
    • View Profile
i'm going to use a standard Windows' function "IsTextUnicode" that guesses the format of the ouput.
it's not 100% reliable but that's it.

For example, this is "normal" dir listing
{.exec|cmd /c dir tmp\*.txt|out=x.}{.^x.}

Code: [Select]
08/06/2020  11:27             1.004 post-data.txt
07/07/2020  14:58               607 swipe panels.txt
10/02/2004  17:01             3.100 ?.txt
10/02/2004  17:01             3.100 ?.txt
10/02/2004  17:01             3.100 ?è?.txt
               7 File         16.524 byte


and this is when /u will produce unicode output
{.exec|cmd /u /c dir tmp\*.txt|out=x.}{.^x.}
Code: [Select]
08/06/2020  11:27             1.004 post-data.txt
07/07/2020  14:58               607 swipe panels.txt
10/02/2004  17:01             3.100 會.txt
10/02/2004  17:01             3.100 生.txt
10/02/2004  17:01             3.100 生è生.txt
               7 File         16.524 byte

this result is possible only in next release