Just wrote:
> I didn't know that. Thanks. :)
Yeah, I found that out the hard way in BCB6, where UTF8Encode() and UTF8Decode()
use a manual UTF-8 implementation that does not handle multi-byte sequences
correctly, thus can corrupt data that uses higher codepoints. That implementation
still exists in D7, but with some small differences (maybe they were trying
to fix it?), but eventually they ditched their implementation and switched
to Microsoft's MultiByteToWideChar() and WideCharToMultiByte() functions
instead (and then later added iconv support for cross-platform).
--
Remy Lebeau (TeamB)
On Mon, 15 Dec 2014 11:37:55 -0800, Remy Lebeau wrote:
>
> Also, UTF8Decode() is broken in D7, it doesn't implement UTF-8 correctly
> (that was fixed in a later version).
I didn't know that. Thanks. :)
Just wrote:
> The text is UTF-8 encoded, so use UTF8Decode(). e.g.:
>
> line:= UTF8Decode(MyStringList[i])
>
> i.e. UTF8Decode() will decode the UTF-8 encoded text into Unicode.
Just keep in mind that UTF8Decode() returns a WideString, which is not as
efficient as AnsiString (or UnicodeString in D2009+). The rest of his code
is using AnsiString since it is D7, so the overhead of converting individual
lines from AnsiString to WideString back to AnsiString might not be desirable.
Also, UTF8Decode() is broken in D7, it doesn't implement UTF-8 correctly
(that was fixed in a later version).
--
Remy Lebeau (TeamB)
Gerrit wrote:
> Relatively simple yes. It seems the csv uses an UTF8 encoding. You'll
> need to that as an UTF8 string into memory and then convert that to
> Ansi string using a (the default) code page. Note that this may be a
> lossy conversion: any utf8 encoded character that does have a mapping
> in the ansi code page is lost. function Utf8ToAnsi() is what you need here.
>
> When reading the file, you may also need to skip the UTF8 BOM if there's
> one.
For example:
{code}
const
Utf8Bom: array[0..2] of Byte = ($EF, $BB, $BF);
var
utf8: UTF8String;
ms: TMemoryStream;
ptr: PAnsiChar;
len: Integer;
begin
ms := TMemoryStream.Create;
try
ms.LoadFromFile(csvfile);
ms.Position := 0;
ptr := PAnsiChar(ms.Memory);
len := ms.Size;
if len >= 3 then begin
if CompareMem(ptr, @Utf8Bom[0], 3) then
begin
Inc(ptr, 3);
Dec(len, 3);
end;
end;
SetString(utf8, ptr, len);
finally
ms.Free;
end;
MyStringList.Text := Utf8ToAnsi(u);
end;
{code}
And of course, if you ever do upgrade to D2009 or later, TStringList.LoadFromFile()
can handle all of those details for you via the TEncoding class:
{code}
MyStringList.LoadFromFile(csvfile, TEncoding.UTF8);
{code}
--
Remy Lebeau (TeamB)