Mega Search
23.2 Million


Sign Up

Make a donation  
Unicode UTF-8 coding question  
News Group: embarcadero.public.delphi.language.delphi.general

I would like to know the following about UTF-8 coding:
When a file contains UTF-8 text (with or without a  BOM) I understand
it such that the first 128 bytes represent standard ASCII and the
following 128 are used for encoded characters.
Apparently the coding uses one lead byte with the high bit set and
then 1 or more extra bytes.

My question concerns these exta bytes, can any of these be $0D or $0A?

The reason I ask is that we use CVS as version control system and it
manages line endings such that on the server they are stored as $0A
and on a PC client as $0D$0A and on a Mac as $0D.
So the CVS system modifies the files concerning these bytes on commits
and checkouts.

If the extra bytes following a Unicode marker can never contain eiter
of these we will be OK to use UTF-8, but otherwise we may be in for
surprises down the road and have to use UTF-16, I guess...

I would much prefer UTF-8 if possible.

Vote for best question.
Score: 0  # Vote:  0
Date Posted: 17-Jan-2015, at 7:20 AM EST
From: Bo Berglund
 
Re: Unicode UTF-8 coding question  
News Group: embarcadero.public.delphi.language.delphi.general
On Sat, 17 Jan 2015 11:01:54 -0800, Remy Lebeau (TeamB)
 wrote:

Remy,
thank you for your clear answer! Much obliged!

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 17-Jan-2015, at 2:21 PM EST
From: Bo Berglund
 
Re: Unicode UTF-8 coding question  
News Group: embarcadero.public.delphi.language.delphi.general
Bo wrote:

> When a file contains UTF-8 text (with or without a  BOM) I understand
> it such that the first 128 bytes represent standard ASCII and the following
> 128 are used for encoded characters.

Yes.

> Apparently the coding uses one lead byte with the high bit set and
> then 1 or more extra bytes.

Yes.  The high bits of the lead byte specify how many bytes are used in the 
sequence.  When the high bit is 0, there is 1 byte used.  When the high 2-4 
bits are set to 1, there are 2-4 bytes used in the sequence, respectively, 
where the extra bytes all have their high bit set to 1.  Physically, UTF-8 
can go up to 6 bytes, but is artificially restricted to 4 bytes by RFC 3629 
to maintain full compatibility with UTF-16.

> My question concerns these exta bytes, can any of these be $0D or $0A?

No.  Since the extra bytes always have their high bit set to 1, they can 
never contain values < 128 ($0D is 13, $0A is 10).  Remember, UTF-8 was designed 
to be 100% backwards compatible with ASCII.  $0D and $0A are covered by ASCII, 
so they are encoded as-is in UTF-8.  Only non-ASCII characters > 127 that 
are encoded using additional bytes in UTF-8.

> The reason I ask is that we use CVS as version control system and it
> manages line endings such that on the server they are stored as $0A
> and on a PC client as $0D$0A and on a Mac as $0D. So the CVS system
> modifies the files concerning these bytes on commits and checkouts.

CVS has an option to disable that behavior.

> I would much prefer UTF-8 if possible.

Please do.

-- 
Remy Lebeau (TeamB)

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 17-Jan-2015, at 11:01 AM EST
From: Remy Lebeau (TeamB)