Mega Search
23.2 Million


Sign Up

Make a donation  
Representing carriage return and newline using Char type  
News Group: embarcadero.public.delphi.language.delphi.general

I have a memory-mapped file where I access the data using a pointer to
a char. I increment the pointer until I encounter a carriage return or
a new line character.

This has been working find with AnsiChar files but I want to be able to
handle files containing unicode as well. So instead of defining my
pointers as pointers to AnsiChar I switched them to pointer to Char.

But now the test for return and newline are failing.

How do I represent carriage return and newline values as Char constants
so that my pointer to Char variables will match them?

const
  cReturn = Char(#$0D);
  cNewline = Char(#$0A);
var
  vCharPtr: PChar;
begin
  ...
  if (vCharPtr^=cReturn) or (vCharPtr^=cNewLine) then  //This is
failing!

Vote for best question.
Score: 0  # Vote:  0
Date Posted: 14-Jan-2015, at 12:01 PM EST
From: Doug Chamberlin
 
Re: Representing carriage return and newline using Char type  
News Group: embarcadero.public.delphi.language.delphi.general
Remy Lebeau (TeamB) wrote:

> > {quote:title=Doug Chamberlin wrote:}{quote}
> > 
> > Though I said unicode before I really meant UTF8 only. I will never
> > accept 16-bit or 32-bit encodings or create them.
> 
> In that case, things are much simpler.  UTF-8 is 8bit data, and ASCII
> is a subset of UTF-8, and CR and LF are both represented the EXACT
> SAME WAY in UTF-8 as they are in ANSI, so you can go back to your
> original code and keep using PAnsiChar, don't switch to PChar at all.
> You did not have to change anything at all in your line-break
> detection code to support UTF-8 files.  The only thing you would have
> to change is how you decode the data in between the line breaks, if
> you are converting that data to Unic odeString for processing.  ANSI
> and UTF-8 are decoded to UTF-16 differently, but you can use
> TEncoding to handle that logic.
> 
> > I pla to crate two versions of my class, one for AnsiChar and one
> > for UTF8.
> 
> You don't need to do that.  PAnsiChar works just fine with UTF-8
> (Delphi's UTF8String type is an AnsiString, afterall).

OK. That all makes sense. I think I'll stick with one class but add a
new method to return a line from the file as UTF8 string.

Thanks again for sharing your experience! Reminds me of the good old
days when these forums were alive with many developers and much good
info.

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 15-Jan-2015, at 10:35 AM EST
From: Doug Chamberlin
 
Re: Representing carriage return and newline using Char type  
News Group: embarcadero.public.delphi.language.delphi.general
> {quote:title=Doug Chamberlin wrote:}{quote}
>
> Though I said unicode before I really meant UTF8 only. I will never
> accept 16-bit or 32-bit encodings or create them.

In that case, things are much simpler.  UTF-8 is 8bit data, and ASCII is a subset of UTF-8, and CR and LF are both represented the EXACT SAME WAY in UTF-8 as they are in ANSI, so you can go back to your original code and keep using PAnsiChar, don't switch to PChar at all.  You did not have to change anything at all in your line-break detection code to support UTF-8 files.  The only thing you would have to change is how you decode the data in between the line breaks, if you are converting that data to Unic
odeString for processing.  ANSI and UTF-8 are decoded to UTF-16 differently, but you can use TEncoding to handle that logic.

> I pla to crate two versions of my class, one for AnsiChar and one
> for UTF8.

You don't need to do that.  PAnsiChar works just fine with UTF-8 (Delphi's UTF8String type is an AnsiString, afterall).

--
Remy Lebeau (TeamB)

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 15-Jan-2015, at 9:07 AM EST
From: Remy Lebeau (TeamB)
 
Re: Representing carriage return and newline using Char type  
News Group: embarcadero.public.delphi.language.delphi.general
Thanks, Remy! As usual your help is quite valuable and really
appreciated.

Though I said unicode before I really meant UTF8 only. I will never
accept 16-bit or 32-bit encodings or create them.

I pla to crate two versions of my class, one for AnsiChar and one for
UTF8. I already have the Ansi one. I just have to work on the other one.

Streams look to me to be too inefficient. Lots of copying of data.
These classes are meant to be ultra efficient with no extra copying of
anything. That's why I'm memory mapping the files to read them and
using pointers. So the essential function looks like it will build and
return a String starting from a PChar and scanning until an end of line
is encountered.

I guess the prototype code you provided will be a good starting point,
but I don't think it gets it done as shown.

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 15-Jan-2015, at 5:39 AM EST
From: Doug Chamberlin
 
Re: Representing carriage return and newline using Char type  
News Group: embarcadero.public.delphi.language.delphi.general
Doug wrote:

> This has been working find with AnsiChar files but I want to be able
> to handle files containing unicode as well. So instead of defining my
> pointers as pointers to AnsiChar I switched them to pointer to Char.

That is fine for UCS-2/UTF-16 encoded files, but will not work for Ansi encoded 
files, since (P)Char is now an alias for (P)WideChar in D2009+.  You need 
to know ahead of time whether a given file is encoded as Ansi or Unicode 
and then use the appropriate logic for each one.  That means either writing 
separate reading functions for each encoding, or using something more generic 
like this:

const
  cReturn = 13;
  cNewline = 10;
var
  vCharPtr: PByte;
  vReturn: array of Byte;
  vNewline: array of Byte;
  vCharSize: Integer;
begin
  ...
  if (file is Unicode) then
  begin
    vCharSize := SizeOf(WideChar);
    SetLength(vReturn, vCharSize);
    SetLength(vNewline, vCharSize);
    PWideChar(vReturn)^ := WideChar(cReturn);
    PWideChar(vNewline)^ := WideChar(cNewline);
  end else
  begin
    vCharSize := SizeOf(AnsiChar);
    SetLength(vReturn, vCharSize);
    SetLength(vNewline, vCharSize);
    PAnsiChar(vReturn)^ := AnsiChar(cReturn);
    PAnsiChar(vNewline)^ := AnsiChar(cNewline);
  end;
  ...
  if (CompareMem(vCharPtr^, PByte(vReturn)^, vCharSize) or (CompareMem(vCharPtr^, 
PByte(vNewline)^, vCharSize) then
  begin
    ...
  end;
  ...
  if (file is Unicode) then
  begin
    // use PWideChar(vCharPtr)^ as needed...
  end else begin
    // use PAnsiChar(vCharPtr)^ as needed...
  end;
  ...
  Inc(vCharPtr, vCharSize);
  ...
end;
{code}

On the other hand, this situation goes back to the argument that you should 
be doing your processing using Unicode only, that will greatly simplify your 
logic.  If the file is UCS-2/UTF-16 encoded, you can process it as-is.  If 
the file is Ansi encoded, decode the raw file data to Unicode as you are 
reading it, and then process only the decoded Unicode data as needed.  You 
can easily wrap your memory mapped data with a small decoder that parses 
the Ansi data and provides Unicode data to the rest of your code.  In D2009+, 
you could pass your memory mapped pointer to TCustomMemoryStream and then 
use TStreamReader to read from it.  TStreamReader takes a TStream and SysUtils.TEncoding 
as input and outputs only Unicode data that it decodes dynamically as it 
is reads from the TStream.  

--
Remy Lebeau (TeamB)

Vote for best answer.
Score: 0  # Vote:  0
Date Posted: 14-Jan-2015, at 12:31 PM EST
From: Remy Lebeau (TeamB)