I have a message which is obviously malformed and cannot be decoded properly (SUBJECT, FROM and TO are even decoded OK in Outlook), but in Indy it causes infinite loop. I'm OK with message not being decoded well due to being malformed (even though, subject and from line could probably be decoded good enough. But it should not infinite loop the decoder. I have Indy 5231 which is not the latest but it is possibly it is a problem not addressed yet. Here goes:
{code}
Subject: abc, =?UTF-8?B?TmVl?= =?UTF-8?B?ZCBDYQ==?= =?UTF-8?B?c2ggUXU=?= =?UTF-8?B?aWNrIEdl?= =?UTF-8?B?dCB1cHRvIFVT?= =?UTF-8?B?RDEw?= =?UTF-8?B?MDAgTm8=?= =?UTF-8?B?dw==?=
From: =?UTF-8?B?TmU=?= =?UTF-8?B?eHRQ?= =?UTF-8?B?YXlk?= =?UTF-8?B?YXlBZHY=?= =?UTF-8?B?YW5j?= =?UTF-8?B?ZQ==?=
To: "549b69b9a719d"abc@def.com.
Content-Type: text/html; charset= CP1026
msg content
{code}
I have tried some online decoders which decode subject and from also good. As for TO: line, Outlook decodes it as
{code}
"549b69b9a719d" as sender name and
"abc@def.com." as email address
{code}
And Outlook Express decodes the same as:
{code}
"549b69b9a719d" as sender name and
"abc@def.com" as email address (no last dot)
{code}
What Indy does I cannot say because it loops forever.
*An update:*
After testing this more I found that Indy indeed manages to decode the message. However, it takes huge amount of time to do so a few minutes and during that time memory usage *increases* for the process! Why exactly I cannot tell, above 2 programs do it instantly. After it finally decodes it (in about 5 or 10 minutes on my PC), the memory usage is back to what it used to be before so it is not a memory leak I think. Subject from and to lines look OK. Sometimes it may end up in "out of memory" error though
if left for a long time.
Vote for best question.
Score: 0
# Vote: 0
Date Posted: 1-Jan-2015, at 8:07 AM EST
From: John May
Re: a message causing Indy mesasage decoder to enter infinit
> due to the bugs I mentioned above. I have to figure out how to get around
> that, withough breaking Indy in the process.
OK, thank you for your work on this. Please read your private messages.
Vote for best answer.
Score: 0
# Vote: 0
Date Posted: 5-Jan-2015, at 10:57 AM EST
From: John May
Re: a message causing Indy mesasage decoder to enter infinit
John wrote:
> ReadLn() is supposed to read a line right?
It reads until the specified terminator has been read, where LF is the default
terminator, thus ReadLn() effectly reads a line by default, yes.
> It looks for $0a as line end.
That is the intent. However, ReadLn() encodes the specified terminator to
a byte array using the specified byte encoding so it matches the rest of
the data belonging to the line being read. ReadLn() searches the raw data
of the IOHandler.InputBuffer looking for the encoded terminator. ReadLn()
does not give any other special considerations to the encoded terminator,
so it does not enforce the terminator being in a specific encoding, such
as LF always being encoded as $0A. In the majority of charsets you are likely
to encounter in the wild, a LF will always encode to byte $0A as most charsets
are ASCII compatible for characters #00-#127. But it turns out that cp1026
(and probably other EBCDIC-based charsets) does not do that. Which is why
ReadLn() is getting stuck.
> How about adding that it gives up and delivers a single line if it reaches
end of file?
ReadLn() has no concept of files or streams, so it can't detect end-of-file.
That is not its job anyway. That is the job of lower-level code to handle.
That being said, in the case of loading a file/stream into a TIdMessage,
Indy actually does detect end-of-file and stop reading. The problem with
that in this particular situation is that cp1026 exposed some bugs I found
the internal parsing:
1) TIdIOHandlerStreamMsg.Readable() returns False instead of True when EOF
is reached, causing ReadLn() to return with ReadLnTimedOut=True instead of
detecting a "disconnect" condition so the caller knows that no more data
can be read.
2) TIdMessageClient (which TIdMessage uses internally) does not handle the
case where ReadLn() "times out", so it just keeps reading expecting more
data, thus gets stuck in a timeout loop. A lot of code (and not just Indy
code, but end user code as well) that uses ReadLn() tend to not be timeout-aware,
as that requires checking the TIdIOHandler.ReadLnTimedOut property when ReadLn()
exits. There is a feature request in Indy's issue trackers to add a new
ReadLnTimeoutAction property to TIdIOHandler so ReadLn() can raise an exception
when a timeout occurs, which is more in line with how Indy normally handles
errors.
3) There is also the issue that TIdMessageClient is expecting to read the
end-of-email terminator, which TIdIOHandlerStreamMsg synthesizes when EOF
is reached, but cp1026 is messing up that terminator before TIdMessageClient
sees it, so TIdMessageClient does not know that EOF has been reached and
keeps reading. That goes back to the timeout bug above.
> The message would still be decoded incorrectly but it would not get into
infinite loop,
> which is not a preferable situation for any software.
Actually, I think it would still get stuck, and I explained why earlier.
It is not just the line break detection at fault, there are other factors
at play.
> The message does not decoded properly in email clients as well, that is
not the problem
> but they do not freeze like Indy.
They are likely not affected by cp1026 messing up their line break and EOF
detection logic.
> They probably look until the end of file if nothing is found they give
up and deliver what they
> can. So why not looking for end of file (or file buffer) and if ReadLn()
finds it, it stops there?
That is not the problem. The problem is that Indy *is* already detecting
EOF and stops reading, but the email parser doesn't know that EOF was reached
due to the bugs I mentioned above. I have to figure out how to get around
that, withough breaking Indy in the process.
--
Remy Lebeau (TeamB)
Vote for best answer.
Score: 0
# Vote: 0
Date Posted: 5-Jan-2015, at 9:54 AM EST
From: Remy Lebeau (TeamB)
Re: a message causing Indy mesasage decoder to enter infinit
> {quote:title=Remy Lebeau (TeamB) wrote:}{quote}
> #$0A when converting bytes to characters). Thus, ReadLn() gets stuck in
> an endless loop waiting for byte $25, which it will never see.
ReadLn() is supposed to read a line right? It looks for $0a as line end. How about adding that it gives up and delivers a single line if it reaches end of file? The message would still be decoded incorrectly but it would not get into infinite loop, which is not a preferable situation for any software. The message does not decoded properly in email clients as well, that is not the problem but they do not freeze like Indy. They probably look until the end of file if nothing is found they give up and deliver
what they can. So why not looking for end of file (or file buffer) and if ReadLn() finds it, it stops there? In other words, have 2 terminators for ReadLn - $0a or end of file, whichever comes first.
Vote for best answer.
Score: 0
# Vote: 0
Date Posted: 3-Jan-2015, at 6:14 AM EST
From: John May
Re: a message causing Indy mesasage decoder to enter infinit
Remy wrote:
> The main problem is that Indy is using cp1026 to process syntax
> elements that should be not processed using the email's charset.
> Changing that would require a rewrite of Indy's parser, and that is
> not going to happen in Indy 10. Maybe this will be addressed in Indy
> 11.
I have opened tickets in Indy's issue trackers for this problem.
--
Remy Lebeau (TeamB)
Vote for best answer.
Score: 0
# Vote: 0
Date Posted: 2-Jan-2015, at 3:13 PM EST
From: Remy Lebeau (TeamB)
Re: a message causing Indy mesasage decoder to enter infinit
John wrote:
> Content-Type: text/html; charset= CP1026
Charset cp1026 is the cause of your problem. As soon as I remove that, the
message decodes just fine.
After Indy has read the headers and attempts to read the 'msg content' line,
it calls TIdIOHandler.ReadLn() with the ATerminator parameter set to character
#10 (LF). ReadLn() is being passed an IIdTextEncoding that represents codepage
1026, which is converting the LF into byte $25 instead of the expected byte
$0A (it also converts byte $0A into character #$8E instead of the expected
#$0A when converting bytes to characters). Thus, ReadLn() gets stuck in
an endless loop waiting for byte $25, which it will never see.
Even if I were to change Indy to force ReadLn() to look for $0A instead of
$25 when ATerminator=LF, TIdMessage would still not decode the email correctly
when using cp1026. When the email terminator '.' is read (Indy synthesizes
it because your email is missing it), cp1026 converts it to character #6
instead of '.', so it does not match the terminator that TIdMessage is expecting,
which will cause more blockage issues.
The main problem is that Indy is using cp1026 to process syntax elements
that should be not processed using the email's charset. Changing that would
require a rewrite of Indy's parser, and that is not going to happen in Indy
10. Maybe this will be addressed in Indy 11.
--
Remy Lebeau (TeamB)