On Friday, August 29, 2014 7:16:45 PM UTC+1, August Karlstrom wrote:
> On 2014-08-29 14:52, Tim Rentsch wrote:
>
> > The biggest problem is that deblank() is way longer and way more
>
> > complicated than it needs to be. All your other questions go
>
> > away if deblank() is coded more simply:
>
> >
>
> > void
>
> > deblank( char string[] ){
>
> > char c, *p = string, *q = p;
>
> >
>
> > while( c = *p++ = *q++ ){
>
> > if( c == ' ' || c == '\t' ) q += strspn( q, " \t" );
>
> > }
>
> > }
>
> >
>
>
>
> Here is a less terse (KISS style) version of the same algorithm:
>
>
>
> void deblank(char s[])
>
> {
>
> int i, j;
>
>
>
> i = 0;
>
> j = 0;
>
> while (s[i] != '\0') {
>
> if ((s[i] == ' ') || (s[i] == '\t')) {
>
> j += 1 + strspn(s + j + 1, " \t");
>
> } else {
>
> j++;
>
> }
>
> i++;
>
> s[i] = s[j];
>
> }
>
> }
>
>
>
>
>
> -- August
We can write it without the function call without too much complexity
void detab(char *str)
{
int i = 0;
int j = 0;
while(str[i])
{
str[j] = str[i++];
if(str[j] == ' ' || str[j] == '\t')
{
while(str[i] == ' ' || str[i] == '\t')
i++;
}
j++;
}
str[j] = 0;
}
The fiddle is that the first character of the run has to be included,
so we need to test on j, then increment j separately.
Macros to test for space or tab are of rather dubious stylistic benefit,
it does make clear that you are deliberately excluding other whitespace,
but you're adding a symbol.
The function has no dependencies, you can cut and paste and run it
in any environment that supports C, including ones with vandalised
standard string libraries.
"BartC" writes:
| That seems strange. If strlen() is applied to the buffer filled in by
| fgets(), then the buffer always has a terminator appended, so it can't
| be running off the end of it.
|
| How many strlen()s are executed before it goes wrong? What is the
| result of those strlens() (should be 8191 on your system)?
I seem to have found something like a heisenbug. A shell transcript:
$ ./deblank < million-spaces
strlen returns 8191
strlen returns 8191
[repeat 119 times]
strlen returns 8191
strlen returns 699
$ ./deblank < million-spaces
strlen returns 8191
Segmentation fault
$ ./deblank < million-spaces
strlen returns 8191
Segmentation fault
$ ./deblank < million-spaces
strlen returns 8191
Segmentation fault
$ ./deblank < million-spaces
strlen returns 8191
Segmentation fault
$ ./deblank < million-spaces
strlen returns 8191
Segmentation fault
$ ./deblank < million-spaces
strlen returns 8191
strlen returns 8191
[repeat 119 times]
strlen returns 8191
strlen returns 699
$
There are a total of 122 calls to strlen() that return 8191. That adds
up to 999302 characters. The last strlen() gives another 699, bringing
the quantity to 1000001 characters.
I used this source:
/* deblank.c begins */
/* Remove runs of excess spaces and tabs from standard input.
* Version 2
*/
#include
#include
#include
void deblank (char string []);
int main (void)
{
char buffer [BUFSIZ];
while (fgets (buffer, sizeof (buffer), stdin) != NULL) {
deblank (buffer);
printf ("%s", buffer);
}
return 0;
}
void deblank (char string [])
{
int length_string = strlen (string);
int excess_ws;
int i, j;
printf ("strlen returns %d\n", length_string);
for (i = 0; i < length_string; i++) {
if (isblank (string [i])) {
excess_ws = 0;
for (j = i + 1; isblank (string [j]); j++)
excess_ws++;
for (j = i + 1; string [j]; j++)
string [j] = string [j + excess_ws];
}
}
}
/* deblank.c ends */
I compiled it with:
gcc -Wall -Wextra -g -O0 -std=c99 -pedantic -o deblank deblank.c
--
Udyant Wig
GitHub: https://github.com/udyant
Poetry: http://www.writing.com/main/profile/biography/frosthrone
G G writes:
> just a question:
>
> why did you or why would one not place the #define after the
> #include? if it will be in the same file...
>
> is the thinking to define it close to where it will be used...
I am sure that was the thinking. In general, it's better to avoid doing
this, but moving this particular macros is not a good idea because it
refers to a name in scope only where it is used. Now that's also a bad
idea, but since it does so, defining in where it's used is better (and
I'd un-define it after the block in question).
The best solution is, in my opinion, to change both: make the names it
depends on macro parameters, and then move it to where most people
expect macros to reside.
For reference:
> #define IS_SPACETAB(index) (string [(index)] == ' ' || string [(index)] == '\t')
--
Ben.
On 2014-08-29 14:52, Tim Rentsch wrote:
> The biggest problem is that deblank() is way longer and way more
> complicated than it needs to be. All your other questions go
> away if deblank() is coded more simply:
>
> void
> deblank( char string[] ){
> char c, *p = string, *q = p;
>
> while( c = *p++ = *q++ ){
> if( c == ' ' || c == '\t' ) q += strspn( q, " \t" );
> }
> }
>
Here is a less terse (KISS style) version of the same algorithm:
void deblank(char s[])
{
int i, j;
i = 0;
j = 0;
while (s[i] != '\0') {
if ((s[i] == ' ') || (s[i] == '\t')) {
j += 1 + strspn(s + j + 1, " \t");
} else {
j++;
}
i++;
s[i] = s[j];
}
}
-- August
just a question:
why did you or why would one not place the #define after the #include? if it will be in the same file...
is the thinking to define it close to where it will be used...
sorry this question is kind of off point, i'm sure.
the following program written by Udyant Wig. my poor one and a half cents on formatting.
/* deblank.c begins */
/* Remove runs of excess spaces and tabs from standard input. */
#include
#include
#include
#define IS_SPACETAB(index) (string [(index)] == ' ' || string [(index)] == '\t')
void deblank (char string []);
int main (void)
{
char buffer [BUFSIZ];
while (fgets (buffer, sizeof (buffer), stdin) != NULL)
{
deblank (buffer);
printf ("%s", buffer);
} /* end while (fgets...) */
exit (EXIT_SUCCESS);
} /* end main(void) */
void deblank (char string [])
{
int length_string = strlen (string);
int excess_ws;
int i;
int j;
for (i = 0; i < length_string; i++)
{
if ( IS_SPACETAB(i) )
{
excess_ws = 0;
for (j = i + 1; IS_SPACETAB(j); j++)
excess_ws++;
for (j = i + 1; string [j]; j++)
string [j] = string [j + excess_ws];
} /* end if ( IS_SPACETAB ...) */
} /* end for (i = 0; ...) */
} /* end void deblank(...) */
Udyant Wig wrote:
> I was reading the Linux Kernel Coding Style document found, among other
> places, at
>
> https://www.kernel.org/doc/Documentation/CodingStyle
>
> It was cited by various people as being good even for non-kernel C
> programming. I tried some of it to see how it worked for me. I quote
> from the parts pertaining to indentation:
>
> Tabs are 8 characters, and thus indentations are also 8
> characters. There are heretic movements that try to make
> indentations 4 (or even 2!) characters deep, and that is akin
> to trying to define the value of PI to be 3.
By the same argument, tabs are whatever the editor defines them to be, and
thus indentations are whatever the editor defines them to be. There are
heretic movements that try to make indentations exactly 8 characters wide
(note: not deep!), and that is akin to trying to define everybody's bank
balance to be exactly £8.
It's a lousy analogy to justify a lousy position. Pick a layout style and
stick to it. I prefer two characters, with the editor replacing tab
keypresses with spaces. Your mileage may vary (YMMV). Anyone who tries to
bully you into a particular layout style just "because I say so" should be
quietly ignored. Cultural diversity is a marvellous thing.
That is not to say that there cannot be arguments for and against particular
settings.
I always thought 8 was too wide, so I switched to 4. That was okay, but it
still seemed a bit wide, so I tried 3. That was ghastly. It's hard to say
why. I think the fact that it wasn't a power of 2 had a lot to do with it!
(Why that should matter to me, I don't know.) But 2 worked out very well
indeed. For me. As I said, YMMV.
>
> Rationale: The whole idea behind indentation is to clearly
> define where a block of control starts and ends. Especially
> when you've been looking at your screen for 20 straight hours,
> you'll find it a lot easier to see how the indentation works if
> you have large indentations.
That, at least, is an attempt at an argument. It has not, however, been my
experience that a narrower indent has any effect on readability (except in
the extreme case, 0, and the nearly-extreme case, 1).
> Now, some people will claim that having 8-character indentations
> makes the code move too far to the right, and makes it hard to
> read on a 80-character terminal screen.
They will indeed. And they are right.
> The answer to that is
> that if you need more than 3 levels of indentation, you're
> screwed anyway, and should fix your program.
Was this document written by Malcolm McLean, by any chance? :-)
Three levels of indentation is really, truly not a problem. By the time you
get to seven (plus or minus a couple), you might want to start about
refactoring, but three is chickenfeed.
>
> In short, 8-char indents make things easier to read, and have
> the added benefit of warning you when you're nesting your
> functions too deep. Heed that warning.
Not true, not true, and don't bother (in that order).
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within
On 2014-08-29, Udyant Wig wrote:
> raltbos@xs4all.nl (Richard Bos) writes:
>| More specifically, why don't you just use 's isblank()!?
>
> Do pardon this digression. There has been a curious sequence of
> incidents.
>
> I mentioned the book I am reading: /Pointers on C/ by Kenneth Reek. It
> covers ISO C 1990. Until I had come across the problem in the book,
> there had been no mention of the isblank() function. But compiling my
> program with it worked. Then, I checked the GLIBC manual for
> isblank(), which says:
>
> -- Function: int isblank (int C)
> Returns true if C is a blank character; that is, a space or a tab.
> This function was originally a GNU extension, but was added in
> ISO C99.
>
> But, compiling with this command,
>
> gcc -std=c90 -pedantic -o deblank deblank.c
>
> worked perfectly. Did I miss something?
Yes: you missed, firstly, that GCC is not a full implementation of any dialect
of ISO C; it does not implement the majority of the Library part.
Whether or not -std=c90 has any bearing on the library depends on how
the library responds to special preprocessor symbols from GCC.
You did not enable any diagnostics: a good start would be -Wall. With
-Wall, gcc will give you a warning about implicit function declarations. When
you use -std=c90 (or, equivalently, -ansi), perhaps your library header ctype.h
does in fact hide the isblank function. But the program can still call it and
no diagnostic is required. Even if the identifier is not declared, it is still
available for linkage. With -Wall you can be alerted to the fact that ctype.h
did not provide a declaration for isblank.
You're also missing that the C90 standard does not say that if a program calls
an external function called isblank (or any other function which is not in the
C90 language), and the program itself does not provide a translation unit which
defines taht function, that linking and executing the program must fail.
The behavior is in fact undefined; such a program may well succeed,
and behave in a documented manner, characteristic of the implementation
(one of the explicitly listed possibilities for "undefined behavior").
Your implementation happens provides a documented isblank function in the
library, which you can call whether you use a declaration in the ctype.h
header, an implicit declaration, or your own declaration. The external
name exists no matter what language dialect you select from gcc.
Your program is no different from something like this:
#include
int main(void)
{
puts(getpass("foo"));
return 0;
}
You can also compile this with and have it it work on many systems, even though
the getpass function isn't in any C standard. On some other systems, it
will bail with "unresolved reference to getpass" or some such message.
And on some other system still, it might crash because there is a getpass
symbol, but it's a different kind of function, or perhaps a data object
like an array of characters. Just like on some C90 systems, a program which
calls isblank will not link: because they do not have the GNU extension,
nor C99.
From the perspective of C90, isblank and getpass are the same category of
thing: library extensions to the C language. From the C99 perspective,
isblank is a language feature. Either way, it is an external name in your
environment that you can link to; historic versions of your environment
had the name before C99 as an extension, and now they have it as a matter of
conformance.
On 08/29/2014 09:55 AM, Stefan Ram wrote:
> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>> #include
>> #include
>> int main( void )
>> { int c; int spacemode = 0;
>> while( ( c = getchar() )!= EOF )if( spacemode )
>> { if( !isspace( c ))
>> { putchar( ' ' ); putchar( c ); spacemode = 0; }} else
>> if( isspace( c ))spacemode = 1; else putchar( c );
>> if( spacemode )putchar( ' ' ); }
>
> The above code calls »isspace« without checking that c is
> representable as an unsigned char value!
You're right - strictly speaking, such a check is needed - but only in
the unlikely case that UCHAR_MAX > INT_MAX. Otherwise, any value
returned by getchar() automatically satisfies the requirements for an
argument of the is*() functions.
Udyant Wig writes:
> "BartC" writes:
> | (BTW, when tested on a file consisting of a million spaces (no
> | newlines), your version produced an output file of 1957 spaces. RH's
> | version correctly resulted in a file containing a single space.)
>
> I tried just that and it blew up with a segmentation fault. I then ran
> my program under valgrind. The output from valgrind said that strlen()
> tried to access invalid memory.
Great that you are using valgrind; it's often a big help. But I would
not worry about the details of the message. The general rule is to fix the
bugs you know. One has been pointed (sorry, I forget who by) that
causes a buffer overrun. Fix that first, because bugs that can cause all
sorts of collateral damage.
--
Ben.
raltbos@xs4all.nl (Richard Bos) writes:
| More specifically, why don't you just use 's isblank()!?
Do pardon this digression. There has been a curious sequence of
incidents.
I mentioned the book I am reading: /Pointers on C/ by Kenneth Reek. It
covers ISO C 1990. Until I had come across the problem in the book,
there had been no mention of the isblank() function. But compiling my
program with it worked. Then, I checked the GLIBC manual for
isblank(), which says:
-- Function: int isblank (int C)
Returns true if C is a blank character; that is, a space or a tab.
This function was originally a GNU extension, but was added in
ISO C99.
But, compiling with this command,
gcc -std=c90 -pedantic -o deblank deblank.c
worked perfectly. Did I miss something?
In sum, isblank() had not been covered up till where I was in the book
on C90 I had been reading.
As an aside, looking up isblank() in the index to Harbison and Steele,
5th ed. failed, i.e., it was not listed in the index. But, it is
covered on page 341.
| Indentations of one whole tab are excessive and make the code hard to
| read. Four characters is acceptable. Personally, I prefer the look of
| two. One is certainly too little, three is idiosyncratic but OK, more
| than four is unnecessary, eight is daftly and illegibly large, and an
| actual tab character (especially, but not exclusively, when mixed with
| spaces) is The Wrong Thing.
| (Besides, I often use more than four for "this is a continuation of the
| previous line, not an indent". See below. IMO, this makes the two
| situations easier to distinguish. This is also why Python is Evil.)
I was reading the Linux Kernel Coding Style document found, among other
places, at
https://www.kernel.org/doc/Documentation/CodingStyle
It was cited by various people as being good even for non-kernel C
programming. I tried some of it to see how it worked for me. I quote
from the parts pertaining to indentation:
Tabs are 8 characters, and thus indentations are also 8
characters. There are heretic movements that try to make
indentations 4 (or even 2!) characters deep, and that is akin
to trying to define the value of PI to be 3.
Rationale: The whole idea behind indentation is to clearly
define where a block of control starts and ends. Especially
when you've been looking at your screen for 20 straight hours,
you'll find it a lot easier to see how the indentation works if
you have large indentations.
Now, some people will claim that having 8-character indentations
makes the code move too far to the right, and makes it hard to
read on a 80-character terminal screen. The answer to that is
that if you need more than 3 levels of indentation, you're
screwed anyway, and should fix your program.
In short, 8-char indents make things easier to read, and have
the added benefit of warning you when you're nesting your
functions too deep. Heed that warning.
| There's another way: don't keep copying the whole string, just keep
| another pointer at the "next to be copied" character and copy from
| there only once. In logic:
|
| Set pointer to start of destination and one to start of source
| Set space flag to false
| Loop until source runs out of characters:
| If not space flag:
| Copy one character from source to destination
| Increase destination pointer
| If character is space:
| Set space flag
| If character is not space:
| Unset space flag
| (Or, in idiomatic C rather than a logical schema:
| set space flag to "isspace(*source)"
| or if you wish to "isspace(source[sourceindex]")
| Increase source pointer
|
| This is more time-efficient, but it requires both a source and a
| destination buffer. This has the extra advantage that it keeps the
| source buffer intact (useful in many cases, required in some), but the
| disadvantage of using more buffer memory (though at least you know
| that the dest buffer need never be larger than the source), and the
| major disadvantage that your program only has the one buffer right
| now...
I'll see if can work this into an implementation.
--
Udyant Wig
GitHub: https://github.com/udyant
Poetry: http://www.writing.com/main/profile/biography/frosthrone