Author Topic: My C homework (Read 19320 times)

mel_zoom · « **on:** February 24, 2007, 11:14:22 AM »

Hi!

So Ive been plodding along learning basic C. Ive been set what I am told is a fairly simple challenge but that has many possible solutions.

The challenge is to write a small program, using only the standard ANSI C headers that reads a text file - the name of which is given as a parameter on the command line - and performs a basic word count.

Stage two is to then modify the program so that it reports a breakdown of the word count into the following ranges

words of less than 4 characters
words of 4 characters
words of 5 characters
words of 6 characters
words of 7 characters
words of 8 characters
words of 9 characters
words of 10 characters
words of more than 10 characters

Stage three - which is supposed to be a lot more difficult to do well - is to create a new program which counts all the unique words in the text file.

For the first stage I thought that I could write a loop which reads the characters one at a time. It would see if the first character was whitespace or not and use it to set a "type" variable which might be 0 for whitespace and 1 for non whitespace. If it is non-whitespace it would set the word count to 1 otherwise set it to 0.

Then as it reads each character from the file it could see if it is whitespace or not. As long as the character is the same type as the previous one (that is whitespace or not) it would just keep looping. When the type of character changes it could perform an action:

If it changed from non whitespace to whitespace just change the "type" variable to reflect the change.

If it changed from whitespace to non whitespace change the "type" variable and increase the word count by one.

When the loop reaches end of file then close it and print out the word count.

I have no intention of looking at stage 2 until I got this right :-)

Is it looking valid so far?

mel_zoom · « **Reply #1 on:** February 25, 2007, 09:36:05 PM »

Well, here is the first program :-)

It seems to work fine on all the text files I gave it so far - including some "control" ones not containing anything some containing only space or 1 word and so on.

Whilst writing it I realised that I could simplify the starting state of the word counting loop by assuming we start with whitespace and no words. That way you can let the normal loop logic decide what to do with the first character read straight away.

Code: [Select]


/*
  Word counting program, stage 1
*/

#include <stdio.h>

/* prototype */
int countwords(FILE* file);

/* main program */
int main(int argc, char** argv)
{
	const char* filename = 0;
	FILE* file = 0;
	int numwords = 0;

	/* get the filename from the command line*/
	if (argc<2) {
		puts(&quot;Usage: wordcount <file name>&quot;);
		return 1;
	}

	/* try to open the file */
	filename = argv[1];
	file = fopen(filename, &quot;r&quot;);
	if (file == NULL) {
		printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
		return 1;
	}

	/* count and display */
	numwords = countwords(file);
	printf(&quot;Counted %d word(s) in file '%s'\n&quot;, numwords, filename);

	/* all done */
	fclose(file);
	return 0;
}

int countwords(FILE* file)
{
	/*
		Looks for whitespace <-> non whitespace changes.
		Every whitespace -> non whitespace change is considered a
		new word.
	*/
	int numwords = 0;
	int whitespace = 1;
	int character;

	character = fgetc(file);
	while(character != EOF) {
		switch (character) {
			case ' ':
			case '\t':
			case '\n':
			case '\r':
				whitespace = 1;
				break;

			default:
				if (whitespace == 1) {
					numwords++;
					whitespace = 0;
				}
				break;
		}
		character = fgetc(file);
	}
	return numwords;
}

Indentation too! - easy once you know how!

mel_zoom · « **Reply #2 on:** February 25, 2007, 09:45:55 PM »

I think I can already see how to go about solving stage 2.

If you put another variable "wordlength" initially set to 0 before the loop you could increment it in the "default" case of the switch.
Then in the whitespace part of the switch you could put an "if (whitespace==0)" test to see if we just entered the whitespace now. If that is true then whatever the current value of "wordlength" is would be the length of the last word read. You would then use this to increase the appropriate word counter and then reset it back to zero.

:idea:

mel_zoom · « **Reply #3 on:** February 26, 2007, 10:03:46 AM »

Wain:

"What I essentially mean by this is to put your switch statement into a separate function called something like is_whitespace(); to be called inside of your while-loop. Of course, you would also need to mildly restructure the implementation of your solution to do something like this."

For stage 2 it might be better. I could have a function that returns the length of the next word in the file. The switch statement could then contain all the punctuation characters to ignore in the same case range as the whitespace.

mel_zoom · « **Reply #4 on:** February 26, 2007, 08:48:43 PM »

I dont think this is as clean as it probably could be but it seems to work.

Code: [Select]


/*

  Simple word count program stage 2

  counts words in text file passed as argument
  presents summary of word lengths

*/

#include <stdio.h>
#include <ctype.h>

/* prototype */
int countwords(FILE* file, int* lengths);

/* main program */
int main(int argc, char** argv)
{
	const char* filename = 0;
	FILE* file = 0;
	int numwords = 0;
	int lengths[9] = {0};
	int n;

	/* get the filename from the command line*/
	if (argc<2) {
		puts(&quot;Usage: wordcount <file name>&quot;);
		return 1;
	}

	/* try to open the file */
	filename = argv[1];
	file = fopen(filename, &quot;r&quot;);
	if (file == NULL) {
		printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
		return 1;
	}

	/* count and display */
	numwords = countwords(file, lengths);
	printf(&quot;Counted a total of %d word(s) in file '%s'\n&quot;, numwords, filename);

	printf(&quot;Words <  4 chars: %d\n&quot;, lengths[0]);
	for (n=1; n<7; n++) {
		printf(&quot;Words of %d chars: %d\n&quot;, n+3, lengths[n]);
	}
	printf(&quot;Words > 10 chars: %d\n&quot;, lengths[8]);

	/* all done */
	fclose(file);
	return 0;
}

int countwords(FILE* file, int* lengths)
{
	int numwords = 0;
	int notinaword = 1;
	int length = 0;
	int character;

	while( (character = fgetc(file)) != EOF ) {
		if (isalnum(character)) {
			/* character is in a word so increment the length */
			if (notinaword) {
				notinaword = 0;
			}
			length++;
		}
		else {
			/* if we just got here then update our counters */
			if (!notinaword) {
				notinaword = 1;
				numwords++;

				/* fill appropriate length bucket */
				if (length<4) {
					/* for lengths < 4 fill bucket 0 */
					lengths[0]++;
				}
				else if (length<11) {
					/* for lengths 4...10 fill buckets 1-7*/
					lengths[length-3]++;
				} else {
					/* for lengths > 10 fill bucket 8*/
					lengths[8]++;
				}
				length = 0;
			}
		}
	}
	return numwords;
}

No switch-case either :-)

mel_zoom · « **Reply #5 on:** February 27, 2007, 07:37:03 PM »

Hmm.

I hit a snag. It seems isalnum() only returns non zero for the characters 0...9 a...z A...Z

When the source text file contains words that might have some european characters it causes the program to think one word is several.

I think perhaps the logic should be that it looks for characters that are not space and are not punctuation?

mel_zoom · « **Reply #6 on:** February 27, 2007, 07:44:37 PM »

Ok that didnt work either :-(

:idea:

Since I already know how to load a file character by character, how hard can it be to write a replacement isword() that uses a table built by reading a different text file?

If there is a table with room for all 256 codes - all set to zero at first - you could load the "character set" file and fill in a 1 for every character found in that file.

Or perhaps the file only need contain extra characters not already recognised by filling the table with the isalnum() for each character first...

mel_zoom · « **Reply #7 on:** February 27, 2007, 08:52:38 PM »

Ok here is the final version for stage 2 - Im not planning to do anything more to it for now.

I improved it by moving more of the code from main() into functions and Ive made a custom isword() thats really a macro. It behaves exactly like isalnum() except that it uses a table that you can add characters to by specifying a character set file as the second parameter on the shell.

I jumped ahead of myself a bit using a macro for that but it seemed easy enough for this exercise.

I also redid the count loop the way falemagn suggested. His modification seems obvious in hindsight. I guess I have a while to go yet :-)

Code: [Select]


/*

  Simple word count program stage 2 revised

  counts words in text file passed as argument
  presents summary of word lengths

  uses custom character list for determining words

*/

#include <stdio.h>
#include <ctype.h>

/* data */
static char chartypetable[256] = {0};

/* prototypes */
void maketypetable(const char* charset);
int countwords(const char* filename, int* lengths);
void printsummary(const char* filename, int numwords, int* lengths);

#define isword(c) (chartypetable[(c)])


/* main program */
int main(int argc, char** argv)
{
	const char* filename = 0;
	const char* charsetfilename = 0;
	int numwords = 0;
	int lengths[9] = {0};

	/* get the filename from the command line */
	if (argc<2) {
		puts(&quot;Usage: wordcount <file name> [character set]&quot;);
		return 1;
	}
	filename = argv[1];

	/* get any additional character set filename */
	if (argc>2) {
		charsetfilename = argv[2];
	}

	/* make our character table */
	maketypetable(charsetfilename);

	/* count and show summary */
	numwords = countwords(filename, lengths);
	printsummary(filename, numwords, lengths);

	return 0;
}



void printsummary(const char* filename, int numwords, int* lengths)
{
	int n;
	printf(&quot;Counted a total of %d word(s) in file '%s'\n&quot;, numwords, filename);
	printf(&quot;Words <  4 chars: %d\n&quot;, lengths[0]);
	for (n=1; n<7; n++) {
		printf(&quot;Words of %d chars: %d\n&quot;, n+3, lengths[n]);
	}
	printf(&quot;Words > 10 chars: %d\n&quot;, lengths[8]);
}



int countwords(const char* filename, int* lengths)
{
	int numwords = 0;
	int length = 0;
	int character;
	FILE*	file;

	file = fopen(filename, &quot;r&quot;);
	if (!file) {
		printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
		exit(1);
	}

	while( (character = getc(file)) != EOF ) {
		if (isword(character)) {
			/* character is in a word so increment the length */
			length++;
		}
		else {
			/* if we just got here then update our counters */
			if (length>0) {
				numwords++;

				/* fill a range bucket based on the word length*/
				if (length<4) {
					/* for lengths < 4 fill bucket 0 */
					lengths[0]++;
				}
				else if (length<11) {
					/* for lengths 4...10 fill buckets 1-7*/
					lengths[length-3]++;
				} else {
					/* for lengths > 10 fill bucket 8*/
					lengths[8]++;
				}
				length = 0;
			}
		}
	}

	fclose(file);
	return numwords;
}



void maketypetable(const char* charset)
{
	/*
		make own character type table
	*/
	int c;

	/* first fill with isalnum() */
	for (c=0; c<256; c++) {
		if (isalnum(c)) {
			chartypetable[c]=1;
		}
	}

	/* try to add any user defined characters too - excluding spaces */
	if (charset) {
		FILE* file = fopen(charset, &quot;r&quot;);
		if (file) {
			while ( (c = getc(file)) != EOF ) {
				if (!isspace(c)) {
					chartypetable[c] = 1;
				}
			}
			fclose(file);

			/* show the recognised &quot;isword()&quot; character set */
			printf(&quot;Added character set definition from file '%s'\nThe following characters will be treat as valid within words:\n&quot;, charset);
			for (c=0; c<256; c++) {
				if (chartypetable[c]) {
					printf(&quot;%c &quot;, c);
				}
			}
			putchar('\n');
		}
		else {
			printf(&quot;Couldn't open character set definition file '%s'\n&quot;, charset);
		}
	}
	printf(&quot;\n&quot;);
}

I think thats enough for one evening!

mel_zoom · « **Reply #8 on:** March 01, 2007, 07:09:03 PM »

guru-666:

"wow so many ppl interested in some average chicks c homework, amazing.... You guy's know that python has this kind of function built in already."

I guess the point that this is all about learning C and not about which language is better for the task is wasted on some average python progammers :-P

It is probably even easier in Perl or PHP.

mel_zoom · « **Reply #9 on:** March 01, 2007, 08:10:12 PM »

People keep saying that C (and C++) are obsolete and that some newer language does it all so much better.

What I cant understand is how people can think that when most of these languages require C in order to be compiled and maintained. How many of them can be genuinely built using only their own language?

As far as I understand Java, Perl, Python, PHP and all of the .net languages are all implemented in C at some level and cannot be built without it!

When everybody has forgotten C who will actually maintain the languages they have migrated to?

mel_zoom · « **Reply #10 on:** March 01, 2007, 10:07:39 PM »

Well Ive been trying to take in the hints from earlier about stage 3.

Its a bit confusing so far. I am guessing that the basic idea of building a list of words is no good?

I had thought of this:

Code: [Select]


struct uniqueword {
    struct uniqueword* last;
    struct uniqueord* next;
    const char* text;
    int count;
};

I think this solves the problem of not knowing how many unique words there are as you can add more uniqueword structures as you go along. I thought about having a list like this. Whenever you get a word you would walk along the list to see if you find it.

I thought that stricmp() could compare the word with the one in a structure until it finds the structure that contains the word already (or not). However it struck me that since stricmp() actually returns a value less than zero or greater than zero if one string is "less" than another or not that it could also be used to detect where in the list to add a new word. As long as the word in the structure is "less" than your current word then keep going along the list. If you then get to a word that is "greater" than your current one - assuming it didnt match anything - you could put your new word at that point in the list.

However I think I see the problem you are all hinting at. The more unique words there are the longer it takes to check if any word you are currently testing is there already. If there were very many unique words the program would run slower and slower.

Back to the drawing board?

mel_zoom · « **Reply #11 on:** March 02, 2007, 10:25:47 AM »

I have a good guru.

mel_zoom · « **Reply #12 on:** March 02, 2007, 10:56:27 AM »

How strange - I was really annoyed at the time that picture was taken as I wrestling with a particularly annoying earpiece prior to a product release event!

I thought the look of perplexed frustration would be apt for when I start getting into more realistic C code :-)

mel_zoom · « **Reply #13 on:** March 02, 2007, 11:10:15 AM »

Ok I changed the title. Better?

mel_zoom · « **Reply #14 on:** March 06, 2007, 10:20:34 PM »

Noooo! I just wrote a really long post and got logged out!

Luckily I was writing most of it in a text editor - phew!

...

Its been a few days now and I read up on malloc(). I realised that malloc doesnt actually care what you are allocating memory for so I changed the structure a little bit so that the text becomes a part of it.

Anyway, this is what I am working towards now:

Code: [Select]



int addword(const char* wordtext); /* main prototype */

struct uniqueword {
    struct uniqueword* last;
    struct uniqueord* next;
    int count;
    char wordtext[1]; /* copy of the text */
};

static struct uniqueword* firstword;

static struct uniqueword* createuniqueword(const char* wordtext)
{
    /* create a new uniqueword structure */
    int wordlength;
    struct uniqueword* unique;

    wordlength = strlen(wordtext);
    unique = (struct uniqueword*)malloc(sizeof(struct uniqueword)+wordlength);

    /* fill in data if malloc() ok */
    if (unique != NULL) {
       unique->last = 0;
       unique->prev = 0;
       unique->count = 1;
       strcpy(unique->wordtext, wordtext);
    }
    return unique;
}

int addword(const char* wordtext)
{
   /* returns 1 if the word was added or zero if it existed or -1 on failure */

   struct uniqueword* newword;
   struct uniqueword* last;
   struct uniqueword* uw;
   int worddiff;

   if (!firstword) {
      newword = createuniqueword(wordtext);
      if (newword) {
         firstword = newword;
         return 1;
      }
      else {
         return -1;
      }
   }

   for (uw = firstword; uw != NULL; uw = uw->next) {
      worddiff = stricmp(wordtext, uw->wordtext);

      /* found it! */
      if (worddiff == 0) {
         uw->count++;
         return 0;
      }

      /* as soon as word > uw->word we insert it immediately before uw */
      if (worddiff > 0) {
         newword = createuniqueword(wordtext);
         if (newword != NULL) {
             newword->prev = uw->prev;
             newword->next = uw;
             uw->prev = newword;
             newword->prev->next = newword;
             return 1;
         } else {
             return -1;
         }
      }

      /* remember the last uniqueword */
      last = uw;
   }

   /* we reached the end of the list without finding it */
   newword = createuniqueword(wordtext);
   if (newword) {
      last->next = newword;
      return 1;
   }
   return -1;
}

Im not really happy with the loop though. Im trying to keep the words sorted based on the greater than / less than return of stricmp() so that the entire list isnt always searched before deciding to add the word.

I think its quite ugly though (having to have a "last" pointer just because the normal loop finishes when uw is null) and probably broken too.

Would "(uw != NULL) && (uw->next != NULL)" in the loop be better? It looks like it would always skip the last one on the list regardless which would only mean having to check if that one matches the wordtext outside the loop which also seems a bit stupid.

Author Topic: My C homework (Read 19320 times)

mel_zoom

My C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework