Amiga.org

Operating System Specific Discussions => Amiga OS => Amiga OS -- Development => Topic started by: mel_zoom on February 24, 2007, 11:14:22 AM

Title: My C homework
Post by: mel_zoom on February 24, 2007, 11:14:22 AM
Hi!

So Ive been plodding along learning basic C. Ive been set what I am told is a fairly simple challenge but that has many possible solutions.

The challenge is to write a small program, using only the standard ANSI C headers that reads a text file - the name of which is given as a parameter on the command line - and performs a basic word count.

Stage two is to then modify the program so that it reports a breakdown of the word count into the following ranges

words of less than 4 characters
words of 4 characters
words of 5 characters
words of 6 characters
words of 7 characters
words of 8 characters
words of 9 characters
words of 10 characters
words of more than 10 characters

Stage three - which is supposed to be a lot more difficult to do well - is to create a new program which counts all the unique words in the text file.


For the first stage I thought that I could write a loop which reads the characters one at a time. It would see if the first character was whitespace or not and use it to set a "type" variable which might be 0 for whitespace and 1 for non whitespace. If it is non-whitespace it would set the word count to 1 otherwise set it to 0.

Then as it reads each character from the file it could see if it is whitespace or not. As long as the character is the same type as the previous one (that is whitespace or not) it would just keep looping. When the type of character changes it could perform an action:

If it changed from non whitespace to whitespace just change the "type" variable to reflect the change.

If it changed from whitespace to non whitespace change the "type" variable and increase the word count by one.

When the loop reaches end of file then close it and print out the word count.

I have no intention of looking at stage 2 until I got this right :-)

Is it looking valid so far?
Title: Re: Mel's C homework
Post by: Hodgkinson on February 24, 2007, 11:37:59 AM
Déjà Vu…

I don’t suppose this is to help with some kind of statistics/maths assignment, by any chance?

Hodgkinson.
Title: Re: Mel's C homework
Post by: Speelgoedmannetje on February 24, 2007, 11:38:14 AM
The lazy bum in me says regex :lol:
Title: Re: Mel's C homework
Post by: Cymric on February 24, 2007, 11:50:16 AM
Quote
Is it looking valid so far?

Only one way to find out, right?  :-P
Title: Re: Mel's C homework
Post by: mgerics on February 24, 2007, 12:12:04 PM
Sonds like an algorithm! Be careful at the end of file, though - treat end of filelike a non-whitespace so the algorithm will count the last word:
e.g. <> represents a whitespace, (eof) is end of file
...<>last<>word(eof) <- treat eof like a non-whitepsace so 'word' gets counted...
...<>last<>word<>(eof) <- the last whitepsace will count 'word'
Have fun programming.
Title: Re: Mel's C homework
Post by: amiga_3k on February 24, 2007, 01:02:37 PM
Don't forget to filter line-breaks.
Title: Re: Mel's C homework
Post by: Karlos on February 24, 2007, 01:10:07 PM
@Mel

As an algorithm it looks fine. Mgerics has a point, however. Depending on your main loop, the end of file character might be intercepted as the start of a new word should there be any whitespace at the end of the file. Look out for that when designing your loop.

As I said, the objective in this case isn't to see how well you can factor the code into lots of utility functions, it is an exercise in solving a problem. You can probably manage the entire thing within main() relatively well.

The second exercise is more likely to benefit from factoring into functions and you should give some thought to that.

The third exercise is for once you have an understanding of structures and memory allocation / deallocation.

Hint: have a structure that at least consists of a char* for the word, a count value. You will almost certainly need to factor that code into functions that perform specific actions on these structures, such initializing them (and allocating space to store the word) and finalizing them (deallocating the space used to store the word). You will need to manage a collection of structures, one for each unique word.

A second hint: comparing integer values is much, much faster than comparing strings with strcmp()/stricmp(). Consider extending your structure to contain a second integer that is some value computed from the word itself (look up hash functions) and base your comparisons on that. Who knows, you might even take that to its logical conclusion and actually implement your collection of word count structures as a hash map ;-)

That's some time off yet though.
Title: Re: Mel's C homework
Post by: Piru on February 24, 2007, 02:20:54 PM
Also, there are various things to consider for "good" implementation of the stage three:

- It should not be limited to some fixed size maximum word length.
- It should not get exponentally slower when number of different words grows.

However, these are points for the advanced class. I have such implementation ready and I'll provide it later.
Title: Re: Mel's C homework
Post by: Karlos on February 24, 2007, 02:27:43 PM
Quote

Piru wrote:
- It should not be limited to some fixed size maximum word length.
- It should not get exponentally slower when number of different words grows.


A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.
Title: Re: Mel's C homework
Post by: Piru on February 24, 2007, 02:35:18 PM
@Karlos
Quote
A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.

True. There's IMO a better data structure for this particular case: a binary tree. Hash tables are boring... ;-)
Title: Re: Mel's C homework
Post by: falemagn on February 24, 2007, 02:50:39 PM
Quote

Piru wrote:
@Karlos
Quote
A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.

True. There's IMO a better data structure for this particular case: a binary tree. Hash tables are boring... ;-)


Or even better, a trie (http://en.wikipedia.org/wiki/Trie).
Title: Re: Mel's C homework
Post by: Piru on February 24, 2007, 02:56:08 PM
Ah indeed, this would be the perfect use for a trie ;-)
Title: Re: Mel's C homework
Post by: Karlos on February 24, 2007, 03:00:19 PM
True enough, there are lots of containers that would cope with this scenario.

Never underestimate the raw speed of the basic hash map though. It so happens I have to do a lot of lookup stuff at work and you'd be surprised how much slower the binary tree search was for this class of application ;-)

Of course, it depends on your hash function. If it takes longer to compute than it does to iterate n levels down the tree (or trie), chances are it will be slower. But that depends on many factors, such as data locality, cost of multiple pointer accesses and so on. Basically all beyond the scope of the exercise :-D

The trie, however would be an excellent choice for this (I use this algorithm, slightly modified, to find the best match prefix for a telephone number in order to determine which price band it belongs to).
Title: Re: Mel's C homework
Post by: falemagn on February 24, 2007, 03:10:05 PM
Quote

Of course, it depends on your hash function. If it takes longer to compute than it does to iterate n levels down the tree (or trie), chances are it will be slower.


Unless you factor in cache locality, which may very well have an inpact, a hash map will always be slower than a trie as you not only have to computer the hash, but also still compare the strings.
Title: Re: Mel's C homework
Post by: Karlos on February 24, 2007, 03:19:55 PM
Quote

but also still compare the strings.


Actually, the fast hashmap implementation compares only the hash keys and not the strings (or whatever object you are using), on the understanding that they are as unique as the entity from which they are created. The modulus of the hash by the container size is used to determine the first location of the entry. Each entry also contains the entity's hash as well as the entity. Due to the hash%size operation, you will tend to hit something unrelated once it gets too full.

If you are lucky, the first entry you hit is the one you are looking for (which you determine by comparing your hash with the entity's hash).

If you are unlucky, you might have to step a few left and right until you find it. This is given the assuming storage algorithm will try to find the closest empty slot to the initial target.

This type of algorihm is discussed in Bjarne Stroustrup's C++ 3rd ed as an exercise in container design ;-)

On every system I've tried, this method of hashed lookup outperforms the trie for longish strings when the hash map is generally less than 60% full. This can be accounted for simply though cacheing alone. Once you get past 60%, efficiency starts to fall off. You'd generally design the container to expand when it hits a load of say 75%.

Of course, the fundamental flaw in this is that you can get into trouble if two keys collide :-D

In our experiment here, the trie is likely to be faster since we probably arent looking for excessively long strings. The place you'd most want to ensure you optimised is in node allocation. You really dont want to be allocating thousands of small nodes individually for the trie.
Title: Re: Mel's C homework
Post by: MskoDestny on February 24, 2007, 03:45:34 PM
There's also the mighty ternary search tree: http://www.ddj.com/dept/windows/184410528
Title: Re: Mel's C homework
Post by: SamuraiCrow on February 24, 2007, 04:15:00 PM
The binary and ternary search trees would be slower than the hash or the trie.  The ternary search tree would be O(log n) time complexity while the trie and hash have typical time complexities of O(1) with the hash degrading to a O(n) as collisions increase.  That said, I'd go with the trie.
Title: Re: Mel's C homework
Post by: MskoDestny on February 24, 2007, 07:14:32 PM
Quote

SamuraiCrow wrote:
The binary and ternary search trees would be slower than the hash or the trie.  The ternary search tree would be O(log n) time complexity while the trie and hash have typical time complexities of O(1) with the hash degrading to a O(n) as collisions increase.  That said, I'd go with the trie.

A ternary search tree has lookup times O(log n) character comparisons whereas a bst has lookup times O(log n) actual full key comparisons so they are no where near as slow as bsts. According to Dr. Dobbs Journal (link provided in my last post) real world performance for insertions and successful lookups is roughly equivalent to a hash table, but it offers these benefits:
Quote

Ternary trees are usually substantially faster than hashing for unsuccessful searches.
Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes.
Ternary trees support advanced searches, such as partial-match and near-neighbor search.
Ternary trees support many other operations, such as traversal to report items in sorted order.

They also use substantially less storage than an array trie (9 times less according to this paper: http://users.info.unicaen.fr/~daireaux/ANADY/PUBLIS/ClementFlajoletVallee2.ps.gz) and will probably achieve better cache utilization as a result.
Title: Re: Mel's C homework
Post by: Karlos on February 24, 2007, 07:19:30 PM
Perhaps we should keep the optimal container design chatter down a bit, lest we deter the student ;-)

We should keep this for after a successful attempt of stage 3, where discussing the structure and performance of rival containers would make more sense.
Title: Re: Mel's C homework
Post by: Cymric on February 24, 2007, 08:06:30 PM
Quote
Perhaps we should keep the optimal container design chatter down a bit, lest we deter the student ;-)

I agree. It's nice and all to be discussing algorithms, fascinating these might be, but really, this for when you're looking for fast ways to solve the excercise.
Title: Re: Mel's C homework
Post by: Vincent on February 25, 2007, 03:29:11 AM
Sorry to go off topic, but everytime I see this thread's title I keep misreading it as "Mel C's homework" :-P
Title: Re: Mel's C homework
Post by: Karlos on February 25, 2007, 10:55:23 AM
Quote

Vincent wrote:
Sorry to go off topic, but everytime I see this thread's title I keep misreading it as "Mel C's homework" :-P


That would be learning how to sing. Preferably, in tune.
Title: Re: Mel's C homework
Post by: mel_zoom on February 25, 2007, 09:36:05 PM
Well, here is the first program :-)

It seems to work fine on all the text files I gave it so far - including some "control" ones not containing anything some containing only space or 1 word and so on.

Whilst writing it I realised that I could simplify the starting state of the word counting loop by assuming we start with whitespace and no words. That way you can let the normal loop logic decide what to do with the first character read straight away.

Code: [Select]

/*
  Word counting program, stage 1
*/

#include <stdio.h>

/* prototype */
int countwords(FILE* file);

/* main program */
int main(int argc, char** argv)
{
const char* filename = 0;
FILE* file = 0;
int numwords = 0;

/* get the filename from the command line*/
if (argc<2) {
puts(&quot;Usage: wordcount <file name>&quot;);
return 1;
}

/* try to open the file */
filename = argv[1];
file = fopen(filename, &quot;r&quot;);
if (file == NULL) {
printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
return 1;
}

/* count and display */
numwords = countwords(file);
printf(&quot;Counted %d word(s) in file '%s'\n&quot;, numwords, filename);

/* all done */
fclose(file);
return 0;
}

int countwords(FILE* file)
{
/*
Looks for whitespace <-> non whitespace changes.
Every whitespace -> non whitespace change is considered a
new word.
*/
int numwords = 0;
int whitespace = 1;
int character;

character = fgetc(file);
while(character != EOF) {
switch (character) {
case ' ':
case '\t':
case '\n':
case '\r':
whitespace = 1;
break;

default:
if (whitespace == 1) {
numwords++;
whitespace = 0;
}
break;
}
character = fgetc(file);
}
return numwords;
}


Indentation too! - easy once you know how!
Title: Re: Mel's C homework
Post by: mel_zoom on February 25, 2007, 09:45:55 PM
I think I can already see how to go about solving stage 2.

If you put another variable "wordlength" initially set to 0 before the loop you could increment it in the "default" case of the switch.
Then in the whitespace part of the switch you could put an "if (whitespace==0)" test to see if we just entered the whitespace now. If that is true then whatever the current value of "wordlength" is would be the length of the last word read. You would then use this to increase the appropriate word counter and then reset it back to zero.

:idea:
Title: Re: Mel's C homework
Post by: Wain on February 26, 2007, 12:58:22 AM
This looks like a really solid beginner implementation of such an idea.  

I would like to make a recommendation from my experience that you eventually may want to try to keep your program flow separate from your conditional code, while this is absolutely unimportant in a program as simple as your phase 1, keeping them together in phase 3 may result in a rather unwieldy bit of code that could wind up being difficult to read/follow by other programmers, or yourself 2 months later.

What I essentially mean by this is to put your switch statement into a separate function called something like is_whitespace();  to be called inside of your while-loop.  Of course, you would also need to mildly restructure the implementation of your solution to do something like this.


Also...
What you may want to try at a more advanced stage is to rewrite it so that it loads a large portion of the file (or all of it) into memory and then processes it from there.

If you were working on large text files it would inevitably save time and reduce disk access...beyond being a good way to get comfy with malloc and basic pointer work.



Regardless of all of this...you've done some really good work so far!  :-)  :-)  :-)


as far as phase 2 is concerned...
Don't forget to filter punctuation from your word-lengths.


Title: Re: Mel's C homework
Post by: Piru on February 26, 2007, 01:28:08 AM
That's not bad at all.
Title: Re: Mel's C homework
Post by: mel_zoom on February 26, 2007, 10:03:46 AM
Wain:

"What I essentially mean by this is to put your switch statement into a separate function called something like is_whitespace(); to be called inside of your while-loop. Of course, you would also need to mildly restructure the implementation of your solution to do something like this."

For stage 2 it might be better. I could have a function that returns the length of the next word in the file. The switch statement could then contain all the punctuation characters to ignore in the same case range as the whitespace.
Title: Re: Mel's C homework
Post by: Karlos on February 26, 2007, 10:35:47 AM
Simple and clean. The character by character IO could be improved upon, as Wain suggests, but there are no obvious logical errors in the code, except one. If the text file contained any spurious punctuation that was surrounded by whitespace, they'd appear to be words.

Eg:

Hello!

would appear to be 2 words, rather than one.

Of course, this error depends on what your specification defines as a word. As I didn't actually say, I'll have to let  you off ;-)
Title: Re: Mel's C homework
Post by: falemagn on February 26, 2007, 01:11:53 PM
Quote

Also...
What you may want to try at a more advanced stage is to rewrite it so that it loads a large portion of the file (or all of it) into memory and then processes it from there.


That's unnecessary, as she's using buffered I/O file access which basically does what you're suggesting she should do.

One thing I would suggest, instead, is to use ctype.h's isspace() function (or macro, depending on its implementation).
Title: Re: Mel's C homework
Post by: Jupp3 on February 26, 2007, 01:23:26 PM
Quote
if (whitespace == 1)

If 0 and 1 (non-zero) are the only different values you need, you can always just do if(whitespace) (which in this case pretty much equals if(whitespace=1)) - on many (most?) CPU's it's faster and more compact to check whether or not a value is zero, rather than checking if it equals some other value. Basically this is also true for if(a==0), but probably many compilers can optimize that to if(!a)

As you probably have already noticed, this is often used as a way of error checking. Consider result=malloc(size); - if the allocation failed, you get NULL pointer. Personally whenever possible, I also use the same approach in my code. Another variant is to use zero as success, and non-zero as error code.

Also don't forget to avoid doing for(i=0; i
Title: Re: Mel's C homework
Post by: Karlos on February 26, 2007, 01:26:08 PM
@falemagn

Even though the stdio buffers the file, the overhead of repeatedly calling fgetc() can make a difference compared to iterating over an array of characters that you effectively have direct access to. You might notice the benefit on an 030 IDE system like Mel's.

Of course, this is only if you wanted to optimize the code as much as possible - but then, a good implementation might have an inlinable version of fgetc().
Title: Re: Mel's C homework
Post by: Karlos on February 26, 2007, 01:40:41 PM
Quote

falemagn wrote:

One thing I would suggest, instead, is to use ctype.h's isspace() function (or macro, depending on its implementation).


Actually, I think the idea of differentiating "word" characters from "non-word" characters would be more effective. You'd probably want to use isalnum() for that one as it only returns non-zero for letters or numbers.
Title: Re: Mel's C homework
Post by: falemagn on February 26, 2007, 02:14:18 PM
Quote

Karlos wrote:
@falemagn

Even though the stdio buffers the file, the overhead of repeatedly calling fgetc() can make a difference compared to iterating over an array of characters that you effectively have direct access to. You might notice the benefit on an 030 IDE system like Mel's.


The standard allows getc() to be implemented as a macro, and it often is. Getc() is otherwise equivalent to fgetc().

Title: Re: Mel's C homework
Post by: falemagn on February 26, 2007, 02:15:35 PM
Quote

Karlos wrote:
Quote

falemagn wrote:

One thing I would suggest, instead, is to use ctype.h's isspace() function (or macro, depending on its implementation).


Actually, I think the idea of differentiating "word" characters from "non-word" characters would be more effective. You'd probably want to use isalnum() for that one as it only returns non-zero for letters or numbers.


I agree with you, but I was merely suggesting a way to do without the 'case' construct she used.
Title: Re: Mel's C homework
Post by: Karlos on February 26, 2007, 02:58:37 PM
@falemagn

Quote
I was merely suggesting a way to do without the 'case' construct she used.


I agree that's the way forward but I also thought the case construct showed an exellent understanding of the underlying logic (for someone not acquainted with all the features of the c standard library) which is what I was looking for when I set the exercise :-)
Title: Re: Mel's C homework
Post by: mel_zoom on February 26, 2007, 08:48:43 PM
I dont think this is as clean as it probably could be but it seems to work.

Code: [Select]

/*

  Simple word count program stage 2

  counts words in text file passed as argument
  presents summary of word lengths

*/

#include <stdio.h>
#include <ctype.h>

/* prototype */
int countwords(FILE* file, int* lengths);

/* main program */
int main(int argc, char** argv)
{
const char* filename = 0;
FILE* file = 0;
int numwords = 0;
int lengths[9] = {0};
int n;

/* get the filename from the command line*/
if (argc<2) {
puts(&quot;Usage: wordcount <file name>&quot;);
return 1;
}

/* try to open the file */
filename = argv[1];
file = fopen(filename, &quot;r&quot;);
if (file == NULL) {
printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
return 1;
}

/* count and display */
numwords = countwords(file, lengths);
printf(&quot;Counted a total of %d word(s) in file '%s'\n&quot;, numwords, filename);

printf(&quot;Words <  4 chars: %d\n&quot;, lengths[0]);
for (n=1; n<7; n++) {
printf(&quot;Words of %d chars: %d\n&quot;, n+3, lengths[n]);
}
printf(&quot;Words > 10 chars: %d\n&quot;, lengths[8]);

/* all done */
fclose(file);
return 0;
}

int countwords(FILE* file, int* lengths)
{
int numwords = 0;
int notinaword = 1;
int length = 0;
int character;

while( (character = fgetc(file)) != EOF ) {
if (isalnum(character)) {
/* character is in a word so increment the length */
if (notinaword) {
notinaword = 0;
}
length++;
}
else {
/* if we just got here then update our counters */
if (!notinaword) {
notinaword = 1;
numwords++;

/* fill appropriate length bucket */
if (length<4) {
/* for lengths < 4 fill bucket 0 */
lengths[0]++;
}
else if (length<11) {
/* for lengths 4...10 fill buckets 1-7*/
lengths[length-3]++;
} else {
/* for lengths > 10 fill bucket 8*/
lengths[8]++;
}
length = 0;
}
}
}
return numwords;
}


No switch-case either :-)
Title: Re: Mel's C homework
Post by: falemagn on February 27, 2007, 04:59:50 PM
Mel,

it looks good, but you don't really need the notinaword variable, you can recode the countwords function like this:

Code: [Select]

int countwords(FILE* file, int* lengths)
{
    int numwords = 0;
    int length   = 0;
    int character;

    while ((character = getc(file)) != EOF) {
        if (isalnum(character)) {
            /* character is in a word so increment the length */
            length++;
        }
        else
        if (length > 0) {
            /* if we just got here then update our counters */
            numwords++;

            /* fill appropriate length bucket */
            if (length < 4) {
                /* for lengths < 4 fill bucket 0 */
                lengths[0]++;
            }
            else
            if (length < 11) {
                /* for lengths 4...10 fill buckets 1-7*/
                lengths[length - 3]++;
            }
            else {
                /* for lengths > 10 fill bucket 8*/
                lengths[8]++;
            }
           
            length = 0;
        }
    }
   
    return numwords;
}


Notice also the use of getc instead of fgetc: it's likely to make the code faster, for the implementation can chose to implement it as an inline function or macro accessing the buffering data structures directly.
Title: Re: Mel's C homework
Post by: mel_zoom on February 27, 2007, 07:37:03 PM
Hmm.

I hit a snag. It seems isalnum() only returns non zero for the characters 0...9 a...z A...Z

When the source text file contains words that might have some european characters it causes the program to think one word is several.

I think perhaps the logic should be that it looks for characters that are not space and are not punctuation?
Title: Re: Mel's C homework
Post by: mel_zoom on February 27, 2007, 07:44:37 PM
Ok that didnt work either :-(

:idea:

Since I already know how to load a file character by character, how hard can it be to write a replacement isword()  that uses a table built by reading a different text file?

If there is a table with room for all 256 codes - all set to zero at first - you could load the "character set" file and fill in a 1 for every character found in that file.

Or perhaps the file only need contain extra characters not already recognised by filling the table with the isalnum() for each character first...
Title: Re: Mel's C homework
Post by: Piru on February 27, 2007, 08:04:37 PM
Quote
When the source text file contains words that might have some european characters it causes the program to think one word is several.

Assuming character set ISO 8859-1 (latin1), here is a routine that should work relatively well:
Code: [Select]

int isalpha_int(unsigned char c)
{
        c &= ~0x20;
        return (c >= 'A' && c <= 'Z') ||
               (c >= 192 && c <= 222 && c != 215);
}
Title: Re: Mel's C homework
Post by: mel_zoom on February 27, 2007, 08:52:38 PM
Ok here is the final version for stage 2 - Im not planning to do anything more to it for now.

I improved it by moving more of the code from main() into functions and Ive made a custom isword() thats really a macro. It behaves exactly like isalnum() except that it uses a table that you can add characters to by specifying a character set file as the second parameter on the shell.

I jumped ahead of myself a bit using a macro for that but it seemed easy enough for this exercise.

I also redid the count loop the way falemagn suggested. His modification seems obvious in hindsight. I guess I have a while to go yet :-)

Code: [Select]

/*

  Simple word count program stage 2 revised

  counts words in text file passed as argument
  presents summary of word lengths

  uses custom character list for determining words

*/

#include <stdio.h>
#include <ctype.h>

/* data */
static char chartypetable[256] = {0};

/* prototypes */
void maketypetable(const char* charset);
int countwords(const char* filename, int* lengths);
void printsummary(const char* filename, int numwords, int* lengths);

#define isword(c) (chartypetable[(c)])


/* main program */
int main(int argc, char** argv)
{
const char* filename = 0;
const char* charsetfilename = 0;
int numwords = 0;
int lengths[9] = {0};

/* get the filename from the command line */
if (argc<2) {
puts(&quot;Usage: wordcount <file name> [character set]&quot;);
return 1;
}
filename = argv[1];

/* get any additional character set filename */
if (argc>2) {
charsetfilename = argv[2];
}

/* make our character table */
maketypetable(charsetfilename);

/* count and show summary */
numwords = countwords(filename, lengths);
printsummary(filename, numwords, lengths);

return 0;
}



void printsummary(const char* filename, int numwords, int* lengths)
{
int n;
printf(&quot;Counted a total of %d word(s) in file '%s'\n&quot;, numwords, filename);
printf(&quot;Words <  4 chars: %d\n&quot;, lengths[0]);
for (n=1; n<7; n++) {
printf(&quot;Words of %d chars: %d\n&quot;, n+3, lengths[n]);
}
printf(&quot;Words > 10 chars: %d\n&quot;, lengths[8]);
}



int countwords(const char* filename, int* lengths)
{
int numwords = 0;
int length = 0;
int character;
FILE* file;

file = fopen(filename, &quot;r&quot;);
if (!file) {
printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
exit(1);
}

while( (character = getc(file)) != EOF ) {
if (isword(character)) {
/* character is in a word so increment the length */
length++;
}
else {
/* if we just got here then update our counters */
if (length>0) {
numwords++;

/* fill a range bucket based on the word length*/
if (length<4) {
/* for lengths < 4 fill bucket 0 */
lengths[0]++;
}
else if (length<11) {
/* for lengths 4...10 fill buckets 1-7*/
lengths[length-3]++;
} else {
/* for lengths > 10 fill bucket 8*/
lengths[8]++;
}
length = 0;
}
}
}

fclose(file);
return numwords;
}



void maketypetable(const char* charset)
{
/*
make own character type table
*/
int c;

/* first fill with isalnum() */
for (c=0; c<256; c++) {
if (isalnum(c)) {
chartypetable[c]=1;
}
}

/* try to add any user defined characters too - excluding spaces */
if (charset) {
FILE* file = fopen(charset, &quot;r&quot;);
if (file) {
while ( (c = getc(file)) != EOF ) {
if (!isspace(c)) {
chartypetable[c] = 1;
}
}
fclose(file);

/* show the recognised &quot;isword()&quot; character set */
printf(&quot;Added character set definition from file '%s'\nThe following characters will be treat as valid within words:\n&quot;, charset);
for (c=0; c<256; c++) {
if (chartypetable[c]) {
printf(&quot;%c &quot;, c);
}
}
putchar('\n');
}
else {
printf(&quot;Couldn't open character set definition file '%s'\n&quot;, charset);
}
}
printf(&quot;\n&quot;);
}


I think thats enough for one evening!
Title: Re: Mel's C homework
Post by: AmireX on February 27, 2007, 09:51:19 PM
Quote

mel_zoom wrote:
Code: [Select]

/* fill a range bucket based on the word length*/
if (length<4) {
/* for lengths < 4 fill bucket 0 */
lengths[0]++;
}
else if (length<11) {
/* for lengths 4...10 fill buckets 1-7*/
lengths[length-3]++;
} else {
/* for lengths > 10 fill bucket 8*/
lengths[8]++;




Not even faster, but more compact :-)
Code: [Select]

/* fill a range bucket based on the word length*/
                                length = min(max(length-3,0),8);
                                lengths[length]++;


with
Code: [Select]

#define min(a, b)  (((a) < (b)) ? (a) : (b))
#define max(a, b)  (((a) > (b)) ? (a) : (b))

Regards, ReX
Title: Re: Mel's C homework
Post by: falemagn on February 28, 2007, 08:45:35 AM
Quote

mel_zoom wrote:
Hmm.

I hit a snag. It seems isalnum() only returns non zero for the characters 0...9 a...z A...Z


That's because the locale is set to the default "C". Libc's setlocale() function will change that behaviour, provided the underlying implementation supports it. AmigaOS' locale.library provides equivalent functionalities, but then it's not ANSI C anymore.
Title: Re: Mel's C homework
Post by: falemagn on February 28, 2007, 08:49:08 AM
Quote

AmireX wrote:
Not even faster, but more compact :-)
Code: [Select]

    /* fill a range bucket based on the word length*/
    length = min(max(length-3,0),8);
    lengths[length]++;

with
Code: [Select]

    #define min(a, b)  (((a) < (b)) ? (a) : (b))
    #define max(a, b)  (((a) > (b)) ? (a) : (b))

Regards, ReX


If she's using gcc, then the following might be a better choice:

Code: [Select]

    /* fill appropriate length bucket */
    switch (length)
    {
        case 1 ... 3:  lengths[0]++;        break;
        case 4 ... 10: lengths[length-3]++; break;
        default:       lengths[8]++;
    }
Title: Re: Mel's C homework
Post by: motorollin on February 28, 2007, 08:56:40 AM
Every time I see the subject line of this thread I think it says "Mel C's homework" :lol:
(http://www.allpop.com/AllPopImages-MelC/melc_212x270.jpg)
Girl Power!

--
moto
Title: Re: Mel's C homework
Post by: mgerics on February 28, 2007, 11:22:11 AM
Link to fre books on c and the like:
http://www.computer-books.us/c.php
Title: Re: Mel's C homework
Post by: Karlos on February 28, 2007, 06:28:04 PM
@Mel

Since your 030 has only 256 bytes of data cache, you might want to redesign your character table like so:

Code: [Select]

static unsigned long chartypedata[8] = { 0 };

#define isword(c) (chartypedata[(c)>>5]&(1UL<<((c)&31UL)))
#define setbit_isword(c) (chartypedata[(c)>>5] |= (1UL<<((c)&31UL)))
#define clrbit_isword(c) (chartypedata[(c)>>5] &= ~(1UL<<((c)&31UL)))


When you are filling your table initially, use the setbit_isword(character) macro.

If, for whatever reason, you need to clear an entry, use the clrbit_isword(character) macro.

The point of this code is that it uses just one bit to represent the "isword" status for each character. You therefore get a table that is only 32 bytes rather than 256.

One slight difference to be wary of is that you can only depend on this version of isword() to return 0 if false and something else if true (it could be any power of 2 value from 1 to 2^31) but that wont affect your logic.
Title: Re: Mel's C homework
Post by: Smack on March 01, 2007, 12:18:56 PM
@Karlos

Quote
Since your 030 has only 256 bytes of data cache, you might want to redesign your character table like so:

May I express my doubts that this will make the table lookups any faster.

While the bit array needs less memory and can therefore fit into the 68030's tiny cache, the extra instructions required for the bitwise access will probably nullify the performance boost achieved by the cache hits. (I haven't counted the instruction cycles, yet)

Here's some example code in M68k assembly to support my claim:
Code: [Select]

;  input: d0 = character (values 0...255)
;  input: a0 = address of lookup table
; output: status bit Z (Zero)

; alternative (1) byte array lookup
tst.b   (a0,d0.w)       ;(cache-miss likely on 68030)

; alternative (2) bit array lookup (supposed C compiler output)
move.w  d0,d1
lsr.w   #5,d1           ;d1=table offset (long words)
move.l  (a0,d1.w*4),d1  ;d1=long word value
and.l   #31,d0          ;d0=bit number (modulo 32)
moveq   #1,d2
lsl.l   d0,d2           ;d2=bit mask
and.l   d2,d1

; alternative (3) bit arry lookup (asm optimized)
move.w  d0,d1
lsr.w   #3,d1           ;d1=table offset (bytes)
btst    d0,(a0,d1.w)    ;(implied: d0=bit number modulo 8)

Depending on the number of wait states caused by the cache-miss during byte array lookup (alternative 1), even the hand-optimised bit array lookup (alternative 3) may be slower due to the extra instructions, let alone the C version (alternative 2). If the code runs on a 68040 or 060 then things will look even worse for the bit array...

I think this kind of performance tweaking is way off-topic for a beginners session in C, anyway.

Sorry for nagging.  ;-)
Title: Re: Mel's C homework
Post by: Karlos on March 01, 2007, 01:30:04 PM
Quote

Smack wrote:

Depending on the number of wait states caused by the cache-miss during byte array lookup (alternative 1), even the hand-optimised bit array lookup (alternative 3) may be slower due to the extra instructions, let alone the C version (alternative 2). If the code runs on a 68040 or 060 then things will look even worse for the bit array...


It was faster than the basic char array on my 040. I reasoned that it may well be faster on the 030 but as you say, mileage varies ;-)

Quote
I think this kind of performance tweaking is way off-topic for a beginners session in C, anyway.


Absolutely :-D
Title: Re: Mel's C homework
Post by: Smack on March 01, 2007, 03:17:08 PM
Quote
It was faster than the basic char array on my 040. I reasoned that it may well be faster on the 030 but as you say, mileage varies ;-)

Interesting! Why's that? Can the longword memory access be that much faster than the byte memory access?
Title: Re: Mel's C homework
Post by: SamuraiCrow on March 01, 2007, 03:46:07 PM
Quote

Smack wrote:
Quote
It was faster than the basic char array on my 040. I reasoned that it may well be faster on the 030 but as you say, mileage varies ;-)

Interesting! Why's that? Can the longword memory access be that much faster than the byte memory access?


No...  The '040 has a 4k cache on the die instead of a 256 byte cache.
Title: Re: Mel's C homework
Post by: Smack on March 01, 2007, 04:32:49 PM
Quote
SamuraiCrow wrote:
No...  The '040 has a 4k cache on the die instead of a 256 byte cache.

Yeah, I know. The byte array (256 bytes) should easily fit into that larger cache, which should give the byte lookup a boost. But Karlos observed a better performance of the bit lookup with its smaller table (32 bytes) but more instructions.

Why? Cache-misses due to lookup table of 256 bytes not being in 68040 data cache? Longword lookups faster than byte lookups? Hm...
Title: Re: Mel's C homework
Post by: Karlos on March 01, 2007, 04:54:27 PM
I'll try a hand optimised assembler implementation of each one later.

The speed was based on testing a very large string (several MB worth) in memory. The speed increase wasn't accurately measurable when reading from a file (even with buffering) since IO overhead tends to be the limiting factor in both cases.

Perhaps you are looking at the problem in too small a scope. The source data also enters the cache. In this case, you are reading something that is effectively streaming (from memory) which will involve cache line transfers to boost the speed of successive character accesses, but fundamentally, this will pollute the cache with values from the string that are only ever used once (this is a common problem with stream style processing).
Therefore, perhaps the smaller bit-table is less affected due to the fact that it's both smaller and fact that a lot of the time the same 32-bit word will be repeatedly hit for many different characters.

Its also possible that the compiler didnt generate code for the byte table lookup that was especially optimal compared to the bit lookup.
Title: Re: Mel's C homework
Post by: Smack on March 01, 2007, 05:28:21 PM
Quote
Karlos wrote:
The source data also enters the cache.
...
Therefore, perhaps the smaller bit-table is less affected due to the fact that it's both smaller and fact that a lot of the time the same 32-bit word will be repeatedly hit for many different characters.

Yes, that's probably the explanation of the results.

But as you said, in the presence of file I/O it seems useless to optimise this part of the program. It's just a nice technical exercise with no practical benefit. Have fun.  ;-)
Title: Re: Mel's C homework
Post by: guru-666 on March 01, 2007, 05:31:51 PM
wow so many ppl interested in some average chicks c homework, amazing....  You guy's know that python has this kind of function built in already.
Title: Re: Mel's C homework
Post by: AmireX on March 01, 2007, 06:58:10 PM
Quote

guru-666 wrote:
wow so many ppl interested in some average chicks c homework, amazing....  You guy's know that python has this kind of function built in already.


Yes python is the better choice :-), could look like this:

Code: [Select]

import sys

argc = len(sys.argv)
if argc < 2:
  print &quot;Usage: wordcount <file name>&quot;
  sys.exit(1)

word_l = open(sys.argv[0]).read().split()
len_d = {}
for word in word_l:
  size = len(word)
  if size in range(1,4): size = 3
  if size > 10:          size = 10
  if len_d.has_key(size): len_d[size] += 1
  else:                   len_d[size] = 1
print len_d

Title: Re: Mel's C homework
Post by: mel_zoom on March 01, 2007, 07:09:03 PM
guru-666:

"wow so many ppl interested in some average chicks c homework, amazing....  You guy's know that python has this kind of function built in already."

I guess the point that this is all about learning C and not about which language is better for the task is wasted on some average python progammers :-P

It is probably even easier in Perl or PHP.
Title: Re: Mel's C homework
Post by: guru-666 on March 01, 2007, 07:45:40 PM
that's cool, can't realy find fault in ppl learning new things!  I just wonder why you choose C, as that a language is old and on it's way out.  I perfer object oriented stuff.  Anyway I'm not even average I'm piss poor at coding.
I'm getting to old to learn hard things with out a goal.  I use (MEL, maya's scripts language) and now python for my 3d animation needs.... to me coding is sensless unless you have a goal.  When I told the folk here at work that I was going to go back and learn C, they laughed at me and told me to just use python and move forward..... So if your looking for work...
anyway, good luck!


PERL!  that syntax sucks..... but them regular expression, wow.
Title: Re: Mel's C homework
Post by: mel_zoom on March 01, 2007, 08:10:12 PM
People keep saying that C (and C++) are obsolete and that some newer language does it all so much better.

What I cant understand is how people can think that when most of these languages require C in order to be compiled and maintained. How many of them can be genuinely built using only their own language?

As far as I understand Java, Perl, Python, PHP and all of the .net languages are all implemented in C at some level and cannot be built without it!

When everybody has forgotten C who will actually maintain the languages they have migrated to?
Title: Re: Mel's C homework
Post by: Piru on March 01, 2007, 08:30:21 PM
Quote
I just wonder why you choose C, as that a language is old and on it's way out.

How is age of the language relevant? Regardless, C is not going anywhere.

IMO C is probably the best choice for overall programming skills. When you know C, it is easy to extend your knowlege to other languages.
Title: Re: Mel's C homework
Post by: Linde on March 01, 2007, 08:42:11 PM
Quote

IMO C is probably the best choice for overall programming skills. When you know C, it is easy to extend your knowlege to other languages.

I agree. With C behind me I could almost instantly pick up PHP and Javascript, and I bet that no "curly bracket" language is really out of my reach. I just need to read up on object orientation :)
Title: Re: Mel's C homework
Post by: Piru on March 01, 2007, 09:09:39 PM
@guru-666
Quote
You guy's know that python has this kind of function built in already.

That doesn't help when learning C though.
Title: Re: Mel's C homework
Post by: Dietmar on March 01, 2007, 09:24:11 PM
> I just wonder why you choose C, as that a language is old

As is English.
Title: Re: Mel's C homework
Post by: guru-666 on March 01, 2007, 09:39:40 PM
yes, english is very old, and a huge part of the world still uses it.   However it is evolving all the time, so learning old english first does not help you talk with people.
I guess learning C is a cool way to spend you free time if you have nothing else to do.  I still kind of want to learn ASM for 68000, but put it off becasue other than having fun there is no use I can think of.
Like I said before I suck at coding, but if I have an issue I can usualy get it resolved.  
Enjoy Mel's c homework, wonder if there would be half of the attention is she where a fat guy. LOL!
here is my take on it.
C, if you want to learn the ART of programing
Python, if you need to get the job done.

Title: Re: Mel's C homework
Post by: Dietmar on March 01, 2007, 09:41:41 PM
> wonder if there would be half of the attention is she where a fat guy

Well, give it a try ;)
Title: Re: Mel's C homework
Post by: guru-666 on March 01, 2007, 09:46:47 PM
not fat.....sorry....maybe not even a guy, LOL
aber ich kann auch deutsch schreiben.
Title: Re: Mel's C homework
Post by: AmireX on March 01, 2007, 09:59:33 PM
Quote

mel_zoom wrote:
As far as I understand Java, Perl, Python, PHP and all of the .net languages are all implemented in C at some level and cannot be built without it!

When everybody has forgotten C who will actually maintain the languages they have migrated to?


That's true. I use python a lot, but to integrate new functions into the language I often need to extend the interpreter with so called python dlls .. written in plain C ;-)

But at the other hand I try to not reinvent the wheel (is this correct english ??), so different languages for different tasks. But I like C/C++ too, it's like a good sudoku.
Title: Re: Mel's C homework
Post by: guru-666 on March 01, 2007, 10:05:47 PM
perfect english! well said.
Title: Re: Mel's C homework
Post by: mel_zoom on March 01, 2007, 10:07:39 PM
Well Ive been trying to take in the hints from earlier about stage 3.

Its a bit confusing so far. I am guessing that the basic idea of building a list of words is no good?

I had thought of this:

Code: [Select]

struct uniqueword {
    struct uniqueword* last;
    struct uniqueord* next;
    const char* text;
    int count;
};


I think this solves the problem of not knowing how many unique words there are as you can add more uniqueword structures as you go along. I thought about having a list like this. Whenever you get a word you would walk along the list to see if you find it.

I thought that stricmp() could compare the word with the one in a structure until it finds the structure that contains the word already (or not). However it struck me that since stricmp() actually returns a value less than zero or greater than zero if one string is "less" than another or not that it could also be used to detect where in the list to add a new word. As long as the word in the structure is "less" than your current word then keep going along the list. If you then get to a word that is "greater" than your current one - assuming it didnt match anything - you could put your new word at that point in the list.

However I think I see the problem you are all hinting at. The more unique words there are the longer it takes to check if any word you are currently testing is there already. If there were very many unique words the program would run slower and slower.

Back to the drawing board?
Title: Re: Mel's C homework
Post by: AmireX on March 01, 2007, 10:07:58 PM
Quote

guru-666 wrote:
perfect english! well said.


Danke, I'm learning.
Title: Re: Mel's C homework
Post by: Piru on March 01, 2007, 10:22:56 PM
@mel_zoom

You got it right there. I think you should first implement it using the slow (list) method. When that works you can look into some faster ways to manage the words, such as trie, binary tree or hash table.

If you keep your program well structured, and separate the "lookup" and "insert" into separate routines you should have no problems changing the implementation when/if desired.

I do this quite a lot: I get all the simpler parts going first, and those work okay I tackle the bigger problem. It is easy to overwhelm yourself with tons of work, it helps if you keep things organized, and don't try to solve everything at once.

In my solution to stage 3 I have both binary tree and trie routines, toggleable with a #define. I could code the simple binary tree out of memory, and since I had set of simple functions for it, it was trivial to add conditional trie routines later on.

I guess what I am trying to say is: Don't be too hard on yourself and don't demand perfection from the beginning. Allow yourself to use simple(r) solutions, it is enough to recognize that better solutions exist. You can always improve the code after the basics work. (And in case you can't get the advanced solution working, at least you have the simpler one that DOES ;-))
Title: Re: Mel's C homework
Post by: AmireX on March 01, 2007, 10:24:07 PM
To solve it in C (and not C++) you can use a hash function. Such a function calculates a special hash value (in a defined range) for every string. This value you can use to put the strings into an array. To avoid conflicts of equal hash values for different strings you built an array of linked lists and appends the equal strings to this lists.

Depending of the size of your "string list array" you have an access near to 1 O(1) to find a string.  But this is a hash map and always implemented in the standard template library for example.  :-?
Title: Re: Mel's C homework
Post by: falemagn on March 02, 2007, 07:06:30 AM
@ mel_zoom

Piru said all that I had in mind to reply myself, so... just wanted to say that I wish my coworkers were as skilled as a newbee like you is. ;-)
Title: Re: Mel's C homework
Post by: mel_zoom on March 02, 2007, 10:25:47 AM
I have a good guru.
Title: Re: Mel's C homework
Post by: A4000_Mad on March 02, 2007, 10:34:04 AM
Thanks Mel, I'm really getting to like this C stuff :-)

I must not mention the beautiful new avatar......
I must not mention the beautiful new avatar......
I must not mention the beautiful new avatar......
I must not mention the beautiful new avatar......
I must not mention the beautiful new avatar......

But wow!!


A4000 Mad
Title: Re: Mel's C homework
Post by: mel_zoom on March 02, 2007, 10:56:27 AM
How strange - I was really annoyed at the time that picture was taken as I wrestling with a particularly annoying earpiece prior to a product release event!

I thought the look of perplexed frustration would be apt for when I start getting into more realistic C code :-)
Title: Re: Mel's C homework
Post by: falemagn on March 02, 2007, 11:01:17 AM
Quote

mel_zoom wrote:

I thought the look of perplexed frustration would be apt for when I start getting into more realistic C code :-)


Except that you don't look either perplexed nor frustrated, in that picture. ;-)
Title: Re: Mel's C homework
Post by: madsjm on March 02, 2007, 11:07:33 AM
I read "Mel C's homework", and I thought one of the Spice Girls was among us.

My face is red.  :roll:
Title: Re: Mel's C homework
Post by: mel_zoom on March 02, 2007, 11:10:15 AM
Ok I changed the title. Better?
Title: Re: Mel's C homework
Post by: madsjm on March 02, 2007, 11:13:15 AM
BTW: Seeing the number of replies Mel get in these forums.. She could work very well as the new marketing campaign for Amiga!  :lol:
Title: Re: Mel's C homework
Post by: Jupp3 on March 02, 2007, 11:35:07 AM
Quote
I just wonder why you choose C, as that a language is old and on it's way out.

You know it's been going out since it was first published, still going (out) strong :-)

Also html is going out, and will be totally replaced with flash. Or is it still flash? At least it was >5 years ago...
Quote
I perfer object oriented stuff.

Me too. That's why I code C in object oriented way. imho, "being object oriented" is mostly the attitude of a programmer, more than a feature of a programming language.
Quote
yes, english is very old, and a huge part of the world still uses it. However it is evolving all the time, so learning old english first does not help you talk with people.

Luckily C is evolving aswell. There have been few newer standard revisions, but most notably there's steady development for various libraries, that you can use to do, well, just about anything.
Title: Re: Mel's C homework
Post by: Cymric on March 02, 2007, 01:51:38 PM
Quote
Piru wrote:
I guess what I am trying to say is: Don't be too hard on yourself and don't demand perfection from the beginning. Allow yourself to use simple(r) solutions, it is enough to recognize that better solutions exist. You can always improve the code after the basics work. (And in case you can't get the advanced solution working, at least you have the simpler one that DOES ;-))

In addition, it always makes a big impression on people when you're able to brag you've rewritten the code to make it 1000% faster. (Everyone goes 'Wow!' and 'That's HUMONGOUS!' and 'There's a coder who knows where his towel is!'.) What you don't say is how terrible your first solution was  :-P
Title: Re: Mel's C homework
Post by: Karlos on March 02, 2007, 08:51:19 PM
Quote

falemagn wrote:
Quote

mel_zoom wrote:

I thought the look of perplexed frustration would be apt for when I start getting into more realistic C code :-)


Except that you don't look either perplexed nor frustrated, in that picture. ;-)


It is a lot better than the look of sullen boredom I was wearing at that presentation :lol:
Title: Re: Mel's C homework
Post by: mel_zoom on March 06, 2007, 10:20:34 PM
Noooo! I just wrote a really long post and got logged out!

Luckily I was writing most of it in a text editor - phew!

...

Its been a few days now and I read up on malloc(). I realised that malloc doesnt actually care what you are allocating memory for so I changed the structure a little bit so that the text becomes a part of it.

Anyway, this is what I am working towards now:

Code: [Select]


int addword(const char* wordtext); /* main prototype */

struct uniqueword {
    struct uniqueword* last;
    struct uniqueord* next;
    int count;
    char wordtext[1]; /* copy of the text */
};

static struct uniqueword* firstword;

static struct uniqueword* createuniqueword(const char* wordtext)
{
    /* create a new uniqueword structure */
    int wordlength;
    struct uniqueword* unique;

    wordlength = strlen(wordtext);
    unique = (struct uniqueword*)malloc(sizeof(struct uniqueword)+wordlength);

    /* fill in data if malloc() ok */
    if (unique != NULL) {
       unique->last = 0;
       unique->prev = 0;
       unique->count = 1;
       strcpy(unique->wordtext, wordtext);
    }
    return unique;
}

int addword(const char* wordtext)
{
   /* returns 1 if the word was added or zero if it existed or -1 on failure */

   struct uniqueword* newword;
   struct uniqueword* last;
   struct uniqueword* uw;
   int worddiff;

   if (!firstword) {
      newword = createuniqueword(wordtext);
      if (newword) {
         firstword = newword;
         return 1;
      }
      else {
         return -1;
      }
   }

   for (uw = firstword; uw != NULL; uw = uw->next) {
      worddiff = stricmp(wordtext, uw->wordtext);

      /* found it! */
      if (worddiff == 0) {
         uw->count++;
         return 0;
      }

      /* as soon as word > uw->word we insert it immediately before uw */
      if (worddiff > 0) {
         newword = createuniqueword(wordtext);
         if (newword != NULL) {
             newword->prev = uw->prev;
             newword->next = uw;
             uw->prev = newword;
             newword->prev->next = newword;
             return 1;
         } else {
             return -1;
         }
      }

      /* remember the last uniqueword */
      last = uw;
   }

   /* we reached the end of the list without finding it */
   newword = createuniqueword(wordtext);
   if (newword) {
      last->next = newword;
      return 1;
   }
   return -1;
}



Im not really happy with the loop though. Im trying to keep the words sorted based on the greater than / less than return of stricmp() so that the entire list isnt always searched before deciding to add the word.

I think its quite ugly though (having to have a "last" pointer just because the normal loop finishes when uw is null) and probably broken too.

Would "(uw != NULL) && (uw->next != NULL)" in the loop be better? It looks like it would always skip the last one on the list regardless which would only mean having to check if that one matches the wordtext outside the loop which also seems a bit stupid.
Title: Re: Mel's C homework
Post by: falemagn on March 07, 2007, 07:16:50 AM
@ Mel

Since you're never traversing the list backwards, you don't really need the "prev" pointer. Besides, you mispelled "prev" with "last" in the uniqueword structure definition. ;-)

About the loop issue: the "trick" is to not add the new word while within the loop, but do it always outside, trying to unify the cases in which it's the last word in the chain or one in between. You can state the loop's invariant property that when it finishes looping, the "last" pointer always points to the word after which you need to insert yours, then factor the loop in such a way as to make that the case.

In other words, consider the loop as the implementation of a "find" function that returns the pointer to the element in the list which is the upper bound of all elements that are less than the one you want to add.

If you need code hints, I can give them to you, but I'm sure you'll prefer to work that out by yourself. ;-)
Title: Re: Mel's C homework
Post by: mel_zoom on March 07, 2007, 09:58:55 AM
falemagn:

Thanks for the hints. The code there was put together but not yet compiled so I expect there to be those sorts of bugs.

I thought about having a search that ran from both ends towards the middle that could terminate once the pointers "cross over". Thats why I thought about having prev/next.

As you can tell I am still some time from making it work :-)
Title: Re: Mel's C homework
Post by: falemagn on March 07, 2007, 12:11:23 PM
Quote

mel_zoom wrote:
falemagn:

Thanks for the hints. The code there was put together but not yet compiled so I expect there to be those sorts of bugs.


I guessed as much, but still apart from that mistake the code as it is will work just fine. It's kind of amazing how you are so good at such stuff... the dream girl of any geek, I'd say. ;-)

Quote

I thought about having a search that ran from both ends towards the middle that could terminate once the pointers "cross over". Thats why I thought about having prev/next.


That may sound a clever idea, but you'd not get better performances than the one-way search unless you were thinking of parallelizing the algorithm... which would be quite an overengineering. :-) In both cases, the average compexity would be O(n/2).

Quote

As you can tell I am still some time from making it work :-)


There's still another improvement you could make, namely structure the code in such a way as to allocate the new uniqueword only in one place rather than in three like you do now. But we'll keep this for later... ;-)
Title: Re: Mel's C homework
Post by: mgerics on March 07, 2007, 12:20:34 PM
All right, I've figured it out - Mel Zoom is NOT a woman! HE is a lazy young brat who wants the 'cool' aspect of being able to program a computer, but is too lazy to do it HIMSELF. HE is taking code found on the web and plagiarizing it and taking advantage of the people on Amiga.org by asking for help with HIS 'pet' project!

Admit it, Mel! You found that avatar, who must be a local model from your home town, and decided 'those nerdy geeks at a.org will fall in love with this person and give me all sorts of free advice..'



Um, you know I'm just kidding, right? Best of luck in your endeavors. Makes me want to get back to C coding and delete Visual Basic from my computer.
Title: Re: Mel's C homework
Post by: mel_zoom on March 07, 2007, 07:29:38 PM
falemagn:

I see. One direction it is then! Is the stricmp() based 'sorting' still useful or not in your opinion?

I ask because a "hash code" was suggested earlier - perhaps I could replace the "prev" with a hash of the string and compare that but I expect any greater than / less than information would be lost.

I got given the following example hash function by Karlos:

Code: [Select]

unsigned long hash32(const char* text)
{
    unsigned long result = 0;
    unsigned long mask = 0x80000000;
    char *pc = (char*)text;
    while (*pc) {
       result = (result<<1) | ((result&mask) ? 1 : 0);
       result ^= (unsigned long)(*pc++);
    }
    return result;
}


Im not entirely sure yet how that works but it generates a "relatively unique" value from the string given. His suggestion was that I should use the hash value of the word being searched for and compare it to the hash value of the existing uniquewords in the list. Only when the hash values match call strcmp() to ensure they are the really same. Im told this should be faster than calling strcmp() each word in turn especially as the list grows.

Unlike my origial idea though it has to check the whole list before deciding the word is not present.

He did also point out that Id need to handle the case insensitivity issue for the above idea to work too.
Title: Re: Mel's C homework
Post by: CannonFodder on March 14, 2007, 04:01:48 PM
Quote

madsjm wrote:
I read "Mel C's homework", and I thought one of the Spice Girls was among us.

My face is red.  :roll:


Dirty boy! ;-)