Author Topic: My C homework (Read 6457 times)

mel_zoom · « **on:** February 24, 2007, 11:14:22 AM »

Hi!

So Ive been plodding along learning basic C. Ive been set what I am told is a fairly simple challenge but that has many possible solutions.

The challenge is to write a small program, using only the standard ANSI C headers that reads a text file - the name of which is given as a parameter on the command line - and performs a basic word count.

Stage two is to then modify the program so that it reports a breakdown of the word count into the following ranges

words of less than 4 characters
words of 4 characters
words of 5 characters
words of 6 characters
words of 7 characters
words of 8 characters
words of 9 characters
words of 10 characters
words of more than 10 characters

Stage three - which is supposed to be a lot more difficult to do well - is to create a new program which counts all the unique words in the text file.

For the first stage I thought that I could write a loop which reads the characters one at a time. It would see if the first character was whitespace or not and use it to set a "type" variable which might be 0 for whitespace and 1 for non whitespace. If it is non-whitespace it would set the word count to 1 otherwise set it to 0.

Then as it reads each character from the file it could see if it is whitespace or not. As long as the character is the same type as the previous one (that is whitespace or not) it would just keep looping. When the type of character changes it could perform an action:

If it changed from non whitespace to whitespace just change the "type" variable to reflect the change.

If it changed from whitespace to non whitespace change the "type" variable and increase the word count by one.

When the loop reaches end of file then close it and print out the word count.

I have no intention of looking at stage 2 until I got this right :-)

Is it looking valid so far?

Hodgkinson · « **Reply #1 on:** February 24, 2007, 11:37:59 AM »

Déjà Vu…

I don’t suppose this is to help with some kind of statistics/maths assignment, by any chance?

Hodgkinson.

Speelgoedmannetje · « **Reply #2 on:** February 24, 2007, 11:38:14 AM »

The lazy bum in me says regex :lol:

Cymric · « **Reply #3 on:** February 24, 2007, 11:50:16 AM »

Quote

Is it looking valid so far?

Only one way to find out, right? :-P

mgerics · « **Reply #4 on:** February 24, 2007, 12:12:04 PM »

Sonds like an algorithm! Be careful at the end of file, though - treat end of filelike a non-whitespace so the algorithm will count the last word:
e.g. <> represents a whitespace, (eof) is end of file
...<>last<>word(eof) <- treat eof like a non-whitepsace so 'word' gets counted...
...<>last<>word<>(eof) <- the last whitepsace will count 'word'
Have fun programming.

amiga_3k · « **Reply #5 on:** February 24, 2007, 01:02:37 PM »

Don't forget to filter line-breaks.

Karlos · « **Reply #6 on:** February 24, 2007, 01:10:07 PM »

@Mel

As an algorithm it looks fine. Mgerics has a point, however. Depending on your main loop, the end of file character might be intercepted as the start of a new word should there be any whitespace at the end of the file. Look out for that when designing your loop.

As I said, the objective in this case isn't to see how well you can factor the code into lots of utility functions, it is an exercise in solving a problem. You can probably manage the entire thing within main() relatively well.

The second exercise is more likely to benefit from factoring into functions and you should give some thought to that.

The third exercise is for once you have an understanding of structures and memory allocation / deallocation.

Hint: have a structure that at least consists of a char* for the word, a count value. You will almost certainly need to factor that code into functions that perform specific actions on these structures, such initializing them (and allocating space to store the word) and finalizing them (deallocating the space used to store the word). You will need to manage a collection of structures, one for each unique word.

A second hint: comparing integer values is much, much faster than comparing strings with strcmp()/stricmp(). Consider extending your structure to contain a second integer that is some value computed from the word itself (look up hash functions) and base your comparisons on that. Who knows, you might even take that to its logical conclusion and actually implement your collection of word count structures as a hash map ;-)

That's some time off yet though.

Piru · « **Reply #7 on:** February 24, 2007, 02:20:54 PM »

Also, there are various things to consider for "good" implementation of the stage three:

- It should not be limited to some fixed size maximum word length.
- It should not get exponentally slower when number of different words grows.

However, these are points for the advanced class. I have such implementation ready and I'll provide it later.

Karlos · « **Reply #8 on:** February 24, 2007, 02:27:43 PM »

Quote

Piru wrote:
- It should not be limited to some fixed size maximum word length.
- It should not get exponentally slower when number of different words grows.

A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.

Piru · « **Reply #9 on:** February 24, 2007, 02:35:18 PM »

@Karlos

Quote

A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.

True. There's IMO a better data structure for this particular case: a binary tree. Hash tables are boring... ;-)

falemagn · « **Reply #10 on:** February 24, 2007, 02:50:39 PM »

Quote

Piru wrote:
@Karlos
Quote
A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.

True. There's IMO a better data structure for this particular case: a binary tree. Hash tables are boring... ;-)

Or even better, a trie.

Piru · « **Reply #11 on:** February 24, 2007, 02:56:08 PM »

Ah indeed, this would be the perfect use for a trie ;-)

Karlos · « **Reply #12 on:** February 24, 2007, 03:00:19 PM »

True enough, there are lots of containers that would cope with this scenario.

Never underestimate the raw speed of the basic hash map though. It so happens I have to do a lot of lookup stuff at work and you'd be surprised how much slower the binary tree search was for this class of application ;-)

Of course, it depends on your hash function. If it takes longer to compute than it does to iterate n levels down the tree (or trie), chances are it will be slower. But that depends on many factors, such as data locality, cost of multiple pointer accesses and so on. Basically all beyond the scope of the exercise :-D

The trie, however would be an excellent choice for this (I use this algorithm, slightly modified, to find the best match prefix for a telephone number in order to determine which price band it belongs to).

falemagn · « **Reply #13 on:** February 24, 2007, 03:10:05 PM »

Quote

Of course, it depends on your hash function. If it takes longer to compute than it does to iterate n levels down the tree (or trie), chances are it will be slower.

Unless you factor in cache locality, which may very well have an inpact, a hash map will always be slower than a trie as you not only have to computer the hash, but also still compare the strings.

Karlos · « **Reply #14 on:** February 24, 2007, 03:19:55 PM »

Quote

but also still compare the strings.

Actually, the fast hashmap implementation compares only the hash keys and not the strings (or whatever object you are using), on the understanding that they are as unique as the entity from which they are created. The modulus of the hash by the container size is used to determine the first location of the entry. Each entry also contains the entity's hash as well as the entity. Due to the hash%size operation, you will tend to hit something unrelated once it gets too full.

If you are lucky, the first entry you hit is the one you are looking for (which you determine by comparing your hash with the entity's hash).

If you are unlucky, you might have to step a few left and right until you find it. This is given the assuming storage algorithm will try to find the closest empty slot to the initial target.

This type of algorihm is discussed in Bjarne Stroustrup's C++ 3rd ed as an exercise in container design ;-)

On every system I've tried, this method of hashed lookup outperforms the trie for longish strings when the hash map is generally less than 60% full. This can be accounted for simply though cacheing alone. Once you get past 60%, efficiency starts to fall off. You'd generally design the container to expand when it hits a load of say 75%.

Of course, the fundamental flaw in this is that you can get into trouble if two keys collide :-D

In our experiment here, the trie is likely to be faster since we probably arent looking for excessively long strings. The place you'd most want to ensure you optimised is in node allocation. You really dont want to be allocating thousands of small nodes individually for the trie.

Author Topic: My C homework (Read 6457 times)

mel_zoom

My C homework

Hodgkinson

Re: Mel's C homework

Speelgoedmannetje

Re: Mel's C homework

Cymric

Re: Mel's C homework

mgerics

Re: Mel's C homework

amiga_3k

Re: Mel's C homework

Karlos

Re: Mel's C homework

Piru

Re: Mel's C homework

Karlos

Re: Mel's C homework

Piru

Re: Mel's C homework

falemagn

Re: Mel's C homework

Piru

Re: Mel's C homework

Karlos

Re: Mel's C homework

falemagn

Re: Mel's C homework

Karlos

Re: Mel's C homework