Author Topic: My C homework (Read 6434 times)

Karlos · « **Reply #14 on:** February 24, 2007, 03:19:55 PM »

Quote

but also still compare the strings.

Actually, the fast hashmap implementation compares only the hash keys and not the strings (or whatever object you are using), on the understanding that they are as unique as the entity from which they are created. The modulus of the hash by the container size is used to determine the first location of the entry. Each entry also contains the entity's hash as well as the entity. Due to the hash%size operation, you will tend to hit something unrelated once it gets too full.

If you are lucky, the first entry you hit is the one you are looking for (which you determine by comparing your hash with the entity's hash).

If you are unlucky, you might have to step a few left and right until you find it. This is given the assuming storage algorithm will try to find the closest empty slot to the initial target.

This type of algorihm is discussed in Bjarne Stroustrup's C++ 3rd ed as an exercise in container design ;-)

On every system I've tried, this method of hashed lookup outperforms the trie for longish strings when the hash map is generally less than 60% full. This can be accounted for simply though cacheing alone. Once you get past 60%, efficiency starts to fall off. You'd generally design the container to expand when it hits a load of say 75%.

Of course, the fundamental flaw in this is that you can get into trouble if two keys collide :-D

In our experiment here, the trie is likely to be faster since we probably arent looking for excessively long strings. The place you'd most want to ensure you optimised is in node allocation. You really dont want to be allocating thousands of small nodes individually for the trie.

MskoDestny · « **Reply #15 on:** February 24, 2007, 03:45:34 PM »

There's also the mighty ternary search tree: http://www.ddj.com/dept/windows/184410528

SamuraiCrow · « **Reply #16 on:** February 24, 2007, 04:15:00 PM »

The binary and ternary search trees would be slower than the hash or the trie. The ternary search tree would be O(log n) time complexity while the trie and hash have typical time complexities of O(1) with the hash degrading to a O(n) as collisions increase. That said, I'd go with the trie.

MskoDestny · « **Reply #17 on:** February 24, 2007, 07:14:32 PM »

Quote

SamuraiCrow wrote:
The binary and ternary search trees would be slower than the hash or the trie. The ternary search tree would be O(log n) time complexity while the trie and hash have typical time complexities of O(1) with the hash degrading to a O(n) as collisions increase. That said, I'd go with the trie.

A ternary search tree has lookup times O(log n) character comparisons whereas a bst has lookup times O(log n) actual full key comparisons so they are no where near as slow as bsts. According to Dr. Dobbs Journal (link provided in my last post) real world performance for insertions and successful lookups is roughly equivalent to a hash table, but it offers these benefits:

Quote

Ternary trees are usually substantially faster than hashing for unsuccessful searches.
Ternary trees gracefully grow and shrink; hash tables need to be rebuilt after large size changes.
Ternary trees support advanced searches, such as partial-match and near-neighbor search.
Ternary trees support many other operations, such as traversal to report items in sorted order.

They also use substantially less storage than an array trie (9 times less according to this paper: http://users.info.unicaen.fr/~daireaux/ANADY/PUBLIS/ClementFlajoletVallee2.ps.gz) and will probably achieve better cache utilization as a result.

Karlos · « **Reply #18 on:** February 24, 2007, 07:19:30 PM »

Perhaps we should keep the optimal container design chatter down a bit, lest we deter the student ;-)

We should keep this for after a successful attempt of stage 3, where discussing the structure and performance of rival containers would make more sense.

Cymric · « **Reply #19 on:** February 24, 2007, 08:06:30 PM »

Quote

Perhaps we should keep the optimal container design chatter down a bit, lest we deter the student ;-)

I agree. It's nice and all to be discussing algorithms, fascinating these might be, but really, this for when you're looking for fast ways to solve the excercise.

Vincent · « **Reply #20 on:** February 25, 2007, 03:29:11 AM »

Sorry to go off topic, but everytime I see this thread's title I keep misreading it as "Mel C's homework" :-P

Karlos · « **Reply #21 on:** February 25, 2007, 10:55:23 AM »

Quote

Vincent wrote:
Sorry to go off topic, but everytime I see this thread's title I keep misreading it as "Mel C's homework" :-P

That would be learning how to sing. Preferably, in tune.

mel_zoom · « **Reply #22 on:** February 25, 2007, 09:36:05 PM »

Well, here is the first program :-)

It seems to work fine on all the text files I gave it so far - including some "control" ones not containing anything some containing only space or 1 word and so on.

Whilst writing it I realised that I could simplify the starting state of the word counting loop by assuming we start with whitespace and no words. That way you can let the normal loop logic decide what to do with the first character read straight away.

Code: [Select]


/*
  Word counting program, stage 1
*/

#include <stdio.h>

/* prototype */
int countwords(FILE* file);

/* main program */
int main(int argc, char** argv)
{
	const char* filename = 0;
	FILE* file = 0;
	int numwords = 0;

	/* get the filename from the command line*/
	if (argc<2) {
		puts(&quot;Usage: wordcount <file name>&quot;);
		return 1;
	}

	/* try to open the file */
	filename = argv[1];
	file = fopen(filename, &quot;r&quot;);
	if (file == NULL) {
		printf(&quot;Error: couldn't open file '%s' for input\n&quot;, filename);
		return 1;
	}

	/* count and display */
	numwords = countwords(file);
	printf(&quot;Counted %d word(s) in file '%s'\n&quot;, numwords, filename);

	/* all done */
	fclose(file);
	return 0;
}

int countwords(FILE* file)
{
	/*
		Looks for whitespace <-> non whitespace changes.
		Every whitespace -> non whitespace change is considered a
		new word.
	*/
	int numwords = 0;
	int whitespace = 1;
	int character;

	character = fgetc(file);
	while(character != EOF) {
		switch (character) {
			case ' ':
			case '\t':
			case '\n':
			case '\r':
				whitespace = 1;
				break;

			default:
				if (whitespace == 1) {
					numwords++;
					whitespace = 0;
				}
				break;
		}
		character = fgetc(file);
	}
	return numwords;
}

Indentation too! - easy once you know how!

mel_zoom · « **Reply #23 on:** February 25, 2007, 09:45:55 PM »

I think I can already see how to go about solving stage 2.

If you put another variable "wordlength" initially set to 0 before the loop you could increment it in the "default" case of the switch.
Then in the whitespace part of the switch you could put an "if (whitespace==0)" test to see if we just entered the whitespace now. If that is true then whatever the current value of "wordlength" is would be the length of the last word read. You would then use this to increase the appropriate word counter and then reset it back to zero.

:idea:

Wain · « **Reply #24 on:** February 26, 2007, 12:58:22 AM »

This looks like a really solid beginner implementation of such an idea.

I would like to make a recommendation from my experience that you eventually may want to try to keep your program flow separate from your conditional code, while this is absolutely unimportant in a program as simple as your phase 1, keeping them together in phase 3 may result in a rather unwieldy bit of code that could wind up being difficult to read/follow by other programmers, or yourself 2 months later.

What I essentially mean by this is to put your switch statement into a separate function called something like is_whitespace(); to be called inside of your while-loop. Of course, you would also need to mildly restructure the implementation of your solution to do something like this.

Also...
What you may want to try at a more advanced stage is to rewrite it so that it loads a large portion of the file (or all of it) into memory and then processes it from there.

If you were working on large text files it would inevitably save time and reduce disk access...beyond being a good way to get comfy with malloc and basic pointer work.

Regardless of all of this...you've done some really good work so far! :-) :-) :-)

as far as phase 2 is concerned...
Don't forget to filter punctuation from your word-lengths.

Piru · « **Reply #25 on:** February 26, 2007, 01:28:08 AM »

That's not bad at all.

mel_zoom · « **Reply #26 on:** February 26, 2007, 10:03:46 AM »

Wain:

"What I essentially mean by this is to put your switch statement into a separate function called something like is_whitespace(); to be called inside of your while-loop. Of course, you would also need to mildly restructure the implementation of your solution to do something like this."

For stage 2 it might be better. I could have a function that returns the length of the next word in the file. The switch statement could then contain all the punctuation characters to ignore in the same case range as the whitespace.

Karlos · « **Reply #27 on:** February 26, 2007, 10:35:47 AM »

Simple and clean. The character by character IO could be improved upon, as Wain suggests, but there are no obvious logical errors in the code, except one. If the text file contained any spurious punctuation that was surrounded by whitespace, they'd appear to be words.

Eg:

Hello!

would appear to be 2 words, rather than one.

Of course, this error depends on what your specification defines as a word. As I didn't actually say, I'll have to let you off ;-)

falemagn · « **Reply #28 on:** February 26, 2007, 01:11:53 PM »

Quote

Also...
What you may want to try at a more advanced stage is to rewrite it so that it loads a large portion of the file (or all of it) into memory and then processes it from there.

That's unnecessary, as she's using buffered I/O file access which basically does what you're suggesting she should do.

One thing I would suggest, instead, is to use ctype.h's isspace() function (or macro, depending on its implementation).

Jupp3 · « **Reply #29 from previous page:** February 26, 2007, 01:23:26 PM »

Quote

if (whitespace == 1)

If 0 and 1 (non-zero) are the only different values you need, you can always just do if(whitespace) (which in this case pretty much equals if(whitespace=1)) - on many (most?) CPU's it's faster and more compact to check whether or not a value is zero, rather than checking if it equals some other value. Basically this is also true for if(a==0), but probably many compilers can optimize that to if(!a)

As you probably have already noticed, this is often used as a way of error checking. Consider result=malloc(size); - if the allocation failed, you get NULL pointer. Personally whenever possible, I also use the same approach in my code. Another variant is to use zero as success, and non-zero as error code.

Also don't forget to avoid doing for(i=0; i

Author Topic: My C homework (Read 6434 times)

Karlos

Re: Mel's C homework

MskoDestny

Re: Mel's C homework

SamuraiCrow

Re: Mel's C homework

MskoDestny

Re: Mel's C homework

Karlos

Re: Mel's C homework

Cymric

Re: Mel's C homework

Vincent

Re: Mel's C homework

Karlos

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

Wain

Re: Mel's C homework

Piru

Re: Mel's C homework

mel_zoom

Re: Mel's C homework

Karlos

Re: Mel's C homework

falemagn

Re: Mel's C homework

Jupp3

Re: Mel's C homework