Welcome, Guest. Please login or register.

Author Topic: My C homework  (Read 14303 times)

Description:

0 Members and 1 Guest are viewing this topic.

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« on: February 24, 2007, 01:10:07 PM »
@Mel

As an algorithm it looks fine. Mgerics has a point, however. Depending on your main loop, the end of file character might be intercepted as the start of a new word should there be any whitespace at the end of the file. Look out for that when designing your loop.

As I said, the objective in this case isn't to see how well you can factor the code into lots of utility functions, it is an exercise in solving a problem. You can probably manage the entire thing within main() relatively well.

The second exercise is more likely to benefit from factoring into functions and you should give some thought to that.

The third exercise is for once you have an understanding of structures and memory allocation / deallocation.

Hint: have a structure that at least consists of a char* for the word, a count value. You will almost certainly need to factor that code into functions that perform specific actions on these structures, such initializing them (and allocating space to store the word) and finalizing them (deallocating the space used to store the word). You will need to manage a collection of structures, one for each unique word.

A second hint: comparing integer values is much, much faster than comparing strings with strcmp()/stricmp(). Consider extending your structure to contain a second integer that is some value computed from the word itself (look up hash functions) and base your comparisons on that. Who knows, you might even take that to its logical conclusion and actually implement your collection of word count structures as a hash map ;-)

That's some time off yet though.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #1 on: February 24, 2007, 02:27:43 PM »
Quote

Piru wrote:
- It should not be limited to some fixed size maximum word length.
- It should not get exponentally slower when number of different words grows.


A hash map satisfies the second point nicely. That would be trivial to implement in C++/STL but it requires some thinking about to do from first principles in C. But then, that is the aim of the exercise.
Certainly you don't want to be iterating over the entire list of known words for each word read.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #2 on: February 24, 2007, 03:00:19 PM »
True enough, there are lots of containers that would cope with this scenario.

Never underestimate the raw speed of the basic hash map though. It so happens I have to do a lot of lookup stuff at work and you'd be surprised how much slower the binary tree search was for this class of application ;-)

Of course, it depends on your hash function. If it takes longer to compute than it does to iterate n levels down the tree (or trie), chances are it will be slower. But that depends on many factors, such as data locality, cost of multiple pointer accesses and so on. Basically all beyond the scope of the exercise :-D

The trie, however would be an excellent choice for this (I use this algorithm, slightly modified, to find the best match prefix for a telephone number in order to determine which price band it belongs to).
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #3 on: February 24, 2007, 03:19:55 PM »
Quote

but also still compare the strings.


Actually, the fast hashmap implementation compares only the hash keys and not the strings (or whatever object you are using), on the understanding that they are as unique as the entity from which they are created. The modulus of the hash by the container size is used to determine the first location of the entry. Each entry also contains the entity's hash as well as the entity. Due to the hash%size operation, you will tend to hit something unrelated once it gets too full.

If you are lucky, the first entry you hit is the one you are looking for (which you determine by comparing your hash with the entity's hash).

If you are unlucky, you might have to step a few left and right until you find it. This is given the assuming storage algorithm will try to find the closest empty slot to the initial target.

This type of algorihm is discussed in Bjarne Stroustrup's C++ 3rd ed as an exercise in container design ;-)

On every system I've tried, this method of hashed lookup outperforms the trie for longish strings when the hash map is generally less than 60% full. This can be accounted for simply though cacheing alone. Once you get past 60%, efficiency starts to fall off. You'd generally design the container to expand when it hits a load of say 75%.

Of course, the fundamental flaw in this is that you can get into trouble if two keys collide :-D

In our experiment here, the trie is likely to be faster since we probably arent looking for excessively long strings. The place you'd most want to ensure you optimised is in node allocation. You really dont want to be allocating thousands of small nodes individually for the trie.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #4 on: February 24, 2007, 07:19:30 PM »
Perhaps we should keep the optimal container design chatter down a bit, lest we deter the student ;-)

We should keep this for after a successful attempt of stage 3, where discussing the structure and performance of rival containers would make more sense.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #5 on: February 25, 2007, 10:55:23 AM »
Quote

Vincent wrote:
Sorry to go off topic, but everytime I see this thread's title I keep misreading it as "Mel C's homework" :-P


That would be learning how to sing. Preferably, in tune.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #6 on: February 26, 2007, 10:35:47 AM »
Simple and clean. The character by character IO could be improved upon, as Wain suggests, but there are no obvious logical errors in the code, except one. If the text file contained any spurious punctuation that was surrounded by whitespace, they'd appear to be words.

Eg:

Hello!

would appear to be 2 words, rather than one.

Of course, this error depends on what your specification defines as a word. As I didn't actually say, I'll have to let  you off ;-)
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #7 on: February 26, 2007, 01:26:08 PM »
@falemagn

Even though the stdio buffers the file, the overhead of repeatedly calling fgetc() can make a difference compared to iterating over an array of characters that you effectively have direct access to. You might notice the benefit on an 030 IDE system like Mel's.

Of course, this is only if you wanted to optimize the code as much as possible - but then, a good implementation might have an inlinable version of fgetc().
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #8 on: February 26, 2007, 01:40:41 PM »
Quote

falemagn wrote:

One thing I would suggest, instead, is to use ctype.h's isspace() function (or macro, depending on its implementation).


Actually, I think the idea of differentiating "word" characters from "non-word" characters would be more effective. You'd probably want to use isalnum() for that one as it only returns non-zero for letters or numbers.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #9 on: February 26, 2007, 02:58:37 PM »
@falemagn

Quote
I was merely suggesting a way to do without the 'case' construct she used.


I agree that's the way forward but I also thought the case construct showed an exellent understanding of the underlying logic (for someone not acquainted with all the features of the c standard library) which is what I was looking for when I set the exercise :-)
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #10 on: February 28, 2007, 06:28:04 PM »
@Mel

Since your 030 has only 256 bytes of data cache, you might want to redesign your character table like so:

Code: [Select]

static unsigned long chartypedata[8] = { 0 };

#define isword(c) (chartypedata[(c)>>5]&(1UL<<((c)&31UL)))
#define setbit_isword(c) (chartypedata[(c)>>5] |= (1UL<<((c)&31UL)))
#define clrbit_isword(c) (chartypedata[(c)>>5] &= ~(1UL<<((c)&31UL)))


When you are filling your table initially, use the setbit_isword(character) macro.

If, for whatever reason, you need to clear an entry, use the clrbit_isword(character) macro.

The point of this code is that it uses just one bit to represent the "isword" status for each character. You therefore get a table that is only 32 bytes rather than 256.

One slight difference to be wary of is that you can only depend on this version of isword() to return 0 if false and something else if true (it could be any power of 2 value from 1 to 2^31) but that wont affect your logic.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #11 on: March 01, 2007, 01:30:04 PM »
Quote

Smack wrote:

Depending on the number of wait states caused by the cache-miss during byte array lookup (alternative 1), even the hand-optimised bit array lookup (alternative 3) may be slower due to the extra instructions, let alone the C version (alternative 2). If the code runs on a 68040 or 060 then things will look even worse for the bit array...


It was faster than the basic char array on my 040. I reasoned that it may well be faster on the 030 but as you say, mileage varies ;-)

Quote
I think this kind of performance tweaking is way off-topic for a beginners session in C, anyway.


Absolutely :-D
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #12 on: March 01, 2007, 04:54:27 PM »
I'll try a hand optimised assembler implementation of each one later.

The speed was based on testing a very large string (several MB worth) in memory. The speed increase wasn't accurately measurable when reading from a file (even with buffering) since IO overhead tends to be the limiting factor in both cases.

Perhaps you are looking at the problem in too small a scope. The source data also enters the cache. In this case, you are reading something that is effectively streaming (from memory) which will involve cache line transfers to boost the speed of successive character accesses, but fundamentally, this will pollute the cache with values from the string that are only ever used once (this is a common problem with stream style processing).
Therefore, perhaps the smaller bit-table is less affected due to the fact that it's both smaller and fact that a lot of the time the same 32-bit word will be repeatedly hit for many different characters.

Its also possible that the compiler didnt generate code for the byte table lookup that was especially optimal compared to the bit lookup.
int p; // A
 

Offline Karlos

  • Sockologist
  • Global Moderator
  • Hero Member
  • *****
  • Join Date: Nov 2002
  • Posts: 16878
  • Country: gb
  • Thanked: 5 times
    • Show all replies
Re: Mel's C homework
« Reply #13 on: March 02, 2007, 08:51:19 PM »
Quote

falemagn wrote:
Quote

mel_zoom wrote:

I thought the look of perplexed frustration would be apt for when I start getting into more realistic C code :-)


Except that you don't look either perplexed nor frustrated, in that picture. ;-)


It is a lot better than the look of sullen boredom I was wearing at that presentation :lol:
int p; // A