Author Topic: AROS SMP Research: Technical Discussion (Read 11335 times)

psxphill · « **on:** August 22, 2013, 06:12:55 PM »

Quote from: Ezrec;745841

In that case, you (the programmer) need to update your code anyway, since you could have gotten pre-empted right before the SendMsg()/Signal() and lost that port, even on AmigaOS 3.x

You need a forbid round the find/sendmsg, but you won't want the forbid to wait for all cpu's to finish their quantum. When the forbid happens the other cpu's need to stop what they are doing immediately.

psxphill · « **Reply #1 on:** August 22, 2013, 06:57:01 PM »

Quote from: Ezrec;745846

And I think *I* have something wrong. Michal Shulz did some rough performance calculations, and even though my method (wait for quantum to expire) is semantically correct, the performance penalty is terrifying.

Yeah that was my point. Making forbid wait for the other cpus will mean that these four lines of code will take over 1 task quantum.

forbid()
permit()
forbid()
permit()

The first forbid() will take anywhere from nothing to 1 task quantum depending on how it aligns with the other cpu's tasks.

Stopping the other cpu's immediately will have some performance penalty, which even though it's much higher than the overhead in AOS 3.1, it should be nowhere near a quantum.

You also don't want the other cpu's tasks to lose their quantum when another cpu does a forbid(), the other cpu's tasks should have the quantum extended by the time they are suspended.

Rather than signalling the other cpu, it might be enough to actually stop them. The performance might depend on architecture, plus I don't know how you're abstracting all this stuff, so either way might make more sense.

psxphill · « **Reply #2 on:** August 22, 2013, 07:44:41 PM »

Quote from: Ezrec;745855

Michal's planning on using IPI to signal the other CPUs to stop (on x86 SMP, there's isn't some "magic register" you can use to stop other CPUs, you have to ask them nicely), but it's a lot faster than waiting until they reach a Switch()/Dispatch() point.

Cool, that should work better.

Do you think the time the cpu is suspended not counting towards the current tasks quantum make sense? Otherwise the fairness will depend on what is running on the other cpu's & you could get one task that is permanently starved in pathological cases. If it's got it's own timer that fires when the quantum is up then it might just be a case of pausing it, but if you can only stop it you'd need to keep track of he current time left and use that when you start the cpu again.

psxphill · « **Reply #3 on:** August 23, 2013, 12:12:19 AM »

Quote from: Zac67;745872

Sorry if I'm a bit naive here - but what would be the problem with leaving the other cores running on a Forbid() as long as they stay in userland?

Because Forbid() is used to protect userland data structures shared between tasks too. As those tasks might be running on different cpu's then you have no choice but to stop other cpu's.

psxphill · « **Reply #4 on:** August 23, 2013, 12:24:31 AM »

Quote from: minator;745861

It might be possible to show a nice speedup on some long running highly parallelisable benchmark but that's it.

It doesn't have to be parallelisable, you could have two independent algorithms running on each cpu core.

If course it needs to be long running, because if it ran in a short amount of time then a 8mhz 68000 would be enough.

If you only have one task that is CPU bound then SMP can't help you, it doesn't matter what OS you use.

Quote from: minator;745861

In any real system apps will be constantly stalling the system and you don't need to be Gene Amdahl to know what the result will be.

If one of the tasks spends all it's time in a forbid then there will be no point in using SMP, but I don't believe this is a common case.

If it spends 1% of time in forbid then you will lose 1% of each cpu core, 20% of it's time in forbid and you will lose 20% of each cpu.

The forbid issue is not ideal, but I think you're overestimating the amount of time it will spend in forbid state.

Quote from: Ezrec;745870

I'm investigating a spinlock-style SignalSemaphore that has a lower latency for protecting frequently used internal data structures in Exec.

I'd get this working first, once you've got it working properly then you can see what effect the Forbid() has. Changing how the Exec structures are protected will mean changing applications too.

SMP 68000 is more likely to happen in an emulator than it is in hardware.

psxphill · « **Reply #5 on:** August 23, 2013, 02:32:55 AM »

Quote from: matthey;745867

How are you handling the ENABLE/DISABLE FORBID/PERMIT macros (ables.i) that increment and decrement the ExecBase IDNestCnt and TDNestCnt?

That won't work anymore.

AFAICT that is from commodore's includes and isn't in AROS, so no software built for AROS should be legitimately manipulating that field in execbase already.

If someone creates a SMP 68k machine then they'll need to decide how to support binaries that do that. They could add hardware that checks for a write to that location and make it store the value in another register and cause an interrupt on the relevant cpu, in the interrupt handler it can check the value and call forbid or permit.

It's not a problem that needs solving yet anyway.

psxphill · « **Reply #6 on:** August 23, 2013, 02:57:54 AM »

Quote from: bloodline;745901

It would be a little bit weird and amazing to have Carl Sassenrath contribute to AROS

He's unlikely to have thought about exec in nearly 30 years. If he has any sense he's forgotten everything he knew about it.

There is unlikely to be any major design work left right now, although there might be some minor design work depending on what is found during coding/testing.

Coding, testing and fixing the current design and then testing the speed to see whether any changes are required is the current goal.

Even if it wastes 10% of each core then SMP could still have a big win.

However moving data between cpu cores might have an overhead, so sharing tasks across cpu's might not be the best strategy. It might make more sense to saturate a cpu and only spin up another cpu if there are still more tasks ready.

Only when the simple implementation is done can you get enough information to make those decisions. It's complex enough that guessing isn't easy.

Quote from: takemehomegrandma;745915

I merely asked matthey to clarify his "You have already proved some people wrong with your experiments" statement.

Some people said you couldn't do SMP with exec. Technically he hasn't proved them wrong as he's moved fields out of execbase, which was the only reason you can't do SMP with exec. His plan is to avoid the theoretical discussions of the implications of that and just try to code it, often this is the only way to solve an argument & it's actually how AROS came to exist.

Once you have an implementation then have a baseline. After evaluating it for any drawbacks you can try to address those and then when you test it you can know whether it's better or worse. The problem with arguing over technical concepts is that it is very difficult to judge their merit. Until you see the code running it's unlikely you'll have any idea what the cache implications are of using SMP etc.

psxphill · « **Reply #7 on:** August 23, 2013, 12:35:10 PM »

Quote from: itix;745940

Oh, btw... you have to consider that other CPUs can call forbid not just that first one. When adding more CPUs chances to be in forbid state increases.

Tell me how you will find out what the percentage of time various software will spend in forbid without trying it?

Quote from: itix;745940

Problem could be demonstrated using silly pingpong task sending message back and worth constantly. Because sending a message requires forbid that task that could easily render other cores useless. Even when running at low priority it would disrupt higher priority task on other cores, due to forbid/disable semantics.

You have always been able to write software for AmigaOS which disturbs high priority tasks. If it turns out that this is a problem that needs solving then you could try changing it so that cpu's will only ever be running a task that has the same priority. As high priority tasks are supposed to run for a short period of time then wasting the other cores during that time may not be a big deal. If you have a high priority task on AmigaOS that takes a long time then it becomes unusable (standard priority tasks like workbench won't be allowed to run at all). The priority in AmigaOS is quite fine grained (-128 to 127 IIRC) which means you could also derail this using as many as possible. But there is no reason why software that can take advantage of SMP shouldn't have limitations (like all Tasks that run at the same time have to run at the same priority).

I understand about the Forbid() overhead, if something spins in a Forbid()/Permit() call then it could cause problems. But is that something that any software should need/want to do? We pretty much have the source to all AROS software at this point & this isn't going to affect AROS 68k.

There is no reason why the number of cores in use at a time couldn't be dynamic & when you're only using 1 core then the Forbid()/Permit() overhead could be reduced to current levels. If it can detect situations where the SMP implementation will help and which will hurt then you could always end up benefiting.

Technologies like Intel Turboboost benefit from only using as many cores as necessary, i.e. when you're only using 1 core it can boost the clock speed but when you're saturating all cpu cores then it drops back to the default (some chips can sustain constant boosting, but in a laptop you'd want to minimize it for power usage).

psxphill · « **Reply #8 on:** August 23, 2013, 02:31:54 PM »

Quote from: ChaosLord;745953

New software written for a new OS, such as a New AROS or new MorphOS can use semaphores to access the various protected OS structures.

I don't think this should even be considered unless as an absolute last resort. Making that compromise without knowing what the benefits are would be a mistake.

Quote from: warpdesign;745959

@Itix: why do we need to halt multitask by using enable/disable ?

Blame Carl Sassenrath.

Quote from: warpdesign;745959

How do OS that support real SMP work ? I mean: what's the main difference with AmigaOS and "modern" OS ?

AmigaOS has a lot of design mistakes in, which didn't matter so much on a games console from the early 1980's that would be around for a few years.

Worrying about not being able to use every ounce of cpu power when using SMP is a mistake. Windows/Linux has a high latency on a lot of it's api calls.

Quote from: itix;745954

Getting accurate results can be difficult but itcould be profiled. At least how many calls to Forbid() or Disable() there are per minute...

The number of calls is not the metric you need. It's how long it spends in Forbid(). You could make one call and stay in Forbid() for 99% of time, or 10 calls and only stay in Forbid() for 1% of time. This affects how much of each CPU you'll lose. The overhead of stopping and starting each cpu would also need to be taken into account, however this becomes even more of a problem to calculate because unless you've written the code and tested it you don't even know what the overhead will be. Plus just counting instructions doesn't help as modern CPU's are way too complex.
[/QUOTE]

Quote from: itix;745954

Code: [Select]
sillypseudocode() { SetTaskPri(SysBase->ThisTask, -128); PutMsg(port, msg); while (true) PutMsg(GetMsg(port)); }

I can play too.

Code: [Select]

sillypseudocode()
{
   Forbid();
   while (true);
}

Sure there are pathological cases, The easiest way to speed up your program is for the user to not run it.

Quote from: itix;745954

Anyway, I just wanted to point out that biggest culprit is the OS itself and write some silly example

But you don't have any idea what the overhead of the biggest culprit is. It depends on how many messages are being processed, what work is done on each message.

The whole point of coding it was to avoid the constant arguments based on contrived examples & be able to see how real software that people might want to run will behave. It doesn't matter if it's not perfect, it's research. It could be derailed by something that nobody has considered.

psxphill · « **Reply #9 on:** August 23, 2013, 03:16:50 PM »

Quote from: wawrzon;745965

wait a minute. couldnt you come up with what it was? there is tremendous overhead on some aros68k operations as i see on a slow system and it would be great to identify most frequently called functions while it happens without doing profiling job, which im not able to.

You could, but it might lead you down the wrong path. If you have a function that is called 1000 times which takes 10ms or a function that is called 1 time which takes 100s then it won't help.

Profiling is the key, often bottlenecks show up in completely unexpected parts of the code. I've seen people spend time optimising code that when they'd finished made no perceivable difference, even though they could measure a 2x speed up in the function they sped up.

Some of the aros68k problems are caused by adding a level of abstraction to the graphics library, which wasn't designed to be as fast as it possibly could as an x86 was fast enough that you wouldn't care.

Also due to small/non existent caches on 68k hardware it's actually very hard to guess where the delays are going to be. For example what you consider a good algorithm choice could end up with the cache being thrashed, a less optimal design could end up being faster if it's memory access patterns suit the cache better. Making it aros68k faster will take a lot of research and effort. Just counting calls and then spending ten minutes rewriting the function with the most calls is quite dangerous, it might be slower in all cases except the one you tested & even if you speed it up it could end up being broken. Although I'm cynical after watching people do it repeatedly and fail (although they generally get to claim the credit before anyone finds out).

psxphill · « **Reply #10 on:** August 24, 2013, 01:43:17 AM »

Quote from: vidarh;746023

It certainly makes sense to prioritise *where* to clean up Forbid()/Disable() calls first based on profiling which ones actually hurt the most.

Removing Forbid calls is a compromise because it will break software in ways that is difficult to detect, but if the speed up is huge and can't be done in any better way then there may be some justification.

If you're allowed to break software then it becomes a whole lot easier.

Quote from: itix;746016

You missed one very important difference. My example would run just fine on any traditional non-SMP Amiga OS system. Your example wouldnt.

Change Forbid() to SetTaskPri(task, 127) and you would have an example where SMP is superior to non-SMP system.

Your software doesn't do anything useful and neither does mine, I don't see a difference. However when Windows started supporting multiple cpu cores there was software that ran worse on SMP systems than on single cpu systems. Until we can benchmark it then you can't tell if it's worth worrying about.

psxphill · « **Reply #11 on:** August 24, 2013, 09:12:14 AM »

Quote from: matthey;746038

There is hardware overhead in parallel processing and that's not even counting the software overhead.

My point is that there is always a software overhead. Saying forbid will have to go because of an unmeasured software overhead is putting the cart before the horse.

It's not like an abi change where just a recompile is needed, it would require code audits and potentially redesigns. So changing from Forbid to a semaphore should be avoided until it's known what the advantage is (and if there is indeed a visible advantage).

psxphill · « **Reply #12 on:** August 24, 2013, 06:25:04 PM »

Quote from: Karlos;746058

Of course. That said, since it's the OS you'd be working on, one would hope these are the ones that are easiest to find and engineer replacements for

Changing the OS will break the user programs too.

Quote from: Blizz1220;746075

I remember that when AMD went dual core it made a press release saying that they are working on single core simulator/emulator in hardware to fight the lack of software that only used single core ...

What happened to that project ?

Wasn't that a joke/fake? http://www.theinquirer.net/inquirer/news/1009078/reverse-hyperthreading-exist
http://forums.anandtech.com/archive/index.php/t-2198930.html

Author Topic: AROS SMP Research: Technical Discussion (Read 11335 times)

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion

psxphill

Re: AROS SMP Research: Technical Discussion