Author Topic: 68k AGA AROS + UAE => winner! (Read 26326 times)

Karlos · « **on:** April 04, 2004, 05:02:43 PM »

AGA has had its day, and that day was some 10 years ago. Its only good for ye hardware banging 2D games of yore.

UAE has to devote considerable CPU resources to emulate AGA at full speed (which is tragic considering how slow that definition full speed is in modern terms), the *sole* benefit of which is running various aga based programs.

All serious amiga software of the last decade or so years has been RTG friendly.

Casting that aside, you have this notion of UAE as an "OS", running unhosted on your PC that gives transparent 680x0 emulation.

I'm afraid it's already been done. Amithlon was pretty much just that.

I could be wrong, but it still needs OS ROMs to work.

A replacememnt ROM for Amithlon (and a general update of the whole thing wouldnt hurt) based on open source code would pretty much do the job you are asking, AGA emulation aside.

As for how useful it would prove to be, I can't say.

Karlos · « **Reply #1 on:** April 07, 2004, 04:06:07 PM »

@bloodline

Well regarding the addressing modes, I could have done without the memory indirect addressing modes added since the 020. There as good as worthless anyway - the time it takes to fully decode one is simply ludicrous.

Does any amiga software even use them?

The only indirect addressing modes I use for 020+ (and see used elsewhere are)

(d16, aN), (aN)+, -(aN), (d8, aN, xN.w|.l * scale)

and occasionally pc based ones, eg I use (d8, pc, xN.w|.l * scale) for some duffs device style jumps into expanded loops.

Karlos · « **Reply #2 on:** April 07, 2004, 04:22:38 PM »

The register indexed scale mode (d8, aN, xN.w|.l * scale) is incredibly handy and is still fast.

Imagine an array of 32-bit integers, pointed to by a0, and a 16-bit index in d0, you want to move that entry to say d1

move.l (a0, d0.w*4), d1

It's much faster than any pointer arithmetic you can do by manually calculating the address and doesnt trash any registers.

How handy is that?

Karlos · « **Reply #3 on:** April 08, 2004, 10:11:43 AM »

@whoosh777

You do realise that trap based emulation on a real 680x0 is very slow? Hardly suitable for introducing support for new instruction opcodes to a real 680x0 CPU, especially if they were used heavily.

Karlos · « **Reply #4 on:** April 09, 2004, 02:19:01 AM »

@whoosh777

A trap based emulation which can replace each "emulated" instruction with a jump to handling code the first time it is encountered is probably the way to go. I think that's how stuff like oxypatcher/cyberpatcher work anyway.

As for movem, you'd want to check the CPU before making any performance assumptions. On the 68040, for instance, it often takes longer than multiple move.l instructions (although I'm not totally sure if the same is true for 68060).

Karlos · « **Reply #5 on:** April 11, 2004, 01:05:26 AM »

Quote

Amithlon has a big endian Intel gcc on www.aminet.net,

They have a 680x0 hosted gcc that produces x86 code for amithlon.

AFAIK, there is no such thing as big endian x86.

Quote

resolve all this via a big endian Intel gcc compile of AROS:

68k and x86 can then access identical OS data structures,

Again, I have no idea what you mean by "big endian intel".

x86 is little endian. 680x0 is big endian. Each has it's advantages and disadvantages.

Fundamentally the only difference is that on a big endian system, the lowest byte address of a multibyte object is the most significant, whereas for a little endian system, its the least significant.

Thus 680x0 is big endian by definition but its a bit odd if you consider register-only byte/word/long operations - the effect is litte endian. That is, a byte/word operation always affects the LSB/LSW of the register. Do the same thing on a 32-bit memory operand and you find the MSB/MSW is affected:

Imagine you have the value 0x01000001 in a memory address pointed to by (a0)

move.l (a0), d0
add.b #1, d0 ; least sig byte is affected
move.l d0, a0

gives 0x01000002

is completely different behaviour to

add.b #1, (a0) ; most sig byte is affected

which gives

0x02000001

On little endian systems like x86, the equivalent code for each of the above fragments generates the same result, affecting the LSB in both cases.

Of the common CPUs kicking around, only load and store architecture (as typified by RISC) tend to support endian swapping modes since all operations they perform are on registers. Never operating directly on memory operands and providing both big and little endian load/store instructions is how the PPC does it, for example.

Karlos · « **Reply #6 on:** April 12, 2004, 01:22:32 AM »

Dude, your posts are huge :-D

Let's see.

So, there is a compiler for amithlon that generates x86 code that does automatic byte reversal for operands during load/save to memory, thus giving a "big endian" data model that the 680x0 model is compatible with.

This makes some sense. However, this also means that you have totally killed the benefits of a CPU capable of memory operands for the majority of normal code.
This is because you have to (for anything bigger than a byte) load the operand from memory to a register, swap it from its "big" endian representation to little endian, perform an operation on it, reverse it again and pump it back out to memory.

Sound familiar? It should. You turned your "complex addressing mode" x86 into a load/store architecture.

Unfortunately, x86 doesn't have the register count to make load/store based code effective. In fact, it is especially bad at it since it was (like the 680x0) designed to be able to have memory operands for most instructions.

Conversely, load/store code is the domain of CPUs like the PPC, where all operations are on registers and to compensate for lack of memory based operands, you have plenty of registers.

Basically, if what you say is true, the compiler generates code that is "big endian" data format compatible at the expense of a large amount of code required to wrap fairly simple operations. In other words, speed is secondary to compatibility.

Can you imagine what any half complicated C expression compiles to where variables are loaded from ram each time?
Instead of being able to perform arithmetic on a register using a memory operand, you have to load the memory operand to another register, byte swap it etc.

You also now see why the 1-address code model ain't really so hot at all (sorry, but it's true. x86 architecture comes straight from pocket calculator era, hence the "accumulator" register ;-) )

Back onto 68K

I disagree that the 680x0 architecture is flawed by its "odd" register behaviour, but I threw it in purely to show that endianness is even more perculiuar than many people think. Things are the way they are for a reason. It would be a flaw if it wasn't well documented that the registers behave that way and you just assumed they behaved the same way as memory ;-)

The 68K has one of the nicest architectures IMHO. Too bad motorola canned it. The memory indirect addressing modes are a bit pointless, but other than that it is a fine design.

GPUs are just ace for this type of oddness. I know ones which have little endian 16-bit words but big endian 32-bit words etc.

Yeah I know the PPC allows big/little endian operation. That's easy for load/store based architectures and it's not alone in being able to do it. There are ARM cores with the same ability.

As for the supervisor stack thing:

1) are you sure that the SSP wasnt already remapped somewhere else in fast ram buy your 030?

2) the system doesnt exactly spend long in supervisor mode anyway. Most OS functions do the bulk of their work in user mode.

-edit-

PS: Your asm code for the x86 $2018915346 is a decimal literal that is 0x78563412 in hex.

Thats your 0x12345678 with the bytes reversed. If your gcc compiler does create "big endian" data, it makes sense that integer literals would get statically converted in this fashion - the variable "x" now contains 0x78563412 as far as the x86 is concerned, but a 680x0 would see it as 0x12345678, which is what you are indending.

Karlos · « **Reply #7 on:** April 12, 2004, 02:47:36 AM »

Quote

maybe we should reimplement and improve 68k! eg:

fix this problem,
remove all the silly addressing modes,
remove pointless opcodes,
make all registers general purpose: where it currently says
effective address=mode xxx register xxx
we replace this by
effective address=mode xx register xxxx
(x = binary digit),
make exception frames store their size,
create huge caches,

Fine, as long as you recognise it wont be object code compatible with any existing 680x0 :-D

If we are dreaming about the perfect 680x0...

As for effective addresses, as long as you have

operation , register
operation register,

for all the common instructions I'd be happy. For example, there is no

eor , dN

which is a bit of a bummer.

64-bit data registers and a new size type reflecting them would be nice. Operations involving them would naturally require extended opcodes since the existing 16-bit opcode word format has no ability to handle 8-byte elements.

However, youd never see this from the assembler perspective and hence have no idea that

add.q #1, d0

has a different opcode layout than

add.b/w/l #1, d0

Also, make sure you add some new muls/mulu operations that can do the long arithmetic and use a single 64 bit register result.

Another nice feature I'd like to see is an extended number of registers. Now, I know you can't actually do this blithely and have d9, d10 etc., because the existing opcode format doesnt allow it.

Instead, have a register "swap space" that allows you to have say 32 data registers in total and the ability to select which 8 of them are currently mapped to d0-d7.
The mapping can be done in such a way that it's impossible to have any of d0-d7 mapped to the same register. The simplest way to do this is to divide the full register space into "banks" (4 for a 32-register version), and you can set which bank dX is in.

For example, you might have a bank set opcode

setreg d0-d7, 0 ; // d0-d7 are using bank 0

setreg d2, 1 ; // d2 now mapped to bank 1

Changing the mapping would not destroy the original value. Eg

setreg d0-d7, 0

move.l #20, d0
setreg d0, 1
move.l #15, d0
setreg d0, 0 ; // d0 = 20
setreg d0, 1 ; // d0 = 15

To augment this, a "swap" operation, swapping the contents of the current dX with the equivalent dX register in another bank

swapreg d0, 3 ; // exchange current d0's value with bank #3 d0's value

Whilst not as flexible as a genuine 32 register file, it would remain object code compatible with older 680x0 and open interesting optimisations

I wish the 680x0 could be continued as a proper CPU (coldfire is interesting, but a more direct modern replacement for 68060 would be nice).

/dream off

Quote

the 68030 MMU OTOH is very well designed,
much better than the PPC MMU,

Why do you say that? The only gripe I have is that the 603 MMU needs some software support to handle table lookup, but the 604 and higher chips dont (AFAIK).

Karlos · « **Reply #8 on:** April 13, 2004, 01:22:34 AM »

Hi,

I'm sure I don't have the energy to reply to that point by point :-D

When I said the "big endian" data emulation model was slow, I meant compared to normal x86 code - not that it will be slow compared to a real 680x0 :-D

That is to say, compile 2 programs, one running in the normal x86 little endian memory access fashion, and once compiled to support a big endian data model. The little endian version will have much less work to do at runtime (eg swapping data when loading saving registers) and also the code will be optimal for it's "memory based operand" architecture.

As for RISC style load/store chips, lots of registers arent bad at all. Your main point is that you cant use them effectively and instructions increase in size etc.

On the usage front, I beg to differ - just study the PPC open ABI. Lots of important stuff (useful for the system) can stay in registers and you have lots of volatile registers for passing data to functions directly etc.

Stack based passing is only used there when there are more than 8 parameters to pass to a function and that's not often.

Also, the 3 operand instructions allow for lots of optimisations with clever compilers. Don't assume gcc is the smartest. I can quite assure you it isnt.

Ultimately, it comes down to a system where memory accesses for data are rare, since a lot of stuff is created in registers, passed arounf in registers and ultimately never ends up in memory unless that's its final destination.

This means memory/caches are hit less, stay valid for longer etc. and most memory accesses end up as bursts.

Quote

IBM dont know how to design CPUs otherwise why did they originally use Intel and now use Motorolas PPC specification

They don't? As I recall, Motorola used IBM's existing POWER (Performance Optimised With Enhanced RISC) architecture to create a desktop CPU, the PowerPC (with Apple as the cheif customer). It was basically a partnership in which IBM provided the architecture, Motorola provided the fabrication processes and Apple sold them :-)

IBM currently make by far the best implementations of the PowerPC architecture, surpassing motorola by quite a margin.

On the design complexity front, the large register count is not a big deal. You do realise that modern x86 cores (since as far back as the Pentium2) use dozens of registers in shadow register files? Internally, the architecture is totally different from the code you see in your assembler. Incoming x86 code is decomposed into smaller RISC like operations. The core executes these operations and makes use of a very large number of registers in the process.

Hence the objections you raise aren't really valid because x86 and PPC both use multi-operand instructions with lots of registers. It's just that you see it directly on the PPC and don't on the x86 ;-)

Finally, I dunno what bswap are for, but I expect they probably aren't endian conversion operations. I'm not too hot on x86 assembler :-)

-edit-

Anyhow, to summarise, architectual comparisons are getting a bit off topic. Alas, if you believe you can compile AROS with this "big endian" dataformat GCC with the intention of adding a 680x0 emulation, I say go for it - it would be an interesting thing to see :-)

The trouble is, stimulating conversation though it has been, I genuinely don't feel I'm helping you get any closer that goal, so I'll settle for wishing you luck :-D

Karlos · « **Reply #9 on:** April 14, 2004, 04:56:20 AM »

Hi,

Quote

but then your objections to x86 also arent valid because your criticism was that x86 has to act on ram operands and now you tell me it doesnt

No, that's not quite what I said. I meant the design paragdim of x86, from the programmer perspective, is that it can use memory direct operands for x86 level instructions, just like 680x0 does. So you do an add operation of a memory opetand onto a register or vice versa etc.

It doesn't *have* to use menory operands (adding register to register, for example is fine), but it doens't have a great deal of registers to spare, so it makes sense to use memory operands.

Up to the Pentium, this is how the core worked, but it was badly falling behind newer RISC architectures in terms of performance. The chip designers could see the advantage of load/store based superscalar CPUs, but couldn't throw away their existing object code compatibility.

So they did pretty much what motorola also did with the 68060, and later cores simply used a lower level RISC style code that incoming x86 "micro-op" instructions are dissasembled into.

So, from your perspective, as a programmer, the x86 does use memory based operands, and you can add a memory operand to your accumulator register.

Now, in reality, that add instruction, during execution gets decomposed into a micro-op load operation to fetch the ram operand, an add, a register file write and so on.

If you look at what x86 code is translated into at micro-op level, you see something very much more like your PPC style code, with lots of register to register operations, many registers and load/store for memory access.

However, as a programmer, you are not exposed to this level, but it does exist. This is partly why most chip designers simply shrug off RISC v CISC comparisions these days, since deep down most modern CPUs are virtually the same.

Regarding 32 registers on PPC, I wrote plenty of code that makes use of it. You are incorrect about the implicit register backup required on the stack for register based calls because the PPC open ABI defines those registers as volatile. Just like you never need to worry about saving d0/d1/a0/a1 for most 680x0 stuff (and similarly cannot assume they survive a function call), roughly half the registers of the PPC ABI are volatile and can be used for argument passing and temporary local variables.

I wrote various matrix multiplication code for transforms on ppc (some of the only hand asm I wrote) that is extremely fast since the entire transformation matrix' terms remain in registers for any number of vertices processed. 12 floats defining the buisness end of the matrix are held in volatile registers and you even get single cycle 4-operand multiply-add (register*register+register -> register) operations to help when evaluating the matrix * vector stuff.

Literally, throught the loop the only stuff accessing memory was the instruction stream, incoming vertices and outgoing transformed vertieces. And the loop, of course pretty much stays in the cache :-D

There is no way I could get say the 680x0 to get anywhere near the code efficiency of it.

I know you don't see large register files as at all useful, but if you look at the way present x86 processors actually work inside and the way they are going with the 64-bit designs (IIRC, IA64 has 256 registers) I think you might be in a minority.

Karlos · « **Reply #10 on:** April 14, 2004, 02:03:56 PM »

Quote

Crumb wrote:

Bernd Meyer replied some of my posts in ANN, and he thinks that using the memory as I described (using the memory in reverse order for the 68k memory allocated areas like in the Mac emu Executor) may work to a certain degree but may cause problems like reversed screens etc... so a better approach should be found

Just thinking about this.

Suppose your Task structure is a little different for any thread currently emulating 68K code, allowing the OS to see that this thread expects big endian memory.

Libraries can then detect if the caller is expecting big endian data or not. So in theory you could allow calls to the OS from the 680x0 side because they can check for it and respond accordingly.

Alternatively, you can have stub libraries for "big endian data access" to normal libraries and whenever a emulated 680x0 thread opens a library, it really gets the stub.

As for that screen issue.

Now, your 680x0 emulation opens a screen by ultimately calling the x86 native OS code. The data format returned should be in whatever format makes most sense for the x86, since most of the rendering will be done by the OS anyway.
If the user (under 680x0 emulation) wants to render into the screen directly, via LockBitMapTags() or whatever, the data format returned for the locked bitmap will be eg RGB16PC or BGRA or whatever other absolute format. Most likely it will always be a "little endian" format.

Since locking bitmaps for direct access *must* care for any format returned anyway, I don't see a problem for the 680x0, other than the fact it will be performing an unneeded byteswap in 680x0 code (which is then byteswapped again in the emulation :-D ) so the relative perfomance will be down a little.

If you think about it, lots of existing amiga gfx cards use little endian data modes that any code doing direct access to has to take care of, I don't see this as being any different.

Karlos · « **Reply #11 on:** April 15, 2004, 02:44:38 PM »

@whoosh777

Heh, we're not arguing, just comparing chip implementations. I still say, considering your enthusiasm for the idea, go for it! :-D

Anyhow, back to that circle :-D

Quote

1. small number of volatile scratch registers,
2. the first function arguments in non-scratch registers: if function arguments
are in scratch registers then those registers are no longer scratch!:

scratch registers are short term breathing space,

3. further arguments to the stack,

eg on 68k:

a0,a1,d0,d1 as scratch, and function arguments f(d2,a2,d3,a3)

if you have too many function arguments in registers you
run out of breathing space: the called function has to start using the stack to free up registers for internal use,
also the calling function may already be using those registers for something else so it has to back them up somewhere,

if you have too many scratch registers its wasteful,

I sincerely suggest you read the PPC open ABI specification to get a better understanding of how it works. The issues you raise almost never occur due to how the ABI is layed out. Stack based calling is rare (unless you have more than 8 integer/pointer parameters). Those registers never need to be backed up before a call because the compiler won't use them to hold anything that needs to survive a call.

As you yourself point out, 4 registers is usually enough for most purposes. With 32 registers, some of which are used for the stack, global data section etc, the compiler can almost always find somewhere to store a variable without using the stack.

Even then, the compiler can spot which volatile registers don't get hammered across calls in the same translation unit. Same with function parameters, it can see which ones 'survive' the call, eg const arguments etc.

When I look at well optimised PPC code generated from a good compiler, I see very little stack usage, very little register backup etc.

Now, if you think back to your "stack cache" idea, having a large register file, of which programming standards say "this half is volatile etc." actually gives you the same functionality.

Large register files are fantastically useful for breaking down complex expressions. All the examples you have posted so far tend to deal with linear code, doing fairly simple arithmetic.

x = a + b * (c + (d*d)); etc.

Chuck in function calls (some of which may be inlined), multidimensional pointer dereferences (that may be used several times in one expression), etc., and more and more volatile registers becomes useful to hold temporaries that may be needed more than once.

For a different example of why large register sets are handy, a small interpretive emulator I wrote as a thought experiment (for a theoretical bytecode cpu), uses several function lookup tables. One for opcodes, one for effective address calculation and one for debugging traps etc.

There is a data structure for the "core", containing a register file, stack (seperate ones for data, register backup and function call) and code pointer (the PC, but expressed as absolute address) etc.

Code is executed in blocks, until either a certian number of statements have been executed, or a particular trap (or break signal) has been invoked.

Careful design of the code allowed each function table base, the virtual register file, the virtual stack pointers and code pointer to persist in the same registers throught a call to execute(), without the need to be moved, saved etc. etc. across all the calls incurred during execution of the bytecode. Given that we are talking about possibly millions of calls, that saving is considerable.

Regarding the x86 internal RISC issue, I never said that it was better or worse than "up front" RISC.

But we can infer 2 things:

1) A RISC style core clearly makes sense as virtually every CPU manufacturer is using it.

2) Your assumption that the "external CISC" style approach of x86 could be better based on code density is very difficult to judge. You have to consider that the code decomposition into the internal micro op language is far from simple to achieve. The design of these cores are fantastically complicated, gobbling up silicon like nobody's buisness.

The problem is, it's not a simple linear process where x86 instruction "X" is always decomposed into micro-op codes "a b c". The early stage of decode may work this way but once it has to start processing "a b c", it almost works like a mini compiler, looking to see what rename registers are/will become free, which instructions have dependencies etc. All this takes clock cycles - which is partially why modern x86 CPU's have such very long pipelines and "time of flight" for instructions.

The above work is largely non existant for up-front RISC designs because it's a compile time issue. They still have to worry about rename registers and such, but compile time instruction scheduling has made their life a lot easier.

In fact, part of the whole point of RISC is that it makes the CPU's life easier by making the compilers life more difficult :-D

Now, the present x86 designs uses the above internal RISC approach not because they thought "Hmm, this is better than those young whipper snapper RISC cpu's", but because they *had* to keep x86 object code compatibility and newer RISC style processors emerging were seriously threatening them.

You only have to look back at the time when x86 introduced these RISC style cores and DEC still made the Alpha AXP. We had several Windows NT workstations in our spectroscopy labs at Uni, one was using the latter at 266 MHz and we had a newer P-II 300 MHz, with it's new spanking "internal RISC style core" and the alpha still stuffed it :-D

Karlos · « **Reply #12 on:** April 17, 2004, 05:07:17 PM »

Quote

IMO you could implement Windows XP in a few megabytes,
I've not seen the API but certainly the functionality is just a few megs worth,

:lol: Tell it to Microsoft :-D

The truth is

1) There's a vast amount of redundancy in windows. There are literally dozens of different was a user can do the same thing.

2) There's tons of functionality most people aren't aware of and wouldn't neccessarily use even if they were. Loads of services and other features are included that the average home user has no idea about.

3) When updating windows resources, it's extremely rare to find that a set of .dll files or whatever are actually updated as it's not time/cost effective for MS to bother. Most of the time, newer .dll files are simply added. There's a lot of historic bloat alone that comes from this.

4) Most likely your preinstalled windows xp has system restore and other features enabled. This is basically like having a copy of the CD in a folder under your xp installlation and if something goes wrong, you can restore the last working configuration.

In the end, it all adds up to one gargantuan behemoth :-D

Karlos · « **Reply #13 on:** April 18, 2004, 02:39:49 PM »

@Hammer

It totally depends on what your transisters are used for :-D

How much of the modern x86 die is given up to x86 -> micro op decode, how much is given up to crutchhing up the "classic" x86 support?

How much of hte 970 transistor count is needed for legacy support? Virtually nothing.

Incidentally, I'm not the one who started the PPC v Intel debate here ;-)

Anyway, my point here is that the core of modern x86 processors is comparable to modern day RISC, hence the old "CISC v RISC" arguments are largely irrelavent in modern designs. Your PPC v AMD comparison makes that even more evident.

Quote

Note that Alpha sports an EV6 bus advantage, why try it with EV6 bus enabled K7 Athlons?

Dude, this was around start 1997. I should have made that clear - it's been a while since I was at Uni, you see. There were no Athlons and I'm not sure what sort of bus the Alpha was using then. But, it completely stuffed the new P2 systems that came to "replace it", which was rather amusing.

@whoosh777

I agree, the fastest consumer processors are x86 currently. That doesn't mean they are achitectually superior - as I say, they use many RISC paragdims in their cores to acheive the performance they reach. There is absolutely *no way* a conventional "classic x86" core, soley depending upon the 486 register set could get this far. The chip designers realised this a decade ago when the first "many register load/store" processors were crucifying them performance wise :-D

However, the x86 was already massively popular. They are the most competitively developed processors because the market is so ripe. Now it is dominated by Intel and AMD. A decade ago, there were many rivals.

Anyway, the point is, the "large register count / RISC is better view" has been largely proven. As I said, internally, x86 manufacturers moved their cores this way quite some time ago. What would have been truly interesting is to see what their cores would be like if they were packaged as a pure RISC processor, without all the x86 decode. For example, the P2 core has something like 64 (or maybe 128, I don't remeber) registers used for rename and shadowing.

Karlos · « **Reply #14 on:** April 20, 2004, 06:35:01 PM »

Quote

Hammer wrote:
A few rhetorical questions;

1. What was the clock speed of that particlar DEC Alpha at 1997 against the Pentium Pro?

I already said. The alpha system there was a single processor set up at 266MHz. The Pentium-II machines brought in as 'upgrades' were 350MHz.

Quote

2. What was the price differential between the two?

Not entirely sure. Both systems were expensive, using SCSI, custom designed cards etc., but the alpha had been kicking around in the lab for several years already. The Pentium-II machines were brought in to replace it for 2 reasons:

1) Future versions of the software for which the machine was used were to be for x86 only, a move to sell it to a potentially larger audience (no doubt), so a shift to x86 was unavoidable. It was decided to make the shift early to lessen the upgrade hassles later.

2) The Pentium-II 350 machines brought in were expected to outclass the older Alpha @ 266.

Embarassingly for the senior IT technician who ordered the changes, the newer machines ran the existing x86 version of the software not only slower than the Alpha, they ran it at a speed that was only marginally greater than running the x86 version under FX32 on the alpha.

Pity they ain't still making them :-(

Author Topic: 68k AGA AROS + UAE => winner! (Read 26326 times)

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!

Karlos

Re: 68k AGA AROS + UAE => winner!