Author Topic: Vampire V500 V2+ is now ready for preoders ! (Read 10675 times)

Niding · « **on:** December 05, 2016, 01:33:47 PM »

Quote from: adonay;817336

I tried to register for The 600 vampire once but when I try to check it does not recognize any of my email addresses. If I try again it's saying that I have already registered interesting probably based on street address. Since I have no confirmation email there is no way of telling if i have placed interest or not. This system could have been slightly better to put it mildly. What if i misspelled my email address?

Do like I did, join their IRC channel and query. If Kipper is around, he usually responds to such queries.

irc.freenode.org
port 6667

Channelname: Apollo-Team

Niding · « **Reply #1 on:** December 06, 2016, 06:55:36 PM »

http://forum.apollo-accelerators.com/viewtopic.php?f=10&t=25

Depending on available time, I've got the intent on outlining in this thread what the SIMD extensions to the Apollo Core offer and how the functionality might apply to the one or other coding problem. Please be aware that the extensions are a work in progress and might change without notice before an official release of a finished core (working title: Gold2).

AMMX, as Gunnar named it, is a 64 Bit SIMD extension. Apart from the fact that it shares the 64 Bit width with the MMX of a well known company, the concept we followed is more geared towards the SIMD extensions in RISC architectures (AltiVEC, Wireless MMX). In the current state of development, 32 registers are available for SIMD usage. These 32 registers include the well-known D0-D7 (extended to 64 Bit) and 24 new registers which are SIMD exclusive. This way, a lot of work can be done in registers, reducing the strain on memory reads and writes considerably.

Most instructions follow a 3 operand logic D=A op B, where the results of the operation between A and B is stored in any C of the registers. It must be noted at this point that the input operand A doesn't have to be a register. Any effective address in 68k notatation is allowed, including immediates. Allow me to show some examples at this point.

PADDW D0,D1,D2 ; 4x16 Bit addition D0+D1=D2
PADDW (A0),D1,D2 ; same, from memory (unaligned)
PADDW #$8100810081008100,D1,D2 ; add 4x16 Bit constant
PADDW.W #$8100,D1,D2 ; same as above, with implicit splat

The latter two code lines above demonstrate a convenient feature in AMMX. You can specify immediates also in SIMD code, something you don't find easily somewhere else. The constants can be given in full 64 Bit. While this may be useful for some applications, the 64 Bit immediates result in instruction words of 12 Bytes. As an alternative, we added a second way of specifiying constants. The .w Syntax in the last of the example mnemonics triggers the implicit distribution of the immediate data word to all four 16 Bit slots. This way, the latter two instructions are identical in their arithmetic operation. The difference with implicit splat is a reduction of the instruction word to 6 Bytes.

These two concepts of 3 operand logic and immediates can help to save a number of move instructions that were common to 68k code.

In terms of data movement, two basic operations are supported: LOAD and STORE. While input data for the operations can be gathered by the for one of the operands in the arithmetic operations, the destination is a register in the majority of instructions. Therefore, movement to memory needs to be done by STORE. Example:

LOAD (A0)+,D1 ;D1=64 bit from any memory location, A0=A0+8
PAVGB (A1)+,D1,D1 ;8x unsigned byte average (a+b+1)>>1
STORE D1,(A2)+ ;write result

A special case of STORE is also provided, one that can selectively write the individual bytes. The STOREM Rn,Rm, will only write bytes of which the corresponding mask bit is set (both in MSB to LSB notation).

moveq #4,d3 ;yes yes, this will stall in the following calculation
LOAD 4(A0,D3.l*4),D1 ;D1=64 bit from any memory location
moveq #%01010101,D2 ;D2.b=bit mask which bytes (bit=1) are to be written
STOREM D1,D2,(A2)+ ;write every second byte from D1

The third special STORE variant is targeted at 8 Bit pixel data. Typical operations in image/video processing result in intermediate results exceeding the 8 Bit range, which implies clipping before going back to 8 Bit. The Apollo features its own interpretation of PACKUSWB for this purpose. Clipping is done to (0,255). Example:

LOAD (A0)+,D1 ;4 signed words: a0.w a1.w a2.w a3.w
LOAD (A1)+,D2 ;4 signed words: b0.w b1.w b2.w b3.w
PACKUSWB D1,D2,(A2)+ ;8 unsigned bytes: a0 a1 a2 a3 b0 b1 b2 b3
; operation: vn.b = ( vn.w < 0 ) ? 0 : ( ( vn.w > 255 ) ? 255 : vn.w ); // n=0...7

One catch with SIMD is that you can not always guarantee that you are able to layout your data as needed by the arithmetics. That's why coders have been fond of the permute instruction, introduced with Morotola's PPC7400 (aka G4) series. The Apollo core offers one, too. Two input registers Ra and Rb can be permuted by a given permutation constant into the destionation Rd. Example:

;byte permutation key semantics for Rm,Rn
; Rm m0 m1 m2 m3 m4 m5 m6 m7 = 0 1 2 3 4 5 6 7
; Rn n0 n1 n2 n3 n4 n5 n6 n7 = 8 9 a b c d e f
;
; ex1: word interleaving
LOAD (A0)+,D1 ;4 signed words: m0.w m1.w m2.w m3.w
LOAD (A1)+,D2 ;4 signed words: n0.w n1.w n2.w n3.w
VPERM #$018923ab,D1,D2,D3 ;D3: m0.w n0.w m1.w n1.w
; ex2: unsigned byte to words
LOAD (A0),D4 ;8 unsigned bytes m0 m1 m2 m3 m4 m5 m6 m7
moveq #0,d5 ;0.l
VPERM #$F0F1F2F3,D4,D5,D6 ; first four bytes as words m0.w m1.w m2.w m3.w
VPERM #$F4F5F6F7,D4,D5,D6 ; second four bytes as words m4.w m5.w m6.w m7.w

Let's come to arithmetics. Bit-wise operations are:
Code: Select all
PAND ,Rb,Rd
POR ,Rb,Rd
PEOR ,Rb,Rd
PANDN ,Rb,Rd

Addition/Subtraction can be done on 8 Bit or 16 Bit.

PADDB ,Rb,Rd ;Rd = Rb +
PADDW ,Rb,Rd ;
PSUBB ,Rb,Rd ;Rd = Rb -
PSUBW ,Rb,Rd ;

One special case of add/sub is the BFLYW. A common recurrence in signal transforms (FFT,DCT,DWT) is the butterfly, an operation where the result of an addition and subtraction of two operands is required. In order to augment such transforms, the AMMX offers BFLYW ,Rb,Rd:Rd+1. Please note that the destination register is actually a consecutive pair (with an even index for the first one).

BFLYW D0,D1,D2:D3 ; D2 = D1 + D0 , D3 = D1 - D0 (4 words each)

As a side note, we replaced 28 add+sub combinations by butterflies in an 8x8 iDCT, roughly 15% of the total instructions in that function block..

Multiplies are currently offered by the PMUL88 ,Rb,Rd instruction. It multiplies four words with the given operand and shifts down by 8 Bits after the multiply (Rd = (Rb*)>>8 ). Example:

PMUL88.W #16,D0,D1 ; D1 = (D0*16)>>8 = D0/16
PMUL88.W #1024,D0,D1 ; D1 = (D1*1024)>>8 = D0*4
PMUL88.W D2,D3,D4 ;

The multiply is implemented with full throughput. With the implemented logic (>>

, it can serve as short range shift replacement.

Author Topic: Vampire V500 V2+ is now ready for preoders ! (Read 10675 times)

Niding

Re: Vampire V500 V2+ is now ready for preoders !

Niding

Re: Vampire V500 V2+ is now ready for preoders !