|
Post by gtoal on Aug 13, 2021 17:00:01 GMT -5
Some time ago, for amusement value primarily, I hacked up a generator to create macros for doing constant divisions. Today with some idle time I took it a step farther and wrote a procedure to do arbitrary 8-bit divides using those macros. gtoal.com/vectrex/divm.cOK, I know this isn't particularly useful compared to 16 bit and 32 bit divides, but still, I was wondering how fast it might be compared to the built-in "/" operator in C. Which brings me here... to ask if anyone has any smart ideas for how to time a comparison of the run-times of two pieces of code. I have a suspicion that a timer function can be made out of the t2 timer that is used to count down the length of a frame but I'm unsure how to write it without breaking the refresh rate. Which might be a valid price to pay, but if anyone has something already that could do the job and lets me compare runtime A to runtime B (whether in timing units or just as a pure ratio), please pipe up! (The code above could actually be faster in the cases where there is a right shift of 8 bits or more, because gcc6809 actually does 8 ASR's or LSR's in those cases, rather than just fetching the high byte directly, but the macros were complicated enough by that point, and I didn't have the enthusiasm to handle writing a version that I could test on an architecture that has the other byte sex, since I was writing it on an Intel Linux system.) Next question: I know how to construct 16 and 32 bit multiplies out of an 8-bit multiply primitive, but am I right in thinking there's no such analogous trick for making larger divides out of an 8-bit primitive? I haven't bothered looking at anything larger than 8 bit divides because I assume that the gcc library code will be using the standard shift-and-subtract algorithm, which is what I would be writing myself and therefore unlikely to be faster than the existing hand-coded assembler. (I really must write the standard shift-and-subtract divide routine some day just to complete the set. I did try writing a divide once that used multiplies but it was a bit silly and used the same number of steps as a shift-and-subtract implementation would, with multiply being more expensive than a subtract, so why bother? Nevertheless, if you like a good laugh, it's near the end of this file that also has my extended multiplies in it: gtoal.com/vectrex/mul8.c - look for procedures named udiv16*,udiv32*, and udiv64*) [ADDENDUM: I was able to speed up the shift&subtract algorithm by doing a binary chop to locate the highest shift needed. gtoal.com/vectrex/vectrex-muldiv.c.html - this code now handles all sizes of division. Tested on linux so far, where all the results check out against the hardware divide. I'll test on the Vectrex itself soon.] Btw taken together, those two files would let you do multiplies and divides on hardware with neither multiply nor divide instructions, though I can't see anyone inventing a new chip that's that stupid in this day and age. As I said, this was just a programming exercise to keep my hand in. But it would be nice to know if any of it would actually speed up a Vectrex program... G
|
|
|
Post by Malban on Aug 15, 2021 13:52:04 GMT -5
|
|
|
Post by gtoal on Aug 17, 2021 9:55:16 GMT -5
So that code above is close to optimal for constant divides by any constant, *but* I did notice that gcc6809 generates poor code for large right-shifts by a constant...
I was thinking about it yesterday when I was stuck in a doctor's waiting room for 2 hrs with no electronics, so I wrote this in my head to keep me from getting bored. Caveat - I'm assuming the gcc innards have to look something like this, I haven't actually dived into the gcc sources to confirm it yet.
The code for something like >>10 looks something like this:
ASR A : ROR B
(repeated a total of 10 times)
I wrote a test and examined the code. For this test at least, the data is on the stack:
$03F8 67 E4 asr ,s 6 $FF $03FA 66 61 ror <<$01,s 7 $FF $03FC 67 E4 asr ,s 6 $FF $03FE 66 61 ror <<$01,s 7 $FF $0400 67 E4 asr ,s 6 $FF $0402 66 61 ror <<$01,s 7 $FF $0404 67 E4 asr ,s 6 $FF $0406 66 61 ror <<$01,s 7 $FF $0408 67 E4 asr ,s 6 $FF $040A 66 61 ror <<$01,s 7 $FF $040C 67 E4 asr ,s 6 $FF $040E 66 61 ror <<$01,s 7 $FF $0410 67 E4 asr ,s 6 $FF $0412 66 61 ror <<$01,s 7 $FF $0414 67 E4 asr ,s 6 $FF $0416 66 61 ror <<$01,s 7 $FF $0418 67 E4 asr ,s 6 $FF $041A 66 61 ror <<$01,s 7 $FF $041C 67 E4 asr ,s 6 $FF $041E 66 61 ror <<$01,s 7 $FF
So at some point in gcc, the compiler knows that the value is in D or in store and executes a for loop to generate up to 15 bit shifts.
if (data in D) { while (rshifts_left-- > 0) { gen("ASR A"); gen ("ROR B"); } } else if (data on stack) { while (rshifts_left-- > 0) { gen("ASR ,S"); gen ("ROR 1,S"); } } else if (data in memory) { while (rshifts_left-- > 0) { gen("ASR mem"); gen ("ROR mem+1"); } } else { // whatever gcc currently does in that situation }
I believe that can be replaced by:
if (data in D) { if (rshifts_left >= 8) { // ADDENDUM: gcc gets this one right, when a static is loaded into D // >> 8 gen("TFR A,B"); gen("SEX"); rshifts_left -= 8; // bits in A now 0 or 1 and match top bit in B: while (rshifts_left-- > 0) { // 0 to 7 shifts. (Could >>7 be optimised with a ROL and an AND? Maybe also >>6?) gen("ASR B"); } } else { while (rshifts_left-- > 0) { gen("ASR A"); gen ("ROR B"); } } } else if (data on stack && D is free) { if (rshifts_left >= 8) { gen("PUL D"); gen("TFR A,B"); gen("SEX"); rshifts_left -= 8; while (rshifts_left-- > 0) { gen ("ASR B"); } gen("PSH D"); } else { while (rshifts_left-- > 0) { gen("ASR ,S"); gen ("ROR 1,S"); } } } else if (data on stack && D is NOT free) { if (rshifts_left >= 8) { gen("PSH D"); gen("LDB 2,S"); gen("SEX"); gen("STA 2,S"); rshifts_left -= 8; while (rshifts_left-- > 0) { gen ("ASR B"); } gen("STB 3,S"); gen("PUL D"); } else { while (rshifts_left-- > 0) { gen("ASR ,S"); gen ("ROR 1,S"); } } } else if (data in memory && D is free) { if (rshifts_left >= 8) { gen("LDB mem+1"); gen("SEX"); rshifts_left -= 8; while (rshifts_left-- > 0) { gen ("ASR B"); } gen("STD mem"); } else { while (rshifts_left-- > 0) { gen("ASR mem"); gen ("ROR mem+1"); } } } else if (data in memory && D is NOT free) { if (rshifts_left >= 8) { gen("PSH D"); gen("LDB mem+1"); gen("SEX"); rshifts_left -= 8; while (rshifts_left-- > 0) { gen ("ASR B"); } gen("STD mem"); gen("PUL D"); } else { while (rshifts_left-- > 0) { gen("ASR mem"); gen ("ROR mem+1"); } } } else { // whatever gcc currently does in that situation }
Whatever the details of GCC are, I think this structure can be mapped to it.
Could a 6809 assembler coder please check that the alternative version of the code above is correct, as well as being shorter and faster?
G
|
|
|
Post by gtoal on Aug 17, 2021 10:17:37 GMT -5
PS I had not even looked at left shifts until just now - I assumed they would use multiplies by powers of 2. They don't, so scope there for similar optimization to above plus substitution of multiply in some cases.
Looking at the timings for MUL vs ASL again, I have to correct an assumption I had made. I had it at the back of my mind that after 2 shifts, MUL was faster. Turns out that's only when shifting indexed memory locations. If the data is in a register, 5 ASLs beat a MUL. Then it gets down to the overhead of loading/unloading memory to register. Everything is about context...
|
|
|
Post by gtoal on Aug 17, 2021 10:21:40 GMT -5
Wow - interestingly, gcc gets it right for statics:
$040B FC C8 80 ldd >_i.2924 6 $C880 $FF $040E 1F 89 tfr a,b 6 $FF $0410 1D sex 2 extendqihi2: R:b -> R:d: $FF $0411 57 asrb 2 $FF $0412 57 asrb 2 $FF $0413 FD C8 80 std >_i.2924 6 $C880 $FF
|
|
|
Post by gtoal on Sept 5, 2021 11:45:23 GMT -5
Turned out getting an exact cycle count on the Vectrex itself wasn't too hard - basically just save the T2 timer before and after the code to be tested. (There are some subtleties involved in making GCC do what you want even with full optimizations turned on, but no show-stoppers) So after timing my 'soft' 16-bit divide procedures of various flavours (not looking at 8 bit here which is a different ballgame) I get these results... t1 = dp_VIA_t2; //======================================================================================= //asm(" nop "); // 4 cycles (not 3 like https://atjs.mbnet.fi/mc6809/Information/6809.htm ) //asm(" adda #0 "); // 4 cycles //p1 = sdiv16_by_16(p1,p2); // 17626 cycles //p1 = sdiv16_by_8(p1,(int)p2); // 17634 cycles //p1 = (long)udiv16_by_16((unsigned long)p1,(unsigned long)p2); // 17180 cycles //p1 = (long)udiv16_by_8((unsigned long)p1,10); // 17138 cycles //p1 = (long)udiv16((unsigned long)p1,(unsigned long)p2); // 9760 cycles (uses multiply rather than subtract) //p1 = p1/10L; // 2364 cycles //p1 = p1/p2; // 2360 cycles //p1 = sdiv10_16(p1); // 352 cycles
//======================================================================================= t2 = dp_VIA_t2; (t1 and t2 have to be sex-changed; and the result tweaked to (long)((t1-t2-13L)<<1) to account for loading the timer (13 ticks) and converting from ticks to cycles (*2)) ... all my general divide routines in C were much worse than GCC's code, *except* for an optimised divide by 10 which was between 6 and 7 times faster than "x/10L" in C. Code is in gtoal.com/vectrex/cycle-counting.c.htmlGraham
|
|
|
Post by Peer on Sept 6, 2021 7:03:31 GMT -5
Just a quick suggestions regarding a snippet of your code:
t1 = dp_VIA_t2; … t2 = dp_VIA_t2;
t1 = (t1<<8) | ((t1>>8)&255); t2 = (t2<<8) | ((t2>>8)&255);
Although these are just very few and rather short lines of C code, this relies on gcc6809 figuring out that in fact it is not necessary to do any bit-shifting or logical operations here, but to simply swap the high-byte and the low-byte of the 16-bit words.
If gcc6809 does not figure this out on its own (which likely depends on the optimization settings used), the generated assembly code can be rather large and ugly and inefficient. Though this is more code to write, I much more prefer coding such things explicitly:
union word_t { unsigned long int word; struct { unsigned int high_byte; unsigned int low_byte; }; };
union word_t t1,t2, t1_swapped, t2_swapped;
t1.word = dp_VIA_t2; … t2.word = dp_VIA_t2;
t1_swapped.high_byte = t1.low_byte; t1_swapped.low_byte = t1.high_byte; t2_swapped.high_byte = t2.low_byte; t2_swapped.low_byte = t2.high_byte;
unsigned long diff = t2_swapped.word - t1_swapped.word; The generated assembly instructions are simple and efficient 8-bit and 16-bit loads and stores. This might not be the optimum (gcc6809 using the d register and then swapping a and b), but it works nicely and independently of compiler optimization, and the C code shows what you want to do here: accessing the same memory location either as a complete 16-bit word, or as two separate 8-bit bytes.
Many Cheers, Peer
|
|
|
Post by hcmffm on Sept 6, 2021 17:14:59 GMT -5
To be honest, I don't understand two bytes (a word). But this compiler generation optimzation/control sounds interesting. And one day I'll start coding and all this will be of big value, then.
|
|
|
Post by gtoal on Nov 13, 2021 23:24:48 GMT -5
Although these are just very few and rather short lines of C code, this relies on gcc6809 figuring out that in fact it is not necessary to do any bit-shifting or logical operations here, but to simply swap the high-byte and the low-byte of the 16-bit words. You are of course perfectly correct and I'll use the overlaid struct trick when needed in time-critical code, but the example you found was just for calculating the time used by the sequence being tested so didn't contribute to the result at all and wasn't worth the effort of writing the long way. It would really be nice if you could rely on GCC to do these tricks for you, since then you could write portable code - unfortunately the solution using structs is usually byte-sex dependent and requires manually coded ifdefs to handle efficiently. (Swapping is OK since that works in either sex but if you want to specifically extract a high byte or a low byte then it gets ugly) Anyway, that was the point of these experiments - to see what GCC does for you and what it doesn't, with a view maybe to adding some tricks to GCC at some point. Though of course you have to be wary of trying to be too clever as these sorts of optimisations are the sort of thing that can break a compiler. I don't know if we've discussed it anywhere but I remember a few years ago when I started using GCC under Vide, we discovered a bug that was caused by the use of the external peephole optimiser - being external, it wasn't aware of a 'volatile' qualifier on one of the device addresses. So ideally this sort of thing has to be added to GCC itself which is a lot harder than adding it to a peepholer and something I don't think any of us are in a hurry to hack on. (Talking of which I did notice that someone has been doing some optimisation maintenance of gcc6809 relatively recently ... gitlab.com/dfffffff/gcc6809 )
|
|
|
Post by Peer on Nov 14, 2021 2:49:34 GMT -5
... (Talking of which I did notice that someone has been doing some optimisation maintenance of gcc6809 relatively recently ... gitlab.com/dfffffff/gcc6809 ) "Recently" is a bit optimistic Unless I am missing something here. About 2 years ago, Malban and me, we reported several issues of which only very few have been fixed. There are still 5 open issues. Malban also did some fixes by himself to various arithmetic routines. I am afraid that those did not make it into the gitlab, but they are contained in the gcc6809 version that we use and which comes with VIDE.
Cheers, Peer
|
|