|
Post by gtoal on Oct 1, 2024 11:19:40 GMT -5
I'm working on a new Vectrex game and I've hit that problem we've seen in the past where GCC coughs up its guts while trying to optimise some expression. I spent an hour trying to recode the statements in a way that did not trigger the fault, without success, so I'm thinking the next thing to try is to turn off optimisations around that (short) procedure. By default I'm compiling with -O3 (which is absolutely required for speed), but if I turn off -O altogether it compiles OK. I know in modern GCC's there are pragma's you can use to modify optimisation over a short range of code, but I don't know if that was ever present in the ancient GCC that we use, and if it was, what exactly the options would be? I suppose a last-ditch solution would be to recode this procedure in asm but although I said it was short it is *way* too long for that to be a tempting way out Thanks, Graham #define N 4 // Number of ghosts and their target destinations
// Table of distances between each ghost and each target.
// Initial values are for testing only.
static int distanceMatrix[N][N] = {
{9, 2, 7, 8},
{6, 4, 3, 7},
{5, 8, 1, 8},
{7, 6, 9, 4}
};
#define PERMS (4*3*2*1)
// There are only 24 possible ways to assign ghosts to targets so why not evalulate them all and pick the best one?
int order[PERMS][N] = {
{0, 1, 2, 3}, {0, 1, 3, 2}, {0, 2, 1, 3}, {0, 2, 3, 1}, {0, 3, 1, 2}, {0, 3, 2, 1},
{1, 0, 2, 3}, {1, 0, 3, 2}, {1, 2, 0, 3}, {1, 2, 3, 0}, {1, 3, 0, 2}, {1, 3, 2, 0},
{2, 0, 1, 3}, {2, 0, 3, 1}, {2, 1, 0, 3}, {2, 1, 3, 0}, {2, 3, 0, 1}, {2, 3, 1, 0},
{3, 0, 1, 2}, {3, 0, 2, 1}, {3, 1, 0, 2}, {3, 1, 2, 0}, {3, 2, 0, 1}, {3, 2, 1, 0}
};
// Having identified the 4 targets (choke points) we are going to use, this evaluation
// function should tell us which is the best assignment of a ghost to a destination
// (using whichever criterion we can think of that gives us a 'best'). Initial criterion
// is to arrange for the arrival of the ghosts at the choke points to happen as close to
// simultaneously as possible.
// Our initial cost function minimizes the differences between each ghost's travel time:
uint8_t balanced(int thisPermutation) { // lower value of cost function means better fit
// This procedure causes a compiler failure (either version) when compiled with -O3
// - it works when no -O is given.
#ifdef ORIGINAL
int targetDistance;
// The rather verbose arithmetic expressions below ensure that we do not assign a value
// in excess of the largest number that can be stored in a distance variable, which will
// be relevant when we move the code to the Vectrex and want to store distances in a
// single byte in order to reduce RAM usage.
int D0 = distanceMatrix[0][order[thisPermutation][0]],
D1 = distanceMatrix[1][order[thisPermutation][1]],
D2 = distanceMatrix[2][order[thisPermutation][2]],
D3 = distanceMatrix[3][order[thisPermutation][3]];
targetDistance = (D0>>2) + (D1>>2) + (D2>>2) + (D3>>2) + (((D0&3) + (D1&3) + (D2&3) + (D3&3) + 2/*rounding*/)>>2);
D0 -= targetDistance; D1 -= targetDistance; D2 -= targetDistance; D3 -= targetDistance;
return (uint8_t)(D0*D0 + D1*D1 + D2*D2 + D3*D3); // Calculate the squared differences from the target distance
#else
int targetDistance;
uint16_t R;
// The rather verbose arithmetic expressions below ensure that we do not assign a value
// in excess of the largest number that can be stored in a distance variable, which will
// be relevant when we move the code to the Vectrex and want to store distances in a
// single byte in order to reduce RAM usage.
int8_t D0, D1, D2, D3;
D0 = distanceMatrix[0][order[thisPermutation][0]],
D1 = distanceMatrix[1][order[thisPermutation][1]],
D2 = distanceMatrix[2][order[thisPermutation][2]],
D3 = distanceMatrix[3][order[thisPermutation][3]];
targetDistance = (D0&3);
targetDistance += (D1&3);
targetDistance += (D2&3);
targetDistance += (D3&3);
targetDistance += 2/*rounding*/;
targetDistance >>= 2;
targetDistance += (D0>>2);
targetDistance += (D1>>2);
targetDistance += (D2>>2);
targetDistance += (D3>>2);
D0 -= targetDistance;
D1 -= targetDistance;
D2 -= targetDistance;
D3 -= targetDistance;
D0 *= D0;
D1 *= D1;
D2 *= D2;
D3 *= D3;
R = (uint16_t)D0;
R += (uint16_t)D1;
R += (uint16_t)D2;
R += (uint16_t)D3;
return (uint8_t)(R & 0xFF);
// GGG6809 BUG:
// return D0*D0 + D1*D1 + D2*D2 + D3*D3; // Calculate the squared differences from the target distance
#endif
}
|
|
|
Post by Peer on Oct 1, 2024 13:32:41 GMT -5
Yes, for each enabling optimization option there is also a corresponding disabling option. Any -Ox level can thus be fine-tuned. For details, see here: gcc.gnu.org/onlinedocs/gcc-4.3.6/gcc/Optimize-Options.html#Optimize-OptionsRegarding speed, -O2 usually gives me better results than -O3. I also use a standard set of -fnoxxxx options which disable several optimizations which are still buggy in gcc6809. These are also the standard settings in Vide and are mentioned somewhere in the Vide documentation. Hope this helps. Cheers, Peer
|
|
|
Post by Peer on Oct 1, 2024 13:40:13 GMT -5
Btw, if those matrices are unchanging lookup-tables, they should better be declared "const", so that they end up in ROM space. Otherwise they will unnecessarily eat up precious RAM space.
|
|
|
Post by gtoal on Oct 1, 2024 15:20:53 GMT -5
Yes, for each enabling optimization option there is also a corresponding disabling option. Any -Ox level can thus be fine-tuned. For details, see here: gcc.gnu.org/onlinedocs/gcc-4.3.6/gcc/Optimize-Options.html#Optimize-OptionsRegarding speed, -O2 usually gives me better results than -O3. I also use a standard set of -fnoxxxx options which disable several optimizations which are still buggy in gcc6809. These are also the standard settings in Vide and are mentioned somewhere in the Vide documentation. Hope this helps. Cheers, Peer Well, what I was looking for was a way to apply options to a small part of a source file rather than to the whole program, eg what in modern GCC compilers would be done with either this: #pragma GCC push_options
#pragma GCC optimize ("-O0")
uint8_t balanced(int thisPermutation) { int targetDistance;
int D0 = distanceMatrix[0][order[thisPermutation][0]],
D1 = distanceMatrix[1][order[thisPermutation][1]],
D2 = distanceMatrix[2][order[thisPermutation][2]],
D3 = distanceMatrix[3][order[thisPermutation][3]];
targetDistance = (D0>>2) + (D1>>2) + (D2>>2) + (D3>>2) + (((D0&3) + (D1&3) + (D2&3) + (D3&3) + 2)>>2);
D0 -= targetDistance; D1 -= targetDistance; D2 -= targetDistance; D3 -= targetDistance;
return (uint8_t)(D0*D0 + D1*D1 + D2*D2 + D3*D3);
}
#pragma GCC pop_options or this: uint8_t balanced(int thisPermutation) __attribute__ ((optimize(0))) {
int targetDistance;
int D0 = distanceMatrix[0][order[thisPermutation][0]],
D1 = distanceMatrix[1][order[thisPermutation][1]],
D2 = distanceMatrix[2][order[thisPermutation][2]],
D3 = distanceMatrix[3][order[thisPermutation][3]];
targetDistance = (D0>>2) + (D1>>2) + (D2>>2) + (D3>>2) + (((D0&3) + (D1&3) + (D2&3) + (D3&3) + 2)>>2);
D0 -= targetDistance; D1 -= targetDistance; D2 -= targetDistance; D3 -= targetDistance;
return (uint8_t)(D0*D0 + D1*D1 + D2*D2 + D3*D3);
} However I've tried those and they don't work with the old gcc6809 code. Fortunately, ... some other change I must have made since this morning has caused the original GCC failure to go away! All I can do is keep my fingers crossed (and keep good RCS backups) and hope it doesn't come back. But I'm interested in your list of optimisations to turn off, if they differ from what is currently supplied by Vide - it would be good to specifically turn off anything that was known to generate bad code (as opposed to merely causing the compilation to fail, which can be worked around eventually when it happens). One critical gcc flag in that vein that I'm aware of is -Wno-strict-aliasing because GCC's default handling of type punning through pointers or record unions is just plain broken, I don't care what the standard says about undefined behaviour. At least on modern x86 and arm compilers; I've not explicitly checked gcc6809's generated code, but I'm taking no chances! Thanks for the comments, Graham PS Yes, I know all about the const ram vs rom issues and indeed spent some time recently working out how to handle a particularly awkward case where the const array's members were pointers to other const arrays. I haven't bothered doing it to all the arrays in my WIP yet but will do as and when needed.
|
|
|
Post by Peer on Oct 2, 2024 7:30:15 GMT -5
Regarding optimizations, I currently use the following three options to suppress optimizations techniques which are buggy and can in some cases result in incorrect assembly code:
-fno-ipa-reference -ftree-ter -fno-gcse
Those are also added in Vide by default.
|
|
|
Post by Malban on Oct 7, 2024 1:04:11 GMT -5
Only now reading this….
The easiest way might be to use two different source files - you can use different gcc comand line options per file…
I used this in Akalabeth too- to disable loop unrolling in places where I did not need it to save space…
|
|
|
Post by gtoal on Oct 19, 2024 16:22:07 GMT -5
Only now reading this…. The easiest way might be to use two different source files - you can use different gcc comand line options per file… I used this in Akalabeth too- to disable loop unrolling in places where I did not need it to save space… You can? In the same project? How? I've only found the main setting that applies to all files in the project. Also aren't the calling conventions slightly different between optimised and non-optimised code? What I ended up doing was compiling the procedure unoptimised, then extracting the generated assembly code and inserting it into a C procedure which is compiled optimised. (But I mess around with the stack to allow the call to succeed and return properly). I'd much rather avoid that hack if I can though! (btw the Pacman project is advancing nicely. I've made a lot of algorithmic optimisations trading off eprom space to get more speed - including some things I didn't think could be precomputed such as performing a breadth-first search between the ghosts and the player character. I'm putting off lower-level optimisations until I have all the code doing everything it should. When I get there I expect to come back here for some help in getting the cycles down to 30K if possible. Currently around 45-50K I've been keeping the WIP updated online at gtoal.com/vectrex/mspac if you're interested in following along.) G
|
|
|
Post by D-Type on Oct 19, 2024 16:35:18 GMT -5
You'll be back for smartlists then, if you're not already using them, they save lots of cycles.
|
|
|
Post by Malban on Oct 20, 2024 3:34:12 GMT -5
> You can? In the same project? How? Right click on the file, in the sub menu chose properties. Under "Flags" enter something you like - like: "-Os" (save space I believe). Then these flags are added to the compiler invoke. These flags are always added last so they take precidence over any "global" project flags. I have never encountered problems by linking them. I never tried different calling conventions with different files though.
(It was Telengard I used this last - not Akalabeth.)
|
|
|
Post by Peer on Oct 20, 2024 3:51:24 GMT -5
I have never encountered calling conventions being altered by optimization. Unless a function is inlined. Never had any linking problems either.
|
|
|
Post by Peer on Oct 20, 2024 4:13:29 GMT -5
Mixing C code and inline assembly in the same function has tons of pitfalls. Especially if the inlined assembly code manipulates the stack. Safest way is to write functions either completely in C, or completely in asm. Just use separate C files and asm files in your project, and add respective function header declarations to the C code, and leave the rest to the compiler and the linker. I found this to work very well.
|
|
|
Post by gtoal on Oct 22, 2024 22:35:54 GMT -5
Thanks for all the comments. Appreciated as always. I have to take a break until the weekend to handle some stuff at home so I thought this would be a good time to bring you all up to date with this project. It started a few weeks back as a project to convert the arcade Ms Pacman to the Pitrex. I started with some experiments to determine the best scale factor to use with smart lists to draw the mazes, which was reasonably successful. Then it dawned on me that with the mazes drawn, that was basically the hard part done and it might actually be possible to write a pac-game from scratch for the plain Vectrex in 6809 code, so that was when I started the project for real. (But boy was I wrong about the maze drawing being the hard part!) Now, this wasn't my first Pac Rodeo as I'd previously written a Pac game in Scratch, and in that game I generated new mazes on the fly, and used a breadth-first search for the ghosts rather than the original Pacman heuristics, and I was pretty pleased with the way that came out, so I decided to incorporate some of that logic in this new game. Yes, it would use the layout of the 4 Ms Pac mazes, but the game play would be quite different. Some changes being by choice, some being forced upon me. Well, when I started adding gameplay, the CPU time taken shot up. Way up. Ridiculously high up. So high up that there was no way that optimising the BFS algo would *ever* get it back into the realms of reality. So after a few days in the slough of despondency, it finally came to me that there was a way to pre-compute the data for the breadth-first search - and in a way that didn't require megabytes of rom space. I'll skip the tedious details for the moment, but let me just say there is some real computer science embedded in this game now, and although the frame rate is unacceptable (at around 60K cycles), it's within range of being optimised down to something usable. Until now I have *only* optimised the maze drawing and the high-level algorithms. I have made zero attempt to do low-level optimisation because I needed to know if the high-level optimisation could get it out of the 100's of K's of cycles range first. The WIP is at gtoal.com/vectrex/mspac/ and there's even a bin there that's the current proof-of-concept, albeit slow. I'ld really appreciate if anyone with a few hours spare this week were to look at it and suggest optimisations that might speed it up. The code is littered with comments that were primarily for my own benefit while getting it working; a few may be wrong and a few no longer relevant after huge chunks of code were deleted and replaced with other ways of doing things, but I think there's enough there to explain how it works. I'm well aware of some of the things that need to be done and that the ghost algorithm is incomplete - it's *only* the BFS that has been implemented, not the details of how the BFS will be applied to gameplay, or when a less effective algorithm will be used so that the difficulty increases as the game progresses. I have a road map in my head of what still needs to be done and a couple of months in which to do it - but for now what I'm really looking for are good hints on how to claw 50% of the cycles back. I'll drop back in at the weekend when my domestic chores are done; I'm not expecting to be online much until then, so if I don't reply or acknowledge posts it's not because I'm being rude - I'm being busy! Chat among yourselves until I get back. And thanks again for the recent advice, I've already applied some of it. (By the way, you can compile the same source file with -DLINUX and run it using ncurses from the command line. It works pretty well and saves a lot of time in debugging.) Graham PS A proper arcade Ms Pac implementation for the Pitrex is definitely going to happen and it won't be nearly as difficult to create. But this one first... I really want to get this out there in time for Christmas. (and yes, it will remain open source and free to download.)
|
|
|
Post by Malban on Oct 23, 2024 5:56:26 GMT -5
I did a very short analysis:
I have divided the main function into some seperate parts with subfunction. That way the Vide profiler can better display the cycles each part needs.
I compiled with -O3. Overall (when not doing special ghost tests): ~56000 cycles
Drawing the maze: ~37200 Drawing the "Dots & Pills": ~9200 Drawing the "Pacman": 1600 Drawing and Movement Ghost: ~7000 Player handling: ~400
(Ghosts were drawn twice! - I removed one ghost drawing loop)
The special pathfinding for the ghosts boost the cycle count by up to 20000 cycles. Don't know what you are doing there - and I am not getting into it.
It might be one hell of an algorythm. But speed wise it might be faster to do a couple of "if then else" - and have them move dumbly... instead of "wasting" tenth of thousands of cycles :-). All the calculations I see are not terribly time consuming - yes they probably can be optimized.
But before I would start with that - doing special maze drawing routines will probably save you at least 20000 cycles. If you then have the pacman/ghosts as smartlist or something similar fast - will save you another 5000 cycles.
Perhaps drawing the dots "consecutavely" might save some more cycles (most times dots will be near each other - no need to do a full zero reset?).
If worse comes to worst - your movement scale is 127 - with a little bit of clever thinking and implementation - you could use these "over 100 idle cycles" to do some calculation in between.
I'd say - keep up the good work. "more or less" finish it - and then we can do a graphics enhancement round. If the result is too slow we go deeper.
But saving a couple of hundred cycles now with some array/loop improvements seem a bit far fetched - keeping in mind what other options are still available.
Cheers Malban
|
|
|
Post by Peer on Oct 23, 2024 6:52:08 GMT -5
Malban, great that you already took a look at the code. Right now, life is interfering, and I will not have the time to do so. gtoal, I do not know exactly what hard computer science stuff is used here, and how the BFS is applied. And I also do not know how the original Ms Pac-Man ghost AI works. But just as a side note: When I wrote Vec-Man, I used some very basic and simple logic as one strategy of the ghosts. At each junction, if Vec-Man is below you, then move downwards. If Vec-Man is left of you, go to the left. And so on. Just some very simple and fast x and y coordinate comparisons. Use four of such ghosts, and the game becomes incredibly hard, on the verge of impossible to beat. So I had to find ways to actually make the ghosts dumber. I never considered using BFS and the like, but I am pretty sure that this would have been overkill. For Vec-Man, at least. So keep up your great work with your project!
|
|
|
Post by Malban on Oct 23, 2024 8:46:12 GMT -5
Played around a little and just replaced "graphics" - I have not tested this on a real machine - might look bad. But in Vide it is Ok'ish.
I have not put too much effort in it - since it is more a proof of concept. The grpahics do not aligh 100% correct:
(This image inserting does not work for me... follow the link...)
|
|