--------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100015 Date: 04/30/97 From: ANTHONY TIBBS Time: 03:49pm \/To: CAREY BLOODWORTH (Read 4 times) Subj: DJGPP OPTIMIZATIONS 2/3 On Apr 29 22:18, 1997, Carey Bloodworth of 1:3822/1 wrote: CB> (ranging in CB> price from $100 C64's to quarter million dollar VAXes (back when CB> $250K CB> was a lot of money)), you could call me an "Old Timer" I CB> haven't CB> been around as long as some people in these echos, but I think its CB> safe CB> to say I'm in the longest 10%. Yep, by the sounds of it, you are! Anthony --- TriED 0.7a Narrow Gamma * Origin: World of Power BBS (1:163/545.15) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100016 Date: 04/28/97 From: OREN WEIL Time: 12:10am \/To: ALL (Read 4 times) Subj: File Shareing in C++ Hello All! How Can i Make File Shering in C++ fstream? Oren Weil(CoSysOp Of ??? BBS 972-2-6724473): orenw@geocities.com http://orenw.home.ml.org orenw@poboxes.com orenw@hotmail.com [No Files] , orenw@mail.snunit.k12.il [No Files] --- * Origin: -=> I  Oren Weil's Point <=- (2:402/423.1) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100017 Date: 04/29/97 From: STEVE WESTCOT Time: 01:07pm \/To: ANDREAS NEUKOETTER (Read 4 times) Subj: C++++++++++C+++++ On <23 Apr, 21:02>, Andreas Neukoetter wrote to Todd Coup : AN> DJGPP - Dos & OS/2 AN> GCC - Linux Does DJGPP contain OS/2 libraries? --- FMail 1.0g * Origin: Where It's At BBS, Kimberly, WI (414)788-8050 (1:139/615) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100018 Date: 04/30/97 From: MICHAEL LARSEN Time: 10:50pm \/To: ALL (Read 4 times) Subj: Bitmap in FrameWnd,- How...? Hello All. Is there a simple way to put a background bitmap in the white area of the FrameWnd? I am using MS Visual C++ and my program works fine, but I would love to include this little pep-up :) The bitmap is included in the resource file. Anyone? Michael --- * Origin: This line is really a waste of precious bandwidth... (2:237/39.89) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100019 Date: 04/30/97 From: CAREY BLOODWORTH Time: 09:55pm \/To: HERMAN SCHONFELD (Read 4 times) Subj: DJGPP OPTIMIZATIONS 1/2 (talking about *256 vs <<8) HS>That basic example does not get optimized by most compilers, Darin. Such as what compilers? Most compilers do. You'd be hard pressed to find an even vaguely new compiler that doesn't. Strength reduction, constant folding, etc. are classic optimizations. Even my 8 year old Microsoft Quick C, a compiler hardly known for aggressive optimization, and designed to run on even an XT, will do that conversion. I used to have a 1983 C compiler designed to run on a 64k computer, and even it would do something like that, when it thought it was appropriate. Things like that are simply part of what's called 'an optimizing compiler'. The optimizations they make are not always appropriate, but they at least try to do things like that. HS>Most compilers optimize code well, but just about all don't optimize HS>the fastest. They _can't_. There are simply too many variations in the CPUs. An Intel 486 is going to behave very differently than a Cyrix 486, or a 486DLC, etc. They may all be '486 compatible', but they all behave differently and run code with different cycles per instruction. There is simply no way they can optimize for all of the possible clones, plus differences in cache size, performance, write back vs. write through, etc. All they can do is either aim towards the middle, or optimize for a very specific cpu/platform combination. If you are compiling for a line with only a few variations (like the 68k line) then optimization is much simpler and more consistent. But with the x86 line, with the Intel and clone chips, there are 3 or 4 different 8086, a couple of 8088s, several 80286, 7 or 8 80386 versions, perhaps a dozen different 486's, 4 or 5 '586' class, a couple of 686 class chips, dozens of different motherboard designs, each with different cache behavior and size. You just can't optimize very well because there is no single point to aim towards. It's like what a shot gun does compared to a rifle. Even if you limit yourself to just a single 'number', such as a 386 or 486, there is still such an incredibly wide variety of behavior and performance, that it simply is not possible to optimize the code for all versions. A Cyrix 486, Intel 486, and an AMD 486 are all going to have different cycle times for their instructions. Code that runs optimally on one can suffer a major performance penalty on another. As an example, my Cyrix 486 multiplies much much faster than a standard Intel 486. (There are other cases, of course, and there are cases where mine is slower than a standard 486.) Mine can do a 32 bit multiply in 7 cycles (and an 8 or 16 in 3 cycles). A standard 486 takes a minimum of 13 cycles for any size multiply and as many as 42 cycles for a 32 bit mul. If I optimize my code to do multiplications, in some program of some sort, then on other computers, the performance could be significantly slower. Perhaps I do x-D arrays whose sizes aren't powers of two, and so the indices have to be multiplied. Or maybe there is some calculation that I could get reasonable results if I did things as powers of two (for shifts) but instead I know my CPU can do multiplies fast and go ahead and do it directly. Or maybe I'm performing graphics calculations and tune it for my ability to quick multiplies. HS>Thats the kind of code that I often see in source code. HS>Not very fast is it? Unless it's in a loop, the odds are very good that it's good enough. In my testing, with my compiler and CPU, it runs at 4/11ths the speed of the best one. If this is only done once, or even a few hundred times, the speed improvement is likely to be negligible compared to the rest of the program execution time. HS>Watcom results :- HS>---------------------------+ HS>LOOP | 386 | 486 | HS>------|---------|----------+ HS>loop1 | 81.1 cy | 35.3 cy | HS>loop2 | 38.7 cy | 29.7 cy | HS>loop3 | 15.9 cy | 9.5 cy | HS>---------------------------+ Well.... I ran your examples of optimizations. Rather than doing a cycle count, I just did a bytes per second, which is easier to calculate and is just as easy to show relative performance. If you want cycles per second, I'm running a 66mhz 486, so feel free... I used DJGPP 2.6.3. (Because of timing resolution, the bytes/sec results do vary between runs, but stay within 1% of the reported results.) Example 1 (with the array indexing) did 4,000,983 bytes per second. About 16.5 cycles per byte. Example 2 (with the pointers) did 3,792,598 bytes per second. This is about 18 cycles per byte. Notice that in this case, using pointers instead of the simpler array indexing is _SLOWER_, even though your timings above show it to be faster. This is an excellent example of how the classic 'wisdom' of how 'always using a pointer is faster' simply isn't true. Example 3 (with pointers unrolled 4 times) got 5,688,898 bytes per second. About 11.8 cycles per byte. HS>Now that I have taught you, you may actually want to read the thread and HS>then come back with your apology. I then did three of my own improvements. The first one, I felt it was 'unfair' of you to unroll the pointer example 4 times but not the indexing one. So, I unrolled it 4 times. And I got 6,150,160 bytes per second. That's about 10.5 cycles per byte. And again, the index version is faster than the pointer version. The second optimization I did was notice that you were still messing with transferring a byte at a time. That's rather slow. So, I changed your unrolled pointer loop to 32 bit integers and got 11,168,389 bytes transferred per second. Double your best optimization on my computer. That works out to about 6 cycles per byte transferred. (You would also need to deal with making sure the data is aligned properly, since that is a major performance killer, and on some computers, even fatal. But it can be dealt with.) The next optimization is, of course, the most obvious one.... I used memcpy(). I did it just as a way of benchmarking the performances. And, within my timing resolution, I got very similar results to my unrolled loop above. (There were enough cases where the integer unrolled loop was faster than the inline rep movsd of memcpy() that I'm not entirely sure it was timing resolution problems.) (And when the data can fit entirely into the cache, this drops down to about 1.5 cycles per byte. On a Pentium, it would be around 1/3rd cycle per byte.) When the data fit entirely into the L2, and especially the L1 cache, memcpy() was faster. The rest of the time, they were the same. (I could have made a couple of additional optimizations. First, pre-warm the cache, by making sure the data we are about to copy is actually in the cache. That can reduce memory latency. The second might have been to use the floating point registers (or the MMX registers) since they can safely transfer 8 bytes at a time, instead of just 4 bytes. Both optimizations such as those have been shown to give significant improvements in code. I didn't bother doing because it wasn't worth the effort for an example.) There are several key points to this reply. And none of them are 'my code / compiler is better than yours'. (I'm not at all questioning the timing results you got on your system. I'm just pointing out that you wouldn't get your results on my computer or compiler.) (Continued to next message) --- QScan/PCB v1.19b / 01-0162 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100020 Date: 04/30/97 From: CAREY BLOODWORTH Time: 09:55pm \/To: HERMAN SCHONFELD (Read 4 times) Subj: DJGPP OPTIMIZATIONS 2/2 (Continued from previous message) 1) although on your computer and compiler, pointers are faster than direct indexing, they may not be on other peoples computers and compilers. This will depend on compiler optimizations available, the register assigning the compiler makes, and the individual CPU itself. 2) You can become so involved in the low level details of your particular implementation, that you miss a better one (like switching to moving words instead of bytes. Or you are optimizing a bubble sort instead of simply switching to a quicksort.) 3) You can become so involved in the 'low level' stuff that you can miss the obvious. This includes things like: using memcpy() (which is likely to be optimized anyway), or reducing the number of times something needs to be done, rather than improving the code (like in finding a prime number, you only have to go to the square root of the number, instead of all the way.) Or, continuing #2 above, that maybe your data doesn't even need to be sorted anyway. 4) if you aren't careful, an optimization that works fine on one computer could be fatal on another (such as misaligned data on non-x86 computers. The word at a time copy is an example where that could occur.) 5) it's very easy to waste your time optimizing something that doesn't need to be optimized, or where simply using a standard library function is already better than what you have. 6) When I look at my data showing the results of different size buffers and the cases where they are small enough to fit entirely into the L2 cache, and then the L1 cache, and write back vs. write through, etc., I can see that much of the variation in results deal with data latency, rather than code efficiency. 7) And when you do finally do lower level optimizations, you have to first ask yourself 1) what platform, 2) what CPU, and 3) what version of that CPU. Wanting your code to run fast is an admirable trait in a programmer. It's just that optimizing these days is simply not an absolute. You can't depend on cycle counting because they vary from processor to processor, and with newer processors, there are so many conditions and notes in the programmers manual as to what can effect those counts, there is no easy or sure way to know what it will do in every situation. What can be an improvement in one situation can be a harmful in another. This is especially true when you consider 1) all the compilers that are around and the different type and quality of code they generate 2) the wide variety of x86 clones, and the different levels of performance they have, and 3) the wide variety of CPUs (68k, sparc, x86, Power-PC, etc.) that the code may be run on. It just isn't simple anymore. If you limit the code to the x86 line, then you can make some more improvements. If you further limit it to some particular version of an x86 cpu, then you can make even more. Otherwise you have to aim somewhere towards the middle. That's why I keep saying you are better off doing algorithmic improvements. With everything else, there are _NO_ absolutes. It'll depend very heavily on the compiler, the CPU, and the cache, etc. And all of those vary from person to person. You could even give me an executable compiled and tuned for your 486 system, but it'll run differently on mine. --- QScan/PCB v1.19b / 01-0162 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100021 Date: 04/30/97 From: CAREY BLOODWORTH Time: 09:55pm \/To: HERMAN SCHONFELD (Read 4 times) Subj: DJGPP OPTIMIZATIONS HS>First of all, when I release the software I've been working on lately, there HS>won't be one executable to suit all; there will be many each compiled HS>specifically for a targeted machine. Few developers are willing to distribute and maintain a half dozen different x86 executables and source changes for those optimizations. Let's see.... you'd have to keep seperate versions for the 8086, 286, 386, original 486, later 486, Pentium, 586 clones, and 686/PPros. HS>specifically for a targeted machine. It's absurd to think that not increasin HS>about 20% overhead on a 486 because it might slow down execution ratio on HS>Pentium. It depends on your target. Considering Pentiums class computers outnumber 486 class computers..... You are generally better off aiming towards what most people actually _have_. That's why nobody optimizes code targeted towards XT, 286, or 386 class computers anymore. All 3 combined make up less than 1% of all PCs in use. --- QScan/PCB v1.19b / 01-0162 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1) --------------- FIDO MESSAGE AREA==> TOPIC: 203 C++ Ref: E5100022 Date: 04/30/97 From: CAREY BLOODWORTH Time: 09:55pm \/To: HERMAN SCHONFELD (Read 4 times) Subj: LOW LEVEL OPTIMIZATIO 1/2 HS>CB>void memcpy(char to[], char from[], int len) HS>CB>{ while (len--) to[len]=from[len];} HS>First off all, try unrolling the loop, and use pointer refferencing. Unrolling can help. Pointers slow it down. I've timed it. Contrary to popular convention, pointers are _not_ automatically faster than indexing. HS>But according to you, unrolling is bad, then use Duffs Device. In some cases, yes. In other cases no. It depends a lot on how much the loop does. This is especially true when you tell the compiler to unroll loops with an optimization switch such as -O3 for DJGPP. HS>CB>void memcpy(char *to, char *from,int len) HS>CB>{ while (len--) *to++=*from++;} HS>len needs not be used like that in a while statement like that. HS>It uses a register (which could be used for something else) so remove it, an HS>restructure your loop. Yes it does use a register. But even on the register poor x86 architecture, most compilers will keep all three variables in registers. Only my 8 year old QC20 doesn't bother to keep _any_ of them in registers. --- QScan/PCB v1.19b / 01-0162 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)