---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100015 Date: 04/30/97
From: ANTHONY TIBBS                                         Time: 03:49pm
\/To: CAREY BLOODWORTH                                    (Read 4 times)
Subj: DJGPP OPTIMIZATIONS   2/3

On Apr 29 22:18, 1997, Carey Bloodworth of 1:3822/1 wrote:
CB> (ranging in
CB> price from $100 C64's to quarter million dollar VAXes (back when 
CB> $250K
CB> was a lot of money)), you could call me an "Old Timer"<BG> I 
CB> haven't
CB> been around as long as some people in these echos, but I think its 
CB> safe
CB> to say I'm in the longest 10%.
Yep, by the sounds of it, you are! <g>
Anthony
--- TriED 0.7a Narrow Gamma
 * Origin: World of Power BBS (1:163/545.15)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100016 Date: 04/28/97
From: OREN WEIL                                             Time: 12:10am
\/To: ALL                                                 (Read 4 times)
Subj: File Shareing in C++

  Hello All!
How Can i Make File Shering in C++ fstream?
  Oren Weil(CoSysOp Of ??? BBS 972-2-6724473):    orenw@geocities.com  
  http://orenw.home.ml.org                          orenw@poboxes.com  
  orenw@hotmail.com [No Files] ,  orenw@mail.snunit.k12.il [No Files]  
---
 * Origin: -=> I  Oren Weil's Point <=- (2:402/423.1)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100017 Date: 04/29/97
From: STEVE WESTCOT                                         Time: 01:07pm
\/To: ANDREAS NEUKOETTER                                  (Read 4 times)
Subj: C++++++++++C+++++

On <23 Apr, 21:02>, Andreas Neukoetter wrote to Todd Coup :
AN> DJGPP - Dos & OS/2
AN> GCC   - Linux
Does DJGPP contain OS/2 libraries?
--- FMail 1.0g
 * Origin: Where It's At BBS, Kimberly, WI (414)788-8050 (1:139/615)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100018 Date: 04/30/97
From: MICHAEL LARSEN                                        Time: 10:50pm
\/To: ALL                                                 (Read 4 times)
Subj: Bitmap in FrameWnd,- How...?

Hello All.
Is there a simple way to put a background bitmap in the white area of the 
FrameWnd? I am using MS Visual C++ and my program works fine, but I would 
love to include this little pep-up :) The bitmap is included in the resource 
file.
Anyone?
Michael
---
 * Origin: This line is really a waste of precious bandwidth... (2:237/39.89)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100019 Date: 04/30/97
From: CAREY BLOODWORTH                                      Time: 09:55pm
\/To: HERMAN SCHONFELD                                    (Read 4 times)
Subj: DJGPP OPTIMIZATIONS   1/2

(talking about *256 vs <<8)
HS>That basic example does not get optimized by most compilers, Darin.
Such as what compilers?
Most compilers do.  You'd be hard pressed to find an even vaguely new
compiler that doesn't.  Strength reduction, constant folding, etc. are
classic optimizations.
Even my 8 year old Microsoft Quick C, a compiler hardly known for
aggressive optimization, and designed to run on even an XT, will do that
conversion.
I used to have a 1983 C compiler designed to run on a 64k computer, and
even it would do something like that, when it thought it was
appropriate.
Things like that are simply part of what's called 'an optimizing
compiler'.  The optimizations they make are not always appropriate, but
they at least try to do things like that.
HS>Most compilers optimize code well, but just about all don't optimize
HS>the fastest.
They _can't_.  There are simply too many variations in the CPUs.  An
Intel 486 is going to behave very differently than a Cyrix 486, or a
486DLC, etc.  They may all be '486 compatible', but they all behave
differently and run code with different cycles per instruction.  There
is simply no way they can optimize for all of the possible clones, plus
differences in cache size, performance, write back vs. write through,
etc.  All they can do is either aim towards the middle, or optimize for
a very specific cpu/platform combination.
If you are compiling for a line with only a few variations (like the 68k
line) then optimization is much simpler and more consistent.  But with
the x86 line, with the Intel and clone chips, there are 3 or 4 different
8086, a couple of 8088s, several 80286, 7 or 8 80386 versions, perhaps a
dozen different 486's, 4 or 5 '586' class, a couple of 686 class
chips, dozens of different motherboard designs, each with different
cache behavior and size.   You just can't optimize very well because
there is no single point to aim towards.  It's like what a shot gun does
compared to a rifle.
Even if you limit yourself to just a single 'number', such as a 386 or
486, there is still such an incredibly wide variety of behavior and
performance, that it simply is not possible to optimize the code for all
versions.  A Cyrix 486, Intel 486, and an AMD 486 are all going to have
different cycle times for their instructions.  Code that runs optimally
on one can suffer a major performance penalty on another.
As an example, my Cyrix 486 multiplies much much faster than a standard
Intel 486.  (There are other cases, of course, and there are cases
where mine is slower than a standard 486.)  Mine can do a 32 bit
multiply in 7 cycles (and an 8 or 16 in 3 cycles).  A standard 486 takes
a minimum of 13 cycles for any size multiply and as many as 42 cycles
for a 32 bit mul.
If I optimize my code to do multiplications, in some program of some
sort, then on other computers, the performance could be significantly
slower.  Perhaps I do x-D arrays whose sizes aren't powers of two, and
so the indices have to be multiplied.  Or maybe there is some
calculation that I could get reasonable results if I did things as
powers of two (for shifts) but instead I know my CPU can do multiplies
fast and go ahead and do it directly.  Or maybe I'm performing graphics
calculations and tune it for my ability to quick multiplies.
HS>Thats the kind of code that I often see in source code.
HS>Not very fast is it?
Unless it's in a loop, the odds are very good that it's good enough.  In
my testing, with my compiler and CPU, it runs at 4/11ths the speed of
the best one.  If this is only done once, or even a few hundred times,
the speed improvement is likely to be negligible compared to the rest of
the program execution time.
HS>Watcom results :-
HS>---------------------------+
HS>LOOP  |   386   |  486     |
HS>------|---------|----------+
HS>loop1 | 81.1 cy | 35.3 cy  |
HS>loop2 | 38.7 cy | 29.7 cy  |
HS>loop3 | 15.9 cy | 9.5  cy  |
HS>---------------------------+
Well.... I ran your examples of optimizations.  Rather than doing a
cycle count, I just did a bytes per second, which is easier to calculate
and is just as easy to show relative performance.  If you want cycles
per second, I'm running a 66mhz 486, so feel free...  I used DJGPP
2.6.3. (Because of timing resolution, the bytes/sec results do vary
between runs, but stay within 1% of the reported results.)
Example 1 (with the array indexing) did 4,000,983 bytes per second.
About 16.5 cycles per byte.
Example 2 (with the pointers) did 3,792,598 bytes per second.  This is
about 18 cycles per byte.  Notice that in this case, using pointers
instead of the simpler array indexing is _SLOWER_, even though your
timings above show it to be faster.  This is an excellent example of how
the classic 'wisdom' of how 'always using a pointer is faster' simply
isn't true.
Example 3 (with pointers unrolled 4 times) got 5,688,898 bytes per
second.  About 11.8 cycles per byte.
HS>Now that I have taught you, you may actually want to read the thread and
HS>then come back with your apology.
I then did three of my own improvements.
The first one, I felt it was 'unfair' of you to unroll the pointer
example 4 times but not the indexing one.  So, I unrolled it 4 times.
And I got 6,150,160 bytes per second.  That's about 10.5 cycles per
byte.  And again, the index version is faster than the pointer version.
The second optimization I did was notice that you were still messing
with transferring a byte at a time.  That's rather slow. So, I changed
your unrolled pointer loop to 32 bit integers and got 11,168,389 bytes
transferred per second.  Double your best optimization on my computer.
That works out to about 6 cycles per byte transferred. (You would also
need to deal with making sure the data is aligned properly, since that
is a major performance killer, and on some computers, even fatal.  But
it can be dealt with.)
The next optimization is, of course, the most obvious one.... I used
memcpy().  I did it just as a way of benchmarking the performances.
And, within my timing resolution, I got very similar results to my
unrolled loop above. (There were enough cases where the integer unrolled
loop was faster than the inline rep movsd of memcpy() that I'm not
entirely sure it was timing resolution problems.) (And when the data can
fit entirely into the cache, this drops down to about 1.5 cycles per
byte.  On a Pentium, it would be around 1/3rd cycle per byte.)  When the
data fit entirely into the L2, and especially the L1 cache, memcpy() was
faster. The rest of the time, they were the same.
(I could have made a couple of additional optimizations.  First,
pre-warm the cache, by making sure the data we are about to copy is
actually in the cache.  That can reduce memory latency.  The second
might have been to use the floating point registers (or the MMX
registers) since they can safely transfer 8 bytes at a time, instead of
just 4 bytes.  Both optimizations such as those have been shown to give
significant improvements in code.  I didn't bother doing because it
wasn't worth the effort for an example.)
There are several key points to this reply.  And none of them are 'my
code / compiler is better than yours'.  (I'm not at all questioning the
timing results you got on your system.  I'm just pointing out that you
wouldn't get your results on my computer or compiler.)
(Continued to next message)
--- QScan/PCB v1.19b / 01-0162
 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100020 Date: 04/30/97
From: CAREY BLOODWORTH                                      Time: 09:55pm
\/To: HERMAN SCHONFELD                                    (Read 4 times)
Subj: DJGPP OPTIMIZATIONS   2/2

(Continued from previous message)
1) although on your computer and compiler, pointers are faster than
direct indexing, they may not be on other peoples computers and
compilers.  This will depend on compiler optimizations available, the
register assigning the compiler makes, and the individual CPU itself.
2) You can become so involved in the low level details of your
particular implementation, that you miss a better one (like switching to
moving words instead of bytes.  Or you are optimizing a bubble sort
instead of simply switching to a quicksort.)
3) You can become so involved in the 'low level' stuff that you can miss
the obvious.   This includes things like: using memcpy() (which is
likely to be optimized anyway), or reducing the number of times
something needs to be done, rather than improving the code (like in
finding a prime number, you only have to go to the square root of the
number, instead of all the way.)  Or, continuing #2 above, that maybe
your data doesn't even need to be sorted anyway.
4) if you aren't careful, an optimization that works fine on one
computer could be fatal on another (such as misaligned data on non-x86
computers.  The word at a time copy is an example where that could
occur.)
5) it's very easy to waste your time optimizing something that doesn't
need to be optimized, or where simply using a standard library function
is already better than what you have.
6) When I look at my data showing the results of different size buffers
and the cases where they are small enough to fit entirely into the L2
cache, and then the L1 cache, and write back vs. write through, etc., I
can see that much of the variation in results deal with data latency,
rather than code efficiency.
7) And when you do finally do lower level optimizations, you have to
first ask yourself 1) what platform, 2) what CPU, and 3) what version of
that CPU.
Wanting your code to run fast is an admirable trait in a programmer.
It's just that optimizing these days is simply not an absolute.  You
can't depend on cycle counting because they vary from processor to
processor, and with newer processors, there are so many conditions and
notes in the programmers manual as to what can effect those counts,
there is no easy or sure way to know what it will do in every situation.
What can be an improvement in one situation can be a harmful in
another. This is especially true when you consider 1) all the compilers
that are around and the different type and quality of code they generate
2) the wide variety of x86 clones, and the different levels of
performance they have, and 3) the wide variety of CPUs (68k, sparc, x86,
Power-PC, etc.) that the code may be run on.  It just isn't simple
anymore.  If you limit the code to the x86 line, then you can make some
more improvements.  If you further limit it to some particular version
of an x86 cpu, then you can make even more.  Otherwise you have to aim
somewhere towards the middle.
That's why I keep saying you are better off doing algorithmic
improvements.  With everything else, there are _NO_ absolutes.  It'll
depend very heavily on the compiler, the CPU, and the cache, etc.  And
all of those vary from person to person.  You could even give me an
executable compiled and tuned for your 486 system, but it'll run
differently on mine.
--- QScan/PCB v1.19b / 01-0162
 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100021 Date: 04/30/97
From: CAREY BLOODWORTH                                      Time: 09:55pm
\/To: HERMAN SCHONFELD                                    (Read 4 times)
Subj: DJGPP OPTIMIZATIONS

HS>First of all, when I release the software I've been working on lately, 
there
HS>won't be one executable to suit all; there will be many each compiled
HS>specifically for a targeted machine.
Few developers are willing to distribute and maintain a half dozen
different x86 executables and source changes for those optimizations.
Let's see.... you'd have to keep seperate versions for the 8086, 286,
386, original 486, later 486, Pentium, 586 clones, and 686/PPros.
HS>specifically for a targeted machine. It's absurd to think that not 
increasin
HS>about 20% overhead on a 486 because it might slow down execution ratio on 

HS>Pentium.
It depends on your target.  Considering Pentiums class computers
outnumber 486 class computers..... You are generally better off aiming
towards what most people actually _have_.  That's why nobody optimizes
code targeted towards XT, 286, or 386 class computers anymore.  All 3
combined make up less than 1% of all PCs in use.
--- QScan/PCB v1.19b / 01-0162
 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)
---------------

FIDO MESSAGE AREA==> TOPIC: 203 C++           Ref: E5100022 Date: 04/30/97
From: CAREY BLOODWORTH                                      Time: 09:55pm
\/To: HERMAN SCHONFELD                                    (Read 4 times)
Subj: LOW LEVEL OPTIMIZATIO 1/2

HS>CB>void memcpy(char to[], char from[], int len)
HS>CB>{ while (len--) to[len]=from[len];}
HS>First off all, try unrolling the loop, and use pointer refferencing.
Unrolling can help.  Pointers slow it down.  I've timed it.  Contrary to
popular convention, pointers are _not_ automatically faster than
indexing.
HS>But according to you, unrolling is bad, then use Duffs Device.
In some cases, yes.  In other cases no.  It depends a lot on how much
the loop does.  This is especially true when you tell the compiler to
unroll loops with an optimization switch such as -O3 for DJGPP.
HS>CB>void memcpy(char *to, char *from,int len)
HS>CB>{ while (len--) *to++=*from++;}
HS>len needs not be used like that in a while statement like that.
HS>It uses a register (which could be used for something else) so remove it, 
an
HS>restructure your loop.
Yes it does use a register.  But even on the register poor x86
architecture, most compilers will keep all three variables in registers.
Only my 8 year old QC20 doesn't bother to keep _any_ of them in
registers.
--- QScan/PCB v1.19b / 01-0162
 * Origin: Jackalope Junction 501-785-5381 Ft Smith AR (1:3822/1)