Part II
An NT example: li wb_maf_full_replays
Problem:
the SPEC benchmark "li" encounters many
Note the different cache structures of the systems:
pca56 |
ev56 |
|
L1 |
16+8KB I+D on chip |
8+8KB I+D on chip |
L2 |
1mb off chip |
96KB on chip |
L3 |
none |
8mb off chip |
Here's a sample portion of the pca56 output:
Cycle=cycles
PDry=pipe_dry
MfRe=wb_maf_full_replays
IMis=icache_miss
DMi=dcache_miss
Ld=loads
LMe=loads_merged
BWri=bcache_write
BWHi=bcache_write_hit
xlsave:
Address Instruction Cycle PDry MfRe IMis DMi Ld LMe BWri BWHi
00406480 xor sp,zero,sp 4024 520 5 5 3 411 950
00406484 ldq t2,0x18(sp) 7353 936 6 2 195 8325 7 662 2337
00406488 stl v0,0(s0) 8417 2967 2534 1 486 723 2811
0040648C stl t1,0(v0) 14511 6781 5173 5 1 712 1 1216 5079
Note the large number of wb_maf_full_replays on the 3rd and 4th instructions above.
Install Perl
On the ev56, the first order of business is to install the NT resource kit,
to get perl:
E:\li>
dir e:\ntreskit\perl\perl.exeVolume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of e:\ntreskit\perl
03/01/96 12:00a 124,928 PERL.EXE
1 File(s) 124,928 bytes
1,196,438,528 bytes free
E:\li>
perl -e "print 123456;"The name specified is not recognized as an
internal or external command, operable program or batch file.
New windows have the path automatically included, but windows
that existed before the kit was installed do not. Add to path.
E:\li>
set path=%path%;e:\ntreskit\perl
E:\li>
perl -e "print 123456;"123456
E:\li>
Finding the Resource Kit
The NT Resource Kit used here was from the book
"Microsoft Windows NT Workstation Resource Kit"
Microsoft Press
ISBN 1-57231-343-9
$69.95 at Barnes & Noble
$47 when ordered via "Stream" part # 276206
VTX PCSOFTWARE
choice 6 US Orders
<return> 13 times to save instructions
Save it, fill it out, email it.
------------------------------------------------------------------
|STREAM PART # | DESCRIPTION |QUANTITY|
------------------------------------------------------------------
276206 MS Win NT Resource Kit V4.0 Workstation 4
Other useful utilities on the NT resource kit include tools such as "sleep", "kill", "ls", "wc", "vi", and "timethis":
E:\li>
timethis sleep 2The name specified is not recognized as an
internal or external command, operable program or batch file.
E:\li>
set path=%path%;e:\ntreskit
E:\li>
timethis sleep 1
TimeThis : Command Line : sleep 1
TimeThis : Start Time : Mon Jul 07 10:26:04 1997
TimeThis : Command Line : sleep 1
TimeThis : Start Time : Mon Jul 07 10:26:04 1997
TimeThis : End Time : Mon Jul 07 10:26:05 1997
TimeThis : Elapsed Time : 00:00:01.086
Populate a directory
For the benchmark to be tested, start with an empty directory
E:\>
cd li
E:\li>
dirVolume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of E:\li
07/07/97 09:44a <DIR> .
07/07/97 09:44a <DIR> ..
2 File(s) 0 bytes
1,236,467,200 bytes free
Add the benchmark, its input files (*.lsp), and a batch control file
E:\li>
dirVolume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of E:\li
07/07/97 10:23a <DIR> .
07/07/97 10:23a <DIR> ..
07/07/97 10:22a 1,919 au.lsp
07/07/97 10:22a 13,498 boyer.lsp
07/07/97 10:22a 4,060 browse.lsp
07/07/97 10:22a 573 ctak.lsp
07/07/97 10:22a 2,289 dderiv.lsp
07/07/97 10:22a 1,209 deriv.lsp
07/07/97 10:22a 1,411 destru0.lsp
07/07/97 10:22a 1,411 destru1.lsp
07/07/97 10:22a 1,411 destru2.lsp
07/07/97 10:22a 1,563 destrum0.lsp
07/07/97 10:22a 1,563 destrum1.lsp
07/07/97 10:22a 1,563 destrum2.lsp
07/07/97 10:22a 1,279 div2.lsp
07/07/97 10:25a 158 do_li.bat
07/07/97 10:23a 231,936 li.exe
07/07/97 10:22a 5,192 puzzle0.lsp
07/07/97 10:22a 5,192 puzzle1.lsp
07/07/97 10:22a 623 tak0.lsp
07/07/97 10:22a 623 tak1.lsp
07/07/97 10:22a 623 tak2.lsp
07/07/97 10:22a 20,808 takr.lsp
07/07/97 10:22a 2,072 triang.lsp
07/07/97 10:22a 29 xit.lsp
25 File(s) 301,005 bytes
1,196,128,256 bytes free
Build it the right way…
We are using an existing executable, and not rebuilding it. But it is important to notice how it was built:
/link /debug /debugtype:coff
which is essential in order to be able to pick up addresses from the image. Support for other image types is a highly likely feature for later this summer, but in the meantime remember to use coff.
E:\li>
E:\li>
type do_li.battimethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" > time.this
echo off
echo d
echo o
echo n
echo e
echo on
The funny printing of "done" is so I can see it from 6 feet away.
Try it, to get a base time:
E:\li>
E:\li>timethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" 1>time.this
E:\li>echo off
d
o
n
e
E:\li>
move time.this no_iprobe.times1 file(s) moved.
E:\li>
type no_iprobe.times
TimeThis : Command Line : .\li.exe *.lsp 1>tmp.out 2>tmp.err
TimeThis : Start Time : Mon Jul 07 10:26:45 1997
TimeThis : Command Line : .\li.exe *.lsp 1>tmp.out 2>tmp.err
TimeThis : Start Time : Mon Jul 07 10:26:45 1997
TimeThis : End Time : Mon Jul 07 10:29:21 1997
TimeThis : Elapsed Time : 00:02:36.094
Get IPROBE
Install IPROBE (using the newest experimental version, which has the "-command" feature):
E:\li>
cd \E:\>
mkdir iprobe_newest_7julE:\>
cd iprobe_newest_7julE:\iprobe_newest_7jul>
ftpftp
> op 16.31.144.83Connected to 16.31.144.83.
220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.
User (16.31.144.83:(none)):
anonymous331 Guest login ok, send ident as password.
Password:
ftp>
cd pubftp>
dirtotal 3
drwxr-xr-x 2 9246 512 512 Jul 7 09:37 IprobeKits
drwxr-xr-x 2 9139 512 512 Apr 11 07:12 gaertner
drwxr-xr-x 2 6562 15 512 Jun 17 05:45 henning
ftp>
cd IprobeKitsftp>
dirtotal 7111
-rw-r--r-- 1 9246 512 57856 Apr 10 14:49 Api.doc
-rw-r--r-- 1 9246 512 42242 Apr 3 09:03 IprNew.mod
-rw-r--r-- 1 9246 512 272356 Apr 3 08:59 Iprobe.ps
-rw-r--r-- 1 9246 512 580608 Apr 3 08:59 Iprobe020.a
-rw-r--r-- 1 9246 512 96768 Apr 3 09:03 Iprobe020ProgrammingKit.bck
-rw-r--r-- 1 9246 512 1075614 Apr 3 08:59 Iprobe021Osf.tar.Z
-rw-r--r-- 1 9246 512 692920 Apr 3 08:59 Iprobe0221Unix40.tar.Z
-rw-r--r-- 1 9246 512 548352 Apr 3 09:01 IprobeVms021.a
-rw-r--r-- 1 9246 512 677376 May 5 13:02 IprobeVms022.a
-rw-r--r-- 1 9246 512 735003 Apr 3 09:01 Nt35IprobeT21.zip
-rw-r--r-- 1 9246 512 304899 Apr 3 09:01 Nt35IprobeT21Update.zip
-rw-r--r-- 1 9246 512 496452 Apr 3 09:02 Nt40IprobeT22Ev4.zip
-rw-r--r-- 1 9246 512 505099 Jun 12 12:37 Nt40IprobeT23Ev5.zip
-rw-r--r-- 1 9246 512 346312 Jun 10 06:42 TurboLaserBusMonitorUnix.tar.Z
-rw-r--r-- 1 9246 512 185344 Apr 3 09:02 UnzipAxp.exe
-rw-r--r-- 1 9246 512 4629 Jun 10 07:08 WhatsHere.txt
-rw-r--r-- 1 9246 512 506252 Jul 7 09:37 nt40_iprobet2_3_ev5_experimental.zip
ftp>
bin200 Type set to I.
ftp>
get UnzipAxp.exeftp>
get nt40_iprobet2_3_ev5_experimental.zip506252 bytes received in 0.69 seconds (735.83 Kbytes/sec)
ftp>
bye221 Goodbye.
Install it
E:\iprobe_newest_7jul>
dir
Directory of E:\iprobe_newest_7jul
07/07/97 10:46a <DIR> .
07/07/97 10:46a <DIR> ..
07/07/97 10:46a 506,252 nt40_iprobet2_3_ev5_experimental.zip
07/07/97 10:45a 185,344 UnzipAxp.exe
4 File(s) 691,596 bytes
1,195,430,400 bytes free
E:\iprobe_newest_7jul>
unzipaxp *zipArchive: nt40_iprobet2_3_ev5_experimental.zip
inflating: install.bat
inflating: ipreduce.exe
inflating: iprkrnl.ini
inflating: iprkrnl.sys
inflating: iprobe.exe
inflating: iprobe_ps.exe
inflating: iprshr.dll
inflating: iprshr.exp
inflating: iprshr.lib
inflating: PSAPI.DLL
inflating: read1st.txt
inflating: regini.exe
inflating: rep.exe
E:\iprobe_newest_7jul>
installCopying Iprobe device driver (IprKrnl.sys) to E:\WINNT\system32\drivers ...
1 file(s) copied.
Copying Iprobe API library (IprShr.dll) to E:\WINNT\system32 ...
1 file(s) copied.
Copying Iprobe command and control user interface (Iprobe) to E:\WINNT\system32...
1 file(s) copied.
Copying Iprobe data reduction program (Ipreduce) to E:\WINNT\system32 ...
1 file(s) copied.
Copying Iprobe read entry points program (Rep) to E:\WINNT\system32 ...
1 file(s) copied.
Copying Iprobe show running processes program (Iprobe_PS) to E:\WINNT\system32 ...
1 file(s) copied.
Copying support libraries to E:\WINNT\system32 ...
1 file(s) copied.
Installing IprKrnl service in the registry ...
**********************************************************************
* *
* Installation complete. *
* *
* You need to reboot the system for the installation to take effect. *
* *
**********************************************************************
E:\iprobe_newest_7jul>
I think the message above is in error, that ever since the IPROBE driver was made loadable you don't really need to reboot. But just to be obedient, we'll reboot, and try iprobe.
Test That IPROBE was Installed
E:\li>
iprobe -hUnable to load library "E:\WINNT\System32\iprshr.dll"
Error = 45a
We forgot to wake up the driver.
Test again
E:\li>
iprobe -help...
Events defined on the current system -- select up to one event from each column:
issues single_issue_cycles long_stalls
cycles dual_issue_cycles branch_mispr
triple_issue_cycles pc_mispr
quad_issue_cycles icache_miss
split_issue_cycles dcache_miss
pipe_dry dtb_miss
pipe_frozen loads_merged
replay_trap ldu_replays
branches cycles
cond_branches itb_miss
jsr_ret wb_maf_full_replays
integer_ops external
float_ops mem_barrier_cycles
loads load_locked
stores scache_write
icache_access scache_miss
dcache_access scache_read_miss
scache_access scache_write_miss
scache_read scache_sh_write
scache_write bcache_miss
scache_victim sys_inv
bcache_hit sys_read_req
bcache_victim
sys_req
Note that the events defined here are NOT the same as the events on the pca56. So we'll measure a slightly different set of events. Here's a batch file to do a whole bunch of measurements:
E:\li>
type do_a_bunch.batiprobe -method sample -command "do_li.bat" -output cyc_dry_maf.dat cycles
pipe_dry wb_maf_full_replays
move time.this cyc_dry_maf.times
iprobe -method sample -command "do_li.bat" -output icache_miss.dat icache_miss
move time.this icache_miss.times
iprobe -method sample -command "do_li.bat" -output dcache_miss.dat
loads dcache_miss
move time.this dcache_miss.times
iprobe -method sample -command "do_li.bat" -output loads.dat loads loads_merged
move time.this loads.times
iprobe -method sample -command "do_li.bat" -output bcache.dat bcache_hit
bcache_miss
move time.this bcache.times
iprobe -method sample -command "do_li.bat" -output scache.dat scache_write
scache_write_miss
move time.this scache.times
Do the measurements
E:\li>
do_a_bunch
E:\li>iprobe -method sample -command "do_li.bat" -output cyc_dry_maf.dat cycles
pipe_dry wb_maf_full_replays
Node name : PRF07
OS : Microsoft Windows NT Version 4.0
CPU count : 1
Model : DEC-00Alcor
Page count : 81916
Pagelength : 8192
Counter count : 3
cycles : Low frequency skip 0 interrupts between samples
pipe_dry : Low frequency skip 0 interrupts between samples
wb_maf_full_replays : Low frequency skip 0 interrupts between samples
Current time : Mon Jul 07 12:16:22 1997
Start time: : immediate
Duration : 0 (until user interrupts)
Interval : 1
Method : sample
Measured Modes : all modes
Measured Data : ps pc pid ctr
Output file : cyc_dry_maf.dat
Buffer_count : 3
Buffer_size : 8192
Start of sampling
Started PID 0x0000006f via command line
E:\li>timethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" 1>time.this
E:\li>echo off
d
o
n
e
End of sampling
Buffers written : 3092
Partial buffers written: 1
E:\li>move time.this cyc_dry_maf.times
...
Confirm file sizes and run times
E:\li>
dir *.datVolume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of E:\li
07/07/97 12:29p 73,844 bcache.dat
07/07/97 12:19p 25,329,780 cyc_dry_maf.dat
07/07/97 12:24p 5,242,996 dcache_miss.dat
07/07/97 12:21p 1,024,116 icache_miss.dat
07/07/97 12:26p 4,218,996 loads.dat
07/07/97 12:32p 1,417,332 scache.dat
6 File(s) 37,307,064 bytes
1,156,445,184 bytes free
E:\li>
findstr laps *timesbcache.times:TimeThis : Elapsed Time : 00:02:34.422
cyc_dry_maf.times:TimeThis : Elapsed Time : 00:02:44.922
dcache_miss.times:TimeThis : Elapsed Time : 00:02:38.429
icache_miss.times:TimeThis : Elapsed Time : 00:02:35.344
loads.times:TimeThis : Elapsed Time : 00:02:39.852
no_iprobe.times:TimeThis : Elapsed Time : 00:02:36.094
scache.times:TimeThis : Elapsed Time : 00:02:36.039
The times running with and without IPROBE are quite close to each other.
Pick up John's data reduction harness...
E:\li>
ftp 16.31.144.83Connected to 16.31.144.83.
220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.
User (16.31.144.83:(none)): anonymous
331 Guest login ok, send ident as password.
Password:
230 Guest login ok, access restrictions apply.
ftp>
cd pub250 CWD command successful.
ftp>
dir200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls (16.31.144.16,1036).
total 3
drwxr-xr-x 2 9246 512 512 Jul 7 09:37 IprobeKits
drwxr-xr-x 2 9139 512 512 Apr 11 07:12 gaertner
drwxr-xr-x 2 6562 15 512 Jun 17 05:45 henning
226 Transfer complete.
202 bytes received in 0.27 seconds (0.74 Kbytes/sec)
ftp>
cd henning250 CWD command successful.
ftp>
dir200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls (16.31.144.16,1037).
total 24
-rwxr-xr-x 1 6562 15 23686 Jul 7 11:56 harness.pl
226 Transfer complete.
76 bytes received in 0.02 seconds (3.30 Kbytes/sec)
ftp>
get harness.pl200 PORT command successful.
150 Opening ASCII mode data connection for harness.pl (16.31.144.16,1038) (23686
bytes).
226 Transfer complete.
24422 bytes received in 0.13 seconds (195.38 Kbytes/sec)
ftp>
bye221 Goodbye.
E:\li>
Hanging rep: same workaround as Unix
Unfortunately the harness hangs when the rep command is issued:
E:\li>
perl harness.pl -x li.exe -e bcache_miss -d bcache -sOs=NT
Running rep to create addresses.resolved
rep li.exe
^C
at harness.pl line 70.
As in Unix, the workaround is to rep by PID:
E:\li>
start.\li.exe *.lsp
1>tmp.out
2>tmp.err
E:\li>
rep -pid 117
PID: 0x00000075 Base: 00400000 Size: 0x00038000 Name: li.exe
** VA range: 0x00400000:0x00438000 (000229376) Symbols: 0000757 Name: E:\li\li.exe
** VA range: 0x77f00000:0x77f96000 (000614400) Symbols: 0000990 Name: E:\WINNT\System32\ntdll.dll
** VA range: 0x77e60000:0x77ef4000 (000606208) Symbols: 0000682 Name: E:\WINNT\
system32\KERNEL32.dll
** Base VA 0x80080000 Symbols: 0001035 Name: \WINNT\System32\ntoskrnl.exe
. . .
Reduce a small event
Let's try the harness on a sample event - picking the smallest .dat file for a quick test.
Use "-s" to only produce a summary report
E:\li>
perl harness.pl -x li.exe -e bcache_miss -d bcache -sOs=NT
Generating top-level report for bcache_miss
ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event bcache_miss
-pthresh 1
No reports for counter 2 (Bcache Miss)
Reason: no samples matched with specifiedcommand line switches
Filtered sample count is zero, all reports suppressed
can't open bcache_miss.rpt at harness.pl line 265.
Apparently there were no Bcache misses. OK, let's look at hits-
E:\li>
perl harness.pl -x li.exe -e bcache_hit -d bcache -sOs=NT
Generating top-level report for bcache_hit
ipreduce -input_file bcache.dat -output_file bcache_hit.rpt -event bcache_hit
-pthresh 1
E:\li>
dir *hot*Volume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of E:\li
07/07/97 02:32p 593 bcache_hit.hot_routines
1 File(s) 593 bytes
1,154,143,744 bytes free
E:\li>
cat *hot*Hot Routines for bcache_hit -pthresh 1
Events % Routine Image Addr
1890 44 mark E:\li\li.exe 405590:40580F
1180 27 xlminit E:\li\li.exe 405AC0:405E1F
233 5 xlsave E:\li\li.exe 4063F0:40681F
195 5 xleval E:\li\li.exe 405E20:405EFF
142 3 cons E:\li\li.exe 4048B0:40495F
91 2 xlabind E:\li\li.exe 406150:4063EF
53 1 xlygetvalue E:\li\li.exe 40E750:40E7AF
43 1 consd E:\li\li.exe 404A00:404A9F
Try to look at routines
The summary level information looks good. Re-enter the same command without the "-s", and the harness proceeds to disassemble the benchmark and add detailed reports by routines.
E:\li>
perl harness.pl -x li.exe -e bcache_hit -d bcacheOs=NT
ipreduce -o bcache_hit_mark.rpd -d pc -event bcache_hit -input_file bcache.dat
-pc 405590:40580F
Generating disassembly of E:\li\li.exe
dumpbin/disasm E:\li\li.exe > E:\li\li.asm
Extracting mark from E:\li\li.asm
Annotating mark
ipreduce -o bcache_hit_xlminit.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 405AC0:405E1F
Extracting xlminit from E:\li\li.asm
Annotating xlminit
ipreduce -o bcache_hit_xlsave.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 4063F0:40681F
Extracting xlsave from E:\li\li.asm
Annotating xlsave
ipreduce -o bcache_hit_xleval.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 405E20:405EFF
Extracting xleval from E:\li\li.asm
Annotating xleval
ipreduce -o bcache_hit_cons.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 4048B0:40495F
Extracting cons from E:\li\li.asm
Annotating cons
ipreduce -o bcache_hit_xlabind.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 406150:4063EF
Extracting xlabind from E:\li\li.asm
Annotating xlabind
ipreduce -o bcache_hit_xlygetvalue.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 40E750:40E7AF
Extracting xlygetvalue from E:\li\li.asm
Annotating xlygetvalue
ipreduce -o bcache_hit_consd.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 404A00:404A9F
Extracting consd from E:\li\li.asm
Annotating consd
Throw away .dis files without fear
Everything seems to be in working order. But the annotated disassemblies will now have bcache_hit as the first column, since that was the first event we reduced. It seems preferable to have cycles be the first event, so throw away the annotated disassemblies.
E:\li>
del *dis
Note: when the harness says "Extracting mumble", it creates mumble.dib, which contains the instructions for that routine WITHOUT any event annotations. When events are added, the .dis files are created or updated. We are not deleting the extracts, just the annotations.
E:\li>
dir *.dibVolume in drive E is nt4sp2vc5rc12
Volume Serial Number is 9457-96EB
Directory of E:\li
07/07/97 02:33p 1,857 cons.dib
07/07/97 02:33p 1,693 consd.dib
07/07/97 02:33p 6,612 mark.dib
07/07/97 02:33p 6,944 xlabind.dib
07/07/97 02:33p 2,348 xleval.dib
07/07/97 02:33p 2,188 xlminit.dib
07/07/97 02:33p 2,184 xlsave.dib
07/07/97 02:33p 1,041 xlygetvalue.dib
8 File(s) 24,867 bytes
1,152,613,888 bytes free
Let's go for the big event: cycles
E:\li>
perl harness.pl -x li -e cycles -d cyc_dry_mafOs=NT
Generating top-level report for cycles
ipreduce -input_file cyc_dry_maf.dat -output_file cycles.rpt -event cycles
-pthresh 1
100 buffers...
200 buffers...
300 buffers...
400 buffers...
500 buffers...
600 buffers...
700 buffers...
800 buffers...
900 buffers...
1000 buffers...
1100 buffers...
1200 buffers...
1300 buffers...
1400 buffers...
1500 buffers...
1600 buffers...
1700 buffers...
1800 buffers...
1900 buffers...
2000 buffers...
2100 buffers...
2200 buffers...
2300 buffers...
2400 buffers...
2500 buffers...
2600 buffers...
2700 buffers...
2800 buffers...
2900 buffers...
3000 buffers...
ipreduce -o cycles_mark.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 405590:40580F
Annotating mark
ipreduce -o cycles_xlsave.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 4063F0:40681F
Annotating xlsave
ipreduce -o cycles_xlygetvalue.rpd -d pc -event cycles -input_file
cyc_dry_maf.dat -pc 40E750:40E7AF
^C
at harness.pl line 359, <HOT> line 5.
Oops. Note that the first ipreduce command used "-pthresh 1". That's the default, to include routines that generate at least 1% of the events; but it seems preferable to use 5% this time. Kill the script, throw away the top level (1%) reports and start again.
Reduce all counters - from the top with a new ptrhesh
E:\li>
del *hot*
E:\li>
perl harness.pl -x li -p 5 -e cycles -d cyc_dry_mafOs=NT
Generating top-level report for cycles
ipreduce -input_file cyc_dry_maf.dat -output_file cycles.rpt -event cycles
-pthresh 5
ipreduce -o cycles_mark.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 405590:40580F
Annotating mark
ipreduce -o cycles_xlsave.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 4063F0:40681F
Annotating xlsave
ipreduce -o cycles_xlygetvalue.rpd -d pc -event cycles -input_file
cyc_dry_maf.dat -pc 40E750:40E7AF
Annotating xlygetvalue
ipreduce -o cycles_xlminit.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 405AC0:405E1F
Annotating xlminit
ipreduce -o cycles_xleval.rpd -d pc -event cycles -input_file cyc_dry_maf.dat
-pc 405E20:405EFF
Annotating xleval
ipreduce -o cycles_xlxgetvalue.rpd -d pc -event cycles -input_file
cyc_dry_maf.dat -pc 40E690:40E74F
Extracting xlxgetvalue from E:\li\li.asm
Annotating xlxgetvalue
E:\li>
perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dryOs=NT
Generating top-level report for pipe_dry
ipreduce -input_file cyc_dry_maf.dat -output_file pipe_dry.rpt -event pipe_dry
-pthresh 5
ipreduce -o pipe_dry_mark.rpd -d pc -event pipe_dry -input_file cyc_dry_maf.dat
-pc 405590:40580F
Annotating mark
ipreduce -o pipe_dry_xlsave.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o pipe_dry_xlygetvalue.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 40E750:40E7AF
Annotating xlygetvalue
ipreduce -o pipe_dry_xlxgetvalue.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o pipe_dry_xleval.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o pipe_dry_xlminit.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 405AC0:405E1F
Annotating xlminit
E:\li>
perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dryOs=NT
Generating top-level report for pipe_dry
ipreduce -input_file cyc_dry_maf.dat -output_file pipe_dry.rpt -event
pipe_dry -pthresh 5
ipreduce -o pipe_dry_mark.rpd -d pc -event pipe_dry -input_file cyc_dry_maf.dat
-pc 405590:40580F
Annotating mark
ipreduce -o pipe_dry_xlsave.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o pipe_dry_xlygetvalue.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 40E750:40E7AF
Annotating xlygetvalue
ipreduce -o pipe_dry_xlxgetvalue.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o pipe_dry_xleval.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o pipe_dry_xlminit.rpd -d pc -event pipe_dry -input_file
cyc_dry_maf.dat -pc 405AC0:405E1F
Annotating xlminit
E:\li> E:\li>
perl harness.pl -x li -p 5 -d icache_missOs=NT
Generating top-level report for icache_miss
ipreduce -input_file icache_miss.dat -output_file icache_miss.rpt -event
icache_miss -pthresh 5
ipreduce -o icache_miss_xlsave.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o icache_miss_xlxgetvalue.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o icache_miss_xlobgetvalue.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 40A930:40AB6F
Extracting xlobgetvalue from E:\li\li.asm
Annotating xlobgetvalue
ipreduce -o icache_miss_xleval.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o icache_miss_xlevarg.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 40DEB0:40DF2F
Extracting xlevarg from E:\li\li.asm
Annotating xlevarg
ipreduce -o icache_miss_xlygetvalue.rpd -d pc -event icache_miss -input_file
icache_miss.dat -pc 40E750:40E7AF
Annotating xlygetvalue
E:\li>
perl harness.pl -x li -p 5 -d dcache_missOs=NT
Generating top-level report for dcache_miss
ipreduce -input_file dcache_miss.dat -output_file dcache_miss.rpt -event
dcache_miss -pthresh 5
ipreduce -o dcache_miss_mark.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 405590:40580F
Annotating mark
ipreduce -o dcache_miss_xlminit.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 405AC0:405E1F
Annotating xlminit
ipreduce -o dcache_miss_xlsave.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o dcache_miss_xleval.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o dcache_miss_xlygetvalue.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 40E750:40E7AF
Annotating xlygetvalue
ipreduce -o dcache_miss_xlxgetvalue.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o dcache_miss_xlobgetvalue.rpd -d pc -event dcache_miss -input_file
dcache_miss.dat -pc 40A930:40AB6F
Annotating xlobgetvalue
E:\li>
perl harness.pl -x li -p 5 -d loadsOs=NT
Generating top-level report for loads
ipreduce -input_file loads.dat -output_file loads.rpt -event loads -pthresh 5
ipreduce -o loads_xlsave.rpd -d pc -event loads -input_file loads.dat -pc
4063F0:40681F
Annotating xlsave
ipreduce -o loads_mark.rpd -d pc -event loads -input_file loads.dat -pc
405590:40580F
Annotating mark
ipreduce -o loads_xlygetvalue.rpd -d pc -event loads -input_file loads.dat
-pc 40E750:40E7AF
Annotating xlygetvalue
ipreduce -o loads_xlxgetvalue.rpd -d pc -event loads -input_file loads.dat
-pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o loads_xleval.rpd -d pc -event loads -input_file loads.dat -pc
405E20:405EFF
Annotating xleval
E:\li>
perl harness.pl -x li -p 5 -d loads -e loads_mergedOs=NT
Generating top-level report for loads_merged
ipreduce -input_file loads.dat -output_file loads_merged.rpt -event
loads_merged -pthresh 5
ipreduce -o loads_merged_xlxgetvalue.rpd -d pc -event loads_merged
-input_file loads.dat -pc 40E690:40E74F
Annotating xlxgetvalue
ipreduce -o loads_merged_xlobgetvalue.rpd -d pc -event loads_merged
-input_file loads.dat -pc 40A930:40AB6F
Annotating xlobgetvalue
ipreduce -o loads_merged_xleval.rpd -d pc -event loads_merged -input_file
loads.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o loads_merged_xlsave.rpd -d pc -event loads_merged -input_file
loads.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o loads_merged_xlevlist.rpd -d pc -event loads_merged -input_file
loads.dat -pc 406050:40611F
Extracting xlevlist from E:\li\li.asm
Annotating xlevlist
ipreduce -o loads_merged_mark.rpd -d pc -event loads_merged -input_file
loads.dat -pc 405590:40580F
Annotating mark
E:\li>
perl harness.pl -x li -p 5 -d bcache -e bcache_hitOs=NT
Generating top-level report for bcache_hit
ipreduce -input_file bcache.dat -output_file bcache_hit.rpt -event bcache_hit
-pthresh 5
ipreduce -o bcache_hit_mark.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 405590:40580F
Annotating mark
ipreduce -o bcache_hit_xlminit.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 405AC0:405E1F
Annotating xlminit
ipreduce -o bcache_hit_xlsave.rpd -d pc -event bcache_hit -input_file
bcache.dat -pc 4063F0:40681F
Annotating xlsave
E:\li>
perl harness.pl -x li -p 5 -d bcache -e bcache_missOs=NT
Generating top-level report for bcache_miss
ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event
bcache_miss -pthresh 5
No reports for counter 2 (Bcache Miss)
Reason: no samples matched with specifiedcommand line switches
Filtered sample count is zero, all reports suppressed
can't open bcache_miss.rpt at harness.pl line 265.
E:\li>
perl harness.pl -x li -p 5 -d scache -e scache_writeOs=NT
Generating top-level report for scache_write
ipreduce -input_file scache.dat -output_file scache_write.rpt -event
scache_write -pthresh 5
ipreduce -o scache_write_xlsave.rpd -d pc -event scache_write -input_file
scache.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o scache_write_mark.rpd -d pc -event scache_write -input_file
scache.dat -pc 405590:40580F
Annotating mark
ipreduce -o scache_write_xleval.rpd -d pc -event scache_write -input_file
scache.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o scache_write_xlminit.rpd -d pc -event scache_write -input_file
scache.dat -pc 405AC0:405E1F
Annotating xlminit
E:\li>
perl harness.pl -x li -p 5 -d scache -e scache_write_missOs=NT
Generating top-level report for scache_write_miss
ipreduce -input_file scache.dat -output_file scache_write_miss.rpt -event
scache_write_miss -pthresh 5
ipreduce -o scache_write_miss_xlminit.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 405AC0:405E1F
Annotating xlminit
ipreduce -o scache_write_miss_xlsave.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 4063F0:40681F
Annotating xlsave
ipreduce -o scache_write_miss_xlevlist.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 406050:40611F
Annotating xlevlist
ipreduce -o scache_write_miss_xlabind.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 406150:4063EF
Annotating xlabind
ipreduce -o scache_write_miss_xleval.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 405E20:405EFF
Annotating xleval
ipreduce -o scache_write_miss_cons.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 4048B0:40495F
Annotating cons
ipreduce -o scache_write_miss_xlbind.rpd -d pc -event scache_write_miss
-input_file scache.dat -pc 40E5E0:40E62F
Extracting xlbind from E:\li\li.asm
Annotating xlbind
Files created by harness.pl
E:\li>ls *asm
E:\li\*asm
li.asm
1434624 (1434374) bytes in 1 files
E:\li> ls *rpt
E:\li\*rpt
bcache_hit.rpt loads.rpt scache_write_miss.rpt
cycles.rpt loads_merged.rpt wb_maf_full_replays.rpt
dcache_miss.rpt pipe_dry.rpt
icache_miss.rpt scache_write.rpt
29184 (26539) bytes in 10 files
E:\li> ls *hot*
E:\li\*hot*
bcache_hit.hot_routines loads_merged.hot_routines
cycles.hot_routines pipe_dry.hot_routines
dcache_miss.hot_routines scache_write.hot_routines
icache_miss.hot_routines scache_write_miss.hot_routines
loads.hot_routines wb_maf_full_replays.hot_routines
6144 (4416) bytes in 10 files
E:\li> ls *xlsave*
E:\li\*xlsave*
bcache_hit_xlsave.rpd pipe_dry_xlsave.rpd
cycles_xlsave.rpd scache_write_miss_xlsave.rpd
dcache_miss_xlsave.rpd scache_write_xlsave.rpd
icache_miss_xlsave.rpd wb_maf_full_replays_xlsave.rpd
loads_merged_xlsave.rpd xlsave.dib
loads_xlsave.rpd xlsave.dis
62976 (60038) bytes in 12 files
Harness.pl creates:
Hot Routines
Done. Let's have a look at the hot routine for the various events (using the NT-resource-kit supplied utility "cat"):
E:\li>
cat *hot*
Hot Routines for bcache_hit -pthresh 5
Events % Routine Image Addr
1890 44 mark E:\li\li.exe 405590:40580F
1180 27 xlminit E:\li\li.exe 405AC0:405E1F
233 5 xlsave E:\li\li.exe 4063F0:40681F
Hot Routines for cycles -pthresh 5
Events % Routine Image Addr
289813 23 mark E:\li\li.exe 405590:40580F
213822 17 xlsave E:\li\li.exe 4063F0:40681F
164149 13 xlygetvalue E:\li\li.exe 40E750:40E7AF
115423 9 xlminit E:\li\li.exe 405AC0:405E1F
89642 7 xleval E:\li\li.exe 405E20:405EFF
86298 7 xlxgetvalue E:\li\li.exe 40E690:40E74F
Hot Routines for dcache_miss -pthresh 5
Events % Routine Image Addr
17495 25 mark E:\li\li.exe 405590:40580F
8218 12 xlminit E:\li\li.exe 405AC0:405E1F
7556 11 xlsave E:\li\li.exe 4063F0:40681F
6628 10 xleval E:\li\li.exe 405E20:405EFF
5647 8 xlygetvalue E:\li\li.exe 40E750:40E7AF
5112 7 xlxgetvalue E:\li\li.exe 40E690:40E74F
4520 7 xlobgetvalue E:\li\li.exe 40A930:40AB6F
Hot Routines for icache_miss -pthresh 5
Events % Routine Image Addr
11227 18 xlsave E:\li\li.exe 4063F0:40681F
9401 15 xlxgetvalue E:\li\li.exe 40E690:40E74F
5497 9 xlobgetvalue E:\li\li.exe 40A930:40AB6F
5326 9 xleval E:\li\li.exe 405E20:405EFF
4676 7 xlevarg E:\li\li.exe 40DEB0:40DF2F
3373 5 xlygetvalue E:\li\li.exe 40E750:40E7AF
Hot Routines for loads -pthresh 5
Events % Routine Image Addr
48619 19 xlsave E:\li\li.exe 4063F0:40681F
48561 19 mark E:\li\li.exe 405590:40580F
45121 18 xlygetvalue E:\li\li.exe 40E750:40E7AF
25131 10 xlxgetvalue E:\li\li.exe 40E690:40E74F
20119 8 xleval E:\li\li.exe 405E20:405EFF
Hot Routines for loads_merged -pthresh 5
Events % Routine Image Addr
2180 32 xlxgetvalue E:\li\li.exe 40E690:40E74F
1929 29 xlobgetvalue E:\li\li.exe 40A930:40AB6F
623 9 xleval E:\li\li.exe 405E20:405EFF
396 6 xlsave E:\li\li.exe 4063F0:40681F
364 5 xlevlist E:\li\li.exe 406050:40611F
338 5 mark E:\li\li.exe 405590:40580F
Hot Routines for pipe_dry -pthresh 5
Events % Routine Image Addr
57989 18 mark E:\li\li.exe 405590:40580F
56530 18 xlsave E:\li\li.exe 4063F0:40681F
41431 13 xlygetvalue E:\li\li.exe 40E750:40E7AF
24491 8 xlxgetvalue E:\li\li.exe 40E690:40E74F
19962 6 xleval E:\li\li.exe 405E20:405EFF
15910 5 xlminit E:\li\li.exe 405AC0:405E1F
Hot Routines for scache_write -pthresh 5
Events % Routine Image Addr
30027 35 xlsave E:\li\li.exe 4063F0:40681F
11698 14 mark E:\li\li.exe 405590:40580F
8585 10 xleval E:\li\li.exe 405E20:405EFF
7003 8 xlminit E:\li\li.exe 405AC0:405E1F
Hot Routines for scache_write_miss -pthresh 5
Events % Routine Image Addr
605 43 xlminit E:\li\li.exe 405AC0:405E1F
151 11 xlsave E:\li\li.exe 4063F0:40681F
122 9 xlevlist E:\li\li.exe 406050:40611F
116 8 xlabind E:\li\li.exe 406150:4063EF
96 7 xleval E:\li\li.exe 405E20:405EFF
82 6 cons E:\li\li.exe 4048B0:40495F
70 5 xlbind E:\li\li.exe 40E5E0:40E62F
Hot Routines for wb_maf_full_replays -pthresh 5
Events % Routine Image Addr
497 46 xleval E:\li\li.exe 405E20:405EFF
177 16 xlsave E:\li\li.exe 4063F0:40681F
75 7 mark E:\li\li.exe 405590:40580F
68 6 cons E:\li\li.exe 4048B0:40495F
55 5 xlminit E:\li\li.exe 405AC0:405E1F
Let's look at the routine of interest:
E:\li>
type xlsave.disCycle=cycles
Cycle=cycles
PDry=pipe_dry
MfR=wb_maf_full_replays
IMi=icache_miss
DMis=dcache_miss
Ld=loads
LMe=loads_merged
BHi=bcache_hit
SWri=scache_write
SWM=scache_write_miss
xlsave:
Address Instruction Cycle Cycle PDry MfR IMi DMis Ld LMe BHi SWri SWM
004063F0 lda sp,0xFFA0(sp) 2248 2248 433 84 10 2 133
004063F4 stq a0,0x30(sp) 2039 2039 126 14 1 8 31
004063F8 stq a1,0x38(sp) 2022 2022 463 20 8 54
004063FC stq a2,0x40(sp) 1943 1943 49 21 2 47
00406400 stq a3,0x48(sp) 2055 2055 69 17 8 1 1 181 1
00406404 stq a4,0x50(sp) 1900 1900 29 1 1 3 248
00406408 stq s0,0(sp) 1839 1839 24 1 1 887
0040640C stq s1,8(sp) 2209 2209 60 28 6 405 2
. . .
Well there's a pain in the neck - "Cycles" are reported in both the first and second event columns because we ran the harness twice for that event (remember the ^C and switch to a 5% threshold?). But this gives us a chance to demonstrate a feature of the harness. Throw away the annotated disassemblies and regenerate them:
E:\li>
del *.dis
E:\li>
perl harness.pl -x li -p 5 -d cyc_dry_maf -e cyclesOs=NT
Annotating mark
Annotating xlsave
Annotating xlygetvalue
Annotating xlminit
Annotating xleval
Annotating xlxgetvalue
Note that ALL ipreduce calls are skipped, because the reports already exist. The .dib files also are not re-generated; only the .dis files, which are created very quickly.
E:\li>
perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dryOs=NT
Annotating mark
Annotating xlsave
Annotating xlygetvalue
Annotating xlxgetvalue
Annotating xleval
Annotating xlminit
E:\li>
perl harness.pl -x li -p 5 -d cyc_dry_maf -e wb_maf_full_replaysOs=NT
Annotating xleval
Annotating xlsave
Annotating mark
Annotating cons
Annotating xlminit
E:\li>
perl harness.pl -x li -p 5 -d icache_missOs=NT
Annotating xlsave
Annotating xlxgetvalue
Annotating xlobgetvalue
Annotating xleval
Annotating xlevarg
Annotating xlygetvalue
E:\li>
perl harness.pl -x li -p 5 -d dcache_missOs=NT
Annotating mark
Annotating xlminit
Annotating xlsave
Annotating xleval
Annotating xlygetvalue
Annotating xlxgetvalue
Annotating xlobgetvalue
E:\li>
perl harness.pl -x li -p 5 -d loadsOs=NT
Annotating xlsave
Annotating mark
Annotating xlygetvalue
Annotating xlxgetvalue
Annotating xleval
E:\li>
perl harness.pl -x li -p 5 -d loads -e loads_mergedOs=NT
Annotating xlxgetvalue
Annotating xlobgetvalue
Annotating xleval
Annotating xlsave
Annotating xlevlist
Annotating mark
E:\li>
perl harness.pl -x li -p 5 -d bcache -e bcache_hitOs=NT
Annotating mark
Annotating xlminit
Annotating xlsave
E:\li>
perl harness.pl -x li -p 5 -d bcache -e bcache_missOs=NT
Generating top-level report for bcache_miss
ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event bcache_miss -pthresh 5
No reports for counter 2 (Bcache Miss)
Reason: no samples matched with specifiedcommand line switches
Filtered sample count is zero, all reports suppressed
can't open bcache_miss.rpt at harness.pl line 265.
E:\li>
perl harness.pl -x li -p 5 -d scache -e scache_writeOs=NT
Annotating xlsave
Annotating mark
Annotating xleval
Annotating xlminit
E:\li>
perl harness.pl -x li -p 5 -d scache -e scache_write_missOs=NT
Annotating xlminit
Annotating xlsave
Annotating xlevlist
Annotating xlabind
Annotating xleval
Annotating cons
Annotating xlbind
Done. Here's the annotated ev56 xlsave disassembly:
E:\li>
type xlsave.disCycle=cycles PDry=pipe_dry MfR=wb_maf_full_replays
IMi=icache_miss DMis=dcache_miss Ld=loads
LMe=loads_merged BHi=bcache_hit SWri=scache_write
SWM=scache_write_miss
Address Instruction Cycle PDry MfR IMi DMis Ld LMe BHi SWri SWM
004063F0 lda sp,0xFFA0(sp) 2248 433 84 10 2 133
004063F4 stq a0,0x30(sp) 2039 126 14 1 8 31
004063F8 stq a1,0x38(sp) 2022 463 20 8 54
004063FC stq a2,0x40(sp) 1943 49 21 2 47
00406400 stq a3,0x48(sp) 2055 69 17 8 1 1 181 1
00406404 stq a4,0x50(sp) 1900 29 1 1 3 248
00406408 stq s0,0(sp) 1839 24 1 1 887
0040640C stq s1,8(sp) 2209 60 28 6 405 2
00406410 stq a5,0x58(sp) 1883 28 8 5 386
00406414 stq ra,0x10(sp) 1819 23 3 5 705 2
00406418 ldah s0,0x43
0040641C lda s0,0xF6C8(s0) 2083 26 2 40
00406420 lda t0,0x30(sp) 1995 30 9 1 15 25
00406424 ldah s1,0x43
00406428 ldl a0,0(s0) 1876 17 4 1778
0040642C ldl v0,0x30(sp)
00406430 stl t0,0x18(sp) 2015 55 15 4 4 5 9 1
00406434 mov 8,t0
00406438 stl t0,0x1C(sp) 2088 30 3 984 8
0040643C lda s1,0xF790(s1)
00406440 stl a0,0x20(sp) 2538 40 7 323 150 3 204 3
00406444 beq v0,004064A8 901 7 407 169 7 238 5
00406448 ldl a0,0(s0) 14929 1676 2 15 4858 4 11 4810 1
0040644C ldl t0,0(s1) 4 1 4 3
00406450 cmpule a0,t0,t0 8875 3807 1 707 496 3 3350 17
00406454 beq t0,00406464
00406458 ldah a0,0x43
0040645C lda a0,0xA3B8(a0)
00406460 bsr ra,xlabort
00406464 ldl t0,0x1C(sp) 3448 1609 1 616 2
00406468 ldl v0,0(s0)
0040646C ldl t1,0x30(sp) 3817 1640 1 93 1
00406470 addl t0,8,t0 5540 855 1 826 7132 3 3 1757 2
00406474 lda v0,0xFFFC(v0) 3
00406478 stl t0,0x1C(sp) 3805 78 5 2 1 29
0040647C nop
00406480 xor sp,zero,sp 3965 81 254
00406484 ldq t2,0x18(sp) 7330 111 1 197 9952 8 1 820
00406488 stl v0,0(s0) 3628 40 1 2 197
0040648C stl t1,0(v0) 4123 91 21 39 1 2 73
00406490 addl t2,t0,t0 3440 40 1 1 61 1
00406494 ldl t3,0x30(sp) 7237 65 15 3586 2 1 79
00406498 stl zero,0(t3) 7510 147 32 1 13 3 886 3
0040649C ldl t0,0xFFF8(t0) 3595 36 2299
004064A0 stl t0,0x30(sp) 7420 198 7 99 3479 1 75
004064A4 bne t0,00406448
004064A8 ldq ra,0x10(sp) 4831 568 3 529 1016
004064AC ldq s0,0(sp)
004064B0 ldq s1,8(sp) 3423 561 3 1395 1 1 1188 1
004064B4 ldl v0,0x20(sp)
004064B8 lda sp,0x60(sp) 2024 560 1583 2
004064BC ret 1
And here's the (previously generated) pca56 xlsave
Cycle=cycles PDry=pipe_dry MfRe=wb_maf_full_replays
IMis=icache_miss DMi=dcache_miss Ld=loads
LMe=loads_merged BWri=bcache_write BWHi=bcache_write_hit
Address Instruction Cycle PDry MfRe IMis DMi Ld LMe BWri BWHi
004063F0 lda sp,0xFFA0(sp) 3943 3565 16 92 14 166 393
004063F4 stq a0,0x30(sp) 2747 911 239 1 1 78 390
004063F8 stq a1,0x38(sp) 5149 3150 1732 1 270 1057
004063FC stq a2,0x40(sp) 2237 324 2 26 191
00406400 stq a3,0x48(sp) 8009 4775 3046 27 1 425 1904
00406404 stq a4,0x50(sp) 2205 607 5 52 186
00406408 stq s0,0(sp) 1881 588 7 76 224
0040640C stq s1,8(sp) 17143 9729 7568 30 1 1 802 4876
00406410 stq a5,0x58(sp) 1369 1328 5 2 82 850
00406414 stq ra,0x10(sp) 5524 3239 1632 586 1913
00406418 ldah s0,0x43
0040641C lda s0,0xF6C8(s0) 1655 1239 6 255 662
00406420 lda t0,0x30(sp) 2417 1304 26 233 572
00406424 ldah s1,0x43
00406428 ldl a0,0(s0) 1652 419 4 1 1 92 604
0040642C ldl v0,0x30(sp)
00406430 stl t0,0x18(sp) 10004 4626 3965 30 247 1407 3 819 2889
00406434 mov 8,t0
00406438 stl t0,0x1C(sp) 1362 767 7 1 170 1365
0040643C lda s1,0xF790(s1)
00406440 stl a0,0x20(sp) 6172 987 14 20 219 109 428 1319
00406444 beq v0,004064A8 4547 129 14 4 243 96 342 1386
00406448 ldl a0,0(s0) 14281 4842 27 3 3236 2 861 5190
0040644C ldl t0,0(s1) 1
00406450 cmpule a0,t0,t0 14648 4200 37 13 683 456 1 1381 4470
00406454 beq t0,00406464
00406458 ldah a0,0x43
0040645C lda a0,0xA3B8(a0)
00406460 bsr ra,xlabort
00406464 ldl t0,0x1C(sp) 4256 1722 9 10 6 309 686
00406468 ldl v0,0(s0) 82 16 8 7
0040646C ldl t1,0x30(sp) 3425 1284 6 1 331 1359
00406470 addl t0,8,t0 10876 911 28 11 738 7201 10 625 1936
00406474 lda v0,0xFFFC(v0) 9 7
00406478 stl t0,0x1C(sp) 7281 1794 1992 1 1620 1 525 3797
0040647C nop
00406480 xor sp,zero,sp 4024 520 5 5 3 411 950
00406484 ldq t2,0x18(sp) 7353 936 6 2 195 8325 7 662 2337
00406488 stl v0,0(s0) 8417 2967 2534 1 486 723 2811
0040648C stl t1,0(v0) 14511 6781 5173 5 1 712 1 1216 5079
00406490 addl t2,t0,t0 3505 1154 6 402 2120
00406494 ldl t3,0x30(sp) 5448 274 10 2287 1 397 1881
00406498 stl zero,0(t3) 14369 6426 3792 6 843 1131 4915
0040649C ldl t0,0xFFF8(t0) 3950 1363 7 682 1021
004064A0 stl t0,0x30(sp) 18845 7422 5722 6 78 4115 5 1755 7247
004064A4 bne t0,00406448
004064A8 ldq ra,0x10(sp) 5006 1956 4 1 501 300 1376
004064AC ldq s0,0(sp)
004064B0 ldq s1,8(sp) 3526 2213 2 563 290 1736
004064B4 ldl v0,0x20(sp)
004064B8 lda sp,0x60(sp) 1802 1343 5 176 1181
004064BC ret 4 1 1
Observations - 1
Using Map Network Drive the ev56 reports are currently mounted on drive G: and the pca56 reports on drive H:. Comparing the two, we can make the following observations and hypotheses:
E:\>
findstr "Cycles.*[0-9]" g:\li\cycles.rpt h:\li\cycles.rptg:\li\cycles.rpt: Cycles 1254830
h:\li\cycles.rpt: Cycles 1984890
E:\>
findstr "Dry.*[0-9]" g:\li\pipe_dry.rpt h:\li\pipe_dry.rptg:\li\pipe_dry.rpt: Pipe Dry 314420
h:\li\pipe_dry.rpt: Pipe Dry 743731
The findstr regular expression looks for the summary line in the top-level IPROBE report, searching for the event name and a count.
The pca56 system incurs about 48B more cycles than the ev56 ((1984890-1254830) cycle counter overflows * 2^16 cycles per counter event) or about 1.6x.
The pca56 spends about 28B more cycles dry than the ev56 ((743731-314420)*2^16)
The ev56 is about 25% dry (314 thousand pipe dry counter overflows / 1255 thousand cycle counter overflows) and the pca56 is 37% dry (744/1985).
Observations - 2
2. MAF replays contribute about 40% of the additional pca56 dry time.
E:\>
findstr "Maf.*[0-9]" g:\li\*maf*.rpt h:\li\*maf*.rptg:\li\wb_maf_full_replays.rpt: Wb Maf Full Replays 1078
h:\li\wb_maf_full_replays.rpt: Wb Maf Full Replays 100751
MAF replays are expected to consume 7 cycles. Each report of a pipe dry cycle counter overflow represents 2^16 events and each report of a Maf Replay represents 2^14 events:
G:\li>
findstr /c:"One sample " *rptbcache_hit.rpt: * One sample = 65536 events *
cycles.rpt: * One sample = 65536 events *
dcache_miss.rpt: * One sample = 16384 events *
icache_miss.rpt: * One sample = 16384 events *
loads.rpt: * One sample = 65536 events *
loads_merged.rpt: * One sample = 16384 events *
pipe_dry.rpt: * One sample = 65536 events *
scache_write.rpt: * One sample = 65536 events *
scache_write_miss.rpt: * One sample = 16384 events *
wb_maf_full_replays.rpt: * One sample = 16384 events *
Therefore we have (743731-314420)*2^16 = 28B extra dry cycles which include 7*(100751-1078)*2^14 = 11B cycles for Maf Replays. About 40% (11/28) of the extra dry time is due to Maf Replays.
Observations - 3
g:\li\icache_miss.rpt: Icache Miss 63000
h:\li\icache_miss.rpt: Icache Miss 100406
If icache misses each cost, say, 8 cycles on an ev56 and 16 cycles on this pca56, then the above would account for an additional 18B cycles on the pca56 (100406*2^14*16 - 63000*2^14*8).
A latency estimate of 8 cycles for ev56's on-chip S-cache is based on http://www.digital.com/info/DTJH09/DTJH09SC.TXT. The pca56 estimate of 16 cycles is based on the fact that the pca56 had its bcache latency set to 8 cycles, which is assumed to be added to on-chip time of about the same amount as ev56. Emperical evidence for 8 and 16 might be indicated by the Dependent Load measurements with the Nix memtest (gem-alpha-perf note 331.33).
This roughly accounts for the other 60% (18/28) of the dry time.
Observations - 4
E:\>
findstr "cache.*[0-9]" g:\li\*scache_write.rpth:\li\*bcache_write.rpt
g:\li\scache_write.rpt: Scache Write 86159
g:\li\scache_write.rpt: Scache Write Miss 1418
h:\li\bcache_write.rpt: Bcache Write 70251
h:\li\bcache_write.rpt: Bcache Write Hit 280824
Remember to always check the units for the counters you are using:
E:\>
findstr /c:"One sample" g:\li\*scache*.rpt h:\li\*bcache*.rptg:\li\scache_write.rpt: * One sample = 65536 events *
g:\li\scache_write_miss.rpt: * One sample = 16384 events *
h:\li\bcache_write.rpt: * One sample = 65536 events *
h:\li\bcache_write_hit.rpt: * One sample = 16384 events *
Therefore we find that the ev56 writes miss the scache less than 1% of the time ((1418*16384)/(86159*65536)) and the pca56 writes hit the bcache over 99% of the time ((280824*16384)/(70251*65536)).
Observations - 5 (uh-oh)
Begin End Sample Image Total
Address Address Name Count Pct Pct
------- ------- ---- ----- --- ---
004063F0-0040681F xlsave 21809 21.8 21.7
But no instruction in the pca56 xlsave.dis is listed as getting more than 92 icache_miss events! The problem is probably due to the fact that the harness recognizes routine boundaries in the .asm disassembly by the occurence of a label followed by a colon, and stops extracting lines at 4064bc instead of continuing on to 40681f.
004064B4: A01E0020 ldl v0,0x20(sp)
004064B8: 23DE0060 lda sp,0x60(sp)
004064BC: 6BFA8001 ret
evalhook:
004064C0: 23DEFFE0 lda sp,0xFFE0(sp)
004064C4: 47FF0413 clr a3
004064C8: B75E0000 stq ra,0(sp)
Look for a future version of harness.pl to extract by address rather than by label. In the meantime, hand-extracts have been posted to http://tlg-www.zko.dec.com/~henning/li, where it can be noted that the icache misses are happening in evform and evfun.
Observations - 6
Cycle=cycles IMi=icache_miss PDry=pipe_dry MfR=wb_maf_full_replays
Address Instruction Cycle IMi PDry MfR
ev56 evfun:
00406720 lda sp,0xFFC0(sp) 1395 698 613
00406724 clr a3
00406728 stq s0,0(sp) 192 168
0040672C stq s1,8(sp) 206 212
00406730 stq ra,0x10(sp) 194 3 184
00406734 stl a0,0x24(sp) 218 34 185
00406738 lda a0,0x20(sp)
0040673C stl a1,0x28(sp) 210 2
00406740 stl a2,0x2C(sp) 228 5 26
00406744 lda a1,0x18(sp)
00406748 lda a2,0x1C(sp) 188 1
0040674C bsr ra,xlsave
00406750 ldl t0,0x24(sp) 1413 4 835
00406754 stl v0,0x30(sp) 209 193 1
00406758 ldl s0,8(t0) 230 2 206
0040675C beq s0,0040676C 738 2 532
00406760 ldbu t2,0(s0) 182 1 69
pca56 evfun:
00406720 lda sp,0xFFC0(sp) 4997 691 4194 9
00406724 clr a3
00406728 stq s0,0(sp) 390 210
0040672C stq s1,8(sp) 62 3 235 25
00406730 stq ra,0x10(sp) 367 758 350 1
00406734 stl a0,0x24(sp) 461 10 25 8
00406738 lda a0,0x20(sp)
0040673C stl a1,0x28(sp) 25 2 2
00406740 stl a2,0x2C(sp) 516 724 256 2
00406744 lda a1,0x18(sp)
00406748 lda a2,0x1C(sp) 404 11
0040674C bsr ra,xlsave
00406750 ldl t0,0x24(sp) 1593 2 882 2
00406754 stl v0,0x30(sp) 96 196 1
00406758 ldl s0,8(t0) 518 238 2
0040675C beq s0,0040676C 1537 5 619 7
00406760 ldbu t2,0(s0) 434 1 146 1
00406764 cmpeq t2,3,t2 1244 215 2
Both processors start off evfun with similar icache misses charged to the first instruction. Both processors are presumably prefetching the following instructions, but ev56 prefetches from the on-chip S-cache and pca56 prefetches from the off-chip B-cache. It continues to incur about 700 icache miss events for two more fetch blocks, whereas ev56 has successfully prefetched them, and avoided most of the misses.