Part II

An NT example: li wb_maf_full_replays

 

Problem: the SPEC benchmark "li" encounters many
"wb_maf_full_replays" on the pca56.
It has already been measured on the pca56;
for comparison, we want to measure it on an ev56.

 

Note the different cache structures of the systems:

 

 

pca56

ev56

L1

16+8KB I+D on chip

8+8KB I+D on chip

L2

1mb off chip

96KB on chip

L3

none

8mb off chip

 

Here's a sample portion of the pca56 output:

 

Cycle=cycles

PDry=pipe_dry

MfRe=wb_maf_full_replays

IMis=icache_miss

DMi=dcache_miss

Ld=loads

LMe=loads_merged

BWri=bcache_write

BWHi=bcache_write_hit

xlsave:

Address Instruction Cycle PDry MfRe IMis DMi Ld LMe BWri BWHi

00406480 xor sp,zero,sp 4024 520 5 5 3 411 950

00406484 ldq t2,0x18(sp) 7353 936 6 2 195 8325 7 662 2337

00406488 stl v0,0(s0) 8417 2967 2534 1 486 723 2811

0040648C stl t1,0(v0) 14511 6781 5173 5 1 712 1 1216 5079

 

Note the large number of wb_maf_full_replays on the 3rd and 4th instructions above.

Install Perl

 

On the ev56, the first order of business is to install the NT resource kit,

to get perl:

 

 

 

 

 

E:\li> dir e:\ntreskit\perl\perl.exe

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of e:\ntreskit\perl

 

03/01/96 12:00a 124,928 PERL.EXE

1 File(s) 124,928 bytes

1,196,438,528 bytes free

 

E:\li> perl -e "print 123456;"

The name specified is not recognized as an

internal or external command, operable program or batch file.

 

New windows have the path automatically included, but windows

that existed before the kit was installed do not. Add to path.

 

E:\li> set path=%path%;e:\ntreskit\perl

 

E:\li> perl -e "print 123456;"

123456

E:\li>

 

Finding the Resource Kit

 

The NT Resource Kit used here was from the book

"Microsoft Windows NT Workstation Resource Kit"

Microsoft Press

ISBN 1-57231-343-9

$69.95 at Barnes & Noble

$47 when ordered via "Stream" part # 276206

 

VTX PCSOFTWARE

choice 6 US Orders

<return> 13 times to save instructions

Save it, fill it out, email it.

------------------------------------------------------------------

|STREAM PART # | DESCRIPTION |QUANTITY|

------------------------------------------------------------------

276206 MS Win NT Resource Kit V4.0 Workstation 4

 

Other useful utilities on the NT resource kit include tools such as "sleep", "kill", "ls", "wc", "vi", and "timethis":

 

E:\li> timethis sleep 2

The name specified is not recognized as an

internal or external command, operable program or batch file.

 

E:\li> set path=%path%;e:\ntreskit

 

E:\li> timethis sleep 1

 

TimeThis : Command Line : sleep 1

TimeThis : Start Time : Mon Jul 07 10:26:04 1997

 

 

TimeThis : Command Line : sleep 1

TimeThis : Start Time : Mon Jul 07 10:26:04 1997

TimeThis : End Time : Mon Jul 07 10:26:05 1997

TimeThis : Elapsed Time : 00:00:01.086

 

 

Populate a directory

 

For the benchmark to be tested, start with an empty directory

 

E:\> cd li

 

E:\li> dir

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of E:\li

 

07/07/97 09:44a <DIR> .

07/07/97 09:44a <DIR> ..

2 File(s) 0 bytes

1,236,467,200 bytes free

 

Add the benchmark, its input files (*.lsp), and a batch control file

 

E:\li> dir

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of E:\li

 

07/07/97 10:23a <DIR> .

07/07/97 10:23a <DIR> ..

07/07/97 10:22a 1,919 au.lsp

07/07/97 10:22a 13,498 boyer.lsp

07/07/97 10:22a 4,060 browse.lsp

07/07/97 10:22a 573 ctak.lsp

07/07/97 10:22a 2,289 dderiv.lsp

07/07/97 10:22a 1,209 deriv.lsp

07/07/97 10:22a 1,411 destru0.lsp

07/07/97 10:22a 1,411 destru1.lsp

07/07/97 10:22a 1,411 destru2.lsp

07/07/97 10:22a 1,563 destrum0.lsp

07/07/97 10:22a 1,563 destrum1.lsp

07/07/97 10:22a 1,563 destrum2.lsp

07/07/97 10:22a 1,279 div2.lsp

07/07/97 10:25a 158 do_li.bat

07/07/97 10:23a 231,936 li.exe

07/07/97 10:22a 5,192 puzzle0.lsp

07/07/97 10:22a 5,192 puzzle1.lsp

07/07/97 10:22a 623 tak0.lsp

07/07/97 10:22a 623 tak1.lsp

07/07/97 10:22a 623 tak2.lsp

07/07/97 10:22a 20,808 takr.lsp

07/07/97 10:22a 2,072 triang.lsp

07/07/97 10:22a 29 xit.lsp

25 File(s) 301,005 bytes

1,196,128,256 bytes free

 

Build it the right way…

 

We are using an existing executable, and not rebuilding it. But it is important to notice how it was built:

 

/link /debug /debugtype:coff

 

which is essential in order to be able to pick up addresses from the image. Support for other image types is a highly likely feature for later this summer, but in the meantime remember to use coff.

 

E:\li>

E:\li> type do_li.bat

timethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" > time.this

echo off

echo d

echo o

echo n

echo e

echo on

 

The funny printing of "done" is so I can see it from 6 feet away.

 

Try it, to get a base time:

 

E:\li> do_li

E:\li>timethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" 1>time.this

E:\li>echo off

d

o

n

e

E:\li> move time.this no_iprobe.times

1 file(s) moved.

 

E:\li> type no_iprobe.times

 

TimeThis : Command Line : .\li.exe *.lsp 1>tmp.out 2>tmp.err

TimeThis : Start Time : Mon Jul 07 10:26:45 1997

 

 

TimeThis : Command Line : .\li.exe *.lsp 1>tmp.out 2>tmp.err

TimeThis : Start Time : Mon Jul 07 10:26:45 1997

TimeThis : End Time : Mon Jul 07 10:29:21 1997

TimeThis : Elapsed Time : 00:02:36.094

Get IPROBE

 

 

Install IPROBE (using the newest experimental version, which has the "-command" feature):

 

 

E:\li> cd \

E:\> mkdir iprobe_newest_7jul

E:\> cd iprobe_newest_7jul

E:\iprobe_newest_7jul>ftp

ftp> op 16.31.144.83

Connected to 16.31.144.83.

220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.

User (16.31.144.83:(none)): anonymous

331 Guest login ok, send ident as password.

Password:

ftp> cd pub

ftp> dir

total 3

drwxr-xr-x 2 9246 512 512 Jul 7 09:37 IprobeKits

drwxr-xr-x 2 9139 512 512 Apr 11 07:12 gaertner

drwxr-xr-x 2 6562 15 512 Jun 17 05:45 henning

ftp> cd IprobeKits

ftp> dir

total 7111

-rw-r--r-- 1 9246 512 57856 Apr 10 14:49 Api.doc

-rw-r--r-- 1 9246 512 42242 Apr 3 09:03 IprNew.mod

-rw-r--r-- 1 9246 512 272356 Apr 3 08:59 Iprobe.ps

-rw-r--r-- 1 9246 512 580608 Apr 3 08:59 Iprobe020.a

-rw-r--r-- 1 9246 512 96768 Apr 3 09:03 Iprobe020ProgrammingKit.bck

-rw-r--r-- 1 9246 512 1075614 Apr 3 08:59 Iprobe021Osf.tar.Z

-rw-r--r-- 1 9246 512 692920 Apr 3 08:59 Iprobe0221Unix40.tar.Z

-rw-r--r-- 1 9246 512 548352 Apr 3 09:01 IprobeVms021.a

-rw-r--r-- 1 9246 512 677376 May 5 13:02 IprobeVms022.a

-rw-r--r-- 1 9246 512 735003 Apr 3 09:01 Nt35IprobeT21.zip

-rw-r--r-- 1 9246 512 304899 Apr 3 09:01 Nt35IprobeT21Update.zip

-rw-r--r-- 1 9246 512 496452 Apr 3 09:02 Nt40IprobeT22Ev4.zip

-rw-r--r-- 1 9246 512 505099 Jun 12 12:37 Nt40IprobeT23Ev5.zip

-rw-r--r-- 1 9246 512 346312 Jun 10 06:42 TurboLaserBusMonitorUnix.tar.Z

-rw-r--r-- 1 9246 512 185344 Apr 3 09:02 UnzipAxp.exe

-rw-r--r-- 1 9246 512 4629 Jun 10 07:08 WhatsHere.txt

-rw-r--r-- 1 9246 512 506252 Jul 7 09:37 nt40_iprobet2_3_ev5_experimental.zip

ftp> bin

200 Type set to I.

ftp> get UnzipAxp.exe

ftp> get nt40_iprobet2_3_ev5_experimental.zip

506252 bytes received in 0.69 seconds (735.83 Kbytes/sec)

ftp> bye

221 Goodbye.

Install it

 

E:\iprobe_newest_7jul>dir

 

Directory of E:\iprobe_newest_7jul

 

07/07/97 10:46a <DIR> .

07/07/97 10:46a <DIR> ..

07/07/97 10:46a 506,252 nt40_iprobet2_3_ev5_experimental.zip

07/07/97 10:45a 185,344 UnzipAxp.exe

4 File(s) 691,596 bytes

1,195,430,400 bytes free

 

E:\iprobe_newest_7jul> unzipaxp *zip

Archive: nt40_iprobet2_3_ev5_experimental.zip

inflating: install.bat

inflating: ipreduce.exe

inflating: iprkrnl.ini

inflating: iprkrnl.sys

inflating: iprobe.exe

inflating: iprobe_ps.exe

inflating: iprshr.dll

inflating: iprshr.exp

inflating: iprshr.lib

inflating: PSAPI.DLL

inflating: read1st.txt

inflating: regini.exe

inflating: rep.exe

 

E:\iprobe_newest_7jul> install

Copying Iprobe device driver (IprKrnl.sys) to E:\WINNT\system32\drivers ...

1 file(s) copied.

Copying Iprobe API library (IprShr.dll) to E:\WINNT\system32 ...

1 file(s) copied.

Copying Iprobe command and control user interface (Iprobe) to E:\WINNT\system32...

1 file(s) copied.

Copying Iprobe data reduction program (Ipreduce) to E:\WINNT\system32 ...

1 file(s) copied.

Copying Iprobe read entry points program (Rep) to E:\WINNT\system32 ...

1 file(s) copied.

Copying Iprobe show running processes program (Iprobe_PS) to E:\WINNT\system32 ...

1 file(s) copied.

Copying support libraries to E:\WINNT\system32 ...

1 file(s) copied.

Installing IprKrnl service in the registry ...

**********************************************************************

* *

* Installation complete. *

* *

* You need to reboot the system for the installation to take effect. *

* *

**********************************************************************

E:\iprobe_newest_7jul>

 

I think the message above is in error, that ever since the IPROBE driver was made loadable you don't really need to reboot. But just to be obedient, we'll reboot, and try iprobe.

Test That IPROBE was Installed

 

 

E:\li>iprobe -h

Unable to load library "E:\WINNT\System32\iprshr.dll"

Error = 45a

 

 

 

We forgot to wake up the driver.

 

 

Test again

 

E:\li>iprobe -help

...

Events defined on the current system -- select up to one event from each column:

 

issues single_issue_cycles long_stalls

cycles dual_issue_cycles branch_mispr

triple_issue_cycles pc_mispr

quad_issue_cycles icache_miss

split_issue_cycles dcache_miss

pipe_dry dtb_miss

pipe_frozen loads_merged

replay_trap ldu_replays

branches cycles

cond_branches itb_miss

jsr_ret wb_maf_full_replays

integer_ops external

float_ops mem_barrier_cycles

loads load_locked

stores scache_write

icache_access scache_miss

dcache_access scache_read_miss

scache_access scache_write_miss

scache_read scache_sh_write

scache_write bcache_miss

scache_victim sys_inv

bcache_hit sys_read_req

bcache_victim

sys_req

 

Note that the events defined here are NOT the same as the events on the pca56. So we'll measure a slightly different set of events. Here's a batch file to do a whole bunch of measurements:

 

E:\li>type do_a_bunch.bat

iprobe -method sample -command "do_li.bat" -output cyc_dry_maf.dat cycles

pipe_dry wb_maf_full_replays

move time.this cyc_dry_maf.times

 

iprobe -method sample -command "do_li.bat" -output icache_miss.dat icache_miss

move time.this icache_miss.times

 

iprobe -method sample -command "do_li.bat" -output dcache_miss.dat

loads dcache_miss

move time.this dcache_miss.times

 

iprobe -method sample -command "do_li.bat" -output loads.dat loads loads_merged

move time.this loads.times

 

iprobe -method sample -command "do_li.bat" -output bcache.dat bcache_hit

bcache_miss

move time.this bcache.times

 

iprobe -method sample -command "do_li.bat" -output scache.dat scache_write

scache_write_miss

move time.this scache.times

Do the measurements

 

 

E:\li>do_a_bunch

 

E:\li>iprobe -method sample -command "do_li.bat" -output cyc_dry_maf.dat cycles

pipe_dry wb_maf_full_replays

Node name : PRF07

OS : Microsoft Windows NT Version 4.0

CPU count : 1

Model : DEC-00Alcor

Page count : 81916

Pagelength : 8192

Counter count : 3

cycles : Low frequency skip 0 interrupts between samples

pipe_dry : Low frequency skip 0 interrupts between samples

wb_maf_full_replays : Low frequency skip 0 interrupts between samples

Current time : Mon Jul 07 12:16:22 1997

Start time: : immediate

Duration : 0 (until user interrupts)

Interval : 1

Method : sample

Measured Modes : all modes

Measured Data : ps pc pid ctr

Output file : cyc_dry_maf.dat

Buffer_count : 3

Buffer_size : 8192

Start of sampling

 

Started PID 0x0000006f via command line

 

E:\li>timethis ".\li.exe *.lsp 1>tmp.out 2>tmp.err" 1>time.this

 

E:\li>echo off

d

o

n

e

End of sampling

Buffers written : 3092

Partial buffers written: 1

 

E:\li>move time.this cyc_dry_maf.times

 

...

 

Confirm file sizes and run times

 

E:\li> dir *.dat

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of E:\li

 

07/07/97 12:29p 73,844 bcache.dat

07/07/97 12:19p 25,329,780 cyc_dry_maf.dat

07/07/97 12:24p 5,242,996 dcache_miss.dat

07/07/97 12:21p 1,024,116 icache_miss.dat

07/07/97 12:26p 4,218,996 loads.dat

07/07/97 12:32p 1,417,332 scache.dat

6 File(s) 37,307,064 bytes

1,156,445,184 bytes free

 

 

E:\li> findstr laps *times

bcache.times:TimeThis : Elapsed Time : 00:02:34.422

cyc_dry_maf.times:TimeThis : Elapsed Time : 00:02:44.922

dcache_miss.times:TimeThis : Elapsed Time : 00:02:38.429

icache_miss.times:TimeThis : Elapsed Time : 00:02:35.344

loads.times:TimeThis : Elapsed Time : 00:02:39.852

no_iprobe.times:TimeThis : Elapsed Time : 00:02:36.094

scache.times:TimeThis : Elapsed Time : 00:02:36.039

 

The times running with and without IPROBE are quite close to each other.

 

 

Pick up John's data reduction harness...

 

E:\li> ftp 16.31.144.83

Connected to 16.31.144.83.

220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.

User (16.31.144.83:(none)): anonymous

331 Guest login ok, send ident as password.

Password:

230 Guest login ok, access restrictions apply.

ftp> cd pub

250 CWD command successful.

ftp> dir

200 PORT command successful.

150 Opening ASCII mode data connection for /bin/ls (16.31.144.16,1036).

total 3

drwxr-xr-x 2 9246 512 512 Jul 7 09:37 IprobeKits

drwxr-xr-x 2 9139 512 512 Apr 11 07:12 gaertner

drwxr-xr-x 2 6562 15 512 Jun 17 05:45 henning

226 Transfer complete.

202 bytes received in 0.27 seconds (0.74 Kbytes/sec)

ftp> cd henning

250 CWD command successful.

ftp> dir

200 PORT command successful.

150 Opening ASCII mode data connection for /bin/ls (16.31.144.16,1037).

total 24

-rwxr-xr-x 1 6562 15 23686 Jul 7 11:56 harness.pl

226 Transfer complete.

76 bytes received in 0.02 seconds (3.30 Kbytes/sec)

ftp> get harness.pl

200 PORT command successful.

150 Opening ASCII mode data connection for harness.pl (16.31.144.16,1038) (23686

bytes).

226 Transfer complete.

24422 bytes received in 0.13 seconds (195.38 Kbytes/sec)

ftp> bye

221 Goodbye.

 

E:\li>

 

Hanging rep: same workaround as Unix

 

Unfortunately the harness hangs when the rep command is issued:

 

E:\li>perl harness.pl -x li.exe -e bcache_miss -d bcache -s

Os=NT

 

Running rep to create addresses.resolved

rep li.exe

^C at harness.pl line 70.

 

As in Unix, the workaround is to rep by PID:

 

E:\li>start

.\li.exe *.lsp

1>tmp.out

2>tmp.err

E:\li>rep -pid 117

 

PID: 0x00000075 Base: 00400000 Size: 0x00038000 Name: li.exe

 

** VA range: 0x00400000:0x00438000 (000229376) Symbols: 0000757 Name: E:\li\li.exe

** VA range: 0x77f00000:0x77f96000 (000614400) Symbols: 0000990 Name: E:\WINNT\System32\ntdll.dll

** VA range: 0x77e60000:0x77ef4000 (000606208) Symbols: 0000682 Name: E:\WINNT\

system32\KERNEL32.dll

** Base VA 0x80080000 Symbols: 0001035 Name: \WINNT\System32\ntoskrnl.exe

. . .

Reduce a small event

 

 

Let's try the harness on a sample event - picking the smallest .dat file for a quick test.

 

Use "-s" to only produce a summary report

 

E:\li> perl harness.pl -x li.exe -e bcache_miss -d bcache -s

Os=NT

 

Generating top-level report for bcache_miss

ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event bcache_miss

-pthresh 1

 

No reports for counter 2 (Bcache Miss)

Reason: no samples matched with specifiedcommand line switches

 

Filtered sample count is zero, all reports suppressed

 

can't open bcache_miss.rpt at harness.pl line 265.

 

Apparently there were no Bcache misses. OK, let's look at hits-

 

E:\li> perl harness.pl -x li.exe -e bcache_hit -d bcache -s

Os=NT

 

Generating top-level report for bcache_hit

ipreduce -input_file bcache.dat -output_file bcache_hit.rpt -event bcache_hit

-pthresh 1

 

E:\li> dir *hot*

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of E:\li

 

07/07/97 02:32p 593 bcache_hit.hot_routines

1 File(s) 593 bytes

1,154,143,744 bytes free

 

E:\li> cat *hot*

Hot Routines for bcache_hit -pthresh 1

Events % Routine Image Addr

1890 44 mark E:\li\li.exe 405590:40580F

1180 27 xlminit E:\li\li.exe 405AC0:405E1F

233 5 xlsave E:\li\li.exe 4063F0:40681F

195 5 xleval E:\li\li.exe 405E20:405EFF

142 3 cons E:\li\li.exe 4048B0:40495F

91 2 xlabind E:\li\li.exe 406150:4063EF

53 1 xlygetvalue E:\li\li.exe 40E750:40E7AF

43 1 consd E:\li\li.exe 404A00:404A9F

 

Try to look at routines

 

The summary level information looks good. Re-enter the same command without the "-s", and the harness proceeds to disassemble the benchmark and add detailed reports by routines.

 

E:\li>perl harness.pl -x li.exe -e bcache_hit -d bcache

Os=NT

 

ipreduce -o bcache_hit_mark.rpd -d pc -event bcache_hit -input_file bcache.dat

-pc 405590:40580F

 

Generating disassembly of E:\li\li.exe

dumpbin/disasm E:\li\li.exe > E:\li\li.asm

Extracting mark from E:\li\li.asm

Annotating mark

 

ipreduce -o bcache_hit_xlminit.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 405AC0:405E1F

Extracting xlminit from E:\li\li.asm

Annotating xlminit

 

ipreduce -o bcache_hit_xlsave.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 4063F0:40681F

Extracting xlsave from E:\li\li.asm

Annotating xlsave

 

ipreduce -o bcache_hit_xleval.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 405E20:405EFF

Extracting xleval from E:\li\li.asm

Annotating xleval

 

ipreduce -o bcache_hit_cons.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 4048B0:40495F

Extracting cons from E:\li\li.asm

Annotating cons

 

ipreduce -o bcache_hit_xlabind.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 406150:4063EF

Extracting xlabind from E:\li\li.asm

Annotating xlabind

 

ipreduce -o bcache_hit_xlygetvalue.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 40E750:40E7AF

Extracting xlygetvalue from E:\li\li.asm

Annotating xlygetvalue

 

ipreduce -o bcache_hit_consd.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 404A00:404A9F

Extracting consd from E:\li\li.asm

Annotating consd

 

Throw away .dis files without fear

Everything seems to be in working order. But the annotated disassemblies will now have bcache_hit as the first column, since that was the first event we reduced. It seems preferable to have cycles be the first event, so throw away the annotated disassemblies.

 

E:\li>del *dis

 

Note: when the harness says "Extracting mumble", it creates mumble.dib, which contains the instructions for that routine WITHOUT any event annotations. When events are added, the .dis files are created or updated. We are not deleting the extracts, just the annotations.

 

E:\li>dir *.dib

Volume in drive E is nt4sp2vc5rc12

Volume Serial Number is 9457-96EB

 

Directory of E:\li

 

07/07/97 02:33p 1,857 cons.dib

07/07/97 02:33p 1,693 consd.dib

07/07/97 02:33p 6,612 mark.dib

07/07/97 02:33p 6,944 xlabind.dib

07/07/97 02:33p 2,348 xleval.dib

07/07/97 02:33p 2,188 xlminit.dib

07/07/97 02:33p 2,184 xlsave.dib

07/07/97 02:33p 1,041 xlygetvalue.dib

8 File(s) 24,867 bytes

1,152,613,888 bytes free

 

Let's go for the big event: cycles

 

E:\li>perl harness.pl -x li -e cycles -d cyc_dry_maf

Os=NT

 

Generating top-level report for cycles

ipreduce -input_file cyc_dry_maf.dat -output_file cycles.rpt -event cycles

-pthresh 1

 

100 buffers...

200 buffers...

300 buffers...

400 buffers...

500 buffers...

600 buffers...

700 buffers...

800 buffers...

900 buffers...

1000 buffers...

1100 buffers...

1200 buffers...

1300 buffers...

1400 buffers...

1500 buffers...

1600 buffers...

1700 buffers...

1800 buffers...

1900 buffers...

2000 buffers...

2100 buffers...

2200 buffers...

2300 buffers...

2400 buffers...

2500 buffers...

2600 buffers...

2700 buffers...

2800 buffers...

2900 buffers...

3000 buffers...

 

ipreduce -o cycles_mark.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 405590:40580F

Annotating mark

 

ipreduce -o cycles_xlsave.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o cycles_xlygetvalue.rpd -d pc -event cycles -input_file

cyc_dry_maf.dat -pc 40E750:40E7AF

^C at harness.pl line 359, <HOT> line 5.

 

Oops. Note that the first ipreduce command used "-pthresh 1". That's the default, to include routines that generate at least 1% of the events; but it seems preferable to use 5% this time. Kill the script, throw away the top level (1%) reports and start again.

Reduce all counters - from the top with a new ptrhesh



E:\li>
del *rpt

E:\li> del *hot*

 

E:\li> perl harness.pl -x li -p 5 -e cycles -d cyc_dry_maf

Os=NT

 

Generating top-level report for cycles

ipreduce -input_file cyc_dry_maf.dat -output_file cycles.rpt -event cycles

-pthresh 5

 

ipreduce -o cycles_mark.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 405590:40580F

Annotating mark

 

ipreduce -o cycles_xlsave.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o cycles_xlygetvalue.rpd -d pc -event cycles -input_file

cyc_dry_maf.dat -pc 40E750:40E7AF

Annotating xlygetvalue

 

ipreduce -o cycles_xlminit.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 405AC0:405E1F

Annotating xlminit

 

ipreduce -o cycles_xleval.rpd -d pc -event cycles -input_file cyc_dry_maf.dat

-pc 405E20:405EFF

Annotating xleval

 

ipreduce -o cycles_xlxgetvalue.rpd -d pc -event cycles -input_file

cyc_dry_maf.dat -pc 40E690:40E74F

Extracting xlxgetvalue from E:\li\li.asm

Annotating xlxgetvalue

 

 

E:\li> perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dry

Os=NT

 

Generating top-level report for pipe_dry

ipreduce -input_file cyc_dry_maf.dat -output_file pipe_dry.rpt -event pipe_dry

-pthresh 5

 

ipreduce -o pipe_dry_mark.rpd -d pc -event pipe_dry -input_file cyc_dry_maf.dat

-pc 405590:40580F

Annotating mark

 

ipreduce -o pipe_dry_xlsave.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o pipe_dry_xlygetvalue.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 40E750:40E7AF

Annotating xlygetvalue

 

ipreduce -o pipe_dry_xlxgetvalue.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o pipe_dry_xleval.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o pipe_dry_xlminit.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 405AC0:405E1F

Annotating xlminit

 

 

E:\li> perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dry

Os=NT

 

Generating top-level report for pipe_dry

ipreduce -input_file cyc_dry_maf.dat -output_file pipe_dry.rpt -event

pipe_dry -pthresh 5

 

ipreduce -o pipe_dry_mark.rpd -d pc -event pipe_dry -input_file cyc_dry_maf.dat

-pc 405590:40580F

Annotating mark

 

ipreduce -o pipe_dry_xlsave.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o pipe_dry_xlygetvalue.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 40E750:40E7AF

Annotating xlygetvalue

 

ipreduce -o pipe_dry_xlxgetvalue.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o pipe_dry_xleval.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o pipe_dry_xlminit.rpd -d pc -event pipe_dry -input_file

cyc_dry_maf.dat -pc 405AC0:405E1F

Annotating xlminit

 

 

E:\li> E:\li>perl harness.pl -x li -p 5 -d icache_miss

Os=NT

 

Generating top-level report for icache_miss

ipreduce -input_file icache_miss.dat -output_file icache_miss.rpt -event

icache_miss -pthresh 5

 

ipreduce -o icache_miss_xlsave.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o icache_miss_xlxgetvalue.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o icache_miss_xlobgetvalue.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 40A930:40AB6F

Extracting xlobgetvalue from E:\li\li.asm

Annotating xlobgetvalue

 

ipreduce -o icache_miss_xleval.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o icache_miss_xlevarg.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 40DEB0:40DF2F

Extracting xlevarg from E:\li\li.asm

Annotating xlevarg

 

ipreduce -o icache_miss_xlygetvalue.rpd -d pc -event icache_miss -input_file

icache_miss.dat -pc 40E750:40E7AF

Annotating xlygetvalue

 

E:\li> perl harness.pl -x li -p 5 -d dcache_miss

Os=NT

 

Generating top-level report for dcache_miss

ipreduce -input_file dcache_miss.dat -output_file dcache_miss.rpt -event

dcache_miss -pthresh 5

 

ipreduce -o dcache_miss_mark.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 405590:40580F

Annotating mark

 

ipreduce -o dcache_miss_xlminit.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 405AC0:405E1F

Annotating xlminit

 

ipreduce -o dcache_miss_xlsave.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o dcache_miss_xleval.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o dcache_miss_xlygetvalue.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 40E750:40E7AF

Annotating xlygetvalue

 

ipreduce -o dcache_miss_xlxgetvalue.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o dcache_miss_xlobgetvalue.rpd -d pc -event dcache_miss -input_file

dcache_miss.dat -pc 40A930:40AB6F

Annotating xlobgetvalue

 

 

E:\li> perl harness.pl -x li -p 5 -d loads

Os=NT

 

Generating top-level report for loads

ipreduce -input_file loads.dat -output_file loads.rpt -event loads -pthresh 5

 

ipreduce -o loads_xlsave.rpd -d pc -event loads -input_file loads.dat -pc

4063F0:40681F

Annotating xlsave

 

ipreduce -o loads_mark.rpd -d pc -event loads -input_file loads.dat -pc

405590:40580F

Annotating mark

 

ipreduce -o loads_xlygetvalue.rpd -d pc -event loads -input_file loads.dat

-pc 40E750:40E7AF

Annotating xlygetvalue

 

ipreduce -o loads_xlxgetvalue.rpd -d pc -event loads -input_file loads.dat

-pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o loads_xleval.rpd -d pc -event loads -input_file loads.dat -pc

405E20:405EFF

Annotating xleval

 

 

 

E:\li> perl harness.pl -x li -p 5 -d loads -e loads_merged

Os=NT

 

Generating top-level report for loads_merged

ipreduce -input_file loads.dat -output_file loads_merged.rpt -event

loads_merged -pthresh 5

 

ipreduce -o loads_merged_xlxgetvalue.rpd -d pc -event loads_merged

-input_file loads.dat -pc 40E690:40E74F

Annotating xlxgetvalue

 

ipreduce -o loads_merged_xlobgetvalue.rpd -d pc -event loads_merged

-input_file loads.dat -pc 40A930:40AB6F

Annotating xlobgetvalue

 

ipreduce -o loads_merged_xleval.rpd -d pc -event loads_merged -input_file

loads.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o loads_merged_xlsave.rpd -d pc -event loads_merged -input_file

loads.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o loads_merged_xlevlist.rpd -d pc -event loads_merged -input_file

loads.dat -pc 406050:40611F

Extracting xlevlist from E:\li\li.asm

Annotating xlevlist

 

ipreduce -o loads_merged_mark.rpd -d pc -event loads_merged -input_file

loads.dat -pc 405590:40580F

Annotating mark

 

 

E:\li>perl harness.pl -x li -p 5 -d bcache -e bcache_hit

Os=NT

 

Generating top-level report for bcache_hit

ipreduce -input_file bcache.dat -output_file bcache_hit.rpt -event bcache_hit

-pthresh 5

 

ipreduce -o bcache_hit_mark.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 405590:40580F

Annotating mark

 

ipreduce -o bcache_hit_xlminit.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 405AC0:405E1F

Annotating xlminit

 

ipreduce -o bcache_hit_xlsave.rpd -d pc -event bcache_hit -input_file

bcache.dat -pc 4063F0:40681F

Annotating xlsave

 

E:\li> perl harness.pl -x li -p 5 -d bcache -e bcache_miss

Os=NT

 

Generating top-level report for bcache_miss

ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event

bcache_miss -pthresh 5

 

No reports for counter 2 (Bcache Miss)

Reason: no samples matched with specifiedcommand line switches

 

Filtered sample count is zero, all reports suppressed

 

can't open bcache_miss.rpt at harness.pl line 265.

 

 

 

E:\li> perl harness.pl -x li -p 5 -d scache -e scache_write

Os=NT

 

Generating top-level report for scache_write

ipreduce -input_file scache.dat -output_file scache_write.rpt -event

scache_write -pthresh 5

 

ipreduce -o scache_write_xlsave.rpd -d pc -event scache_write -input_file

scache.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o scache_write_mark.rpd -d pc -event scache_write -input_file

scache.dat -pc 405590:40580F

Annotating mark

 

ipreduce -o scache_write_xleval.rpd -d pc -event scache_write -input_file

scache.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o scache_write_xlminit.rpd -d pc -event scache_write -input_file

scache.dat -pc 405AC0:405E1F

Annotating xlminit

 

 

E:\li> perl harness.pl -x li -p 5 -d scache -e scache_write_miss

Os=NT

 

Generating top-level report for scache_write_miss

ipreduce -input_file scache.dat -output_file scache_write_miss.rpt -event

scache_write_miss -pthresh 5

 

ipreduce -o scache_write_miss_xlminit.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 405AC0:405E1F

Annotating xlminit

 

ipreduce -o scache_write_miss_xlsave.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 4063F0:40681F

Annotating xlsave

 

ipreduce -o scache_write_miss_xlevlist.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 406050:40611F

Annotating xlevlist

 

ipreduce -o scache_write_miss_xlabind.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 406150:4063EF

Annotating xlabind

 

ipreduce -o scache_write_miss_xleval.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 405E20:405EFF

Annotating xleval

 

ipreduce -o scache_write_miss_cons.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 4048B0:40495F

Annotating cons

 

ipreduce -o scache_write_miss_xlbind.rpd -d pc -event scache_write_miss

-input_file scache.dat -pc 40E5E0:40E62F

Extracting xlbind from E:\li\li.asm

Annotating xlbind

 

Files created by harness.pl

 

 

E:\li>ls *asm

E:\li\*asm

li.asm

1434624 (1434374) bytes in 1 files

 

E:\li> ls *rpt

E:\li\*rpt

bcache_hit.rpt loads.rpt scache_write_miss.rpt

cycles.rpt loads_merged.rpt wb_maf_full_replays.rpt

dcache_miss.rpt pipe_dry.rpt

icache_miss.rpt scache_write.rpt

29184 (26539) bytes in 10 files

 

E:\li> ls *hot*

E:\li\*hot*

bcache_hit.hot_routines loads_merged.hot_routines

cycles.hot_routines pipe_dry.hot_routines

dcache_miss.hot_routines scache_write.hot_routines

icache_miss.hot_routines scache_write_miss.hot_routines

loads.hot_routines wb_maf_full_replays.hot_routines

6144 (4416) bytes in 10 files

 

 

E:\li> ls *xlsave*

E:\li\*xlsave*

bcache_hit_xlsave.rpd pipe_dry_xlsave.rpd

cycles_xlsave.rpd scache_write_miss_xlsave.rpd

dcache_miss_xlsave.rpd scache_write_xlsave.rpd

icache_miss_xlsave.rpd wb_maf_full_replays_xlsave.rpd

loads_merged_xlsave.rpd xlsave.dib

loads_xlsave.rpd xlsave.dis

62976 (60038) bytes in 12 files

 

Harness.pl creates:

 

 

Hot Routines

Done. Let's have a look at the hot routine for the various events (using the NT-resource-kit supplied utility "cat"):

 

E:\li>cat *hot*

 

Hot Routines for bcache_hit -pthresh 5

Events % Routine Image Addr

1890 44 mark E:\li\li.exe 405590:40580F

1180 27 xlminit E:\li\li.exe 405AC0:405E1F

233 5 xlsave E:\li\li.exe 4063F0:40681F

 

Hot Routines for cycles -pthresh 5

Events % Routine Image Addr

289813 23 mark E:\li\li.exe 405590:40580F

213822 17 xlsave E:\li\li.exe 4063F0:40681F

164149 13 xlygetvalue E:\li\li.exe 40E750:40E7AF

115423 9 xlminit E:\li\li.exe 405AC0:405E1F

89642 7 xleval E:\li\li.exe 405E20:405EFF

86298 7 xlxgetvalue E:\li\li.exe 40E690:40E74F

 

Hot Routines for dcache_miss -pthresh 5

Events % Routine Image Addr

17495 25 mark E:\li\li.exe 405590:40580F

8218 12 xlminit E:\li\li.exe 405AC0:405E1F

7556 11 xlsave E:\li\li.exe 4063F0:40681F

6628 10 xleval E:\li\li.exe 405E20:405EFF

5647 8 xlygetvalue E:\li\li.exe 40E750:40E7AF

5112 7 xlxgetvalue E:\li\li.exe 40E690:40E74F

4520 7 xlobgetvalue E:\li\li.exe 40A930:40AB6F

 

Hot Routines for icache_miss -pthresh 5

Events % Routine Image Addr

11227 18 xlsave E:\li\li.exe 4063F0:40681F

9401 15 xlxgetvalue E:\li\li.exe 40E690:40E74F

5497 9 xlobgetvalue E:\li\li.exe 40A930:40AB6F

5326 9 xleval E:\li\li.exe 405E20:405EFF

4676 7 xlevarg E:\li\li.exe 40DEB0:40DF2F

3373 5 xlygetvalue E:\li\li.exe 40E750:40E7AF

 

Hot Routines for loads -pthresh 5

Events % Routine Image Addr

48619 19 xlsave E:\li\li.exe 4063F0:40681F

48561 19 mark E:\li\li.exe 405590:40580F

45121 18 xlygetvalue E:\li\li.exe 40E750:40E7AF

25131 10 xlxgetvalue E:\li\li.exe 40E690:40E74F

20119 8 xleval E:\li\li.exe 405E20:405EFF

 

Hot Routines for loads_merged -pthresh 5

Events % Routine Image Addr

2180 32 xlxgetvalue E:\li\li.exe 40E690:40E74F

1929 29 xlobgetvalue E:\li\li.exe 40A930:40AB6F

623 9 xleval E:\li\li.exe 405E20:405EFF

396 6 xlsave E:\li\li.exe 4063F0:40681F

364 5 xlevlist E:\li\li.exe 406050:40611F

338 5 mark E:\li\li.exe 405590:40580F

 

 

Hot Routines for pipe_dry -pthresh 5

Events % Routine Image Addr

57989 18 mark E:\li\li.exe 405590:40580F

56530 18 xlsave E:\li\li.exe 4063F0:40681F

41431 13 xlygetvalue E:\li\li.exe 40E750:40E7AF

24491 8 xlxgetvalue E:\li\li.exe 40E690:40E74F

19962 6 xleval E:\li\li.exe 405E20:405EFF

15910 5 xlminit E:\li\li.exe 405AC0:405E1F

 

Hot Routines for scache_write -pthresh 5

Events % Routine Image Addr

30027 35 xlsave E:\li\li.exe 4063F0:40681F

11698 14 mark E:\li\li.exe 405590:40580F

8585 10 xleval E:\li\li.exe 405E20:405EFF

7003 8 xlminit E:\li\li.exe 405AC0:405E1F

 

Hot Routines for scache_write_miss -pthresh 5

Events % Routine Image Addr

605 43 xlminit E:\li\li.exe 405AC0:405E1F

151 11 xlsave E:\li\li.exe 4063F0:40681F

122 9 xlevlist E:\li\li.exe 406050:40611F

116 8 xlabind E:\li\li.exe 406150:4063EF

96 7 xleval E:\li\li.exe 405E20:405EFF

82 6 cons E:\li\li.exe 4048B0:40495F

70 5 xlbind E:\li\li.exe 40E5E0:40E62F

 

Hot Routines for wb_maf_full_replays -pthresh 5

Events % Routine Image Addr

497 46 xleval E:\li\li.exe 405E20:405EFF

177 16 xlsave E:\li\li.exe 4063F0:40681F

75 7 mark E:\li\li.exe 405590:40580F

68 6 cons E:\li\li.exe 4048B0:40495F

55 5 xlminit E:\li\li.exe 405AC0:405E1F

 

Let's look at the routine of interest:

 

E:\li> type xlsave.dis

Cycle=cycles

Cycle=cycles

PDry=pipe_dry

MfR=wb_maf_full_replays

IMi=icache_miss

DMis=dcache_miss

Ld=loads

LMe=loads_merged

BHi=bcache_hit

SWri=scache_write

SWM=scache_write_miss

xlsave:

Address Instruction Cycle Cycle PDry MfR IMi DMis Ld LMe BHi SWri SWM

004063F0 lda sp,0xFFA0(sp) 2248 2248 433 84 10 2 133

004063F4 stq a0,0x30(sp) 2039 2039 126 14 1 8 31

004063F8 stq a1,0x38(sp) 2022 2022 463 20 8 54

004063FC stq a2,0x40(sp) 1943 1943 49 21 2 47

00406400 stq a3,0x48(sp) 2055 2055 69 17 8 1 1 181 1

00406404 stq a4,0x50(sp) 1900 1900 29 1 1 3 248

00406408 stq s0,0(sp) 1839 1839 24 1 1 887

0040640C stq s1,8(sp) 2209 2209 60 28 6 405 2

. . .

 

Well there's a pain in the neck - "Cycles" are reported in both the first and second event columns because we ran the harness twice for that event (remember the ^C and switch to a 5% threshold?). But this gives us a chance to demonstrate a feature of the harness. Throw away the annotated disassemblies and regenerate them:

 

E:\li> del *.dis

 

E:\li> perl harness.pl -x li -p 5 -d cyc_dry_maf -e cycles

Os=NT

Annotating mark

Annotating xlsave

Annotating xlygetvalue

Annotating xlminit

Annotating xleval

Annotating xlxgetvalue

 

Note that ALL ipreduce calls are skipped, because the reports already exist. The .dib files also are not re-generated; only the .dis files, which are created very quickly.

 

E:\li> perl harness.pl -x li -p 5 -d cyc_dry_maf -e pipe_dry

Os=NT

Annotating mark

Annotating xlsave

Annotating xlygetvalue

Annotating xlxgetvalue

Annotating xleval

Annotating xlminit

 

E:\li>perl harness.pl -x li -p 5 -d cyc_dry_maf -e wb_maf_full_replays

Os=NT

Annotating xleval

Annotating xlsave

Annotating mark

Annotating cons

Annotating xlminit

 

E:\li>perl harness.pl -x li -p 5 -d icache_miss

Os=NT

Annotating xlsave

Annotating xlxgetvalue

Annotating xlobgetvalue

Annotating xleval

Annotating xlevarg

Annotating xlygetvalue

 

E:\li>perl harness.pl -x li -p 5 -d dcache_miss

Os=NT

Annotating mark

Annotating xlminit

Annotating xlsave

Annotating xleval

Annotating xlygetvalue

Annotating xlxgetvalue

Annotating xlobgetvalue

 

E:\li>perl harness.pl -x li -p 5 -d loads

Os=NT

Annotating xlsave

Annotating mark

Annotating xlygetvalue

Annotating xlxgetvalue

Annotating xleval

 

E:\li>perl harness.pl -x li -p 5 -d loads -e loads_merged

Os=NT

Annotating xlxgetvalue

Annotating xlobgetvalue

Annotating xleval

Annotating xlsave

Annotating xlevlist

Annotating mark

 

E:\li>perl harness.pl -x li -p 5 -d bcache -e bcache_hit

Os=NT

Annotating mark

Annotating xlminit

Annotating xlsave

 

E:\li>perl harness.pl -x li -p 5 -d bcache -e bcache_miss

Os=NT

 

Generating top-level report for bcache_miss

ipreduce -input_file bcache.dat -output_file bcache_miss.rpt -event bcache_miss -pthresh 5

 

No reports for counter 2 (Bcache Miss)

Reason: no samples matched with specifiedcommand line switches

 

Filtered sample count is zero, all reports suppressed

 

can't open bcache_miss.rpt at harness.pl line 265.

 

E:\li>perl harness.pl -x li -p 5 -d scache -e scache_write

Os=NT

Annotating xlsave

Annotating mark

Annotating xleval

Annotating xlminit

 

E:\li>perl harness.pl -x li -p 5 -d scache -e scache_write_miss

Os=NT

Annotating xlminit

Annotating xlsave

Annotating xlevlist

Annotating xlabind

Annotating xleval

Annotating cons

Annotating xlbind

 

 

Done. Here's the annotated ev56 xlsave disassembly:

E:\li> type xlsave.dis

Cycle=cycles PDry=pipe_dry MfR=wb_maf_full_replays

IMi=icache_miss DMis=dcache_miss Ld=loads

LMe=loads_merged BHi=bcache_hit SWri=scache_write

SWM=scache_write_miss

Address Instruction Cycle PDry MfR IMi DMis Ld LMe BHi SWri SWM

004063F0 lda sp,0xFFA0(sp) 2248 433 84 10 2 133

004063F4 stq a0,0x30(sp) 2039 126 14 1 8 31

004063F8 stq a1,0x38(sp) 2022 463 20 8 54

004063FC stq a2,0x40(sp) 1943 49 21 2 47

00406400 stq a3,0x48(sp) 2055 69 17 8 1 1 181 1

00406404 stq a4,0x50(sp) 1900 29 1 1 3 248

00406408 stq s0,0(sp) 1839 24 1 1 887

0040640C stq s1,8(sp) 2209 60 28 6 405 2

00406410 stq a5,0x58(sp) 1883 28 8 5 386

00406414 stq ra,0x10(sp) 1819 23 3 5 705 2

00406418 ldah s0,0x43

0040641C lda s0,0xF6C8(s0) 2083 26 2 40

00406420 lda t0,0x30(sp) 1995 30 9 1 15 25

00406424 ldah s1,0x43

00406428 ldl a0,0(s0) 1876 17 4 1778

0040642C ldl v0,0x30(sp)

00406430 stl t0,0x18(sp) 2015 55 15 4 4 5 9 1

00406434 mov 8,t0

00406438 stl t0,0x1C(sp) 2088 30 3 984 8

0040643C lda s1,0xF790(s1)

00406440 stl a0,0x20(sp) 2538 40 7 323 150 3 204 3

00406444 beq v0,004064A8 901 7 407 169 7 238 5

00406448 ldl a0,0(s0) 14929 1676 2 15 4858 4 11 4810 1

0040644C ldl t0,0(s1) 4 1 4 3

00406450 cmpule a0,t0,t0 8875 3807 1 707 496 3 3350 17

00406454 beq t0,00406464

00406458 ldah a0,0x43

0040645C lda a0,0xA3B8(a0)

00406460 bsr ra,xlabort

00406464 ldl t0,0x1C(sp) 3448 1609 1 616 2

00406468 ldl v0,0(s0)

0040646C ldl t1,0x30(sp) 3817 1640 1 93 1

00406470 addl t0,8,t0 5540 855 1 826 7132 3 3 1757 2

00406474 lda v0,0xFFFC(v0) 3

00406478 stl t0,0x1C(sp) 3805 78 5 2 1 29

0040647C nop

00406480 xor sp,zero,sp 3965 81 254

00406484 ldq t2,0x18(sp) 7330 111 1 197 9952 8 1 820

00406488 stl v0,0(s0) 3628 40 1 2 197

0040648C stl t1,0(v0) 4123 91 21 39 1 2 73

00406490 addl t2,t0,t0 3440 40 1 1 61 1

00406494 ldl t3,0x30(sp) 7237 65 15 3586 2 1 79

00406498 stl zero,0(t3) 7510 147 32 1 13 3 886 3

0040649C ldl t0,0xFFF8(t0) 3595 36 2299

004064A0 stl t0,0x30(sp) 7420 198 7 99 3479 1 75

004064A4 bne t0,00406448

004064A8 ldq ra,0x10(sp) 4831 568 3 529 1016

004064AC ldq s0,0(sp)

004064B0 ldq s1,8(sp) 3423 561 3 1395 1 1 1188 1

004064B4 ldl v0,0x20(sp)

004064B8 lda sp,0x60(sp) 2024 560 1583 2

004064BC ret 1

And here's the (previously generated) pca56 xlsave

 

Cycle=cycles PDry=pipe_dry MfRe=wb_maf_full_replays

IMis=icache_miss DMi=dcache_miss Ld=loads

LMe=loads_merged BWri=bcache_write BWHi=bcache_write_hit

Address Instruction Cycle PDry MfRe IMis DMi Ld LMe BWri BWHi

004063F0 lda sp,0xFFA0(sp) 3943 3565 16 92 14 166 393

004063F4 stq a0,0x30(sp) 2747 911 239 1 1 78 390

004063F8 stq a1,0x38(sp) 5149 3150 1732 1 270 1057

004063FC stq a2,0x40(sp) 2237 324 2 26 191

00406400 stq a3,0x48(sp) 8009 4775 3046 27 1 425 1904

00406404 stq a4,0x50(sp) 2205 607 5 52 186

00406408 stq s0,0(sp) 1881 588 7 76 224

0040640C stq s1,8(sp) 17143 9729 7568 30 1 1 802 4876

00406410 stq a5,0x58(sp) 1369 1328 5 2 82 850

00406414 stq ra,0x10(sp) 5524 3239 1632 586 1913

00406418 ldah s0,0x43

0040641C lda s0,0xF6C8(s0) 1655 1239 6 255 662

00406420 lda t0,0x30(sp) 2417 1304 26 233 572

00406424 ldah s1,0x43

00406428 ldl a0,0(s0) 1652 419 4 1 1 92 604

0040642C ldl v0,0x30(sp)

00406430 stl t0,0x18(sp) 10004 4626 3965 30 247 1407 3 819 2889

00406434 mov 8,t0

00406438 stl t0,0x1C(sp) 1362 767 7 1 170 1365

0040643C lda s1,0xF790(s1)

00406440 stl a0,0x20(sp) 6172 987 14 20 219 109 428 1319

00406444 beq v0,004064A8 4547 129 14 4 243 96 342 1386

00406448 ldl a0,0(s0) 14281 4842 27 3 3236 2 861 5190

0040644C ldl t0,0(s1) 1

00406450 cmpule a0,t0,t0 14648 4200 37 13 683 456 1 1381 4470

00406454 beq t0,00406464

00406458 ldah a0,0x43

0040645C lda a0,0xA3B8(a0)

00406460 bsr ra,xlabort

00406464 ldl t0,0x1C(sp) 4256 1722 9 10 6 309 686

00406468 ldl v0,0(s0) 82 16 8 7

0040646C ldl t1,0x30(sp) 3425 1284 6 1 331 1359

00406470 addl t0,8,t0 10876 911 28 11 738 7201 10 625 1936

00406474 lda v0,0xFFFC(v0) 9 7

00406478 stl t0,0x1C(sp) 7281 1794 1992 1 1620 1 525 3797

0040647C nop

00406480 xor sp,zero,sp 4024 520 5 5 3 411 950

00406484 ldq t2,0x18(sp) 7353 936 6 2 195 8325 7 662 2337

00406488 stl v0,0(s0) 8417 2967 2534 1 486 723 2811

0040648C stl t1,0(v0) 14511 6781 5173 5 1 712 1 1216 5079

00406490 addl t2,t0,t0 3505 1154 6 402 2120

00406494 ldl t3,0x30(sp) 5448 274 10 2287 1 397 1881

00406498 stl zero,0(t3) 14369 6426 3792 6 843 1131 4915

0040649C ldl t0,0xFFF8(t0) 3950 1363 7 682 1021

004064A0 stl t0,0x30(sp) 18845 7422 5722 6 78 4115 5 1755 7247

004064A4 bne t0,00406448

004064A8 ldq ra,0x10(sp) 5006 1956 4 1 501 300 1376

004064AC ldq s0,0(sp)

004064B0 ldq s1,8(sp) 3526 2213 2 563 290 1736

004064B4 ldl v0,0x20(sp)

004064B8 lda sp,0x60(sp) 1802 1343 5 176 1181

004064BC ret 4 1 1

Observations - 1

 

Using Map Network Drive the ev56 reports are currently mounted on drive G: and the pca56 reports on drive H:. Comparing the two, we can make the following observations and hypotheses:

 

 

  1. pca56 takes 60% more cycles than ev56, and the pipeline is dryer.

 

E:\>findstr "Cycles.*[0-9]" g:\li\cycles.rpt h:\li\cycles.rpt

g:\li\cycles.rpt: Cycles 1254830

h:\li\cycles.rpt: Cycles 1984890

 

E:\>findstr "Dry.*[0-9]" g:\li\pipe_dry.rpt h:\li\pipe_dry.rpt

g:\li\pipe_dry.rpt: Pipe Dry 314420

h:\li\pipe_dry.rpt: Pipe Dry 743731

 

The findstr regular expression looks for the summary line in the top-level IPROBE report, searching for the event name and a count.

 

The pca56 system incurs about 48B more cycles than the ev56 ((1984890-1254830) cycle counter overflows * 2^16 cycles per counter event) or about 1.6x.

 

The pca56 spends about 28B more cycles dry than the ev56 ((743731-314420)*2^16)

The ev56 is about 25% dry (314 thousand pipe dry counter overflows / 1255 thousand cycle counter overflows) and the pca56 is 37% dry (744/1985).

 

 

Observations - 2

2. MAF replays contribute about 40% of the additional pca56 dry time.

 

E:\>findstr "Maf.*[0-9]" g:\li\*maf*.rpt h:\li\*maf*.rpt

g:\li\wb_maf_full_replays.rpt: Wb Maf Full Replays 1078

h:\li\wb_maf_full_replays.rpt: Wb Maf Full Replays 100751

 

MAF replays are expected to consume 7 cycles. Each report of a pipe dry cycle counter overflow represents 2^16 events and each report of a Maf Replay represents 2^14 events:

 

G:\li>findstr /c:"One sample " *rpt

bcache_hit.rpt: * One sample = 65536 events *

cycles.rpt: * One sample = 65536 events *

dcache_miss.rpt: * One sample = 16384 events *

icache_miss.rpt: * One sample = 16384 events *

loads.rpt: * One sample = 65536 events *

loads_merged.rpt: * One sample = 16384 events *

pipe_dry.rpt: * One sample = 65536 events *

scache_write.rpt: * One sample = 65536 events *

scache_write_miss.rpt: * One sample = 16384 events *

wb_maf_full_replays.rpt: * One sample = 16384 events *

 

Therefore we have (743731-314420)*2^16 = 28B extra dry cycles which include 7*(100751-1078)*2^14 = 11B cycles for Maf Replays. About 40% (11/28) of the extra dry time is due to Maf Replays.

 

Observations - 3

  1. Icache misses are higher on the pca56, even though it has a larger Icache, and the imisses account for the other 60% of the dry increase.

 

g:\li\icache_miss.rpt: Icache Miss 63000

h:\li\icache_miss.rpt: Icache Miss 100406

If icache misses each cost, say, 8 cycles on an ev56 and 16 cycles on this pca56, then the above would account for an additional 18B cycles on the pca56 (100406*2^14*16 - 63000*2^14*8).

 

A latency estimate of 8 cycles for ev56's on-chip S-cache is based on http://www.digital.com/info/DTJH09/DTJH09SC.TXT. The pca56 estimate of 16 cycles is based on the fact that the pca56 had its bcache latency set to 8 cycles, which is assumed to be added to on-chip time of about the same amount as ev56. Emperical evidence for 8 and 16 might be indicated by the Dependent Load measurements with the Nix memtest (gem-alpha-perf note 331.33).

 

This roughly accounts for the other 60% (18/28) of the dry time.

 

Observations - 4

 

  1. Write misses are similar between the two systems.

 

E:\>findstr "cache.*[0-9]" g:\li\*scache_write.rpt

h:\li\*bcache_write.rpt

g:\li\scache_write.rpt: Scache Write 86159

g:\li\scache_write.rpt: Scache Write Miss 1418

h:\li\bcache_write.rpt: Bcache Write 70251

h:\li\bcache_write.rpt: Bcache Write Hit 280824

 

Remember to always check the units for the counters you are using:

 

E:\>findstr /c:"One sample" g:\li\*scache*.rpt h:\li\*bcache*.rpt

g:\li\scache_write.rpt: * One sample = 65536 events *

g:\li\scache_write_miss.rpt: * One sample = 16384 events *

h:\li\bcache_write.rpt: * One sample = 65536 events *

h:\li\bcache_write_hit.rpt: * One sample = 16384 events *

 

Therefore we find that the ev56 writes miss the scache less than 1% of the time ((1418*16384)/(86159*65536)) and the pca56 writes hit the bcache over 99% of the time ((280824*16384)/(70251*65536)).

 

Observations - 5 (uh-oh)

 

  1. The data reduction harness isn't picking up enough of the routines when it creates the .dis files. Note that ipreduce sees xlsave as the hot routine for imisses:

 

Begin End Sample Image Total

Address Address Name Count Pct Pct

------- ------- ---- ----- --- ---

004063F0-0040681F xlsave 21809 21.8 21.7

 

But no instruction in the pca56 xlsave.dis is listed as getting more than 92 icache_miss events! The problem is probably due to the fact that the harness recognizes routine boundaries in the .asm disassembly by the occurence of a label followed by a colon, and stops extracting lines at 4064bc instead of continuing on to 40681f.

 

004064B4: A01E0020 ldl v0,0x20(sp)

004064B8: 23DE0060 lda sp,0x60(sp)

004064BC: 6BFA8001 ret

evalhook:

004064C0: 23DEFFE0 lda sp,0xFFE0(sp)

004064C4: 47FF0413 clr a3

004064C8: B75E0000 stq ra,0(sp)

 

Look for a future version of harness.pl to extract by address rather than by label. In the meantime, hand-extracts have been posted to http://tlg-www.zko.dec.com/~henning/li, where it can be noted that the icache misses are happening in evform and evfun.

 

Observations - 6

 

  1. The reason for the added icache misses is probably bcache latency. Noting differences such as the following in the postings from #5:

 

Cycle=cycles IMi=icache_miss PDry=pipe_dry MfR=wb_maf_full_replays

 

Address Instruction Cycle IMi PDry MfR

ev56 evfun:

00406720 lda sp,0xFFC0(sp) 1395 698 613

00406724 clr a3

00406728 stq s0,0(sp) 192 168

0040672C stq s1,8(sp) 206 212

00406730 stq ra,0x10(sp) 194 3 184

00406734 stl a0,0x24(sp) 218 34 185

00406738 lda a0,0x20(sp)

0040673C stl a1,0x28(sp) 210 2

00406740 stl a2,0x2C(sp) 228 5 26

00406744 lda a1,0x18(sp)

00406748 lda a2,0x1C(sp) 188 1

0040674C bsr ra,xlsave

00406750 ldl t0,0x24(sp) 1413 4 835

00406754 stl v0,0x30(sp) 209 193 1

00406758 ldl s0,8(t0) 230 2 206

0040675C beq s0,0040676C 738 2 532

00406760 ldbu t2,0(s0) 182 1 69

pca56 evfun:

00406720 lda sp,0xFFC0(sp) 4997 691 4194 9

00406724 clr a3

00406728 stq s0,0(sp) 390 210

0040672C stq s1,8(sp) 62 3 235 25

00406730 stq ra,0x10(sp) 367 758 350 1

00406734 stl a0,0x24(sp) 461 10 25 8

00406738 lda a0,0x20(sp)

0040673C stl a1,0x28(sp) 25 2 2

00406740 stl a2,0x2C(sp) 516 724 256 2

00406744 lda a1,0x18(sp)

00406748 lda a2,0x1C(sp) 404 11

0040674C bsr ra,xlsave

00406750 ldl t0,0x24(sp) 1593 2 882 2

00406754 stl v0,0x30(sp) 96 196 1

00406758 ldl s0,8(t0) 518 238 2

0040675C beq s0,0040676C 1537 5 619 7

00406760 ldbu t2,0(s0) 434 1 146 1

00406764 cmpeq t2,3,t2 1244 215 2

Both processors start off evfun with similar icache misses charged to the first instruction. Both processors are presumably prefetching the following instructions, but ev56 prefetches from the on-chip S-cache and pca56 prefetches from the off-chip B-cache. It continues to incur about 700 icache miss events for two more fetch blocks, whereas ev56 has successfully prefetched them, and avoided most of the misses.