Professional Documents
Culture Documents
1
Some Hard Problems in Translation Static vs. Dynamic Translation
• Day-to-day problems • Static
– Floating-point representation and operations
+ May have source information (or at least have object code)
– Precise exceptions and interrupts
+ Can spend as much time as you need (days to months)
• More obscure problems
- Isn’t always safe or possible
(Executables compiled from high-level languages tend - Not transparent to users
not to have these kind of problems) • Dynamic
– Self-modifying code
A program can construct an instruction (in the old format) as a - Translation time is part of program execution time
data word, store it to memory, and jump to it ⇒ Can’t do very complex analysis / optimization
– Self-referential code ⇒ Infrequently used code sections cost as much to translate
A program can checksum the code segment and compare it as frequently used code sections
against a stored value based on the original executable - No source-level information
– Register Indirect jumps to computed addresses + Has runtime information (dynamic profiling and optimization)
A program might compute a jump target that is only + Can fall back to interpretation if all else fails
appropriate for the original binary format and layout + Can be completely transparent to users
A program can jump to the middle of an x86 inst on purpose
2
Transmeta Crusoe & Code Morphing Crusoe VLIW Processor
128-bit molecule
x86 applications
FADD ADD LD BRCC
Complete
x86 x86 OS
Abstraction x86 BIOS
FPU ALU #0 Load/Store
Native Code Morphing Branch
10-stage 7-stage Unit
SW Dynamic Binary Translation
Parallel FUs
Out-of-Order
Translate
P4 uOP
Dispatch
In-Order
Retire
x86 uOP
check Translation
point restore Cache
VLIW FUs
Dispatch
shadow
VLIW
3
Code Morphing Software (CMS) Cost of Translation
• The only software written natively for Crusoe processors • Translation time is part of execution time!
– begins execution at power-up Translation cost has to be amortized over repeat use
– fetches previously unseen x86 basic block from memory • 1st pass translation must be fast and safe
– translates a block of x86 instructions at a time into Crusoe VLIW – almost like interpretation
– caches the translation for future use – x86 instructions are examined and translated byte-by-byte
– jumps to the generated Crusoe code for execution, execution can – CMS constructs a function that is equivalent to the basic block
continue directly into other blocks if translation is cached
– CMS jumps to the function and regain control when the fxn returns
– regains control when execution reaches a unknown basic block
– collects statistics, i.e. execution frequency, branch histories
– interprets the execution of “unsafe” x86 instructions
• Re-translate an often “repeated” basic block (after ~50 times)
– retranslates a block after collecting profiling information
– examines execution profile
• CMS uses a separate region of memory that cannot be – applies full-blown analysis and optimization
touched by code translated from x86 – builds inlined Crusoe code that can run directly out of the translation
• Crusoe processors do not need to be binary compatible cache without intervention by CMS
between generations – can do cross-basic block optimizations, such as speculative code
motion and trace scheduling
⇒ can make different design trade-offs but needs a new
• Caches translation for reuse to amortize translation cost
translator with a new processor
4
Branch Prediction
Example of Scheduling
2nd Pass Optimized Crusoe Atoms • Static prediction based on dynamic profiling
ld %r30, [%esp]
add %eax, %eax, %r30 • Translation can favor the more frequent traversed
add %ebx, %ebx, %r30 arm of an if-then-else statement by making that arm
ld %esi, [%ebp] the fall through (not-taken) path
sub.c %ecx, %ecx, 5
• Trace scheduling
– construct traces such that the most frequently traversed
VLIW Scheduling control flow paths encounters no branches at all
– enlarged scope of ILP scheduling
Final Pass Scheduled Crusoe Molecules – needs compensation code when falling off trace
{ ld %r30, [%esp] ; sub.c %ecx, %ecx, 5 }
{ ld %esi, [%ebp] ; add %eax, %eax, %r30 ; add %ebx, %ebx, %r30 }
• “select” instruction
– “SEL CC, Rd, Rs, Rt” means if (CC) Rd=Rs else Rd=Rt
– a limited variant of predicated execution
In-order execution of scheduled molecules on a Crusoe processor mimics – supports if-conversion, i.e change control-flow to data-flow
the dynamic superscalar execution of uOPs in Pentium’s
5
Precise Exception Handling Check Pointing x86 Machine
• CMS and Crusoe must emulate x86 behavior exactly,
Register File State
- a special “commit” instruction
including precise exception makes a copy of x86 register x86
Temporary registers for
Code Morphing Software &
registers
• But, an x86 instruction maps to several atoms and can contents in the shadow registers translated code
be reordered with atoms of other x86 instructions and - shadow registers is not touched by
commit
can be dispersed over a large code block after program execution restore
6
Crusoe vs. Pentium: Heat