[Note by Jay Sage: I am always actively recruiting new columnists for TCJ. ¨ Rick Charnes, whose excellent articles are now appearing in every issue, was¨ my first success. With this issue I would like to introduce the second, Lee¨ Hart. Lee has been carrying forward the development of Poor Person¨ Software's Write-Hand-Man, the 8-bit "SideKick". That kind of program, like¨ BGii, cannot tolerate wasteful coding, and in this, the first of two¨ articles, Lee shares with us some of his tricks.] TCJ Issue #39 PROGRAMMING FOR PERFORMANCE by Lee A. Hart Over the years, the ancient masters of the software arts have ¨ meticulously crafted the tools of structured programming. They have¨ eloquently preached the virtues to body and soul that come from writing¨ clean, healthy code, free from the evils of self-modifying code or the¨ dreaded GOTO. Many programmers have seen the light. They write exclusively in¨ structured high-level languages, and avoid BASIC as if it carried AIDS. ¨ Assembly language is just that unreadable stuff the compiler generates as an¨ intermediate step before linking. Memory and processor speed are viewed as¨ infinite resources. Who cares if it takes 100K for a pop-up calculator¨ program? And if it's not fast enough, use turbo mode, or a 386. But a REAL pocket calculator doesn't have a 16-bit processor, or 100K of¨ RAM; it typically runs a primitive 4-bit CPU at 1 MHz or less, with perhaps¨ 2K of memory. Yet it can out-perform a PC clone having 10 times the speed¨ and memory! How can this be? Special hardware? Tricky instruction sets? On the¨ contrary; CPU registers and instructions have instead been removed to cut¨ cost. No; the surprising performance comes from clever, efficient¨ programming with an extreme attention to detail. Such techniques are¨ essential to the success of every high-volume micro-based product. But they¨ aren't widely known and so are rarely applied to general-purpose¨ microcomputers. Suppose your micro doesn't provide the luxury of unlimited (or even¨ adequate) resources. Your program absolutely has to fit in a certain space,¨ such as a ROM. You're stuck with a slow CPU but must handle a hardware¨ device with particularly severe timing requirements. Your C compiler just¨ turned out a program that misses the mark by a megabyte. Don't give up!¨ I'll show you some techniques that are particularly effective at "running¨ light without overbyte," as Dr. Dobbs used to say. I'll demonstrate these techniques with the Z80. With over 30 million¨ sold last year alone, it remains the number-one-selling micro and is widely¨ used in cost-effective designs when performance counts. However, the¨ principles used apply to almost any microcomputer. In the beginning Novice Z80 programmers soon spot peculiarities in the instruction set;¨ arcane rules restrict data movement between registers. For instance, the¨ stack pointer can be loaded but not examined. Flags are not set¨ automatically and must be explicitly updated by a math or logical¨ instruction. The carry flag can be set or inverted but not reset. Of the¨ six flags, only four can be tested by jumps and calls (and only two by¨ relative jumps). These limitations are no accident. They represent an artful compromise¨ between cost, complexity, performance, and compatibility with the earlier¨ 8080 instruction set. To get the most out of any micro, you must discover¨ how its designer expected you to use the architecture. Get "inside" his¨ head; become part of the machine. The Z80 is register-oriented; it manipulates data in registers¨ efficiently but deals rather clumsily with memory. Registers are¨ specialized, with each having an intended purpose. Here are some rules I've¨ found useful: A = Accumulator: first choice for 8-bit data; best selection of ¨ load/store instructions; source and destination for most math, logical, and¨ comparisons. HL = High/Low address: first choice for 16-bit data/addresses; source¨ and destination for 16-bit math; second choice for 8-bit data; pointer when¨ one math/logical/compare operand is in memory; source address for block¨ operations. DE = DEstination: second choice for 16-bit data/addresses; third choice¨ for 8-bit data; destination address for block operations. BC = Byte Counter: third choice for 16-bit data/addresses; I/O port¨ addreses; 8/16 bit counter for loops and block operations. F = Flag byte (6 bits used): updated by math/logical/compare ¨ instructions; Zero, Carry, Sign, and Parity tested by conditional jumps,¨ calls, or returns; Zero and Carry by relative jumps; block operations use¨ Parity; bit tests use Zero; shifts use Carry; only decimal adjust tests¨ Half-carry and Add/Subtract flags. A',BC',DE',F',HL' = twins of A, BC, DE, F, HL; can be quickly swapped¨ with main set; use for frequently used variables, fast interrupt handlers,¨ task switching. R = Refresh counter for dynamic RAM: also counts instructions for¨ diagnostics, debuggers, copy-protection schemes; pseudorandom number¨ generator; interrupt detection. I = Interrupt vector: page address for interrupts in mode 2; otherwise,¨ an extra 8-bit register that updates flags when read. IX,IY = Index registers X and Y: Two 16-bit registers, used like HL as¨ an indirect memory pointer, except instructions can include a relative¨ offset. SP = Stack Pointer: 16-bit memory pointer for LIFO (last-in first-out)¨ stack to hold interrupt and subroutine return address, pushed/popped¨ register data; stack-oriented data structures. Naturally, some instructions get used a lot more than others. But ¨ frequency-of-use studies reveal that many programs NEVER use large portions¨ of the instruction set. Sometimes there are good reasons, like sticking to¨ 8080 opcodes so your code runs on an 8080/8085/V20 etc. More often the¨ programmer is simply unfamiliar with the entire instruction set, and so¨ restricts himself to what he knows. This is fine for noncritical uses but suicidal when performance counts.¨ It's like running a racehorse with a gag in its throat. Take some time to¨ go over the entire instruction set, one by one. Devise an example for each¨ instruction that puts it to good use. I only know of nine turkeys with no¨ use besides "trick" NOPs (can you find them?). Here's a routine that might be written by a rather inept programmer (or¨ an unusually efficient compiler). It outputs a string of characters ending¨ in 0 to the console. It generally follows good programming practices; it's¨ well structured, has clearly-defined entry and exit points, and carefully¨ saves and restores all registers used. ... ld de,string ; point to message call outstr ; output it ... outstr: push af ; save registers push bc push de push hl ld a,(de) ; get next character cp 0 ; compare it to 0 jp z,outend ; if not last char, push de ; ..save registers ld e,a ld c,conout ; output character to console call bdos pop de ; restore registers inc de ; advance to next jp outstr ; repeat until done outend: pop hl ; else 0 is last char pop de pop bc ; restore registers pop af ret ; return string: db 'message' ; display our message db 0 ; end of string marker Now let's see how it can be improved. First, note that over half the¨ instructions are PUSHes or POPs. This is the consequence of saving every¨ register before use, a common compiler strategy. Though safe and simple,¨ it's the single worst performance-killer I know. The alternative is to push/pop only as necessary. This is easier said¨ than done; miss one, and you've got a nasty bug to find. A good strategy¨ helps. I initially define my routines to minimize the registers used; only¨ push/pop as needed within the routine itself; and restore nothing on exit. ¨ In OUTSTR, this eliminates all but the PUSH DE/POP DE around the CALL BDOS. This shifts the save/restore burden to the calling routine. Since the¨ caller also follows the rule of minimal register usage and push/pops only as¨ necessary, it will probably not push/pop as many registers; thus we have¨ increased speed by eliminating redundant push/pops. We have also made it¨ explicitly clear which registers a caller really needs preserved. Now I move the remaining push/pops to the called routines to save memory. ¨ If every caller saves a particular register, it obviously should be¨ saved/restored by the subroutine itself. If two or more callers save it,¨ speed is the deciding factor; preserve that register in the subroutine if¨ the extra overhead is not a problem for callers that don't need that¨ register preserved. Push/pops are sloooww; at 21 to 29 T-states per pair, they make wonderful¨ low-byte time killers. If possible, either use, or save to, a register that¨ isn't killed by the called routine. In our example, try IX or IY instead of¨ DE; the index registers aren't trashed by the BDOS call (except, see Jay¨ Sage's column. Ed). This saves 5 T-states/loop but adds 2 bytes (see¨ why?). The instruction EX DE,HL (8 T-states per pair) is often useful, but¨ not here; the BIOS eats both HL and DE. The ultimate speed demon is a fast-n-drastic pair of EXX instructions to replace the PUSH DE/POP DE. They save¨ 13 T-states with no size increase, and even preserve BC so we don't have to¨ reload it for every loop. Comparisons A CP 0 instruction was used to test for 0, an obvious choice. But it¨ takes 2 bytes and 7 T-states to execute. The Z80's Zero flag makes the¨ special case of testing for zero easy; all we have to do is update the flags¨ to match the byte loaded. This is most easily done with an OR A¨ instruction, which takes only 1 byte and 4 T-states. You'll find this trick¨ often in Z80 code. Note that OR A has no effect on A; we just used it to update the flags¨ because it's smaller and faster than CP 0. This illustrates a basic¨ principle of assembly languages; the side effects of an instruction are¨ often more important than the main effect. Here are some other not-so-obvious instructions: and a ; update flags and clear Carry xor a ; set A=0, update flags, P/V flag=1 sub a ; same, but P/V flag=0 sbc a,a ; set all bits in A to Carry (00 or FF) add a,a ; A*2, or shift A left & set lsb=0 add hl,hl ; HL*2, or shift HL left & set lsb=0 adc hl,hl ; shift HL left & lsb=Carry sbc hl,hl ; set all bits in HL to Carry (0000 or FFFF) ld hl,0 ; \_load SP into HL so it can be examined add hl,sp ; / Using DE as the string pointer is a weak choice. It forces us to load¨ the character into A, then move it to E. If we use HL, IX, or IY instead,¨ we can load E directly and save a byte. But this makes it harder to test¨ for 0. An INC E, DEC E updates the Z flag without changing E. Or mark the end¨ of the string with 80h, and use BIT 7,E to test for end. Both are as¨ efficient as the OR A trick but don't need A. If you are REALLY desperate,¨ add 1 to every byte in the string, so a single DEC E restores the character¨ and sets the Z flag; kinky, but short and fast. Jumps This example used 3-byte absolute jump instructions. We can save memory¨ by using the Z80's 2-byte relative jumps instead; each use saves a byte. ¨ Since jumps are among the most common instructions, this adds up fast. Relative jumps have a limited range, so it pays to arrange your code¨ carefully to maximize their use. I've found that about half the jumps in a¨ well structured program can be relative. When most of the jumps are out of¨ range, it's often a sign of structural weaknesses, "spaghetti-code" or¨ excessively complex subroutines. How about execution speed? An absolute jump always takes 10 T-states; a¨ relative jump takes 12 to jump, or 7 to continue. So if speed counts, use¨ absolute jumps when the branch is normally taken, and relative jumps when it¨ is not. In the example, this means changing the JP Z,OUTEND to JR Z,OUTEND¨ but keeping the JP at the end. But wait a minute! The JR Z,OUTEND merely jumps to the RET at the end of¨ the subroutine. It would be more efficient still to replace it with RET Z,¨ a 1-byte conditional return that is only 5 T-states if the return is not¨ taken. This illustrates another difference between assembler and high-level¨ languages; entry and exit points are often not at the beginning and end of a¨ routine. We can speed up unconditional jumps within a loop. On entry, load HL¨ with the start address of the loop, and replace JP LABEL by JP (HL). It¨ takes 1 byte and 4 T-states, saving 6 T-states per loop. This scheme costs¨ us a byte (+3 to set HL; -2 for JP (HL)). But if used more than once in the¨ routine, we save 2 bytes per occurrence. If HL is unavailable (as is the¨ case here; the BDOS trashes it), IX or IY can be used instead. However, the¨ JP (IX) and JP (IY) instructions take 2 bytes and 8 T-states, making the ¨ savings marginal. Can we do better yet? Yes, if we carefully rethink the structure of our¨ program. Notice it has two jump instructions per loop; yet only one test is¨ performed (test for 0). This is a hint that one conditional jump should be¨ all we need. Think of the instructions in the loop as links in a chain.¨ Rotate the chain to put the test-for-0 link at the bottom, and LD C,CONOUT¨ on top (which we'll label OUTNXT). The JP OUTSTR is now unnecessary, and¨ can be removed. JP NZ,OUTNXT performs the test and loops until 0 (remember,¨ absolute for speed, relative for size). The entry point is still OUTSTR,¨ though (horrors!) it's now in the middle of the routine. We've also made a subtle change in the logic. Presumably we wouldn't¨ call OUTSTR unless there was at least one character to output. But what¨ would happen if we did? Another way is to use DJNZ to close the loop. Make the first byte of the¨ string its length (1-256). Load this value into B as part of the¨ initialization. The resulting program takes 34 T-states per loop (not¨ counting the CALL). STILL faster? OK, you twisted my arm. If you're absolutely sure the¨ string won't cross a page boundary, you can use INC L instead of INC HL to¨ save 2 T-states. The 8-bit INC/DEC instructions are faster than their¨ 16-bit counterparts, but should only be used if you're positive the address¨ will never require a carry. This brings us to 32 T-states/loop, which is¨ the best I can do within this routine itself. Or can you do better? ... ld hl,string ; point to message call outstr ; output it ... outstr: ld b,(hl) ; get length of message ld c,conout ; output to console ld e,(hl) ; get 1st char outnxt: exx ; save registers, call bdos ; output char, exx ; and restore inc hl ; advance to next ld e,(hl) ; get next character djnz z,outnxt ; loop until end ret string: db strend - strbeg ; message length strbeg: db 'message' ; message itself strend: Parameter Passing In the above example, parameters were passed to the subroutine via¨ registers (string address in HL). This is fast and easy, but each call to¨ OUTSTR takes 6 bytes. Now let's look at methods that save memory at the¨ expense of speed. Parameters can be passed to a subroutine as "data" bytes immediately¨ following the CALL. Let's define the two bytes after CALL OUTSTR as the¨ address of the string. The following code then picks up this pointer,¨ saving us a byte per call. The penalty is in making OUTSTR 4 bytes longer¨ and 38 T-states/loop slower; thus it doesn't pay until we use it 5 or more¨ times. ... call outstr ; output message dw string ; beginning here ... outstr: pop hl ; get pointer to "DW STRING" ld e,(hl) ; E=low byte of string addr inc hl ld d,(hl) ; D=high byte of string addr inc hl ; skip over "DW STRING" & push push hl ; corrected return address outnxt: ld a,(de) ; get next character or a ; if 0, ret z ; all done, return push de ld e,a ld c,conout ; output character to console call bdos pop de inc de ; advance to next jr outnxt We also had to rethink our choice of registers. If we tried to use HL or¨ IX as the string pointer, OUTSTR would have been larger and slower (try it¨ yourself). This demonstrates the consequences of inappropriate register¨ choices. The more parameters that must be passed, the more efficient this¨ technique becomes. A further refinement is to put the string itself¨ immediately after CALL. This saves an additional two bytes per call, and¨ shortens OUTSTR by 6 bytes. ... call outstr ; output message db 'message',0 ; which immediately follows ... outstr: pop de ; get pointer to message ld a,(de) ; get next character inc de ; advance to next push de ; & save as return address or a ; if char=0, all done ret z ; pointer is return addr ld e,a ld c,conout ; else output char to console call bdos jr outstr ; & repeat Constants and Variables Constants and variables are part of every program. Constants are usually¨ embedded within the program itself, as "immediate" bytes. Variables on the¨ other hand are usually separated, grouped into a common region perhaps at¨ the end of the program. This makes sense for programs in ROM, where the¨ variables obviously must be stored elsewhere. But it is not a requirement¨ for programs in RAM. If your program executes from RAM, performance can be improved by¨ treating variables as in-line constants; storage for the variable is in the¨ last byte (or two) of an immediate instruction. For example, here is a¨ routine that creates a new stack, toggles a variable FLAG between two¨ states, and then restores the original stack: toggle: ld (stack),sp ; save old stack pointer ld sp,mystack ; setup my stack ld a,(flag) ; get Yes/No flag cp 'Y' ; if "Y", ld a,'N' ; set it to "N" jr z,setno ; else "N", ld a,'Y' ; set it to "Y" setno: ld (flag),a ; save new state ld sp,(stack) ; restore stack pointer ret stack: dw 0 ; old stack pointer flag: db 'Y' ; value of flag The LD A,(FLAG) instruction takes 13 T-states and 4 bytes of RAM (3 for¨ the instruction, 1 to store FLAG). It can be replaced by LD A,'Y' where 'Y'¨ is the initial value of the variable FLAG, the 2nd byte of the instruction.¨ Speed and memory are improved 2:1, to 7 T-states and 2 bytes respectively. It works for 16-bit variables as well. Replace LD SP,(STACK) with LD¨ SP,0 where 0 is a placeholder for the 2-byte variable STACK. This saves 3¨ bytes and 10 T-states. toggle: ld (stack+1),sp ; save old stack pointer ld sp,mystack ; setup my stack flag: ld a,'Y' ; get Y/N flag (byte 2=var) cp 'Y' ; if "Y", ld a,'N' ; set it to "N" jr z,setno ; else "N", ld a,'Y' ; set it to "Y" setno: ld (flag+1),a ; save new state stack: ld sp,0 ; restore stack (byte 2,3=var) ret There is another advantage to this technique -- versatility. Any¨ immediate-mode instruction can have variable data; loads, math, compares,¨ logical, even jumps and calls. Try changing our first example so a variable¨ OUTDEV selects the output device; console or printer. Now see how simple it¨ is if OUTDEV is the 2nd byte of the LD C,CONOUT instruction. It even creates new instructions. For instance, the Z80's indexed ¨ instructions don't allow a variable offset. This makes it awkward to load¨ the "n"th byte of a table, where we would like LD A,(IX+b) where "b" is a¨ variable. But it can be done if the variable offset is stored in the last¨ byte of the indexed instruction itself. Storing variables in the address field of a jump or call instruction can¨ do some weird and wonderful things. There is no faster way to perform a¨ conditional branch based on a variable. But remember you are treading on¨ the thin ice of self-modifying code; debugging and relocation become much¨ more difficult, and you must insure that the variable NEVER has an¨ unexpected value. Also, in microprocessors with instruction caches (fast¨ memory containing copies of the contents of regular memory), there can be¨ problems if the cache data are not updated. I put a LABEL at each instruction with an immediate variable, then use¨ LABEL+1 for all references to it. This serves as a reminder that something¨ odd is going on. Be sure to document what you're doing, or you'll drive¨ some poor soul (probably yourself) batty. Exclusive OR The XOR operator is a powerful tool for manipulating data. Since¨ anything XOR'd with itself is 0, use XOR A instead of LD A,0. To toggle a¨ variable between two values, XOR it with the difference between the two¨ values. Our last example can be performed much more efficiently by: toggle: ld (stack+1),sp ; save old stack pointer ld sp,mystack ; setup my stack flag: ld a,'Y' ; get Y/N flag (byte 2=var) xor 'Y'-'N' ; toggle "Y" <-> "N" ld (flag+1),a ; save new state stack: ld sp,0 ; restore stack (byte 2,3=var) ret XOR eliminated the jump, for a 2:1 improvement in size and speed. This¨ illustrates a generally useful rule. Almost any permutation can be¨ performed faster, without jumps, by XOR and the other math and logical¨ operators. Consider the following routine to convert the ASCII character in¨ A to uppercase: it's both shorter and faster than the traditional method¨ using jumps. convert: ld b,a ; save a copy of the char in B sub 'a' ; if char is lowercase (a thru z), cp 'z'-'a'+1 ; then carry=1, else carry=0 sbc a,a ; fill A with carry and 'a'-'A' ; difference between upper/lower xor b ; convert to uppercase Data Compaction Programs frequently include large blocks of text, data tables, and other¨ non-program data. Careful organization of such information can produce¨ large savings in memory and speed of execution. ASCII is a 7-bit code. The 8th bit of each byte is either unused or just¨ marks the end of a string. You can bit-pack 8 characters into 7 bytes with¨ a suitable routine. If upper case alone is sufficient, 6 bits are enough.¨ For dedicated applications, don't overlook older but more memory-efficient¨ codes like Baudot (5 bits), EBCD (4 bits), or even International Morse (2-10¨ bits, with frequent characters the shortest). If your text is destined for a CRT or printer, it may be heavy on control¨ characters and ESC sequences. I've found the following algorithm useful.¨ Bytes whose msb=0 are normal ASCII characters: output as-is. Printable¨ characters whose msb=1 are preceeded by ESC, so "A"+80h=C1h sends "ESC A".¨ Control codes whose msb=1 are a "repeat" prefix to output the next byte¨ between 2 and 32 times. For example, linefeed+80h=8Ah repeats the next¨ character 11 times. The value 80h, which otherwise would be "repeat once",¨ is reserved as the marker for the end of the string. Programs can be compacted, too. One technique is to write your program¨ in an intermediate language (IL) better suited to the task at hand. It¨ might be a high-level language, the instruction set of another CPU, or a¨ unique creation specifically for the job at hand. The rest of your program¨ is then an interpreter to execute this language. Tom Pittman's Tiny BASIC¨ is an excellent example of this technique. His intermediate language¨ implemented BASIC in just 384 bytes; the IL interpreter in turn took about¨ 2K. Another approach is threaded code, made popular by the FORTH language. A¨ tight, well-structured program will probably use lots of CALLs. At the¨ highest levels, the code may in fact be nothing but long sequences of CALLs: main: call getname call openfile call readfile call expandtabs call writefile call closefile ret Every 3rd byte is a CALL; large programs will have 1000s of them. So¨ let's eliminate the CALL opcodes, making our program just a list of¨ addresses: main: ld (stack),sp ; save stack pointer ld sp,first ; point it to first address in the list ret ; and go execute it first: dw openfile dw readfile dw expandtabs dw writefile dw closefile dw return ; end of list return: ld sp,(stack) ; restore stack pointer ret ; and return (to MAIN's caller) The stack pointer is pointed to the address of the first subroutine in the¨ list to execute. RET then loads this address into the program counter and¨ advances the stack pointer to the next address. Since each subroutine also¨ ends with a RET, it automatically jumps directly to the next routine in the¨ list to be executed. This is called directly threaded code. RETURN is always the last subroutine in a list. It restores the stack¨ pointer and returns to the caller of MAIN. Directly threaded code can cut program size up to 30%, while actually¨ increasing execution speed. However, it has some rather drastic¨ limitations. During execution of the machine-code subroutines in the¨ address list, the Z80's one and only stack pointer is tied up as an address¨ pointer. That means the stack can't be used; no interrupts, calls, pushes,¨ or pops are allowed without first switching to a local stack. The solution to this is called indirectly threaded code, made famous (or¨ infamous) by the FORTH language. Rather than have each subroutine directly¨ chain into the next, they are linked by a tiny interpreter, called NEXT: main: call next dw openfile dw readfile dw expandtabs dw writefile dw closefile dw return ; end of list next: pop ix ; make IX our next-subroutine pointer next1: ld hl,next1 ; push address so RET comes back here push hl ld l,(ix+0) ; get address to "call" inc ix ; low byte ld h,(ix+0) ; high byte inc ix ; point IX to addr for next time jp (hl) ; call address return: pop hl ; end of list; discard NEXT addr ret ; and return to MAIN's caller Now IX is our pointer into the address list; it points to the next¨ subroutine to be executed. Subroutines can use the stack normally within,¨ but must preserve IX and can't pass parameters in HL. When they exit via¨ RET, it returns them to NEXT1. Though the example executes the address list as straight-line code,¨ subroutines can be written to perform jumps and calls via IX as well. NEXT¨ can provide special handling for commonly-used routines as well; words with¨ the high byte=0 could jump IX by a relative offset if A=0. If there are¨ less than 256 subroutines, each address can be replaced by a single byte,¨ which NEXT converts into an address via a lookup table. Indirectly threaded code can reduce size up to 2:1 in return for a¨ similar loss in execution speed. The decrease in program size is often¨ remarkable. I learned this in 1975 designing a sound-level dosimeter. This¨ cigarette-pack sized gadget rode around in a shirt pocket all day, logging¨ the noise a person was exposed to. It then did various statistical¨ computations to report the high, low, mean, and RMS noise levels versus¨ time. In those dark ages, a BIG memory chip was 256x4. Cost, power, and size¨ forced us into an RCA 1802 CMOS microprocessor, with just 512 bytes of¨ program memory (bytes, not K!). Try as we might, we couldn't do it. In¨ desperation, we tried Charlie Moore's FORTH. Incredibly, it bettered even¨ our heavily optimized code by 30%! Of course, very little of FORTH itself wound up in the final product; it¨ just showed us the way. Once you know HOW it's done, you can apply the same¨ techniques to any assembly-language program without becoming a born-again¨ FORTH zealot. Shortcuts Here are some "quickies" that didn't fit in elsewhere. Keep in mind what¨ is actually in all the registers as you program. Do you really need to¨ clear carry, or is it already cleared as the result of a previous operation? ¨ Before you load a register, are you sure it's necessary? Perhaps it's¨ already there, or sitting in another register. Many routines return "leftovers" that can be very useful, such as HL, DE,¨ and BC=0 after a block move. Perhaps an INC or DEC will produce the value¨ you want. Variables can be grouped so you needn't reload the entire address¨ for each. If the high byte of a register is correct, just load the lower¨ half. Keep an index register pointed to your frequently-used variables. This¨ makes them easier to access (up to 256 bytes) and opens the door to memory-¨ (rather than register-) oriented manipulations. The indexed instructions¨ are slower and less memory-efficient, but the versatility sometimes makes up¨ for it (store an immediate byte to memory, load/save to memory from¨ registers other than A, etc.). The Z80's bit test/set/reset instructions add considerable versatility if¨ you define your flags as bits rather than bytes. Bit flags can be accessed¨ in any register, or even directly in memory, without loading them into a¨ register. If the last two instructions of a subroutine are CALL FOO and RET, you¨ could just as well end with JP FOO and let FOO do the return for you. If¨ the entry point of FOO is at the top, even the JP is unnecessary; you can¨ locate FOO immediately after and "fall in" to it. If you have a large number of jumps to a particular label (like the start¨ of your MAIN program), it may be more efficient to push the address of MAIN¨ onto the stack at the top of the routine. Each JP MAIN can then be replaced¨ by a 1-byte RET. SKIP instructions are a short, fast way to jump a fixed distance. The¨ Z80 has no skips, but you can simulate a 1- or 2-byte skip with a 2- or¨ 3-byte do-nothing instruction: JR or JP on a condition that is never true,¨ for instance. If the flags aren't in a known state, load to an unused¨ register. For example: clear1: ld a,1 ; clear 1 byte db 21h ; skip next two bytes (21h = ld hl,nn) clear80: ld a,80 ; clear 80 bytes (and "nn" for ld hl,nn) db 26h ; skip next byte (26h = ld h,n) clear256: xor a ; clear 256 bytes (and "n" for ld h,n) clear: ld b,a ; clear #bytes in A to zero ld hl,buffer ; beginning at buffer loop: ld (hl),0 inc hl djnz loop ret The stack pointer is the Z80's only auto-increment/decrement register. ¨ This makes it uniquely suitable for fast block operations. For instance,¨ the fastest way to clear a block of RAM is to make it the stack and push the¨ desired data. At 11 T-states per 2 bytes, it is 3 times faster than two LDD ¨ instructions. Remember to disable interrupts or to allow for them; if an¨ interrupt routine pushes onto the stack while you are using it for this¨ special purpose, the results may not be what you intended. That is all for this time. Next time we will continue the discussion¨ with a look at the interplay between software and hardware.