Guide: About the 80386 architecture

cvs.delorie.com/djgpp/doc/ug/asm/about-386.html

search

Guide: About the 80386 architecture

Introduction

This section presents an overview of the 80386 processor as it applies to assembly language programming. Most books you will find on this subject are geared toward PC based assembly language products using the Intel 803x6 instruction set. The GNU assembler (GAS), which DJGPP uses, is based on the AT&T 680x0 instruction set. This user guide will present the 803x6 instruction set in the AT&T 680x0 format which can be used to create DJGPP GAS programs which run on an 803x6 PC's.

The information in this text should apply to any 32 bit x86 based processor. This includes the 386, 486, 586 and 686 processors from Intel, AMD and Cyrix. The pentium has a 386 compatibility mode, but how much of this would work with it, I do not know.

This overview should give you enough information about the 80386 to start programming using GAS. However, 80386 advanced topics such as protected mode, virtual memory, and multitasking, will not be presented. A reference section is included at the end of this page which can be used to obtain more in depth information.

If you have any comments, suggestions, or questions, feel free to contact the author of this section via email: schwarz@escmail.orl.mmc.com

GAS 80386 Syntax

There are some minor differences between normal 80386 assembly language (ie MASM, TASM, NASM) and the GNU Assembler (GAS). These are listed in the 80386 Machine Dependent Information section of the GAS on-line info docs. I repeat them here for completeness. In the course of the text below, I use the GAS syntax in all assembly language examples. You can also use this information when referencing 80386 assembly manuals.

In instructions containing two operands, one is called the Source operand, and the other is the Destination operand. In GAS syntax the order of the operands on the instruction line is Source, Destination. This is opposite normal 80386 assembly.
Any variable or constant in an instruction, which is not prefixed with a "$" or "%" refers to memory locations. In 80386 assembly, the instruction usually figures this out, GAS doesn't.
Register operands must have the prefix "%". ie %ax refers to a register, ax refers to memory.
Immediate operands are prefixed with a "$" sign. ie $0x9 refers to the number 9, 0x9 refers to a memory offset of 0x9. This also applies to variables, for example, if the variable doscall is set to 0x21, then $doscall refers to the number 21, while doscall refers to memory offset 21.
The size of memory operands must be specified by appending a single character to the end of the opcode. The character must be one of "b" (8-bit), "w" (16-bit), "l" (32-bit). For example, the command to move a 16-bit word from register bx to register cx would be "movw %bx, %cx". To move a 32-bit word from register eax to ebx would be "movl %eax, %ebx".

Memory Model

The 80386 has two available memory models to choose from. A "Flat" model and a Segmented Model. The "Flat" memory model is basically a single segment. A single segment can address up to 4Gb of memory ((2^32) - 1). The segmented model can consist of 16,383 segments. The "Flat" memory model is similar to the Motorola 680x0 processor memory model which the GNU tools were original designed to support. The rest of this section will describe a little more about the "Flat" model. The segmented, model will not be discussed.

The "flat" model presents the programmer with a single array of up to 2^32 bytes (4 Gb). A program running under a "flat" model will be running in the 80386's protected mode under DJGPP. This basically means while you can access any memory address within the "flat" model linear address space, you cannot directly access memory locations outside the address space.

Data Types

Bytes, words, and double words are the fundamental data types within the 80386. A byte is eight contiguous bits starting at any logical address. Bits are numbered 0 through 7, with bit zero being the least significant bit.

A word is two contiguous bytes, or 16 bits, starting at any byte address. The bits are numbered 0 through 15, with zero being least significant. Each byte within a word has its own address, with the lower byte's address representing the words address.

A doubleword is two contiguous words, or 32 bits, starting at any byte address. Bits are numbered 0 through 31, with bit 0 being the least significant.

The two bytes in a word can be referred to as the low byte and the high byte. Similarly the two words in a double word can be referred to as low word and a high word, each having a low byte and a high byte. Note that the least significant bit in each of these is at a lower memory location than the most significant bit. Also the low byte, or low word, is at a lower memory location than the high byte, and high word.

The 80386 is a little endian processor. That is the lower byte of a word is at a lower memory location than the higher byte of a word. Also, lower bits (ie bit 0) are at a lower memory location than higher bits (ie bit 4).

Note that there is no requirement for aligning words on even-numbered addresses, nor do double words need be aligned on addresses evenly divisible by four. However, you should attempt to do this when designing data structures. If you don't, the 80386 will do some extra work to align the data before processing.

The 80386 also supports additional data types, based on the instruction used. The follows types are understood:

Integers: A signed binary numeric value in a doubleword, word or byte. An integer is packed in a 2's complement form. The sign bit is the most significant bit. It is 0 for positive and 1 for negative integers. The range of an 8-bit integer is -128 to +127, a 16-bit integer is -32,768 to +32,767 and a 32-bit integer is -2^31 to +2^31 - 1.
Ordinal: Also known as an unsigned integer. The range of an 8-bit ordinal is 0 to 255, a 16-bit ordinal is 0 to 65,535 and a 32-bit ordinal is 0 to 2^32 - 1.
Near Pointer: A 32-bit logical address containing a 32-bit offset within a segment and is the primary pointer used in the flat memory model.
Far Pointer: A 48-bit logical address containing two components, a 16-bit segment selector and a 32-bit offset. It is used in the segmented memory model.
String: A contiguous sequence of bytes. It may contain from zero to 2^32 - 1 bytes (4Gb).
Bit field: A contiguous sequence of bits. It may begin at any bit position of any byte and may contain up to 32 bits.
Bit string: A contiguous sequence of bits. A bit string may begin at any bit position of any byte and may contain up to 2^32 -1 bits.

Packed and unpacked BCD is also supported.

Hexadecimal numbers are represented by pre-pending a 0x in front of the number. For example decimal 15 would be hex 0xF. Note that this is different from standard 80386 assembly language.

Processor Registers

There are sixteen registers that are of use to general purpose programmers. (There are several other registers for system level programming that are not discussed in this guide.)

General Registers

There are 8 general purpose, 32-bit registers in the 80386. They are EAX, EBX, ECX, EDX, EBP, ESP, ESI, and EDI. Each register can hold a doubleword containing any of the data types listed above.

A backwards compatible feature is built into the 80386. The lower word of each register can be addressed as a separate unit. This is useful for handling 16-bit data items and for compatibility with older 8086 and 80286 programs. The registers are named AX, BX, CX, DX, BP, SP, SI and DI. Note, the upper word of each register cannot be addressed separately.

Furthermore, each byte of the four 16-bit registers AX, BX, CX, and DX can be separately addressed. The high bytes are named AH, BH, CH and DH. The lower bytes are named AL, BL, CL and DL.

All of the general purpose registers are available for addressing calculations and for the results of most calculations. However, a number of functions expect there data to be in specific registers. This allows for more compact and efficient instructions.

Segment Registers

There are six segment registers, CS, DS, SS, ES, FS, and GS. They are used to identify the current six segments in use by a program. The CS registers contains the address of the currently running code segment. The DS register contains the address of the currently accessible data segment. The SS register contains the address of the current stack segment. The ES, FS, and GS registers contains additional segments as required by the program.

The flat memory model used by GAS, has only one segment and therefore programmer's don't normally need too worry about the segment registers. They are usually loaded up with selectors for descriptors that contain the entire 32-bit linear address space. Once loaded, there is no need to change them, and the 32-bit pointers can address the entire program. You don't need to worry about this initialization since defaults are selected when your program is loaded. (Selectors and descriptors are part of protected mode programming.)

While GAS does not use the segmentation model of the 80386, it has it's own segmentation model using sections. As a minimum there are 3 sections, the text section containing code, the data section and the bss section, which contains initialization data. You can also create your own sections. All sections are contained within the same segment. Sections with the same name are combined together during linking.

Stack Implementation

The 80386 allows for multiple stacks, with each stack being a separate segment. The stack pointer (ESP) register points to the top of the stack. The stack is a push-down stack, and is referenced implicitly by PUSH and POP instructions, subroutine calls and returns and interrupt operations. When an item is pushed on the stack, ESP is first decremented, then the data is written to the new ESP location. The opposite occurs when data is popped off the stack. The data is first copied out of the stack, then the ESP is incremented. The stack grows down in memory toward lesser addresses.

The stack frame base pointer (EBP) register is the best register to use for accessing data within the stack. It typically identifies the base address of the current stack frame in use by the current procedure. When this register is used in an offset calculation, the data is automatically fetched from the stack segment. This means the stack segment does not have to be included in the instruction, making for a more compact instruction.

Flags Register

The flags register is a 32-bit register named EFLAGS. The low-order 16 bits of EFLAGS is named FLAGS for compatibility with older 8086 and 80286 code. There are three basic groups of flags, status flags, control flags and the system flags. The flags are as follows:

  16  15                0
----  ---- ---- ---- ---X  CF  Carry Flag
----  ---- ---- ---- -X--  PF  Parity Flag
----  ---- ---- ---X ----  AF  Auxiliary Carry
----  ---- ---- -X-- ----  ZF  Zero Flag
----  ---- ---- X--- ----  SF  Sign Flag
----  ---- ---X ---- ----  TF  Trap Flag
----  ---- --X- ---- ----  IF  Interrupt Flag
----  ---- -X-- ---- ----  DF  Direction Flag
----  ---- X--- ---- ----  OF  Overflow Flag
----  --XX ---- ---- ----  PL  I/O Privilege Level
----  -X-- ---- ---- ----  NT  Nested Task Flag
---X  ---- ---- ---- ----  RF  Resume Flag
--X-  ---- ---- ---- ----  v8  Virtual 8086 MODE

The remaining bits are reserved for future Intel use. A flag is considered cleared when it is zero, set when it is 1.

The status flags used by application programmers are CF, PF, AF, ZF, SF, and OF. These flags hold the results of various instructions that are then used by later instructions. What follows is a brief description of each flag.

Carry Flag: Set in math instructions to indicate that the high-order bit was either carried or borrowed. It is cleared if neither of these conditions occur.
Parity Flag: Indicates whether the lower 8-bits of a result contains an even number of bits set to 1 (flag is set) or an odd set of bits are set to 1 (flag is cleared)
Adjust Flag: Set in decimal math instructions to indicate whether the low order 4-bits of AL where carried, or borrowed. It is cleared if not.
Zero Flag:Set to indicate a math instruction resulted in a zero result. It is cleared otherwise. It is also used by string and loop instructions to indicate completion of the instruction.
Sign Flag:Set equal to high-order bit of results of math instruction. If set the result is negative, positive if cleared.
Overflow Flag:Indicates if the number placed in the destination operand overflowed, either too large, or small. If no overflow occurred, the bit is cleared.

The only control flag at this time is the Direction Flag. It is used by string instructions to determine whether to process strings from the end of the string (auto-decrement), or from the beginning of the string (auto-increment).

The other flags are system flags. Refer to the 80386 Programmers guide for further information on these flags.

Instruction Pointer

The instruction pointer register (EIP) contains the offset into the current code segment. It is a 32-bit value. The lower 16-bit field is addressable as a single unit for backwards compatibility, and is named IP. The programmer should not fiddle with this register. It is controlled by instructions that transfer control of the program such as jump instructions, interrupts, and exceptions.

Assembly Instruction

The basic format of an instruction in 80386 assembly programming is:

Label Opcode Operands ; Comments

All of these are optional. For example an instruction can contain only a label or comment, It may have an Opcode that requires no operands, and is on a line by itself. Here are a couple of examples:

    start:                   ; This is a label
          ret                ; An instruction with no operands
                             ; Comment line
    here: jmp start          ; Instruction with a label, opcode, and a
                             ; single operand 
          movw %ax, %bx      ; Instruction with opcode and two
                             ; operands.

It should be noted that you cannot have an operand by itself in an instruction. There must be an opcode.

In the case where there are two operands, one is considered the source operand, and one is considered the destination operand. In GAS the source operand is first, the destination operand is second. This happens to be opposite of normal 80386 assembly language. For example, the command movw %ax, %bx will move the data from register ax and place it in register bx. The data in the destination operand is usually overwritten with the results of the instruction.

An instruction can include any of the following elements:

Prefixes - one or more bytes preceding the operator that modify the operation of an instruction. For example, the scan string instruction (scas) can be repeated 12 times with the following:
```
movw $12, %ecx
repne
scas
```
The following prefixes are available within GAS:
- Segment Override - explicitly specifies which segment register an instruction should use, thereby overriding the default segment-register selection used for the instruction. The override prefixes are 'cs', 'ds', 'ss', 'es', 'fs', and 'gs'. These are automatically added by specifying memory references using the SECTION:MEMORY-OPERAND form.
- Operand/Address size prefixes 'data16' and 'addr16' change 32-bit operands/address into 16-bit operands/addresses. Note that 16-bit addressing modes (i.e. 8086 and 80286 addressing modes) are not supported (yet).
- Bus Lock - The 'lock' prefix inhibits interrupts during execution of the instruction it precedes. It is only valid for certain instructions.
- Wait - The 'wait' prefixes waits for the coprocessor to complete the current instruction. You really shouldn't ever need this prefix with the 80386/80387 combination.
- Repeat - The repeat prefixes 'rep', 'repe', and 'repne' are added to string instructions to make them repeat '%ecx' times.
Opcode - Specifies the operation to be performed by the instruction. Some operations have several different opcodes, each specifying a different variant of the operation. Opcodes that operate on memory operands must be modified to indicate the size of the memory operands.
Register Operands - an instruction may specify one or two register operands.
Memory Operands - an instruction may specify one or two memory location operands. The 80386 provides several ways to address memory locations. The location can be in register, or it can be pointed to by a combination of several components which can include a base register, an index register, a displacement and a scaling factor. This will be explained in more depth in the section titled "Operand Selection".
Immediate Operand - Directly provides the value in the instruction. Immediate operands may be 8, 16, or 32 bit values. In cases where an 8-bit immediate operand is combined in some way with a 16, or 32 bit operand, the processor extends the size of the 8-bit operand, taking into account the sign bit if required.

Operand Selection

As mentioned above, an instruction can include zero or more operands. The operands represent the data being operated on by the instruction. An operand can be in any of these locations:

In the instruction itself. (an immediate operand)
In a register.
In memory.
At an I/O port.

Immediate operands and operands in registers are accessed faster than operands in memory, since memory operands have to be fetched from memory. Register operands are available within the CPU. Immediate operands are also available within the CPU since they are pre-fetched as part of the instruction.

Of instructions that have operands, some specify the operands implicitly, requiring the operand to be in a predefined location, usually a register or the stack. Others specify the operands explicitly, requiring the operands to be encoded in the instruction, after the opcode.

Instructions which contain two explicitly specified operands, generally overwrite one of the two operands with the result. The operand that is overwritten is called the destination operand. The other is called the source operand. For most instructions, either the source or destination operand can be in a register or in memory. The other operand must be in a register or be an immediate source operand. Therefore, two operand instructions permit operations of the following kind:

Register to register
Register to memory
Memory to register
Immediate to register
Immediate to memory

Some string instructions, and stack manipulation instructions transfer data from memory to memory. Both operands of some string instructions are in memory and implicitly specified. Stack operations allow the transfer of data between memory operands and the stack, which is located in memory.

Immediate Operands

As mentioned earlier, immediate operands are data embedded in the instruction itself. Immediate operands can be 8, 16, or 32 bits long. Some examples of these operands follows:

movb $0x24 %ax - In this instruction the hex number 0x24 is the immediate operand.
movw $footer, %dx - In this instruction, the variable footer is the immediate operand.

Notice the use of the "$" sign to indicate the immediate operand. This is a GAS requirement that is different than general 80386 assembly language.

Register Operands

Operands may be located in any one of the general registers. In the two examples above, ax, and dx are the Register operands. Notice the use of the "%" sign to indicate the register operand. This again is a GAS requirement that is different than general 80386 assembly language.

Memory Operands

Data-manipulation instructions that address operands in memory must specify the segment that contains the operand and the offset of the operand within the segment. Since GAS uses the flat memory model, you will not ordinarily worry about the segment, since your whole program is in a single, O/S determined, segment. There are two general methods for specifying the offset of a memory operand:

The most flexible method computes an effective address, which can be composed of any of the following values: a base register, and index register a scaling factor and a displacement.
A few data-manipulation instructions use specialized addressing methods:
- A few short forms of the MOV command implicitly use the EAX register, the offset is coded in the instruction. There is no base or index register, nor scaling factor used.
- String operations implicitly address memory via DS:ESI and ES:EDI.
- Stack operations implicitly address operands via SS:ESP registers.

Effective Address Computation

The effective address is calculated by taking the sum of up to three components:

A displacement element in the instruction.
A base register.
An index register and optional scale factor of 1, 2, 4, and 8.

The general form for the operand is:

SECTION:DISP(BASE, INDEX, SCALE)

Note that this is GAS's format, and is different from general 80386 assembly languages.

The offset that results is useful for accessing various kinds of arrays and data structures. The displacement component is useful for fixed aspects of addressing:

Location of simple scalar operands.
Beginning of a statically allocated array.
Offset of an item within a record.

The base and index components have similar functions. Both can be used for dynamic addressing, for example:

Location of procedure parameters and local variables from stack.
The beginning of one dimension in a multiple dimensioned array.
The beginning of a dynamically allocated array.

The uses of the general registers have the following restrictions:

ESP cannot be used as an index register.
When ESP or EPB is used as the base register, the default segment is SS. In all other cases DS is the default.

The scaling factor permits efficient indexing into an array whose elements are 2, 4, or 8 bytes wide.

The base, index, and displacement components may be used in any combination. Any of the components may be null. A scale factor can only be used with an index. The following are several examples of effective addresses:

mtable(,%ebx) - In this case the disp is mtable, which can be an array of data elements. The register EBX is the index into the array. Base and Scale are not used.
My2dArray(%ebx, %eax) - In this case My2dArray points to a the location of a 2-dimensional array. The registers EBX and EAX can point to the two indexes.

Interrupts and Exceptions

The 80386 has two mechanisms for interrupting program execution:

Exceptions are synchronous events that indicate the CPU detected conditions while executing an instruction.
Interrupts are asynchronous events typically triggered by external devices needing attention.

In most cases the various exceptions are associated with various 80836 instructions. In addition to servicing hardware interrupts, there is an INT command that programmers can use to access various hardware and operating system information. The following shows the 80836 reserved exceptions and interrupts of use to application programmers:

    0  Divide Error
    1  Debug Exceptions
    2  NMI Interrupt
    3  Breakpoint
    4  INTO Detected Overflow
    5  BOUND Range Exceeded
    6  Invalid Opcode
    7  Coprocessor Not Available
    8  Double Fault
    9  Coprocessor Segment Overrun
   10  Invalid Task State Segment
   11  Segment Not Present
   12  Stack Fault
   13  General Protection Fault
   14  Page Fault
   15  Reserved
   16  Coprocessor Error
   17  Alignment Check
18-31  Reserved

Instruction Set

And finally, the 80386 instructions. Sorry, but there are way too many instructions to include them all here. If your going to do a lot of 80386 assembly language programming you should get a good 80386 assembly book.

What I will show in this section, is several command that are different in GAS than they are in general 80386 assembly. They perform basically the same operation, but the calling form is different.

CALL/JMP FAR SECTION:OFFSET - In GAS use LCALL/LJMP SECTION, OFFSET
CBW - Convert Byte to Word - Use CBTW.
CDQ - Convert Doubleword to Quadword - Use CLTD.
CWD - Convert Word to Doubleword - Use CWTD.
CWDE - Convert Word to Extended Doubleword. Use CWTL.
MOVSX - Move with Sign Extended - Use MOVSSD.
MOVZX - Move with Zero Extended - Use MOVZSD.
RET FAR ret - Return from Subroutine, case of FAR call/jmp. Use LRET ret.

References

I first learned assembly program on the 80286 and have used several books over the years. These are the ones I used for this section:

Intel 80386 Programmer's Reference Manual 1986 - I used this mostly to be sure all the information in this paper was accurate. I'm not sure if the book is still available, but there is a text copy at http://www.x86.org. Note this is not an Intel site.
Using Assembly Language 3rd Edition. Allen L. Wyatt, Sr. QUE Publishing - I have found it to be a very informative book on X86 Services, Accessing BIOS Services and the 80386 Instruction Set.
There is a WWW page on DJ's site named "DJGPP QuickAsm Guide". I used this when first learned DJGPP/GAS syntax.

On a finally note, if you don't want to deal with the GAS Syntax issues, there is a freeware 80386 assembler named NASM available on the net. I don't know where, but I think its listed on DJ's site. A quick search of the web should find the home page. NASM is supposed to be a MASM/TASM compatible assembler.

This section was provided by Jim Schwarz (a.k.a J.E. Schwarz, Jr. PE)

Email questions or comments to: schwarz@escmail.orl.mmc.com

webmaster	delorie software privacy
Copyright © 1998	Updated Jan 1998