### **Embedded Computing Platform Microprocessor: Architecture and Instruction Set**

Ingo Sander ingo@kth.se

### **Microprocessor - A central part of** the embedded platform

• A platform is the basic hardware and software architecture needed for an embedded application



### Microprocessor

- The microprocessor is a key component in embedded systems
- Modern cars include around 40 to 100 microprocessors
  - Advanced functions (engine control, brake system) need powerful processors (32-bit)
  - Simple functions (window control) need less powerful processors (8-bit)

IL2206 Embedded Systems

3

### Microprocessor

### Microprocessors are

- flexible and easy to program
- cheap
- optimized
- come in several variations
  - 8-bit, 16-bit, 32-bit
  - General purpose microcontroller or DSP
  - Customized processor

### but

 include overhead in form of instruction decoding and memory access (compared to pure hardware)

IL2206 Embedded Systems

4

A microprocessor is

found in almost any

embedded system!

### Thus ...

- Microprocessors are used
  - as key components in an embedded design
- Programmable Logic and ASICs are used
  - for critical parts in a design
- An objective for an embedded system designer is to find the cheapest solution that meets the requirements

Do not use a powerful 32-bit processor, when you only want to control a simple battery-driven device

IL2206 Embedded Systems

5

### **Embedded System Processors**

- There is a huge amount of microprocessors that are used for embedded systems
  - different sizes and performance
  - processors may either come as
    - integrated circuit component (like Intel i5 or i7)
    - soft core, which can be instantiated on ASIC or FPGA

IL2206 Embedded Systems

### **Embedded System Processors**

- Embedded microprocessors follow general concepts
  - similar architecture
  - similar instruction set
- Embedded programs are usually written in C or another high-level programming language, only very critical parts are programmed in assembler
- Course uses Nios II-processor for the laboratories

IL2206 Embedded Systems

7

### **Nios II Overview**

- The Altera Nios II is a soft microprocessor core for Altera FPGAs
  - Microblaze is the corresponding soft processor for Xilinx FPGAs
- A soft processor cannot be bought as an integrated circuit, like the Pentium, but can be instantiated in hardware as part of an FPGA or ASIC design
- Soft microprocessor cores can be customized in order to meet the special requirements of a design

IL2206 Embedded Systems

## Field Programmable Gate Arrays

- An FPGA consists of programmable logic cells and a programmable interconnect
- The logic cells are configurable digital hardware blocks
- If large enough, the FPGA can be used to implement any digital functionality



IL2206 Embedded Systems

9

# Cyclone II Logic Element Register Chain Floring From Privious L. LAB Wide Synchronous Floring From Privious L. LAB Wide Synchronous Floring From Privious L. LAB Wide Synchronous Floring From Privious L. LAB Wide Register Select Register Select Register Select Register Select Register Chain Floring Row, Column, Register Chain Floring Row, Column, Register Chain Colpy Vide Reset Local Routing Reset Local Routing Register Chain Colpy Clura LAB Carry-Out LAB Carry-Out

# Nios II Cores can be implemented on an FPGA

- The Nios II processor can be implemented on an FPGA, if enough logic cells and interconnection resources are available
- More than one processor core can be implemented on an FPGA



Powerful parallel computer systems become possible!

IL2206 Embedded Systems

11

### **Nios II Processor Overview**

- 32-bit instruction set, data path, and address space
- 32 general-purpose registers
- Instruction and data caches
- Single instruction 32x32 bit multiply and divide instruction, result is 32 bit
- Configurable processor
- Custom Instructions: Additional instructions can be defined using extra hardware

IL2206 Embedded Systems



### **Nios II cores**

- When creating a Nios II core the designer can
  - choose a special type of Nios II processor (II/e, II/s, II/f)
  - choose the amount of cache memory and on-chip memory
  - choose peripheral circuits and external memories that will be integrated into the core
  - code is in general portable from one Nios core to another

IL2206 Embedded Systems

### The Nios II/f processor core

- The "fast" Nios core is designed for high execution performance
- <1800 LEs, max 1.16 DMIPS/MHz (DMIPS = Dhrystone Million Instructions per Second)
- Characteristics:
  - Separate Instruction and Data Caches
  - 2 GByte of external address space
  - Optional tightly-coupled memory
  - 6-stage pipeline
  - Dynamic branch prediction
  - Hardware multiply/divide
  - Possibility to add custom instructions

IL2206 Embedded Systems

15

### Nios II/s processor core

- The "standard" Nios core is designed for a small core size. On-chip logic and memory resources are conserved at the expense of execution performance
- <1400 LEs, 0.74 DMIPS</li>
- Instruction cache, but no data cache
- 5-stage pipeline

IL2206 Embedded Systems

### Nios II/e processor core

- The "economy" Nios core is designed to achieve the smallest possible core size
- <700 LEs, 0.15 DMIPS/MHz
- Compatible with Nios II instruction set
- No pipeline
- No caches
- No branch prediction

IL2206 Embedded Systems

17

### **Overview Nios II family (Extract)**

|                    | Feature                   | Core                     |                           |                           |  |  |  |
|--------------------|---------------------------|--------------------------|---------------------------|---------------------------|--|--|--|
|                    |                           | Nios II/e                | Nios II/s                 | Nios II/f                 |  |  |  |
| Objective          |                           | Minimal core size        | Small core size           | Fast execution speed      |  |  |  |
| Performance        | DMIPS/MHz (1)             | 0.15                     | 0.74                      | 1.16                      |  |  |  |
|                    | Max. DMIPS (2)            | 31                       | 127                       | 218                       |  |  |  |
|                    | Max. f <sub>MAX</sub> (2) | 200 MHz                  | 165 MHz                   | 185 MHz                   |  |  |  |
| Area               | •                         | < 700 LEs;<br>< 350 ALMs | < 1400 LEs;<br>< 700 ALMs | < 1800 LEs;<br>< 900 ALMs |  |  |  |
| Pipeline           |                           | 1 Stage                  | 5 Stages                  | 6 Stages                  |  |  |  |
| External Add       | ress Space                | 2 Gbytes                 | 2 GBytes                  | 2 GBytes                  |  |  |  |
| Instruction<br>Bus | Cache                     | -                        | 512 bytes to 64<br>kbytes | 512 bytes to 64 kbytes    |  |  |  |
|                    | Pipelined Memory Access   | -                        | Yes                       | Yes                       |  |  |  |
|                    | Branch Prediction         | -                        | Static                    | Dynamic                   |  |  |  |
|                    | Tightly Coupled Memory    | -                        | Optional                  | Optional                  |  |  |  |

(2) Numbers are for the fastest device in the Stratix II family

IL2206 Embedded Systems



### **The Laboratory Environment**

- DE2 Board
  - Cyclone II EP2C35 FPGA
  - 4 Mbytes of flash memory
  - 512 Kbytes of static RAM
  - 8 Mbytes of SDRAM
  - Several I/O-Devices
  - 50 MHz oscillator



IL2206 Embedded Systems

### **Design Flow - Nios II**

- Design a Nios II core with peripherals using the tool SOPC builder (alternatively you can use an existing core)
- Write your software application using the symbolic addresses and values that have been defined in SOPC builder
- 3. Download the Nios core to the FPGA
- 4. Download the software program to the memory

IL2206 Embedded Systems

21

# **Customizing a Nios II Core The Tool SOPC Builder**



- Designer can select and configure
  - CPU(s)
  - peripherals
- SOPC builder creates core that can be instantiated on FPGA
- Symbolic names are accessible for software designers

IL2206 Embedded Systems



### The system.h file

- The system.h file provides a complete software description of the Nios II system hardware
- The system.h file reflects the actual Nios II hardware, which is given by the \*.ptf file.
- A new core means a new \*.ptf file and a new system.h file

IL2206 Embedded Systems

### The system.h file

- The system.h file describes each peripheral and provides the following details:
  - The hardware configuration of the peripheral
  - The base address
  - The IRQ (interrupt request) priority
  - A symbolic name for the peripheral
- If the hardware changes, the source code is still valid, it will only use another system.h file

NEVER edit the system.h file!!!

IL2206 Embedded Systems

25

### **Quick Question**



 Which of the following architectures <u>cannot</u> be realized on the FPGA of the DE2 board (EP2C35)?

| Table 1–1. Cyclone                                 | II FPGA F | amily Feature | s          |            |         |         |           |
|----------------------------------------------------|-----------|---------------|------------|------------|---------|---------|-----------|
| Feature                                            | EP2C5     | EP2C8 (2)     | EP2C15 (1) | EP2C20 (2) | EP2C35  | EP2C50  | EP2C70    |
| LEs                                                | 4,608     | 8,256         | 14,448     | 18,752     | 33,216  | 50,528  | 68,416    |
| M4K RAM blocks (4<br>Kbits plus<br>512 parity bits | 26        | 36            | 52         | 52         | 105     | 129     | 250       |
| Total RAM bits                                     | 119,808   | 165,888       | 239,616    | 239,616    | 483,840 | 594,432 | 1,152,000 |
| Embedded<br>multipliers (3)                        | 13        | 18            | 26         | 26         | 35      | 86      | 150       |
| PLLs                                               | 2         | 2             | 4          | 4          | 4       | 4       | 4         |
| Maximum user<br>I/O pins                           | 158       | 182           | 315        | 315        | 475     | 450     | 622       |

- 1) 4 Nios II/f processors (1800LEs) with 4 kB I-Cache and 4 kB D-cache
- 2) 2 Nios II/f processors (1800LEs) with 16 kB I-Cache and 8 kB D-cache
- 3) 1 Nios II/f processor (1800LEs) with 32 kB I-Cache and 32 kB D-cache

IL2206 Embedded Systems

### **Quick Question**



 Which of the following architectures <u>cannot</u> be realized on the FPGA of the DE2 board (EP2C35)?

| Table 1–1. Cyclone II FPGA Family Features         |         |           |            |            |         |        |          |       |
|----------------------------------------------------|---------|-----------|------------|------------|---------|--------|----------|-------|
| Feature                                            | EP2C5   | EP2C8 (2) | EP2C15 (1) | EP2C20 (2) | EP2C35  | EP2C50 | EP2C70   |       |
| LEs                                                | 4,608   | 8,256     | 14,448     | 18,752     | 33,216  | 50,528 | 68,416   |       |
| M4K RAM blocks (4<br>Kbits plus<br>512 parity bits | 26      | 36        | 52         | 52         | 105     | 129    | 250      |       |
| Total RAM bits                                     | 119,808 | 165,888   | 239,616    | 239,616    | 483,840 | 483,84 | 0 bits < | 64 kE |
| Embedded<br>multipliers (3)                        | 13      | 18        | 26         | 26         | 35      | 86     | 150      |       |
| PLLs                                               | 2       | 2         | 4          | 4          | 4       | 4      | 4        |       |
| Maximum user<br>I/O pins                           | 158     | 182       | 315        | 315        | 475     | 450    | 622      |       |

- 1) 4 Nios II/f processors (1800LEs) with 4 kB I-Cache and 4 kB D-cache
- 2) 2 Nios II/f processors (1800LEs) with 16 kB I-Cache and 8 kB D-cache
- 3) 1 Nios II/f processor (1800LEs) with 32 kB I-Cache and 32 kB D-cache

IL2206 Embedded Systems

2

### **Stratix III Family (announced 2009)**

|             | Device/<br>Feature | ALMs | LEs    | M9K<br>Blocks | M144K<br>Blocks | MLAB<br>Blocks | Total<br>Embedded<br>RAM Kbits | MLAB<br>RAM<br>Kbits(2) | Total<br>RAM<br>Kbits(3) | 18×18-bit<br>Multipliers<br>(FIR Mode) | PLLs |
|-------------|--------------------|------|--------|---------------|-----------------|----------------|--------------------------------|-------------------------|--------------------------|----------------------------------------|------|
| Stratix III | EP3SL50            | 19K  | 47.5K  | 108           | 6               | 950            | 1,836                          | 297                     | 2,133                    | 216                                    | 4    |
| Logic       | EP3SL70            | 27K  | 67.5K  | 150           | 6               | 1,350          | 2,214                          | 422                     | 2,636                    | 288                                    | 4    |
| Family      | EP3SL110           | 43K  | 107.5K | 275           | 12              | 2,150          | 4.203                          | 672                     | 4.875                    | 288                                    | 8    |
|             | EP3SL150           | 57K  | 142.5K | 355           | 16              | 2,850          | 5,499                          | 891                     | 6,390                    | 384                                    | 8    |
|             | EP3SL200           | 80K  | 200K   | 468           | 36              | 4,000          | 9,396                          | 1,250                   | 10,646                   | 576                                    | 12   |
|             | EP3SE260           | 102K | 255K   | 864           | 48              | 5,100          | 14,688                         | 1,594                   | 16,282                   | 768                                    | 12   |
|             | EP3SL340           | 135K | 337.5K | 1,040         | 48              | 6,750          | 16,272                         | 2,109                   | 18,381                   | 576                                    | 12   |
| Stratix III | EP3SE50            | 19K  | 47.5K  | 400           | 12              | 950            | 5,328                          | 297                     | 5,625                    | 384                                    | 4    |
| Enhanced    | EP3SE80            | 32K  | 80K    | 495           | 12              | 1,600          | 6,183                          | 500                     | 6,683                    | 672                                    | 8    |
| Family      | EP3SE110           | 43K  | 107.5K | 639           | 16              | 2,150          | 8,055                          | 672                     | 8,727                    | 896                                    | 8    |
|             | EP3SE260           | 102K | 255K   | 864           | 48              | 5,100          | 14,688                         | 1,594                   | 16,282                   | 768                                    | 12   |

DE3 Board

Todays FPGAs are much larger !!! Check!

IL2206 Embedded Systems

### **Programming the Nios II**

- Altera assumes that the Nios II is programmed in C/C++, using the Hardware Abstraction Layer (HAL)
- Reasons are
  - Maintenance
  - Time-to-Market
  - Flexibility



IL2206 Embedded Systems

29

30

### **Nios II Assembly Language**

- The assembly language reflects the instruction set (almost one to one)
  - One instruction per line
  - Labels provide names for addresses (usually in first column)
  - Instructions often start in later columns.
  - Columns run to end of line
- Example:



### **Von Neumann Architecture**

- Consists of CPU and one single memory
- Memory holds instructions and data



The von Neumann architecture

- Memory holds data, instructions.
- Central processing unit (CPU) fetches instructions from memory.
  - Separate CPU and memory distinguishes programmable computer.
- CPU registers help out: program counter (PC), instruction register (IR), general-purpose registers, etc.

IL2206 Embedded Systems

32

### **Harvard Architecture**

- Consists of CPU and two single memories
- In the original Harvard, one memory holds instructions and the other data



IL2206 Embedded Systems

33

# **Comparison von Neumann and Harvard**

- Harvard allows two simultaneous memory fetches.
- Most DSPs use Harvard architecture for streaming data:
  - greater memory bandwidth
  - more predictable bandwidth
- Additional hardware, since two address and data busses are needed

IL2206 Embedded Systems





# Be Aware of Endianess... • An embedded system has several devices, which communicate with each other • Non-Awareness of differences in "endian-policies" can cause large problems! • CPU 1 writes 0x01020304 into a memory location • CPU 2 reads 0x01020304 from this memory location, but interpretes it as ...? CPU 1 (little-endian) CPU 2 (big-endian) Memory





### **Nios II Programming Model**

- The Nios programming model is a model of the hardware that helps the designer to implement applications on Nios II
  - General-purpose registers
  - Control registers
  - Addressing Modes
  - Instruction Set
  - Memory Access
  - Processor Modes
  - Exceptions

Reference: Nios II Processor Handbook, Chapter 3 (Programming Model), and Chapter 8 (Instruction Set) Tutorial: Introduction to the Altera Nios II Soft Processor

IL2206 Embedded Systems

### **Nios II Programming Model**

- Students shall study the programming model on their own
  - only few slides on the programming model are presented in the lecture and are left to self-studies
- Please read the tutorial "Introduction to the Altera Nios II Soft Processor" (available via the course web page)

IL2206 Embedded Systems

41

# Nios II Programming Model Addressing Modes

- The Nios II architecture supports the following addressing modes:
  - Register addressing: all operands are registers, e.g. add r6, r7, r8
  - Displacement addressing: address is calculated as the sum of a register and a signed, 16-bit immediate value, e.g. ldw r6, 100(r5)
  - Immediate addressing: the operand is a constant within the instruction itself, e.g. movi r6, -30
  - Register indirect addressing: as in displacement adressing, but the displacement is the constant 0, e.g. ldw r6, (r5)
  - Absolute addressing: as in displacement addressing, but register r0 (zero) is used, e.g. ldw r6, 100(r0)

IL2206 Embedded Systems

# **Nios II Programming Model Instruction Set**

### **Data Transfer**

- Nios II has a RISC (Reduced Instruction Set Computer) load-store architecture
- Load and Store instructions handle all data movement between registers, memory, and peripherals
- There are instructions for both cached (ldw, stw) and uncached access (ldwio, stwio)
- There are instructions that support half-word and byte transfers
- Example: stw r5, (r9) stores the value of word in r5 in the memory location that is given by r9

IL2206 Embedded Systems

43

# **Nios II Programming Model Instruction Set**

### **Custom Instructions**

- Nios reserves instruction codes for custom instruction that designers create by defining the hardware for these functions
- Critical functions can be executed much faster.

IL2206 Embedded Systems

### **Co-processor**

- Many processors provide co-processors for special functionality
- A typical co-processor is a floating-point unit
- The co-processor is "integrated" in the programming model and called by special instructions
- In the Nios II concept co-processors can be easily created as custom instructions

IL2206 Embedded Systems

45

### **Pipelining**

- · Idea of pipeline is to increase execution speed
- Several instructions are executed simultaneously at different stages of completion.
- Various conditions can cause pipeline stalls that reduce execution speed
  - branches
  - memory system delays (result arrives late)
  - etc.

Concept of pipeline is not only used for processors, but can be applied in many design situations

IL2206 Embedded Systems

### Pipeline example **No Pipeline Pipeline** Processor will simultaneously Processor will either fetch an instruction (F), fetch an instruction (F) decode an instruction (D), or decode an instruction (D), and execute an instruction (E) execute an instruction (E) Processor produces a result Processor produces a result every third cycle every cycle E |i+1 D E i+2 i+2 D DE DE DE Time

IL2206 Embedded Systems

### **Nios II pipelines**

- Nios II/f has a 6-stage pipeline and Nios II/s a 5stage pipeline, where fetch, decode and execute are the first three stages
- Nios II/e has no pipeline

IL2206 Embedded Systems

### Latency and throughput

- Latency: time (or number of cycles) from entering the pipelining until it is completes
- Throughput: number of instructions executed per time period

Pipeline is used to increase throughput

IL2206 Embedded Systems

49

### **Pipeline stalls**

- If every step cannot be completed in the same amount of time, pipeline stalls
- Bubbles introduced by stall increase latency, reduce throughput



### **Execution Time For Branch Is Not Constant!**

| Instruction                                                            | Cycles | Penalties     |
|------------------------------------------------------------------------|--------|---------------|
| Normal ALU instructions (e.g., add, cmplt)                             | 1      |               |
| Combinatorial custom instructions                                      | 1      |               |
| Multi-cycle custom instructions                                        | 11     | Late result   |
| Branch (correctly predicted, taken)  Branch Performance                | 2      |               |
| Branch (correctly predicted, not taken)                                | 1      |               |
| Branch (mis-predicted)                                                 | 4      | Pipeline flus |
| trap, break, eret, bret, flushp, wrctl, and unimplemented instructions | 4      | Pipeline flus |
| call                                                                   | 2      |               |
| jmp, ret, callr                                                        | 3      |               |
| rdctl                                                                  | 1      | Late result   |
| load (without Avalon transfer)                                         | 1      | Late result   |
| load (with Avalon transfer)                                            | > 1    | Late result   |

IL2206 Embedded Systems

51

# Multiplication/Division heavily depends on underlying hardware!

| ALU Option                                                   | Hardware Details                                           | Cycles per<br>Instruction | Result Latency<br>Cycles | Supported<br>Instructions               |
|--------------------------------------------------------------|------------------------------------------------------------|---------------------------|--------------------------|-----------------------------------------|
| No hardware multiply<br>or divide                            | Multiply & divide<br>instructions generate an<br>exception | -                         | -                        | None                                    |
| LE-based multiplier                                          | ALU includes 32 x 4-bit multiplier                         | 11                        | +2                       | mul, muli                               |
| Embedded multiplier<br>on Stratix and Stratix II<br>families | ALU includes 32 x 32-bit multiplier                        | 1                         | +2                       | mul, muli,<br>mulxss, mulxsu,<br>mulxuu |
| Embedded multiplier on Cyclone II family                     | ALU includes 32 x 16-bit multiplier                        | 5                         | +2                       | mul, muli                               |
| Hardware divide                                              | ALU includes multicycle divide circuit                     | 4 – 66                    | +2                       | div, divu                               |

IL2206 Embedded Systems

### **Execution Performance Nios II/f**

| > 1    |             |
|--------|-------------|
| 1      |             |
|        |             |
| 1      |             |
| (1)    | Late result |
| (1)    | Late result |
| 1      | Late result |
| 2      | Late result |
| 1 - 32 | Late result |
| 1      |             |
|        | (1)         |

IL2206 Embedded Systems

53

### **Latency Cycles**

- Latency Cycles can decrease performance, if data dependency exist
- Execution of the addi instruction is delayed by two additional cycles until the multiply operation completes

```
mul r1, r2, r3 \# addi r1, r1, 100 \# (Depends on result of mul)
```

In contrast, the code below does not stall the processor

```
mul r1, r2, r3
or r5, r5, r6  # No dependency on previous results
or r7, r7, r8  # No dependency on previous results
addi r1, r1, 100 # (Depends on result of mul)
```

IL2206 Embedded Systems

# Some words about the ARM Microprocessor Core

- ARM is a family of RISC architectures, which share the same design principles and a common instruction set
- ARM does not manufacture the CPU itself, but licenses it to other manufacturers to integrate them into their own system
- The ARM core is used in very many embedded products and can be seen as market leader

 The ARM core as part of a system-on-chip (in the same way as Nios II)



ASIC

IL2206 Embedded Systems

55

### **Further Reading**

- <u>Lee and Seshia, Introduction to Embedded</u>
   <u>Systems (Second Edition)</u>
  - Chapter 8: Embedded Processors

IL2206 Embedded Systems

### **Summary**

- Microprocessor is a key component in an embedded system
- Nios II is a family of soft microprocessor cores
  - Load/store architecture
  - Most instructions are RISCy, operate in single cycle
  - Nios II is customizable

IL2206 Embedded Systems