The OpenWrt Flash Layout

The embedded devices (routers and such) OpenWrt/LEDE (Linux Embedded Development Environment) has mainly targeted since its inception, use flash memory as the form of non-volatile memory for the persistent storage of the firmware and its configuration.

Moving parts are prone to wear (german: Verschleiß) and experience all sorts of “mechanical breakage/mechanical failure”. But how can a non-moving part possibly break? Possibly by electromigration, by whisker growth, etc.

Non-mechanical wear does not only occur when flash memory is erased!

1. Flash memory is more likely to experience failure than a Hard_disk_drive (the ones with the platters rotating at 5400–15000 RPM)
2. Some types of flash memory seem to experience more non-mechanical wear then other types
3. How do we deal with failure?

Based on how the flash memory chip is connected with the SoC (i.e. the “host”) we at OpenWrt distinguish between “raw flash” or “host-managed” and “FTL (Flash Translation Layer) flash” or “self-managed”: in case the flash memory chip is connected directly with the SoC we call it “raw flash” / “host-managed” and in case there is an additional controller chip between the flash memory chip and the SoC, we call it “FTL flash” / “self-managed”. Primarily the controller chip does wear-leveling and manages known bad blocks, but it may do other stuff as well. The flash memory cannot be accessed directly, but only through this controller. The controller has to be considered a black box.

Embedded systems almost exclusively use “raw flash”, while solid-state drives (SSDs) and USB memory sticks, almost exclusively use “FTL flash”!

Additionally we at OpenWrt distinguish between the two basic types of flash memory: NOR flash and NAND flash.

“Raw NOR flash” in typical routers is generally small (4 MiB – 16 MiB) and error-free, i.e. there cannot be bad erase blocks. Because raw NOR flash is error-free, the installed file system(s) do not need to take bad erase blocks into account, and neither SquashFS nor JFFS2 do! The combination of OverlayFS with SquashFS and JFFS2 has been the default OpenWrt setup since the beginning, and it works flawlessly on “raw NOR flash”.

“Raw NAND flash” in typical routers is generally larger (32 MiB – 256 MiB) and not error-free, i.e. it may contain bad erase blocks. A solution to deal with bad erase blocks comprises three provisions:

  1. the manufacturer of the “raw NAND flash” has to guarantee that certain erase blocks are error-free:
    • namely the one(s) which the bootloader is to be written to
    • but also the ones which the Linux kernel and the SquashFS are to be written to, because the firmware image file is generated on some desktop computer, that cannot know which erase blocks of the “raw NAND flash” of the device are bad.
  2. the Image Generator has be constrained to build only file sizes that are equal or smaller than the size of the area of the “raw NAND flash”, that consists of guaranteed error-free erase blocks.
  3. OpenWrt would replace JFFS2 with UBIFS, and the entire area of the “raw NAND flash”, that consists of potentially bad erase blocks, would be written to exclusively from an installed OpenWrt system through UBIFS.
Older routers generally have “raw NOR flash” but many newer routers have “raw NAND flash”.

The main difference between SLC and MLC is durability. single-level cell (SLC) flash memory may have a lifetime of about 50,000 to 100,000 program/erase cycles, while multi-level cell (MLC) flash may have a lifetime of about 1,000 to 10,000 program/erase cycles.

To be noted that it is NOT RIGHT to estimate the life of a NAND flash in embedded devices using the same method for SSD!

Almost all embedded systems contain “raw flash”-chips. The available storage is not partitioned in the traditional way, where you store the data about the partitions in the MBR and PBRs, but it is done in the Linux Kernel (and sometimes independently in the bootloader as well!). It's simply defined, that “partition kernel starts at offset x and ends at offset y”. Using names allows convenient addressing of partitions by name instead of giving the start offset over and over again.

The generic flash layout is:

Layer0 raw flash
Layer1 bootloader
partition(s)
optional
SoC
specific
partition(s)
OpenWrt firmware partition optional
SoC
specific
partition(s)
Layer2 Linux Kernel rootfs
mounted: “/”, OverlayFS with /overlay
Layer3 /dev/root
mounted: “/rom”, SquashFS
size depends on selected packages
rootfs_data
mounted: “/overlay”, JFFS2
“free” space

Many newer devices share this scheme, but the flash layout can differ between the devices! Mostly minor details slightly differ concerning U-Boot and SoC specific firmware images. Please see the wiki pages for each SoC and devices for information about a particular layout. In case the flash layout differs for your device please update the wiki pages.
Here are some examples how it looks on actual devices:

Qualcomm Atheros-based TL-WR1043ND. Somebody also provided a LibreOffice Calc ODS.

SquashFS-Images are suitable for devices with “raw NOR flash memory”-chips and it is not recommended to install them onto devices with “raw NAND flash memory”-chips. SquashFS-Images comprise both, a SquashFS partition and an JFFS2 partition. JFFS2-Images omit the SquashFS partition.

TP-Link WR1043ND Flash Layout
Layer0 raw NOR flash memory chip (m25p80 spi0.0: m25p64) 8192 KiB
Layer1 mtd0 u-boot 128 KiB mtd5 firmware 8000 KiB mtd4 art 64 KiB
Layer2 mtd1 kernel 1280 KiB mtd2 rootfs 6720 KiB
mountpoint /
filesystem OverlayFS
Layer3 mtd3 rootfs_data 5184 KiB
Size in KiB 128 KiB 1280 KiB 1536 KiB 5184 KiB 64 KiB
Name u-boot kernel rootfs_data art
mountpoint none none /rom /overlay none
filesystem none none SquashFS JFFS2 none

Another Flash layout example

The Linux kernel treats “raw flash memory” (no matter whether NOR or NAND) chips as an MTD (Memory Technology Device) and employs filesystems developed for this purpose on top of the MTD layer.

Since the partitions are nested we look at this whole thing in layers:

  1. Layer0: So we have the Flashchip, 8 MiB in size, which is soldered to the PCB and connected to the soc over SPI (Serial Peripheral Interface Bus).
  2. Layer1: We “partition” the space into mtd0 for the bootloader, mtd5 for OpenWrt and, in this case, mtd4 for the ART (Atheros Radio Test) - it contains calibration data for the wifi (EEPROM). If it is missing or corrupt, ath9k (wireless driver) won't come up anymore. The bootloader (128 KiB) contains of the u-boot 64KiB block AND a data section which contains the MAC, WPS-PIN and type description. If no MAC is configured ath9k will not work correctly due to a faulty MAC.
  3. Layer2: we subdivide mtd5 (firmware) into mtd1 (kernel) and mtd2 (rootfs); In the generation process of the firmware (see imagebuilder) the Kernel binary file is first packed with LZMA, then the obtained file is packed with gzip and then this file will be written onto the raw flash (mtd1) without being part of any filesystem! During boot, u-boot copies this entire section into RAM and executes it. From there on, the Linux kernel bootstraps itself…
  4. Layer3: we subdivide rootfs even further into mtd3 for rootfs_data and the rest for an unnamed partition which will accommodate the SquashFS-partition.

Mount Points

  • / this is your entire root filesystem, it comprises /rom and /overlay. Please ignore /rom and /overlay and use exclusively / for your daily routines!
  • /rom contains all the basic files, like busybox, dropbear or iptables. It also includes default configuration files used when booting into OpenWrt Failsafe mode. It does not contain the Linux kernel. All files in this directory are located on the SquashFS partition, and thus cannot be altered or deleted. But, because we use overlay_fs filesystem, overlay-whiteout-symlinks can be created on the JFFS2 partition.
  • /overlay is the writable part of the file system that gets merged with /rom to create a uniform /-tree. It contains anything that was written to the router after installation, e.g. changed configuration files, additional packages installed with opkg, etc. It is formatted with JFFS2.

Whenever the system is asked to look for an existing file in /, it first looks in /overlay, and if not there, then in /rom. In this way /overlay overrides /rom and creates the effect of a writable / while much of the content is safely and efficiently stored in the read-only /rom.

When the system is asked to delete a file that is in /rom, it instead creates a corresponding entry in /overlay, a whiteout. A whiteout is a symlink to (overlay-whiteout) that mostly behaves like a file that doesn't exist. In newer versions, the whiteout is created as a character device with 0/0 device number instead.

#!/bin/sh
# shows all overlay-whiteout symlinks
# 2018: overlay-whiteouts are a character device on CC 'find /overlay -type c' seems to work
#  https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt  put me on the right track
 
find /overlay -type c; find /overlay -type l -exec sh -c \
    'for x; do [ "$(readlink -n -- "$x")" = "(overlay-whiteout)" ] && printf %s\\n "$x"; done' -- {} +

Example 2: Hoo Too HT-TM02

Ralink RT5350F-based Hoo Too HT-TM02.

Layer0 raw flash, 8192 KiB
Layer1 mtd0
u-boot
192 KiB
mtd1
u-boot-env
64 KiB
mtd2
factory
64 KiB
mtd3
firmware
7872 KiB (= FlashSize-(192+64+64))
Layer2 mtd4
kernel
about 1 MiB
mtd5
rootfs
Layer3 /dev/root
around 2 MiB
mtd6
rootfs_data
around 4.5 MiB

For some devices, the OpenWrt partition firmware may not exist at all. The DIR-300 flash layout is such an example.

UBIFS-Images are suitable for devices with “raw NAND flash memory”-chips.

TODO

The Linux kernel treats “raw/host-managed” flash memory (NOR and NAND alike) as an MTD (Memory Technology Device). An MTD is different to a block device or a character device.

On a common block device such as a hard drive, the storage space is split up into “blocks”, which are also named “sectors”, of a size of 512 Bytes or 4096 Bytes. Blocks do not get corrupted during common operation, but only exceptionally. In the very rare case this happens, the LBA hard disk controller takes care, that accesses to such a bad block are redirected to a replacement block. Block devices support 2 main operations - read a whole block and write a whole block. When a block device is partitioned, the information is stored in the MBR or the GPT.

Flash memory using MTD is different from this.

The storage space of a MTD is split up into “erase-blocks”, of a size of e.g 64 KiB, 128 KiB or much more, which themselves are split up into “blocks”, which are more correctly named “pages”, of smaller sizes.

A single “page” can be written to, but it cannot be overwritten, but instead the entire “erase block” that page is part of, has to be erased before it becomes possible to re-write its “pages”. Erase-blocks do become worn out after some number of erase cycles – typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC NAND flashes. Erase-blocks may become bad (only NAND). In case of “FTL flash”, the controller should notice and avoid further access to bad erase-blocks. In case of “raw flash”, the operating system should deal with such cases.

MTD devices support 3 main operations - read from some offset within an erase block, write to some offset within an erase-block, and erase a whole erase-block.

The utility program mtd can be used to manage MTD devices.

The MTD device is often subdivided into logical chunks of memory called partitions. Each partition start at the beginning of an erase-block and end at the end of an erase-block.

The partitioning of MTD devices is not stored in some MBR/GPT, but it is done in the Linux Kernel using MTD-specific partition parsers determining the location and size of these partitions. (sometimes the partitioning is implemented independently in the bootloader as well!).

The kernel boot process involves discovering of partitions within the NOR flash and it can be done by various target-dependent means:

  • some bootloaders store a partition table at a known location
  • some pass the partition layout via kernel command line
  • some pass the partition layout using Device Tree
  • some targets require specifying the kernel command line at the compile time (thus overriding the one provided by the bootloader).

Some of these schemes but not all are implemented in the mainline Linux kernel. The standard kernel can usually detect the top level coarse partitioning scheme, but not the more fine-grained sub-partitions.

In order to deal with some of the custom flash partitioning schemes directly in the kernel, OpenWrt has developed mtdsplit which is a set of patches currently maintained separately from the mainline kernel, but used in OpenWrt to parse different flash layouts and split them into further “logical” partitions.

This is done recursively so that further split of a new “child” partition may be attempted. Whether an attempt is made to split a partition depends on the partition name.

  • rootfs is hardcoded to be split.
  • CONFIG_MTD_SPLIT_FIRMWARE can be used to control whether attempt is made on firmware partition. The most common splitting here is kernel, followed by padding, followed by SquashFS root filesystem, followed by padding, followed by free space.

During splitting, the kernel walks the erase blocks and detects magic bytes via parsers. Each partition type (usually determined from name) has its own list of parsers.

New partitions are usually some offset into the start of the original partition. The size and number of the “children” depends on what is detected. For example if SquashFS image is found then the rootfs partition is added. For SquashFS image the splitter also automatically adds rootfs_data to the list of the available mtd partitions, setting this partition's beginning to the first appropriate address after the SquashFS end and size to the remainder of the rootfs partition.

The resulting list of split off partitions is stored in RAM only, so no partition table of any kind gets actually modified. This also includes detection and creation of ubi partition and others, as well as for vendor-specific layouts.

For more details please refer to the code for the mtdsplit: https://github.com/openwrt/openwrt/tree/master/target/linux/generic/files/drivers/mtd/mtdsplit

For overlaying a special mini_fo filesystem is used, the README is available from the sources at https://dev.openwrt.org/browser/trunk/target/linux/generic/patches-2.6.37/209-mini_fo.patch

Unsorted Block Images (UBI) is an erase block management layer in the Linux kernel for raw NAND flash memory chips. It is layer on top of the MTD layer. UBI is used by UBIFS.

UBI serves two purposes, tracking “bad erase blocks” of a raw NAND flash memory chip and also providing wear-leveling. To accomplish this, UBI maps logical erase blocks to physical erase blocks and presents the first ones to higher layers.

cat /proc/mtd
dev:    size   erasesize  name
mtd0: 00020000 00010000 "u-boot"
mtd1: 00140000 00010000 "kernel"
mtd2: 00690000 00010000 "rootfs"
mtd3: 00530000 00010000 "rootfs_data"
mtd4: 00010000 00010000 "art"
mtd5: 007d0000 00010000 "firmware"

The erasesize is the block size of the flash, in this case 64KiB. The size is little or big endian hex value in Bytes. In case of little endian, you switch to hex-mode and enter 02 0000 into the calculator for example and convert to decimal (by switching back to decimal mode again). Then guess how they are nested into each other. Or execute dmesg after a fresh boot and look for something like:

Creating 5 MTD partitions on "spi0.0":
0x000000000000-0x000000020000 : "u-boot"
0x000000020000-0x000000160000 : "kernel"
0x000000160000-0x0000007f0000 : "rootfs"
mtd: partition "rootfs" set to be root filesystem
mtd: partition "rootfs_data" created automatically, ofs=2C0000, len=530000
0x0000002c0000-0x0000007f0000 : "rootfs_data"
0x0000007f0000-0x000000800000 : "art"
0x000000020000-0x0000007f0000 : "firmware"

These are the start and end offsets of the partitions as hex values in Bytes. Now you don't have to guess which is nested in which. E.g. 02 0000 = 131.072 Bytes = 128KiB.

The flash chip can be represented as a large block of continuous space:

start of flash ................. end of flash

There is no ROM to boot from; at power up the CPU begins executing the code at the very start of the flash. Luckily this isn't the firmware or we'd be in real danger every time we reflashed. Boot is actually handled by a section of code we tend to refer to as the bootloader (the BIOS of your PC is a bootloader).

Boot Loader Partition Firmware Partition Special Configuration Data
Atheros U-Boot firmware ART
Broadcom CFE firmware NVRAM
Atheros RedBoot firmware FIS recovery RedBoot config boardconfig

The partition or partitions containing so called Special Configuration Data differ very much from each other. Example: The ART-partition you will meet in conjunction with Atheros-Wireless and U-Boot, contains only data regarding the wireless driver, while the NVRAM-partition of broadcom devices is used for much more than only that. There are special utilities to access and modify special configuration partitions. For Broadcom devices this is the nvram utility. To find out what is written in NVRAM you can run nvram show.

Note that clearing these special configuration data partitions like ART, NVRAM and FIS does not clear much of OpenWrt's configuration, unlike other router software which keep configuration data solely in e.g. NVRAM. Instead, as a consequence of using the overlay_fs filesystem configuration with JFFS2 flash partition, the whole file system is writable and allows the flexibility of extending your OpenWrt installation in any way you want. OpenWrt's main configuration is therefore just kept in the root file system, using UCI configuration files. For convenience, many other packages are made UCI compatible. If you want to reset your complete installation you should use OpenWrt's built-in functionality such as sysupgrade to restore settings, by clearing the JFFS2 partition. Or, if you cannot boot normally, you can wipe or change the JFFS2 partition using OpenWrt's failsafe mode (look in your device's dedicated page for information how to boot into failsafe).

If you dig into the “firmware” section you'll find a trx. A trx is just an encapsulation, which looks something like this:

trx-header
HDR0 length crc32 flags pointers data

“HDR0” is a magic value to indicate a trx header, rest is 4 byte unsigned values followed by the actual contents. In short, it's a block of data with a length and a checksum. So, our flash usage actually looks something like this:

CFE trx containing firmware NVRAM

Except that the firmware is generally pretty small and doesn't use the entire space between CFE and NVRAM:

CFE trx firmware unused NVRAM

(NOTE: The <model>.bin files are nothing more than the generic *.trx file with an additional header appended to the start to identify the model. The model information gets verified by the vendor's upgrade utilities and only the remaining data -- the trx -- gets written to the flash. When upgrading from within OpenWrt remember to use the *.trx file.)

So what exactly is the firmware?

The boot loader really has no concept of filesystems, it pretty much assumes that the start of the trx data section is executable code. So, at the very start of our firmware is the kernel. But just putting a kernel directly onto flash is quite boring and consumes a lot of space, so we compress the kernel with a heavy compression known as LZMA. Now the start of firmware is code for an LZMA decompress:

lzma decompress lzma compressed kernel

Now, the boot loader boots into an LZMA program which decompresses the kernel into RAM and executes it. It adds one second to the bootup time, but it saves a large chunk of flash space. (And if that wasn't amusing enough, it turns out the boot loader does know gzip compression, so we gzip compressed the LZMA decompression program)

Immediately following the kernel is the filesystem. We use SquashFS for this because it's a highly compressed readonly filesystem -- remember that altering the contents of the trx in any way would invalidate the crc, so we put our writable data in a JFFS2 partition, which is outside the trx. This means that our firmware looks like this:

trx gzip'd lzma decompress lzma'd kernel (SquashFS filesystem)

And the entire flash usage looks like this -

CFE trx gz'd lzma lzma'd kernel SquashFS JFFS2 filesystem NVRAM

That's about as tight as we can possibly pack things into flash.


An image file is byte by byte copy of data contained in a file system. If you installed a Debian or a Windows in the usual way onto one or two hard disk partitions and would afterwards copy the whole content byte by byte from the hard disk into one file:

dd if=/dev/sda of=/media/sdb3/backup.dd

the obtained backup file /media/sdb3/backup.dd, could be used in the exact same manner like an OpenWrt-Image-File.

The difference is, that OpenWrt-Image-File are not created that way ;-) They are being generated with the Image Generator (former called Image Builder). Other resources:

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies
  • Last modified: 2023/03/09 09:32
  • by brlin