Write-Amplification is the best Performance and Health parameter if one wants to inspect them for SSDs. Write-Amplification is defined as the ratio of FLASH PAGES WRITTEN and HOST PAGES WRITTEN.
Write Amplification (WA) = FLASH DATA WRITTEN / HOST DATA WRITTEN.
The Performance of a storage drive is measured from IOPS and Throughput. And the health of a storage drive is ascertained from the number of bad Blocks, the rate of ECC errors and the PE-count (Program-Erase Count) in SSDs. We have observed that Write-Amplification affects not only health but also performance. So, when a drive shows high Write-Amplification then IO performance will also be lower.
Let’s understand the basics of SSD first and then we will get into WA. SSD stands for Solid State Drives. It is called so because it uses the Floating Gate MOSFET for storing the data unlike Magnetic Platter used in HDD. A Floating Gate MOSFET might store 1 bit of data as in SLC (Single Level Cell) or 2 bits as in MLC (Multi Level Cell) or even 3 bits as in TLC (Triple Level Cell).
Below figures show Floating Gate MOSFET and its Programing and Erasing configuration:
Programming of a Floating Gate MOSFET: When a large voltage is applied across Gate and Substrate, it induces a high electric field, sufficient to tunnel some electrons into the Floating Gate through the dielectric. As Floating Gate is insulated with the dielectrics on both sides, electron tunnelled into it are now stuck and that is the stored data; non-volatile. Electrons stuck in Floating Gate have shifted the threshold voltage of the FGMOSFET up. Now we need to apply a larger voltage to switch it ON; Applied voltage that switches the FGMOSFET ON tells the value stored on it.
There can be many threshold levels that can be programmed on an FG-MOSFET and it depends on whether it’s SLC, MLC or TLC. An SLC will have one threshold voltage level; Similarly, an MLC will have two threshold voltage levels.
A threshold voltage is not programmed in one shot not even in SLCs. A circuit will apply the voltage to the cell for some time to tunnel the electrons and then will read the voltage level of the cell to confirm that cell has reached the requisite voltage level. Otherwise, above process is repeated until the cell reaches the required voltage level. Also, the voltage can only be raised and it cannot be reduced inside a cell. The voltage can only be reduced while Erasing and Erasing happens at Block granularity.
Erasing of a Floating Gate MOSFET: When a large voltage is applied to the substrate instead of on the Gate then a reverse high electric field is induced which pulls out electrons stuck in Floating Gate thus, erasing the cell. The Erasing happens at Block granularity because all the transistors (cells) in a block share their substrate; remember we apply the erasing voltage on the substrate?
Below is the diagram that shows a typical BLOCK of an SSD FLASH MEMORY.
Let’s list down the shortcomings of Flash Memory which add to the complexity:
1. Write happens at Page granularity(WL – word line is considered as a Page; typically of 4K, 8K or 16K size) whereas Erase happens at Block granularity(a block might have 64, 128 or 256 pages). This doesn’t seem like an issue; until…..
2. Writes can only happen on Erased Block. Thus, when a block is full (no empty pages to write to) and we need to edit a page or two we must either erase that block or use another block. This doesn’t seem like an issue too; until…….
3. Program/Erase cycles (PE count) count is limited. i.e. we can Program and Erase a block only for a limited number of times. DAMNED!
The Limited PE count for blocks makes the one-to-one mapping of Host LBAs and Flash LBAs impossible. As we cannot keep writing on same LBAs and degrade a block to cross its PE count. SSDs try to wear every block equally and this is called wear-leveling and it is done with the help of FTL (Flash Translation Layer). So when a page is edited it is not erased immediately rather it is moved on some other page marking present page invalid. All invalid pages will be erased when a need arises and some complex algorithm runs to minimise the wearing. The process of making invalid pages available for writing is called Garbage Collection (GC).
There are two types of GC – Foreground GC and Background GC. Foreground GC gets triggered when there are writes (flash writes) waiting and there is space crunch. Background GC gets triggered when SSD is idle. It is GC which cause WA as not all the pages in a block are invalid and those that are valid must be moved. So Flash writes of valid pages are unnecessary Flash writes.
We have observed that when WA is high SSD controller is so busy handling internal Flash Writes that the throughput of the drive starts decreasing.
Below is the figure that shows a plot of Throughput and WA against varied Block-Size. Details: 100% ENTROPY, ALIGNED 8K, RANDOM Writes except the first bar.
Here we can observe that as we are increasing the Block-Size throughput is getting increased and WA is getting decreased thus WA does affect the throughput. The 16KiB-sequential-writes shows no WA (WA = 1) at all and its Throughput is also higher compared to 16KiB-Random-Aligned-writes and it’s because later shows a higher WA value. There is no difference in the method of writing for Sequential and Random, when it comes to SSDs, as even Random-Writes gets converted to sequential-Writes; So all those decreases in Random-Write performance is due to WA. We can also observe the 512B_al bar which shows the measurement for 512-bytes Random-aligned-writes: there we have a large WA giving rise to lowest Throughput.
When the drive is erased. Random-Writes Throughput will be equal to Sequential Writes Throughput as there won’t be any GC and WA. Sequential-writes shows lesser WA and Higher Throughput because GC for Sequential-writes is simple – as large chunks of pages go invalid together.
WA affects the Performance along with Life and can be considered as a Performance parameter. In SSDs both Life and IO Performance is of major significance and WA affects both of them. WA can give us a single parameter to comment on both IO performance and Life. So if: for a particular Block-Size or some other Test-Case WA is lower then it is certain that drive will give the best performance and best lIfe for that Test.