General about disks

Among the standard and important storage technologies is the magnetic disk. They are characterized by having large amounts of storage at a low cost, and that the data stored is permanent and stable. A disk does not need continuous power or operation to maintain the data on the disk. The model for permanent and large storage from the software point of view is the file system. The disk is the major technological gizmo that are responsible for realizing file systems.

Disks are in fact disk shaped: magnetic platters often stacked on the same center spindle, that are rotated at a fixed (or variable) speed. Magnetic pickups are positioned over the top and bottom surfaces of the drive, and a grouped together and move together radially towards and away from the spindle shared by the disks. The heads are stepped from one concentric track of magnetization to the next. Each track is divided into evenly spaced segments called sectors or blocks. Each block contains typically 512 magnetizable positions. Each position can be either magnetized to represent a 1 or a 0.

The collect of all tracks at a given head position is called a cylinder. Therefore a given 512-bit sector can be given by the cylinder number, the head number in the cylinder, and the sector within the cylinder. This is called CHS address and is of historical interest. Modern drives number sectors from 0 up to the maximum available to the drive and call this the Linear Block Address (LBA) of the sector.

The mapping between LBA and CHS is a secret of the drive. As drives contain more and more computation power, the drive takes on a greater responsibility to deliver any LBA in the shortest amount of time after the request is made, the latency. Factors affecting latency include the rotational speed of the disk and the distance in terms of steps that the head assembly must move. Disks contain buffers to that they can expedite requests, reordering the actual work on the disk for best overall performance.

Disks can only be read or written in entire blocks. If a single bit needs to be changed, the entire block must be transfered to the computer for modification and eventual write-back. It is generally assumed that a block write either succeeds fully or does not occur, that they are atomic. Even a power failure during the the write cannot interfere, as the time to write a block is too brief to be affected by a power failure. Journaling drives use this fact axiomatically.

Drives compete on size and speed and also other features: size of cache, resilience to breakage, mean-time-to-failure. New drives have accelerometers to retract heads if the drive is dropped, to avoid the head touching down and scratching the surface of the platters. They are capable of using the rotational inertia in the spinning platters to power the electronics and complete the write of all buffered data in case the power is cut.

Drives are connected to the motherboard by wires, of course, and a set of protocols for the signals that flow over the wires. There are two standards: ATA and SCSI. Typically SCSI is more sophisticated, more expensive, and more for servers. ATA now has a serial version called SATA which is faster. There are several versions of ATA and SCSI. The biggest difference is that the newer the version the faster it transfers data.

LBA, CHS and the 137G byte limit

The concepts of LBA versus CHS are clear enough. Early CHS numbering probably really meant specifying the track, head and sector as they physcially existed on the hard drive. This was probably a result of limited electronics on the drive. The CPU had to do all the thinking. With IDE, Integrated Drive Electronics, the drives had smarts enough to figure these things out for themselves, and I would imagine that the notion of telling a drive exactly where data is, physically, on the disk, seems unlikely. However, there are things that everyone had to live with.

One thing, is that the INT 13H mechanism for accessing drives, as well as the drive geometry description that we shall find in the partition table, had defined a 3 byte scheme for specifying CHS: 10 bits cylinder, 8 bits head, and 6 bits sector. Furthermore, the sector numbering started from 1, rather than 0, so there would be atmost 63 sectors per track, not 64 as I would have imagined. For 512 byte sectors, this limits drives to 8GB.

This makes it somewhat difficult to convert between LBA and CHS, as the 3 byte CHS numbering has a plentitude of "impossible" values, that is, whenever the 6 sector bits are all zeros.

The reality of LBA addressing is that the first ATA standard, X3.221-1994, defined at 28 bit LBA. This was updated by ATA-6, NCITS 361-2002, to 48 bits. This occured for the unfortunates among you who use Windows at SP1 of Windows XP. The 28 bits LBA limits drives to 137GB, assuming 512 byte sectors.

ATA-1 also reused the 28 bit field for CHS addressing, however it's breakdown was 16 bits cylinder, 4 bits head, 8 bits of sector. The sector was still a one's based address, values 1-255. This meant that CHS expressed by ATA would not be compatible with CHS expressed by the BIOS (INT 13H, and consequently, the MBR). In fact, the least common demominator would be 10/4/6, or a meer 504MB of disk.

See page 523 and so forth of Upgrading and repairing PCs by Scott Mueller. There described is additional conversions between reported CHS of a drive and a mapping to the INT-13 CHS conventions, including a work-around for a Microsoft bug. The first is called ECHS, Extended CHS. The second is called LBA-assist.

ECHS converts from a reported CHS to a INT 13H compatible CHS by maintaining the sector count, but moving bits from cylinder count into head count, since 10 bits is too few for real cylinder counts and 8 bits is too large for real head counts. There seems to be a microsoft bug to work around, however, but I don't have the details.

The LBA-assist method simply makes 63 sectors cannonical, selects between 16 to 255 heads, accoring to the total sectors on the disk, and then reports the number of cylinders required to get the total sector count correct.

He also gives these formulas:

	LBA = logical block address
	C = cylinder
	H = head
	S = sector
	HPC = heads per cylinder
	SPT = sectors per track
	
	then
	
	LBA = ((C * HPC) + H) * SPT) + S - 1
	C = int ( LBA/SPT/HPC ) 
	H = int (LBA/SPT) mod HPC
	S = (LBA mod SPT) + 1 

Master Boot Record, Partitions

The boot is the process by which the operating system is loaded when the computer starts up. A some point, sector zero of the primary disk is read and the computer jumps to the first byte in the sector and begins running the code found there. In less than 512 bytes, this code must read the disk again to pull in a more complete loader which will complete the boot sequence. This sector is called the Master Boot Record (MBR). A bootable disk needs a correct MBR.

For DOS based computers, the MBR contains a partition table. This table divides the disk up into contiguous runs of blocks, each run is a partition, and can contain its own file system belonging to different operating systems. Once the operating system is running, each partition acts like its own separate disk. The partition table contains four entries, each 16 bytes long, each describing one partition. The table comes just before the end of the MBR, except for the very last two bytes in the sector where the value 0xaa55 is written little-endian. This value is called the signature and is generally pointless.

Each of the four partition table entry describes a partition in two formats: CHS of the start and end of the partition; and LBA of the start of the partition and its length in sectors. Also in the entry is the type of the partition, DOS, Linux, etc., and a byte for the bootable flag. The LBA is a four byte value, and a linear address. The CHS is a 3 byte value comprising:

  1. 10 bit cylinder number
  2. 8 bit head number
  3. and 6 bit sector number
It is unclear which of the two is used. However CHS addresses are insufficient for any current disk. I have in front of me the specifications for the Hitachi Travel Star 2.5 inch SATA drive with 5600 rpm spindle speed. The 250 Gigabyte model uses 512 byte sectors, and has 2 platters and 4 heads, The disk collects its cylinders into 24 zones, with 1368 sectors per track for cylinders in the outer zone, reducing by zone down to 704 sectors per track for cylinders in the inner zone. However, it presents itself as a disk with 16 heads and a constant 63 sectors per track. The number of cylinders is determined by dividing the total number of sectors by 16 times 63. The 250 Gigabyte model reports to have 488,397,168 sectors, so it has 484,521 cylinders.

The CHS values for a DOS partition cannot address so large a space. With LBA addressing and 512 byte sectors, 2 Terabyte drives are accommodated. However, a 32 bit LBA is neither the 28 bit ATA-1 standard nor the 48 bit ATA-6 standard. One imagines that the field intends to be the 32 low order bits of the 48 bit standard, with the impossible bits ignored, at the peril of future users.

The DOS partition table format is essentially folklore. It was not designed, documented, then implemented. It just happened. There are other partition table formats, but the DOS partition is used on PC's running windows and any operating system which would think of wanting to dual boot with windows. This includes the free Unices such as Linux and FreeBSD.

The MBR has four partition entries. For a disk to have more than four partitions, a primary extend partition is created. Inside this extend partition can be many additional partitions. These are described by having a partition table in the zero block of the primary extended partition that describes a partition and possibly a secondary extended partition. The secondary extended partition has a partition table in its zero block that describes a partition and possibly describes as its second entry another secondary extended partition, and so on. In this way the secondary extended partitions form a linked list.

Each partition table entry gives the location (and length) of the partition and a type. Extended partitions have type extended. The up to four partitions in the disk MBR (LBA 0) are called primary partitions. The rest are called logical partitions, maybe, who knows. We will call them that. The primary extended partition should have size that covers all logical partitions. (Essentially, it should cover the entire disk, sans primary partitions.) However, the size of a secondary extended partition should only cover the logical partition contained in the extended partition (and the boot block/partition table sector just ahead of the logical partition).

Also, the partition tables inside the extended partitions use two different conventions for starting block offsets. For the logical partition, the starting address is an offset from the current boot block; for the secondary extended partition, the starting address is an offset from the start of the primary extended partition.

Example MBR's

Using the unix dd (data dump) command, I look at LBA zero. I obviously need to be superuser to do this, and open the disk device, /dev/sda, rather than a device representing a partition on the disk, /dev/sda1 or /dev/sda2. These naming conventions vary gently from Unix to Unix. I use the unix command df (disk freespace) to find the names of the devices.

Here is what you see:


       +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
       |  F |  CHS start   |  T |   CHS end    |      LBA start    |   number sectors  |
       +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+ 
       

Try to convert the LBA and CHS addresses and see if they work out.


[burt@davis ~]$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2            186894676   5224068 172023576   3% /
/dev/sda1               101086     15962     79905  17% /boot
/dev/shm               1037648         0   1037648   0% /dev/shm
/dev/sdb1            246087720  77284704 156302440  34% /exp
/dev/sdb2            370851780  80424812 275356392  23% /huge
/dev/sdb3            151187172  35699376 107807924  25% /space
/space/irina/FC-6-i386-DVD.iso
                       3442574   3442574         0 100% /space/irina/disk
[burt@davis ~]$ su
[root@davis burt]# dd if=/dev/sda of=mbr.out bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0143895 seconds, 35.6 kB/s
[root@davis burt]# chown burt mbr.out 
[burt@davis ~]$ hexdump -C mbr.out 
00000000  eb 48 90 8e c0 8e d8 8e  d0 bc 00 7c 89 e6 bf 00  |.H.........|....|
00000010  06 b9 00 01 f3 a5 89 fd  b1 08 f3 ab fe 45 f2 e9  |.............E..|
00000020  00 8a f6 46 bb 20 75 08  84 d2 78 07 80 4e bb 40  |...F. u...x..N.@|
00000030  8a 56 ba 88 56 00 e8 fa  00 52 bb c2 07 31 03 02  |.V..V....R...1..|
00000040  80 00 00 80 43 c4 00 00  00 08 fa 90 90 f6 c2 80  |....C...........|
00000050  75 02 b2 80 ea 59 7c 00  00 31 c0 8e d8 8e d0 bc  |u....Y|..1......|
00000060  00 20 fb a0 40 7c 3c ff  74 02 88 c2 52 be 7f 7d  |. ..@|<.t...R..}|
00000070  e8 34 01 f6 c2 80 74 54  b4 41 bb aa 55 cd 13 5a  |.4....tT.A..U..Z|
00000080  52 72 49 81 fb 55 aa 75  43 a0 41 7c 84 c0 75 05  |RrI..U.uC.A|..u.|
00000090  83 e1 01 74 37 66 8b 4c  10 be 05 7c c6 44 ff 01  |...t7f.L...|.D..|
000000a0  66 8b 1e 44 7c c7 04 10  00 c7 44 02 01 00 66 89  |f..D|.....D...f.|
000000b0  5c 08 c7 44 06 00 70 66  31 c0 89 44 04 66 89 44  |\..D..pf1..D.f.D|
000000c0  0c b4 42 cd 13 72 05 bb  00 70 eb 7d b4 08 cd 13  |..B..r...p.}....|
000000d0  73 0a f6 c2 80 0f 84 ea  00 e9 8d 00 be 05 7c c6  |s.............|.|
000000e0  44 ff 00 66 31 c0 88 f0  40 66 89 44 04 31 d2 88  |D..f1...@f.D.1..|
000000f0  ca c1 e2 02 88 e8 88 f4  40 89 44 08 31 c0 88 d0  |........@.D.1...|
00000100  c0 e8 02 66 89 04 66 a1  44 7c 66 31 d2 66 f7 34  |...f..f.D|f1.f.4|
00000110  88 54 0a 66 31 d2 66 f7  74 04 88 54 0b 89 44 0c  |.T.f1.f.t..T..D.|
00000120  3b 44 08 7d 3c 8a 54 0d  c0 e2 06 8a 4c 0a fe c1  |;D.}<.T.....L...|
00000130  08 d1 8a 6c 0c 5a 8a 74  0b bb 00 70 8e c3 31 db  |...l.Z.t...p..1.|
00000140  b8 01 02 cd 13 72 2a 8c  c3 8e 06 48 7c 60 1e b9  |.....r*....H|`..|
00000150  00 01 8e db 31 f6 31 ff  fc f3 a5 1f 61 ff 26 42  |....1.1.....a.&B|
00000160  7c be 85 7d e8 40 00 eb  0e be 8a 7d e8 38 00 eb  ||..}.@.....}.8..|
00000170  06 be 94 7d e8 30 00 be  99 7d e8 2a 00 eb fe 47  |...}.0...}.*...G|
00000180  52 55 42 20 00 47 65 6f  6d 00 48 61 72 64 20 44  |RUB .Geom.Hard D|
00000190  69 73 6b 00 52 65 61 64  00 20 45 72 72 6f 72 00  |isk.Read. Error.|
000001a0  bb 01 00 b4 0e cd 10 ac  3c 00 75 f4 c3 00 00 00  |........<.u.....|
000001b0  00 00 00 00 00 00 00 00  b1 00 80 0f b6 00 80 01  |................|
000001c0  01 00 83 fe 3f 0c 3f 00  00 00 8e 2f 03 00 00 00  |....?.?..../....|
000001d0  01 0d 83 fe ff ff cd 2f  03 00 d4 14 00 17 00 fe  |......./........|
000001e0  ff ff 82 fe ff ff a1 44  03 17 3f 82 3e 00 00 00  |.......D..?.>...|
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200
[burt@davis ~]$ 

Another machine, with just the linux partition and swap.

Script started on Fri 12 Oct 2007 10:11:33 AM EDT
[root@matawan burt]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      17775388   4597280  12260604  28% /
/dev/hda1               101086     14242     81625  15% /boot
/dev/shm                257568         0    257568   0% /dev/shm
[root@matawan burt]# dd if=/dev/hda count=1 of=bb.1.txt
1+0 records in
1+0 records out
[root@matawan burt]# hexdump -C bb.1.txt
00000000  eb 48 90 8e c0 8e d8 8e  d0 bc 00 7c 89 e6 bf 00  |.H.........|....|
00000010  06 b9 00 01 f3 a5 89 fd  b1 08 f3 ab fe 45 f2 e9  |.............E..|
00000020  00 8a f6 46 bb 20 75 04  84 d2 78 03 8a 56 ba 88  |...F. u...x..V..|
00000030  56 00 e8 fc 00 52 bb c2  07 31 d2 88 6f fc 03 02  |V....R...1..o...|
00000040  80 00 00 80 41 f8 00 00  00 08 fa 80 ca 80 ea 53  |....A..........S|
00000050  7c 00 00 31 c0 8e d8 8e  d0 bc 00 20 fb a0 40 7c  ||..1....... ..@||
00000060  3c ff 74 02 88 c2 52 be  79 7d e8 34 01 f6 c2 80  |<.t...R.y}.4....|
00000070  74 54 b4 41 bb aa 55 cd  13 5a 52 72 49 81 fb 55  |tT.A..U..ZRrI..U|
00000080  aa 75 43 a0 41 7c 84 c0  75 05 83 e1 01 74 37 66  |.uC.A|..u....t7f|
00000090  8b 4c 10 be 05 7c c6 44  ff 01 66 8b 1e 44 7c c7  |.L...|.D..f..D|.|
000000a0  04 10 00 c7 44 02 01 00  66 89 5c 08 c7 44 06 00  |....D...f.\..D..|
000000b0  70 66 31 c0 89 44 04 66  89 44 0c b4 42 cd 13 72  |pf1..D.f.D..B..r|
000000c0  05 bb 00 70 eb 7d b4 08  cd 13 73 0a f6 c2 80 0f  |...p.}....s.....|
000000d0  84 f0 00 e9 8d 00 be 05  7c c6 44 ff 00 66 31 c0  |........|.D..f1.|
000000e0  88 f0 40 66 89 44 04 31  d2 88 ca c1 e2 02 88 e8  |..@f.D.1........|
000000f0  88 f4 40 89 44 08 31 c0  88 d0 c0 e8 02 66 89 04  |..@.D.1......f..|
00000100  66 a1 44 7c 66 31 d2 66  f7 34 88 54 0a 66 31 d2  |f.D|f1.f.4.T.f1.|
00000110  66 f7 74 04 88 54 0b 89  44 0c 3b 44 08 7d 3c 8a  |f.t..T..D.;D.}<.|
00000120  54 0d c0 e2 06 8a 4c 0a  fe c1 08 d1 8a 6c 0c 5a  |T.....L......l.Z|
00000130  8a 74 0b bb 00 70 8e c3  31 db b8 01 02 cd 13 72  |.t...p..1......r|
00000140  2a 8c c3 8e 06 48 7c 60  1e b9 00 01 8e db 31 f6  |*....H|`......1.|
00000150  31 ff fc f3 a5 1f 61 ff  26 42 7c be 7f 7d e8 40  |1.....a.&B|..}.@|
00000160  00 eb 0e be 84 7d e8 38  00 eb 06 be 8e 7d e8 30  |.....}.8.....}.0|
00000170  00 be 93 7d e8 2a 00 eb  fe 47 52 55 42 20 00 47  |...}.*...GRUB .G|
00000180  65 6f 6d 00 48 61 72 64  20 44 69 73 6b 00 52 65  |eom.Hard Disk.Re|
00000190  61 64 00 20 45 72 72 6f  72 00 bb 01 00 b4 0e cd  |ad. Error.......|
000001a0  10 ac 3c 00 75 f4 c3 00  00 00 00 00 00 00 00 00  |..<.u...........|
000001b0  00 00 00 00 00 00 00 00  00 00 80 0f b6 00 80 01  |................|
000001c0  01 00 83 fe 3f 0c 3f 00  00 00 8e 2f 03 00 00 00  |....?.?..../....|
000001d0  01 0d 8e fe ff ff cd 2f  03 00 35 77 51 02 00 00  |......./..5wQ...|
000001e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200
[root@matawan burt]# 
Script done on Fri 12 Oct 2007 10:13:35 AM EDT

References