The quantum leap within Hard Disk Drives

Magnetic storage: The innovation of the “giant magneto-resistive” head—the advancements that boosted the capacity of hard-drives from a few gigabytes to 500 gigabytes.

Modern Hard disk drive

  • Drive construction
  • Basic HDD malfunctions
  • Technologies used for maintaining HDD reliability
  • Fundamentals of searching for malfunctions
  • Some typical malfunctions of hard drives and methods of their repair
  • Most frequent typical malfunctions in various HDD families
  • Conclusion
  • Introduction

    Brief architecture description, the main problems of modern hard disk drives, methods of HDD servicing and repair of simple malfunctions, SMART, passwords.
     

    Drive construction

    A Hard disk drive consists of a mechanical part (HDA) head-and-disk assembly and a (PCB) - printed circuit board. HDA acts as a case for all mechanical parts of a drive and contains one more chip performing the functions of a preamplifier/commutator. The PCB consists of series of chips which control the mechanical parts, encode/decode data on the platters/magnetic surfaces, then transfer the data through an external interface. Generally, PCBs are located outside HDA, in its lower part. In certain hard disk drives, such as the Seagate-Barracuda series, the controller has an additional metal cover protecting the electronic components from damage.
     

    Mechanics

    The whole construction is based on the drive case protecting sensitive mechanical parts from environmental influence. Inside it is filled with dust-free air though the air is not specifically purified; instead the assembly of the mechanical part is performed in a special workshop where air contains less than one hundred dust particles per cubic meter, i.e. in the so-called "class 100 clean room".
    HDA case has an opening blocked by a tight air filter. It is used to align air pressure inside the HDD and outside. Unfortunately, if a drive falls into water, the latter penetrates the inner space through that opening. Rotation of disks creates air flow circulating inside the case and constantly passes through one more filter separating dust if it somehow appears inside.
    Drive case accommodates a pack of magnetic disks driven by a spindle motor, magnetic heads with their positioning system and a preamplifier/commutator enhancing the signal from the heads and switching between them.
    A magnetic disk is a circular aluminum (rarely ceramic or made of special glass) plate with surface polished in accordance with the highest precision class for the sole exception of the parking zone, if it is present. In fact, high precision of disk surfaces and the heads causes them to "stick" to each other because of molecular attraction forces. To prevent that effect, manufacturers use special laser serrations in the zone of contact between drive heads and disks.
    The disks demonstrate specific magnetic properties owing to their chrome oxide based coating (magnetically active substance) or cobalt layer applied using vacuum deposition. Such coating is characterized by high hardness and much greater wear resistance compared to previous models coated with a layer of soft varnish based on ferric oxides which could be easily damaged unlike modern coatings.
    The disks are rotated by a special 3-phase electric motor. The stationary part contains three windings connected according to the "star" scheme, with a tap in the middle, and the rotating part is a permanent sectional magnet made of rare-earth metals. The requirement of beat reduction and high rotational speed values force the manufacturers to use special bearings in the spindle motor; these can be either ball bearings or improved fluid bearings (using special oil dampening impact loads and thus increasing motor durability). Fluid bearings are characterized by a lower noise level and produce practically no heat during operation. The number of revolutions per minute in modern IDE drives is equal to 5400 RPM or 7200 RPM; for modern SCSI drives it is 10000 RPM or 15000 RPM.
    A magnetic head is also a sophisticated construction composed of numerous details. Those details are so small that they are manufactured using photolithography method just like chips. Working surface of the head's ceramic case is polished with the same precision as the disk itself. Heads' actuator is a flat solenoid coil of copper wire suspended between the poles of a permanent magnet and fixed at one end of a lever rotating around a bearing. The other end of the lever is connected to a bracket carrying magnetic heads. The bracket is spring-loaded with a certain effort which allows the heads to "fly" at a definite height above the disk surface; the said height is usually equal to tenths of micron.
    The whole transport system moving the heads' pack has been called Voice Coil by analogy with a loud-speaker cone. Its functional principle is similar to that of a common dynamic loud-speaker (i.e. copper coil in static magnetic field). Positioner's coil is surrounded by a stator acting as a permanent magnet. When electric current of certain voltage and polarity appears in the coil the positioner starts turning to the corresponding side with respective acceleration; thus dynamic modification of current properties in coil allows positioning of magnetic heads to any location above disk surface.
    Drive heads are fixed when a drive is powered-off (in the parking zone) with special latches. Magnetic and pneumatic latches are two most widely used types. A magnetic latch is a small permanent magnet fixed within drive case and attracting ferrous lug on the voice coil in the heads' parking position. Pneumatic latch (or air lock) also fixes a positioner in the parking zone preventing its further movement. When the magnetic disks begin rotation the air flow thus generated deflects the "sail" of an air latch and unblocks the positioning system.
    The electronic components inside HDA are limited to the preamplifier/commutator for the signal received from drive heads. It is located closer to the heads to minimize interference of external noise, right over the flexible cable from the heads to drive's electronics. The same cable is connected to the voice coil and, sometimes, to the spindle motor; however, in most cases power supply of the spindle motor is implemented via a separate cable.
    A HDA is usually linked to the PCB with two connectors. One of them is a three-phase center-tapped connector for the spindle motor while the other delivers signals from the preamplifier/commutator and voice coil.
     

    Printed circuit board

    The circuit design of modern drives is characterized by the use of a few highly-integrated chips.

    As one can see in the picture, the whole layout is based upon four chips:
      • system controller chip including the read/write channel, disk controller and RISC control processor (microcontroller);
      • Flash ROM chip containing drive firmware;
      • chip controlling the spindle motor and voice coil;
      • ROM chip used as a cache buffer.
    Further increase of integration is impossible due to some basic differences in the operational modes of the above functional parts.
    The first system controller used in hard drives was a chip manufactured by Cirrus Logic. Its obvious breakthrough was manifested in the read/write channel, processor and disk controller integrated within one chip; however insufficiently developed methods of using such a microcircuit caused frequent malfunctions of Fujitsu drives belonging to series MPF3xxxAT and MPG.

    A microcontroller has RISC architecture. As soon as power supply is switched on after the /RESET interface signal the drive reset circuit sends a RESET signal to microcontroller which executes its program from ROM running self-diagnostics, cleaning the working data area in memory and programming disk controller and all programmable chips connected to the internal data bus of an HDD. Then microcontroller polls internal signals used during drive operation and if it detects no emergency alerts, it starts the spindle motor. The next stage of firmware operation is internal testing of an HDD checking data buffer RAM, disk microcontroller and the status of microcontroller signals input from its port. Then the microcontroller begins analyzing the frequency of pulses waiting until the spindle motor reaches defined rotational speed. As soon as the necessary speed is reached, the controller begins to manipulate the positioning circuit and disk controller moving the magnetic heads to the area containing recorded firmware data and transfers it to buffer RAM for further operation. Then the microcontroller switches to readiness and awaits commands from HOST. In that mode a command received from the central processor initiates a whole chain of actions performed by all the electronic components in a HDD.

    HDD read/write channel consists of a preamplifier/commutator (located inside HDA), read circuit, write circuit and a synchronizing clock.
    Drive preamplifier has several channels, each being connected to its respective head. The channels are switched by signals from the drive’s microprocessor. Preamplifier also contains a recording current switch and recording error sensor, which emits an error signal if a short circuit or break occurs in a magnetic head.
    Integrated reading/writing channel operating in the recording mode receives data from disk controller simultaneously with the recording clock frequency, performs data encoding, precompensation and transfers the data to preamplifier for writing to a disk. In the reading mode signal from preamplifier/commutator is transmitted to the automatic control circuit and then passes a programmable filter, adaptive compensatory circuit and pulse detector while being converted into data pulses sent to the disk controller for decoding and transfer through an external interface.

    Disk controller is the most complicated drive component which determines the speed of data exchange between a HDD and HOST.
    Disk controller has four ports used for connection to a HOST, microcontroller, buffer RAM and data exchange channel between it and HDD. Disk controller is an automatic device driven by microcontroller; from HOST side only standard registers of task file are accessible. Disk controller is programmed at the initialization stage by microcontroller, during the procedure it sets up the data encoding methods, selects the polynomial method of error correction, defines flexible or hard partitioning into sectors, etc.
    Buffer manager is a functional part of disk controller governing the operations of buffer RAM. The capacity of the latter ranges in modern HDDs from 512 Kb to 8 Mb. Buffer manager splits the whole buffer RAM into separate sectioned buffers. Special registers accessible from microcontroller contain the initial addresses of those sectioned buffers. When HOST exchanges data with one of the buffers the read/write channel can exchange data with another buffer sector. Thus the system achieves multisequencing for the processes of data reading/writing from/to disk and data exchange with HOST.

    Spindle motor controller regulates the motion of a 3-phase motor. It is programmed by the drive microcontroller. There are three control modes of spindle motor operation: the start mode, acceleration mode and stable rotation mode. Let us review the start mode. At power-up a reset signal is sent to the control microprocessor which performs initialization programming internal registers of spindle motor controller for a start. Drive controller generates phase switching signals; the spindle motor at that rotates at low speed generating self-induced electromotive force. Drive controller detects EMF and notifies the microprocessor which uses that signal for rotation control. In the acceleration mode microprocessor speeds-up phase switching and measures the rotational speed of the spindle motor until the speed reaches its rated value. As soon as the rated rotational speed is reached the controller introduces stable rotational mode. In that mode microprocessor calculates the time required for one revolution of the spindle motor based on the phase signal and adjusts the rotational speed accordingly. After relocation of magnetic heads from the parking zone the drive electronics begins tracking the stability of rotation using servo marks.

    Voice coil controller generates the control current moving drive positioner and stabilizing it over a defined track. Current value is calculated by microcontroller on the basis of digital error signal for head position relatively to a track (Position Error Signal or PES). Current value in digital form is transmitted to CPU, the analogous signal thus received is enhanced and supplied to the voice coil.
     

    Firmware data (service information)

    Firmware data is necessary for functioning of internal HDD circuits and as a rule it remains hidden from users. Firmware data can be subdivided into the following types:
    • Servo information or servo fields;
    • Low-level format;
    • Resident firmware microcode (operational programs);
    • Configuration tables and settings;
    • Tables of defects.

    Servo fields are necessary for operation of a servo system used by the driving assembly of magnetic heads in a HDD; they serve for heads’ positioning and keeping them precisely over a defined track. Servo fields are recorded during the manufacturing process to an already assembled HDA through special service openings in its case. The openings are subsequently closed with sticky labels that read: Warning! DO NOT OPEN. The recording is actually performed using drive’s own heads in a special high-precision instrument – servo writer. Relocation of heads’ positioner is achieved through a motion of a special pusher of the servo writer using steady steps much smaller than the intervals between tracks.

    Firmware (microcode) of the control microprocessor is a collection of programs required for operation of HDD components. Here belong the programs used for initial diagnostics, control of spindle motor rotation, data exchange with disk controller, buffer RAM, etc. In most HDD models firmware microcode is stored within internal microcontroller ROM; some models employ external Flash ROM. In some HDD models a part of firmware programs is recorded to magnetic disk in a special firmware zone while ROM contains the programs used for initialization, and positioning together with primary loader reading the firmware data from magnetic disk to RAM. Since actual firmware modules are first loaded to RAM before execution they have been called resident modules.
    Manufacturers of hard drives record some firmware portions on disk surface not only for purposes of ROM space saving, but also to enable its easy replacement if the manufacturing process or drive operation reveal any errors in a microcode. Internet pages of most manufacturers contain links to utilities used for such updates. Overwriting disk firmware is much easier than unsoldering of hard-programmed microcontrollers. We can remember how Western Digital had to recall a large number of its drives back to factory several years ago…

    Low-level format. Track beginning is identified by an index pulse. Each track is subdivided into data sectors and servo fields. Format of each sector consists of an ID field, data field, synchronization zones and spaces. The beginning of each sector contains a synchronization zone used for phasing and synchronization of data strobe. ID field contains an address marker, physical sector address, flag byte and CRC bytes.
    Format without identifiers has become popular recently. When manufacturers employ such method of data placement along a track ID fields are not used at all (thus increasing available drive capacity). Instead they use a system of servo fields directing to physical sectors on a track. At that reading/writing of all sectors on a track is performed simultaneously (in one disk revolution) to/from RAM containing an image of the read/written track. Thus for reading just one sector a drive copies a whole track to RAM and reading of all subsequent sectors (if necessary) is performed from drive RAM instead of disk surface. Identical operations are performed during recording. During sector recording a drive reads a track, modifies it in RAM and writes the whole track back to disk.

    Configuration tables and settings of hard drives contain information about logical and physical structure of disk space. Those tables enable PCBs, which are identical for the whole drive family, to self-adjust for a certain drive model. As a matter of fact, during design of a certain model like, for example, a 80 Gb drive based on two disks it allows to produce automatically a “half-size” model with 40 Gb capacity based on one disk and “quarter-size” model with 20 Gb capacity based on one side only. Thus a manufacturer can offer a greater number of models with varied capacity for the market without considerable R&D expenses. Besides, junior models can use disks, which for some reasons are unsuitable for full-size models. E.g. “half-size” models can successfully use magnetic disks with defects on one of their surfaces, etc.

    Tables of defects. Modern technology of magnetic disks production does not allow their defect-free manufacture. Heterogeneity of media material, polishing defects, admixtures during magnetic layer application, etc. result in appearance of areas, where data recording or reading end in errors.
    Earlier drives with ST506/412 interface displayed the table of defective tracks as a label on HDA case and any drive had some reserved space, e.g. HDD ST225 (20 Mb) had actual capacity of 21,5 Mb, i.e. 1,5 Mb extra were allocated for defective sectors and tracks. Modern HDDs also have extra capacity, but it is hidden from users and only drive microcontroller can access it. A portion of that extra space is allocated to HDD firmware, configuration tables, S.M.A.R.T. counters, factory information about a HDD, tables of defects, etc. The remaining part is held in reserve for substitution of defective sectors with the reserved ones.
    Tables of defects are filled by the manufacturer during internal factory testing. Numbers of all discovered BAD sectors are added into a table. Such procedure is called updating (relocation) of defects (UPDATE DEFECT). After that if a defective sector is addressed during work with a HDD, the drive itself will redirect the request to a reserved sector. Therefore all modern drives newly arriving from the manufacturing factories have no defective sectors.
    Most HDD models have two tables of defects: Primary or P-List and Grown or G-List. Primary table is filled at the factory during internal testing - SELFSCAN (intelligent burn-in). Grown list is not filled at the factory; it is designed for addition of defects which appear during drive operation. To enable that functionality, the list of user commands practically in all HDDs contains the “assign” command replacing a defective sector with a reserved one. The command is used by numerous test utilities including those recommended by the manufacturers for operations over drives with BAD sectors. Western Digital drives have a Data Lifeguard system, which performs automatic substitution of defective sectors while a drive is idle. In order to perform the procedure, a drive self-tests its surfaces and transfers user data to a reserved sector marking at that defective sector as BAD; the mechanism of defect relocation is identical to the ”assign” command. Manufacturers of Fujitsu, Quantum, Maxtor, and IBM drives implemented a mechanism of automatic defect relocation during the recording process. Thus if data is recorded to a defective sector, a drive itself will redirect such request to the reserved zone marking at that the defective sector as BAD and adding its number to G-List. Among specialized utilities used for relocation of BAD sectors we can note FUJFMT.EXE for Fujitsu drive, WDDIAG.EXE for Western Digital drives, ShDiag.exe offered by Samsung, etc.
     

    Two mechanisms of defect relocation

    When the substitution (Assign) mechanism is used in a drive the latter records to the ID field of a BAD sector the flag of the relocated sector and writes to the data field the number of the reserved sector, i.e. the one, which should be accessed for data recording or reading. As a rule, it is the first available sector after user data area.

    During data read/write operations accessing the defective sector drive controller will read the flag and assigned address and reposition the heads to the reserved zone in order to perform reading/writing from/to a good sector. Defective sectors in that case will disappear, but the drive will perform positioning to the reserved area each time it has to address a defective sector. The procedure is accompanied with clicking sounds and slight slow-down. The “Assign” procedure allows relocation only for defects in data fields. Errors pertaining to corruption of ID fields or servo fields cannot be relocated using the “Assign” method.
    Another mechanism used for hiding defective sectors at manufacturing factories is skipping of defective sectors. When that method is used, the defective sector is skipped, its number is assigned to the following sector (and so on), and the last sector is shifted to the reserved zone.

    Such method of sector hiding disrupts the continuous integrity of low-level format; the system of LBA conversion to PCHS should also take into account BAD sectors while skipping them. Therefore the method requires obligatory recalculation of translator tables and low-level formatting making it impossible to preserve user data if the method is employed. Exactly for that reason the said method of relocation is applied only in special factory mode of drive operation. It is used in the FUJFMT.EXE utility designed for relocation of defects in FUJITSU drives.
     

    Logical structure of disk space

    Considerable part of disk space in modern drives is hidden from users; it contains service data and an area reserved for substitution instead of defective sectors in a HDD. In normal operation mode it is accessible by drive microcontroller only. Users may access the working area frequently called logical disk space and it is exactly the same capacity as the value indicated in the characteristics of a certain model. Access to the working area represented by a continuous chain of logical sectors is performed in LBA notation from 0 to N. Connection between the logical disk space and physical disk format is established through a special program, i.e. a translator, which takes into account physical format, zone allocation as well as defective sectors and tracks to be skipped during operation.
    Access to firmware zone is possible only in a special drive operation mode, i.e. factory mode. A drive is switched into that mode by a key command opening access to an additional set of factory commands. Those commands are used for such operations as reading/writing of firmware zone sectors, obtaining a map with locations of modules and tables in firmware zone, access to zone allocation table, conversion of LBA into PCHS and vice versa, launch of low-level format, reading/writing to/from Flash ROM and some other actions.
    In the process of HDD design developers define firmware data required for drive operation as well as the number of cylinders occupied by firmware; therefore zero logical cylinder is the first free cylinder following the last cylinder occupied by firmware area. The structure of disk space may vary with different HDD models.

     

    Basic HDD malfunctions

    "Nothing is eternal" – that expression applies also to hard disk drives. No matter how reliable a HDD is still it is degraded with time by destructive processes.
    First, a drive is a mechanical and electronic device but all mechanical parts gradually wear out. With time connections between mechanical parts become slack. Numerous ascensions and descents of magnetic heads which occur during each start and stop of magnetic disk rotation destroy the protective layer coating the heads. However, modern manufacturing technology guarantees rather long life for hard drives. Thus, according to the information from the technical manual for operation of Western Digital drives (Caviar BB/JB family) the minimum number of contacts between magnetic heads and disk surface during start/stop (Contact Start/Stop Cycles - CSS) is at least 50000 cycles, while unrecoverable reading errors (Error Rate - Unrecoverable) appear less frequently than once per 10 bytes raised to the 14th power. If we translate those figures into generally understandable terms we receive the following: minimum time before any deterioration in the quality of heads or surfaces because of their contacts provided that the drive is switched on and off ten times daily will be 14 years; and one error will occur during reading of more than 32 TB of data (that approximately corresponds to viewing movies in MP4 format non-stop for 7 - 10 years).
    Still, in real life we frequently face a totally different situation when a brand new drive purchased recently goes out of order after a few months of operation. Numerous drives even do not endure the warranty period defined by their manufacturing factory. We have to note that all manufacturers except for Samsung have decreased that period from 3 years to one. What are the reasons of such situation?
     

    Normal HDD ageing malfunctions

    During correct operation of a properly assembled drive performed in conformity to all requirements of its Technical Reference Manual with time you can observe normal ageing process. It tells most badly on magnetic disks. First, with time the magnetization of minimum magnetic “prints” – dibits – decreases and a drive has to re-read some portions of disks, which used to read flawlessly, or they even begin to produce reading errors. In the second place, the magnetic layer on disks also deteriorates gathering scratches, chippings, cracks, etc. All of the above cause appearance of BAD sectors.
    The process of normal drive ageing is quite long and usually it takes 3-5 years. We have to note that for a HDD non-stop mode of operation is even more favourable than a mode, when a drive starts and stops frequently. Thus drives function quite long in dedicated servers operating round-the-clock and located in a separate premise or a box with obligatory normal climate control.
     

    Malfunctions resulting from incorrect mode of operation

    The most frequent cause of HDD malfunctions has to deal exactly with incorrect manner of their operation, its main destructive factors include: overheating, mechanical impacts and voltage jumps of HDD power supply.
    Overheating is caused by insufficient cooling of drive case and PCB. According to the technical reference manual for Western Digital drives (Caviar BB/JB family) the allowed operational drive temperature ranges from 5 С to 550 С provided that air circulates around all the time. The latter condition is determined by the fact that some chips on the control board become much warmer than the above temperature (motor controllers, etc.) and heat dissipation must be arranged for them. Now let us imagine that it is summer time, temperature inside may reach 30 С, within computer case it will grow to the extreme values - by another 20 - 250 С – while there is no normal air circulation because there is only one blow-out fan in the power supply clogged with dust, flat cables inside form a tight knot and the drive is blocked from both sides between a CD drive and FDD. An open computer case at that does not remedy the situation because it does not facilitate air flow around HDD.
    Another important temperature value is its gradient, which should not exceed 200 С per hour during operation and 300 С during downtime. When the latter is exceeded, it is very dangerous for drive mechanics; that phenomenon is called thermal shock. Thus if you bring a HDD during winter time from a store or from a friend (where you had to read some necessary data) and it is frosty outside and 200 С inside, then if you power-up the drive immediately it causes sudden local heating of separate mechanical HDA parts, which may cause micro deformations of precise drive mechanics. Such a drastic temperature drop is very harmful for electronic components, too.
    The same holds true regarding mechanical influence over HDA, i.e. impacts which are also very dangerous for precise mechanical parts of a drive. During operation as described in the previous article, spring-loaded magnetic heads fly at a low height above disks rotating at a rather high speed. An impact against HDA in that situation will cause inevitable vibration of heads which will produce a series of hits against disks, which in turn are sure to cause chipping both on disk surface and on the surface of magnetic heads.
    Very serious danger for HDD electronics is manifested by power supply units powering the whole PC and the drive respectively. In order to make their price lower manufacturers frequently do not install filtering circuitry both in the primary 220 V chain and in secondary circuit. Very frequently rated power does not correspond to the actual values and stabilized voltage turns out to be not so stable although those parameters are strictly regulated for disk drives. Thus, according to the technical reference manual for Western Digital drives (Caviar BB/JB family) allowed power supply voltage is +5 V +- 5% and +12 V +- 10%, allowed fluctuation is 100 mV in +5V circuits and 200 mV in 12 V circuits. Most specialists servicing computer equipment use only voltage meters while testing power supply units, but one should keep in mind that voltage fluctuations, which are an important parameter can be checked with an oscilloscope only.
     

    Construction-related malfunctions

    Quality of HDDs has decreased lately; that fact is confirmed by reduction of warranty period by many manufacturers. To some extent it is caused by stiff competition between them and the resulting race for production of cheap drives. It is also connected with growing technological standards, a sort of a race for density increase and achievement of higher capacity per disk. As a consequence vendors frequently use in their HDDs solutions, materials and technologies, which have not been thoroughly tested and verified; thus imperfect products appear in the market and then in possession of end users. After some time manufacturers analyze malfunctions of drives returned during their warranty period and attempt to eliminate drawbacks in their construction, but those attempts are not always successful.
    Theoretically such approach to drive design and production may cause problems with any drive part. We can single out the most frequent troubles:
    Bad contact in pin connector between PCB and preamplifier chip connected to magnetic heads' assembly. The consequences of a poor contact may be quite numerous. First of all, it causes appearance of bad sectors. But those sectors differ from common defects caused by poor surface quality. The difference manifests itself in the fact that the surface remains intact but bad contact causes recording of invalid data to service bytes of some sectors, e.g. to the field containing CRC code of the sector. The problem may also lead to corruption of firmware data, which cannot be restored by the drive itself during the next power-up; besides, there is no user mode for such restoration. Firmware data of a drive can be restored in the factory mode only.
    Poor quality of chips’ soldering at the factory. Such workmanship flaw becomes obvious as a rule approximately after a year of drive operation. It is usually manifested in lack of contact, i.e. after some period of normal operation a drive either switches off and does not start again (“hangs”) or begins to produce knocking sounds with its heads; the latter situation may result in damage to its mechanical parts. Just like the previous flow it may also cause firmware corruption.
    Insufficient quality of chips becoming defective even at heating values, which do not exceed allowed limits. The fault can be repaired by replacing the defective chip with an identical operational one.
    Imperfect construction of fluid dynamic bearings, which causes accumulation of scrap particles in the grease resulting in spindle motor seizure.
    There are also cases when disks are not fixed on a spindle properly, as a result disk beating grows increasingly and causes bearing destruction in spindle motor. Considerable noise begins to accompany drive operation and after some time defective sectors appear because disk beating leads to incorrect reading of some tracks.
    Poor quality of Flash ROM chips, which may lose the firmware code stored therein because of charge leakage when heated. ROM can be overwritten either in a special ROM chip programmer or using the drive itself in the factory mode.
    Errors in drive firmware microcode. Manufacturers do not make public the information about the nature of such errors keeping it secret. However, firmware updates are issued quite regularly. It would be a mistake to believe that the errors do not influence drive's operability in any way because in some cases they may result in damage to drive mechanics.

    Technologies used for maintaining HDD reliability

    With all the complications HDD manufacturers are constantly trying to make user data storage more reliable. To accomplish that they use various methods and technologies in their drives.

    S.M.A.R.T. (abbreviated Self-Monitoring, Analysis, and Reporting Technology) is intended to inform hard drive users about the status of its main parameters. Many motherboard BIOSes support analysis of those parameters at computer power-up and if some critical parameter exceeds its emergency limit an informational message is displayed during computer start-up. Of course, it does not mean that the drive will stop functioning, but the user should take some steps in that situation, for example, prepare a backup copy of valuable data. If computer BIOS does not contain an analyzer of S.M.A.R.T. attributes you can use an external diagnostic utility launched from within the operating system.

    For greater reliability practically all drives use a technology, which allows hiding and relocation of occurring defects immediately during operation. Some peculiarities of its implementation may vary with different drive models; however, they are all based upon the same principle. If the operating system attempts to access a sector, which cannot be read or written to, then the drive will replace it if possible (if there is sufficient reserved space) with a sector from the reserved zone (assign). The table of thus substituted sectors is stored in drive firmware zone and the drive loads it to controller ROM at power-up.

    Impact sensors found in all drives also belong to technologies used for protection against malfunctions. It is a piezoelectric sensor producing an electric pulse at mechanical shock. Filtering of sensor pulses allows identification of obvious impacts. When a drive detects shock action, it parks magnetic heads. One peculiarity of impact sensor installation is the angle of its mounting relative to front case line. It is equal to 45O.

    In recent models manufacturers have began to use widely temperature sensors in PCB and heads’ block. Temperature information is monitored by drive processor and the drive stops operation if the allowed value is exceeded. In some drive models temperature is output as S.M.A.R.T. attribute value and there are programs (usually available from the web pages of HDD manufacturers) which allow viewing it.

    Fundamentals of searching for malfunctions

    The description above should demonstrate that a HDD is a sophisticated software and hardware device combining electronic and mechanical parts and utilizing the most recent achievements of microelectronics, micromechanics, automatic control theory, magnetic recording theory, and coding theory. HDD repair is impossible without specialized knowledge, special equipment, instruments and tools, and without a specifically equipped location (clean room). However, an expert in computer hardware can perform primary diagnostics of HDD and repair simple failures, perform operations over BAD sectors using software offered by HDD manufacturers.
    In the absence of special diagnostic equipment and software HDD diagnostics should begin with connection to an individual PC power supply unit. Operator’s hearing is the diagnostic tool in that case. At power-up a HDD spins up the spindle motor, sound level increases for 4 - 7 sec., then a click follows (heads are moved from the parking zone) and very specific recalibration crackling noise that lasts 1-2 sec. It is easy to get used to such drive behaviour by connecting a known good HDD to a power supply unit.
    Recalibration procedure performed by a drive demonstrates at least operability of the reset circuit, its clock, microcontroller, spindle motor control circuit and positioning system, data conversion channel, normal status of magnetic heads (at least one of them, the one used for the initialization process) and drive firmware data.
    For further diagnostics a HDD has to be connected to the Secondary IDE port and automatically detected in BIOS through the SetUp procedure. If the model of the HDD being checked is recognized, the operating system loads and computer starts diagnostic software. OS can be started from a working HDD connected to Primary IDE port or from a floppy disk. The easiest diagnostics would be an attempt to create a partition on the drive being checked using FDISK procedure and subsequent formatting procedure with Format d:/u command. Formatting in DOS or Windows OS does not accomplish the actual “formatting”, instead the OS performs surface verification, creating in the end a file system structure selected for the partition. If formatting (verification) reveals any defects, they will be displayed on-screen as BAD sectors. Of course, such diagnostics is primitive and aimed rather towards checking HDD operability than discovery of malfunction causes or, moreover, their elimination. More detailed diagnostics can be performed using utilities recommended by manufacturers and available from their web pages.
    All the above utilities perform testing in regular user mode and do not switch drives to factory mode; therefore their features are rather limited. Specialized diagnostic utilities are not offered for free; instead they are distributed to special service centers and dealers of drive manufacturers.
    Let us show an example of searching for malfunction in the spindle motor control circuit of a Caviar HDD manufactured by Western Digital.
    The layout scheme below is used in WDAC32500 and WDAC33100 drive families and takes into account all ratings and serial numbers of components, but it is also applicable for repair of WDAC2340, WDAC2420, WDAC2540, WDAC2700, WDAC2850, WDAC33100, WDAC31200, WDAC21200, and WDAC31600 drive families if you ignore serial numbers of components and assume that some ratings differ from the values shown in the layout scheme.
    If at HDD power-up its spindle motor does not start you should first make sure that the HDA is operational by connecting it to a known good PCB. If there is no such opportunity you should check the resistance of coils (phases) of the spindle motor, it should correspond to ~ 2 Ohm relatively to middle output; then continue to look for the malfunction on the PCB. (Inability to start a spindle motor frequently results from sticking of magnetic heads to disks).
    In order to check a PCB for failed components, you should remove it from the HDA, connect to an external power supply and position it on the worktable with electronic components facing up. Further operations will require an oscilloscope with sweep frequency up to 50 MHz.
    First of all, you should switch on power and check the feed +5 V and +12V voltages at outputs from the U3 and U6 chips (see layout scheme), check excitation of quartz resonator at outputs 24 and 33 from U6 chip. Then check for presence of clock pulses supplied to the U9 control microprocessor and U11 reading channel to 57 and 13 outputs respectively. After that make sure that there is no RESET signal (active level О). If all the requirements are met then the control microprocessor will start and perform the initialization procedure programming all chips connected to the internal data bus. You can check microprocessor operability indirectly judging by the presence of control pulses: ALE, RD#, WR#, data bus pulses, etc.
    To check the spindle motor control circuit you should trigger 10 ms/div oscilloscope sweep with 2V/div amplification (it is advisable to use 1:10 multiplier). After power-up check for presence of motor start pulses with 11 - 12 V amplitude for three phases (connections J14, J13, J12). The control circuit will try to start the motor for 1 - 2 min., then it will discontinue the attempts. After that you should switch power off/on or send a RESET command by short-circuit of lines 1 and 2 in IDE interface connector using tweezers. If voltage is lower than 10 V for any phase, then U3 chip is malfunctioning. As a result of such failure the spindle motor most likely spins up but remains unable to gain rated rotational speed and, consequently, magnetic heads cannot be shifted from the parking zone. Rotational speed of spindle motor can be controlled using the INDEX pulses at the Е35 control point (if a PCB is connected to the HDA). The frequency of INDEX pulses is ~12 ms, width of INDEX pulses is - 140 nanoseconds. U3 chip is controlled by the U6 synchronization controller chip and the SPINDLE START signal of the spindle motor. For motor start SPINDLE START = 1, for motor stop it is = 0.
    Phase distribution is controlled by the U6 chip through its Fc1 - Fc6 outputs; it uses TTL range of control signals. Feedback of rotational speed is accomplished through the 32Р4910А U11 reading channel chip using the SERVO READ DATA line. In its turn, the U6 synchronization controller chip generates the signal for servo field search (SERVO GATE) for U11 chip.
    .The signals can be viewed more conveniently using oscilloscope with 100 MHz or greater sweep range since INDEX pulses and servo marker last for about ~140 nanoseconds (it is also advisable to use 1:10 multiplier). Monitoring should be performed using two sources, synchronizing the oscilloscope by INDEX or by servo marker. It may be interesting to watch not only servo signals at the Е37 control point but also data reading signals in general at the Е13 and Е7 control points, where one can see all synchronization fields, sectors, etc.

    Details on functioning of control microprocessor, data reading channel and spindle motor control chip are available at web sites of Intel, Silicon Systems Incorporation and SGS-Thomson respectively: www.intel.com and www.st.com

    Some typical malfunctions of hard drives
    and methods of their repair

    Always to make repairing hard drive it is necessary to use special complicated eqipments, but sometimes you need desoldering station and programmator only. In the last part of our descriptive survey we would like to address some typical malfunctions of hard drives and methods of their repair.

    As we have mentioned in our previous articles devoted to problems with hard disk drives, a drive consists of 2 main parts: a mechanical part (heads-and-disk assembly) and electronics (control printed circuit board). Those two components are supplemented with internal firmware, which is partially stored in ROM on PCB and partially resides within firmware zone of a drive (that latter portion is loaded to RAM of HDD microcontroller during its initialization). Those three components interact very closely and normal HDD operation is possible only when all of them function properly. Consequently a drive malfunction may result with equal probability from failure of any of the mentioned components, and that can be observed in real life. Moreover, in various HDD models from different manufacturers the frequency and degree of damage to different components is not the same. When a HDD has to be repaired in conditions offered by a regular (not specialized) laboratory we have to decline some repair orders. In the first place, it pertains to the repair of HDD mechanics – HDA, secondly – to the service data in the firmware area of a drive.

    The difficulty of HDA repair is connected, first of all, with exceptional purity of air contained under normal pressure inside the case (no more than 100 dust particles per 1 cubic meter of air). Opening a case in usual premises or in common laboratory conditions will inevitably lead to dust penetration inside (in usual rooms 1 cubic meter of air contains approximately 600 dust particles) and that is sure to cause damage to precise mechanics. Few companies, which perform repair of drive mechanics use in their work special clean rooms or clean worktables (tables equipped with a special “aquarium” with sleeves inside for performance of necessary work). Besides, a whole set of specialized tools is required including T type screwdrivers (from T9 to T3), hex screwdrivers, mounting supports that allow hard fixing of a HDA for work on it as well as various lifters for heads’ blocks in HDDs of different types. We should add to the above list requirements to engineering personnel who have to perform such jobs. The people should be accurate, move precisely and certainly they should have experience. One incorrect motion with a tool or a finger touch to magnetic disks will render drive repair impossible at once or will make it more complicated at least by order of magnitude. It is because of those pitfalls that most companies possessing specialized equipment for HDD repair do not undertake to perform works related to their mechanical parts.

    The simplest drive repair consists in restoration of software modules in its firmware zone. Corruption of modules is one of three possible HDD malfunctions rendering a drive inoperable although all mechanical and electronic parts remain completely intact. As a rule, a drive with such defect is not visible in computer BIOS and any attempt to access it ends with an ABRT error (the command cannot be executed). Repair of such malfunctions requires just overwriting of the corrupted module; the drive will become operational again. The procedure takes 5-10 minutes on the average. However, that seeming simplicity of the solution hides its complicated implementation. As a matter of fact, module recording is possible only in a special factory mode of drive operation. A drive is switched into that mode by special commands (the so-called key) which differ not only with various manufacturers, but also for different drive families of one manufacturer and those commands are kept secret. Firmware structure may also be very different. Modules can be overwritten with copies obtained from identical models and taking into account firmware version and module type. We should also mention that incorrect module overwriting or recording of an incompatible module version may damage a drive once and for all. Thus, for example, erroneous recording of a configuration module with information about the number of magnetic heads may result in firmware attempt to address a non-existent head during initialization at drive power-up. The drive at that will begin to knock endlessly hitting its positioner against the limiting stop and at last it will damage its magnetic surfaces if it is not switched off in time. But after the next power-up the problem will recur. Therefore operations over firmware zone should be as careful and accurate as actions over drive mechanics, i.e. HDA. That is why drive manufacturers password-protect and keep secret access to it. Thus, with all the simplicity of repair for drives with damaged firmware data, such procedures are not possible without special software and frequently even without a whole hardware and software complex. In addition to the actual technological utilities a host of which may be included into such complex (an individual utility exists for each drive family) users need documentation – clear methodology of testing and restoration for failing firmware zone, which is also individual for each drive. High cost of such equipment does not allow everyone to purchase it, so we shall describe the methods of HDD repair, which do not require specialized tools, devices and software.

    One of the basic principles for any repair reads "do not make it any worse", that is why it is important to perform accurate diagnostics of malfunction and, probably, refuse to repair that drive and send the customer to a specialized service centre, if the malfunction is caused by the HDD mechanics or corrupted firmware data. As an example we shall discuss the analysis of a very widely spread malfunction - "HDD knocking".

    If at power-up a drive produces periodic knocking sounds (hitting its positioner against the limiting stop), it means that the drive is unable to read servo information from disks’ surfaces. There may be a lot of reasons for that:
    • malfunctioning magnetic heads;
    • malfunctioning preamplifier/commutator located inside HDA in the immediate vicinity of the heads;
    • malfunctioning PCB, namely:
      • reading/data conversion channel;
      • positioner controller microchip;
      • supply circuits (stabilizers, filters, generators of negative voltages).
    In addition to the above list, such malfunction may be caused by incorrect recording of firmware modules, when a non-existent head is selected and, as a result, the stream of servo data is missing. Precise diagnostics of that malfunction is complicated and difficult even for an experienced specialist in HDD repair, but still there are a few tricks that can simplify the task a little. First of all, you will need to identify where the cause of malfunction is located – is it in HDA or control board. To do so, remove the drive’s PCB and replace it with a known good board from the same model with an identical firmware version. We should note that it is not possible for all models, recent Seagate models and Fujitsu MPG3xxxAT drives keep in ROM unique adaptive parameters and during PCB swap the original ROM should also be swapped. If knocking stops and the drive reports on readiness, then you should check the board for the cause of malfunctions. If the drive keeps knocking with a known good board, the cause of malfunction is inside HDA and in that case it is time to give up repair. Under no circumstances should you open the HDA just to see what has happened inside. Most likely you will not see any visible faults but the damage from opening will be considerable. Thus, of all the possible types of HDD malfunctions only repair of electronics board can be recommended for a regular laboratory without special equipment.

    Most frequent typical malfunctions
    in various HDD families
     

    Manufacturer: Quantum
    Quantum Fireball drive families: EL, EX, CR, CX, lct08, lct10, lct15

    Malfunction signs: A drive operates normally for some time (from15 minutes to several hours), then it begins to hit its positioner against the limiting stop.

    It is a very frequent malfunction in those drive families, it is caused by the chip controlling the spindle motor and positioner; the chip has poor quality of factory soldering (please see the table), overheats because of that and stops to function normally.

    Table
    HDD family Spindle motor and positioner control chip Possible replacement part
    Quantum Fureball EL Philips TDA5147BH Panasonic AN8427FBP
    Quantum Fureball EX Philips TDA5147BH Panasonic AN8427FBP
    Quantum Fureball CR Philips TDA5147BH Panasonic AN8427FBP
    Quantum Fureball CX Philips TDA5247HT Panasonic AN8428NGAR
    Quantum Fureball lct08 Philips TDA5247HT Panasonic AN8428NGAR
    Quantum Fureball lct10 Philips TDA5247HT Panasonic AN8428NGAR
    Quantum Fureball lct15 Philips TDA5247HT Panasonic AN8428NGAR

    One peculiarity of the TDA5247HT (AN8428NGAR) microchip is the availability of space for soldering in the lower part of its case acting, by the way, as its heatsink. It accomplishes heat abstraction from the chip and its dissipation along the board. Thus mounting and removal of that chip should be performed using a thermal air unit.
    To repair that malfunction, you should unsolder the chip, broaden the soldering pad (that work can be performed using a lancet for removal of a portion of protective layer), blanch it and the lower part of the chip and solder the latter back pressing its case gently during soldering in such a manner that solder shows through board openings on the other side. Then you should carefully flush the soldered location because that chip has high-resistance analog outputs and fusing agent residue may disturb its normal operation.

    That method undoubtedly improves the thermal conditions of the chip but it does not yield positive results always. If a chip used to be overheated for a long time, its resoldering does not remedy the situation. In that case the chip should be replaced. It is advisable to replace it with an identical model offered by Panasonic and having better thermal characteristics. Such chips can be purchased at stores selling electronic components. Its price may vary from 5 to 10$.
     

    Manufacturer: Fujitsu
    M1638TAU drive family

    Malfunction signs: The spindle motor does not start

    The connection scheme of VCM (Voice Coil Motor) & SPM (Spindle Motor) controller is practically identical for the following drive families: M1614TAU, M1638TAU, MPA30xxAT, MPB30xxAT and MPC30xxAT).

    VCM&SPM controller regulates 3-phase motor; it is programmed by the MB9004 processor produced by Fujitsu. There are three modes of spindle motor control: start mode, acceleration mode and stable rotation mode. In the start mode at power-up Power Monitor (MP3771) sends a “reset” signal to the microprocessor (MB9004) and the VCM & SPM controller. Microprocessor uses a serial channel to program internal registers of VCM & SPM controller for a start and charges the pump capacitor of the controller using the “Charge pump” signal. Charge volume determines the current which will flow to the spindle motor. As soon as the start-up capacitor is charged sufficiently the microprocessor programs SPM controller for a start mode, then ~ 1,3А current flows to the spindle motor. Controller generates phase switching signals. The spindle motor at that begins rotation generating self-induced EMF. The controller detects EMF and notifies the microprocessor about that; the latter uses the signal for rotation control. In the acceleration mode the microprocessor speeds up phase switching and measures spindle motor rotational speed until it reaches 5400 RPM. When the speed is reached the controller switches to stable rotation. In that mode microprocessor calculates the time required for one spindle motor revolution on the basis of the phase signal and adjusts the rotational speed charging or discharging the pump capacitor. Adjustment control (charge/discharge) is performed every 1/6 spindle revolution.
    The complexity of diagnostics is determined by the fact that SPM controller monitors EMF generated during spindle rotation and at an attempt of spindle spin-up it makes just 2 - 3 phase switches which are difficult to track using oscilloscope. If the spindle does not begin rotation (for whatever reason) the controller, as a rule, either switches off or retries its attempt after some time. Thus, if you use a regular oscilloscope, you can see only presence of pulses falling within a certain range, which is insufficient for complete diagnostics. In an ideal case we would recommend using 3-channel oscilloscope with memory function operating in the automatic recorder mode. Probably such device is not really commonplace. Therefore it is possible just to check the presence of pulses for motor phases.
    VCM & SPM controller is a quite reliable microchip and it rarely goes out of order. More frequently a spindle motor does not start because of other malfunctions. Still, if the chip fails such failure is usually caused by overheating with clearly visible traces on chip case. During repair of the start circuit you should check the Stop Spindle signal from the MB3771 chip. The signal forces parking of magnetic heads and stops the spindle motor with keys Q8 and Q9. Active level of that signal in the parking mode is “1”, in the operational drive mode it is “0”. If a spindle motor begins to spin up you can check the operation of output keys of HA13525A chip controlling phase signal with oscilloscope. To do so select 10 ms/div sweep with 2V/div amplification (it is advisable to use the 1:10 multiplier). A phase may be diverted by a disrupted Q8 or Q9 key. HA13525A and HA13525B chips are compatible from top downward, i.e. in models belonging to the M1638TAU and MPA drive families both of those chips can be used. In MPB and MPC drive families only HA13525B is allowed.
     

    Manufacturer: Fujitsu
    drive families: MPB, MPC

    Malfunction signs: A drive begins to detect a higher own capacity than the actual rated value, the so-called "megalomania".

    That malfunction is quite frequent in the above-mentioned drive families; it is caused by corruption of firmware in Flash ROM chip on the control board of the drive. Those drive families employ Flash ROM chips using 64К structure based on 16-bit words, programming voltage is 5 or 12 V, package type is PLCC44.
    Elimination of that malfunction requires just reprogramming of Flash chip by recording a known good firmware of the corresponding version. Version number in Fujitsu drives is indicated in the lower right corner of the label over HDA below bar code and it looks like: xyy-zzzz, where x –means the month when the drive was manufactured in hexadecimal notation, yy – means version prefix and zzzz – means the actual firmware version, e.g.: С02-2009. For version compatibility in MPB and MPC drive families just the actual version match is sufficient, the prefix and month of manufacture are not important.
     

    Manufacturer: Fujitsu
    MPG3xxxAT/AH drive family

    Malfunction signs: Quite unexpectedly for user and user data a drive is no longer identified in PC BIOS.

    We should note that this very drive model has broken all records of failures, which happen in most cases after a year of operation, just after completion of the warranty period. The main cause of the malfunction was in the Cirrus Logic CL-SH8671-450E chip. It can hardly be replaced with a working chip because those microcircuits were produced for a special Fujitsu order and the manufacture of that drive family was discontinued long ago. However, there is a method of “revival” and “revitalizing” a malfunctioning chip which allows extending HDD life a little. However, if you ignore drive “hangings” and do not take due steps (at least backup valuable data) the table of S.M.A.R.T. logs in firmware zone will be gradually overfilled and the drive will additionally corrupt its modules in firmware zone, which cannot be restored without specialized software.

    One of the versions explaining the cause of problems with those chips is the use of a new polymer compound during production of chip case. The compound decomposes under the influence of increased temperature in humid conditions producing phosphoric acid. But it is just a version; we may never learn whether it is so or not. However, one thing is known for sure: if you unsolder that chip, remove old solder from its pins and contact pads on the board, flush the location for the chip and then solder it back the drive will begin to work properly.
     

    Manufacturer: IBM
    drive families: DJNA, DPTA, DTLA, AVER, AVVA

    Malfunction signs: A drive spins up the spindle motor, recalibrates itself, reports on readiness, BIOS identifies it correctly but at a reading attempt the drive produces "scratching” sounds and reveals numerous BAD sectors on its surfaces.

    That malfunction is connected with a mismatch between the cyclical redundancy check code in the data fields and the information recorded in the sector service field. Such a situation appears when recording to a sector is unfinished. That may result from lack of contact at the connector between the PCB and HDA. That connector consists of needle-like pins touching tinned pads on the PCB . With time soft solder becomes perforated and contact quality deteriorates.

    In order to repair that malfunction you should remove the control board, clean the old solder off the contact pads and cover them again using silver-based solder, then carefully wash the soldered location. Install the board back to HDA. Then you will have to clear the whole disk surface overwriting it with any code using freely available software (please see part 4); that will accomplish recording of correct CRC codes.
     

    Manufacturer: Seagate
    drive families: Seagate Barracuda IV, V and 7200.7

    A very common flaw is disruption of protective diode along the +12V circuit and resulting outage of the computer power supply unit. In that case the external look of that component does not allow identification of the damage, because its case remains unaffected. An attempt to connect a drive so damaged to an operable power supply for diagnostics will most likely result in breakdown of the latter. Therefore if such a drive is brought for repair then first of all you should probe the 0 and +12 V circuit with a regular tester to check for a short circuit.

    The protective diode originally designed using the "transil" technology at SGS Thomson is intended for protection of electronic circuitry from short power supply peaks not greater than 10 - 20 microseconds. But in that case their common failures demonstrate that HDD designers did not expect to encounter so poor quality of power supply units. Thus drive operation can be resumed after simple removal of that damaged element from its circuits but we cannot guarantee flawless HDD operation without that component.
     

    Conclusion
     

    In conclusion we would like to warn regular users of personal computers who have to restore their data on a damaged hard drive: "You have only one recovery attempt". The probability of data recovery decreases by second degree depending upon the number of visits to service centres. Lots of very serious specialized service centres performing data recovery refuse to work with drives if their head-and-disk assembly has been opened; others multiply the cost of their work.
    For beginning experts willing to devote their time to HDD repair and data recovery and wishing to increase their knowledge in that sphere we would recommend reading more specialized documentation at our website. In addition we would recommend the forum of iXBT web site at http://www.ixbt.com/, and the following literature:
    1. Conclusion

      In conclusion we would like to warn regular users of personal computers who have to restore their data on a damaged hard drive: "You have only one recovery attempt". The probability of data recovery decreases by second degree depending upon the number of visits to service centres. Lots of very serious specialized service centres performing data recovery refuse to work with drives if their head-and-disk assembly has been opened; others multiply the cost of their work.

      For beginning experts willing to devote their time to HDD repair and data recovery and wishing to increase their knowledge in that sphere we would recommend reading more specialized documentation at our website. In addition we would recommend the forum of iXBT web site at http://www.ixbt.com/, and the following literature:

      1. J. Goodman. "Hard Disk Mysteries Revealed", translated by V.L. Grigoriev. Kiev: "Dialektika", 1994. - 256 p., illustrated.

      2. M. Gouk., "PC disk subsystem". – St. Petersburg.: “Piter”, 2001. - 336 p, illustrated.

      3."Office informational security". Practical scientific collection of works. First issue "Technical means for data protection ". K.: "TID DS" LLC, 2003. – 216 p.

      V.Morozov & S.Yatsenko