The Boitho search engine now uses 7 independed servers. All have 4 sata disk, making it a total of 28 disk. The problem is that we are having a lot of disk crashes. Have lost some 6 disks total now.
Every time that happened we have a system that can find out which pages was on that disk, and recrawle them.
The recrawl is time consuming, sow we are thinking about switching to raid5.
I have newer really tested out raid in a high performance system. According to http://www.pcguide.com/ref/hdd/perf/raid/levels/singleLevel5-c.html “the overhead necessary in dealing with the parity continues to bog down writes”.
How bad is this bog down?
Disk i/o is a big bottleneck to day. To work around this by uses 4 prepossesses in parallel, each indexing data on one disk. Thereby using all 4 disk at ones. If we changes to raid 5 this method will not work.
Have anyone seen any research on this ?
When we become bigger we will use a “redundant array of inexpensive nodes”. Where all data resist on at least 3 independed servers. If one fail we can just add another and copy the data from the two remaining servers. Google uses this approach in the “Google file system”.