Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

General informations

HOWTO content

This HOWTO describes the (networked) storage performance measurements methodology and tips.
As the members of TERENA's TF-Storage we try to collect general informations HOW TO
prepare and perform the storage performance testing, basing on our own experience and
knowledge as well as the material accessible in the Internet.

This HOWTO re-uses the presentation prepared by IBBT/Ghent University,
Storage Benchmarking Cookbook.

HOWTO limitations

Obviously, the world is too complicated to be described in a single work.
However, we hope that both background and practical, hands-on informations included in this HOWTO
can be useful for the community.
HOWTO does not cover all possible aspects, problems and approaches to the
storage benchmarking process. We try to describe the most important issues,
most popular approaches and tools used to evaulate the storage systems.

HOWTO organisation

The organisation of the sections of this HOWTO follows the real-world, networked storage systems layers layout (see table below).

We provide general and layer-by-layer comments on how to perform storage benchmarking
along with some background information, practical benchmarking tips and examples
as well as links to the external sources.

System layers

Typical storage environment is composed of layers similar to presented above.
It is important to understand, that each of these layers can have important influence of the data access performance.

In following subsections we shortly discuss the layers and mention the main factors,
that have the most important impact on the efficiency of the whole storage system.

Clients and applications layer

Client applications run in the client hosts. These hosts are the NAS clients. NAS services include file system-level
services, basic user authentication and access control.

...

Note, that client hosts can use block-level services provided by the storage system: DAS (Directly Attached Storage) or SAN (Storage Area Network) instead of NAS services. This case is discussed in details in one of
the following sections.

Client network

Client network interconnects the clients and the filesystem cluster, called also 'NAS cluster'.

...

As the client-NAS server network can introduce data transmission delays and
significant performance limits, its features must be carefull examined
in the process of benchmarking the networked storage solution.

Filesystem and cluster nodes... and block-level storage clients

Filesystem and cluster nodes implement the NAS (Network Attached Storage) functionality,
including filesystem services, users authentication and data access control.

NAS services implementation

Physically, there are two most typical ways of implementing the NAS services:

...

  • a lack of scalability of data processing performance,
  • overhead of the file-level services compared to block-level storagem
  • specifics of the NAS gateways hardware and software including:
  • file system settings: caching, journaling, prefetching, redundancy,
  • operating systems limits on NAS gateways,
  • transport protocol parameters in both client and storage front-end network.

Block-level storage clients

Note, that while being a NAS servers, NAS gateways act as the block-level services clients in the same time. Block-level services are provided by storage controllers accesible through storage front-end network, e.g.
a FC-based SAN.

...

In typical networked storage setups also the application/client systems can be connected to the storage network directly, exploiting the block-level storage. In such case, storage benchmarking can be focused
on the performance evaluation of SAN compoments, SAN network interfaces (so called HBAs)
(Host Bus Adapters) and SAN driver characteristics.

Storage front-end connectivity (SAN or DAS)

The block-level storage can be served directly, locally (DAS - Direct Attached Storage) or by the Storage Area Network. In both cases, the performance evaluation of the block-level storage can be performed using similar methods and tools, as the virtualisation mechanismes implemented in modern operating systems hide the fact that the SAN storage is not local.

...

Beside the FC technology, iSCSI and FCIP, iFCP and SRP (over IB) protocols are used to implement the data transport in the front-end storage networks.

Storage controllers and RAID structures

Storage controllers implement the basic virtualisation functions such as RAID structures, volumes (LUNs) and volume
mapping to hosts and volumes access control. They are crucial elements of the storage system, so their features may determine its performance characteristics.

...

Significant amount of time should be spent in order to get a realistic picture
of a given controller performance characteristics.

Storage back-end connectivity

Back-end storage network interconnects the storage controllers with he disks drives.
Various type of the technologies are used for that purpose. Traditionally, Fibre Channel
loops span disk drives and storage controller back-end ports. In enterprise solutions,
switched back-end topologies are used. Currently, SAS technology is often used
for the back-end connectivity as it provides high bandwidth at optimal cost.

The storage back-end links can be a bootle-neck, especially in a sustained transfer
operations. Therefore its characteristics such as latency and bandwidth should be taken into account while
running the performance tests.

Disk drives

As the disk drives hold the actual data, their characteristics are very important
component of storage system performance.

...

As the data processed in storage systems ultimately goes to disk drives, it might be useful to examine the performance of the individual disk drives (if possible), in order to determine the realistic performance that can be achieved after combining multiple disks into RAID arrays.

Storage system layers summary

In above sections we showed that networked storage systems are typically composed of multiple layers.
Each of this layers has its own features, complexity and performance characteristics.
Solid storage benchmarking should take into account these facts. Appropriate testing metodology
should be used in order to reveal the real characteristics of particular components
or to find a source of the actual or potential bottle-neck observed in the system.

Overview of the benchmarking methodologies is provided in the next section of the HOTWO.

Storage Benchmarking methodologies overview

The complexity of the networked storage system makes solid storage benchmarking a difficult task.
One of the basic decisions that must be taken while preparing the benchmarking procedure
is the PURPOSE of running the benchmark and the EVALUATION CRITERIA for the test results.

Benchmarking purposes and evaluation criteria

There are many possible benchmarking purposes. Here are some examples of them:

...

Benchmarking procedure and tools should be able to reveal the characteristics of the examined
storage system that matches the defined criteria.

Benchmarking tactics and organisation

Depending on the purpose, the benchmarking process should be organised in different ways.

...

  • plan the correct sequence of the benchmarking actions, e.g.:
  • work bottom-up to know the efficiency of each system layer and the overhead it introduces comparing to lower layers;
  • work top-down to determine see the limit on the highest level and try to find its source on the lower layers.
  • choose the appropriate set of the examined configuration parameters and determine the set of the values used for them,
  • eliminate the influence of unwanted optimisation mechanisms (in case we want it), e.g.:
  • storage controller-side caching,
  • NAS gateway filesystem-side caching,
  • client-level, filesystem-side and application-side caching
  • make sure, that the benchmarking load reflects the real load of the target applications and the target environment,

Benchmarking tactics: bottom up or top-down?

Both approaches has some advantages and disadvantages. The accepted tactics should fit the purpose
of the test.
If we want to quickly determine the overall performance of the storage system, we can
start from the top, e.g. by running a file-system level benchmark (more on benchmarks in the BENCHMARKING
TOOLS section), and go down in case the performance is lower than expected. This is most probably less
time consuming approach, however we may get an incomplete picture of the actual system efficiency.
Going bottom-up is more costly, but gives as the knowledge about the efficiency of each system layer and the overhead it introduces comparing to lower layers. This approach is suitable if we want to find a bottle-neck in the
system or we want to determine the realistic maximum performance in a tunned setup.

Benchmarking parameters selection

Having in mind the complexity of multi-layer storage system, we may be forced to:

  • use heuristics,
  • trust our experience and intuition (smile)
    in order to determine (and limit!!) the list of the configuration parameters modified
    during the benchmarking process and their examined values. Choosing the correct parameters
    and their values is a difficult task and can be performed basing on the deep knowledge
    of the storage systems, networks and computers systems architectures and characteristics.

Measuring the correct bottle-neck

Benchmarking the multi-layered system we have to make sure that we are measuring the correct
component of the system.

Data caching influence

Typical issue faced while benchmarking on the file-system level is the influence of the data
caching which can take place on the client side (file system-level, operating system-level), NAS gateway side,
or on the storage controller.

...

Another side effect of the data caching is that, the speed of writing the the buffer
can be measured instead of the actual tranmission speed. This can be avoided by using the data set large enought to fill up the buffers
and transmit enough data over the real communication channels. We may also use external monitoring tools,
such as Gigabit Ethernet network or SAN switches performance analysers, host ports usage monitoring tools to
analise the actual data traffic taking place in the network.

Monitoring the system compoments

While doing the benchmark we have to make sure, that we are able to determine the reason of the
the performance bottle-neck visible in the results. On the other hand, we should try to avoid
incorrect intepretation of the results in case if some caching/buffering effect makes the results
better than actual system features. We can use monitoring tools in addition to the benchmarking tools for that purpose.

...

  • network interfaces monitoring tools: ntop, ethreal, tcpdump
  • CPU monitoring tools:: top, dstat, vmstat
  • system statistics collectors: vmstat, sar, dstat
  • /proc filesystem in linux
  • virtual machines monitoring tools: xm top, virt-top, virt-manager for Xen;
  • and many others.

Avoiding the artificial setup

Note however, that the decision to switch off or eleminate the effects of caching should be taken carefully.
Disabling the cache, can make the operation of some optimisation technologies implemented in storage system elements
impossible. For instance, if we disable the write cache in the RAID controller, full-stripe write technique cannot
be exploited.

Switching the caching off, we also make the testing environment a bit artificial – the results we get
in such setup, can be useless in the real-life configuration.

Benchmarking workloads

Getting the realistic picture of the storage system performance, requires that we apply
the benchmark that generates the workload that matches our needs. Againt, the kind
of the testing workload should fit the purpose of performing the benchmark.

...

  • by finding the benchmark that generates the workload that is similar to the application data access pattern (micorbenchmarks or macrobenchmarks); the similarity should include:
  • access pattern: sequential, random or mixed,
  • read/write ratio,
  • temporal and spatial locality of storage access requests,
  • number of simultaneous access requests (concurrent application/benchmark threads).
  • by choosing the test that focus on the storage characteristic that are crucial for application:
  • throughput-intensive (performance measured e.g. in MB/s),
  • I/O intensive (performance measured e.g. in IOPS or in request serving time).

Workload generators and benchmark types

There is a lot of benchmarks types available for free or under the license.
They differ in:

...

The table below provides some examples of benchmarks along with the links to Internet resources related to them.

Level

Workload generator / benchmark

Auxiliary monitoring tools

applications-level

real application (FTP,NFS client, DBMS)

top,

 

SPC (seq/random R/W)

 

 

SPECsfs2008 (CIFS, NFS)

 

 

DVDstore (SQL)

 

 

TPC (transactions)

 

network level

iperf

dstat, ethreal/wireshark, ntop

 

smartbits appliance

optiview link analyser

filesystem level

dd

 

 

xdd

 

 

iozone

 

 

bonnie/bonnie++

 

device level

dd

dstat

 

iometer

iostat, vmstat

 

xdd

 

 

diskspeed, hdtune, hdtach, zcav

Linux's procfs directories

 

own tools

 

Selected benchmarks discussion

In this section we discuss some benchmarks in details. We selected them basin on our experience
and interests. Therefore, the selection may be not optimal for every situation.

SPC benchmarks

Storage Performance Council (SPC) try to standardize storage system evaluation.
The organisation defines industry-standard storage workloads.
This "forces" vendors to publish a standardized performance of their storage systems.

...

The table below summarizes SPC-1 and SPC-2 specifics.

 

SPC-1

SPC-2

Typical applications

  • database operations
  • mail servers
  • OLTP

...

  • large files
  • processing
  • large database queries
  • video on demand

...

Workload

Random I/O

Sequential I/O (1+ streams)

Workload variations

  • addressrequestdistribution: uniform + sequential
  • R/W ratio

...

  • transfer size
  • R/W ratio
  • number of outstanding I/O requests

...

Reported metrics

  • I/O rate (IOPS)
  • total storage capacity
  • price-performance

...

  • Data rate (MB/s)
  • Total storage capacity
  • Price-performance

...

Good practices and tips for benchmarking

Storage system components benchmarking

If we decide to perform layer-by-layer testing of the storage system or we want to find a source
of the bottleneck observed in the system we may want to examine a single element of the system and
avoid the influence of the other elements.

Network-only benchmarking

Example problem is to examine the network ability to carry the data traffic using a given protocol,
without the influence of the disk access latency. To do so, we may use some tricks, for instance:

  • measure the network link features with a dedicated tool, that performs RAM to RAM transfers, e.g. iperf,
  • configure a RAM disk and export it, e.g. using NFS (say we want to test NFS transmission efficiency).

Network benchmarking tips

As mentioned before, the client network can be a source of significant performance bootlenecks.
However, the degree the network parameters impact the performance observed by the application
is related to the fact, what is the application sensitive for.
In case, the application is bandwidth intensive, we should examine mainly the bandwidth of the network, while
for I/O intensive application the network delay should be carefully tested.

The parameters of the network link we should evaluate depend also on the kind of the protocol used
for data transmission. For instance, in case of NFS, both bandwidth and delay of the network link matters,
as typicall NFS performs a synchronous data transmission. Another example is the GridFTP protocol,
which can exploit multiple parallel tranmission streams. Therefore, the network benchmark should
be able to examine the network bandwidth using multiple parallel transmission streams (for instance, we
can use iperf benchmark with multiple TCP/IP streams).

Benchmarking the storage systems compoments

This section of the HOTWO contains the more detailed informations about benchmarking selected storage system
compoments. Each of the subsections contains some background informations as well as practical informations
on testing the storage system element.

Clients network

This part of the HOWTO describes how to perform clients network testing. Some background information is
provided along with the practical informations related to TCP network testing using iperf tool.
Iperf is a open-source benchmark working both under Linux and Windows operating system.
In this HOWTO we focus on testing the network under Linux, i.e. between two machines running Red Hat Linux Enterprise.

Background information

As mentioned in the introduction, the efficiency of data access performed by the client network may depend on
network topology, technology, transmission protocol and network delay which in turn can result from both the physical distance between communication parties and the feature of communication equipment that processes the network traffic.

As the most popular client network in storage systems is the IP connectivity implemented over Gigabit Ethernet
we focus on this technology in this part of the HOWTO. We will show, how to examine data transmission delays
present in the network and the bandwidth available in the link. As some data transmission protocols are able to exploit
multiple parallel data streams (TCP connections), we also show how to examine the influence of the number
of the TCP streams used in parallel on the actual bandwidth observed by the application.

Test preparation

Because Iperf is an client-server application, you have to install iperf on both machines involved in tests. Make sure that you use iperf 2.0.2 using p-threads or newer due to some multi-threading issues with older versions. You can check the version of this installed tools with a following command:

Panelnoformat
 [:userathostname ~]$ iperf -v


 iperf version 2.0.2 (03 May 2005) pthreads

Because in many cases the low network performance is caused by high CPU load, you should measure CPU usage at both link ends during every test round. In this HOWTO we use an open-source vmstat LINK-! tool which you probably have already installed on your machines.

Link properties

Before we start the tests, we should take a look on our network link setup. First, we should ensure that we can use MTU larger than the standard Ethernet MTU. We should try to use MTU 9000. Using jumbo frames is recommended especially in reliable and fast networks, as bigger frames boost network performance due to a better header-to-body-frame ratio. But we should remember that, it is possible to use MTU 9000 only if all network hardware between tested hosts (routers, switches, NICs etc.) support jumbo frames.

In order to enable MTU 9000 on the machine network interfaces you may use ifconfig command.

Panelnoformat
test001:
 [root@hostname ~]$ ifconfig eth1 mtu 9000

Alternatively, you can put this settings into interface configuration scripts, e.g. /etc/sysconfig/network-scripts/ifcfg-eth1 (on RHEL, Centos, Fedora etc.).

If jumbo frame are working properly, you should be able to ping one host from another using large MTU:

Panelnoformat
test001:
 [root@hostname ~]$ ping 10.0.0.1 -s 8960

In the example above, we use 8960 instead of 9000 because ping tool option -s needs frame size minus frame header which lenght is equal to 40 bytes. If you cannot use jumbo frames set the mtu to default value 1500.

To tune your link you should measure the average Round Trip Time (RTT) between machines. RTT can be obtained by multiplying the value returned by a ping command by 2. When you have RTT measured, you can set TCO read and write buffers sizes. There are three values you can set: minimum, initial and maximum buffer size. The theoretical value (in bytes) for initial buffer size is BPS / 8 * RTT, where BPS is the link bandwidth in bits/second. Example commands that set these values for the whole operating system are:

Panelnoformat
test001:
 [root@hostname ~]# sysctl -w net.ipv4.tcp_rmem="4096 500000 1000000"

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_wmem="4096 500000 1000000"

Probably, it is best if you start with values computed using the formula mentioned above and then tune these values according to the tests results.

You can also experiment with maximum socket buffer sizes:

Panelnoformat
test001:
 [root@hostname ~]# sysctl -w net.core.rmem_max=1000000

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.core.wmem_max=1000000

Another options that should boost performance are:

Panelnoformat
test001:
 [root@hostname ~]# sysctl -w net.ipv4.tcp_no_metrics_save=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_moderate_rcvbuf=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_window_scaling=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_moderate_rcvbuf=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_sack=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_fack=1

test001:root@hostname ~# sysctl -w

 [root@hostname ~]# sysctl -w net.ipv4.tcp_dsack=1

COMMENT: the meaning of these parameters is explained in the Linux documentation (sysctl command) LINK-!.

Iperf tool description

After setting up the network link parameters, we are ready to run the test. Obviously, configuring
the network settings can be an iterative process, where we check different settings by running the
tests and evaluating their results.

To perform the test, we should run iperf in server mode in one host:

Panelnoformat
test001:
 [root@hostname ~]# iperf -s -M $mss

On the other host we should run command like this:

Panelnoformat
test001:
 [root@hostname ~]# iperf -c $serwer -M $mss -P $threads -w $\{window\} -i $interval -t $test_time

There is a description of names and symbols used in the command line:

  • s - server
  • c - client
  • $server - address of machine on which iperf server is running
  • M - MSS=MTU-40
  • P number of threads which are sending data through tested link simultaneously
  • w - TCP initial buffer size
  • i - interval between two test rounds in seconds
  • t - test time

Testing methodology

As we want to perform several actions before, during and after a single test. These actions include:

...

To generate key-pair we use a following command:

Panelnoformat
test001:
 [root@hostname ~]# ssh-keygen -t dsa

Then we copy the public key to the remote server and add it to authorized keys file:

Panelnoformat
test001:
 [root@hostname ~]# cat identity.pub >> /home/sarevok/.ssh/authorized_keys

Now we can login from the remote serve to bug without password. So can also run programs remotely, e.g. from the bash script.

...

There is a simple shell script to run iperf test:

panel
Code Block
shell
shell
 #!/bin/sh


 file_size=41


 dst_path=/home/stas/iperf_results


 script_path=/root


 curr_date=`date +%m-%d-%y-%H-%M-%S`


 serwer="10.0.1.1"


 user="root"


 test_time=60


 interval=1


 mss=1460


 window=1000000


 min_threads=1


 max_threads=128
panel
Code Block
shell
shell
 for threads in 1 2 4 8 16 32 64 80 96 112 128 ; do


 	ssh $user@$serwer $script_path/run_iperf.sh -s -w $\{window\}  -M $mss &


 	ssh $user@$serwer $script_path/run_vmstat 1 vmstat-$window-$threads-$mss-$curr_date &


 	vmstat 1 > $dst_path/vmstat-$window-$threads-$mss-$curr_date
& iperf c $serwer -M $mss -P $threads -w ${window} -i $interval -t $test_time >>
 &
         
 	iperf -c $serwer -M $mss -P $threads -w $\{window\} -i $interval -t $test_time   >> $dst_path/iperf-$window-$threads-$mss-$curr_date 
Panelnoformat
 	ps ax | grep vmstat | awk '\{print $1\}' | xargs -i kill \{\} 2&>/dev/null


 	ssh  $user@$serwer $script_path/kill_iperf_vmstat.sh


 done

Script run_iperf.sh can look like this:

Panelnoformat
 #\!/bin/sh
Panelnoformat
 iperf $1 $2 $3 &

run_vmstat.sh script can contain:

Panelnoformat
 #\!/bin/sh


 vmstat $1 > $2 &
  

kill_iperf_vmstat.sh may look like this:

Panelnoformat
 #\!/bin/sh


 ps -elf | egrep "iperf" | egrep -v "egrep" |awk '\{print $4\}' | xargs -i kill -9 \{\}


 ps -elf | egrep "vmstat" | egrep -v "egrep" |awk '\{print $4\}' | xargs -i kill -9 \{\}

To start test script that can ignoring hangup signals, you can use nohup command.

Panelnoformat
 [:stasatworm]$ nohup script.sh &

This command keeps the test running when you close the session with server.

...

To present obtained results you may use the open-source gnuplot LINK-! program, as the graphical results presentation may help yuo to interpret them. Generating the plots can also be automated.

RAID structures

...

RAID background information

One of the most known and important storage virtualisation techniques is the RAID (Redundant Array of Independent Disks) technology. It allows to combine independent disk resources into structures that can provide advanced reliability and performance features, which are not possible to deliver using individual disk resources (e.g. individual drives).
In this section we describe standard RAID levels including RAID 0, 1, 5 and 6 as well as nested RAID structures such as 10, 50 and 60. At the end of the section we summarize fault tolerance, performance and storage efficiency characteristics of particular RAID structures.

Standard RAID structures

RAID0 (striping) does not provide any data redundancy. The main purpose of using RAID0 structures is to distribute the data traffic load over RAID components. Each file is split into blocks of a certain size and those are distributed to the various drives. This block is called chunk. A number of chunks for whom one parity chunk was counted is called stripe. Chunk size is a user-defined parameter characteristic to array. The chunks are send to all disk in the array in a way that one chunk is written to only one disk. Distributing the I/O operations among multiple drives allow to accumulate the performance of particular RAID components.

...

Parity information can be written to one special disk (RAID3 and RAID7structures) or can be spread across all the drives in the array (RAID5 and RAID6). The latter approach has some performance advantages, as the parity-writing load is distributed over all RAID components, opposite to the former case in which parity stripe writing can be a bottleneck. The known limitation of parity mechanism is that calculating the parity data can affect the write performance. RAID5 uses single parity stripe, while RAID6 uses double parity stripes, which provides extra fault tolerance - the array can deal with two broken drives. However, double-parity calculation can affect the write performance.

Nested RAID structures

Standard RAID structures have contradictory performance and redundancy features, e.g. mirroring provides high data redundancy, while limiting the writing performance. In order to provide both redundancy and performance, nested RAID structures are used. RAID10 combines some number of RAID1 structures by striping the data over them RAID0. In that way, superior fault tolerance can be achieved. The array can deal with 50% broken drives if for every broken drive there is his mirror drive working. RAID10 can achieve performance similar or or even better (in random read case) than RAID0. Another commonly used nested RAID structure is RAID50. It is RAID0 made of some number of RAID5 structures. RAID50 improve the performance of RAID5 thanks to the fact that the I/O traffic is distributed over particular RAID5 structures. This approach is effective especially for write operations. It also provides better fault tolerance than the single RAID level does. The drawback of nested RAID structures is that they require a relatively high number of drives to implement a given storage space.

...

Table below contains a RAID levels comparison. The notes range is 0 - the worst to 5 - the best. The notes are based on http://www.pcguide.com/ref/hdd/perf/raid/levels/comp.htmImage Removed and modified according to gathered experience and the fact that we are considering the same amount of drives in every RAID structure. To compare performance we assume that we make each RAID structure using the same number of drives and we use one thread to read or write data from or to the RAID structure.

RAID Level

Capacity

Storage efficiency

Fault tolerance

Sequential read perf

Sequential write perf

RAID0

S*N

100%

0

5

5

RAID1

S*N/2

50%

4

2

2

RAID5

S * (N - 1)

(N - 1)/N

3

4

3

RAID6

S * (N - 2)

(N - 1)/N

4,5

4

2,5

RAID10

S * N/2

50%

4

3

4

RAID50

S * N0 * (N5 - 1) (N5 - 1)/N5

(N 5 - 1)/N

3,5

3

3,5

RAID benchmarking assumptions

This part of the HOWTO explains the methodology and tips for measuring performance of client's Block-level storage configured on certain types of RAID storage.

...

  • we have the same number of drives to build each RAID structure we examine,
  • we create all tested RAID structures using the same identical pool of disks,
  • we use the same type and size of the filesystem for filesystem-level testing.
    In that way we assure that we measure and compare only the RAID structures performance differences.

RAID structures preparation

Here, we present how to make software raid structure using Linux md tool. To create simple raid level from devices sda1 sda2 sda3 sda4 you should use following command:

Panelnoformat
 mdadm --create --verbose /dev/md1 --spare-devices=0 --level=0 --raid-devices=4 /dev/\{sda1, sda2, sda3, sda4\}

Where:

  • /dev/md1 - the name of created raid group,
  • spare devices - you can specify number on drives to be spare ones,
  • level - simple raid level you want to create (Currently, Linux supports LINEAR (disks concatenation) md devices, RAID0 (striping), RAID1 (mirroring), RAID4, RAID5, RAID6, RAID10),
  • raid-devices - number of devices you want to use to make a RAID structure.

When you create a RAID structure you should be able to see some RAID details similar to the informations shown below:

Panelnoformat
test001:
 [root@sarevok bin]# mdadm --detail /dev/md1


 /dev/md1:

Version :

         Version : 00.90.03


   Creation Time : Mon Apr  6 17:41:43
2009
Raid Level : raid0
Array Size : 6837337472
 2009
      Raid Level : raid0
      Array Size : 6837337472 (6520.59 GiB 7001.43 GB)


    Raid Devices : 4


   Total Devices : 4


 Preferred Minor :
3
Persistence : Superblock is persistent
 3
     Persistence : Superblock is persistent
No Format
     Update Time : Mon Apr  6 
Panel
Update Time : Mon Apr 6
17:41:43
2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Panel

Chunk Size : 64K

Panel

Rebuild Status : 10% complete

 2009
           State : clean
  Active Devices : 4
 Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0
No Format
      Chunk Size : 64K
No Format
  Rebuild Status : 10% complete
No Format
            UUID : 
Panel
UUID :
19450624:f6490625:aa77982e:0d41d013

Events :

          Events : 0.1
Panelnoformat
     Number   Major   Minor   RaidDevice
State
0 65 16 0 active sync /dev/sda1
1 65 32 1 active sync /dev/sda2
2 65 48 2 active sync /dev/sda3
3 65 64 3 active sync /dev/sda4
 State
        0      65       16        0      active sync   /dev/sda1
        1      65       32        1      active sync   /dev/sda2
        2      65       48        2      active sync   /dev/sda3
        3      65       64        3      active sync   /dev/sda4

When performance is considered the chunk size of the md device may be important parameter to tune. There is -c option of the mdadm command, that cen be usedto specify chunk size of kilobytes. The default is 64 kB, however it should be setup up according to some factors such as:

...

To create the file system on md device and then mount it to some directory we use command like this:

Panelnoformat
 mkfs.ext3 /dev/md1


 mount /dev/md1 /mnt/md1

Again, there are some filesystem parameters that are interesting from the performance point of view. One of them is
the blocksize. It should be set taking into account the application features and the underlying storage components.
One of the rules of thumb is to use the block size equal to the size of the RAID stripe. You can set the the blocksize
using mkfs's -b parameter. There is also possibility to influence the filesystem behaviouds by using using mount command options.

Methods and tools for RAID levels benchmarking

To examine the performance of md device we normally use iozone tool. However, for making a quick test (for example to receive fast results) we may use dd tool.

dd tool

The idea of dd is to copy file from 'if' location to 'of' location. Using this tool to measure disk devices requires some trick. To measure write speed you read data from /dev/zero to file on the tested device. For measuring the read performance you should read the data from the file on tested device and write it to /dev/zero.
In that way we avoid measuring more that one storage system at a time. To measure time of reading or writing the file we use time tool. The example commands to test write and read 32 GB of data are:

for writing performance (please note using the sync command before and during the benchmark, so you are not measuring your operating system's cache performance) :

Panelnoformat
test001:
 [root@sarevok ~]# sync; time (dd if=/dev/zero of=/mnt/md1/test_file.txt bs=1024M count=32; sync)

and for reading performance:

Panelnoformat
test001:
 [root@sarevok ~]# time dd if=/mnt/md1/test_file.txt of=/dev/zero bs=1024M count=32

where:

  • if - input file/device path
  • of - output file/device path
  • bs - size of a chunk of data to copy
  • count - how many times a chunk defined by bs is copied

iozone tool

For more precise tests we use iozone tool. Iozone allows to run test in many modes, including:

...

To perform one round of the test we can use command:

Panelnoformat
 iozone -T -t $threads -r $\{blocksize\}k -s $\{file_size\}G -i 0 -i 1 -i 2 -c -e

where:

  • -T - Use POSIX pthreads for throughput tests
  • -t - how many threads use for test
  • -r - chunk size used to test
  • -s - test file size Important!! It is file size PER THREAD, because each thread writes or reads from it's own file.
  • -i - test modes - we choose 0 - write/rewrite 1 - read/reread and 2 - random write/read
  • -c - Include close() in the timing calculations
  • -e - Include flush (fsync,fflush) in the timing calculations

...

To automate the testing we can write some simple SH script like this:

Panelnoformat
 #\!/bin/sh


 dst_path=/home/sarevok/wyniki_test_iozone


 curr_date=`date +%m-%d-%y-%H-%M-%S`
Panelnoformat
 file_size=128


 min_blocksize=1


 max_blocksize=32
Panelnoformat
 min_queuedepth=1


 max_queuedepth=16
Panelnoformat
 mkdir $dst_path


 cd /mnt/sdaw/
Panelnoformat
 blocksize=$min_blocksize

while

 while [: _blocksize -le $max_blocksize ];
do
 do
         queuedepth=$min_queuedepth

while

 	while [: _queuedepth -le $max_queuedepth ]; do
Panelnoformat
 		vmstat 1 > $dst_path/vmstat-$blocksize-$queuedepth-$curr_date
 
  
 		/root/iozone -T -t $queuedepth -r $\{blocksize\}k -s $\{file_size\}G -i 0 -i 1  -c -e > $dst_path/iozone-$blocksize-$queuedepth-$curr_date
No Format
 	    ps ax | grep vmstat | awk '\{print $1\}' | xargs -i kill \{\} 
Panel
ps ax | grep vmstat | awk '{print $1}' | xargs -i kill {}
2&>/dev/null
No Format
         
Panel
queuedepth=`expr $queuedepth \*
2`
 2`
         file_size=`expr $file_size \/
2`
done
 2`
        done
  	blocksize=`expr $blocksize \* 2`

	
 done

To start test script that can ignore hangup signals, you can use nohup command:

test001: root@sarevok$ nohup script.sh &

This command keeps the test running when we close the session with server. To present obtained results we use the open-source gnuplot program.

Remarks

When you perform any disk device or file system benchmark you should have in mind that there are many levels of cache in the system, e.g. file system cache, operating system cache, disk drive cache etc. The simplest way to avoid cache influence is to use such amount of data to fill all cache levels buffers. To do this we use the amount of data that is at least is equal to the machine RAM size doubled. Such data size should successfully eliminate the caching influence on measured md device performance.

After every round of tests, when we want to change some RAID (mD) or the file system parameters it is recommended to make the fresh file system in md device, in order to avoid the influence of the filesystem state on the test results.

Benchmarking tools discussion

This section of the HOTWO discusses the details of particular benchmarking tools and provides
practical information about theirs usage, automation, interpretation of the results and so on.

TO BE EXPANDED.

Links:

Practical information:

Real life benchmark requirements in RFPs:

One of the most common usages of storage benchmarking is making sure that the storage systems you buy meets your requirements. As always there are practical limits how complex the benchmark can be. This section lists benchmark procedures actually used in tenders.

CESNET - ~400TB disk array for HSM system using both FC and SATA disks (2011)

Brief disk array requirements

  • Two types of disk in one disk array, no automatic tiering within the array required (there was an HSM system for doing this on a file level)
  • Tier 1 - FC, SAS or SCSI drives, min. 15k RPM, totally min. 50TB consisting of 120x 600GB drives + 3 hot spares
  • Tier 2 - SATA drives, min. 7.2.k RPM, totally min. 300TB, min. 375x1TB  + 10 hot spares OR 188x2TB + 5 hot spares

Performance requirements

  • Sequential: there will be 10TB cluster filesystem  on the disk array using RAID5 or R6, this file system will be part of the HSM system. This filesystem will connected to one of front end servers (technical solutions of the connection is up to the candidates, e.g. MPIO, # FC channels, etc., but the solution must be identical to what is used in the proposal). The following benchmark will be run using iozone v3.347:

    iozone -Mce -t200 -s15g -r512k -i0 -i1 -F path_to_files

    The result of the test is an average value of three runs of the abovementioned command as „Children see throughput for 200 initial writers”, and , „Children see throughput for 200 readers”. 
    Minimum read speed 1600MB/s, minimum write speed 1200MB/s.
  • Random: 

    Same setup of the volume as in the sequential test, but for this test, it will be connected without any filesystem (on a block level). The following test will be run on the connected LUN using fio v1.4.1 with this test definition:

    No Format
    [global]
    description=CESNET_test
    [cesnet]
    # change it to name of the block device used
    filename=XXXX
    rw=randrw
    # 70% rand read, 30% rand write
    rwmixread=70
    size=10000g
    ioengine=libaio
    bs=2k
    runtime=8h
    time_based
    numjobs=32
    group_reporting
    # --- end ---
    

    The result of the test is sum of write and read IO operations divided by total elapsed time of the test in seconds. 

    Minimum required performance 9000 IOPs.

Results of the tests required as a part of proposal:  YES

Notes after evaluation: the tests themselves were OK, but the test architectures could be defined a bit better: The tests actually measured only performance of the FC disks (candidates obviously configured the volume in such a way that it was faster), performance of SATA volumes was not evaluated at all. Also, the winner used RAID5 as required but there was a big RAID0 volume above the 20 individual RAID5s (thus creating RAID50) which was allowed but not used in production afterwards.

File system benchmarking examples:

Independent storage benchmarking organisations:

Storage Performance Council (SPC)

Enterprise Strategy Group (ESG)

Howto authors:

The text re-uses the material presented by Stijn Eeckhaut in Espoo, Finland,
during the 1st TF-Storage Meeting, 8 April 2008.
It also includes the material prepared by Maciej Brzezniak from Poznan Supercomputing
and Networking Centre, Poland and Stanislaw Jankowski, student at Poznan University of Technology.

...