aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c
Commit message (Collapse)AuthorAgeFilesLines
* md: revert the recent addition of a call to the BLKRRPART ioctl.NeilBrown2008-11-061-6/+0
| | | | | | | | | | | | | | It turns out that it is only safe to call blkdev_ioctl when the device is actually open (as ->bd_disk is set to NULL on last close). And it is quite possible for do_md_stop to be called when the device is not open. So discard the call to blkdev_ioctl(BLKRRPART) which was added in commit 934d9c23b4c7e31840a895ba4b7e88d6413c81f3 It is just as easy to call this ioctl from userspace when needed (on mdadm -S) so leave it out of the kernel Signed-off-by: NeilBrown <neilb@suse.de>
* md: destroy partitions and notify udev when md array is stopped.NeilBrown2008-10-281-0/+7
| | | | | | | | | | | | | | | md arrays are not currently destroyed when they are stopped - they remain in /sys/block. Last time I tried this I tripped over locking too much. A consequence of this is that udev doesn't remove anything from /dev. This is rather ugly. As an interim measure until proper device removal can be achieved, make sure all partitions are removed using the BLKRRPART ioctl, and send a KOBJ_CHANGE when an md array is stopped. Signed-off-by: NeilBrown <neilb@suse.de>
* Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2008-10-261-23/+33
|\ | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: allow extended partitions on md devices. md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state md: use sysfs_notify_dirent to notify changes to md/array_state
| * md: allow extended partitions on md devices.NeilBrown2008-10-211-0/+5
| | | | | | | | | | | | | | | | | | | | The new extended partition support provides a much nicer was to have partitions on md devices that the 'mdp' alternate major. We cannot really get rid of 'mdp' at this time, but we can enable extended partitions as that will probably make life easier for sysadmins. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: use sysfs_notify_dirent to notify changes to md/dev-xxx/stateNeilBrown2008-10-211-9/+12
| | | | | | | | | | | | | | | | | | | | The 'state' file for a device reports, for example, when the device has failed. Changes should be reported to userspace ASAP without the possibility of blocking on low-memory. sysfs_notify does have that possibility (as it takes a mutex which can be held across a kmalloc) so use sysfs_notify_dirent instead. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: use sysfs_notify_dirent to notify changes to md/array_stateNeilBrown2008-10-211-14/+16
| | | | | | | | | | | | | | | | | | Now that we have sysfs_notify_dirent, use it to notify changes to md/array_state. As sysfs_notify_dirent can be called in atomic context, we can remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag. Signed-off-by: NeilBrown <neilb@suse.de>
* | [PATCH] pass fmode_t to blkdev_put()Al Viro2008-10-211-2/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | [PATCH] switch mdAl Viro2008-10-211-10/+10
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | [PATCH] beginning of methods conversionAl Viro2008-10-211-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | To keep the size of changesets sane we split the switch by drivers; to keep the damn thing bisectable we do the following: 1) rename the affected methods, add ones with correct prototypes, make (few) callers handle both. That's this changeset. 2) for each driver convert to new methods. *ALL* drivers are converted in this series. 3) kill the old (renamed) methods. Note that it _is_ a flagday; all in-tree drivers are converted and by the end of this series no trace of old methods remain. The only reason why we do that this way is to keep the damn thing bisectable and allow per-driver debugging if anything goes wrong. New methods: open(bdev, mode) release(disk, mode) ioctl(bdev, mode, cmd, arg) /* Called without BKL */ compat_ioctl(bdev, mode, cmd, arg) locked_ioctl(bdev, mode, cmd, arg) /* Called with BKL, legacy */ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* md: fix input truncation in safe_delay_store()Dan Williams2008-10-161-5/+3
| | | | | | | | | | safe_delay_store() currently truncates the last character of input since it tells strlcpy that the buffer can only hold 'len' characters, off by one. sysfs already null terminates the buffer, so just increase the last argument to strlcpy. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: build failure due to missing delay.hStephen Rothwell2008-10-151-0/+1
| | | | | | | | | | | | | | | | | | | | | Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid1.o] Error 1 make[3]: *** Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Relax minimum size restrictions on chunk_size.NeilBrown2008-10-131-6/+1
| | | | | | | | | | | | | | | | | | Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE. This makes moving an array to a machine with a larger PAGE_SIZE, or changing the kernel to use a larger PAGE_SIZE, can stop an array from working. For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync process works on whole pages at a time, and assumes them to be wholly within a stripe. For other raid personalities, this restriction is not needed at all and can be dropped. So remove the test on chunk_size from common can, and add it in just the places where it is needed: raid10 and raid4/5/6. Signed-off-by: NeilBrown <neilb@suse.de>
* md: remove space after function name in declaration and call.NeilBrown2008-10-131-13/+13
| | | | | | | | | | | | Having function (args) instead of function(args) make is harder to search for calls of particular functions. So remove all those spaces. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Remove unnecessary #includes, #defines, and function declarations.NeilBrown2008-10-131-17/+4
| | | | | | A lot of cruft has gathered over the years. Time to remove it. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Don't try to set an array to 'read-auto' if it is already in that state.NeilBrown2008-10-131-2/+2
| | | | | | | | | | 'read-auto' is a variant of 'readonly' which will switch to writable on the first write attempt. Calling do_md_stop to set the array readonly when it is already readonly returns an error. So make sure not to do that. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Allow metadata_version to be updated for externally managed metadata.NeilBrown2008-10-131-1/+7
| | | | | | | | | | | | | | For externally managed metadata, the 'metadata_version' sysfs attribute is really just a channel for user-space programs to communicate about how the array is being managed. It can be useful for this to be changed while the array is active. Normally changes to metadata_version are not permitted while the array is active. Change that so that if the metadata is externally managed, the metadata_version can be changed to a different flavour of external management. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Fix rdev_size_store with size == 0Chris Webb2008-10-131-4/+2
| | | | | | | | | | | | | Fix rdev_size_store with size == 0. size == 0 means to use the largest size allowed by the underlying device and is used when modifying an active array. This fixes a regression introduced by commit d7027458d68b2f1752a28016dcf2ffd0a7e8f567 Cc: <stable@kernel.org> Signed-off-by: Chris Webb <chris@arachsys.com> Signed-off-by: NeilBrown <neilb@suse.de>
* block: move stats from disk to part0Tejun Heo2008-10-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_*() now updates part0 together if the specified partition is not part0. ie. part_stat_*() are now essentially all_stat_*(). * {disk|all}_stat_*() are gone. * part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: always set bdev->bd_partTejun Heo2008-10-091-4/+1
| | | | | | | | | Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: implement and use {disk|part}_to_dev()Tejun Heo2008-10-091-5/+5
| | | | | | | | | | | | Implement {disk|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* md: Don't wait UNINTERRUPTIBLE for other resync to finishNeilBrown2008-09-191-1/+7
| | | | | | | | | | | | | | When two md arrays share some block device (e.g each uses different partitions on the one device), a resync of one array will wait for the resync on the other to finish. This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE, the softlockup code notices and complains. So use TASK_INTERRUPTIBLE instead and make sure to flush signals before calling schedule. Signed-off-by: NeilBrown <neilb@suse.de>
* Remove invalidate_partition call from do_md_stop.NeilBrown2008-09-011-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When stopping an md array, or just switching to read-only, we currently call invalidate_partition while holding the mddev lock. The main reason for this is probably to ensure all dirty buffers are flushed (invalidate_partition calls fsync_bdev). However if any dirty buffers are found, it will almost certainly cause a deadlock as starting writeout will require an update to the superblock, and performing that updates requires taking the mddev lock - which is already held. This deadlock can be demonstrated by running "reboot -f -n" with a root filesystem on md/raid, and some dirty buffers in memory. All other calls to stop an array should already happen after a flush. The normal sequence is to stop using the array (e.g. umount) which will cause __blkdev_put to call sync_blockdev. Then open the array and issue the STOP_ARRAY ioctl while the buffers are all still clean. So this invalidate_partition is normally a no-op, except for one case where it will cause a deadlock. So remove it. This patch possibly addresses the regression recored in http://bugzilla.kernel.org/show_bug.cgi?id=11460 and http://bugzilla.kernel.org/show_bug.cgi?id=11452 though it isn't yet clear how it ever worked. Signed-off-by: NeilBrown <neilb@suse.de>
* md: cancel check/repair requests when recovery is neededDan Williams2008-08-071-1/+3
| | | | | | | | | | If a 'repair' is requested when an array is in a position to 'recover' raid1 will perform the repair while md believes a recovery is happening. Address this at both ends, i.e. cancel check/repair requests upon detecting a recover condition and do not call ->spare_active after completing a check/repair. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Allow faulty devices to be removed from a readonly array.NeilBrown2008-08-051-1/+12
| | | | | | | | | | | | | | | Removing faulty devices from an array is a two stage process. First the device is moved from being a part of the active array to being similar to a spare device. Then it can be removed by a request from user space. The first step is currently not performed for read-only arrays, so the second step can never succeed. So allow readonly arrays to remove failed devices (which aren't blocked). Signed-off-by: NeilBrown <neilb@suse.de>
* Fail safely when trying to grow an array with a write-intent bitmap.NeilBrown2008-08-051-0/+5
| | | | | | | | | | | | We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>
* Restore force switch of md array to readonly at reboot time.NeilBrown2008-08-051-1/+5
| | | | | | | | | | | | | | | A recent patch allowed do_md_stop to know whether it was being called via an ioctl or not, and thus where to allow for an extra open file descriptor when checking if it is in use. This broke then switch to readonly performed by the shutdown notifier, which needs to work even when the array is still (apparently) active (as md doesn't get told when the filesystem becomes readonly). So restore this feature by pretending that there can be lots of file descriptors open, but we still want do_md_stop to switch to readonly. Signed-off-by: NeilBrown <neilb@suse.de>
* Make writes to md/safe_mode_delay immediately effective.NeilBrown2008-08-051-0/+5
| | | | | | | | | | | | If we reduce the 'safe_mode_delay', it could still wait for the old delay to completely expire before doing anything about safe_mode. Thus the effect if the change is delayed. To make the effect more immediate, run the timeout function immediately if the delay was reduced. This may cause it to run slightly earlier that required, but that is the safer option. Signed-off-by: NeilBrown <neilb@suse.de>
* md: do not count blocked devices as sparesDan Williams2008-07-281-1/+2
| | | | | | | | remove_and_add_spares() assumes that failed devices have been hot-removed from the array. Removal is skipped in the 'blocked' case so do not count a device in this state as 'spare'. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* md: delay notification of 'active_idle' to the recovery threadDan Williams2008-07-231-1/+4
| | | | | | sysfs_notify might sleep, so do not call it from md_safemode_timeout. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* md: Protect access to mddev->disks list using RCUNeilBrown2008-07-211-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All modifications and most access to the mddev->disks list are made under the reconfig_mutex lock. However there are three places where the list is walked without any locking. If a reconfig happens at this time, havoc (and oops) can ensue. So use RCU to protect these accesses: - wrap them in rcu_read_{,un}lock() - use list_for_each_entry_rcu - add to the list with list_add_rcu - delete from the list with list_del_rcu - delay the 'free' with call_rcu rather than schedule_work Note that export_rdev did a list_del_init on this list. In almost all cases the entry was not in the list anymore so it was a no-op and so safe. It is no longer safe as after list_del_rcu we may not touch the list_head. An audit shows that export_rdev is called: - after unbind_rdev_from_array, in which case the delete has already been done, - after bind_rdev_to_array fails, in which case the delete isn't needed. - before the device has been put on a list at all (e.g. in add_new_disk where reading the superblock fails). - and in autorun devices after a failure when the device is on a different list. So remove the list_del_init call from export_rdev, and add it back immediately before the called to export_rdev for that last case. Note also that ->same_set is sometimes used for lists other than mddev->list (e.g. candidates). In these cases rcu is not needed. Signed-off-by: NeilBrown <neilb@suse.de>
* md: only count actual openers as access which prevent a 'stop'NeilBrown2008-07-211-3/+6
| | | | | | | | | Open isn't the only thing that increments ->active. e.g. reading /proc/mdstat will increment it briefly. So to avoid false positives in testing for concurrent access, introduce a new counter that counts just the number of times the md device it open. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Make mddev->array_size sector-based.Andre Noll2008-07-211-5/+7
| | | | | | | | | This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Make super_type->rdev_size_change() take sector-based sizes.Andre Noll2008-07-211-21/+19
| | | | | | | | Also, change the type of the size parameter from unsigned long long to sector_t and rename it to num_sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Fix check for overlapping devices.Andre Noll2008-07-211-2/+3
| | | | | | | | | | | | | The checks in overlaps() expect all parameters either in block-based or sector-based quantities. However, its single caller passes two rdev->data_offset arguments as well as two rdev->size arguments, the former being sector counts while the latter are measured in 1K blocks. This could cause rdev_size_store() to accept an invalid size from user space. Fix it by passing only sector-based quantities to overlaps(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Tidy up rdev_size_store a bit:Neil Brown2008-07-211-9/+8
| | | | | | | | | | | | | | | - used strict_strtoull in place of simple_strtoull - use my_mddev in place of rdev->mddev (they have the same value) and more significantly, - don't adjust mddev->size to fit, rather reject changes which make rdev->size smaller than mddev->size Adjusting mddev->size is a hangover from bind_rdev_to_array which does a similar thing. But it really is a better design to insist that mddev->size is set as required, then the rdev->sizes are set to allow for that. The previous way invites confusion. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Turn rdev->sb_offset into a sector-based quantity.Andre Noll2008-07-111-43/+38
| | | | | | | Rename it to sb_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Make calc_dev_sboffset() return a sector count.Andre Noll2008-07-111-6/+7
| | | | | | | | | | | | As BLOCK_SIZE_BITS is 10 and MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x), the return value of calc_dev_sboffset() doubles. Fix up all three callers accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Replace calc_dev_size() by calc_num_sectors().Andre Noll2008-07-111-11/+7
| | | | | | | | | Number of sectors is the preferred unit for sizes of raid devices, so change calc_dev_size() so that it returns this unit instead of the number of 1K blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Make update_size() take the number of sectors.Andre Noll2008-07-111-18/+18
| | | | | | | | | | | | | | Changing the internal representations of sizes of raid devices from 1K blocks to sector counts (512B units) is desirable because it allows to get rid of many divisions/multiplications and unnecessary casts that are present in the current code. This patch is a first step in this direction. It replaces the old 1K-based "size" argument of update_size() by "num_sectors" and fixes up its two callers. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Better control of when do_md_stop is allowed to stop the array.Neil Brown2008-07-111-14/+15
| | | | | | | | | | | | | do_md_stop check the number of active users before allowing the array to be stopped. Two problems: 1/ it assumes the request is coming through an open file descriptor (via ioctl) so it allows for that. This is not always the case. 2/ it doesn't do the check it the array hasn't been activated. This is not good for cases when we use an inactive array to hold some devices in a container. Signed-off-by: Neil Brown <neilb@suse.de>
* md: get_disk_info(): Don't convert between signed and unsigned and back.Andre Noll2008-07-111-4/+1
| | | | | | | | | The current code copies a signed int from user space, converts it to unsigned and passes the unsigned value to find_rdev_nr() which expects a signed value. Simply pass the signed value from user space directly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Simplify restart_array().Andre Noll2008-07-111-32/+17
| | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: alloc_disk_sb(): Return proper error value.Andre Noll2008-07-111-1/+1
| | | | | | | | If alloc_page() fails, ENOMEM is a more suitable error value than EINVAL. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Simplify sb_equal().Andre Noll2008-07-111-5/+1
| | | | | | | | The only caller of sb_equal() tests the return value against zero, so it's OK to return the negated return value of memcmp(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Simplify uuid_equal().Andre Noll2008-07-111-9/+4
| | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: sb_equal(): Fix misleading printk.Andre Noll2008-07-081-1/+1
| | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Fix a typo in the comment to cmd_match().Andre Noll2008-07-081-1/+1
| | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: Fix typo in array_state comment.Andre Noll2008-07-081-1/+1
| | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: sync_speed_show(): Trivial cleanups.Andre Noll2008-07-081-4/+4
| | | | | | | | - Remove superfluous parentheses. - Make format string match the type of the variable that is printed. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
* md: do_md_run(): Fix misleading error message.Andre Noll2008-07-081-2/+3
| | | | | | | | | | In case pers->run() succeeds but creating the bitmap fails, we print an error message stating that pers->run() has failed. Print this message only if pers->run() really failed. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>