aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/dm-table.c
Commit message (Collapse)AuthorAgeFilesLines
* dm: improve block integrity supportMike Snitzer2011-04-051-34/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current block integrity (DIF/DIX) support in DM is verifying that all devices' integrity profiles match during DM device resume (which is past the point of no return). To some degree that is unavoidable (stacked DM devices force this late checking). But for most DM devices (which aren't stacking on other DM devices) the ideal time to verify all integrity profiles match is during table load. Introduce the notion of an "initialized" integrity profile: a profile that was blk_integrity_register()'d with a non-NULL 'blk_integrity' template. Add blk_integrity_is_initialized() to allow checking if a profile was initialized. Update DM integrity support to: - check all devices with _initialized_ integrity profiles match during table load; uninitialized profiles (e.g. for underlying DM device(s) of a stacked DM device) are ignored. - disallow a table load that would result in an integrity profile that conflicts with a DM device's existing (in-use) integrity profile - avoid clearing an existing integrity profile - validate all integrity profiles match during resume; but if they don't all we can do is report the mismatch (during resume we're past the point of no return) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* block: Require subsystems to explicitly allocate bio_set integrity mempoolMartin K. Petersen2011-03-171-2/+5
| | | | | | | | | | | | | | | | | | MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* block: remove per-queue pluggingJens Axboe2011-03-101-24/+0
| | | | | | | | Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* block: restore multiple bd_link_disk_holder() supportTejun Heo2011-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Commit e09b457b (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit e09b457b sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* dm: per target unplug callback supportNeilBrown2011-01-131-0/+5
| | | | | | | | | | Add per-target unplug callback support. Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: introduce target callbacks and congestion callbackNeilBrown2011-01-131-0/+14
| | | | | | | | | | | | | | | | | | DM currently implements congestion checking by checking on congestion in each component device. For raid456 we need to also check if the stripe cache is congested. Add per-target congestion checker callback support. Extending the target_callbacks structure with additional callback functions allows for establishing multiple callbacks per-target (a callback is also needed for unplug). Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-01-131-9/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
| * block: clean up blkdev_get() wrappers and their usersTejun Heo2010-11-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>
| * block: make blkdev_get/put() handle exclusive accessTejun Heo2010-11-131-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over time, block layer has accumulated a set of APIs dealing with bdev open, close, claim and release. * blkdev_get/put() are the primary open and close functions. * bd_claim/release() deal with exclusive open. * open/close_bdev_exclusive() are combination of open and claim and the other way around, respectively. * bd_link/unlink_disk_holder() to create and remove holder/slave symlinks. * open_by_devnum() wraps bdget() + blkdev_get(). The interface is a bit confusing and the decoupling of open and claim makes it impossible to properly guarantee exclusive access as in-kernel open + claim sequence can disturb the existing exclusive open even before the block layer knows the current open if for another exclusive access. Reorganize the interface such that, * blkdev_get() is extended to include exclusive access management. @holder argument is added and, if is @FMODE_EXCL specified, it will gain exclusive access atomically w.r.t. other exclusive accesses. * blkdev_put() is similarly extended. It now takes @mode argument and if @FMODE_EXCL is set, it releases an exclusive access. Also, when the last exclusive claim is released, the holder/slave symlinks are removed automatically. * bd_claim/release() and close_bdev_exclusive() are no longer necessary and either made static or removed. * bd_link_disk_holder() remains the same but bd_unlink_disk_holder() is no longer necessary and removed. * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev() and blkdev_get(). It also has an unexpected extra bdev_read_only() test which probably should be moved into blkdev_get(). * open_by_devnum() is modified to take @holder argument and pass it to blkdev_get(). Most of bdev open/close operations are unified into blkdev_get/put() and most exclusive accesses are tested atomically at the open time (as it should). This cleans up code and removes some, both valid and invalid, but unnecessary all the same, corner cases. open_bdev_exclusive() and open_by_devnum() can use further cleanup - rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop special features. Well, let's leave them for another day. Most conversions are straight-forward. drbd conversion is a bit more involved as there was some reordering, but the logic should stay the same. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Philipp Reisner <philipp.reisner@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Alex Elder <aelder@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: dm-devel@redhat.com Cc: drbd-dev@lists.linbit.com Cc: Leo Chen <leochen@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Joern Engel <joern@logfs.org> Cc: reiserfs-devel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk>
| * block: simplify holder symlink handlingTejun Heo2010-11-131-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code to manage symlinks in /sys/block/*/{holders|slaves} are overly complex with multiple holder considerations, redundant extra references to all involved kobjects, unused generic kobject holder support and unnecessary mixup with bd_claim/release functionalities. Strip it down to what's necessary (single gendisk holder) and make it use a separate interface. This is a step for cleaning up bd_claim/release. This patch makes dm-table slightly more complex but it will be simplified again with further changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Brown <neilb@suse.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com
* | block: max hardware sectors limit wrapperMike Snitzer2010-12-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement blk_limits_max_hw_sectors() and make blk_queue_max_hw_sectors() a wrapper around it. DM needs this to avoid setting queue_limits' max_hw_sectors and max_sectors directly. dm_set_device_limits() now leverages blk_limits_max_hw_sectors() logic to establish the appropriate max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was incorrectly setting max_sectors rather than max_hw_sectors (which caused dm_merge_bvec()'s max_hw_sectors check to be ineffective). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits insteadMartin K. Petersen2010-12-171-5/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | When stacking devices, a request_queue is not always available. This forced us to have a no_cluster flag in the queue_limits that could be used as a carrier until the request_queue had been set up for a metadevice. There were several problems with that approach. First of all it was up to the stacking device to remember to set queue flag after stacking had completed. Also, the queue flag and the queue limits had to be kept in sync at all times. We got that wrong, which could lead to us issuing commands that went beyond the max scatterlist limit set by the driver. The proper fix is to avoid having two flags for tracking the same thing. We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the block layer merging functions. The queue_limit 'no_cluster' is turned into 'cluster' to avoid double negatives and to ease stacking. Clustering defaults to being enabled as before. The queue flag logic is removed from the stacking function, and explicitly setting the cluster flag is no longer necessary in DM and MD. Reported-by: Ed Lin <ed.lin@promise.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Consolidate min_not_zeroMartin K. Petersen2010-09-101-5/+0
| | | | | | | | We have several users of min_not_zero, each of them using their own definition. Move the define to kernel.h. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@carl.home.kernel.dk>
* dm: linear support discardMike Snitzer2010-08-121-0/+44
| | | | | | | | | | | | | | | | | Allow discards to be passed through to linear mappings if at least one underlying device supports it. Discards will be forwarded only to devices that support them. A target that supports discards should set num_discard_requests to indicate how many times each discard request must be submitted to it. Verify table's underlying devices support discards prior to setting the associated DM device as capable of discards (via QUEUE_FLAG_DISCARD). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: refactor dm_table_completeWill Drewry2010-08-121-3/+52
| | | | | | | | | This change unifies the various checks and finalization that occurs on a table prior to use. By doing so, it allows table construction without traversing the dm-ioctl interface. Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: remove unused dm_get_device range parametersNikanth Karthikesan2010-03-061-6/+4
| | | | | | | | Remove unused parameters(start and len) of dm_get_device() and fix the callers. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: remove dm_get from dm_table_get_mdKiyoshi Ueda2010-03-061-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could be called from presuspend/postsuspend, which are called while mapped_device is in DMF_FREEING state, where dm_get() is not allowed. Justification for that is the lifetime of both objects: As far as the current dm design/implementation, mapped_device is never freed while targets are doing something, because dm core waits for targets to become quiet in dm_put() using presuspend/postsuspend. So targets should be able to touch mapped_device without holding reference count of the mapped_device, and we should allow targets to touch mapped_device even if it is in DMF_FREEING state. Backgrounds: I'm trying to remove the multipath internal queue, since dm core now has a generic queue for request-based dm. In the patch-set, the multipath target wants to request dm core to start/stop queue. One of such start/stop requests can happen during postsuspend() while the target waits for pg-init to complete, because the target stops queue when starting pg-init and tries to restart it when completing pg-init. Since queue belongs to mapped_device, it involves calling dm_table_get_md() and dm_put(). On the other hand, postsuspend() is called in dm_put() for mapped_device which is in DMF_FREEING state, and that triggers BUG_ON(DMF_FREEING) in the 2nd dm_put(). I had tried to solve this problem by changing only multipath not to touch mapped_device which is in DMF_FREEING state, but I couldn't and I came up with a question why we need dm_get() in dm_table_get_md(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* DM: Fix device mapper topology stackingMartin K. Petersen2010-01-111-15/+5
| | | | | | | | | | | | | | Make DM use bdev_stack_limits() function so that partition offsets get taken into account when calculating alignment. Clarify stacking warnings. Also remove obsolete clearing of final alignment_offset and misalignment flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair G. Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds2009-12-151-0/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (80 commits) dm snapshot: use merge origin if snapshot invalid dm snapshot: report merge failure in status dm snapshot: merge consecutive chunks together dm snapshot: trigger exceptions in remaining snapshots during merge dm snapshot: delay merging a chunk until writes to it complete dm snapshot: queue writes to chunks being merged dm snapshot: add merging dm snapshot: permit only one merge at once dm snapshot: support barriers in snapshot merge target dm snapshot: avoid allocating exceptions in merge dm snapshot: rework writing to origin dm snapshot: add merge target dm exception store: add merge specific methods dm snapshot: create function for chunk_is_tracked wait dm snapshot: make bio optional in __origin_write dm mpath: reject messages when device is suspended dm: export suspended state to targets dm: rename dm_suspended to dm_suspended_md dm: swap target postsuspend call and setting suspended flag dm crypt: add plain64 iv ...
| * dm: bind new table before destroying oldAlasdair G Kergon2009-12-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When replacing a mapped device's table during a 'resume', delay the destruction of the old table until the new one is successfully in place. This will make it easier for a later patch to transfer internal state information from the old table to the new one (something we do not currently support) while giving us more options for reversion if a later part of the operation fails. Devices are always in the suspended state during dm_swap_table(). This patch reinforces the requirement that all I/O must have been flushed from the table targets while in this state (including any in workqueues). In the case of 'noflush' suspending, unprocessed I/O should have been 'pushed back' to the dm core prior to this point, for resubmission after the new table is in place. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | tree-wide: convert open calls to remove spaces to skip_spaces() lib functionAndré Goddard Rosa2009-12-151-4/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes use of skip_spaces() defined in lib/string.c for removing leading spaces from strings all over the tree. It decreases lib.a code size by 47 bytes and reuses the function tree-wide: text data bss dec hex filename 64688 584 592 65864 10148 (TOTALS-BEFORE) 64641 584 592 65817 10119 (TOTALS-AFTER) Also, while at it, if we see (*str && isspace(*str)), we can be sure to remove the first condition (*str) as the second one (isspace(*str)) also evaluates to 0 whenever *str == 0, making it redundant. In other words, "a char equals zero is never a space". Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below, and found occurrences of this pattern on 3 more files: drivers/leds/led-class.c drivers/leds/ledtrig-timer.c drivers/video/output.c @@ expression str; @@ ( // ignore skip_spaces cases while (*str && isspace(*str)) { \(str++;\|++str;\) } | - *str && isspace(*str) ) Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Julia Lawall <julia@diku.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Neil Brown <neilb@suse.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: David Howells <dhowells@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* dm stripe: expose correct io hintsMike Snitzer2009-09-041-0/+4
| | | | | | | | | | | | | Set sensible I/O hints for striped DM devices in the topology infrastructure added for 2.6.31 for userspace tools to obtain via sysfs. Add .io_hints to 'struct target_type' to allow the I/O hints portion (io_min and io_opt) of the 'struct queue_limits' to be set by each target and implement this for dm-stripe. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: add more context to terse warning messagesMike Snitzer2009-09-041-7/+18
| | | | | | | | | | | | | | | | | | A couple of recent warning messages make it difficult for the reader to determine exactly what is wrong. This patch adds more information to those messages. The messages were added by these commits: 5dea271b6d87bd1d79a59c1d5baac2596a841c37 ("dm table: pass correct dev area size to device_area_is_valid") ea9df47cc92573b159ef3b4fda516c32cba9c4fd ("dm table: fix blk_stack_limits arg to use bytes not sectors") The patch also corrects references to logical_block_size in printk format strings from %hu to %u. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: fix queue_limit checking device iteratorMikulas Patocka2009-09-041-11/+11
| | | | | | | | | | | | | | | | | | | The logic to check for valid device areas is inverted relative to proper use with iterate_devices. The iterate_devices method calls its callback for every underlying device in the target. If any callback returns non-zero, iterate_devices exits immediately. But the callback device_area_is_valid() returns 0 on error and 1 on success. The overall effect without is that an error is issued only if every device is invalid. This patch renames device_area_is_valid to device_area_is_invalid and inverts the logic so that one invalid device is sufficient to raise an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: pass correct dev area size to device_area_is_validMike Snitzer2009-07-231-5/+5
| | | | | | | | | | | | | | | | Incorrect device area lengths are being passed to device_area_is_valid(). The regression appeared in 2.6.31-rc1 through commit 754c5fc7ebb417b23601a6222a6005cc2e7f2913. With the dm-stripe target, the size of the target (ti->len) was used instead of the stripe_width (ti->len/#stripes). An example of a consequent incorrect error message is: device-mapper: table: 254:0: sdb too small for target Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: remove queue next_ordered workaround for barriersMike Snitzer2009-07-231-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes DM's bio-based vs request-based conditional setting of next_ordered. For bio-based DM the next_ordered check is no longer a concern (as that check is now in the __make_request path). For request-based DM the default of QUEUE_ORDERED_NONE is now appropriate. bio-based DM was changed to work-around the previously misplaced next_ordered check with this commit: 99360b4c18f7675b50d283301d46d755affe75fd request-based DM does not yet support barriers but reacted to the above bio-based DM change with this commit: 5d67aa2366ccb8257d103d0b43df855605c3c086 The above changes are no longer needed given Neil Brown's recent fix to put the next_ordered check in the __make_request path: db64f680ba4b5c56c4be59f0698000df89ff0281 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: NeilBrown <neilb@suse.de> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: fix blk_stack_limits arg to use bytes not sectorsMike Snitzer2009-06-301-1/+1
| | | | | | | | | | | The offset passed to blk_stack_limits() must be in bytes not sectors. Fixes false warnings like the following: device-mapper: table: 254:1: target device sda6 is misaligned Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reported-by: Frans Pop <elendil@planet.nl> Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: do not set QUEUE_ORDERED_DRAIN if request basedKiyoshi Ueda2009-06-221-0/+5
| | | | | | | | | | | | Request-based dm doesn't have barrier support yet. So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm. Since the device type is decided at the first table loading time, the flag set is deferred until then. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: enable request based optionKiyoshi Ueda2009-06-221-0/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables request-based dm. o Request-based dm and bio-based dm coexist, since there are some target drivers which are more fitting to bio-based dm. Also, there are other bio-based devices in the kernel (e.g. md, loop). Since bio-based device can't receive struct request, there are some limitations on device stacking between bio-based and request-based. type of underlying device bio-based request-based ---------------------------------------------- bio-based OK OK request-based -- OK The device type is recognized by the queue flag in the kernel, so dm follows that. o The type of a dm device is decided at the first table binding time. Once the type of a dm device is decided, the type can't be changed. o Mempool allocations are deferred to at the table loading time, since mempools for request-based dm are different from those for bio-based dm and needed mempool type is fixed by the type of table. o Currently, request-based dm supports only tables that have a single target. To support multiple targets, we need to support request splitting or prevent bio/request from spanning multiple targets. The former needs lots of changes in the block layer, and the latter needs that all target drivers support merge() function. Both will take a time. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: prepare for request based optionKiyoshi Ueda2009-06-221-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds core functions for request-based dm. When struct mapped device (md) is initialized, md->queue has an I/O scheduler and the following functions are used for request-based dm as the queue functions: make_request_fn: dm_make_request() pref_fn: dm_prep_fn() request_fn: dm_request_fn() softirq_done_fn: dm_softirq_done() lld_busy_fn: dm_lld_busy() Actual initializations are done in another patch (PATCH 2). Below is a brief summary of how request-based dm behaves, including: - making request from bio - cloning, mapping and dispatching request - completing request and bio - suspending md - resuming md bio to request ============== md->queue->make_request_fn() (dm_make_request()) calls __make_request() for a bio submitted to the md. Then, the bio is kept in the queue as a new request or merged into another request in the queue if possible. Cloning and Mapping =================== Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()), when requests are dispatched after they are sorted by the I/O scheduler. dm_request_fn() checks busy state of underlying devices using target's busy() function and stops dispatching requests to keep them on the dm device's queue if busy. It helps better I/O merging, since no merge is done for a request once it is dispatched to underlying devices. Actual cloning and mapping are done in dm_prep_fn() and map_request() called from dm_request_fn(). dm_prep_fn() clones not only request but also bios of the request so that dm can hold bio completion in error cases and prevent the bio submitter from noticing the error. (See the "Completion" section below for details.) After the cloning, the clone is mapped by target's map_rq() function and inserted to underlying device's queue using blk_insert_cloned_request(). Completion ========== Request completion can be hooked by rq->end_io(), but then, all bios in the request will have been completed even error cases, and the bio submitter will have noticed the error. To prevent the bio completion in error cases, request-based dm clones both bio and request and hooks both bio->bi_end_io() and rq->end_io(): bio->bi_end_io(): end_clone_bio() rq->end_io(): end_clone_request() Summary of the request completion flow is below: blk_end_request() for a clone request => blk_update_request() => bio->bi_end_io() == end_clone_bio() for each clone bio => Free the clone bio => Success: Complete the original bio (blk_update_request()) Error: Don't complete the original bio => blk_finish_request() => rq->end_io() == end_clone_request() => blk_complete_request() => dm_softirq_done() => Free the clone request => Success: Complete the original request (blk_end_request()) Error: Requeue the original request end_clone_bio() completes the original request on the size of the original bio in successful cases. Even if all bios in the original request are completed by that completion, the original request must not be completed yet to keep the ordering of request completion for the stacking. So end_clone_bio() uses blk_update_request() instead of blk_end_request(). In error cases, end_clone_bio() doesn't complete the original bio. It just frees the cloned bio and gives over the error handling to end_clone_request(). end_clone_request(), which is called with queue lock held, completes the clone request and the original request in a softirq context (dm_softirq_done()), which has no queue lock, to avoid a deadlock issue on submission of another request during the completion: - The submitted request may be mapped to the same device - Request submission requires queue lock, but the queue lock has been held by itself and it doesn't know that The clone request has no clone bio when dm_softirq_done() is called. So target drivers can't resubmit it again even error cases. Instead, they can ask dm core for requeueing and remapping the original request in that cases. suspend ======= Request-based dm uses stopping md->queue as suspend of the md. For noflush suspend, just stops md->queue. For flush suspend, inserts a marker request to the tail of md->queue. And dispatches all requests in md->queue until the marker comes to the front of md->queue. Then, stops dispatching request and waits for the all dispatched requests to complete. After that, completes the marker request, stops md->queue and wake up the waiter on the suspend queue, md->wait. resume ====== Starts md->queue. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: calculate queue limits during resume not loadMike Snitzer2009-06-221-83/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, device-mapper maintains a separate instance of 'struct queue_limits' for each table of each device. When the configuration of a device is to be changed, first its table is loaded and this structure is populated, then the device is 'resumed' and the calculated queue_limits are applied. This places restrictions on how userspace may process related devices, where it is often advantageous to 'load' tables for several devices at once before 'resuming' them together. As the new queue_limits only take effect after the 'resume', if they are changing and one device uses another, the latter must be 'resumed' before the former may be 'loaded'. This patch moves the calculation of these queue_limits out of the 'load' operation into 'resume'. Since we are no longer pre-calculating this struct, we no longer need to maintain copies within our dm structs. dm_set_device_limits() now passes the 'start' of the device's data area (aka pe_start) as the 'offset' to blk_stack_limits(). init_valid_queue_limits() is replaced by blk_set_default_limits(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: martin.petersen@oracle.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: establish queue limits by copying table limitsMike Snitzer2009-06-221-10/+2
| | | | | | | | | | | | | | | | | Copy the table's queue_limits to the DM device's request_queue. This properly initializes the queue's topology limits and also avoids having to track the evolution of 'struct queue_limits' in dm_table_set_restrictions() Also fixes a bug that was introduced in dm_table_set_restrictions() via commit ae03bf639a5027d27270123f5f6e3ee6a412781d. In addition to establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit() also performs an allocation to setup the ISA DMA pool. This allocation resulted in "sleeping function called from invalid context" when called from dm_table_set_restrictions(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: replace struct io_restrictions with struct queue_limitsMike Snitzer2009-06-221-95/+43
| | | | | | | | Use blk_stack_limits() to stack block limits (including topology) rather than duplicate the equivalent within Device Mapper. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: validate device logical_block_sizeMike Snitzer2009-06-221-0/+69
| | | | | | | | | Impose necessary and sufficient conditions on a devices's table such that any incoming bio which respects its logical_block_size can be processed successfully. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: ensure targets are aligned to logical_block_sizeMike Snitzer2009-06-221-14/+44
| | | | | | | | | | | Ensure I/O is aligned to the logical block size of target devices. Rename check_device_area() to device_area_is_valid() for clarity and establish the device limits including the logical block size prior to calling it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: improve warning message when devices not freed before destructionJonthan Brassow2009-06-221-5/+3
| | | | | | | Report any devices forgotten to be freed before a table is destroyed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: use i_size_readMikulas Patocka2009-06-221-1/+2
| | | | | | | | | | | | | | | Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* Revert "block: Fix bounce limit setting in DM"Jens Axboe2009-06-091-1/+1
| | | | | | | | This reverts commit a05c0205ba031c01bba33a21bf0a35920eb64833. DM doesn't need to access the bounce_pfn directly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: Fix bounce limit setting in DMMartin K. Petersen2009-06-031-1/+1
| | | | | | | | | | blk_queue_bounce_limit() is more than a wrapper about the request queue limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can be called by stacking drivers that wish to set the bounce limit explicitly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: Use accessor functions for queue limitsMartin K. Petersen2009-05-221-14/+14
| | | | | | | | Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: Do away with the notion of hardsect_sizeMartin K. Petersen2009-05-221-5/+7
| | | | | | | | | | | | | | Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* dm: remove limited barrier supportMikulas Patocka2009-04-091-19/+0
| | | | | | | Prepare for full barrier implementation: first remove the restricted support. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: add integrity supportMartin K. Petersen2009-04-091-0/+40
| | | | | | | | | | | | | | | | This patch provides support for data integrity passthrough in the device mapper. - If one or more component devices support integrity an integrity profile is preallocated for the DM device. - If all component devices have compatible profiles the DM device is flagged as capable. - Handle integrity metadata when splitting and cloning bios. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: fix upgrade mode raceAlasdair G Kergon2009-04-021-12/+14
| | | | | | | | | | | | | | upgrade_mode() sets bdev to NULL temporarily, and does not have any locking to exclude anything from seeing that NULL. In dm_table_any_congested() bdev_get_queue() can dereference that NULL and cause a reported oops. Fix this by not changing that field during the mode upgrade. Cc: stable@kernel.org Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: rework reference countingMikulas Patocka2009-01-061-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework table reference counting. The existing code uses a reference counter. When the last reference is dropped and the counter reaches zero, the table destructor is called. Table reference counters are acquired/released from upcalls from other kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all). If the reference counter reaches zero in one of the upcalls, the table destructor is called from almost random kernel code. This leads to various problems: * dm_any_congested being called under a spinlock, which calls the destructor, which calls some sleeping function. * the destructor attempting to take a lock that is already taken by the same process. * stale reference from some other kernel code keeps the table constructed, which keeps some devices open, even after successful return from "dmsetup remove". This can confuse lvm and prevent closing of underlying devices or reusing device minor numbers. The patch changes reference counting so that the table destructor can be called only at predetermined places. The table has always exactly one reference from either mapped_device->map or hash_cell->new_map. After this patch, this reference is not counted in table->holders. A pair of dm_create_table/dm_destroy_table functions is used for table creation/destruction. Temporary references from the other code increase table->holders. A pair of dm_table_get/dm_table_put functions is used to manipulate it. When the table is about to be destroyed, we wait for table->holders to reach 0. Then, we call the table destructor. We use active waiting with msleep(1), because the situation happens rarely (to one user in 5 years) and removing the device isn't performance-critical task: the user doesn't care if it takes one tick more or not. This way, the destructor is called only at specific points (dm_table_destroy function) and the above problems associated with lazy destruction can't happen. Finally remove the temporary protection added to dm_any_congested(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: support barriers on simple devicesAndi Kleen2009-01-061-0/+19
| | | | | | | | | | | | | | | | | Implement barrier support for single device DM devices This patch implements barrier support in DM for the common case of dm linear just remapping a single underlying device. In this case we can safely pass the barrier through because there can be no reordering between devices. NB. Any DM device might cease to support barriers if it gets reconfigured so code must continue to allow for a possible -EOPNOTSUPP on every barrier bio submitted. - agk Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* block: fix setting of max_segment_size and seg_boundary maskMilan Broz2008-12-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=471639 http://bugzilla.kernel.org/show_bug.cgi?id=8672 Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Cc: Mike Miller <mike.miller@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdevLinus Torvalds2008-10-231-8/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits) [PATCH] kill the rest of struct file propagation in block ioctls [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET [PATCH] get rid of blkdev_locked_ioctl() [PATCH] get rid of blkdev_driver_ioctl() [PATCH] sanitize blkdev_get() and friends [PATCH] remember mode of reiserfs journal [PATCH] propagate mode through swsusp_close() [PATCH] propagate mode through open_bdev_excl/close_bdev_excl [PATCH] pass fmode_t to blkdev_put() [PATCH] kill the unused bsize on the send side of /dev/loop [PATCH] trim file propagation in block/compat_ioctl.c [PATCH] end of methods switch: remove the old ones [PATCH] switch sr [PATCH] switch sd [PATCH] switch ide-scsi [PATCH] switch tape_block [PATCH] switch dcssblk [PATCH] switch dasd [PATCH] switch mtd_blkdevs [PATCH] switch mmc ...
| * [PATCH] pass fmode_t to blkdev_put()Al Viro2008-10-211-2/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * [PATCH] introduce fmode_t, do annotationsAl Viro2008-10-211-6/+6
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>