aboutsummaryrefslogtreecommitdiffstats
path: root/fs/reiserfs/inode.c
Commit message (Collapse)AuthorAgeFilesLines
* reiserfs: Move quota calls out of write lockJan Kara2012-11-261-3/+7
| | | | | | | | | | | | commit 7af11686933726e99af22901d622f9e161404e6b upstream. Calls into highlevel quota code cannot happen under the write lock. These calls take dqio_mutex which ranks above write lock. So drop write lock before calling back into quota code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checkingHugh Dickins2012-10-211-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 35c2a7f4908d404c9124c2efc6ada4640ca4d5d5 upstream. Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-03-241-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
| * block: remove per-queue pluggingJens Axboe2011-03-101-1/+0
| | | | | | | | | | | | | | | | Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | exportfs: Return the minimum required handle sizeAneesh Kumar K.V2011-03-141-1/+6
|/ | | | | | | | | | | The exportfs encode handle function should return the minimum required handle size. This helps user to find out the handle size by passing 0 handle size in the first step and then redoing to the call again with the returned handle size value. Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* BKL: remove extraneous #include <smp_lock.h>Arnd Bergmann2010-11-171-1/+0
| | | | | | | | | | The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2010-10-261-13/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) split invalidate_inodes() fs: skip I_FREEING inodes in writeback_sb_inodes fs: fold invalidate_list into invalidate_inodes fs: do not drop inode_lock in dispose_list fs: inode split IO and LRU lists fs: switch bdev inode bdi's correctly fs: fix buffer invalidation in invalidate_list fsnotify: use dget_parent smbfs: use dget_parent exportfs: use dget_parent fs: use RCU read side protection in d_validate fs: clean up dentry lru modification fs: split __shrink_dcache_sb fs: improve DCACHE_REFERENCED usage fs: use percpu counter for nr_dentry and nr_dentry_unused fs: simplify __d_free fs: take dcache_lock inside __d_path fs: do not assign default i_ino in new_inode fs: introduce a per-cpu last_ino allocator new helper: ihold() ...
| * fs: kill block_prepare_writeChristoph Hellwig2010-10-251-13/+11
| | | | | | | | | | | | | | | | | | __block_write_begin and block_prepare_write are identical except for slightly different calling conventions. Convert all callers to the __block_write_begin calling conventions and drop block_prepare_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | writeback: remove nonblocking/encountered_congestion referencesWu Fengguang2010-10-261-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes more dead code that was somehow missed by commit 0d99519efef (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface. The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO. The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check. We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age. Inspired by Christoph Hellwig. Thanks! Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix reiserfs_evict_inode end_writeback second callSergey Senozhatsky2010-08-181-0/+1
| | | | | | | | reiserfs_evict_inode calls end_writeback two times hitting kernel BUG at fs/inode.c:298 becase inode->i_state is I_CLEAR already. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2010-08-101-54/+80
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c
| * convert reiserfs to ->evict_inode()Al Viro2010-08-091-3/+10
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * always call inode_change_ok early in ->setattrChristoph Hellwig2010-08-091-4/+4
| | | | | | | | | | | | | | | | | | Make sure we call inode_change_ok before doing any changes in ->setattr, and make sure to call it even if our fs wants to ignore normal UNIX permissions, but use the ATTR_FORCE to skip those. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * remove inode_setattrChristoph Hellwig2010-08-091-45/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * introduce __block_write_beginChristoph Hellwig2010-08-091-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | Split up the block_write_begin implementation - __block_write_begin is a new trivial wrapper for block_prepare_write that always takes an already allocated page and can be either called from block_write_begin or filesystem code that already has a page allocated. Remove the handling of already allocated pages from block_write_begin after switching all callers that do it to __block_write_begin. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * sort out blockdev_direct_IO variantsChristoph Hellwig2010-08-091-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | Move the call to vmtruncate to get rid of accessive blocks to the callers in prepearation of the new truncate calling sequence. This was only done for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant was not needed anyway. Get rid of blockdev_direct_IO_no_locking and its _newtrunc variant while at it as just opencoding the two additional paramters is shorted than the name suffix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * Fix reiserfs_file_release()Al Viro2010-08-091-2/+0
| | | | | | | | | | | | | | | | | | a) count file openers correctly; i_count use was completely wrong b) use new mutex for exclusion between final close/open/truncate, to protect tailpacking logics. i_mutex use was wrong and resulted in deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fix typos concerning "initiali[zs]e"Uwe Kleine-König2010-06-161-1/+1
|/ | | | | Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* quota: unify quota init condition in setattrDmitry Monakhov2010-05-211-1/+2
| | | | | | | | | | | | | Quota must being initialized if size or uid/git changes requested. But initialization performed in two different places: in case of i_size file system is responsible for dquot init , but in case of uid/gid init will be called internally in dquot_transfer(). This ambiguity makes code harder to understand. Let's move this logic to one common helper function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* Merge branch 'for_linus' of ↵Linus Torvalds2010-03-051-8/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits) quota: stop using QUOTA_OK / NO_QUOTA dquot: cleanup dquot initialize routine dquot: move dquot initialization responsibility into the filesystem dquot: cleanup dquot drop routine dquot: move dquot drop responsibility into the filesystem dquot: cleanup dquot transfer routine dquot: move dquot transfer responsibility into the filesystem dquot: cleanup inode allocation / freeing routines dquot: cleanup space allocation / freeing routines ext3: add writepage sanity checks ext3: Truncate allocated blocks if direct IO write fails to update i_size quota: Properly invalidate caches even for filesystems with blocksize < pagesize quota: generalize quota transfer interface quota: sb_quota state flags cleanup jbd: Delay discarding buffers in journal_unmap_buffer ext3: quota_write cross block boundary behaviour quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota quota: split out compat_sys_quotactl support from quota.c quota: split out netlink notification support from quota.c quota: remove invalid optimization from quota_sync_all ... Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
| * dquot: cleanup dquot initialize routineChristoph Hellwig2010-03-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | Get rid of the initialize dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_initialize helper to __dquot_initialize and vfs_dq_init to dquot_initialize to have a consistent namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * dquot: move dquot initialization responsibility into the filesystemChristoph Hellwig2010-03-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently various places in the VFS call vfs_dq_init directly. This means we tie the quota code into the VFS. Get rid of that and make the filesystem responsible for the initialization. For most metadata operations this is a straight forward move into the methods, but for truncate and open it's a bit more complicated. For truncate we currently only call vfs_dq_init for the sys_truncate case because open already takes care of it for ftruncate and open(O_TRUNC) - the new code causes an additional vfs_dq_init for those which is harmless. For open the initialization is moved from do_filp_open into the open method, which means it happens slightly earlier now, and only for regular files. The latter is fine because we don't need to initialize it for operations on special files, and we already do it as part of the namespace operations for directories. Add a dquot_file_open helper that filesystems that support generic quotas can use to fill in ->open. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * dquot: cleanup dquot drop routineChristoph Hellwig2010-03-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Get rid of the drop dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_drop helper to __dquot_drop and vfs_dq_drop to dquot_drop to have a consistent namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * dquot: cleanup dquot transfer routineChristoph Hellwig2010-03-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Get rid of the transfer dquot operation - it is now always called from the filesystem and if a filesystem really needs it's own (which none currently does) it can just call into it's own routine directly. Rename the now static low-level dquot_transfer helper to __dquot_transfer and vfs_dq_transfer to dquot_transfer to have a consistent namespace, and make the new dquot_transfer return a normal negative errno value which all callers expect. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * dquot: cleanup inode allocation / freeing routinesChristoph Hellwig2010-03-051-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Get rid of the alloc_inode and free_inode dquot operations - they are always called from the filesystem and if a filesystem really needs their own (which none currently does) it can just call into it's own routine directly. Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always call the lowlevel dquot_alloc_inode / dqout_free_inode routines directly, which now lose the number argument which is always 1. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* | pass writeback_control to ->write_inodeChristoph Hellwig2010-03-051-2/+2
|/ | | | | | | | | | This gives the filesystem more information about the writeback that is happening. Trond requested this for the NFS unstable write handling, and other filesystems might benefit from this too by beeing able to distinguish between the different callers in more detail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* reiserfs: Fix softlockup while waiting on an inodeFrederic Weisbecker2010-02-141-0/+2
| | | | | | | | | | | | | | | When we wait for an inode through reiserfs_iget(), we hold the reiserfs lock. And waiting for an inode may imply waiting for its writeback. But the inode writeback path may also require the reiserfs lock, which leads to a deadlock. We just need to release the reiserfs lock from reiserfs_iget() to fix this. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Chris Mason <chris.mason@oracle.com>
* Merge branch 'reiserfs/kill-bkl' of ↵Linus Torvalds2010-01-081-4/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks reiserfs: Fix unreachable statement reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock reiserfs: Relax lock on xattr removing reiserfs: Relax the lock before truncating pages reiserfs: Fix recursive lock on lchown reiserfs: Fix mistake in down_write() conversion
| * reiserfs: Relax the lock before truncating pagesFrederic Weisbecker2010-01-051-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While truncating a file, reiserfs_setattr() calls inode_setattr() that will truncate the mapping for the given inode, but for that it needs the pages locks. In order to release these, the owners need the reiserfs lock to complete their jobs. But they can't, as we don't release it before calling inode_setattr(). We need to do that to fix the following softlockups: INFO: task flush-8:0:2149 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D f51af998 0 2149 2 0x00000000 f51af9ac 00000092 00000002 f51af998 c2803304 00000000 c1894ad0 010f3000 f51af9cc c1462604 c189ef80 f51af974 c1710304 f715b450 f715b5ec c2807c40 00000000 0005bb00 c2803320 c102c55b c1710304 c2807c50 c2803304 00000246 Call Trace: [<c1462604>] ? schedule+0x434/0xb20 [<c102c55b>] ? resched_task+0x4b/0x70 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c146414d>] ? mutex_lock_nested+0x1fd/0x350 [<c14640b9>] mutex_lock_nested+0x169/0x350 [<c1178cde>] ? reiserfs_write_lock+0x2e/0x40 [<c1178cde>] reiserfs_write_lock+0x2e/0x40 [<c11719a2>] do_journal_end+0xc2/0xe70 [<c1172912>] journal_end+0xb2/0x120 [<c11686b3>] ? pathrelse+0x33/0xb0 [<c11729e4>] reiserfs_end_persistent_transaction+0x64/0x70 [<c1153caa>] reiserfs_get_block+0x12ba/0x15f0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c1154b24>] reiserfs_writepage+0xa74/0xe80 [<c1465a27>] ? _raw_spin_unlock_irq+0x27/0x50 [<c11f3d25>] ? radix_tree_gang_lookup_tag_slot+0x95/0xc0 [<c10b5377>] ? find_get_pages_tag+0x127/0x1a0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c106fcd4>] ? trace_hardirqs_on_caller+0x124/0x170 [<c10bc1e0>] __writepage+0x10/0x40 [<c10bc9ab>] write_cache_pages+0x16b/0x320 [<c10bc1d0>] ? __writepage+0x0/0x40 [<c10bcb88>] generic_writepages+0x28/0x40 [<c10bcbd5>] do_writepages+0x35/0x40 [<c11059f7>] writeback_single_inode+0xc7/0x330 [<c11067b2>] writeback_inodes_wb+0x2c2/0x490 [<c1106a86>] wb_writeback+0x106/0x1b0 [<c1106cf6>] wb_do_writeback+0x106/0x1e0 [<c1106c18>] ? wb_do_writeback+0x28/0x1e0 [<c1106e0a>] bdi_writeback_task+0x3a/0xb0 [<c10cbb13>] bdi_start_fn+0x63/0xc0 [<c10cbab0>] ? bdi_start_fn+0x0/0xc0 [<c105d1f4>] kthread+0x74/0x80 [<c105d180>] ? kthread+0x0/0x80 [<c100327a>] kernel_thread_helper+0x6/0x10 3 locks held by flush-8:0/2149: #0: (&type->s_umount_key#30){+++++.}, at: [<c110676f>] writeback_inodes_wb+0x27f/0x490 #1: (&journal->j_mutex){+.+...}, at: [<c117199a>] do_journal_end+0xba/0xe70 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178cde>] reiserfs_write_lock+0x2e/0x40 INFO: task fstest:3813 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fstest D 00000002 0 3813 3812 0x00000000 f5103c94 00000082 f5103c40 00000002 f5ad5450 00000007 f5103c28 011f3000 00000006 f5ad5450 c10bb005 00000480 c1710304 f5ad5450 f5ad55ec c2907c40 00000001 f5ad5450 f5103c74 00000046 00000002 f5ad5450 00000007 f5103c6c Call Trace: [<c10bb005>] ? free_hot_cold_page+0x1d5/0x280 [<c1462d64>] io_schedule+0x74/0xc0 [<c10b5a45>] sync_page+0x35/0x60 [<c146325a>] __wait_on_bit_lock+0x4a/0x90 [<c10b5a10>] ? sync_page+0x0/0x60 [<c10b59e5>] __lock_page+0x85/0x90 [<c105d660>] ? wake_bit_function+0x0/0x60 [<c10bf654>] truncate_inode_pages_range+0x1e4/0x2d0 [<c10bf75f>] truncate_inode_pages+0x1f/0x30 [<c10bf7cf>] truncate_pagecache+0x5f/0xa0 [<c10bf86a>] vmtruncate+0x5a/0x70 [<c10fdb7d>] inode_setattr+0x5d/0x190 [<c1150117>] reiserfs_setattr+0x1f7/0x2f0 [<c1464569>] ? down_write+0x49/0x70 [<c10fde01>] notify_change+0x151/0x330 [<c10e6f3d>] do_truncate+0x6d/0xa0 [<c10f4ce2>] do_filp_open+0x9a2/0xcf0 [<c1465aec>] ? _raw_spin_unlock+0x2c/0x50 [<c10fec50>] ? alloc_fd+0xe0/0x100 [<c10e602d>] do_sys_open+0x6d/0x130 [<c1002cfb>] ? sysenter_exit+0xf/0x16 [<c10e615e>] sys_open+0x2e/0x40 [<c1002ccc>] sysenter_do_call+0x12/0x32 3 locks held by fstest/3813: #0: (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<c10e6f33>] do_truncate+0x63/0xa0 #1: (&sb->s_type->i_alloc_sem_key#3){+.+.+.}, at: [<c10fdf07>] notify_change+0x257/0x330 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178c8e>] reiserfs_write_lock_once+0x2e/0x50 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
| * reiserfs: Fix recursive lock on lchownFrederic Weisbecker2010-01-051-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On chown, reiserfs will call reiserfs_setattr() to change the owner of the given inode, but it may also recursively call reiserfs_setattr() to propagate the owner change to the private xattr files for this inode. Hence, the reiserfs lock may be acquired twice which is not wanted as reiserfs_setattr() calls journal_begin() that is going to try to relax the lock in order to safely acquire the journal mutex. Using reiserfs_write_lock_once() from reiserfs_setattr() solves the problem. This fixes the following warning, that precedes a lockdep report. WARNING: at fs/reiserfs/lock.c:95 reiserfs_lock_check_recursive+0x3f/0x50() Hardware name: MS-7418 Unwanted recursive reiserfs lock! Pid: 4189, comm: fsstress Not tainted 2.6.33-rc2-tip-atom+ #195 Call Trace: [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f7ac>] warn_slowpath_common+0x6c/0xc0 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f84b>] warn_slowpath_fmt+0x2b/0x30 [<c1178bff>] reiserfs_lock_check_recursive+0x3f/0x50 [<c1172ae3>] do_journal_begin_r+0x83/0x350 [<c1172f2d>] journal_begin+0x7d/0x140 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c115007d>] reiserfs_setattr+0x15d/0x2e0 [<c10f9bf3>] ? dput+0xe3/0x140 [<c1465adc>] ? _raw_spin_unlock+0x2c/0x50 [<c117831d>] chown_one_xattr+0xd/0x10 [<c11780a3>] reiserfs_for_each_xattr+0x113/0x2c0 [<c1178310>] ? chown_one_xattr+0x0/0x10 [<c14641e9>] ? mutex_lock_nested+0x2a9/0x350 [<c117826f>] reiserfs_chown_xattrs+0x1f/0x60 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c1150046>] reiserfs_setattr+0x126/0x2e0 [<c1177c20>] ? reiserfs_getxattr+0x0/0x90 [<c11b0d57>] ? cap_inode_need_killpriv+0x37/0x50 [<c10fde01>] notify_change+0x151/0x330 [<c10e659f>] chown_common+0x6f/0x90 [<c10e67bd>] sys_lchown+0x6d/0x80 [<c1002ccc>] sysenter_do_call+0x12/0x32 ---[ end trace 7c2b77224c1442fc ]--- Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
* | Merge branch 'reiserfs/kill-bkl' of ↵Linus Torvalds2010-01-021-2/+3
|\ \ | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Safely acquire i_mutex from xattr_rmdir reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattr reiserfs: Fix journal mutex <-> inode mutex lock inversion reiserfs: Fix unwanted recursive reiserfs lock in reiserfs_unlink() reiserfs: Relax lock before open xattr dir in reiserfs_xattr_set_handle() reiserfs: Relax reiserfs lock while freeing the journal reiserfs: Fix reiserfs lock <-> i_mutex dependency inversion on xattr reiserfs: Warn on lock relax if taken recursively reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion reiserfs: Fix remaining in-reclaim-fs <-> reclaim-fs-on locking inversion reiserfs: Fix reiserfs lock <-> inode mutex dependency inversion reiserfs: Fix reiserfs lock and journal lock inversion dependency reiserfs: Fix possible recursive lock
| * reiserfs: Fix reiserfs lock and journal lock inversion dependencyFrederic Weisbecker2009-12-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we were using the bkl, we didn't care about dependencies against other locks, but the mutex conversion created new ones, which is why we have reiserfs_mutex_lock_safe(), which unlocks the reiserfs lock before acquiring another mutex. But this trick actually fails if we have acquired the reiserfs lock recursively, as we try to unlock it to acquire the new mutex without inverted dependency, but we eventually only decrease its depth. This happens in the case of a nested inode creation/deletion. Say we have no space left on the device, we create an inode and tak the lock but fail to create its entry, then we release the inode using iput(), which calls reiserfs_delete_inode() that takes the reiserfs lock recursively. The path eventually ends up in journal_begin() where we try to take the journal safely but we fail because of the reiserfs lock recursion: [ INFO: possible circular locking dependency detected ] 2.6.32-06486-g053fe57 #2 ------------------------------------------------------- vi/23454 is trying to acquire lock: (&journal->j_mutex){+.+...}, at: [<c110dac4>] do_journal_begin_r+0x64/0x2f0 but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c11106a8>] reiserfs_write_lock+0x28/0x40 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [<c104f8f3>] validate_chain+0xa23/0xf70 [<c1050325>] __lock_acquire+0x4e5/0xa70 [<c105092a>] lock_acquire+0x7a/0xa0 [<c134c78f>] mutex_lock_nested+0x5f/0x2b0 [<c11106a8>] reiserfs_write_lock+0x28/0x40 [<c110dacb>] do_journal_begin_r+0x6b/0x2f0 [<c110ddcf>] journal_begin+0x7f/0x120 [<c10f76c2>] reiserfs_remount+0x212/0x4d0 [<c1093997>] do_remount_sb+0x67/0x140 [<c10a9ca6>] do_mount+0x436/0x6b0 [<c10a9f86>] sys_mount+0x66/0xa0 [<c1002c50>] sysenter_do_call+0x12/0x36 -> #0 (&journal->j_mutex){+.+...}: [<c104fe38>] validate_chain+0xf68/0xf70 [<c1050325>] __lock_acquire+0x4e5/0xa70 [<c105092a>] lock_acquire+0x7a/0xa0 [<c134c78f>] mutex_lock_nested+0x5f/0x2b0 [<c110dac4>] do_journal_begin_r+0x64/0x2f0 [<c110ddcf>] journal_begin+0x7f/0x120 [<c10ef52f>] reiserfs_delete_inode+0x9f/0x140 [<c10a55fc>] generic_delete_inode+0x9c/0x150 [<c10a56ed>] generic_drop_inode+0x3d/0x60 [<c10a4607>] iput+0x47/0x50 [<c10e915c>] reiserfs_create+0x16c/0x1c0 [<c109a9c1>] vfs_create+0xc1/0x130 [<c109dbec>] do_filp_open+0x81c/0x920 [<c109004f>] do_sys_open+0x4f/0x110 [<c1090179>] sys_open+0x29/0x40 [<c1002c50>] sysenter_do_call+0x12/0x36 other info that might help us debug this: 2 locks held by vi/23454: #0: (&sb->s_type->i_mutex_key#5){+.+.+.}, at: [<c109d64e>] do_filp_open+0x27e/0x920 #1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c11106a8>] reiserfs_write_lock+0x28/0x40 stack backtrace: Pid: 23454, comm: vi Not tainted 2.6.32-06486-g053fe57 #2 Call Trace: [<c134b202>] ? printk+0x18/0x1e [<c104e960>] print_circular_bug+0xc0/0xd0 [<c104fe38>] validate_chain+0xf68/0xf70 [<c104ca9b>] ? trace_hardirqs_off+0xb/0x10 [<c1050325>] __lock_acquire+0x4e5/0xa70 [<c105092a>] lock_acquire+0x7a/0xa0 [<c110dac4>] ? do_journal_begin_r+0x64/0x2f0 [<c134c78f>] mutex_lock_nested+0x5f/0x2b0 [<c110dac4>] ? do_journal_begin_r+0x64/0x2f0 [<c110dac4>] ? do_journal_begin_r+0x64/0x2f0 [<c110ff80>] ? delete_one_xattr+0x0/0x1c0 [<c110dac4>] do_journal_begin_r+0x64/0x2f0 [<c110ddcf>] journal_begin+0x7f/0x120 [<c11105b5>] ? reiserfs_delete_xattrs+0x15/0x50 [<c10ef52f>] reiserfs_delete_inode+0x9f/0x140 [<c10a55bf>] ? generic_delete_inode+0x5f/0x150 [<c10ef490>] ? reiserfs_delete_inode+0x0/0x140 [<c10a55fc>] generic_delete_inode+0x9c/0x150 [<c10a56ed>] generic_drop_inode+0x3d/0x60 [<c10a4607>] iput+0x47/0x50 [<c10e915c>] reiserfs_create+0x16c/0x1c0 [<c1099a5d>] ? inode_permission+0x7d/0xa0 [<c109a9c1>] vfs_create+0xc1/0x130 [<c10e8ff0>] ? reiserfs_create+0x0/0x1c0 [<c109dbec>] do_filp_open+0x81c/0x920 [<c104ca9b>] ? trace_hardirqs_off+0xb/0x10 [<c134dc0d>] ? _spin_unlock+0x1d/0x20 [<c10a6eea>] ? alloc_fd+0xba/0xf0 [<c109004f>] do_sys_open+0x4f/0x110 [<c1090179>] sys_open+0x29/0x40 [<c1002c50>] sysenter_do_call+0x12/0x36 To fix this, use reiserfs_lock_once() from reiserfs_delete_inode() which prevents from adding reiserfs lock recursion. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de>
* | reiserfs: truncate blocks not used by a writeJan Kara2009-12-171-4/+14
|/ | | | | | | | | | | | It can happen that write does not use all the blocks allocated in write_begin either because of some filesystem error (like ENOSPC) or because page with data to write has been removed from memory. We truncate these blocks so that we don't have dangling blocks beyond i_size. Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()Frederic Weisbecker2009-11-201-1/+1
| | | | | | | | | | | | | | | GFP_ATOMIC was used in reiserfs_get_block to not lose the Bkl so that nobody can modify the tree in the middle of its work. Now that we kicked out the bkl, we can use a more friendly flag. We use GFP_NOFS here because we already hold the reiserfs lock. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Laurent Riffard <laurent.riffard@free.fr> Cc: Thomas Gleixner <tglx@linutronix.de>
* kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()Frederic Weisbecker2009-10-141-9/+2
| | | | | | | | | | | | | | | We had a watchdog in _get_block_create_0() that jumped to a fixup retry path in case the bkl got relaxed while calling kmap(). This is not necessary anymore since we now have a reiserfs lock that is not implicitly relaxed while sleeping. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Laurent Riffard <laurent.riffard@free.fr> Cc: Thomas Gleixner <tglx@linutronix.de>
* kill-the-bkl/reiserfs: fix recursive reiserfs write lock in ↵Frederic Weisbecker2009-09-141-9/+2
| | | | | | | | | | | | | | | | | | | reiserfs_commit_write() reiserfs_commit_write() is always called with the write lock held. Thus the current calls to reiserfs_write_lock() in this function are acquiring the lock recursively. We can safely drop them. This also solves further assumptions for this lock to be really released while calling reiserfs_write_unlock(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Laurent Riffard <laurent.riffard@free.fr>
* kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end()Frederic Weisbecker2009-09-141-10/+15
| | | | | | | | | | | | | | | | | reiserfs_write_end() is a hot path in reiserfs. We have two wasteful write lock lock/release inside that can be gathered without changing the code logic. This patch factorizes them out in a single protected section, reducing the number of contentions inside. [ Impact: reduce lock contention in a reiserfs hotpath ] Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* kill-the-bkl/reiserfs: lock only once on reiserfs_get_block()Frederic Weisbecker2009-09-141-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | reiserfs_get_block() is one of these sites where the write lock might be acquired recursively. It's a particular problem because this function is called very often. It's a hot spot which needs to reschedule() periodically while converting direct items to indirect ones because it can take some time. Then if we are applying the write lock release/reacquire pattern on schedule() here, it may not produce the desired effect since we may have locked in more than one depth. The solution is to use reiserfs_write_lock_once() which won't try to reacquire the lock recursively. Then the lock will be *really* released before schedule(). Also, we only release the lock if TIF_NEED_RESCHED is set to not create wasteful numerous contentions. [ Impact: fix a too long holded lock case in reiserfs_get_block() ] Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_fileFrederic Weisbecker2009-09-141-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix a deadlock reiserfs_truncate_file() can be called from multiple context where the write lock can be already hold or not. This function also acquire (possibly recursively) the write lock. Subsequent releases before sleeping will not actually release the lock because we may be in more than one lock depth degree. A typical case is: reiserfs_file_release { acquire_the_lock() reiserfs_truncate_file() reacquire_the_lock() journal_begin() { do_journal_begin_r() { reiserfs_wait_on_write_block() { /* * Not released because still one * depth owned */ release_lock() wait_for_event() At this stage the event never happen because the one which provides it needs the write lock. We use reiserfs_write_lock_once() here to ensure that we don't acquire the write lock recursively. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Alessio Igor Bogani <abogani@texware.it> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> LKML-Reference: <1239680065-25013-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* reiserfs: kill-the-BKLFrederic Weisbecker2009-09-141-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is an attempt to remove the Bkl based locking scheme from reiserfs and is intended. It is a bit inspired from an old attempt by Peter Zijlstra: http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html The bkl is heavily used in this filesystem to prevent from concurrent write accesses on the filesystem. Reiserfs makes a deep use of the specific properties of the Bkl: - It can be acqquired recursively by a same task - It is released on the schedule() calls and reacquired when schedule() returns The two properties above are a roadmap for the reiserfs write locking so it's very hard to simply replace it with a common mutex. - We need a recursive-able locking unless we want to restructure several blocks of the code. - We need to identify the sites where the bkl was implictly relaxed (schedule, wait, sync, etc...) so that we can in turn release and reacquire our new lock explicitly. Such implicit releases of the lock are often required to let other resources producer/consumer do their job or we can suffer unexpected starvations or deadlocks. So the new lock that replaces the bkl here is a per superblock mutex with a specific property: it can be acquired recursively by a same task, like the bkl. For such purpose, we integrate a lock owner and a lock depth field on the superblock information structure. The first axis on this patch is to turn reiserfs_write_(un)lock() function into a wrapper to manage this mutex. Also some explicit calls to lock_kernel() have been converted to reiserfs_write_lock() helpers. The second axis is to find the important blocking sites (schedule...(), wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit release of the write lock on these locations before blocking. Then we can safely wait for those who can give us resources or those who need some. Typically this is a fight between the current writer, the reiserfs workqueue (aka the async commiter) and the pdflush threads. The third axis is a consequence of the second. The write lock is usually on top of a lock dependency chain which can include the journal lock, the flush lock or the commit lock. So it's dangerous to release and trying to reacquire the write lock while we still hold other locks. This is fine with the bkl: T1 T2 lock_kernel() mutex_lock(A) unlock_kernel() // do something lock_kernel() mutex_lock(A) -> already locked by T1 schedule() (and then unlock_kernel()) lock_kernel() mutex_unlock(A) .... This is not fine with a mutex: T1 T2 mutex_lock(write) mutex_lock(A) mutex_unlock(write) // do something mutex_lock(write) mutex_lock(A) -> already locked by T1 schedule() mutex_lock(write) -> already locked by T2 deadlock The solution in this patch is to provide a helper which releases the write lock and sleep a bit if we can't lock a mutex that depend on it. It's another simulation of the bkl behaviour. The last axis is to locate the fs callbacks that are called with the bkl held, according to Documentation/filesystem/Locking. Those are: - reiserfs_remount - reiserfs_fill_super - reiserfs_put_super Reiserfs didn't need to explicitly lock because of the context of these callbacks. But now we must take care of that with the new locking. After this patch, reiserfs suffers from a slight performance regression (for now). On UP, a high volume write with dd reports an average of 27 MB/s instead of 30 MB/s without the patch applied. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Bron Gondwana <brong@fastmail.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* switch reiserfs to inode->i_aclAl Viro2009-06-241-4/+0
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'reiserfs-updates' from Jeff MahoneyLinus Torvalds2009-03-301-104/+99
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * reiserfs-updates: (35 commits) reiserfs: rename [cn]_* variables reiserfs: rename p_._ variables reiserfs: rename p_s_tb to tb reiserfs: rename p_s_inode to inode reiserfs: rename p_s_bh to bh reiserfs: rename p_s_sb to sb reiserfs: strip trailing whitespace reiserfs: cleanup path functions reiserfs: factor out buffer_info initialization reiserfs: add atomic addition of selinux attributes during inode creation reiserfs: use generic readdir for operations across all xattrs reiserfs: journaled xattrs reiserfs: use generic xattr handlers reiserfs: remove i_has_xattr_dir reiserfs: make per-inode xattr locking more fine grained reiserfs: eliminate per-super xattr lock reiserfs: simplify xattr internal file lookups/opens reiserfs: Clean up xattrs when REISERFS_FS_XATTR is unset reiserfs: remove IS_PRIVATE helpers reiserfs: remove link detection code ... Fixed up conflicts manually due to: - quota name cleanups vs variable naming changes: fs/reiserfs/inode.c fs/reiserfs/namei.c fs/reiserfs/stree.c fs/reiserfs/xattr.c - exported include header cleanups include/linux/reiserfs_fs.h
| * reiserfs: rename p_s_inode to inodeJeff Mahoney2009-03-301-21/+22
| | | | | | | | | | | | | | | | | | This patch is a simple s/p_s_inode/inode/g to the reiserfs code. This is the third in a series of patches to rip out some of the awful variable naming in reiserfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: strip trailing whitespaceJeff Mahoney2009-03-301-26/+26
| | | | | | | | | | | | | | This patch strips trailing whitespace from the reiserfs code. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: add atomic addition of selinux attributes during inode creationJeff Mahoney2009-03-301-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Some time ago, some changes were made to make security inode attributes be atomically written during inode creation. ReiserFS fell behind in this area, but with the reworking of the xattr code, it's now fairly easy to add. The following patch adds the ability for security attributes to be added automatically during inode creation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: journaled xattrsJeff Mahoney2009-03-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Deadlocks are possible in the xattr code between the journal lock and the xattr sems. This patch implements journalling for xattr operations. The benefit is twofold: * It gets rid of the deadlock possibility by always ensuring that xattr write operations are initiated inside a transaction. * It corrects the problem where xattr backing files aren't considered any differently than normal files, despite the fact they are metadata. I discussed the added journal load with Chris Mason, and we decided that since xattrs (versus other journal activity) is fairly rare, the introduction of larger transactions to support journaled xattrs wouldn't be too big a deal. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: eliminate per-super xattr lockJeff Mahoney2009-03-301-13/+1
| | | | | | | | | | | | | | | | | | | | With the switch to using inode->i_mutex locking during lookups/creation in the xattr root, the per-super xattr lock is no longer needed. This patch removes it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: remove IS_PRIVATE helpersJeff Mahoney2009-03-301-3/+2
| | | | | | | | | | | | | | | | | | | | There are a number of helper functions for marking a reiserfs inode private that were leftover from reiserfs did its own thing wrt to private inodes. S_PRIVATE has been in the kernel for some time, so this patch removes the helpers and uses IS_PRIVATE instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * reiserfs: use reiserfs_error()Jeff Mahoney2009-03-301-24/+21
| | | | | | | | | | | | | | | | This patch makes many paths that are currently using warnings to handle the error. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>