aboutsummaryrefslogtreecommitdiffstats
path: root/fs/jbd/commit.c
Commit message (Collapse)AuthorAgeFilesLines
* jbd: Fix assertion failure in commit code due to lacking transaction creditsJan Kara2012-10-211-11/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 09e05d4805e6c524c1af74e524e5d0528bb3fef3 upstream. ext3 users of data=journal mode with blocksize < pagesize were occasionally hitting assertion failure in journal_commit_transaction() checking whether the transaction has at least as many credits reserved as buffers attached. The core of the problem is that when a file gets truncated, buffers that still need checkpointing or that are attached to the committing transaction are left with buffer_mapped set. When this happens to buffers beyond i_size attached to a page stradding i_size, subsequent write extending the file will see these buffers and as they are mapped (but underlying blocks were freed) things go awry from here. The assertion failure just coincidentally (and in this case luckily as we would start corrupting filesystem) triggers due to journal_head not being properly cleaned up as well. Under some rare circumstances this bug could even hit data=ordered mode users. There the assertion won't trigger and we would end up corrupting the filesystem. We fix the problem by unmapping buffers if possible (in lots of cases we just need a buffer attached to a transaction as a place holder but it must not be written out anyway). And in one case, we just have to bite the bullet and wait for transaction commit to finish. Reviewed-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* jbd/jbd2: remove obsolete summarise_journal_usage.Tao Ma2011-05-171-6/+0
| | | | | | | | | summarise_journal_usage seems to be obsolete for a long time, so remove it. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
* jbd: Fix forever sleeping process in do_get_write_access()Jan Kara2011-05-171-2/+7
| | | | | | | | | | | | | | In do_get_write_access() we wait on BH_Unshadow bit for buffer to get from shadow state. The waking code in journal_commit_transaction() has a bug because it does not issue a memory barrier after the buffer is moved from the shadow state and before wake_up_bit() is called. Thus a waitqueue check can happen before the buffer is actually moved from the shadow state and waiting process may never be woken. Fix the problem by issuing proper barrier. CC: stable@kernel.org Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
* Fix common misspellingsLucas De Marchi2011-03-311-1/+1
| | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit pluggingJens Axboe2011-03-171-11/+11
| | | | | | | | 'write_op' was still used, even though it was always WRITE_SYNC now. Add plugging around the cases where it submits IO, and flush them before we end up waiting for that IO. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* block: kill off REQ_UNPLUGJens Axboe2011-03-101-1/+1
| | | | | | | | | | With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'for_linus' of ↵Linus Torvalds2010-10-271-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits) quota: Fix possible oops in __dquot_initialize() ext3: Update kernel-doc comments jbd/2: fixed typos ext2: fixed typo. ext3: Fix debug messages in ext3_group_extend() jbd: Convert atomic_inc() to get_bh() ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate() jbd: Fix debug message in do_get_write_access() jbd: Check return value of __getblk() ext3: Use DIV_ROUND_UP() on group desc block counting ext3: Return proper error code on ext3_fill_super() ext3: Remove unnecessary casts on bh->b_data ext3: Cleanup ext3_setup_super() quota: Fix issuing of warnings from dquot_transfer quota: fix dquot_disable vs dquot_transfer race v2 jbd: Convert bitops to buffer fns ext3/jbd: Avoid WARN() messages when failing to write the superblock jbd: Use offset_in_page() instead of manual calculation jbd: Remove unnecessary goto statement jbd: Use printk_ratelimited() in journal_alloc_journal_head() ...
| * jbd: Convert atomic_inc() to get_bh()Namhyung Kim2010-10-281-1/+1
| | | | | | | | | | | | | | Convert atomic_inc(&bh->b_count) to get_bh(bh) for consistency. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * jbd: Convert bitops to buffer fnsNamhyung Kim2010-10-281-3/+3
| | | | | | | | | | | | | | | | Convert set/clear_bit(BH_JWrite, ...) to set/clear_buffer_jwrite() for consistency. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-10-221-27/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
| * | jbd: replace barriers with explicit flush / FUA usageChristoph Hellwig2010-09-101-27/+3
| |/ | | | | | | | | | | | | | | | | | | Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the EOPNOTSUPP detection for barriers. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | cfq: improve fsync performance for small filesCorrado Zoccolo2010-09-201-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | Fsync performance for small files achieved by cfq on high-end disks is lower than what deadline can achieve, due to idling introduced between the sync write happening in process context and the journal commit. Moreover, when competing with a sequential reader, a process writing small files and fsync-ing them is starved. This patch fixes the two problems by: - marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE flag set, - force all queues that have REQ_NOIDLE requests to be put in the noidle tree. Having the queue associated to the fsync-ing process and the one associated to journal commits in the noidle tree allows: - switching between them without idling, - fairness vs. competing idling queues, since they will be serviced only after the noidle tree expires its slice. Acked-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* kill BH_Ordered flagChristoph Hellwig2010-08-181-24/+25
| | | | | | | | | Instead of abusing a buffer_head flag just add a variant of sync_dirty_buffer which allows passing the exact type of write flag required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* jbd: Provide function to check whether transaction will issue data barrierJan Kara2010-05-211-1/+7
| | | | | | | | Provide a function which returns whether a transaction with given tid will send a barrier to the filesystem device. The function will be used by ext3 to detect whether fsync needs to send a separate barrier or not. Signed-off-by: Jan Kara <jack@suse.cz>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* jbd: Delay discarding buffers in journal_unmap_bufferJan Kara2010-03-051-5/+5
| | | | | | | | | | | | Delay discarding buffers in journal_unmap_buffer until we know that "add to orphan" operation has definitely been committed, otherwise the log space of committing transation may be freed and reused before truncate get committed, updates may get lost if crash happens. This patch is a backport of JBD2 fix by dingdinghua <dingdinghua@nrchpc.ac.cn>. Signed-off-by: Jan Kara <jack@suse.cz>
* jbd: Journal block numbers can ever be only 32-bit use unsigned int for themJan Kara2009-09-161-1/+1
| | | | | | | | It does not make sense to store block number for journal as unsigned long since they can be only 32-bit (because of on-disk format limitation). So change in-memory structures and variables to use unsigned int instead. Signed-off-by: Jan Kara <jack@suse.cz>
* jbd: fix race in buffer processing in commit codeJan Kara2009-06-091-2/+4
| | | | | | | | | | | | | | | | | | | In commit code, we scan buffers attached to a transaction. During this scan, we sometimes have to drop j_list_lock and then we recheck whether the journal buffer head didn't get freed by journal_try_to_free_buffers(). But checking for buffer_jbd(bh) isn't enough because a new journal head could get attached to our buffer head. So add a check whether the journal head remained the same and whether it's still at the same transaction and list. This is a nasty bug and can cause problems like memory corruption (use after free) or trigger various assertions in JBD code (observed). Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke recordsTheodore Ts'o2009-04-141-1/+1
| | | | | | | | The revoke records must be written using the same way as the rest of the blocks during the commit process; that is, either marked as synchronous writes or as asynchornous writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNCJens Axboe2009-04-061-1/+6
| | | | | | | | | | | When you are going to be submitting several sync writes, we want to give the IO scheduler a chance to merge some of them. Instead of using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG and rely on sync_buffer() doing the unplug when someone does a wait_on_buffer()/lock_buffer(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3: Use WRITE_SYNC for commits which are caused by fsync()Theodore Ts'o2009-03-271-8/+15
| | | | | | | | | If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz>
* jbd: improve fsync batchingJosef Bacik2009-01-081-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a flaw with the way jbd handles fsync batching. If we fsync() a file and we were not the last person to run fsync() on this fs then we automatically sleep for 1 jiffie in order to wait for new writers to join into the transaction before forcing the commit. The problem with this is that with really fast storage (ie a Clariion) the time it takes to commit a transaction to disk is way faster than 1 jiffie in most cases, so sleeping means waiting longer with nothing to do than if we just committed the transaction and kept going. Ric Wheeler noticed this when using fs_mark with more than 1 thread, the throughput would plummet as he added more threads. This patch attempts to fix this problem by recording the average time in nanoseconds that it takes to commit a transaction to disk, and what time we started the transaction. If we run an fsync() and we have been running for less time than it takes to commit the transaction to disk, we sleep for the delta amount of time and then commit to disk. We acheive sub-jiffie sleeping using schedule_hrtimeout. This means that the wait time is auto-tuned to the speed of the underlying disk, instead of having this static timeout. I weighted the average according to somebody's comments (Andreas Dilger I think) in order to help normalize random outliers where we take way longer or way less time to commit than the average. I also have a min() check in there to make sure we don't sleep longer than a jiffie in case our storage is super slow, this was requested by Andrew. I unfortunately do not have access to a Clariion, so I had to use a ramdisk to represent a super fast array. I tested with a SATA drive with barrier=1 to make sure there was no regression with local disks, I tested with a 4 way multipathed Apple Xserve RAID array and of course the ramdisk. I ran the following command fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my results type threads with patch without patch sata 2 24.6 26.3 sata 4 49.2 48.1 sata 8 70.1 67.0 sata 16 104.0 94.1 sata 32 153.6 142.7 xserve 2 246.4 222.0 xserve 4 480.0 440.8 xserve 8 829.5 730.8 xserve 16 1172.7 1026.9 xserve 32 1816.3 1650.5 ramdisk 2 2538.3 1745.6 ramdisk 4 2942.3 661.9 ramdisk 8 2882.5 999.8 ramdisk 16 2738.7 1801.9 ramdisk 32 2541.9 2394.0 Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ext3: add an option to control error handling on file dataHidehiro Kawai2008-10-201-0/+2
| | | | | | | | | | | | | | | | | | | | If the journal doesn't abort when it gets an IO error in file data blocks, the file data corruption will spread silently. Because most of applications and commands do buffered writes without fsync(), they don't notice the IO error. It's scary for mission critical systems. On the other hand, if the journal aborts whenever it gets an IO error in file data blocks, the system will easily become inoperable. So this patch introduces a filesystem option to determine whether it aborts the journal or just call printk() when it gets an IO error in file data. If you mount a ext3 fs with data_err=abort option, it aborts on file data write error. If you mount it with data_err=ignore, it doesn't abort, just call printk(). data_err=ignore is the default. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: don't dirty original metadata buffer on abortHidehiro Kawai2008-10-201-1/+4
| | | | | | | | | | | | | | | | | | | | | | | Currently, original metadata buffers are dirtied when they are unfiled whether the journal has aborted or not. Eventually these buffers will be written-back to the filesystem by pdflush. This means some metadata buffers are written to the filesystem without journaling if the journal aborts. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. Additionally, replaying journaled metadata can overwrite the latest metadata on the filesystem partly. Because, if the journal aborts, journaled metadata are preserved and replayed during the next mount not to lose uncheckpointed metadata. This would also break the consistency of the filesystem. This patch prevents original metadata buffers from being dirtied on abort by clearing BH_JBDDirty flag from those buffers. Thus, no metadata buffers are written to the filesystem without journaling. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: abort when failed to log metadata buffersHidehiro Kawai2008-10-201-0/+3
| | | | | | | | | | | | | | | If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: rename buffer trylockNick Piggin2008-08-041-1/+1
| | | | | | | | Like the page lock change, this also requires name change, so convert the raw test_and_set bitop to a trylock. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rename page trylockNick Piggin2008-08-041-2/+2
| | | | | | | | | | | | | | | Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: don't abort if flushing file data failedHidehiro Kawai2008-07-251-7/+28
| | | | | | | | | | | | | | | | | | | | | | | | | In ordered mode, the current jbd aborts the journal if a file data buffer has an error. But this behavior is unintended, and we found that it has been adopted accidentally. This patch undoes it and just calls printk() instead of aborting the journal. Additionally, set AS_EIO into the address_space object of the failed buffer which is submitted by journal_do_submit_data() so that fsync() can get -EIO. Missing error checkings are also added to inform errors on file data buffers to the user. The following buffers are targeted. (a) the buffer which has already been written out by pdflush (b) the buffer which has been unlocked before scanned in the t_locked_list loop [akpm@linux-foundation.org: improve grammar in a printk] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: positively dispose the unmapped data buffers in ↵Toshiyuki Okajima2008-07-251-9/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | journal_commit_transaction() After ext3-ordered files are truncated, there is a possibility that the pages which cannot be estimated still remain. Remaining pages can be released when the system has really few memory. So, it is not memory leakage. But the resource management software etc. may not work correctly. It is possible that journal_unmap_buffer() cannot release the buffers, and the pages to which they belong because they are attached to a commiting transaction and journal_unmap_buffer() cannot release them. To release such the buffers and the pages later, journal_unmap_buffer() leaves it to journal_commit_transaction(). (journal_unmap_buffer() puts the mark 'BH_Freed' to the buffers so that journal_commit_transaction() can identify whether they can be released or not.) In the journalled mode and the writeback mode, jbd does with only metadata buffers. But in the ordered mode, jbd does with metadata buffers and also data buffers. Actually, journal_commit_transaction() releases only the metadata buffers of which release is demanded by journal_unmap_buffer(), and also releases the pages to which they belong if possible. As a result, the data buffers of which release is demanded by journal_unmap_buffer() remain after a transaction commits. And also the pages to which they belong remain. Such the remained pages don't have mapping any longer. Due to this fact, there is a possibility that the pages which cannot be estimated remain. The metadata buffers marked 'BH_Freed' and the pages to which they belong can be released at 'JBD: commit phase 7'. Therefore, by applying the same code into 'JBD: commit phase 2' (where the data buffers are done with), journal_commit_transaction() can also release the data buffers marked 'BH_Freed' and the pages to which they belong. As a result, all the buffers marked 'BH_Freed' can be released, and also all the pages to which these buffers belong can be released at journal_commit_transaction(). So, the page which cannot be estimated is lost. <<Excerpt of code at 'JBD: commit phase 7'>> > spin_lock(&journal->j_list_lock); > while (commit_transaction->t_forget) { > transaction_t *cp_transaction; > struct buffer_head *bh; > > jh = commit_transaction->t_forget; >... > if (buffer_freed(bh)) { > ^^^^^^^^^^^^^^^^^^^^^^^^ > clear_buffer_freed(bh); > ^^^^^^^^^^^^^^^^^^^^^^^^ > clear_buffer_jbddirty(bh); > } > > if (buffer_jbddirty(bh)) { > JBUFFER_TRACE(jh, "add to new checkpointing trans"); > __journal_insert_checkpoint(jh, commit_transaction); > JBUFFER_TRACE(jh, "refile for checkpoint writeback"); > __journal_refile_buffer(jh); > jbd_unlock_bh_state(bh); > } else { > J_ASSERT_BH(bh, !buffer_dirty(bh)); > ... > JBUFFER_TRACE(jh, "refile or unfile freed buffer"); > __journal_refile_buffer(jh); > if (!jh->b_transaction) { > jbd_unlock_bh_state(bh); > /* needs a brelse */ > journal_remove_journal_head(bh); > release_buffer_page(bh); > ^^^^^^^^^^^^^^^^^^^^^^^^ > } else > } **************************************************************** * Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' * **************************************************************** At journal_commit_transaction() code, there is one extra message in the series of jbd debug messages. ("JBD: commit phase 2") This patch fixes it, too. Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: need to hold j_state_lock to updates to transaction t_state to T_COMMITMingming Cao2008-05-141-0/+2
| | | | | | | | | | Updating the current transaction's t_state is protected by j_state_lock. We need to do the same when updating the t_state to T_COMMIT. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Acked-by: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: fix possible journal overflow issuesJosef Bacik2008-04-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several cases where the running transaction can get buffers added to its BJ_Metadata list which it never dirtied, which makes its t_nr_buffers counter end up larger than its t_outstanding_credits counter. This will cause issues when starting new transactions as while we are logging buffers we decrement t_outstanding_buffers, so when t_outstanding_buffers goes negative, we will report that we need less space in the journal than we actually need, so transactions will be started even though there may not be enough room for them. In the worst case scenario (which admittedly is almost impossible to reproduce) this will result in the journal running out of space. The fix is to only refile buffers from the committing transaction to the running transactions BJ_Modified list when b_modified is set on that journal, which is the only way to be sure if the running transaction has modified that buffer. This patch also fixes an accounting error in journal_forget, it is possible that we can call journal_forget on a buffer without having modified it, only gotten write access to it, so instead of freeing a credit, we only do so if the buffer was modified. The assert will help catch if this problem occurs. Without these two patches I could hit this assert within minutes of running postmark, with them this issue no longer arises. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Acked-by: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: fix the way the b_modified flag is clearedJosef Bacik2008-04-281-16/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently at the start of a journal commit we loop through all of the buffers on the committing transaction and clear the b_modified flag (the flag that is set when a transaction modifies the buffer) under the j_list_lock. The problem is that everywhere else this flag is modified only under the jbd lock buffer flag, so it will race with a running transaction who could potentially set it, and have it unset by the committing transaction. This is also a big waste, you can have several thousands of buffers that you are clearing the modified flag on when you may not need to. This patch removes this code and instead clears the b_modified flag upon entering do_get_write_access/journal_get_create_access, so if that transaction does indeed use the buffer then it will be accounted for properly, and if it does not then we know we didn't use it. That will be important for the next patch in this series. Tested thoroughly by myself using postmark/iozone/bonnie++. Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: <linux-ext4@vger.kernel.org> Acked-by: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] jbd: Remove useless loop when writing commit recordJan Kara2008-02-011-8/+6
| | | | | | | | | | Commit block was intended to have several copies of the header. But due to a bug it never had them and actually, nobody checks that. So just remove the useless loop. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext3 can fail badly when device stops accepting BIO_RW_BARRIER requestsNeil Brown2008-02-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some devices - notably dm and md - can change their behaviour in response to BIO_RW_BARRIER requests. They might start out accepting such requests but on reconfiguration, they find out that they cannot any more. ext3 (and other filesystems) deal with this by always testing if BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write requests without the barrier (probably after waiting for any pending writes to complete). However there is a bug in the handling for this for ext3. When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request, it sets the buffer_ordered flag on the buffer head. If the request completes successfully, the flag STAYS SET. Other code might then write the same buffer_head after the device has been reconfigured to not accept barriers. This write will then fail, but the "other code" is not ready to handle EOPNOTSUPP errors and the error will be treated as fatal. This can be seen without having to reconfigure a device at exactly the wrong time by putting: if (buffer_ordered(bh)) printk("OH DEAR, and ordered buffer\n"); in the while loop in "commit phase 5" of journal_commit_transaction. If it ever prints the "OH DEAR ..." message (as it does sometimes for me), then that request could (in different circumstances) have failed with EOPNOTSUPP, but that isn't tested for. My proposed fix is to clear the buffer_ordered flag after it has been used, as in the following patch. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* spinlock: lockbreak cleanupNick Piggin2008-01-301-1/+1
| | | | | | | | | | | | | | | | | | | | The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* jbd: Fix assertion failure in fs/jbd/checkpoint.cJan Kara2007-12-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before we start committing a transaction, we call __journal_clean_checkpoint_list() to cleanup transaction's written-back buffers. If this call happens to remove all of them (and there were already some buffers), __journal_remove_checkpoint() will decide to free the transaction because it isn't (yet) a committing transaction and soon we fail some assertion - the transaction really isn't ready to be freed :). We change the check in __journal_remove_checkpoint() to free only a transaction in T_FINISHED state. The locking there is subtle though (as everywhere in JBD ;(). We use j_list_lock to protect the check and a subsequent call to __journal_drop_transaction() and do the same in the end of journal_commit_transaction() which is the only place where a transaction can get to T_FINISHED state. Probably I'm too paranoid here and such locking is not really necessary - checkpoint lists are processed only from log_do_checkpoint() where a transaction must be already committed to be processed or from __journal_clean_checkpoint_list() where kjournald itself calls it and thus transaction cannot change state either. Better be safe if something changes in future... Signed-off-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* jbd: fix commit code to properly abort journalJan Kara2007-10-191-4/+4
| | | | | | | | | | | | We should really call journal_abort() and not __journal_abort_hard() in case of errors. The latter call does not record the error in the journal superblock and thus filesystem won't be marked as with errors later (and user could happily mount it without any warning). Signed-off-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* JBD: JBD slab allocation cleanupsMingming Cao2007-10-171-3/+3
| | | | | | | | | | JBD: Replace slab allocations with page allocations JBD allocate memory for committed_data and frozen_data from slab. However JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
* jbd commit: fix transaction droppingJan Kara2007-07-161-1/+2
| | | | | | | | | | | | | We have to check that also the second checkpoint list is non-empty before dropping the transaction. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: <linux-ext4@vger.kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* header cleaning: don't include smp_lock.h when not usedRandy Dunlap2007-05-081-1/+0
| | | | | | | | | | | | Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] jbd: wait for already submitted t_sync_datalist buffer to completeHisashi Hifumi2006-12-221-2/+6
| | | | | | | | | | | | | | | | | | | | | In the current jbd code, if a buffer on BJ_SyncData list is dirty and not locked, the buffer is refiled to BJ_Locked list, submitted to the IO and waited for IO completion. But the fsstress test showed the case that when a buffer was already submitted to the IO just before the buffer_dirty(bh) check, the buffer was not waited for IO completion. Following patch solves this problem. If it is assumed that a buffer is submitted to the IO before the buffer_dirty(bh) check and still being written to disk, this buffer is refiled to BJ_Locked list. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Jan Kara <jack@ucw.cz> Cc: "Stephen C. Tweedie" <sct@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* fix file specification in commentsUwe Zeisberger2006-10-031-1/+1
| | | | | | | Many files include the filename at the beginning, serveral used a wrong one. Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
* [PATCH] jbd: fix commit of ordered data buffersJan Kara2006-09-261-69/+113
| | | | | | | | | | | | | | | | | | | | | | | | | Original commit code assumes, that when a buffer on BJ_SyncData list is locked, it is being written to disk. But this is not true and hence it can lead to a potential data loss on crash. Also the code didn't count with the fact that journal_dirty_data() can steal buffers from committing transaction and hence could write buffers that no longer belong to the committing transaction. Finally it could possibly happen that we tried writing out one buffer several times. The patch below tries to solve these problems by a complete rewrite of the data commit code. We go through buffers on t_sync_datalist, lock buffers needing write out and store them in an array. Buffers are also immediately refiled to BJ_Locked list or unfiled (if the write out is completed). When the array is full or we have to block on buffer lock, we submit all accumulated buffers for IO. [suitable for 2.6.18.x around the 2.6.19-rc2 timeframe] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Manage jbd allocations from its own slabsBadari Pulavarty2006-08-271-3/+3
| | | | | | | | | | | | | | | | JBD currently allocates commit and frozen buffers from slabs. With CONFIG_SLAB_DEBUG, its possible for an allocation to cross the page boundary causing IO problems. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200127 So, instead of allocating these from regular slabs - manage allocation from its own slabs and disable slab debug for these slabs. [akpm@osdl.org: cleanups] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] jbd: fix BUG in journal_commit_transaction()Jan Kara2006-06-231-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix possible assertion failure in journal_commit_transaction() on jh->b_next_transaction == NULL (when we are processing BJ_Forget list and buffer is not jbddirty). !jbddirty buffers can be placed on BJ_Forget list for example by journal_forget() or by __dispose_buffer() - generally such buffer means that it has been freed by this transaction. Freed buffers should not be reallocated until the transaction has committed (that's why we have the assertion there) but they *can* be reallocated when the transaction has already been committed to disk and we are just processing the BJ_Forget list (as soon as we remove b_committed_data from the bitmap bh, ext3 will be able to reallocate buffers freed by the committing transaction). So we have to also count with the case that the buffer has been reallocated and b_next_transaction has been already set. And one more subtle point: it can happen that we manage to reallocate the buffer and also mark it jbddirty. Then we also add the freed buffer to the checkpoint list of the committing trasaction. But that should do no harm. Non-jbddirty buffers should be filed to BJ_Reserved and not BJ_Metadata list. It can actually happen that we refile such buffers during the commit phase when we reallocate in the running transaction blocks deleted in committing transaction (and that can happen if the committing transaction already wrote all the data and is just cleaning up BJ_Forget list). Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: "Stephen C. Tweedie" <sct@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] jbd: revert checkpoint list changesMark Fasheh2006-02-141-2/+1
| | | | | | | | | | | | | | | | | | | | | This patch reverts commit f93ea411b73594f7d144855fd34278bcf34a9afc: [PATCH] jbd: split checkpoint lists This broke journal_flush() for OCFS2, which is its method of being sure that metadata is sent to disk for another node. And two related commits 8d3c7fce2d20ecc3264c8d8c91ae3beacdeaed1b and 43c3e6f5abdf6acac9b90c86bf03f995bf7d3d92 with the subjects: [PATCH] jbd: log_do_checkpoint fix [PATCH] jbd: remove_transaction fix These seem to be incremental bugfixes on the original patch and as such are no longer needed. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Jan Kara <jack@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] jbd: remove_transaction fixJan Kara2006-01-181-1/+2
| | | | | | | | | We have to check that also the second checkpoint list is non-empty before dropping the transaction. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kfree cleanup: fsJesper Juhl2005-11-071-4/+2
| | | | | | | | | | This is the fs/ part of the big kfree cleanup patch. Remove pointless checks for NULL prior to calling kfree() in fs/. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Change ll_rw_block() calls in JBDJan Kara2005-09-071-2/+2
| | | | | | | | | We must be sure that the current data in buffer are sent to disk. Hence we have to call ll_rw_block() with SWRITE. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Fix JBD race in t_forget list handlingJan Kara2005-09-071-10/+24
| | | | | | | | | | | | | Fix race between journal_commit_transaction() and other places as journal_unmap_buffer() that are adding buffers to transaction's t_forget list. We have to protect against such places by holding j_list_lock even when traversing the t_forget list. The fact that other places can only add buffers to the list makes the locking easier. OTOH the lock ranking complicates the stuff... Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>