| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
[tytso@mit.edu: Fix compilation with CONFIG_JBD_DEBUG enabled]
Acked-by: tytso@mit.edu
cc: linux-ext4@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
|
|
|
|
|
|
|
|
|
| |
The various quota operations check for any quota beeing active on
a superblock, and the inode not having the noquota flag.
Merge these two checks into a dquot_active check and move that
into dquot.c as that's the only place where it's needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
|
|
|
|
|
|
|
|
| |
Almost all identifiers use the FS_* namespace, so rename the missing few
XFS_* ones to FS_* as well. Without this some people might get upset
about having too many XFS names in generic code.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
|
|
|
|
|
|
|
|
| |
Reserved space must being claimed before remove_dquot_ref() for a
given inode. Filesystem is responsible for performing force blocks
allocation in case of dealloc in ->quota_off. Let's add sanity check
for that case. Do it similar to add_dquot_ref().
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Silence gcc warning in ocfs2_write_zero_page().
jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
ocfs2: Don't duplicate pages past i_size during CoW.
ocfs2: tighten up strlen() checking
ocfs2: Make xattr reflink work with new local alloc reservation.
ocfs2: make xattr extension work with new local alloc reservation.
ocfs2: Remove the redundant cpu_to_le64.
ocfs2/dlm: don't access beyond bitmap size
ocfs2: No need to zero pages past i_size.
ocfs2: Zero the tail cluster when extending past i_size.
ocfs2: When zero extending, do it by page.
ocfs2: Limit default local alloc size within bitmap range.
ocfs2: Move orphan scan work to ocfs2_wq.
fs/ocfs2/dlm: Add missing spin_unlock
|
| |
| |
| |
| |
| |
| |
| | |
ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
doesn't know that. Set ret=0 just to make gcc happy.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
OCFS2 uses t_commit trigger to compute and store checksum of the just
committed blocks. When a buffer has b_frozen_data, checksum is computed
for it instead of b_data but this can result in an old checksum being
written to the filesystem in the following scenario:
1) transaction1 is opened
2) handle1 is opened
3) journal_access(handle1, bh)
- This sets jh->b_transaction to transaction1
4) modify(bh)
5) journal_dirty(handle1, bh)
6) handle1 is closed
7) start committing transaction1, opening transaction2
8) handle2 is opened
9) journal_access(handle2, bh)
- This copies off b_frozen_data to make it safe for transaction1 to commit.
jh->b_next_transaction is set to transaction2.
10) jbd2_journal_write_metadata() checksums b_frozen_data
11) the journal correctly writes b_frozen_data to the disk journal
12) handle2 is closed
- There was no dirty call for the bh on handle2, so it is never queued for
any more journal operation
13) Checkpointing finally happens, and it just spools the bh via normal buffer
writeback. This will write b_data, which was never triggered on and thus
contains a wrong (old) checksum.
This patch fixes the problem by calling the trigger at the moment data is
frozen for journal commit - i.e., either when b_frozen_data is created by
do_get_write_access or just before we write a buffer to the log if
b_frozen_data does not exist. We also rename the trigger to t_frozen as
that better describes when it is called.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
dlm_migration_can_proceed() for that purpose. However, if the node is
down, dlm_migration_can_proceed() will also return "go ahead". In this
rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
the BUG_ON() that trips over this condition.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
During CoW, the pages after i_size don't contain valid data, so there's
no need to read and duplicate them.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This function is only called from one place and it's like this:
dlm_register_domain(conn->cc_name, dlm_key, &fs_version);
The "conn->cc_name" is 64 characters long. If strlen(conn->cc_name)
were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
strlen() doesn't count the NULL character.
In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
64 character buffers. The only exception is nd_name from struct
o2nm_node.
Anyway I looked into it and in this case the domain string comes from
osb->uuid_str in ocfs2_setup_osb_uuid(). That's 32 characters and NULL
which easily fits into O2NM_MAX_NAME_LEN. This patch doesn't change how
the code works, but I think it makes the code a little cleaner.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The new reservation code in local alloc has add the limitation
that the caller should handle the case that the local alloc
doesn't give use enough contiguous clusters. It make the old
xattr reflink code broken.
So this patch udpate the xattr reflink code so that it can
handle the case that local alloc give us one cluster at a time.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The old ocfs2_xattr_extent_allocation is too optimistic about
the clusters we can get. So actually if the file system is
too fragmented, ocfs2_add_clusters_in_btree will return us
with EGAIN and we need to allocate clusters once again.
So this patch change it to a while loop so that we can allocate
clusters until we reach clusters_to_add.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
But actually bg->bg_blkno is already changed to little endian
in ocfs2_block_group_fill. So remove the extra cpu_to_le64.
Reported-by: Marcos Matsunaga <Marcos.Matsunaga@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
dlm->recovery_map is defined as
unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
We should treat O2NM_MAX_NODES as the bit map size in bits.
This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When ocfs2 fills a hole, it does so by allocating clusters. When a
cluster is larger than the write, ocfs2 must zero the portions of the
cluster outside of the write. If the clustersize is smaller than a
pagecache page, this is handled by the normal pagecache mechanisms, but
when the clustersize is larger than a page, ocfs2's write code will zero
the pages adjacent to the write. This makes sure the entire cluster is
zeroed correctly.
Currently ocfs2 behaves exactly the same when writing past i_size.
However, this means ocfs2 is writing zeroed pages for portions of a new
cluster that are beyond i_size. The page writeback code isn't expecting
this. It treats all pages past the one containing i_size as left behind
due to a previous truncate operation.
Thankfully, ocfs2 calculates the number of pages it will be working on
up front. The rest of the write code merely honors the original
calculation. We can simply trim the number of pages to only cover the
actual file data.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ocfs2's allocation unit is the cluster. This can be larger than a block
or even a memory page. This means that a file may have many blocks in
its last extent that are beyond the block containing i_size. There also
may be more unwritten extents after that.
When ocfs2 grows a file, it zeros the entire cluster in order to ensure
future i_size growth will see cleared blocks. Unfortunately,
block_write_full_page() drops the pages past i_size. This means that
ocfs2 is actually leaking garbage data into the tail end of that last
cluster. This is a bug.
We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
when a write or truncate is past i_size. They will use
ocfs2_zero_extend() to ensure the data is properly zeroed.
Older versions of ocfs2_zero_extend() simply zeroed every block between
i_size and the zeroing position. This presumes three things:
1) There is allocation for all of these blocks.
2) The extents are not unwritten.
3) The extents are not refcounted.
(1) and (2) hold true for non-sparse filesystems, which used to be the
only users of ocfs2_zero_extend(). (3) is another bug.
Since we're now using ocfs2_zero_extend() for sparse filesystems as
well, we teach ocfs2_zero_extend() to check every extent between
i_size and the zeroing position. If the extent is unwritten, it is
ignored. If it is refcounted, it is CoWed. Then it is zeroed.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ocfs2_zero_extend() does its zeroing block by block, but it calls a
function named ocfs2_write_zero_page(). Let's have
ocfs2_write_zero_page() handle the page level. From
ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: stable@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In commit 6b82021b9e91cd689fdffadbcdb9a42597bbe764, we increase
our local alloc size and calculate how much megabytes we can
get according to group size and volume size.
But we also need to check the maximum bits a local alloc block
bitmap can have. With a bs=512, cs=32K, local volume with 160G,
it calculate 96MB while the maximum local alloc size is only
76M. So the bitmap will overflow and corrupt the system truncate
log file. See bug
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We used to let orphan scan work in the default work queue,
but there is a corner case which will make the system deadlock.
The scenario is like this:
1. set heartbeat threadshold to 200. this will allow us to have a
great chance to have a orphan scan work before our quorum decision.
2. mount node 1.
3. after 1~2 minutes, mount node 2(in order to make the bug easier
to reproduce, better add maxcpus=1 to kernel command line).
4. node 1 do orphan scan work.
5. node 2 do orphan scan work.
6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
lock while node 2 know node 1 is the master.
7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).
Now when node 2 begins orphan scan, the system queue is blocked.
The root cause is that both orphan scan work and quorum decision work
will use the system event work queue. orphan scan has a chance of
blocking the event work queue(in dlm_wait_for_node_death) so that there
is no chance for quorum decision work to proceed.
This patch resolve it by moving orphan scan work to ocfs2_wq.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a spin_unlock missing on the error path. Unlock as in the other code
that leads to the leave label.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E1;
@@
* spin_lock(E1,...);
<+... when != E1
if (...) {
... when != E1
* return ...;
}
...+>
* spin_unlock(E1,...);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch fixes a kernel Oops in the GFS2 rename code.
The problem was in the way the gfs2 directory code was trying
to re-use sentinel directory entries.
In the failing case, gfs2's rename function was renaming a
file to another name that had the same non-trivial length.
The file being renamed happened to be the first directory
entry on the leaf block.
First, the rename code (gfs2_rename in ops_inode.c) found the
original directory entry and decided it could do its job by
simply replacing the directory entry with another. Therefore
it determined correctly that no block allocations were needed.
Next, the rename code deleted the old directory entry prior to
replacing it with the new name. Therefore, the soon-to-be
replaced directory entry was temporarily made into a directory
entry "sentinel" or a place holder at the start of a leaf block.
Lastly, it went to re-add the replacement directory entry in
that leaf block. However, when gfs2_dirent_find_space was
looking for space in the leaf block, it used the wrong value
for the sentinel. That threw off its calculations so later
it decides it can't really re-use the sentinel and therefore
must allocate a new leaf block. But because it previously decided
to re-use the directory entry, it didn't waste the time to
grab a new block allocation for the inode. Therefore, the
inode's i_alloc pointer was still NULL and it crashes trying to
reference it.
In the case of sentinel directory entries, the entire dirent is
reused, not just the "free space" portion of it, and therefore
the function gfs2_dirent_find_space should use the value 0
rather than GFS2_DIRENT_SIZE(0) for the actual dirent size.
Fixing this calculation enables the reproducer programs to work
properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
HighMem pages on i686 do not get mapped to the buffer_heads and this was
causing a NULL pointer dereference when we were trying to memset page buffers
to zero.
We now use zero_user() that kmaps the page and directly manipulates page data.
This patch also fixes a boundary condition that was incorrect.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch fixes a problem in an error path when looking
up dinodes. There are two sister-functions, gfs2_inode_lookup
and gfs2_process_unlinked_inode. Both functions acquire and
hold the i_iopen glock for the dinode being looked up. The last
thing they try to do is hold the i_gl glock for the dinode.
If that glock fails for some reason, the error path was
incorrectly calling gfs2_glock_put for the i_iopen glock twice.
This resulted in the glock being prematurely freed. The
"minimum hold time" usually kept the glock in memory, but the
lock interface to dlm (aka lock_dlm) freed its memory for the
glock. In some circumstances, it would cause dlm's dlm_astd daemon
to try to call the bast function for the freed lock_dlm memory,
which resulted in a NULL pointer dereference.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch fixes bugzilla bug #590878: GFS2: recovery stuck on
transaction lock. We set the frozen flag on the glock when we receive
a completion that cannot be delivered due to blocked locks. At that
point we check to see whether the first waiting holder has the noexp
flag set. If the noexp lock is queued later, then we need to unfreeze
the glock at that point in time, namely, in the glock work function.
This patch was originally written by Steve Whitehouse, but since
he's on holiday, I'm submitting it. It's been well tested with a
complex recovery test called revolver.
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch replaces a statement that got dropped out by accident.
Without the patch, truncates on stuffed (very small) files cause
those files to have an unpredictable size.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
writeback: simplify the write back thread queue
writeback: split writeback_inodes_wb
writeback: remove writeback_inodes_wbc
fs-writeback: fix kernel-doc warnings
splice: check f_mode for seekable file
splice: direct_splice_actor() should not use pos in sd
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
First remove items from work_list as soon as we start working on them. This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished. Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly. Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work. Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there. Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The case where we have a superblock doesn't require a loop here as we scan
over all inodes in writeback_sb_inodes. Split it out into a separate helper
to make the code simpler. This also allows to get rid of the sb member in
struct writeback_control, which was rather out of place there.
Also update the comments in writeback_sb_inodes that explain the handling
of inodes from wrong superblocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fix kernel-doc to match the function's changed args.
Warning(fs/fs-writeback.c:190): No description found for parameter 'args'
Warning(fs/fs-writeback.c:190): Excess function parameter 'sb' description in 'bdi_queue_work_onstack'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
check f_mode for seekable file
As a seekable file is allowed without a llseek function, so the old way isn't
work any more.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
fs/splice.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
direct_splice_actor() shouldn't use sd->pos, as sd->pos is for file reading,
file->f_pos should be used instead.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
fs/splice.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix crush device 'out' threshold to 1.0, not 0.1
ceph: fix caps usage accounting for import (non-reserved) case
ceph: only release clean, unused caps with mds requests
ceph: fix crush CHOOSE_LEAF when type is already a leaf
ceph: fix crush recursion
ceph: fix caps debugfs entry
ceph: delay umount until all mds requests drop inode+dentry refs
ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate
ceph: fix crush map update decoding
ceph: fix message memory leak, uninitialized variable
ceph: fix map handler error path
ceph: some endianity fixes
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fix a typo that made any OSD weighted between 0.1 and 1.0 effectively
weighted as 1.0 (fully in).
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need to increase the total and used counters when allocating a new cap
in the non-reserved (cap import) case.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We can drop caps with an mds request. Ensure we only drop unused AND
clean caps, since the MDS doesn't support cap writeback in that context,
nor do we track it. If caps are dirty, and the MDS needs them back, we
it will revoke and we will flush in the normal fashion.
This fixes a possibly loss of metadata.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We may not recurse for CHOOSE_LEAF if we start with a leaf node. When
that happens, the out2 vector needs to be filled in with the result.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There was a longstanding problem with recursion through intervening
bucket types on complex hierarchies.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The ceph client structure was not set correctly.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache. If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.
This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct. I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing. (Also, report the error code.)
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If the incremental osdmap has a new crush map, advance the position after
decoding so that we can parse the rest of the osdmap properly.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need to properly initialize skip, as not all alloc_msg op instances
set it.
Also, BUG if someone says skip but also allocates a message.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Don't leak message if we receive an unexpected message type.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | | |
Fix some problems that came up with sparse.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: remove block number from inode lookup code
xfs: rename XFS_IGET_BULKSTAT to XFS_IGET_UNTRUSTED
xfs: validate untrusted inode numbers during lookup
xfs: always use iget in bulkstat
xfs: prevent swapext from operating on write-only files
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.
This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.
As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.
It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes. This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|