From 03d6a74b5f85ff46f20e1382982b7f4860f5fec6 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Tue, 22 Sep 2009 11:09:12 -0400 Subject: nfsd: fix Documentation typo Caught by Benny, thanks! Signed-off-by: J. Bruce Fields --- Documentation/filesystems/nfs41-server.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs41-server.txt index 5920fe2..1f95e77 100644 --- a/Documentation/filesystems/nfs41-server.txt +++ b/Documentation/filesystems/nfs41-server.txt @@ -41,7 +41,7 @@ interoperability problems with future clients. Known issues: conformant with the spec (for example, we don't use kerberos on the backchannel correctly). - no trunking support: no clients currently take advantage of - trunking, but this is a mandatory failure, and its use is + trunking, but this is a mandatory feature, and its use is recommended to clients in a number of places. (E.g. to ensure timely renewal in case an existing connection's retry timeouts have gotten too long; see section 8.3 of the draft.) -- cgit v1.1 From ddc04fd4d5163aee9ebdb38a56c365b602e2b7b7 Mon Sep 17 00:00:00 2001 From: Andy Adamson Date: Wed, 23 Sep 2009 21:32:21 -0400 Subject: nfsd41: use sv_max_mesg for forechannel max sizes ca_maxresponsesize and ca_maxrequest size include the RPC header. sv_max_mesg is sv_max_payolad plus a page for overhead and is used in svc_init_buffer to allocate server buffer space for both the request and reply. Note that this means we can service an RPC compound that requires ca_maxrequestsize (MAXWRITE) or ca_max_responsesize (MAXREAD) but that we do not support an RPC compound that requires both ca_maxrequestsize and ca_maxresponsesize. Signed-off-by: Andy Adamson [bfields@citi.umich.edu: more documentation updates] Signed-off-by: J. Bruce Fields --- Documentation/filesystems/nfs41-server.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs41-server.txt index 1f95e77..1bd0d0c 100644 --- a/Documentation/filesystems/nfs41-server.txt +++ b/Documentation/filesystems/nfs41-server.txt @@ -213,3 +213,10 @@ The following cases aren't supported yet: DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. * DESTROY_SESSION MUST be the final operation in the COMPOUND request. +Nonstandard compound limitations: +* No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. +* No more than one IO operation (read, write, readdir) allowed per + compound. -- cgit v1.1 From dc7a08166f3a5f23e79e839a8a88849bd3397c32 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Tue, 27 Oct 2009 14:41:35 -0400 Subject: nfs: new subdir Documentation/filesystems/nfs We're adding enough nfs documentation that it may as well have its own subdirectory. Acked-by: Randy Dunlap Signed-off-by: J. Bruce Fields --- Documentation/filesystems/00-INDEX | 10 +- Documentation/filesystems/Exporting | 147 -------------- Documentation/filesystems/nfs-rdma.txt | 271 ------------------------- Documentation/filesystems/nfs.txt | 98 --------- Documentation/filesystems/nfs/00-INDEX | 12 ++ Documentation/filesystems/nfs/Exporting | 147 ++++++++++++++ Documentation/filesystems/nfs/nfs-rdma.txt | 271 +++++++++++++++++++++++++ Documentation/filesystems/nfs/nfs.txt | 98 +++++++++ Documentation/filesystems/nfs/nfs41-server.txt | 222 ++++++++++++++++++++ Documentation/filesystems/nfs/nfsroot.txt | 270 ++++++++++++++++++++++++ Documentation/filesystems/nfs41-server.txt | 222 -------------------- Documentation/filesystems/nfsroot.txt | 270 ------------------------ Documentation/filesystems/porting | 2 +- Documentation/kernel-parameters.txt | 6 +- 14 files changed, 1026 insertions(+), 1020 deletions(-) delete mode 100644 Documentation/filesystems/Exporting delete mode 100644 Documentation/filesystems/nfs-rdma.txt delete mode 100644 Documentation/filesystems/nfs.txt create mode 100644 Documentation/filesystems/nfs/00-INDEX create mode 100644 Documentation/filesystems/nfs/Exporting create mode 100644 Documentation/filesystems/nfs/nfs-rdma.txt create mode 100644 Documentation/filesystems/nfs/nfs.txt create mode 100644 Documentation/filesystems/nfs/nfs41-server.txt create mode 100644 Documentation/filesystems/nfs/nfsroot.txt delete mode 100644 Documentation/filesystems/nfs41-server.txt delete mode 100644 Documentation/filesystems/nfsroot.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index f15621e..482151c 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -1,7 +1,5 @@ 00-INDEX - this file (info on some of the filesystems supported by linux). -Exporting - - explanation of how to make filesystems exportable. Locking - info on locking rules as they pertain to Linux VFS. 9p.txt @@ -66,12 +64,8 @@ mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. -nfs41-server.txt - - info on the Linux server implementation of NFSv4 minor version 1. -nfs-rdma.txt - - how to install and setup the Linux NFS/RDMA client and server software. -nfsroot.txt - - short guide on setting up a diskless box with NFS root filesystem. +nfs/ + - nfs-related documentation. nilfs2.txt - info and mount options for the NILFS2 filesystem. ntfs.txt diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting deleted file mode 100644 index 87019d2..0000000 --- a/Documentation/filesystems/Exporting +++ /dev/null @@ -1,147 +0,0 @@ - -Making Filesystems Exportable -============================= - -Overview --------- - -All filesystem operations require a dentry (or two) as a starting -point. Local applications have a reference-counted hold on suitable -dentries via open file descriptors or cwd/root. However remote -applications that access a filesystem via a remote filesystem protocol -such as NFS may not be able to hold such a reference, and so need a -different way to refer to a particular dentry. As the alternative -form of reference needs to be stable across renames, truncates, and -server-reboot (among other things, though these tend to be the most -problematic), there is no simple answer like 'filename'. - -The mechanism discussed here allows each filesystem implementation to -specify how to generate an opaque (outside of the filesystem) byte -string for any dentry, and how to find an appropriate dentry for any -given opaque byte string. -This byte string will be called a "filehandle fragment" as it -corresponds to part of an NFS filehandle. - -A filesystem which supports the mapping between filehandle fragments -and dentries will be termed "exportable". - - - -Dcache Issues -------------- - -The dcache normally contains a proper prefix of any given filesystem -tree. This means that if any filesystem object is in the dcache, then -all of the ancestors of that filesystem object are also in the dcache. -As normal access is by filename this prefix is created naturally and -maintained easily (by each object maintaining a reference count on -its parent). - -However when objects are included into the dcache by interpreting a -filehandle fragment, there is no automatic creation of a path prefix -for the object. This leads to two related but distinct features of -the dcache that are not needed for normal filesystem access. - -1/ The dcache must sometimes contain objects that are not part of the - proper prefix. i.e that are not connected to the root. -2/ The dcache must be prepared for a newly found (via ->lookup) directory - to already have a (non-connected) dentry, and must be able to move - that dentry into place (based on the parent and name in the - ->lookup). This is particularly needed for directories as - it is a dcache invariant that directories only have one dentry. - -To implement these features, the dcache has: - -a/ A dentry flag DCACHE_DISCONNECTED which is set on - any dentry that might not be part of the proper prefix. - This is set when anonymous dentries are created, and cleared when a - dentry is noticed to be a child of a dentry which is in the proper - prefix. - -b/ A per-superblock list "s_anon" of dentries which are the roots of - subtrees that are not in the proper prefix. These dentries, as - well as the proper prefix, need to be released at unmount time. As - these dentries will not be hashed, they are linked together on the - d_hash list_head. - -c/ Helper routines to allocate anonymous dentries, and to help attach - loose directory dentries at lookup time. They are: - d_alloc_anon(inode) will return a dentry for the given inode. - If the inode already has a dentry, one of those is returned. - If it doesn't, a new anonymous (IS_ROOT and - DCACHE_DISCONNECTED) dentry is allocated and attached. - In the case of a directory, care is taken that only one dentry - can ever be attached. - d_splice_alias(inode, dentry) will make sure that there is a - dentry with the same name and parent as the given dentry, and - which refers to the given inode. - If the inode is a directory and already has a dentry, then that - dentry is d_moved over the given dentry. - If the passed dentry gets attached, care is taken that this is - mutually exclusive to a d_alloc_anon operation. - If the passed dentry is used, NULL is returned, else the used - dentry is returned. This corresponds to the calling pattern of - ->lookup. - - -Filesystem Issues ------------------ - -For a filesystem to be exportable it must: - - 1/ provide the filehandle fragment routines described below. - 2/ make sure that d_splice_alias is used rather than d_add - when ->lookup finds an inode for a given parent and name. - Typically the ->lookup routine will end with a: - - return d_splice_alias(inode, dentry); - } - - - - A file system implementation declares that instances of the filesystem -are exportable by setting the s_export_op field in the struct -super_block. This field must point to a "struct export_operations" -struct which has the following members: - - encode_fh (optional) - Takes a dentry and creates a filehandle fragment which can later be used - to find or create a dentry for the same object. The default - implementation creates a filehandle fragment that encodes a 32bit inode - and generation number for the inode encoded, and if necessary the - same information for the parent. - - fh_to_dentry (mandatory) - Given a filehandle fragment, this should find the implied object and - create a dentry for it (possibly with d_alloc_anon). - - fh_to_parent (optional but strongly recommended) - Given a filehandle fragment, this should find the parent of the - implied object and create a dentry for it (possibly with d_alloc_anon). - May fail if the filehandle fragment is too small. - - get_parent (optional but strongly recommended) - When given a dentry for a directory, this should return a dentry for - the parent. Quite possibly the parent dentry will have been allocated - by d_alloc_anon. The default get_parent function just returns an error - so any filehandle lookup that requires finding a parent will fail. - ->lookup("..") is *not* used as a default as it can leave ".." entries - in the dcache which are too messy to work with. - - get_name (optional) - When given a parent dentry and a child dentry, this should find a name - in the directory identified by the parent dentry, which leads to the - object identified by the child dentry. If no get_name function is - supplied, a default implementation is provided which uses vfs_readdir - to find potential names, and matches inode numbers to find the correct - match. - - -A filehandle fragment consists of an array of 1 or more 4byte words, -together with a one byte "type". -The decode_fh routine should not depend on the stated size that is -passed to it. This size may be larger than the original filehandle -generated by encode_fh, in which case it will have been padded with -nuls. Rather, the encode_fh routine should choose a "type" which -indicates the decode_fh how much of the filehandle is valid, and how -it should be interpreted. diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs-rdma.txt deleted file mode 100644 index e386f7e..0000000 --- a/Documentation/filesystems/nfs-rdma.txt +++ /dev/null @@ -1,271 +0,0 @@ -################################################################################ -# # -# NFS/RDMA README # -# # -################################################################################ - - Author: NetApp and Open Grid Computing - Date: May 29, 2008 - -Table of Contents -~~~~~~~~~~~~~~~~~ - - Overview - - Getting Help - - Installation - - Check RDMA and NFS Setup - - NFS/RDMA Setup - -Overview -~~~~~~~~ - - This document describes how to install and setup the Linux NFS/RDMA client - and server software. - - The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server - was first included in the following release, Linux 2.6.25. - - In our testing, we have obtained excellent performance results (full 10Gbit - wire bandwidth at minimal client CPU) under many workloads. The code passes - the full Connectathon test suite and operates over both Infiniband and iWARP - RDMA adapters. - -Getting Help -~~~~~~~~~~~~ - - If you get stuck, you can ask questions on the - - nfs-rdma-devel@lists.sourceforge.net - - mailing list. - -Installation -~~~~~~~~~~~~ - - These instructions are a step by step guide to building a machine for - use with NFS/RDMA. - - - Install an RDMA device - - Any device supported by the drivers in drivers/infiniband/hw is acceptable. - - Testing has been performed using several Mellanox-based IB cards, the - Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. - - - Install a Linux distribution and tools - - The first kernel release to contain both the NFS/RDMA client and server was - Linux 2.6.25 Therefore, a distribution compatible with this and subsequent - Linux kernel release should be installed. - - The procedures described in this document have been tested with - distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). - - - Install nfs-utils-1.1.2 or greater on the client - - An NFS/RDMA mount point can be obtained by using the mount.nfs command in - nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils - version with support for NFS/RDMA mounts, but for various reasons we - recommend using nfs-utils-1.1.2 or greater). To see which version of - mount.nfs you are using, type: - - $ /sbin/mount.nfs -V - - If the version is less than 1.1.2 or the command does not exist, - you should install the latest version of nfs-utils. - - Download the latest package from: - - http://www.kernel.org/pub/linux/utils/nfs - - Uncompress the package and follow the installation instructions. - - If you will not need the idmapper and gssd executables (you do not need - these to create an NFS/RDMA enabled mount command), the installation - process can be simplified by disabling these features when running - configure: - - $ ./configure --disable-gss --disable-nfsv4 - - To build nfs-utils you will need the tcp_wrappers package installed. For - more information on this see the package's README and INSTALL files. - - After building the nfs-utils package, there will be a mount.nfs binary in - the utils/mount directory. This binary can be used to initiate NFS v2, v3, - or v4 mounts. To initiate a v4 mount, the binary must be called - mount.nfs4. The standard technique is to create a symlink called - mount.nfs4 to mount.nfs. - - This mount.nfs binary should be installed at /sbin/mount.nfs as follows: - - $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs - - In this location, mount.nfs will be invoked automatically for NFS mounts - by the system mount command. - - NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed - on the NFS client machine. You do not need this specific version of - nfs-utils on the server. Furthermore, only the mount.nfs command from - nfs-utils-1.1.2 is needed on the client. - - - Install a Linux kernel with NFS/RDMA - - The NFS/RDMA client and server are both included in the mainline Linux - kernel version 2.6.25 and later. This and other versions of the 2.6 Linux - kernel can be found at: - - ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ - - Download the sources and place them in an appropriate location. - - - Configure the RDMA stack - - Make sure your kernel configuration has RDMA support enabled. Under - Device Drivers -> InfiniBand support, update the kernel configuration - to enable InfiniBand support [NOTE: the option name is misleading. Enabling - InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. - - Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or - iWARP adapter support (amso, cxgb3, etc.). - - If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. - - - Configure the NFS client and server - - Your kernel configuration must also have NFS file system support and/or - NFS server support enabled. These and other NFS related configuration - options can be found under File Systems -> Network File Systems. - - - Build, install, reboot - - The NFS/RDMA code will be enabled automatically if NFS and RDMA - are turned on. The NFS/RDMA client and server are configured via the hidden - SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The - value of SUNRPC_XPRT_RDMA will be: - - - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client - and server will not be built - - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, - in this case the NFS/RDMA client and server will be built as modules - - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client - and server will be built into the kernel - - Therefore, if you have followed the steps above and turned no NFS and RDMA, - the NFS/RDMA client and server will be built. - - Build a new kernel, install it, boot it. - -Check RDMA and NFS Setup -~~~~~~~~~~~~~~~~~~~~~~~~ - - Before configuring the NFS/RDMA software, it is a good idea to test - your new kernel to ensure that the kernel is working correctly. - In particular, it is a good idea to verify that the RDMA stack - is functioning as expected and standard NFS over TCP/IP and/or UDP/IP - is working properly. - - - Check RDMA Setup - - If you built the RDMA components as modules, load them at - this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel - card: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - - If you are using InfiniBand, make sure there is a Subnet Manager (SM) - running on the network. If your IB switch has an embedded SM, you can - use it. Otherwise, you will need to run an SM, such as OpenSM, on one - of your end nodes. - - If an SM is running on your network, you should see the following: - - $ cat /sys/class/infiniband/driverX/ports/1/state - 4: ACTIVE - - where driverX is mthca0, ipath5, ehca3, etc. - - To further test the InfiniBand software stack, use IPoIB (this - assumes you have two IB hosts named host1 and host2): - - host1$ ifconfig ib0 a.b.c.x - host2$ ifconfig ib0 a.b.c.y - host1$ ping a.b.c.y - host2$ ping a.b.c.x - - For other device types, follow the appropriate procedures. - - - Check NFS Setup - - For the NFS components enabled above (client and/or server), - test their functionality over standard Ethernet using TCP/IP or UDP/IP. - -NFS/RDMA Setup -~~~~~~~~~~~~~~ - - We recommend that you use two machines, one to act as the client and - one to act as the server. - - One time configuration: - - - On the server system, configure the /etc/exports file and - start the NFS/RDMA server. - - Exports entries with the following formats have been tested: - - /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) - /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) - - The IP address(es) is(are) the client's IPoIB address for an InfiniBand - HCA or the cleint's iWARP address(es) for an RNIC. - - NOTE: The "insecure" option must be used because the NFS/RDMA client does - not use a reserved port. - - Each time a machine boots: - - - Load and configure the RDMA drivers - - For InfiniBand using a Mellanox adapter: - - $ modprobe ib_mthca - $ modprobe ib_ipoib - $ ifconfig ib0 a.b.c.d - - NOTE: use unique addresses for the client and server - - - Start the NFS server - - If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in - kernel config), load the RDMA transport module: - - $ modprobe svcrdma - - Regardless of how the server was built (module or built-in), start the - server: - - $ /etc/init.d/nfs start - - or - - $ service nfs start - - Instruct the server to listen on the RDMA transport: - - $ echo rdma 20049 > /proc/fs/nfsd/portlist - - - On the client system - - If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in - kernel config), load the RDMA client module: - - $ modprobe xprtrdma.ko - - Regardless of how the client was built (module or built-in), use this - command to mount the NFS/RDMA server: - - $ mount -o rdma,port=20049 :/ /mnt - - To verify that the mount is using RDMA, run "cat /proc/mounts" and check - the "proto" field for the given mount. - - Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs.txt b/Documentation/filesystems/nfs.txt deleted file mode 100644 index f50f26c..0000000 --- a/Documentation/filesystems/nfs.txt +++ /dev/null @@ -1,98 +0,0 @@ - -The NFS client -============== - -The NFS version 2 protocol was first documented in RFC1094 (March 1989). -Since then two more major releases of NFS have been published, with NFSv3 -being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April -2003). - -The Linux NFS client currently supports all the above published versions, -and work is in progress on adding support for minor version 1 of the NFSv4 -protocol. - -The purpose of this document is to provide information on some of the -upcall interfaces that are used in order to provide the NFS client with -some of the information that it requires in order to fully comply with -the NFS spec. - -The DNS resolver -================ - -NFSv4 allows for one server to refer the NFS client to data that has been -migrated onto another server by means of the special "fs_locations" -attribute. See - http://tools.ietf.org/html/rfc3530#section-6 -and - http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 - -The fs_locations information can take the form of either an ip address and -a path, or a DNS hostname and a path. The latter requires the NFS client to -do a DNS lookup in order to mount the new volume, and hence the need for an -upcall to allow userland to provide this service. - -Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual -/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: - - (1) The process checks the dns_resolve cache to see if it contains a - valid entry. If so, it returns that entry and exits. - - (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' - (may be changed using the 'nfs.cache_getent' kernel boot parameter) - is run, with two arguments: - - the cache name, "dns_resolve" - - the hostname to resolve - - (3) After looking up the corresponding ip address, the helper script - writes the result into the rpc_pipefs pseudo-file - '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' - in the following (text) format: - - " \n" - - Where is in the usual IPv4 (123.456.78.90) or IPv6 - (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. - is identical to the second argument of the helper - script, and is the 'time to live' of this cache entry (in - units of seconds). - - Note: If is invalid, say the string "0", then a negative - entry is created, which will cause the kernel to treat the hostname - as having no valid DNS translation. - - - - -A basic sample /sbin/nfs_cache_getent -===================================== - -#!/bin/bash -# -ttl=600 -# -cut=/usr/bin/cut -getent=/usr/bin/getent -rpc_pipefs=/var/lib/nfs/rpc_pipefs -# -die() -{ - echo "Usage: $0 cache_name entry_name" - exit 1 -} - -[ $# -lt 2 ] && die -cachename="$1" -cache_path=${rpc_pipefs}/cache/${cachename}/channel - -case "${cachename}" in - dns_resolve) - name="$2" - result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" - [ -z "${result}" ] && result="0" - ;; - *) - die - ;; -esac -echo "${result} ${name} ${ttl}" >${cache_path} - diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX new file mode 100644 index 0000000..6ff3d21 --- /dev/null +++ b/Documentation/filesystems/nfs/00-INDEX @@ -0,0 +1,12 @@ +00-INDEX + - this file (nfs-related documentation). +Exporting + - explanation of how to make filesystems exportable. +nfs.txt + - nfs client, and DNS resolution for fs_locations. +nfs41-server.txt + - info on the Linux server implementation of NFSv4 minor version 1. +nfs-rdma.txt + - how to install and setup the Linux NFS/RDMA client and server software +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. diff --git a/Documentation/filesystems/nfs/Exporting b/Documentation/filesystems/nfs/Exporting new file mode 100644 index 0000000..87019d2 --- /dev/null +++ b/Documentation/filesystems/nfs/Exporting @@ -0,0 +1,147 @@ + +Making Filesystems Exportable +============================= + +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting +point. Local applications have a reference-counted hold on suitable +dentries via open file descriptors or cwd/root. However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry. As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (outside of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentries will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree. This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object. This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1/ The dcache must sometimes contain objects that are not part of the + proper prefix. i.e that are not connected to the root. +2/ The dcache must be prepared for a newly found (via ->lookup) directory + to already have a (non-connected) dentry, and must be able to move + that dentry into place (based on the parent and name in the + ->lookup). This is particularly needed for directories as + it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a/ A dentry flag DCACHE_DISCONNECTED which is set on + any dentry that might not be part of the proper prefix. + This is set when anonymous dentries are created, and cleared when a + dentry is noticed to be a child of a dentry which is in the proper + prefix. + +b/ A per-superblock list "s_anon" of dentries which are the roots of + subtrees that are not in the proper prefix. These dentries, as + well as the proper prefix, need to be released at unmount time. As + these dentries will not be hashed, they are linked together on the + d_hash list_head. + +c/ Helper routines to allocate anonymous dentries, and to help attach + loose directory dentries at lookup time. They are: + d_alloc_anon(inode) will return a dentry for the given inode. + If the inode already has a dentry, one of those is returned. + If it doesn't, a new anonymous (IS_ROOT and + DCACHE_DISCONNECTED) dentry is allocated and attached. + In the case of a directory, care is taken that only one dentry + can ever be attached. + d_splice_alias(inode, dentry) will make sure that there is a + dentry with the same name and parent as the given dentry, and + which refers to the given inode. + If the inode is a directory and already has a dentry, then that + dentry is d_moved over the given dentry. + If the passed dentry gets attached, care is taken that this is + mutually exclusive to a d_alloc_anon operation. + If the passed dentry is used, NULL is returned, else the used + dentry is returned. This corresponds to the calling pattern of + ->lookup. + + +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: + + 1/ provide the filehandle fragment routines described below. + 2/ make sure that d_splice_alias is used rather than d_add + when ->lookup finds an inode for a given parent and name. + Typically the ->lookup routine will end with a: + + return d_splice_alias(inode, dentry); + } + + + + A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block. This field must point to a "struct export_operations" +struct which has the following members: + + encode_fh (optional) + Takes a dentry and creates a filehandle fragment which can later be used + to find or create a dentry for the same object. The default + implementation creates a filehandle fragment that encodes a 32bit inode + and generation number for the inode encoded, and if necessary the + same information for the parent. + + fh_to_dentry (mandatory) + Given a filehandle fragment, this should find the implied object and + create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) + Given a filehandle fragment, this should find the parent of the + implied object and create a dentry for it (possibly with d_alloc_anon). + May fail if the filehandle fragment is too small. + + get_parent (optional but strongly recommended) + When given a dentry for a directory, this should return a dentry for + the parent. Quite possibly the parent dentry will have been allocated + by d_alloc_anon. The default get_parent function just returns an error + so any filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." entries + in the dcache which are too messy to work with. + + get_name (optional) + When given a parent dentry and a child dentry, this should find a name + in the directory identified by the parent dentry, which leads to the + object identified by the child dentry. If no get_name function is + supplied, a default implementation is provided which uses vfs_readdir + to find potential names, and matches inode numbers to find the correct + match. + + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it. This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls. Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt new file mode 100644 index 0000000..e386f7e --- /dev/null +++ b/Documentation/filesystems/nfs/nfs-rdma.txt @@ -0,0 +1,271 @@ +################################################################################ +# # +# NFS/RDMA README # +# # +################################################################################ + + Author: NetApp and Open Grid Computing + Date: May 29, 2008 + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + + This document describes how to install and setup the Linux NFS/RDMA client + and server software. + + The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server + was first included in the following release, Linux 2.6.25. + + In our testing, we have obtained excellent performance results (full 10Gbit + wire bandwidth at minimal client CPU) under many workloads. The code passes + the full Connectathon test suite and operates over both Infiniband and iWARP + RDMA adapters. + +Getting Help +~~~~~~~~~~~~ + + If you get stuck, you can ask questions on the + + nfs-rdma-devel@lists.sourceforge.net + + mailing list. + +Installation +~~~~~~~~~~~~ + + These instructions are a step by step guide to building a machine for + use with NFS/RDMA. + + - Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards, the + Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + + - Install a Linux distribution and tools + + The first kernel release to contain both the NFS/RDMA client and server was + Linux 2.6.25 Therefore, a distribution compatible with this and subsequent + Linux kernel release should be installed. + + The procedures described in this document have been tested with + distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + + - Install nfs-utils-1.1.2 or greater on the client + + An NFS/RDMA mount point can be obtained by using the mount.nfs command in + nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils + version with support for NFS/RDMA mounts, but for various reasons we + recommend using nfs-utils-1.1.2 or greater). To see which version of + mount.nfs you are using, type: + + $ /sbin/mount.nfs -V + + If the version is less than 1.1.2 or the command does not exist, + you should install the latest version of nfs-utils. + + Download the latest package from: + + http://www.kernel.org/pub/linux/utils/nfs + + Uncompress the package and follow the installation instructions. + + If you will not need the idmapper and gssd executables (you do not need + these to create an NFS/RDMA enabled mount command), the installation + process can be simplified by disabling these features when running + configure: + + $ ./configure --disable-gss --disable-nfsv4 + + To build nfs-utils you will need the tcp_wrappers package installed. For + more information on this see the package's README and INSTALL files. + + After building the nfs-utils package, there will be a mount.nfs binary in + the utils/mount directory. This binary can be used to initiate NFS v2, v3, + or v4 mounts. To initiate a v4 mount, the binary must be called + mount.nfs4. The standard technique is to create a symlink called + mount.nfs4 to mount.nfs. + + This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + + $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + + In this location, mount.nfs will be invoked automatically for NFS mounts + by the system mount command. + + NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed + on the NFS client machine. You do not need this specific version of + nfs-utils on the server. Furthermore, only the mount.nfs command from + nfs-utils-1.1.2 is needed on the client. + + - Install a Linux kernel with NFS/RDMA + + The NFS/RDMA client and server are both included in the mainline Linux + kernel version 2.6.25 and later. This and other versions of the 2.6 Linux + kernel can be found at: + + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + + Download the sources and place them in an appropriate location. + + - Configure the RDMA stack + + Make sure your kernel configuration has RDMA support enabled. Under + Device Drivers -> InfiniBand support, update the kernel configuration + to enable InfiniBand support [NOTE: the option name is misleading. Enabling + InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + + Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or + iWARP adapter support (amso, cxgb3, etc.). + + If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + + - Configure the NFS client and server + + Your kernel configuration must also have NFS file system support and/or + NFS server support enabled. These and other NFS related configuration + options can be found under File Systems -> Network File Systems. + + - Build, install, reboot + + The NFS/RDMA code will be enabled automatically if NFS and RDMA + are turned on. The NFS/RDMA client and server are configured via the hidden + SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The + value of SUNRPC_XPRT_RDMA will be: + + - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client + and server will not be built + - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, + in this case the NFS/RDMA client and server will be built as modules + - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client + and server will be built into the kernel + + Therefore, if you have followed the steps above and turned no NFS and RDMA, + the NFS/RDMA client and server will be built. + + Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + + Before configuring the NFS/RDMA software, it is a good idea to test + your new kernel to ensure that the kernel is working correctly. + In particular, it is a good idea to verify that the RDMA stack + is functioning as expected and standard NFS over TCP/IP and/or UDP/IP + is working properly. + + - Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + host1$ ifconfig ib0 a.b.c.x + host2$ ifconfig ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + + - Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + + We recommend that you use two machines, one to act as the client and + one to act as the server. + + One time configuration: + + - On the server system, configure the /etc/exports file and + start the NFS/RDMA server. + + Exports entries with the following formats have been tested: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the cleint's iWARP address(es) for an RNIC. + + NOTE: The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + + Each time a machine boots: + + - Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ifconfig ib0 a.b.c.d + + NOTE: use unique addresses for the client and server + + - Start the NFS server + + If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA transport module: + + $ modprobe svcrdma + + Regardless of how the server was built (module or built-in), start the + server: + + $ /etc/init.d/nfs start + + or + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + $ echo rdma 20049 > /proc/fs/nfsd/portlist + + - On the client system + + If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA client module: + + $ modprobe xprtrdma.ko + + Regardless of how the client was built (module or built-in), use this + command to mount the NFS/RDMA server: + + $ mount -o rdma,port=20049 :/ /mnt + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt new file mode 100644 index 0000000..f50f26c --- /dev/null +++ b/Documentation/filesystems/nfs/nfs.txt @@ -0,0 +1,98 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +upcall interfaces that are used in order to provide the NFS client with +some of the information that it requires in order to fully comply with +the NFS spec. + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See + http://tools.ietf.org/html/rfc3530#section-6 +and + http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + " \n" + + Where is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + is identical to the second argument of the helper + script, and is the 'time to live' of this cache entry (in + units of seconds). + + Note: If is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ + echo "Usage: $0 cache_name entry_name" + exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt new file mode 100644 index 0000000..1bd0d0c --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.txt @@ -0,0 +1,222 @@ +NFSv4.1 Server Implementation + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is disabled by default. +It can be enabled at run time by writing the string "+4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. Use your user-mode +nfs-utils to set this up; see rpc.nfsd(8) + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on the latest NFSv4.1 Internet Draft: +http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +Other NFSv4.1 features, Parallel NFS operations in particular, +are still under development out of tree. +See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design +for more information. + +The current implementation is intended for developers only: while it +does support ordinary file operations on clients we have tested against +(including the linux client), it is incomplete in ways which may limit +features unexpectedly, cause known bugs in rare cases, or cause +interoperability problems with future clients. Known issues: + + - gss support is questionable: currently mounts with kerberos + from a linux client are possible, but we aren't really + conformant with the spec (for example, we don't use kerberos + on the backchannel correctly). + - no trunking support: no clients currently take advantage of + trunking, but this is a mandatory feature, and its use is + recommended to clients in a number of places. (E.g. to ensure + timely renewal in case an existing connection's retry timeouts + have gotten too long; see section 8.3 of the draft.) + Therefore, lack of this feature may cause future clients to + fail. + - Incomplete backchannel support: incomplete backchannel gss + support and no support for BACKCHANNEL_CTL mean that + callbacks (hence delegations and layouts) may not be + available and clients confused by the incomplete + implementation may fail. + - Server reboot recovery is unsupported; if the server reboots, + clients may fail. + - We do not support SSV, which provides security for shared + client-server state (thus preventing unauthorized tampering + with locks and opens, for example). It is mandatory for + servers to support this, though no clients use it yet. + - Mandatory operations which we do not support, such as + DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and + TEST_STATEID, are not currently used by clients, but will be + (and the spec recommends their uses in common cases), and + clients should not be expected to know how to recover from the + case where they are not supported. This will eventually cause + interoperability failures. + +In addition, some limitations are inherited from the current NFSv4 +implementation: + + - Incomplete delegation enforcement: if a file is renamed or + unlinked, a client holding a delegation may continue to + indefinitely allow opens of the file under the old name. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + pNFS Parallel NFS + FDELG File Delegations + DDELG Directory Delegations + +The following abbreviations indicate the linux server implementation status. + I Implemented NFSv4.1 operations. + NS Not Supported. + NS* unimplemented optional feature. + P pNFS features implemented out of tree. + PNS pNFS features that are not supported yet (out of tree). + +Operations + + +----------------------+------------+--------------+----------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +----------------------+------------+--------------+----------------+ + | ACCESS | REQ | | Section 18.1 | +NS | BACKCHANNEL_CTL | REQ | | Section 18.33 | +NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | + | CLOSE | REQ | | Section 18.2 | + | COMMIT | REQ | | Section 18.3 | + | CREATE | REQ | | Section 18.4 | +I | CREATE_SESSION | REQ | | Section 18.36 | +NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | + | DELEGRETURN | OPT | FDELG, | Section 18.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS | DESTROY_CLIENTID | REQ | | Section 18.50 | +I | DESTROY_SESSION | REQ | | Section 18.37 | +I | EXCHANGE_ID | REQ | | Section 18.35 | +NS | FREE_STATEID | REQ | | Section 18.38 | + | GETATTR | REQ | | Section 18.7 | +P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | +P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | + | GETFH | REQ | | Section 18.8 | +NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | +P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | +P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | +P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | + | LINK | OPT | | Section 18.9 | + | LOCK | REQ | | Section 18.10 | + | LOCKT | REQ | | Section 18.11 | + | LOCKU | REQ | | Section 18.12 | + | LOOKUP | REQ | | Section 18.13 | + | LOOKUPP | REQ | | Section 18.14 | + | NVERIFY | REQ | | Section 18.15 | + | OPEN | REQ | | Section 18.16 | +NS*| OPENATTR | OPT | | Section 18.17 | + | OPEN_CONFIRM | MNI | | N/A | + | OPEN_DOWNGRADE | REQ | | Section 18.18 | + | PUTFH | REQ | | Section 18.19 | + | PUTPUBFH | REQ | | Section 18.20 | + | PUTROOTFH | REQ | | Section 18.21 | + | READ | REQ | | Section 18.22 | + | READDIR | REQ | | Section 18.23 | + | READLINK | OPT | | Section 18.24 | +NS | RECLAIM_COMPLETE | REQ | | Section 18.51 | + | RELEASE_LOCKOWNER | MNI | | N/A | + | REMOVE | REQ | | Section 18.25 | + | RENAME | REQ | | Section 18.26 | + | RENEW | MNI | | N/A | + | RESTOREFH | REQ | | Section 18.27 | + | SAVEFH | REQ | | Section 18.28 | + | SECINFO | REQ | | Section 18.29 | +NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | + | | | layout (REQ) | Section 13.12 | +I | SEQUENCE | REQ | | Section 18.46 | + | SETATTR | REQ | | Section 18.30 | + | SETCLIENTID | MNI | | N/A | + | SETCLIENTID_CONFIRM | MNI | | N/A | +NS | SET_SSV | REQ | | Section 18.47 | +NS | TEST_STATEID | REQ | | Section 18.48 | + | VERIFY | REQ | | Section 18.31 | +NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | + | WRITE | REQ | | Section 18.32 | + +Callback Operations + + +-------------------------+-----------+-------------+---------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +-------------------------+-----------+-------------+---------------+ + | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | +P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | +NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | +P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | +NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | +NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | + | CB_RECALL | OPT | FDELG, | Section 20.2 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS | CB_RECALL_SLOT | REQ | | Section 20.8 | +NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | + | | | (REQ) | | +I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | + | | | DDELG, pNFS | | + | | | (REQ) | | + +-------------------------+-----------+-------------+---------------+ + +Implementation notes: + +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: +* only SP4_NONE state protection supported +* implementation ids are ignored + +CREATE_SESSION: +* backchannel attributes are ignored +* backchannel security parameters are ignored + +SEQUENCE: +* no support for dynamic slot table renegotiation (optional) + +nfsv4.1 COMPOUND rules: +The following cases aren't supported yet: +* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION, + DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. +* DESTROY_SESSION MUST be the final operation in the COMPOUND request. + +Nonstandard compound limitations: +* No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. +* No more than one IO operation (read, write, readdir) allowed per + compound. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt new file mode 100644 index 0000000..3ba0b94 --- /dev/null +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -0,0 +1,270 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann +Updated 1997 by Martin Mares +Updated 2006 by Nico Schottelius +Updated 2006 by Horms + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities + ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line + ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[:][,] + + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. + + Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + Standard NFS options. All options are separated by commas. + The following defaults are used: + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=:::::: + + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + IP address of the client. + + Default: Determined using autoconfiguration. + + IP address of the NFS server. If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + Netmask for local network interface. If unspecified + the netmask is derived from the client IP address assuming + classful addressing. + + Default: Determined using autoconfiguration. + + Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. + + Default: Client IP address is used in ASCII notation. + + Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + Default: any + + + + +3.) Boot Loader + ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1) Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use zimage + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch//boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel + +3.4) Using loadlin + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel ". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits + ------- + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann . + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares . + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager for his help. diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs41-server.txt deleted file mode 100644 index 1bd0d0c..0000000 --- a/Documentation/filesystems/nfs41-server.txt +++ /dev/null @@ -1,222 +0,0 @@ -NFSv4.1 Server Implementation - -Server support for minorversion 1 can be controlled using the -/proc/fs/nfsd/versions control file. The string output returned -by reading this file will contain either "+4.1" or "-4.1" -correspondingly. - -Currently, server support for minorversion 1 is disabled by default. -It can be enabled at run time by writing the string "+4.1" to -the /proc/fs/nfsd/versions control file. Note that to write this -control file, the nfsd service must be taken down. Use your user-mode -nfs-utils to set this up; see rpc.nfsd(8) - -(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and -"-4", respectively. Therefore, code meant to work on both new and old -kernels must turn 4.1 on or off *before* turning support for version 4 -on or off; rpc.nfsd does this correctly.) - -The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based -on the latest NFSv4.1 Internet Draft: -http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 - -From the many new features in NFSv4.1 the current implementation -focuses on the mandatory-to-implement NFSv4.1 Sessions, providing -"exactly once" semantics and better control and throttling of the -resources allocated for each client. - -Other NFSv4.1 features, Parallel NFS operations in particular, -are still under development out of tree. -See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design -for more information. - -The current implementation is intended for developers only: while it -does support ordinary file operations on clients we have tested against -(including the linux client), it is incomplete in ways which may limit -features unexpectedly, cause known bugs in rare cases, or cause -interoperability problems with future clients. Known issues: - - - gss support is questionable: currently mounts with kerberos - from a linux client are possible, but we aren't really - conformant with the spec (for example, we don't use kerberos - on the backchannel correctly). - - no trunking support: no clients currently take advantage of - trunking, but this is a mandatory feature, and its use is - recommended to clients in a number of places. (E.g. to ensure - timely renewal in case an existing connection's retry timeouts - have gotten too long; see section 8.3 of the draft.) - Therefore, lack of this feature may cause future clients to - fail. - - Incomplete backchannel support: incomplete backchannel gss - support and no support for BACKCHANNEL_CTL mean that - callbacks (hence delegations and layouts) may not be - available and clients confused by the incomplete - implementation may fail. - - Server reboot recovery is unsupported; if the server reboots, - clients may fail. - - We do not support SSV, which provides security for shared - client-server state (thus preventing unauthorized tampering - with locks and opens, for example). It is mandatory for - servers to support this, though no clients use it yet. - - Mandatory operations which we do not support, such as - DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and - TEST_STATEID, are not currently used by clients, but will be - (and the spec recommends their uses in common cases), and - clients should not be expected to know how to recover from the - case where they are not supported. This will eventually cause - interoperability failures. - -In addition, some limitations are inherited from the current NFSv4 -implementation: - - - Incomplete delegation enforcement: if a file is renamed or - unlinked, a client holding a delegation may continue to - indefinitely allow opens of the file under the old name. - -The table below, taken from the NFSv4.1 document, lists -the operations that are mandatory to implement (REQ), optional -(OPT), and NFSv4.0 operations that are required not to implement (MNI) -in minor version 1. The first column indicates the operations that -are not supported yet by the linux server implementation. - -The OPTIONAL features identified and their abbreviations are as follows: - pNFS Parallel NFS - FDELG File Delegations - DDELG Directory Delegations - -The following abbreviations indicate the linux server implementation status. - I Implemented NFSv4.1 operations. - NS Not Supported. - NS* unimplemented optional feature. - P pNFS features implemented out of tree. - PNS pNFS features that are not supported yet (out of tree). - -Operations - - +----------------------+------------+--------------+----------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +----------------------+------------+--------------+----------------+ - | ACCESS | REQ | | Section 18.1 | -NS | BACKCHANNEL_CTL | REQ | | Section 18.33 | -NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | - | CLOSE | REQ | | Section 18.2 | - | COMMIT | REQ | | Section 18.3 | - | CREATE | REQ | | Section 18.4 | -I | CREATE_SESSION | REQ | | Section 18.36 | -NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | - | DELEGRETURN | OPT | FDELG, | Section 18.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS | DESTROY_CLIENTID | REQ | | Section 18.50 | -I | DESTROY_SESSION | REQ | | Section 18.37 | -I | EXCHANGE_ID | REQ | | Section 18.35 | -NS | FREE_STATEID | REQ | | Section 18.38 | - | GETATTR | REQ | | Section 18.7 | -P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | -P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | - | GETFH | REQ | | Section 18.8 | -NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | -P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | -P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | -P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | - | LINK | OPT | | Section 18.9 | - | LOCK | REQ | | Section 18.10 | - | LOCKT | REQ | | Section 18.11 | - | LOCKU | REQ | | Section 18.12 | - | LOOKUP | REQ | | Section 18.13 | - | LOOKUPP | REQ | | Section 18.14 | - | NVERIFY | REQ | | Section 18.15 | - | OPEN | REQ | | Section 18.16 | -NS*| OPENATTR | OPT | | Section 18.17 | - | OPEN_CONFIRM | MNI | | N/A | - | OPEN_DOWNGRADE | REQ | | Section 18.18 | - | PUTFH | REQ | | Section 18.19 | - | PUTPUBFH | REQ | | Section 18.20 | - | PUTROOTFH | REQ | | Section 18.21 | - | READ | REQ | | Section 18.22 | - | READDIR | REQ | | Section 18.23 | - | READLINK | OPT | | Section 18.24 | -NS | RECLAIM_COMPLETE | REQ | | Section 18.51 | - | RELEASE_LOCKOWNER | MNI | | N/A | - | REMOVE | REQ | | Section 18.25 | - | RENAME | REQ | | Section 18.26 | - | RENEW | MNI | | N/A | - | RESTOREFH | REQ | | Section 18.27 | - | SAVEFH | REQ | | Section 18.28 | - | SECINFO | REQ | | Section 18.29 | -NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | - | | | layout (REQ) | Section 13.12 | -I | SEQUENCE | REQ | | Section 18.46 | - | SETATTR | REQ | | Section 18.30 | - | SETCLIENTID | MNI | | N/A | - | SETCLIENTID_CONFIRM | MNI | | N/A | -NS | SET_SSV | REQ | | Section 18.47 | -NS | TEST_STATEID | REQ | | Section 18.48 | - | VERIFY | REQ | | Section 18.31 | -NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | - | WRITE | REQ | | Section 18.32 | - -Callback Operations - - +-------------------------+-----------+-------------+---------------+ - | Operation | REQ, REC, | Feature | Definition | - | | OPT, or | (REQ, REC, | | - | | MNI | or OPT) | | - +-------------------------+-----------+-------------+---------------+ - | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | -P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | -NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | -P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | -NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | -NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | - | CB_RECALL | OPT | FDELG, | Section 20.2 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS | CB_RECALL_SLOT | REQ | | Section 20.8 | -NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | - | | | (REQ) | | -I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | - | | | DDELG, pNFS | | - | | | (REQ) | | -NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | - | | | DDELG, pNFS | | - | | | (REQ) | | - +-------------------------+-----------+-------------+---------------+ - -Implementation notes: - -DELEGPURGE: -* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or - CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that - persist across client reboots). Thus we need not implement this for - now. - -EXCHANGE_ID: -* only SP4_NONE state protection supported -* implementation ids are ignored - -CREATE_SESSION: -* backchannel attributes are ignored -* backchannel security parameters are ignored - -SEQUENCE: -* no support for dynamic slot table renegotiation (optional) - -nfsv4.1 COMPOUND rules: -The following cases aren't supported yet: -* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION, - DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. -* DESTROY_SESSION MUST be the final operation in the COMPOUND request. - -Nonstandard compound limitations: -* No support for a sessions fore channel RPC compound that requires both a - ca_maxrequestsize request and a ca_maxresponsesize reply, so we may - fail to live up to the promise we made in CREATE_SESSION fore channel - negotiation. -* No more than one IO operation (read, write, readdir) allowed per - compound. diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfsroot.txt deleted file mode 100644 index 3ba0b94..0000000 --- a/Documentation/filesystems/nfsroot.txt +++ /dev/null @@ -1,270 +0,0 @@ -Mounting the root filesystem via NFS (nfsroot) -=============================================== - -Written 1996 by Gero Kuhlmann -Updated 1997 by Martin Mares -Updated 2006 by Nico Schottelius -Updated 2006 by Horms - - - -In order to use a diskless system, such as an X-terminal or printer server -for example, it is necessary for the root filesystem to be present on a -non-disk device. This may be an initramfs (see Documentation/filesystems/ -ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a -filesystem mounted via NFS. The following text describes on how to use NFS -for the root filesystem. For the rest of this text 'client' means the -diskless system, and 'server' means the NFS server. - - - - -1.) Enabling nfsroot capabilities - ----------------------------- - -In order to use nfsroot, NFS client support needs to be selected as -built-in during configuration. Once this has been selected, the nfsroot -option will become available, which should also be selected. - -In the networking options, kernel level autoconfiguration can be selected, -along with the types of autoconfiguration to support. Selecting all of -DHCP, BOOTP and RARP is safe. - - - - -2.) Kernel command line - ------------------- - -When the kernel has been loaded by a boot loader (see below) it needs to be -told what root fs device to use. And in the case of nfsroot, where to find -both the server and the name of the directory on the server to mount as root. -This can be established using the following kernel command line parameters: - - -root=/dev/nfs - - This is necessary to enable the pseudo-NFS-device. Note that it's not a - real device but just a synonym to tell the kernel to use NFS instead of - a real device. - - -nfsroot=[:][,] - - If the `nfsroot' parameter is NOT given on the command line, - the default "/tftpboot/%s" will be used. - - Specifies the IP address of the NFS server. - The default address is determined by the `ip' parameter - (see below). This parameter allows the use of different - servers for IP autoconfiguration and NFS. - - Name of the directory on the server to mount as root. - If there is a "%s" token in the string, it will be - replaced by the ASCII-representation of the client's - IP address. - - Standard NFS options. All options are separated by commas. - The following defaults are used: - port = as given by server portmap daemon - rsize = 4096 - wsize = 4096 - timeo = 7 - retrans = 3 - acregmin = 3 - acregmax = 60 - acdirmin = 30 - acdirmax = 60 - flags = hard, nointr, noposix, cto, ac - - -ip=:::::: - - This parameter tells the kernel how to configure IP addresses of devices - and also how to set up the IP routing table. It was originally called - `nfsaddrs', but now the boot-time IP configuration works independently of - NFS, so it was renamed to `ip' and the old name remained as an alias for - compatibility reasons. - - If this parameter is missing from the kernel command line, all fields are - assumed to be empty, and the defaults mentioned below apply. In general - this means that the kernel tries to configure everything using - autoconfiguration. - - The parameter can appear alone as the value to the `ip' - parameter (without all the ':' characters before). If the value is - "ip=off" or "ip=none", no autoconfiguration will take place, otherwise - autoconfiguration will take place. The most common way to use this - is "ip=dhcp". - - IP address of the client. - - Default: Determined using autoconfiguration. - - IP address of the NFS server. If RARP is used to determine - the client address and this parameter is NOT empty only - replies from the specified server are accepted. - - Only required for NFS root. That is autoconfiguration - will not be triggered if it is missing and NFS root is not - in operation. - - Default: Determined using autoconfiguration. - The address of the autoconfiguration server is used. - - IP address of a gateway if the server is on a different subnet. - - Default: Determined using autoconfiguration. - - Netmask for local network interface. If unspecified - the netmask is derived from the client IP address assuming - classful addressing. - - Default: Determined using autoconfiguration. - - Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - - Default: Client IP address is used in ASCII notation. - - Name of network device to use. - - Default: If the host only has one device, it is used. - Otherwise the device is determined using - autoconfiguration. This is done by sending - autoconfiguration requests out of all devices, - and using the device that received the first reply. - - Method to use for autoconfiguration. In the case of options - which specify multiple autoconfiguration protocols, - requests are sent using all protocols, and the first one - to reply is used. - - Only autoconfiguration protocols that have been compiled - into the kernel will be used, regardless of the value of - this option. - - off or none: don't use autoconfiguration - (do static IP assignment instead) - on or any: use any protocol available in the kernel - (default) - dhcp: use DHCP - bootp: use BOOTP - rarp: use RARP - both: use both BOOTP and RARP but not DHCP - (old option kept for backwards compatibility) - - Default: any - - - - -3.) Boot Loader - ---------- - -To get the kernel into memory different approaches can be used. -They depend on various facilities being available: - - -3.1) Booting from a floppy using syslinux - - When building kernels, an easy way to create a boot floppy that uses - syslinux is to use the zdisk or bzdisk make targets which use zimage - and bzimage images respectively. Both targets accept the - FDARGS parameter which can be used to set the kernel command line. - - e.g. - make bzdisk FDARGS="root=/dev/nfs" - - Note that the user running this command will need to have - access to the floppy drive device, /dev/fd0 - - For more information on syslinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - N.B: Previously it was possible to write a kernel directly to - a floppy using dd, configure the boot device using rdev, and - boot using the resulting floppy. Linux no longer supports this - method of booting. - -3.2) Booting from a cdrom using isolinux - - When building kernels, an easy way to create a bootable cdrom that - uses isolinux is to use the isoimage target which uses a bzimage - image. Like zdisk and bzdisk, this target accepts the FDARGS - parameter which can be used to set the kernel command line. - - e.g. - make isoimage FDARGS="root=/dev/nfs" - - The resulting iso image will be arch//boot/image.iso - This can be written to a cdrom using a variety of tools including - cdrecord. - - e.g. - cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - -3.2) Using LILO - When using LILO all the necessary command line parameters may be - specified using the 'append=' directive in the LILO configuration - file. - - However, to use the 'root=' directive you also need to create - a dummy root device, which may be removed after LILO is run. - - mknod /dev/boot255 c 0 255 - - For information on configuring LILO, please refer to its documentation. - -3.3) Using GRUB - When using GRUB, kernel parameter are simply appended after the kernel - specification: kernel - -3.4) Using loadlin - loadlin may be used to boot Linux from a DOS command prompt without - requiring a local hard disk to mount as root. This has not been - thoroughly tested by the authors of this document, but in general - it should be possible configure the kernel command line similarly - to the configuration of LILO. - - Please refer to the loadlin documentation for further information. - -3.5) Using a boot ROM - This is probably the most elegant way of booting a diskless client. - With a boot ROM the kernel is loaded using the TFTP protocol. The - authors of this document are not aware of any no commercial boot - ROMs that support booting Linux over the network. However, there - are two free implementations of a boot ROM, netboot-nfs and - etherboot, both of which are available on sunsite.unc.edu, and both - of which contain everything you need to boot a diskless Linux client. - -3.6) Using pxelinux - Pxelinux may be used to boot linux using the PXE boot loader - which is present on many modern network cards. - - When using pxelinux, the kernel image is specified using - "kernel ". The nfsroot parameters - are passed to the kernel by adding them to the "append" line. - It is common to use serial console in conjunction with pxeliunx, - see Documentation/serial-console.txt for more information. - - For more information on isolinux, including how to create bootdisks - for prebuilt kernels, see http://syslinux.zytor.com/ - - - - -4.) Credits - ------- - - The nfsroot code in the kernel and the RARP support have been written - by Gero Kuhlmann . - - The rest of the IP layer autoconfiguration code has been written - by Martin Mares . - - In order to write the initial version of nfsroot I would like to thank - Jens-Uwe Mager for his help. diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 92b888d..a7e9746 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -140,7 +140,7 @@ Callers of notify_change() need ->i_mutex now. New super_block field "struct export_operations *s_export_op" for explicit support for exporting, e.g. via NFS. The structure is fully documented at its declaration in include/linux/fs.h, and in -Documentation/filesystems/Exporting. +Documentation/filesystems/nfs/Exporting. Briefly it allows for the definition of decode_fh and encode_fh operations to encode and decode filehandles, and allows the filesystem to use diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9107b38..dab0f04 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1017,7 +1017,7 @@ and is between 256 and 4096 characters. It is defined in the file No delay ip= [IP_PNP] - See Documentation/filesystems/nfsroot.txt. + See Documentation/filesystems/nfs/nfsroot.txt. ip2= [HW] Set IO/IRQ pairs for up to 4 IntelliPort boards See comment before ip2_setup() in @@ -1538,10 +1538,10 @@ and is between 256 and 4096 characters. It is defined in the file going to be removed in 2.6.29. nfsaddrs= [NFS] - See Documentation/filesystems/nfsroot.txt. + See Documentation/filesystems/nfs/nfsroot.txt. nfsroot= [NFS] nfs root filesystem for disk-less boxes. - See Documentation/filesystems/nfsroot.txt. + See Documentation/filesystems/nfs/nfsroot.txt. nfs.callback_tcpport= [NFS] set the TCP port on which the NFSv4 callback -- cgit v1.1 From 3d8e3ad879441ae14c5957b933028daf39d252b0 Mon Sep 17 00:00:00 2001 From: Frans Pop Date: Mon, 26 Oct 2009 08:39:02 +0100 Subject: thermal: add sanity check for the passive attribute Values below 1000 milli-celsius don't make sense and can cause the system to go into a thermal heart attack: the actual temperature will always be lower and thus the system will be throttled down to its lowest setting. An additional problem is that values below 1000 will show as 0 in /proc/acpi/thermal/TZx/trip_points:passive. cat passive 0 echo -n 90 >passive bash: echo: write error: Invalid argument echo -n 90000 >passive cat passive 90000 Signed-off-by: Frans Pop Acked-by: Zhang Rui Signed-off-by: Andrew Morton Signed-off-by: Len Brown --- Documentation/thermal/sysfs-api.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index a87dc27..cb3d15b 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -206,6 +206,7 @@ passive passive trip point for the zone. Activation is done by polling with an interval of 1 second. Unit: millidegrees Celsius + Valid values: 0 (disabled) or greater than 1000 RW, Optional ***************************** -- cgit v1.1 From ea4878a24d7e6a467d369b962bab95bd6a12cbe0 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Fri, 6 Nov 2009 13:59:43 -0500 Subject: nfs: move more to Documentation/filesystems/nfs Oops: I missed two files in the first commit that created this directory. Signed-off-by: J. Bruce Fields --- Documentation/filesystems/00-INDEX | 2 - Documentation/filesystems/knfsd-stats.txt | 159 -------------------- Documentation/filesystems/nfs/00-INDEX | 4 + Documentation/filesystems/nfs/knfsd-stats.txt | 159 ++++++++++++++++++++ Documentation/filesystems/nfs/rpc-cache.txt | 202 ++++++++++++++++++++++++++ Documentation/filesystems/rpc-cache.txt | 202 -------------------------- 6 files changed, 365 insertions(+), 363 deletions(-) delete mode 100644 Documentation/filesystems/knfsd-stats.txt create mode 100644 Documentation/filesystems/nfs/knfsd-stats.txt create mode 100644 Documentation/filesystems/nfs/rpc-cache.txt delete mode 100644 Documentation/filesystems/rpc-cache.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 482151c..658154f 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -84,8 +84,6 @@ relay.txt - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. -rpc-cache.txt - - introduction to the caching mechanisms in the sunrpc layer. seq_file.txt - how to use the seq_file API sharedsubtree.txt diff --git a/Documentation/filesystems/knfsd-stats.txt b/Documentation/filesystems/knfsd-stats.txt deleted file mode 100644 index 64ced51..0000000 --- a/Documentation/filesystems/knfsd-stats.txt +++ /dev/null @@ -1,159 +0,0 @@ - -Kernel NFS Server Statistics -============================ - -This document describes the format and semantics of the statistics -which the kernel NFS server makes available to userspace. These -statistics are available in several text form pseudo files, each of -which is described separately below. - -In most cases you don't need to know these formats, as the nfsstat(8) -program from the nfs-utils distribution provides a helpful command-line -interface for extracting and printing them. - -All the files described here are formatted as a sequence of text lines, -separated by newline '\n' characters. Lines beginning with a hash -'#' character are comments intended for humans and should be ignored -by parsing routines. All other lines contain a sequence of fields -separated by whitespace. - -/proc/fs/nfsd/pool_stats ------------------------- - -This file is available in kernels from 2.6.30 onwards, if the -/proc/fs/nfsd filesystem is mounted (it almost always should be). - -The first line is a comment which describes the fields present in -all the other lines. The other lines present the following data as -a sequence of unsigned decimal numeric fields. One line is shown -for each NFS thread pool. - -All counters are 64 bits wide and wrap naturally. There is no way -to zero these counters, instead applications should do their own -rate conversion. - -pool - The id number of the NFS thread pool to which this line applies. - This number does not change. - - Thread pool ids are a contiguous set of small integers starting - at zero. The maximum value depends on the thread pool mode, but - currently cannot be larger than the number of CPUs in the system. - Note that in the default case there will be a single thread pool - which contains all the nfsd threads and all the CPUs in the system, - and thus this file will have a single line with a pool id of "0". - -packets-arrived - Counts how many NFS packets have arrived. More precisely, this - is the number of times that the network stack has notified the - sunrpc server layer that new data may be available on a transport - (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). - - Depending on the NFS workload patterns and various network stack - effects (such as Large Receive Offload) which can combine packets - on the wire, this may be either more or less than the number - of NFS calls received (which statistic is available elsewhere). - However this is a more accurate and less workload-dependent measure - of how much CPU load is being placed on the sunrpc server layer - due to NFS network traffic. - -sockets-enqueued - Counts how many times an NFS transport is enqueued to wait for - an nfsd thread to service it, i.e. no nfsd thread was considered - available. - - The circumstance this statistic tracks indicates that there was NFS - network-facing work to be done but it couldn't be done immediately, - thus introducing a small delay in servicing NFS calls. The ideal - rate of change for this counter is zero; significantly non-zero - values may indicate a performance limitation. - - This can happen either because there are too few nfsd threads in the - thread pool for the NFS workload (the workload is thread-limited), - or because the NFS workload needs more CPU time than is available in - the thread pool (the workload is CPU-limited). In the former case, - configuring more nfsd threads will probably improve the performance - of the NFS workload. In the latter case, the sunrpc server layer is - already choosing not to wake idle nfsd threads because there are too - many nfsd threads which want to run but cannot, so configuring more - nfsd threads will make no difference whatsoever. The overloads-avoided - statistic (see below) can be used to distinguish these cases. - -threads-woken - Counts how many times an idle nfsd thread is woken to try to - receive some data from an NFS transport. - - This statistic tracks the circumstance where incoming - network-facing NFS work is being handled quickly, which is a good - thing. The ideal rate of change for this counter will be close - to but less than the rate of change of the packets-arrived counter. - -overloads-avoided - Counts how many times the sunrpc server layer chose not to wake an - nfsd thread, despite the presence of idle nfsd threads, because - too many nfsd threads had been recently woken but could not get - enough CPU time to actually run. - - This statistic counts a circumstance where the sunrpc layer - heuristically avoids overloading the CPU scheduler with too many - runnable nfsd threads. The ideal rate of change for this counter - is zero. Significant non-zero values indicate that the workload - is CPU limited. Usually this is associated with heavy CPU usage - on all the CPUs in the nfsd thread pool. - - If a sustained large overloads-avoided rate is detected on a pool, - the top(1) utility should be used to check for the following - pattern of CPU usage on all the CPUs associated with the given - nfsd thread pool. - - - %us ~= 0 (as you're *NOT* running applications on your NFS server) - - - %wa ~= 0 - - - %id ~= 0 - - - %sy + %hi + %si ~= 100 - - If this pattern is seen, configuring more nfsd threads will *not* - improve the performance of the workload. If this patten is not - seen, then something more subtle is wrong. - -threads-timedout - Counts how many times an nfsd thread triggered an idle timeout, - i.e. was not woken to handle any incoming network packets for - some time. - - This statistic counts a circumstance where there are more nfsd - threads configured than can be used by the NFS workload. This is - a clue that the number of nfsd threads can be reduced without - affecting performance. Unfortunately, it's only a clue and not - a strong indication, for a couple of reasons: - - - Currently the rate at which the counter is incremented is quite - slow; the idle timeout is 60 minutes. Unless the NFS workload - remains constant for hours at a time, this counter is unlikely - to be providing information that is still useful. - - - It is usually a wise policy to provide some slack, - i.e. configure a few more nfsds than are currently needed, - to allow for future spikes in load. - - -Note that incoming packets on NFS transports will be dealt with in -one of three ways. An nfsd thread can be woken (threads-woken counts -this case), or the transport can be enqueued for later attention -(sockets-enqueued counts this case), or the packet can be temporarily -deferred because the transport is currently being used by an nfsd -thread. This last case is not very interesting and is not explicitly -counted, but can be inferred from the other counters thus: - -packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) - - -More ----- -Descriptions of the other statistics file should go here. - - -Greg Banks -26 Mar 2009 diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX index 6ff3d21..2f68cd6 100644 --- a/Documentation/filesystems/nfs/00-INDEX +++ b/Documentation/filesystems/nfs/00-INDEX @@ -2,6 +2,8 @@ - this file (nfs-related documentation). Exporting - explanation of how to make filesystems exportable. +knfsd-stats.txt + - statistics which the NFS server makes available to user space. nfs.txt - nfs client, and DNS resolution for fs_locations. nfs41-server.txt @@ -10,3 +12,5 @@ nfs-rdma.txt - how to install and setup the Linux NFS/RDMA client and server software nfsroot.txt - short guide on setting up a diskless box with NFS root filesystem. +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt new file mode 100644 index 0000000..64ced51 --- /dev/null +++ b/Documentation/filesystems/nfs/knfsd-stats.txt @@ -0,0 +1,159 @@ + +Kernel NFS Server Statistics +============================ + +This document describes the format and semantics of the statistics +which the kernel NFS server makes available to userspace. These +statistics are available in several text form pseudo files, each of +which is described separately below. + +In most cases you don't need to know these formats, as the nfsstat(8) +program from the nfs-utils distribution provides a helpful command-line +interface for extracting and printing them. + +All the files described here are formatted as a sequence of text lines, +separated by newline '\n' characters. Lines beginning with a hash +'#' character are comments intended for humans and should be ignored +by parsing routines. All other lines contain a sequence of fields +separated by whitespace. + +/proc/fs/nfsd/pool_stats +------------------------ + +This file is available in kernels from 2.6.30 onwards, if the +/proc/fs/nfsd filesystem is mounted (it almost always should be). + +The first line is a comment which describes the fields present in +all the other lines. The other lines present the following data as +a sequence of unsigned decimal numeric fields. One line is shown +for each NFS thread pool. + +All counters are 64 bits wide and wrap naturally. There is no way +to zero these counters, instead applications should do their own +rate conversion. + +pool + The id number of the NFS thread pool to which this line applies. + This number does not change. + + Thread pool ids are a contiguous set of small integers starting + at zero. The maximum value depends on the thread pool mode, but + currently cannot be larger than the number of CPUs in the system. + Note that in the default case there will be a single thread pool + which contains all the nfsd threads and all the CPUs in the system, + and thus this file will have a single line with a pool id of "0". + +packets-arrived + Counts how many NFS packets have arrived. More precisely, this + is the number of times that the network stack has notified the + sunrpc server layer that new data may be available on a transport + (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). + + Depending on the NFS workload patterns and various network stack + effects (such as Large Receive Offload) which can combine packets + on the wire, this may be either more or less than the number + of NFS calls received (which statistic is available elsewhere). + However this is a more accurate and less workload-dependent measure + of how much CPU load is being placed on the sunrpc server layer + due to NFS network traffic. + +sockets-enqueued + Counts how many times an NFS transport is enqueued to wait for + an nfsd thread to service it, i.e. no nfsd thread was considered + available. + + The circumstance this statistic tracks indicates that there was NFS + network-facing work to be done but it couldn't be done immediately, + thus introducing a small delay in servicing NFS calls. The ideal + rate of change for this counter is zero; significantly non-zero + values may indicate a performance limitation. + + This can happen either because there are too few nfsd threads in the + thread pool for the NFS workload (the workload is thread-limited), + or because the NFS workload needs more CPU time than is available in + the thread pool (the workload is CPU-limited). In the former case, + configuring more nfsd threads will probably improve the performance + of the NFS workload. In the latter case, the sunrpc server layer is + already choosing not to wake idle nfsd threads because there are too + many nfsd threads which want to run but cannot, so configuring more + nfsd threads will make no difference whatsoever. The overloads-avoided + statistic (see below) can be used to distinguish these cases. + +threads-woken + Counts how many times an idle nfsd thread is woken to try to + receive some data from an NFS transport. + + This statistic tracks the circumstance where incoming + network-facing NFS work is being handled quickly, which is a good + thing. The ideal rate of change for this counter will be close + to but less than the rate of change of the packets-arrived counter. + +overloads-avoided + Counts how many times the sunrpc server layer chose not to wake an + nfsd thread, despite the presence of idle nfsd threads, because + too many nfsd threads had been recently woken but could not get + enough CPU time to actually run. + + This statistic counts a circumstance where the sunrpc layer + heuristically avoids overloading the CPU scheduler with too many + runnable nfsd threads. The ideal rate of change for this counter + is zero. Significant non-zero values indicate that the workload + is CPU limited. Usually this is associated with heavy CPU usage + on all the CPUs in the nfsd thread pool. + + If a sustained large overloads-avoided rate is detected on a pool, + the top(1) utility should be used to check for the following + pattern of CPU usage on all the CPUs associated with the given + nfsd thread pool. + + - %us ~= 0 (as you're *NOT* running applications on your NFS server) + + - %wa ~= 0 + + - %id ~= 0 + + - %sy + %hi + %si ~= 100 + + If this pattern is seen, configuring more nfsd threads will *not* + improve the performance of the workload. If this patten is not + seen, then something more subtle is wrong. + +threads-timedout + Counts how many times an nfsd thread triggered an idle timeout, + i.e. was not woken to handle any incoming network packets for + some time. + + This statistic counts a circumstance where there are more nfsd + threads configured than can be used by the NFS workload. This is + a clue that the number of nfsd threads can be reduced without + affecting performance. Unfortunately, it's only a clue and not + a strong indication, for a couple of reasons: + + - Currently the rate at which the counter is incremented is quite + slow; the idle timeout is 60 minutes. Unless the NFS workload + remains constant for hours at a time, this counter is unlikely + to be providing information that is still useful. + + - It is usually a wise policy to provide some slack, + i.e. configure a few more nfsds than are currently needed, + to allow for future spikes in load. + + +Note that incoming packets on NFS transports will be dealt with in +one of three ways. An nfsd thread can be woken (threads-woken counts +this case), or the transport can be enqueued for later attention +(sockets-enqueued counts this case), or the packet can be temporarily +deferred because the transport is currently being used by an nfsd +thread. This last case is not very interesting and is not explicitly +counted, but can be inferred from the other counters thus: + +packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + + +More +---- +Descriptions of the other statistics file should go here. + + +Greg Banks +26 Mar 2009 diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt new file mode 100644 index 0000000..8a382be --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-cache.txt @@ -0,0 +1,202 @@ + This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store. This is in the form of a + structure definition that must contain a + struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + The operations requires are: + struct cache_head *alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + void cache_put(struct kref *) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + int match(struct cache_head *orig, struct cache_head *new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + void init(struct cache_head *orig, struct cache_head *new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + void update(struct cache_head *orig, struct cache_head *new) + Set the 'content' fileds in 'new' from 'orig'. + int cache_show(struct seq_file *m, struct cache_detail *cd, + struct cache_head *h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + int cache_request(struct cache_detail *cd, struct cache_head *h, + char **bpp, int *blen) + Format a request to be send to user-space for an item + to be instantiated. *bpp is a buffer of size *blen. + bpp should be moved forward over the encoded message, + and *blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + int cache_parse(struct cache_detail *cd, char *buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup, and update the item + with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + - a key + - an expiry time + - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail. This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use it's own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +2/ otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/filesystems/rpc-cache.txt b/Documentation/filesystems/rpc-cache.txt deleted file mode 100644 index 8a382be..0000000 --- a/Documentation/filesystems/rpc-cache.txt +++ /dev/null @@ -1,202 +0,0 @@ - This document gives a brief introduction to the caching -mechanisms in the sunrpc layer that is used, in particular, -for NFS authentication. - -CACHES -====== -The caching replaces the old exports table and allows for -a wide variety of values to be caches. - -There are a number of caches that are similar in structure though -quite possibly very different in content and use. There is a corpus -of common code for managing these caches. - -Examples of caches that are likely to be needed are: - - mapping from IP address to client name - - mapping from client name and filesystem to export options - - mapping from UID to list of GIDs, to work around NFS's limitation - of 16 gids. - - mappings between local UID/GID and remote UID/GID for sites that - do not have uniform uid assignment - - mapping from network identify to public key for crypto authentication. - -The common code handles such things as: - - general cache lookup with correct locking - - supporting 'NEGATIVE' as well as positive entries - - allowing an EXPIRED time on cache items, and removing - items after they expire, and are no longer in-use. - - making requests to user-space to fill in cache entries - - allowing user-space to directly set entries in the cache - - delaying RPC requests that depend on as-yet incomplete - cache entries, and replaying those requests when the cache entry - is complete. - - clean out old entries as they expire. - -Creating a Cache ----------------- - -1/ A cache needs a datum to store. This is in the form of a - structure definition that must contain a - struct cache_head - as an element, usually the first. - It will also contain a key and some content. - Each cache element is reference counted and contains - expiry and update times for use in cache management. -2/ A cache needs a "cache_detail" structure that - describes the cache. This stores the hash table, some - parameters for cache management, and some operations detailing how - to work with particular cache items. - The operations requires are: - struct cache_head *alloc(void) - This simply allocates appropriate memory and returns - a pointer to the cache_detail embedded within the - structure - void cache_put(struct kref *) - This is called when the last reference to an item is - dropped. The pointer passed is to the 'ref' field - in the cache_head. cache_put should release any - references create by 'cache_init' and, if CACHE_VALID - is set, any references created by cache_update. - It should then release the memory allocated by - 'alloc'. - int match(struct cache_head *orig, struct cache_head *new) - test if the keys in the two structures match. Return - 1 if they do, 0 if they don't. - void init(struct cache_head *orig, struct cache_head *new) - Set the 'key' fields in 'new' from 'orig'. This may - include taking references to shared objects. - void update(struct cache_head *orig, struct cache_head *new) - Set the 'content' fileds in 'new' from 'orig'. - int cache_show(struct seq_file *m, struct cache_detail *cd, - struct cache_head *h) - Optional. Used to provide a /proc file that lists the - contents of a cache. This should show one item, - usually on just one line. - int cache_request(struct cache_detail *cd, struct cache_head *h, - char **bpp, int *blen) - Format a request to be send to user-space for an item - to be instantiated. *bpp is a buffer of size *blen. - bpp should be moved forward over the encoded message, - and *blen should be reduced to show how much free - space remains. Return 0 on success or <0 if not - enough room or other problem. - int cache_parse(struct cache_detail *cd, char *buf, int len) - A message from user space has arrived to fill out a - cache entry. It is in 'buf' of length 'len'. - cache_parse should parse this, find the item in the - cache with sunrpc_cache_lookup, and update the item - with sunrpc_cache_update. - - -3/ A cache needs to be registered using cache_register(). This - includes it on a list of caches that will be regularly - cleaned to discard old data. - -Using a cache -------------- - -To find a value in a cache, call sunrpc_cache_lookup passing a pointer -to the cache_head in a sample item with the 'key' fields filled in. -This will be passed to ->match to identify the target entry. If no -entry is found, a new entry will be create, added to the cache, and -marked as not containing valid data. - -The item returned is typically passed to cache_check which will check -if the data is valid, and may initiate an up-call to get fresh data. -cache_check will return -ENOENT in the entry is negative or if an up -call is needed but not possible, -EAGAIN if an upcall is pending, -or 0 if the data is valid; - -cache_check can be passed a "struct cache_req *". This structure is -typically embedded in the actual request and can be used to create a -deferred copy of the request (struct cache_deferred_req). This is -done when the found cache item is not uptodate, but the is reason to -believe that userspace might provide information soon. When the cache -item does become valid, the deferred copy of the request will be -revisited (->revisit). It is expected that this method will -reschedule the request for processing. - -The value returned by sunrpc_cache_lookup can also be passed to -sunrpc_cache_update to set the content for the item. A second item is -passed which should hold the content. If the item found by _lookup -has valid data, then it is discarded and a new item is created. This -saves any user of an item from worrying about content changing while -it is being inspected. If the item found by _lookup does not contain -valid data, then the content is copied across and CACHE_VALID is set. - -Populating a cache ------------------- - -Each cache has a name, and when the cache is registered, a directory -with that name is created in /proc/net/rpc - -This directory contains a file called 'channel' which is a channel -for communicating between kernel and user for populating the cache. -This directory may later contain other files of interacting -with the cache. - -The 'channel' works a bit like a datagram socket. Each 'write' is -passed as a whole to the cache for parsing and interpretation. -Each cache can treat the write requests differently, but it is -expected that a message written will contain: - - a key - - an expiry time - - a content. -with the intention that an item in the cache with the give key -should be create or updated to have the given content, and the -expiry time should be set on that item. - -Reading from a channel is a bit more interesting. When a cache -lookup fails, or when it succeeds but finds an entry that may soon -expire, a request is lodged for that cache item to be updated by -user-space. These requests appear in the channel file. - -Successive reads will return successive requests. -If there are no more requests to return, read will return EOF, but a -select or poll for read will block waiting for another request to be -added. - -Thus a user-space helper is likely to: - open the channel. - select for readable - read a request - write a response - loop. - -If it dies and needs to be restarted, any requests that have not been -answered will still appear in the file and will be read by the new -instance of the helper. - -Each cache should define a "cache_parse" method which takes a message -written from user-space and processes it. It should return an error -(which propagates back to the write syscall) or 0. - -Each cache should also define a "cache_request" method which -takes a cache item and encodes a request into the buffer -provided. - -Note: If a cache has no active readers on the channel, and has had not -active readers for more than 60 seconds, further requests will not be -added to the channel but instead all lookups that do not find a valid -entry will fail. This is partly for backward compatibility: The -previous nfs exports table was deemed to be authoritative and a -failed lookup meant a definite 'no'. - -request/response format ------------------------ - -While each cache is free to use it's own format for requests -and responses over channel, the following is recommended as -appropriate and support routines are available to help: -Each request or response record should be printable ASCII -with precisely one newline character which should be at the end. -Fields within the record should be separated by spaces, normally one. -If spaces, newlines, or nul characters are needed in a field they -much be quoted. two mechanisms are available: -1/ If a field begins '\x' then it must contain an even number of - hex digits, and pairs of these digits provide the bytes in the - field. -2/ otherwise a \ in the field must be followed by 3 octal digits - which give the code for a byte. Other characters are treated - as them selves. At the very least, space, newline, nul, and - '\' must be quoted in this way. -- cgit v1.1 From 347a26860e2293b1347996876d3550499c7bb31f Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Wed, 9 Dec 2009 01:36:24 +0000 Subject: thinkpad-acpi: issue backlight class events Take advantage of the new events capabilities of the backlight class to notify userspace of backlight changes. This depends on "backlight: Allow drivers to update the core, and generate events on changes", by Matthew Garrett. Signed-off-by: Henrique de Moraes Holschuh Cc: Matthew Garrett Cc: Richard Purdie Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 51 ++++++++------------------------- 1 file changed, 12 insertions(+), 39 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index aafcaa6..f5056c7 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -460,6 +460,8 @@ event code Key Notes For Lenovo ThinkPads with a new BIOS, it has to be handled either by the ACPI OSI, or by userspace. + The driver does the right thing, + never mess with this. 0x1011 0x10 FN+END Brightness down. See brightness up for details. @@ -582,46 +584,15 @@ with hotkey_report_mode. Brightness hotkey notes: -These are the current sane choices for brightness key mapping in -thinkpad-acpi: +Don't mess with the brightness hotkeys in a Thinkpad. If you want +notifications for OSD, use the sysfs backlight class event support. -For IBM and Lenovo models *without* ACPI backlight control (the ones on -which thinkpad-acpi will autoload its backlight interface by default, -and on which ACPI video does not export a backlight interface): - -1. Don't enable or map the brightness hotkeys in thinkpad-acpi, as - these older firmware versions unfortunately won't respect the hotkey - mask for brightness keys anyway, and always reacts to them. This - usually work fine, unless X.org drivers are doing something to block - the BIOS. In that case, use (3) below. This is the default mode of - operation. - -2. Enable the hotkeys, but map them to something else that is NOT - KEY_BRIGHTNESS_UP/DOWN or any other keycode that would cause - userspace to try to change the backlight level, and use that as an - on-screen-display hint. - -3. IF AND ONLY IF X.org drivers find a way to block the firmware from - automatically changing the brightness, enable the hotkeys and map - them to KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN, and feed that to - something that calls xbacklight. thinkpad-acpi will not be able to - change brightness in that case either, so you should disable its - backlight interface. - -For Lenovo models *with* ACPI backlight control: - -1. Load up ACPI video and use that. ACPI video will report ACPI - events for brightness change keys. Do not mess with thinkpad-acpi - defaults in this case. thinkpad-acpi should not have anything to do - with backlight events in a scenario where ACPI video is loaded: - brightness hotkeys must be disabled, and the backlight interface is - to be kept disabled as well. This is the default mode of operation. - -2. Do *NOT* load up ACPI video, enable the hotkeys in thinkpad-acpi, - and map them to KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN. Process - these keys on userspace somehow (e.g. by calling xbacklight). - The driver will do this automatically if it detects that ACPI video - has been disabled. +The driver will issue KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN events +automatically for the cases were userspace has to do something to +implement brightness changes. When you override these events, you will +either fail to handle properly the ThinkPads that require explicit +action to change backlight brightness, or the ThinkPads that require +that no action be taken to work properly. Bluetooth @@ -1465,3 +1436,5 @@ Sysfs interface changelog: and it is always able to disable hot keys. Very old thinkpads are properly supported. hotkey_bios_mask is deprecated and marked for removal. + +0x020600: Marker for backlight change event support. -- cgit v1.1 From 196f0082d8e77f647edd21ea27b5a9bba8116396 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sun, 6 Dec 2009 12:32:47 +0100 Subject: powerpc/fsl: try to explain why the interrupt numbers are off by 16 Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Kumar Gala --- Documentation/powerpc/dts-bindings/fsl/mpic.txt | 42 +++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 Documentation/powerpc/dts-bindings/fsl/mpic.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/fsl/mpic.txt b/Documentation/powerpc/dts-bindings/fsl/mpic.txt new file mode 100644 index 0000000..71e39cf --- /dev/null +++ b/Documentation/powerpc/dts-bindings/fsl/mpic.txt @@ -0,0 +1,42 @@ +* OpenPIC and its interrupt numbers on Freescale's e500/e600 cores + +The OpenPIC specification does not specify which interrupt source has to +become which interrupt number. This is up to the software implementation +of the interrupt controller. The only requirement is that every +interrupt source has to have an unique interrupt number / vector number. +To accomplish this the current implementation assigns the number zero to +the first source, the number one to the second source and so on until +all interrupt sources have their unique number. +Usually the assigned vector number equals the interrupt number mentioned +in the documentation for a given core / CPU. This is however not true +for the e500 cores (MPC85XX CPUs) where the documentation distinguishes +between internal and external interrupt sources and starts counting at +zero for both of them. + +So what to write for external interrupt source X or internal interrupt +source Y into the device tree? Here is an example: + +The memory map for the interrupt controller in the MPC8544[0] shows, +that the first interrupt source starts at 0x5_0000 (PIC Register Address +Map-Interrupt Source Configuration Registers). This source becomes the +number zero therefore: + External interrupt 0 = interrupt number 0 + External interrupt 1 = interrupt number 1 + External interrupt 2 = interrupt number 2 + ... +Every interrupt number allocates 0x20 bytes register space. So to get +its number it is sufficient to shift the lower 16bits to right by five. +So for the external interrupt 10 we have: + 0x0140 >> 5 = 10 + +After the external sources, the internal sources follow. The in core I2C +controller on the MPC8544 for instance has the internal source number +27. Oo obtain its interrupt number we take the lower 16bits of its memory +address (0x5_0560) and shift it right: + 0x0560 >> 5 = 43 + +Therefore the I2C device node for the MPC8544 CPU has to have the +interrupt number 43 specified in the device tree. + +[0] MPC8544E PowerQUICCTM III, Integrated Host Processor Family Reference Manual + MPC8544ERM Rev. 1 10/2007 -- cgit v1.1 From 728900f6fa7142e07a67d10d862bcb774d7a3493 Mon Sep 17 00:00:00 2001 From: Corentin Chary Date: Thu, 3 Dec 2009 07:45:17 +0000 Subject: asus-laptop: schedule display_get and lcd_switch for removal Signed-off-by: Corentin Chary Signed-off-by: Len Brown --- Documentation/feature-removal-schedule.txt | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index bc693ff..d1c9e17 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -489,3 +489,23 @@ Why: With the recent innovations in CPU hardware acceleration technologies Who: Alok N Kataria ---------------------------- + +What: Support for lcd_switch and display_get in asus-laptop driver +When: March 2010 +Why: These two features use non-standard interfaces. There are the + only features that really need multiple path to guess what's + the right method name on a specific laptop. + + Removing them will allow to remove a lot of code an significantly + clean the drivers. + + This will affect the backlight code which won't be able to know + if the backlight is on or off. The platform display file will also be + write only (like the one in eeepc-laptop). + + This should'nt affect a lot of user because they usually know + when their display is on or off. + +Who: Corentin Chary + +---------------------------- -- cgit v1.1 From f7111821e51a58ee0f548f256743121ce1b365ae Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Wed, 9 Dec 2009 14:21:36 -0800 Subject: IB: Fix typo in ipoib.txt Delete extra words in "is to takes advantage of". Signed-off-by: Bart Van Assche Signed-off-by: Roland Dreier --- Documentation/infiniband/ipoib.txt | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt index 6d40f00..64eeb55 100644 --- a/Documentation/infiniband/ipoib.txt +++ b/Documentation/infiniband/ipoib.txt @@ -36,11 +36,11 @@ Datagram vs Connected modes fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes. In connected mode, the IB RC (Reliable Connected) transport is used. - Connected mode is to takes advantage of the connected nature of the - IB transport and allows an MTU up to the maximal IP packet size of - 64K, which reduces the number of IP packets needed for handling - large UDP datagrams, TCP segments, etc and increases the performance - for large messages. + Connected mode takes advantage of the connected nature of the IB + transport and allows an MTU up to the maximal IP packet size of 64K, + which reduces the number of IP packets needed for handling large UDP + datagrams, TCP segments, etc and increases the performance for large + messages. In connected mode, the interface's UD QP is still used for multicast and communication with peers that don't support connected mode. In -- cgit v1.1 From d698aa4500aa3ca9559142060caf0f79da998744 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 10 Dec 2009 23:52:30 +0000 Subject: dm snapshot: add merge target The snapshot-merge target allows a snapshot to be merged back into the snapshot's origin device. One anticipated use of snapshot merging is the rollback of filesystems to back out problematic system upgrades. This patch adds snapshot-merge target management to both dm_snapshot_init() and dm_snapshot_exit(). As an initial place-holder, snapshot-merge is identical to the snapshot target. Documentation is provided. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Alasdair G Kergon --- Documentation/device-mapper/snapshot.txt | 60 +++++++++++++++++++++++++++++--- 1 file changed, 55 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt index a5009c8..e3a77b2 100644 --- a/Documentation/device-mapper/snapshot.txt +++ b/Documentation/device-mapper/snapshot.txt @@ -8,13 +8,19 @@ the block device which are also writable without interfering with the original content; *) To create device "forks", i.e. multiple different versions of the same data stream. +*) To merge a snapshot of a block device back into the snapshot's origin +device. +In the first two cases, dm copies only the chunks of data that get +changed and uses a separate copy-on-write (COW) block device for +storage. -In both cases, dm copies only the chunks of data that get changed and -uses a separate copy-on-write (COW) block device for storage. +For snapshot merge the contents of the COW storage are merged back into +the origin device. -There are two dm targets available: snapshot and snapshot-origin. +There are three dm targets available: +snapshot, snapshot-origin, and snapshot-merge. *) snapshot-origin @@ -40,8 +46,25 @@ The difference is that for transient snapshots less metadata must be saved on disk - they can be kept in memory by the kernel. -How this is used by LVM2 -======================== +* snapshot-merge + +takes the same table arguments as the snapshot target except it only +works with persistent snapshots. This target assumes the role of the +"snapshot-origin" target and must not be loaded if the "snapshot-origin" +is still present for . + +Creates a merging snapshot that takes control of the changed chunks +stored in the of an existing snapshot, through a handover +procedure, and merges these chunks back into the . Once merging +has started (in the background) the may be opened and the merge +will continue while I/O is flowing to it. Changes to the are +deferred until the merging snapshot's corresponding chunk(s) have been +merged. Once merging has started the snapshot device, associated with +the "snapshot" target, will return -EIO when accessed. + + +How snapshot is used by LVM2 +============================ When you create the first LVM2 snapshot of a volume, four dm devices are used: 1) a device containing the original mapping table of the source volume; @@ -72,3 +95,30 @@ brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base + +How snapshot-merge is used by LVM2 +================================== +A merging snapshot assumes the role of the "snapshot-origin" while +merging. As such the "snapshot-origin" is replaced with +"snapshot-merge". The "-real" device is not changed and the "-cow" +device is renamed to -cow to aid LVM2's cleanup of the +merging snapshot after it completes. The "snapshot" that hands over its +COW device to the "snapshot-merge" is deactivated (unless using lvchange +--refresh); but if it is left active it will simply return I/O errors. + +A snapshot will merge into its origin with the following command: + +lvconvert --merge volumeGroup/snap + +we'll now have this situation: + +# dmsetup table|grep volumeGroup + +volumeGroup-base-real: 0 2097152 linear 8:19 384 +volumeGroup-base-cow: 0 204800 linear 8:19 2097536 +volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16 + +# ls -lL /dev/mapper/volumeGroup-* +brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real +brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow +brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base -- cgit v1.1 From a1a541d86f50a9957beeedb122a035870d602647 Mon Sep 17 00:00:00 2001 From: Zhang Rui Date: Wed, 2 Dec 2009 13:31:00 +0800 Subject: ACPI: support customizing ACPI control methods at runtime Introduce a new debugfs I/F (/sys/kernel/debug/acpi/custom_method) for ACPI, which can be used to customize the ACPI control methods at runtime. We can use this to debug the AML code level bugs instead of overriding the whole DSDT table, without rebuilding/rebooting kernel any more. Detailed description about how to use this debugfs I/F is stated in Documentation/acpi/method-customizing.txt Signed-off-by: Zhang Rui Signed-off-by: Len Brown --- Documentation/acpi/method-customizing.txt | 66 +++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) create mode 100644 Documentation/acpi/method-customizing.txt (limited to 'Documentation') diff --git a/Documentation/acpi/method-customizing.txt b/Documentation/acpi/method-customizing.txt new file mode 100644 index 0000000..e628cd2 --- /dev/null +++ b/Documentation/acpi/method-customizing.txt @@ -0,0 +1,66 @@ +Linux ACPI Custom Control Method How To +======================================= + +Written by Zhang Rui + + +Linux supports customizing ACPI control methods at runtime. + +Users can use this to +1. override an existing method which may not work correctly, + or just for debugging purposes. +2. insert a completely new method in order to create a missing + method such as _OFF, _ON, _STA, _INI, etc. +For these cases, it is far simpler to dynamically install a single +control method rather than override the entire DSDT, because kernel +rebuild/reboot is not needed and test result can be got in minutes. + +Note: Only ACPI METHOD can be overridden, any other object types like + "Device", "OperationRegion", are not recognized. +Note: The same ACPI control method can be overridden for many times, + and it's always the latest one that used by Linux/kernel. + +1. override an existing method + a) get the ACPI table via ACPI sysfs I/F. e.g. to get the DSDT, + just run "cat /sys/firmware/acpi/tables/DSDT > /tmp/dsdt.dat" + b) disassemble the table by running "iasl -d dsdt.dat". + c) rewrite the ASL code of the method and save it in a new file, + d) package the new file (psr.asl) to an ACPI table format. + Here is an example of a customized \_SB._AC._PSR method, + + DefinitionBlock ("", "SSDT", 1, "", "", 0x20080715) + { + External (ACON) + + Method (\_SB_.AC._PSR, 0, NotSerialized) + { + Store ("In AC _PSR", Debug) + Return (ACON) + } + } + Note that the full pathname of the method in ACPI namespace + should be used. + And remember to use "External" to declare external objects. + e) assemble the file to generate the AML code of the method. + e.g. "iasl psr.asl" (psr.aml is generated as a result) + f) mount debugfs by "mount -t debugfs none /sys/kernel/debug" + g) override the old method via the debugfs by running + "cat /tmp/psr.aml > /sys/kernel/debug/acpi/custom_method" + +2. insert a new method + This is easier than overriding an existing method. + We just need to create the ASL code of the method we want to + insert and then follow the step c) ~ g) in section 1. + +3. undo your changes + The "undo" operation is not supported for a new inserted method + right now, i.e. we can not remove a method currently. + For an overrided method, in order to undo your changes, please + save a copy of the method original ASL code in step c) section 1, + and redo step c) ~ g) to override the method with the original one. + + +Note: We can use a kernel with multiple custom ACPI method running, + But each individual write to debugfs can implement a SINGLE + method override. i.e. if we want to insert/override multiple + ACPI methods, we need to redo step c) ~ g) for multiple times. -- cgit v1.1 From 12458ea06efd7b44281e68fe59c950ec7d59c649 Mon Sep 17 00:00:00 2001 From: Anatolij Gustschin Date: Fri, 11 Dec 2009 21:24:44 -0700 Subject: ppc440spe-adma: adds updated ppc440spe adma driver This patch adds new version of the PPC440SPe ADMA driver. Signed-off-by: Yuri Tikhonov Signed-off-by: Anatolij Gustschin Signed-off-by: Dan Williams --- .../powerpc/dts-bindings/4xx/ppc440spe-adma.txt | 93 ++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 Documentation/powerpc/dts-bindings/4xx/ppc440spe-adma.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/4xx/ppc440spe-adma.txt b/Documentation/powerpc/dts-bindings/4xx/ppc440spe-adma.txt new file mode 100644 index 0000000..515ebcf --- /dev/null +++ b/Documentation/powerpc/dts-bindings/4xx/ppc440spe-adma.txt @@ -0,0 +1,93 @@ +PPC440SPe DMA/XOR (DMA Controller and XOR Accelerator) + +Device nodes needed for operation of the ppc440spe-adma driver +are specified hereby. These are I2O/DMA, DMA and XOR nodes +for DMA engines and Memory Queue Module node. The latter is used +by ADMA driver for configuration of RAID-6 H/W capabilities of +the PPC440SPe. In addition to the nodes and properties described +below, the ranges property of PLB node must specify ranges for +DMA devices. + + i) The I2O node + + Required properties: + + - compatible : "ibm,i2o-440spe"; + - reg : + - dcr-reg : + + Example: + + I2O: i2o@400100000 { + compatible = "ibm,i2o-440spe"; + reg = <0x00000004 0x00100000 0x100>; + dcr-reg = <0x060 0x020>; + }; + + + ii) The DMA node + + Required properties: + + - compatible : "ibm,dma-440spe"; + - cell-index : 1 cell, hardware index of the DMA engine + (typically 0x0 and 0x1 for DMA0 and DMA1) + - reg : + - dcr-reg : + - interrupts : . + - interrupt-parent : needed for interrupt mapping + + Example: + + DMA0: dma0@400100100 { + compatible = "ibm,dma-440spe"; + cell-index = <0>; + reg = <0x00000004 0x00100100 0x100>; + dcr-reg = <0x060 0x020>; + interrupt-parent = <&DMA0>; + interrupts = <0 1>; + #interrupt-cells = <1>; + #address-cells = <0>; + #size-cells = <0>; + interrupt-map = < + 0 &UIC0 0x14 4 + 1 &UIC1 0x16 4>; + }; + + + iii) XOR Accelerator node + + Required properties: + + - compatible : "amcc,xor-accelerator"; + - reg : + - interrupts : + - interrupt-parent : for interrupt mapping + + Example: + + xor-accel@400200000 { + compatible = "amcc,xor-accelerator"; + reg = <0x00000004 0x00200000 0x400>; + interrupt-parent = <&UIC1>; + interrupts = <0x1f 4>; + }; + + + iv) Memory Queue Module node + + Required properties: + + - compatible : "ibm,mq-440spe"; + - dcr-reg : + + Example: + + MQ0: mq { + compatible = "ibm,mq-440spe"; + dcr-reg = <0x040 0x020>; + }; + -- cgit v1.1 From 9367858dd08caf4e6ebd511abd2fca0a2d87b648 Mon Sep 17 00:00:00 2001 From: Sam Ravnborg Date: Sat, 17 Oct 2009 21:33:34 +0200 Subject: dontdiff: add generated We move more and more stuff to include/generated - so lets ignore the content for users of plain diff. Signed-off-by: Sam Ravnborg Signed-off-by: Michal Marek --- Documentation/dontdiff | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/dontdiff b/Documentation/dontdiff index e151b2a..3ad6ace 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -103,6 +103,7 @@ gconf gen-devlist gen_crc32table gen_init_cpio +generated genheaders genksyms *_gray256.c -- cgit v1.1 From 264a26838056fc2d759f58bec2e720e01fcb1bdb Mon Sep 17 00:00:00 2001 From: Sam Ravnborg Date: Sun, 18 Oct 2009 00:49:24 +0200 Subject: kbuild: move autoconf.h to include/generated Signed-off-by: Sam Ravnborg Signed-off-by: Michal Marek --- Documentation/kbuild/kconfig.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kbuild/kconfig.txt b/Documentation/kbuild/kconfig.txt index 849b5e5..ab8dc35 100644 --- a/Documentation/kbuild/kconfig.txt +++ b/Documentation/kbuild/kconfig.txt @@ -106,7 +106,8 @@ This environment variable can be set to specify the path & name of the KCONFIG_AUTOHEADER -------------------------------------------------- This environment variable can be set to specify the path & name of the -"autoconf.h" (header) file. Its default value is "include/linux/autoconf.h". +"autoconf.h" (header) file. +Its default value is "include/generated/autoconf.h". ====================================================================== -- cgit v1.1 From bc081dd6e9f622c73334dc465359168543ccaabf Mon Sep 17 00:00:00 2001 From: Michal Marek Date: Mon, 7 Dec 2009 16:38:33 +0100 Subject: kbuild: generate modules.builtin To make it easier for module-init-tools and scripts like mkinitrd to distinguish builtin and missing modules, install a modules.builtin file listing all builtin modules. This is done by generating an additional config file (tristate.conf) with tristate options set to uppercase 'Y' or 'M'. If we source that config file, the builtin modules appear in obj-Y. Signed-off-by: Michal Marek --- Documentation/kbuild/kbuild.txt | 14 ++++++++++++++ Documentation/kbuild/kconfig.txt | 5 +++++ 2 files changed, 19 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index bb3bf38..6f8c1ca 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -1,3 +1,17 @@ +Output files + +modules.order +-------------------------------------------------- +This file records the order in which modules appear in Makefiles. This +is used by modprobe to deterministically resolve aliases that match +multiple modules. + +modules.builtin +-------------------------------------------------- +This file lists all modules that are built into the kernel. This is used +by modprobe to not fail when trying to load something builtin. + + Environment variables KCPPFLAGS diff --git a/Documentation/kbuild/kconfig.txt b/Documentation/kbuild/kconfig.txt index ab8dc35..49efae7 100644 --- a/Documentation/kbuild/kconfig.txt +++ b/Documentation/kbuild/kconfig.txt @@ -103,6 +103,11 @@ KCONFIG_AUTOCONFIG This environment variable can be set to specify the path & name of the "auto.conf" file. Its default value is "include/config/auto.conf". +KCONFIG_TRISTATE +-------------------------------------------------- +This environment variable can be set to specify the path & name of the +"tristate.conf" file. Its default value is "include/config/tristate.conf". + KCONFIG_AUTOHEADER -------------------------------------------------- This environment variable can be set to specify the path & name of the -- cgit v1.1 From 86ad53f8aa051f5cf8783e28a804adc3ebc391f7 Mon Sep 17 00:00:00 2001 From: Albert Herranz Date: Sat, 12 Dec 2009 06:31:35 +0000 Subject: powerpc: gamecube: device tree Add a device tree source file for the Nintendo GameCube video game console. Signed-off-by: Albert Herranz Acked-by: Segher Boessenkool Signed-off-by: Grant Likely --- .../powerpc/dts-bindings/nintendo/gamecube.txt | 109 +++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 Documentation/powerpc/dts-bindings/nintendo/gamecube.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/nintendo/gamecube.txt b/Documentation/powerpc/dts-bindings/nintendo/gamecube.txt new file mode 100644 index 0000000..b558585 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/nintendo/gamecube.txt @@ -0,0 +1,109 @@ + +Nintendo GameCube device tree +============================= + +1) The "flipper" node + + This node represents the multi-function "Flipper" chip, which packages + many of the devices found in the Nintendo GameCube. + + Required properties: + + - compatible : Should be "nintendo,flipper" + +1.a) The Video Interface (VI) node + + Represents the interface between the graphics processor and a external + video encoder. + + Required properties: + + - compatible : should be "nintendo,flipper-vi" + - reg : should contain the VI registers location and length + - interrupts : should contain the VI interrupt + +1.b) The Processor Interface (PI) node + + Represents the data and control interface between the main processor + and graphics and audio processor. + + Required properties: + + - compatible : should be "nintendo,flipper-pi" + - reg : should contain the PI registers location and length + +1.b.i) The "Flipper" interrupt controller node + + Represents the interrupt controller within the "Flipper" chip. + The node for the "Flipper" interrupt controller must be placed under + the PI node. + + Required properties: + + - compatible : should be "nintendo,flipper-pic" + +1.c) The Digital Signal Procesor (DSP) node + + Represents the digital signal processor interface, designed to offload + audio related tasks. + + Required properties: + + - compatible : should be "nintendo,flipper-dsp" + - reg : should contain the DSP registers location and length + - interrupts : should contain the DSP interrupt + +1.c.i) The Auxiliary RAM (ARAM) node + + Represents the non cpu-addressable ram designed mainly to store audio + related information. + The ARAM node must be placed under the DSP node. + + Required properties: + + - compatible : should be "nintendo,flipper-aram" + - reg : should contain the ARAM start (zero-based) and length + +1.d) The Disk Interface (DI) node + + Represents the interface used to communicate with mass storage devices. + + Required properties: + + - compatible : should be "nintendo,flipper-di" + - reg : should contain the DI registers location and length + - interrupts : should contain the DI interrupt + +1.e) The Audio Interface (AI) node + + Represents the interface to the external 16-bit stereo digital-to-analog + converter. + + Required properties: + + - compatible : should be "nintendo,flipper-ai" + - reg : should contain the AI registers location and length + - interrupts : should contain the AI interrupt + +1.f) The Serial Interface (SI) node + + Represents the interface to the four single bit serial interfaces. + The SI is a proprietary serial interface used normally to control gamepads. + It's NOT a RS232-type interface. + + Required properties: + + - compatible : should be "nintendo,flipper-si" + - reg : should contain the SI registers location and length + - interrupts : should contain the SI interrupt + +1.g) The External Interface (EXI) node + + Represents the multi-channel SPI-like interface. + + Required properties: + + - compatible : should be "nintendo,flipper-exi" + - reg : should contain the EXI registers location and length + - interrupts : should contain the EXI interrupt + -- cgit v1.1 From 7a09116c016611effe7037c8346c5f42a7416718 Mon Sep 17 00:00:00 2001 From: Albert Herranz Date: Sat, 12 Dec 2009 06:31:44 +0000 Subject: powerpc: wii: device tree Add a device tree source file for the Nintendo Wii video game console. Signed-off-by: Albert Herranz Acked-by: Segher Boessenkool Acked-by: Benjamin Herrenschmidt Signed-off-by: Grant Likely --- .../powerpc/dts-bindings/nintendo/wii.txt | 184 +++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 Documentation/powerpc/dts-bindings/nintendo/wii.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/nintendo/wii.txt b/Documentation/powerpc/dts-bindings/nintendo/wii.txt new file mode 100644 index 0000000..a7e155a --- /dev/null +++ b/Documentation/powerpc/dts-bindings/nintendo/wii.txt @@ -0,0 +1,184 @@ + +Nintendo Wii device tree +======================== + +0) The root node + + This node represents the Nintendo Wii video game console. + + Required properties: + + - model : Should be "nintendo,wii" + - compatible : Should be "nintendo,wii" + +1) The "hollywood" node + + This node represents the multi-function "Hollywood" chip, which packages + many of the devices found in the Nintendo Wii. + + Required properties: + + - compatible : Should be "nintendo,hollywood" + +1.a) The Video Interface (VI) node + + Represents the interface between the graphics processor and a external + video encoder. + + Required properties: + + - compatible : should be "nintendo,hollywood-vi","nintendo,flipper-vi" + - reg : should contain the VI registers location and length + - interrupts : should contain the VI interrupt + +1.b) The Processor Interface (PI) node + + Represents the data and control interface between the main processor + and graphics and audio processor. + + Required properties: + + - compatible : should be "nintendo,hollywood-pi","nintendo,flipper-pi" + - reg : should contain the PI registers location and length + +1.b.i) The "Flipper" interrupt controller node + + Represents the "Flipper" interrupt controller within the "Hollywood" chip. + The node for the "Flipper" interrupt controller must be placed under + the PI node. + + Required properties: + + - #interrupt-cells : <1> + - compatible : should be "nintendo,flipper-pic" + - interrupt-controller + +1.c) The Digital Signal Procesor (DSP) node + + Represents the digital signal processor interface, designed to offload + audio related tasks. + + Required properties: + + - compatible : should be "nintendo,hollywood-dsp","nintendo,flipper-dsp" + - reg : should contain the DSP registers location and length + - interrupts : should contain the DSP interrupt + +1.d) The Serial Interface (SI) node + + Represents the interface to the four single bit serial interfaces. + The SI is a proprietary serial interface used normally to control gamepads. + It's NOT a RS232-type interface. + + Required properties: + + - compatible : should be "nintendo,hollywood-si","nintendo,flipper-si" + - reg : should contain the SI registers location and length + - interrupts : should contain the SI interrupt + +1.e) The Audio Interface (AI) node + + Represents the interface to the external 16-bit stereo digital-to-analog + converter. + + Required properties: + + - compatible : should be "nintendo,hollywood-ai","nintendo,flipper-ai" + - reg : should contain the AI registers location and length + - interrupts : should contain the AI interrupt + +1.f) The External Interface (EXI) node + + Represents the multi-channel SPI-like interface. + + Required properties: + + - compatible : should be "nintendo,hollywood-exi","nintendo,flipper-exi" + - reg : should contain the EXI registers location and length + - interrupts : should contain the EXI interrupt + +1.g) The Open Host Controller Interface (OHCI) nodes + + Represent the USB 1.x Open Host Controller Interfaces. + + Required properties: + + - compatible : should be "nintendo,hollywood-usb-ohci","usb-ohci" + - reg : should contain the OHCI registers location and length + - interrupts : should contain the OHCI interrupt + +1.h) The Enhanced Host Controller Interface (EHCI) node + + Represents the USB 2.0 Enhanced Host Controller Interface. + + Required properties: + + - compatible : should be "nintendo,hollywood-usb-ehci","usb-ehci" + - reg : should contain the EHCI registers location and length + - interrupts : should contain the EHCI interrupt + +1.i) The Secure Digital Host Controller Interface (SDHCI) nodes + + Represent the Secure Digital Host Controller Interfaces. + + Required properties: + + - compatible : should be "nintendo,hollywood-sdhci","sdhci" + - reg : should contain the SDHCI registers location and length + - interrupts : should contain the SDHCI interrupt + +1.j) The Inter-Processsor Communication (IPC) node + + Represent the Inter-Processor Communication interface. This interface + enables communications between the Broadway and the Starlet processors. + + - compatible : should be "nintendo,hollywood-ipc" + - reg : should contain the IPC registers location and length + - interrupts : should contain the IPC interrupt + +1.k) The "Hollywood" interrupt controller node + + Represents the "Hollywood" interrupt controller within the + "Hollywood" chip. + + Required properties: + + - #interrupt-cells : <1> + - compatible : should be "nintendo,hollywood-pic" + - reg : should contain the controller registers location and length + - interrupt-controller + - interrupts : should contain the cascade interrupt of the "flipper" pic + - interrupt-parent: should contain the phandle of the "flipper" pic + +1.l) The General Purpose I/O (GPIO) controller node + + Represents the dual access 32 GPIO controller interface. + + Required properties: + + - #gpio-cells : <2> + - compatible : should be "nintendo,hollywood-gpio" + - reg : should contain the IPC registers location and length + - gpio-controller + +1.m) The control node + + Represents the control interface used to setup several miscellaneous + settings of the "Hollywood" chip like boot memory mappings, resets, + disk interface mode, etc. + + Required properties: + + - compatible : should be "nintendo,hollywood-control" + - reg : should contain the control registers location and length + +1.n) The Disk Interface (DI) node + + Represents the interface used to communicate with mass storage devices. + + Required properties: + + - compatible : should be "nintendo,hollywood-di" + - reg : should contain the DI registers location and length + - interrupts : should contain the DI interrupt + -- cgit v1.1 From 7a92263705435d046d37a0990d0edfcb517f7ad3 Mon Sep 17 00:00:00 2001 From: Jan Engelhardt Date: Mon, 14 Dec 2009 14:52:10 +0100 Subject: netfilter: xtables: document minimal required version For both .33 and .32-stable. Signed-off-by: Jan Engelhardt Cc: stable@kernel.org Signed-off-by: Patrick McHardy --- Documentation/Changes | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/Changes b/Documentation/Changes index 6d0f1ef..f08b313 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -49,6 +49,8 @@ o oprofile 0.9 # oprofiled --version o udev 081 # udevinfo -V o grub 0.93 # grub --version o mcelog 0.6 +o iptables 1.4.1 # iptables -V + Kernel compilation ================== -- cgit v1.1 From c3813d6af177fab19e322f3114b1f64fbcf08d71 Mon Sep 17 00:00:00 2001 From: Jean Delvare Date: Mon, 14 Dec 2009 21:17:25 +0100 Subject: i2c: Get rid of struct i2c_client_address_data Struct i2c_client_address_data only contains one field at this point, which makes its usefulness questionable. Get rid of it and pass simple address lists around instead. Signed-off-by: Jean Delvare Tested-by: Wolfram Sang --- Documentation/i2c/writing-clients | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients index 7860aaf..0a74603 100644 --- a/Documentation/i2c/writing-clients +++ b/Documentation/i2c/writing-clients @@ -44,7 +44,7 @@ static struct i2c_driver foo_driver = { /* if device autodetection is needed: */ .class = I2C_CLASS_SOMETHING, .detect = foo_detect, - .address_data = &addr_data, + .address_list = normal_i2c, .shutdown = foo_shutdown, /* optional */ .suspend = foo_suspend, /* optional */ -- cgit v1.1 From aaa225e0aa96691964a5f0518c3c16b954d400cb Mon Sep 17 00:00:00 2001 From: Michael Hennerich Date: Wed, 14 Oct 2009 13:43:53 +0000 Subject: Blackfin: punt cache lock documentation The cache lock code was unused and punted, so punt the documentation too. Signed-off-by: Michael Hennerich Signed-off-by: Mike Frysinger --- Documentation/blackfin/00-INDEX | 3 -- Documentation/blackfin/cache-lock.txt | 48 -------------------------------- Documentation/blackfin/cachefeatures.txt | 10 ------- 3 files changed, 61 deletions(-) delete mode 100644 Documentation/blackfin/cache-lock.txt (limited to 'Documentation') diff --git a/Documentation/blackfin/00-INDEX b/Documentation/blackfin/00-INDEX index d6840a9..c34e124 100644 --- a/Documentation/blackfin/00-INDEX +++ b/Documentation/blackfin/00-INDEX @@ -1,9 +1,6 @@ 00-INDEX - This file -cache-lock.txt - - HOWTO for blackfin cache locking. - cachefeatures.txt - Supported cache features. diff --git a/Documentation/blackfin/cache-lock.txt b/Documentation/blackfin/cache-lock.txt deleted file mode 100644 index 88ba1e6..0000000 --- a/Documentation/blackfin/cache-lock.txt +++ /dev/null @@ -1,48 +0,0 @@ -/* - * File: Documentation/blackfin/cache-lock.txt - * Based on: - * Author: - * - * Created: - * Description: This file contains the simple DMA Implementation for Blackfin - * - * Rev: $Id: cache-lock.txt 2384 2006-11-01 04:12:43Z magicyang $ - * - * Modified: - * Copyright 2004-2006 Analog Devices Inc. - * - * Bugs: Enter bugs at http://blackfin.uclinux.org/ - * - */ - -How to lock your code in cache in uClinux/blackfin --------------------------------------------------- - -There are only a few steps required to lock your code into the cache. -Currently you can lock the code by Way. - -Below are the interface provided for locking the cache. - - -1. cache_grab_lock(int Ways); - -This function grab the lock for locking your code into the cache specified -by Ways. - - -2. cache_lock(int Ways); - -This function should be called after your critical code has been executed. -Once the critical code exits, the code is now loaded into the cache. This -function locks the code into the cache. - - -So, the example sequence will be: - - cache_grab_lock(WAY0_L); /* Grab the lock */ - - critical_code(); /* Execute the code of interest */ - - cache_lock(WAY0_L); /* Lock the cache */ - -Where WAY0_L signifies WAY0 locking. diff --git a/Documentation/blackfin/cachefeatures.txt b/Documentation/blackfin/cachefeatures.txt index 0fbec23..75de51f 100644 --- a/Documentation/blackfin/cachefeatures.txt +++ b/Documentation/blackfin/cachefeatures.txt @@ -41,16 +41,6 @@ icplb_flush(); dcplb_flush(); - - Locking the cache. - - cache_grab_lock(); - cache_lock(); - - Please refer linux-2.6.x/Documentation/blackfin/cache-lock.txt for how to - lock the cache. - - Locking the cache is optional feature. - - Miscellaneous cache functions. flush_cache_all(); -- cgit v1.1 From 4b60779d5ea76908c3bc82d93280b733335fce48 Mon Sep 17 00:00:00 2001 From: Mike Frysinger Date: Wed, 21 Oct 2009 21:22:35 +0000 Subject: Blackfin: add an example showing how to use the gptimers API Signed-off-by: Mike Frysinger --- Documentation/blackfin/Makefile | 6 +++ Documentation/blackfin/gptimers-example.c | 83 +++++++++++++++++++++++++++++++ 2 files changed, 89 insertions(+) create mode 100644 Documentation/blackfin/Makefile create mode 100644 Documentation/blackfin/gptimers-example.c (limited to 'Documentation') diff --git a/Documentation/blackfin/Makefile b/Documentation/blackfin/Makefile new file mode 100644 index 0000000..773dbb1 --- /dev/null +++ b/Documentation/blackfin/Makefile @@ -0,0 +1,6 @@ +obj-m := gptimers-example.o + +all: modules + +modules clean: + $(MAKE) -C ../.. SUBDIRS=$(PWD) $@ diff --git a/Documentation/blackfin/gptimers-example.c b/Documentation/blackfin/gptimers-example.c new file mode 100644 index 0000000..b1bd634 --- /dev/null +++ b/Documentation/blackfin/gptimers-example.c @@ -0,0 +1,83 @@ +/* + * Simple gptimers example + * http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:drivers:gptimers + * + * Copyright 2007-2009 Analog Devices Inc. + * + * Licensed under the GPL-2 or later. + */ + +#include +#include + +#include +#include + +/* ... random driver includes ... */ + +#define DRIVER_NAME "gptimer_example" + +struct gptimer_data { + uint32_t period, width; +}; +static struct gptimer_data data; + +/* ... random driver state ... */ + +static irqreturn_t gptimer_example_irq(int irq, void *dev_id) +{ + struct gptimer_data *data = dev_id; + + /* make sure it was our timer which caused the interrupt */ + if (!get_gptimer_intr(TIMER5_id)) + return IRQ_NONE; + + /* read the width/period values that were captured for the waveform */ + data->width = get_gptimer_pwidth(TIMER5_id); + data->period = get_gptimer_period(TIMER5_id); + + /* acknowledge the interrupt */ + clear_gptimer_intr(TIMER5_id); + + /* tell the upper layers we took care of things */ + return IRQ_HANDLED; +} + +/* ... random driver code ... */ + +static int __init gptimer_example_init(void) +{ + int ret; + + /* grab the peripheral pins */ + ret = peripheral_request(P_TMR5, DRIVER_NAME); + if (ret) { + printk(KERN_NOTICE DRIVER_NAME ": peripheral request failed\n"); + return ret; + } + + /* grab the IRQ for the timer */ + ret = request_irq(IRQ_TIMER5, gptimer_example_irq, IRQF_SHARED, DRIVER_NAME, &data); + if (ret) { + printk(KERN_NOTICE DRIVER_NAME ": IRQ request failed\n"); + peripheral_free(P_TMR5); + return ret; + } + + /* setup the timer and enable it */ + set_gptimer_config(TIMER5_id, WDTH_CAP | PULSE_HI | PERIOD_CNT | IRQ_ENA); + enable_gptimers(TIMER5bit); + + return 0; +} +module_init(gptimer_example_init); + +static void __exit gptimer_example_exit(void) +{ + disable_gptimers(TIMER5bit); + free_irq(IRQ_TIMER5, &data); + peripheral_free(P_TMR5); +} +module_exit(gptimer_example_exit); + +MODULE_LICENSE("BSD"); -- cgit v1.1 From 3428838d8e88fb766143c913cf3e777b000f8e07 Mon Sep 17 00:00:00 2001 From: Tommi Rantala Date: Mon, 14 Dec 2009 17:57:48 -0800 Subject: page-types: constify read only arrays Signed-off-by: Tommi Rantala Cc: Randy Dunlap Cc: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page-types.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index ea44ea5..59a4614 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -100,7 +100,7 @@ #define BIT(name) (1ULL << KPF_##name) #define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL)) -static char *page_flag_names[] = { +static const char *page_flag_names[] = { [KPF_LOCKED] = "L:locked", [KPF_ERROR] = "E:error", [KPF_REFERENCED] = "R:referenced", @@ -173,7 +173,7 @@ static int kpageflags_fd; static int opt_hwpoison; static int opt_unpoison; -static char *hwpoison_debug_fs = "/debug/hwpoison"; +static const char hwpoison_debug_fs[] = "/debug/hwpoison"; static int hwpoison_inject_fd; static int hwpoison_forget_fd; @@ -885,7 +885,7 @@ static void parse_bits_mask(const char *optarg) } -static struct option opts[] = { +static const struct option opts[] = { { "raw" , 0, NULL, 'r' }, { "pid" , 1, NULL, 'p' }, { "file" , 1, NULL, 'f' }, -- cgit v1.1 From f1327bf18c8737aeb85226f02f102dfa065fddde Mon Sep 17 00:00:00 2001 From: Roel Kluin Date: Mon, 14 Dec 2009 17:57:49 -0800 Subject: page-types: unsigned cannot be less than 0 in add_page() If not signed, testing of the read() return value in this function will not work. Signed-off-by: Roel Kluin Cc: Wu Fengguang Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page-types.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 59a4614..2279b61 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -560,7 +560,7 @@ static void walk_pfn(unsigned long voffset, { uint64_t buf[KPAGEFLAGS_BATCH]; unsigned long batch; - unsigned long pages; + long pages; unsigned long i; while (count) { -- cgit v1.1 From dcfe730c60030fb76e2794a8960c6bd84c6c6163 Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 14 Dec 2009 17:57:52 -0800 Subject: page-types: learn to describe flags directly from command line Teach page-types to describe page flags directly from the command line. Why is this useful? For instance, if you're using memory hotplug and see this in /var/log/messages: kernel: removing from LRU failed 3836dd0/1/1e00000000000010 It would be nice to decode those page flags without staring at the source. Example usage and output: # Documentation/vm/page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty # Documentation/vm/page-types -d anon 0x0000000000001000 ____________a_____________________ anonymous # Documentation/vm/page-types -d anon,0x10 0x0000000000001010 ____D_______a_____________________ dirty,anonymous [achiang@hp.com: documentation] Signed-off-by: Alex Chiang Signed-off-by: Wu Fengguang Cc: Andi Kleen Cc: Haicheng Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page-types.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 2279b61..8c28be6 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -674,6 +674,7 @@ static void usage(void) printf( "page-types [options]\n" " -r|--raw Raw mode, for kernel developers\n" +" -d|--describe flags Describe flags\n" " -a|--addr addr-spec Walk a range of pages\n" " -b|--bits bits-spec Walk pages with specified bits\n" " -p|--pid pid Walk process address space\n" @@ -686,6 +687,10 @@ static void usage(void) " -X|--hwpoison hwpoison pages\n" " -x|--unpoison unpoison pages\n" " -h|--help Show this usage message\n" +"flags:\n" +" 0x10 bitfield format, e.g.\n" +" anon bit-name, e.g.\n" +" 0x10,anon comma-separated list, e.g.\n" "addr-spec:\n" " N one page at offset N (unit: pages)\n" " N+M pages range from N to N+M-1\n" @@ -884,6 +889,15 @@ static void parse_bits_mask(const char *optarg) add_bits_filter(mask, bits); } +static void describe_flags(const char *optarg) +{ + uint64_t flags = parse_flag_names(optarg, 0); + + printf("0x%016llx\t%s\t%s\n", + (unsigned long long)flags, + page_flag_name(flags), + page_flag_longname(flags)); +} static const struct option opts[] = { { "raw" , 0, NULL, 'r' }, @@ -891,6 +905,7 @@ static const struct option opts[] = { { "file" , 1, NULL, 'f' }, { "addr" , 1, NULL, 'a' }, { "bits" , 1, NULL, 'b' }, + { "describe" , 1, NULL, 'd' }, { "list" , 0, NULL, 'l' }, { "list-each" , 0, NULL, 'L' }, { "no-summary", 0, NULL, 'N' }, @@ -907,7 +922,7 @@ int main(int argc, char *argv[]) page_size = getpagesize(); while ((c = getopt_long(argc, argv, - "rp:f:a:b:lLNXxh", opts, NULL)) != -1) { + "rp:f:a:b:d:lLNXxh", opts, NULL)) != -1) { switch (c) { case 'r': opt_raw = 1; @@ -924,6 +939,10 @@ int main(int argc, char *argv[]) case 'b': parse_bits_mask(optarg); break; + case 'd': + opt_no_summary = 1; + describe_flags(optarg); + break; case 'l': opt_list = 1; break; -- cgit v1.1 From 9fdcd886ab407dbe3f854579417e6b036222a468 Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 14 Dec 2009 17:57:53 -0800 Subject: page-types: whitespace alignment Align the output when page-type -h is invoked. Signed-off-by: Alex Chiang Acked-by: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page-types.c | 46 +++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 23 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 8c28be6..19ca23c 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -673,35 +673,35 @@ static void usage(void) printf( "page-types [options]\n" -" -r|--raw Raw mode, for kernel developers\n" +" -r|--raw Raw mode, for kernel developers\n" " -d|--describe flags Describe flags\n" -" -a|--addr addr-spec Walk a range of pages\n" -" -b|--bits bits-spec Walk pages with specified bits\n" -" -p|--pid pid Walk process address space\n" +" -a|--addr addr-spec Walk a range of pages\n" +" -b|--bits bits-spec Walk pages with specified bits\n" +" -p|--pid pid Walk process address space\n" #if 0 /* planned features */ -" -f|--file filename Walk file address space\n" +" -f|--file filename Walk file address space\n" #endif -" -l|--list Show page details in ranges\n" -" -L|--list-each Show page details one by one\n" -" -N|--no-summary Don't show summay info\n" -" -X|--hwpoison hwpoison pages\n" -" -x|--unpoison unpoison pages\n" -" -h|--help Show this usage message\n" +" -l|--list Show page details in ranges\n" +" -L|--list-each Show page details one by one\n" +" -N|--no-summary Don't show summay info\n" +" -X|--hwpoison hwpoison pages\n" +" -x|--unpoison unpoison pages\n" +" -h|--help Show this usage message\n" "flags:\n" -" 0x10 bitfield format, e.g.\n" -" anon bit-name, e.g.\n" -" 0x10,anon comma-separated list, e.g.\n" +" 0x10 bitfield format, e.g.\n" +" anon bit-name, e.g.\n" +" 0x10,anon comma-separated list, e.g.\n" "addr-spec:\n" -" N one page at offset N (unit: pages)\n" -" N+M pages range from N to N+M-1\n" -" N,M pages range from N to M-1\n" -" N, pages range from N to end\n" -" ,M pages range from 0 to M-1\n" +" N one page at offset N (unit: pages)\n" +" N+M pages range from N to N+M-1\n" +" N,M pages range from N to M-1\n" +" N, pages range from N to end\n" +" ,M pages range from 0 to M-1\n" "bits-spec:\n" -" bit1,bit2 (flags & (bit1|bit2)) != 0\n" -" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" -" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" -" =bit1,bit2 flags == (bit1|bit2)\n" +" bit1,bit2 (flags & (bit1|bit2)) != 0\n" +" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" +" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" +" =bit1,bit2 flags == (bit1|bit2)\n" "bit-names:\n" ); -- cgit v1.1 From bb86a7338b7a864c03e1736a8d370b10254b0300 Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 14 Dec 2009 17:57:54 -0800 Subject: page-types: exit early when invoked with -d|--describe On a system with large amount of memory (256GB), invoking page-types can take quite a long time, which is unreasonable considering the user only wants a description of the flags: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m34.285s user 0m1.966s sys 0m32.313s This is because we still walk the entire address range. Exiting early seems like a reasonble solution: # time ./page-types -d 0x10 0x0000000000000010 ____D_____________________________ dirty real 0m0.007s user 0m0.001s sys 0m0.005s Signed-off-by: Alex Chiang Cc: Andi Kleen Cc: Haicheng Li Acked-by: Wu Fengguang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/page-types.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 19ca23c..7a7d9ba 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -940,9 +940,8 @@ int main(int argc, char *argv[]) parse_bits_mask(optarg); break; case 'd': - opt_no_summary = 1; describe_flags(optarg); - break; + exit(0); case 'l': opt_list = 1; break; -- cgit v1.1 From 267b4c281b4a43c8f3d965c791d3a7fd62448733 Mon Sep 17 00:00:00 2001 From: Lee Schermerhorn Date: Mon, 14 Dec 2009 17:58:30 -0800 Subject: hugetlb: update hugetlb documentation for NUMA controls Update the kernel huge tlb documentation to describe the numa memory policy based huge page management. Additionaly, the patch includes a fair amount of rework to improve consistency, eliminate duplication and set the context for documenting the memory policy interaction. Signed-off-by: Lee Schermerhorn Acked-by: David Rientjes Acked-by: Mel Gorman Reviewed-by: Andi Kleen Cc: KAMEZAWA Hiroyuki Cc: Lee Schermerhorn Cc: Randy Dunlap Cc: Nishanth Aravamudan Cc: Adam Litke Cc: Andy Whitcroft Cc: Eric Whitney Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 261 ++++++++++++++++++++++++++------------- 1 file changed, 176 insertions(+), 85 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 82a7bd1..01c3108 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -11,23 +11,21 @@ This optimization is more critical now as bigger and bigger physical memories (several GBs) are more readily available. Users can use the huge page support in Linux kernel by either using the mmap -system call or standard SYSv shared memory system calls (shmget, shmat). +system call or standard SYSV shared memory system calls (shmget, shmat). First the Linux kernel needs to be built with the CONFIG_HUGETLBFS (present under "File systems") and CONFIG_HUGETLB_PAGE (selected automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with huge page support should show the number of configured -huge pages in the system by running the "cat /proc/meminfo" command. +The /proc/meminfo file provides information about the total number of +persistent hugetlb pages in the kernel's huge page pool. It also displays +information about the number of free, reserved and surplus huge pages and the +default huge page size. The huge page size is needed for generating the +proper alignment and size of the arguments to system calls that map huge page +regions. -/proc/meminfo also provides information about the total number of hugetlb -pages configured in the kernel. It also displays information about the -number of free hugetlb pages at any time. It also displays information about -the configured huge page size - this is needed for generating the proper -alignment and size of the arguments to the above system calls. - -The output of "cat /proc/meminfo" will have lines like: +The output of "cat /proc/meminfo" will include lines like: ..... HugePages_Total: vvv @@ -53,59 +51,63 @@ HugePages_Surp is short for "surplus," and is the number of huge pages in /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb -pages in the kernel. Super user can dynamically request more (or free some -pre-configured) huge pages. -The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of huge pages is -possible only if there are enough hugetlb pages free that can be transferred -back to regular memory pool). +/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge +pages in the kernel's huge page pool. "Persistent" huge pages will be +returned to the huge page pool when freed by a task. A user with root +privileges can dynamically allocate more or free some persistent huge pages +by increasing or decreasing the value of 'nr_hugepages'. -Pages that are used as hugetlb pages are reserved inside the kernel and cannot -be used for other purposes. +Pages that are used as huge pages are reserved inside the kernel and cannot +be used for other purposes. Huge pages cannot be swapped out under +memory pressure. -Once the kernel with Hugetlb page support is built and running, a user can -use either the mmap system call or shared memory system calls to start using -the huge pages. It is required that the system administrator preallocate -enough memory for huge page purposes. +Once a number of huge pages have been pre-allocated to the kernel huge page +pool, a user with appropriate privilege can use either the mmap system call +or shared memory system calls to use the huge pages. See the discussion of +Using Huge Pages, below. -The administrator can preallocate huge pages on the kernel boot command line by -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages -requested. This is the most reliable method for preallocating huge pages as -memory has not yet become fragmented. +The administrator can allocate persistent huge pages on the kernel boot +command line by specifying the "hugepages=N" parameter, where 'N' = the +number of huge pages requested. This is the most reliable method of +allocating huge pages as memory has not yet become fragmented. -Some platforms support multiple huge page sizes. To preallocate huge pages +Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, one must preceed the huge pages boot command parameters with a huge page size selection parameter "hugepagesz=". must be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the "default_hugepagesz=" boot parameter. -/proc/sys/vm/nr_hugepages indicates the current number of configured [default -size] hugetlb pages in the kernel. Super user can dynamically request more -(or free some pre-configured) huge pages. - -Use the following command to dynamically allocate/deallocate default sized -huge pages: +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages +indicates the current number of pre-allocated huge pages of the default size. +Thus, one can use the following command to dynamically allocate/deallocate +default sized persistent huge pages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 default sized huge pages in the system. +This command will try to adjust the number of default sized huge pages in the +huge page pool to 20, allocating or freeing huge pages, as required. + On a NUMA platform, the kernel will attempt to distribute the huge page pool -over the all on-line nodes. These huge pages, allocated when nr_hugepages -is increased, are called "persistent huge pages". +over all the set of allowed nodes specified by the NUMA memory policy of the +task that modifies nr_hugepages. The default for the allowed nodes--when the +task has default memory policy--is all on-line nodes. Allowed nodes with +insufficient available, contiguous memory for a huge page will be silently +skipped when allocating persistent huge pages. See the discussion below of +the interaction of task memory policy, cpusets and per node attributes with +the allocation and freeing of persistent huge pages. The success or failure of huge page allocation depends on the amount of -physically contiguous memory that is preset in system at the time of the +physically contiguous memory that is present in system at the time of the allocation attempt. If the kernel is unable to allocate huge pages from some nodes in a NUMA system, it will attempt to make up the difference by allocating extra pages on other nodes with sufficient available contiguous memory, if any. -System administrators may want to put this command in one of the local rc init -files. This will enable the kernel to request huge pages early in the boot -process when the possibility of getting physical contiguous pages is still -very high. Administrators can verify the number of huge pages actually -allocated by checking the sysctl or meminfo. To check the per node +System administrators may want to put this command in one of the local rc +init files. This will enable the kernel to allocate huge pages early in +the boot process when the possibility of getting physical contiguous pages +is still very high. Administrators can verify the number of huge pages +actually allocated by checking the sysctl or meminfo. To check the per node distribution of huge pages in a NUMA system, use: cat /sys/devices/system/node/node*/meminfo | fgrep Huge @@ -113,45 +115,47 @@ distribution of huge pages in a NUMA system, use: /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are requested by applications. Writing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain "surplus" -huge pages from the buddy allocator, when the normal pool is exhausted. As -these surplus huge pages go out of use, they are freed back to the buddy -allocator. +indicates that the hugetlb subsystem is allowed to try to obtain that +number of "surplus" huge pages from the kernel's normal page pool, when the +persistent huge page pool is exhausted. As these surplus huge pages become +unused, they are freed back to the kernel's normal page pool. -When increasing the huge page pool size via nr_hugepages, any surplus +When increasing the huge page pool size via nr_hugepages, any existing surplus pages will first be promoted to persistent huge pages. Then, additional huge pages will be allocated, if necessary and if possible, to fulfill -the new huge page pool size. +the new persistent huge page pool size. -The administrator may shrink the pool of preallocated huge pages for +The administrator may shrink the pool of persistent huge pages for the default huge page size by setting the nr_hugepages sysctl to a smaller value. The kernel will attempt to balance the freeing of huge pages -across all on-line nodes. Any free huge pages on the selected nodes will -be freed back to the buddy allocator. - -Caveat: Shrinking the pool via nr_hugepages such that it becomes less -than the number of huge pages in use will convert the balance to surplus -huge pages even if it would exceed the overcommit value. As long as -this condition holds, however, no more surplus huge pages will be -allowed on the system until one of the two sysctls are increased -sufficiently, or the surplus huge pages go out of use and are freed. +across all nodes in the memory policy of the task modifying nr_hugepages. +Any free huge pages on the selected nodes will be freed back to the kernel's +normal page pool. + +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that +it becomes less than the number of huge pages in use will convert the balance +of the in-use huge pages to surplus huge pages. This will occur even if +the number of surplus pages it would exceed the overcommit value. As long as +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is +increased sufficiently, or the surplus huge pages go out of use and are freed-- +no more surplus huge pages will be allowed to be allocated. With support for multiple huge page pools at run-time available, much of -the huge page userspace interface has been duplicated in sysfs. The above -information applies to the default huge page size which will be -controlled by the /proc interfaces for backwards compatibility. The root -huge page control directory in sysfs is: +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. +The /proc interfaces discussed above have been retained for backwards +compatibility. The root huge page control directory in sysfs is: /sys/kernel/mm/hugepages For each huge page size supported by the running kernel, a subdirectory -will exist, of the form +will exist, of the form: hugepages-${size}kB Inside each of these directories, the same set of files will exist: nr_hugepages + nr_hugepages_mempolicy nr_overcommit_hugepages free_hugepages resv_hugepages @@ -159,6 +163,101 @@ Inside each of these directories, the same set of files will exist: which function as described above for the default huge page-sized case. + +Interaction of Task Memory Policy with Huge Page Allocation/Freeing + +Whether huge pages are allocated and freed via the /proc interface or +the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA +nodes from which huge pages are allocated or freed are controlled by the +NUMA memory policy of the task that modifies the nr_hugepages_mempolicy +sysctl or attribute. When the nr_hugepages attribute is used, mempolicy +is ignored. + +The recommended method to allocate or free huge pages to/from the kernel +huge page pool, using the nr_hugepages example above, is: + + numactl --interleave echo 20 \ + >/proc/sys/vm/nr_hugepages_mempolicy + +or, more succinctly: + + numactl -m echo 20 >/proc/sys/vm/nr_hugepages_mempolicy + +This will allocate or free abs(20 - nr_hugepages) to or from the nodes +specified in , depending on whether number of persistent huge pages +is initially less than or greater than 20, respectively. No huge pages will be +allocated nor freed on any node not included in the specified . + +When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any +memory policy mode--bind, preferred, local or interleave--may be used. The +resulting effect on persistent huge page allocation is as follows: + +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt], + persistent huge pages will be distributed across the node or nodes + specified in the mempolicy as if "interleave" had been specified. + However, if a node in the policy does not contain sufficient contiguous + memory for a huge page, the allocation will not "fallback" to the nearest + neighbor node with sufficient contiguous memory. To do this would cause + undesirable imbalance in the distribution of the huge page pool, or + possibly, allocation of persistent huge pages on nodes not allowed by + the task's memory policy. + +2) One or more nodes may be specified with the bind or interleave policy. + If more than one node is specified with the preferred policy, only the + lowest numeric id will be used. Local policy will select the node where + the task is running at the time the nodes_allowed mask is constructed. + For local policy to be deterministic, the task must be bound to a cpu or + cpus in a single node. Otherwise, the task could be migrated to some + other node at any time after launch and the resulting node will be + indeterminate. Thus, local policy is not very useful for this purpose. + Any of the other mempolicy modes may be used to specify a single node. + +3) The nodes allowed mask will be derived from any non-default task mempolicy, + whether this policy was set explicitly by the task itself or one of its + ancestors, such as numactl. This means that if the task is invoked from a + shell with non-default policy, that policy will be used. One can specify a + node list of "all" with numactl --interleave or --membind [-m] to achieve + interleaving over all nodes in the system or cpuset. + +4) Any task mempolicy specifed--e.g., using numactl--will be constrained by + the resource limits of any cpuset in which the task runs. Thus, there will + be no way for a task with non-default policy running in a cpuset with a + subset of the system nodes to allocate huge pages outside the cpuset + without first moving to a cpuset that contains all of the desired nodes. + +5) Boot-time huge page allocation attempts to distribute the requested number + of huge pages over all on-lines nodes. + +Per Node Hugepages Attributes + +A subset of the contents of the root huge page control directory in sysfs, +described above, has been replicated under each "node" system device in: + + /sys/devices/system/node/node[0-9]*/hugepages/ + +Under this directory, the subdirectory for each supported huge page size +contains the following attribute files: + + nr_hugepages + free_hugepages + surplus_hugepages + +The free_' and surplus_' attribute files are read-only. They return the number +of free and surplus [overcommitted] huge pages, respectively, on the parent +node. + +The nr_hugepages attribute returns the total number of huge pages on the +specified node. When this attribute is written, the number of persistent huge +pages on the parent node will be adjusted to the specified value, if sufficient +resources exist, regardless of the task's mempolicy or cpuset constraints. + +Note that the number of overcommit and reserve pages remain global quantities, +as we don't know until fault time, when the faulting task's mempolicy is +applied, from which node the huge page allocation will be attempted. + + +Using Huge Pages + If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs: @@ -206,9 +305,11 @@ map_hugetlb.c. * requesting huge pages. * * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * huge pages. That means the addresses starting with 0x800000... will need - * to be specified. Specifying a fixed address is not required on ppc64, - * i386 or x86_64. + * huge pages. That means that if one requires a fixed address, a huge page + * aligned address starting with 0x800000... will be required. If a fixed + * address is not required, the kernel will select an address in the proper + * range. + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. * * Note: The default shared memory limit is quite low on many kernels, * you may need to increase it via: @@ -237,14 +338,8 @@ map_hugetlb.c. #define dprintf(x) printf(x) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define SHMAT_FLAGS (SHM_RND) -#else -#define ADDR (void *)(0x0UL) +#define ADDR (void *)(0x0UL) /* let kernel choose address */ #define SHMAT_FLAGS (0) -#endif int main(void) { @@ -302,10 +397,12 @@ int main(void) * example, the app is requesting memory of size 256MB that is backed by * huge pages. * - * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. - * That means the addresses starting with 0x800000... will need to be - * specified. Specifying a fixed address is not required on ppc64, i386 - * or x86_64. + * For the ia64 architecture, the Linux kernel reserves Region number 4 for + * huge pages. That means that if one requires a fixed address, a huge page + * aligned address starting with 0x800000... will be required. If a fixed + * address is not required, the kernel will select an address in the proper + * range. + * Other architectures, such as ppc64, i386 or x86_64 are not so constrained. */ #include #include @@ -317,14 +414,8 @@ int main(void) #define LENGTH (256UL*1024*1024) #define PROTECTION (PROT_READ | PROT_WRITE) -/* Only ia64 requires this */ -#ifdef __ia64__ -#define ADDR (void *)(0x8000000000000000UL) -#define FLAGS (MAP_SHARED | MAP_FIXED) -#else -#define ADDR (void *)(0x0UL) +#define ADDR (void *)(0x0UL) /* let kernel choose address */ #define FLAGS (MAP_SHARED) -#endif void check_bytes(char *addr) { -- cgit v1.1 From 9b5e5d0fdc91b73bba8cf5e0fbe3521a953e4e4d Mon Sep 17 00:00:00 2001 From: Lee Schermerhorn Date: Mon, 14 Dec 2009 17:58:32 -0800 Subject: hugetlb: use only nodes with memory for huge pages Register per node hstate sysfs attributes only for nodes with memory. Global replacement of 'all online nodes" with "all nodes with memory" in mm/hugetlb.c. Suggested by David Rientjes. A subsequent patch will handle adding/removing of per node hstate sysfs attributes when nodes transition to/from memoryless state via memory hotplug. NOTE: this patch has not been tested with memoryless nodes. Signed-off-by: Lee Schermerhorn Reviewed-by: Andi Kleen Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: Randy Dunlap Cc: Nishanth Aravamudan Acked-by: David Rientjes Cc: Adam Litke Cc: Andy Whitcroft Cc: Eric Whitney Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 01c3108..6a8e466 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -90,11 +90,11 @@ huge page pool to 20, allocating or freeing huge pages, as required. On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies nr_hugepages. The default for the allowed nodes--when the -task has default memory policy--is all on-line nodes. Allowed nodes with -insufficient available, contiguous memory for a huge page will be silently -skipped when allocating persistent huge pages. See the discussion below of -the interaction of task memory policy, cpusets and per node attributes with -the allocation and freeing of persistent huge pages. +task has default memory policy--is all on-line nodes with memory. Allowed +nodes with insufficient available, contiguous memory for a huge page will be +silently skipped when allocating persistent huge pages. See the discussion +below of the interaction of task memory policy, cpusets and per node attributes +with the allocation and freeing of persistent huge pages. The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time of the @@ -226,7 +226,7 @@ resulting effect on persistent huge page allocation is as follows: without first moving to a cpuset that contains all of the desired nodes. 5) Boot-time huge page allocation attempts to distribute the requested number - of huge pages over all on-lines nodes. + of huge pages over all on-lines nodes with memory. Per Node Hugepages Attributes -- cgit v1.1 From 4faf8d950ec438c49ae4526b897c30f8a2cad741 Mon Sep 17 00:00:00 2001 From: Lee Schermerhorn Date: Mon, 14 Dec 2009 17:58:35 -0800 Subject: hugetlb: handle memory hot-plug events Register per node hstate attributes only for nodes with memory. As suggested by David Rientjes. With Memory Hotplug, memory can be added to a memoryless node and a node with memory can become memoryless. Therefore, add a memory on/off-line notifier callback to [un]register a node's attributes on transition to/from memoryless state. N.B., Only tested build, boot, libhugetlbfs regression. i.e., no memory hotplug testing. Signed-off-by: Lee Schermerhorn Reviewed-by: Andi Kleen Acked-by: David Rientjes Cc: KAMEZAWA Hiroyuki Cc: Lee Schermerhorn Cc: Mel Gorman Cc: Randy Dunlap Cc: Nishanth Aravamudan Cc: Adam Litke Cc: Andy Whitcroft Cc: Eric Whitney Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 6a8e466..bc31636 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -231,7 +231,8 @@ resulting effect on persistent huge page allocation is as follows: Per Node Hugepages Attributes A subset of the contents of the root huge page control directory in sysfs, -described above, has been replicated under each "node" system device in: +described above, will be replicated under each the system device of each +NUMA node with memory in: /sys/devices/system/node/node[0-9]*/hugepages/ -- cgit v1.1 From dee5d0d518defd0337a41f1a504428c9acc87be5 Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 14 Dec 2009 17:59:05 -0800 Subject: mm: add numa node symlink for memory section in sysfs Commit c04fc586c (mm: show node to memory section relationship with symlinks in sysfs) created symlinks from nodes to memory sections, e.g. /sys/devices/system/node/node1/memory135 -> ../../memory/memory135 If you're examining the memory section though and are wondering what node it might belong to, you can find it by grovelling around in sysfs, but it's a little cumbersome. Add a reverse symlink for each memory section that points back to the node to which it belongs. Signed-off-by: Alex Chiang Cc: Gary Hade Cc: Badari Pulavarty Cc: Ingo Molnar Acked-by: David Rientjes Cc: Greg KH Cc: Randy Dunlap Cc: David Rientjes Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-devices-memory | 14 +++++++++++++- Documentation/memory-hotplug.txt | 11 +++++++---- 2 files changed, 20 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index 9fe91c0..bf1627b 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -60,6 +60,19 @@ Description: Users: hotplug memory remove tools https://w3.opensource.ibm.com/projects/powerpc-utils/ + +What: /sys/devices/system/memoryX/nodeY +Date: October 2009 +Contact: Linux Memory Management list +Description: + When CONFIG_NUMA is enabled, a symbolic link that + points to the corresponding NUMA node directory. + + For example, the following symbolic link is created for + memory section 9 on node0: + /sys/devices/system/memory/memory9/node0 -> ../../node/node0 + + What: /sys/devices/system/node/nodeX/memoryY Date: September 2008 Contact: Gary Hade @@ -70,4 +83,3 @@ Description: memory section directory. For example, the following symbolic link is created for memory section 9 on node0. /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 - diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index bbc8a6a..57e7e9c 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -160,12 +160,15 @@ Under each section, you can see 4 files. NOTE: These directories/files appear after physical memory hotplug phase. -If CONFIG_NUMA is enabled the -/sys/devices/system/memory/memoryXXX memory section -directories can also be accessed via symbolic links located in -the /sys/devices/system/node/node* directories. For example: +If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed +via symbolic links located in the /sys/devices/system/node/node* directories. + +For example: /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 +A backlink will also be created: +/sys/devices/system/memory/memory9/node0 -> ../../node/node0 + -------------------------------- 4. Physical memory hot-add phase -------------------------------- -- cgit v1.1 From cba5dd7fa535b7684cba68e17ac8be5b0083dc3d Mon Sep 17 00:00:00 2001 From: Alex Chiang Date: Mon, 14 Dec 2009 17:59:09 -0800 Subject: Documentation: ABI: /sys/devices/system/cpu/cpu#/node Describe NUMA node symlink created for CPUs when CONFIG_NUMA is set. Signed-off-by: Alex Chiang Cc: Greg KH Cc: Randy Dunlap Cc: Gary Hade Cc: Badari Pulavarty Cc: Ingo Molnar Cc: David Rientjes Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-devices-system-cpu | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 2aae06f..84a710f 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -92,6 +92,20 @@ Description: Discover NUMA node a CPU belongs to /sys/devices/system/cpu/cpu42/node2 -> ../../node/node2 +What: /sys/devices/system/cpu/cpu#/node +Date: October 2009 +Contact: Linux memory management mailing list +Description: Discover NUMA node a CPU belongs to + + When CONFIG_NUMA is enabled, a symbolic link that points + to the corresponding NUMA node directory. + + For example, the following symlink is created for cpu42 + in NUMA node 2: + + /sys/devices/system/cpu/cpu42/node2 -> ../../node/node2 + + What: /sys/devices/system/cpu/cpu#/topology/core_id /sys/devices/system/cpu/cpu#/topology/core_siblings /sys/devices/system/cpu/cpu#/topology/core_siblings_list -- cgit v1.1 From d0f209f68f80f9a152799760c230019e7f270b2a Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 14 Dec 2009 17:59:34 -0800 Subject: ksm: remove unswappable max_kernel_pages Now that ksm pages are swappable, and the known holes plugged, remove mention of unswappable kernel pages from KSM documentation and comments. Remove the totalram_pages/4 initialization of max_kernel_pages. In fact, remove max_kernel_pages altogether - we can reinstate it if removal turns out to break someone's script; but if we later want to limit KSM's memory usage, limiting the stable nodes would not be an effective approach. Signed-off-by: Hugh Dickins Cc: Izik Eidus Cc: Andrea Arcangeli Cc: Chris Wright Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/ksm.txt | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index 262d8e6..b392e49 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt @@ -16,9 +16,9 @@ by sharing the data common between them. But it can be useful to any application which generates many instances of the same data. KSM only merges anonymous (private) pages, never pagecache (file) pages. -KSM's merged pages are at present locked into kernel memory for as long -as they are shared: so cannot be swapped out like the user pages they -replace (but swapping KSM pages should follow soon in a later release). +KSM's merged pages were originally locked into kernel memory, but can now +be swapped out just like other user pages (but sharing is broken when they +are swapped back in: ksmd must rediscover their identity and merge again). KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) @@ -44,20 +44,12 @@ includes unmapped gaps (though working on the intervening mapped areas), and might fail with EAGAIN if not enough memory for internal structures. Applications should be considerate in their use of MADV_MERGEABLE, -restricting its use to areas likely to benefit. KSM's scans may use -a lot of processing power, and its kernel-resident pages are a limited -resource. Some installations will disable KSM for these reasons. +restricting its use to areas likely to benefit. KSM's scans may use a lot +of processing power: some installations will disable KSM for that reason. The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, readable by all but writable only by root: -max_kernel_pages - set to maximum number of kernel pages that KSM may use - e.g. "echo 100000 > /sys/kernel/mm/ksm/max_kernel_pages" - Value 0 imposes no limit on the kernel pages KSM may use; - but note that any process using MADV_MERGEABLE can cause - KSM to allocate these pages, unswappable until it exits. - Default: quarter of memory (chosen to not pin too much) - pages_to_scan - how many present pages to scan before ksmd goes to sleep e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" Default: 100 (chosen for demonstration purposes) @@ -75,7 +67,7 @@ run - set 0 to stop ksmd from running but keep merged pages, The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: -pages_shared - how many shared unswappable kernel pages KSM is using +pages_shared - how many shared pages are being used pages_sharing - how many more sites are sharing them i.e. how much saved pages_unshared - how many pages unique but repeatedly checked for merging pages_volatile - how many pages changing too fast to be placed in a tree @@ -87,4 +79,4 @@ pages_volatile embraces several different kinds of activity, but a high proportion there would also indicate poor use of madvise MADV_MERGEABLE. Izik Eidus, -Hugh Dickins, 24 Sept 2009 +Hugh Dickins, 17 Nov 2009 -- cgit v1.1 From ea637639591def87a54cea811cbac796980cb30d Mon Sep 17 00:00:00 2001 From: Jie Zhang Date: Mon, 14 Dec 2009 18:00:02 -0800 Subject: nommu: fix malloc performance by adding uninitialized flag The NOMMU code currently clears all anonymous mmapped memory. While this is what we want in the default case, all memory allocation from userspace under NOMMU has to go through this interface, including malloc() which is allowed to return uninitialized memory. This can easily be a significant performance penalty. So for constrained embedded systems were security is irrelevant, allow people to avoid clearing memory unnecessarily. This also alters the ELF-FDPIC binfmt such that it obtains uninitialised memory for the brk and stack region. Signed-off-by: Jie Zhang Signed-off-by: Robin Getz Signed-off-by: Mike Frysinger Signed-off-by: David Howells Acked-by: Paul Mundt Acked-by: Greg Ungerer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/nommu-mmap.txt | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) (limited to 'Documentation') diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index b565e82..8e1ddec 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -119,6 +119,32 @@ FURTHER NOTES ON NO-MMU MMAP granule but will only discard the excess if appropriately configured as this has an effect on fragmentation. + (*) The memory allocated by a request for an anonymous mapping will normally + be cleared by the kernel before being returned in accordance with the + Linux man pages (ver 2.22 or later). + + In the MMU case this can be achieved with reasonable performance as + regions are backed by virtual pages, with the contents only being mapped + to cleared physical pages when a write happens on that specific page + (prior to which, the pages are effectively mapped to the global zero page + from which reads can take place). This spreads out the time it takes to + initialize the contents of a page - depending on the write-usage of the + mapping. + + In the no-MMU case, however, anonymous mappings are backed by physical + pages, and the entire map is cleared at allocation time. This can cause + significant delays during a userspace malloc() as the C library does an + anonymous mapping and the kernel then does a memset for the entire map. + + However, for memory that isn't required to be precleared - such as that + returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to + indicate to the kernel that it shouldn't bother clearing the memory before + returning it. Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled + to permit this, otherwise the flag will be ignored. + + uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this + to allocate the brk and stack region. + (*) A list of all the private copy and anonymous mappings on the system is visible through /proc/maps in no-MMU mode. -- cgit v1.1 From 4614a696bd1c3a9af3a08f0e5874830a85b889d4 Mon Sep 17 00:00:00 2001 From: john stultz Date: Mon, 14 Dec 2009 18:00:05 -0800 Subject: procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm Setting a thread's comm to be something unique is a very useful ability and is helpful for debugging complicated threaded applications. However currently the only way to set a thread name is for the thread to name itself via the PR_SET_NAME prctl. However, there may be situations where it would be advantageous for a thread dispatcher to be naming the threads its managing, rather then having the threads self-describe themselves. This sort of behavior is available on other systems via the pthread_setname_np() interface. This patch exports a task's comm via proc/pid/comm and proc/pid/task/tid/comm interfaces, and allows thread siblings to write to these values. [akpm@linux-foundation.org: cleanups] Signed-off-by: John Stultz Cc: Andi Kleen Cc: Arjan van de Ven Cc: Mike Fulton Cc: Sean Foley Cc: Darren Hart Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 94b9f20..220cc63 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -38,6 +38,7 @@ Table of Contents 3.3 /proc//io - Display the IO accounting fields 3.4 /proc//coredump_filter - Core dump filtering settings 3.5 /proc//mountinfo - Information about mounts + 3.6 /proc//comm & /proc//task//comm ------------------------------------------------------------------------------ @@ -1409,3 +1410,11 @@ For more information on mount propagation see: Documentation/filesystems/sharedsubtree.txt + +3.6 /proc//comm & /proc//task//comm +-------------------------------------------------------- +These files provide a method to access a tasks comm value. It also allows for +a task to set its own or one of its thread siblings comm value. The comm value +is limited in size compared to the cmdline value, so writing anything longer +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated +comm value. -- cgit v1.1 From 4eb174bee6f8623fed1af0072f1bebfc3b513a52 Mon Sep 17 00:00:00 2001 From: Michael Hennerich Date: Mon, 14 Dec 2009 18:00:15 -0800 Subject: ad525x_dpot: new driver for AD525x digital potentiometers This driver supports the non-volatile digital potentiometers via I2C: AD5258, AD5259, AD5251, AD5252, AD5253, AD5254, and AD5255 It provides a sysfs interface to each device for reading/writing which is documented in Documentation/misc-devices/ad525x_dpot.txt. Signed-off-by: Michael Hennerich Signed-off-by: Chris Verges Signed-off-by: Mike Frysinger Cc: Jean Delvare Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/misc-devices/ad525x_dpot.txt | 57 ++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 Documentation/misc-devices/ad525x_dpot.txt (limited to 'Documentation') diff --git a/Documentation/misc-devices/ad525x_dpot.txt b/Documentation/misc-devices/ad525x_dpot.txt new file mode 100644 index 0000000..0c9413b --- /dev/null +++ b/Documentation/misc-devices/ad525x_dpot.txt @@ -0,0 +1,57 @@ +--------------------------------- + AD525x Digital Potentiometers +--------------------------------- + +The ad525x_dpot driver exports a simple sysfs interface. This allows you to +work with the immediate resistance settings as well as update the saved startup +settings. Access to the factory programmed tolerance is also provided, but +interpretation of this settings is required by the end application according to +the specific part in use. + +--------- + Files +--------- + +Each dpot device will have a set of eeprom, rdac, and tolerance files. How +many depends on the actual part you have, as will the range of allowed values. + +The eeprom files are used to program the startup value of the device. + +The rdac files are used to program the immediate value of the device. + +The tolerance files are the read-only factory programmed tolerance settings +and may vary greatly on a part-by-part basis. For exact interpretation of +this field, please consult the datasheet for your part. This is presented +as a hex file for easier parsing. + +----------- + Example +----------- + +Locate the device in your sysfs tree. This is probably easiest by going into +the common i2c directory and locating the device by the i2c slave address. + + # ls /sys/bus/i2c/devices/ + 0-0022 0-0027 0-002f + +So assuming the device in question is on the first i2c bus and has the slave +address of 0x2f, we descend (unrelated sysfs entries have been trimmed). + + # ls /sys/bus/i2c/devices/0-002f/ + eeprom0 rdac0 tolerance0 + +You can use simple reads/writes to access these files: + + # cd /sys/bus/i2c/devices/0-002f/ + + # cat eeprom0 + 0 + # echo 10 > eeprom0 + # cat eeprom0 + 10 + + # cat rdac0 + 5 + # echo 3 > rdac0 + # cat rdac0 + 3 -- cgit v1.1 From 948c1e2521979c332b21b623414cf258150f214e Mon Sep 17 00:00:00 2001 From: Amerigo Wang Date: Mon, 14 Dec 2009 18:00:22 -0800 Subject: kallsyms: remove deprecated print_fn_descriptor_symbol() According to feature-removal-schedule.txt, it is the time to remove print_fn_descriptor_symbol(). And a quick grep shows that it no longer has any callers. Signed-off-by: WANG Cong Cc: Bjorn Helgaas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/feature-removal-schedule.txt | 9 --------- 1 file changed, 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index eb2c138..21ab935 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -291,15 +291,6 @@ Who: Michael Buesch --------------------------- -What: print_fn_descriptor_symbol() -When: October 2009 -Why: The %pF vsprintf format provides the same functionality in a - simpler way. print_fn_descriptor_symbol() is deprecated but - still present to give out-of-tree modules time to change. -Who: Bjorn Helgaas - ---------------------------- - What: /sys/o2cb symlink When: January 2010 Why: /sys/fs/o2cb is the proper location for this information - /sys/o2cb -- cgit v1.1 From 41e9a062361de204d3710038925ae7f356ebb40d Mon Sep 17 00:00:00 2001 From: Daniel J Blueman Date: Mon, 14 Dec 2009 18:01:37 -0800 Subject: hwmon: w83627ehf updates Add control of fan minimum turn-on output levels, decoupling it from the fan turn-off output level. Add control of rate of change of fan output level. These in turn allow lower turn-off rotor speed and smoother transitions for better thermal and acoustic control authority. Add support for constant fan speed and proportional-response operations modes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Daniel J Blueman Cc: Jean Delvare Cc: David Hubbard Cc: Hans de Goede Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hwmon/w83627ehf | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/w83627ehf b/Documentation/hwmon/w83627ehf index 02b7489..b7e42ec 100644 --- a/Documentation/hwmon/w83627ehf +++ b/Documentation/hwmon/w83627ehf @@ -81,8 +81,14 @@ pwm[1-4] - this file stores PWM duty cycle or DC value (fan speed) in range: 0 (stop) to 255 (full) pwm[1-4]_enable - this file controls mode of fan/temperature control: - * 1 Manual Mode, write to pwm file any value 0-255 (full speed) - * 2 Thermal Cruise + * 1 Manual mode, write to pwm file any value 0-255 (full speed) + * 2 "Thermal Cruise" mode + * 3 "Fan Speed Cruise" mode + * 4 "Smart Fan III" mode + +pwm[1-4]_mode - controls if output is PWM or DC level + * 0 DC output (0 - 12v) + * 1 PWM output Thermal Cruise mode ------------------- -- cgit v1.1 From bc62c1471773fc32adcfc05100abd16fa2b6e126 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C3=89ric=20Piel?= Date: Mon, 14 Dec 2009 18:01:39 -0800 Subject: lis3: update documentation and comments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Most of the documentation and comments were written when the driver was only supporting one type of chip, only via ACPI/HP. Update the info to the much clearer understanding that we have now. Signed-off-by: Éric Piel Signed-off-by: Samu Onkalo Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hwmon/lis3lv02d | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/lis3lv02d b/Documentation/hwmon/lis3lv02d index effe949..21f0902 100644 --- a/Documentation/hwmon/lis3lv02d +++ b/Documentation/hwmon/lis3lv02d @@ -3,7 +3,8 @@ Kernel driver lis3lv02d Supported chips: - * STMicroelectronics LIS3LV02DL and LIS3LV02DQ + * STMicroelectronics LIS3LV02DL, LIS3LV02DQ (12 bits precision) + * STMicroelectronics LIS302DL, LIS3L02DQ, LIS331DL (8 bits) Authors: Yan Burman @@ -13,13 +14,12 @@ Authors: Description ----------- -This driver provides support for the accelerometer found in various HP -laptops sporting the feature officially called "HP Mobile Data -Protection System 3D" or "HP 3D DriveGuard". It detects automatically -laptops with this sensor. Known models (for now the HP 2133, nc6420, -nc2510, nc8510, nc84x0, nw9440 and nx9420) will have their axis -automatically oriented on standard way (eg: you can directly play -neverball). The accelerometer data is readable via +This driver provides support for the accelerometer found in various HP laptops +sporting the feature officially called "HP Mobile Data Protection System 3D" or +"HP 3D DriveGuard". It detects automatically laptops with this sensor. Known +models (full list can be found in drivers/hwmon/hp_accel.c) will have their +axis automatically oriented on standard way (eg: you can directly play +neverball). The accelerometer data is readable via /sys/devices/platform/lis3lv02d. Sysfs attributes under /sys/devices/platform/lis3lv02d/: @@ -33,12 +33,16 @@ rate - reports the sampling rate of the accelerometer device in HZ This driver also provides an absolute input class device, allowing the laptop to act as a pinball machine-esque joystick. +On HP laptops, if the led infrastructure is activated, support for a led +indicating disk protection will be provided as /sys/class/leds/hp::hddprotect. + Another feature of the driver is misc device called "freefall" that acts similar to /dev/rtc and reacts on free-fall interrupts received from the device. It supports blocking operations, poll/select and fasync operation modes. You must read 1 bytes from the device. The result is number of free-fall interrupts since the last successful -read (or 255 if number of interrupts would not fit). +read (or 255 if number of interrupts would not fit). See the hpfall.c +file for an example on using the device. Axes orientation @@ -55,7 +59,7 @@ the accelerometer are converted into a "standard" organisation of the axes * If the laptop is put upside-down, Z becomes negative If your laptop model is not recognized (cf "dmesg"), you can send an -email to the authors to add it to the database. When reporting a new +email to the maintainer to add it to the database. When reporting a new laptop, please include the output of "dmidecode" plus the value of /sys/devices/platform/lis3lv02d/position in these four cases. -- cgit v1.1 From e956e6b77054f91580deb81ba6389cc8c8ce67ef Mon Sep 17 00:00:00 2001 From: Samu Onkalo Date: Mon, 14 Dec 2009 18:01:46 -0800 Subject: lis3: update documentation to match latest changes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [akpm@linux-foundation.org: s/q/g/ (Randy)] Signed-off-by: Samu Onkalo Acked-by: Éric Piel Cc: Pavel Machek Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hwmon/lis3lv02d | 31 ++++++++++++++++++++++++------- 1 file changed, 24 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/lis3lv02d b/Documentation/hwmon/lis3lv02d index 21f0902..06534f2 100644 --- a/Documentation/hwmon/lis3lv02d +++ b/Documentation/hwmon/lis3lv02d @@ -20,18 +20,35 @@ sporting the feature officially called "HP Mobile Data Protection System 3D" or models (full list can be found in drivers/hwmon/hp_accel.c) will have their axis automatically oriented on standard way (eg: you can directly play neverball). The accelerometer data is readable via -/sys/devices/platform/lis3lv02d. +/sys/devices/platform/lis3lv02d. Reported values are scaled +to mg values (1/1000th of earth gravity). Sysfs attributes under /sys/devices/platform/lis3lv02d/: position - 3D position that the accelerometer reports. Format: "(x,y,z)" -calibrate - read: values (x, y, z) that are used as the base for input - class device operation. - write: forces the base to be recalibrated with the current - position. -rate - reports the sampling rate of the accelerometer device in HZ +rate - read reports the sampling rate of the accelerometer device in HZ. + write changes sampling rate of the accelerometer device. + Only values which are supported by HW are accepted. +selftest - performs selftest for the chip as specified by chip manufacturer. This driver also provides an absolute input class device, allowing -the laptop to act as a pinball machine-esque joystick. +the laptop to act as a pinball machine-esque joystick. Joystick device can be +calibrated. Joystick device can be in two different modes. +By default output values are scaled between -32768 .. 32767. In joystick raw +mode, joystick and sysfs position entry have the same scale. There can be +small difference due to input system fuzziness feature. +Events are also available as input event device. + +Selftest is meant only for hardware diagnostic purposes. It is not meant to be +used during normal operations. Position data is not corrupted during selftest +but interrupt behaviour is not guaranteed to work reliably. In test mode, the +sensing element is internally moved little bit. Selftest measures difference +between normal mode and test mode. Chip specifications tell the acceptance +limit for each type of the chip. Limits are provided via platform data +to allow adjustment of the limits without a change to the actual driver. +Seltest returns either "OK x y z" or "FAIL x y z" where x, y and z are +measured difference between modes. Axes are not remapped in selftest mode. +Measurement values are provided to help HW diagnostic applications to make +final decision. On HP laptops, if the led infrastructure is activated, support for a led indicating disk protection will be provided as /sys/class/leds/hp::hddprotect. -- cgit v1.1 From eac8ea536aded07004bde917f05a2329902c64b0 Mon Sep 17 00:00:00 2001 From: Laurent Pinchart Date: Fri, 27 Nov 2009 13:56:50 -0300 Subject: V4L/DVB (13549): v4l: Add video_device_node_name function Many drivers access the device number (video_device::v4l2_devnode::num) in order to print the video device node name. Add and use a helper function to retrieve the video_device node name. Signed-off-by: Laurent Pinchart Signed-off-by: Mauro Carvalho Chehab --- Documentation/video4linux/v4l2-framework.txt | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/video4linux/v4l2-framework.txt b/Documentation/video4linux/v4l2-framework.txt index b806eda..74d677c 100644 --- a/Documentation/video4linux/v4l2-framework.txt +++ b/Documentation/video4linux/v4l2-framework.txt @@ -561,6 +561,8 @@ video_device helper functions There are a few useful helper functions: +- file/video_device private data + You can set/get driver private data in the video_device struct using: void *video_get_drvdata(struct video_device *vdev); @@ -575,8 +577,7 @@ struct video_device *video_devdata(struct file *file); returns the video_device belonging to the file struct. -The final helper function combines video_get_drvdata with -video_devdata: +The video_drvdata function combines video_get_drvdata with video_devdata: void *video_drvdata(struct file *file); @@ -584,6 +585,17 @@ You can go from a video_device struct to the v4l2_device struct using: struct v4l2_device *v4l2_dev = vdev->v4l2_dev; +- Device node name + +The video_device node kernel name can be retrieved using + +const char *video_device_node_name(struct video_device *vdev); + +The name is used as a hint by userspace tools such as udev. The function +should be used where possible instead of accessing the video_device::num and +video_device::minor fields. + + video buffer helper functions ----------------------------- -- cgit v1.1 From 116dc5c50aff62064bfcceaa15abf44e9277578a Mon Sep 17 00:00:00 2001 From: Jean-Francois Moine Date: Tue, 1 Dec 2009 07:20:34 -0300 Subject: V4L/DVB (13562): gspca - doc: Update webcam list. Signed-off-by: Jean-Francois Moine Signed-off-by: Mauro Carvalho Chehab --- Documentation/video4linux/gspca.txt | 34 ++++++++++++++++++++++++---------- 1 file changed, 24 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/video4linux/gspca.txt b/Documentation/video4linux/gspca.txt index 319d983..1800a62 100644 --- a/Documentation/video4linux/gspca.txt +++ b/Documentation/video4linux/gspca.txt @@ -12,6 +12,7 @@ m5602 0402:5602 ALi Video Camera Controller spca501 040a:0002 Kodak DVC-325 spca500 040a:0300 Kodak EZ200 zc3xx 041e:041e Creative WebCam Live! +ov519 041e:4003 Video Blaster WebCam Go Plus spca500 041e:400a Creative PC-CAM 300 sunplus 041e:400b Creative PC-CAM 600 sunplus 041e:4012 PC-Cam350 @@ -168,10 +169,14 @@ sunplus 055f:c650 Mustek MDC5500Z zc3xx 055f:d003 Mustek WCam300A zc3xx 055f:d004 Mustek WCam300 AN conex 0572:0041 Creative Notebook cx11646 +ov519 05a9:0511 Video Blaster WebCam 3/WebCam Plus, D-Link USB Digital Video Camera +ov519 05a9:0518 Creative WebCam ov519 05a9:0519 OV519 Microphone ov519 05a9:0530 OmniVision +ov519 05a9:2800 OmniVision SuperCAM ov519 05a9:4519 Webcam Classic ov519 05a9:8519 OmniVision +ov519 05a9:a511 D-Link USB Digital Video Camera ov519 05a9:a518 D-Link DSB-C310 Webcam sunplus 05da:1018 Digital Dream Enigma 1.3 stk014 05e1:0893 Syntek DV4000 @@ -187,7 +192,7 @@ ov534 06f8:3002 Hercules Blog Webcam ov534 06f8:3003 Hercules Dualpix HD Weblog sonixj 06f8:3004 Hercules Classic Silver sonixj 06f8:3008 Hercules Deluxe Optical Glass -pac7311 06f8:3009 Hercules Classic Link +pac7302 06f8:3009 Hercules Classic Link spca508 0733:0110 ViewQuest VQ110 spca501 0733:0401 Intel Create and Share spca501 0733:0402 ViewQuest M318B @@ -199,6 +204,7 @@ sunplus 0733:2221 Mercury Digital Pro 3.1p sunplus 0733:3261 Concord 3045 spca536a sunplus 0733:3281 Cyberpix S550V spca506 0734:043b 3DeMon USB Capture aka +ov519 0813:0002 Dual Mode USB Camera Plus spca500 084d:0003 D-Link DSC-350 spca500 08ca:0103 Aiptek PocketDV sunplus 08ca:0104 Aiptek PocketDVII 1.3 @@ -236,15 +242,15 @@ pac7311 093a:2603 Philips SPC 500 NC pac7311 093a:2608 Trust WB-3300p pac7311 093a:260e Gigaware VGA PC Camera, Trust WB-3350p, SIGMA cam 2350 pac7311 093a:260f SnakeCam -pac7311 093a:2620 Apollo AC-905 -pac7311 093a:2621 PAC731x -pac7311 093a:2622 Genius Eye 312 -pac7311 093a:2624 PAC7302 -pac7311 093a:2626 Labtec 2200 -pac7311 093a:2628 Genius iLook 300 -pac7311 093a:2629 Genious iSlim 300 -pac7311 093a:262a Webcam 300k -pac7311 093a:262c Philips SPC 230 NC +pac7302 093a:2620 Apollo AC-905 +pac7302 093a:2621 PAC731x +pac7302 093a:2622 Genius Eye 312 +pac7302 093a:2624 PAC7302 +pac7302 093a:2626 Labtec 2200 +pac7302 093a:2628 Genius iLook 300 +pac7302 093a:2629 Genious iSlim 300 +pac7302 093a:262a Webcam 300k +pac7302 093a:262c Philips SPC 230 NC jeilinj 0979:0280 Sakar 57379 zc3xx 0ac8:0302 Z-star Vimicro zc0302 vc032x 0ac8:0321 Vimicro generic vc0321 @@ -259,6 +265,7 @@ vc032x 0ac8:c002 Sony embedded vimicro vc032x 0ac8:c301 Samsung Q1 Ultra Premium spca508 0af9:0010 Hama USB Sightcam 100 spca508 0af9:0011 Hama USB Sightcam 100 +ov519 0b62:0059 iBOT2 Webcam sonixb 0c45:6001 Genius VideoCAM NB sonixb 0c45:6005 Microdia Sweex Mini Webcam sonixb 0c45:6007 Sonix sn9c101 + Tas5110D @@ -318,8 +325,10 @@ sn9c20x 0c45:62b3 PC Camera (SN9C202 + OV9655) sn9c20x 0c45:62bb PC Camera (SN9C202 + OV7660) sn9c20x 0c45:62bc PC Camera (SN9C202 + HV7131R) sunplus 0d64:0303 Sunplus FashionCam DXG +ov519 0e96:c001 TRUST 380 USB2 SPACEC@M etoms 102c:6151 Qcam Sangha CIF etoms 102c:6251 Qcam xxxxxx VGA +ov519 1046:9967 W9967CF/W9968CF WebCam IC, Video Blaster WebCam Go zc3xx 10fd:0128 Typhoon Webshot II USB 300k 0x0128 spca561 10fd:7e50 FlyCam Usb 100 zc3xx 10fd:8050 Typhoon Webshot II USB 300k @@ -332,7 +341,12 @@ spca501 1776:501c Arowana 300K CMOS Camera t613 17a1:0128 TASCORP JPEG Webcam, NGS Cyclops vc032x 17ef:4802 Lenovo Vc0323+MI1310_SOC pac207 2001:f115 D-Link DSB-C120 +sq905c 2770:9050 sq905c +sq905c 2770:905c DualCamera +sq905 2770:9120 Argus Digital Camera DC1512 +sq905c 2770:913d sq905c spca500 2899:012c Toptro Industrial +ov519 8020:ef04 ov519 spca508 8086:0110 Intel Easy PC Camera spca500 8086:0630 Intel Pocket PC Camera spca506 99fa:8988 Grandtec V.cap -- cgit v1.1 From 007701e2c4de94109864c8e85e39b6d8fc474170 Mon Sep 17 00:00:00 2001 From: Muralidharan Karicheri Date: Thu, 3 Dec 2009 01:13:17 -0300 Subject: V4L/DVB (13572): v4l2-spec: Digital Video Timings API documentation This patch updates the v4l2-dvb documentation for the new video timings API added. Also updated the document based on comments from Hans Verkuil. Signed-off-by: Muralidharan Karicheri Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- Documentation/DocBook/media-entities.tmpl | 18 ++ Documentation/DocBook/media-indices.tmpl | 4 + Documentation/DocBook/v4l/common.xml | 35 +++ Documentation/DocBook/v4l/v4l2.xml | 4 + Documentation/DocBook/v4l/videodev2.h.xml | 116 +++++++++- .../DocBook/v4l/vidioc-enum-dv-presets.xml | 238 +++++++++++++++++++++ Documentation/DocBook/v4l/vidioc-enuminput.xml | 36 +++- Documentation/DocBook/v4l/vidioc-enumoutput.xml | 36 +++- Documentation/DocBook/v4l/vidioc-g-dv-preset.xml | 111 ++++++++++ Documentation/DocBook/v4l/vidioc-g-dv-timings.xml | 224 +++++++++++++++++++ .../DocBook/v4l/vidioc-query-dv-preset.xml | 85 ++++++++ 11 files changed, 903 insertions(+), 4 deletions(-) create mode 100644 Documentation/DocBook/v4l/vidioc-enum-dv-presets.xml create mode 100644 Documentation/DocBook/v4l/vidioc-g-dv-preset.xml create mode 100644 Documentation/DocBook/v4l/vidioc-g-dv-timings.xml create mode 100644 Documentation/DocBook/v4l/vidioc-query-dv-preset.xml (limited to 'Documentation') diff --git a/Documentation/DocBook/media-entities.tmpl b/Documentation/DocBook/media-entities.tmpl index bb5ab74..c725cb8 100644 --- a/Documentation/DocBook/media-entities.tmpl +++ b/Documentation/DocBook/media-entities.tmpl @@ -23,6 +23,7 @@ VIDIOC_ENUMINPUT"> VIDIOC_ENUMOUTPUT"> VIDIOC_ENUMSTD"> +VIDIOC_ENUM_DV_PRESETS"> VIDIOC_ENUM_FMT"> VIDIOC_ENUM_FRAMEINTERVALS"> VIDIOC_ENUM_FRAMESIZES"> @@ -30,6 +31,8 @@ VIDIOC_G_AUDOUT"> VIDIOC_G_CROP"> VIDIOC_G_CTRL"> +VIDIOC_G_DV_PRESET"> +VIDIOC_G_DV_TIMINGS"> VIDIOC_G_ENC_INDEX"> VIDIOC_G_EXT_CTRLS"> VIDIOC_G_FBUF"> @@ -53,6 +56,7 @@ VIDIOC_QUERYCTRL"> VIDIOC_QUERYMENU"> VIDIOC_QUERYSTD"> +VIDIOC_QUERY_DV_PRESET"> VIDIOC_REQBUFS"> VIDIOC_STREAMOFF"> VIDIOC_STREAMON"> @@ -60,6 +64,8 @@ VIDIOC_S_AUDOUT"> VIDIOC_S_CROP"> VIDIOC_S_CTRL"> +VIDIOC_S_DV_PRESET"> +VIDIOC_S_DV_TIMINGS"> VIDIOC_S_EXT_CTRLS"> VIDIOC_S_FBUF"> VIDIOC_S_FMT"> @@ -118,6 +124,7 @@ v4l2_audio"> v4l2_audioout"> +v4l2_bt_timings"> v4l2_buffer"> v4l2_capability"> v4l2_captureparm"> @@ -128,6 +135,9 @@ v4l2_dbg_chip_ident"> v4l2_dbg_match"> v4l2_dbg_register"> +v4l2_dv_enum_preset"> +v4l2_dv_preset"> +v4l2_dv_timings"> v4l2_enc_idx"> v4l2_enc_idx_entry"> v4l2_encoder_cmd"> @@ -243,6 +253,10 @@ + + + + @@ -333,6 +347,10 @@ + + + + diff --git a/Documentation/DocBook/media-indices.tmpl b/Documentation/DocBook/media-indices.tmpl index 9e30a23..78d6031 100644 --- a/Documentation/DocBook/media-indices.tmpl +++ b/Documentation/DocBook/media-indices.tmpl @@ -36,6 +36,7 @@ enum v4l2_preemphasis struct v4l2_audio struct v4l2_audioout +struct v4l2_bt_timings struct v4l2_buffer struct v4l2_capability struct v4l2_captureparm @@ -46,6 +47,9 @@ struct v4l2_dbg_chip_ident struct v4l2_dbg_match struct v4l2_dbg_register +struct v4l2_dv_enum_preset +struct v4l2_dv_preset +struct v4l2_dv_timings struct v4l2_enc_idx struct v4l2_enc_idx_entry struct v4l2_encoder_cmd diff --git a/Documentation/DocBook/v4l/common.xml b/Documentation/DocBook/v4l/common.xml index b1a81d2..c65f0ac 100644 --- a/Documentation/DocBook/v4l/common.xml +++ b/Documentation/DocBook/v4l/common.xml @@ -716,6 +716,41 @@ if (-1 == ioctl (fd, &VIDIOC-S-STD;, &std_id)) { } +
+ Digital Video (DV) Timings + + The video standards discussed so far has been dealing with Analog TV and the +corresponding video timings. Today there are many more different hardware interfaces +such as High Definition TV interfaces (HDMI), VGA, DVI connectors etc., that carry +video signals and there is a need to extend the API to select the video timings +for these interfaces. Since it is not possible to extend the &v4l2-std-id; due to +the limited bits available, a new set of IOCTLs is added to set/get video timings at +the input and output: + + DV Presets: Digital Video (DV) presets. These are IDs representing a +video timing at the input/output. Presets are pre-defined timings implemented +by the hardware according to video standards. A __u32 data type is used to represent +a preset unlike the bit mask that is used in &v4l2-std-id; allowing future extensions +to support as many different presets as needed. + + + Custom DV Timings: This will allow applications to define more detailed +custom video timings for the interface. This includes parameters such as width, height, +polarities, frontporch, backporch etc. + + + + To enumerate and query the attributes of DV presets supported by a device, +applications use the &VIDIOC-ENUM-DV-PRESETS; ioctl. To get the current DV preset, +applications use the &VIDIOC-G-DV-PRESET; ioctl and to set a preset they use the +&VIDIOC-S-DV-PRESET; ioctl. + To set custom DV timings for the device, applications use the +&VIDIOC-S-DV-TIMINGS; ioctl and to get current custom DV timings they use the +&VIDIOC-G-DV-TIMINGS; ioctl. + Applications can make use of the and + flags to decide what ioctls are available to set the +video timings for the device. +
&sub-controls; diff --git a/Documentation/DocBook/v4l/v4l2.xml b/Documentation/DocBook/v4l/v4l2.xml index 937b415..77b7dd8 100644 --- a/Documentation/DocBook/v4l/v4l2.xml +++ b/Documentation/DocBook/v4l/v4l2.xml @@ -411,6 +411,7 @@ and discussions on the V4L mailing list. &sub-encoder-cmd; &sub-enumaudio; &sub-enumaudioout; + &sub-enum-dv-presets; &sub-enum-fmt; &sub-enum-framesizes; &sub-enum-frameintervals; @@ -421,6 +422,8 @@ and discussions on the V4L mailing list. &sub-g-audioout; &sub-g-crop; &sub-g-ctrl; + &sub-g-dv-preset; + &sub-g-dv-timings; &sub-g-enc-index; &sub-g-ext-ctrls; &sub-g-fbuf; @@ -441,6 +444,7 @@ and discussions on the V4L mailing list. &sub-querybuf; &sub-querycap; &sub-queryctrl; + &sub-query-dv-preset; &sub-querystd; &sub-reqbufs; &sub-s-hw-freq-seek; diff --git a/Documentation/DocBook/v4l/videodev2.h.xml b/Documentation/DocBook/v4l/videodev2.h.xml index 3e282ed..0683259 100644 --- a/Documentation/DocBook/v4l/videodev2.h.xml +++ b/Documentation/DocBook/v4l/videodev2.h.xml @@ -734,6 +734,99 @@ struct v4l2_standard { }; /* + * V I D E O T I M I N G S D V P R E S E T + */ +struct v4l2_dv_preset { + __u32 preset; + __u32 reserved[4]; +}; + +/* + * D V P R E S E T S E N U M E R A T I O N + */ +struct v4l2_dv_enum_preset { + __u32 index; + __u32 preset; + __u8 name[32]; /* Name of the preset timing */ + __u32 width; + __u32 height; + __u32 reserved[4]; +}; + +/* + * D V P R E S E T V A L U E S + */ +#define V4L2_DV_INVALID 0 +#define V4L2_DV_480P59_94 1 /* BT.1362 */ +#define V4L2_DV_576P50 2 /* BT.1362 */ +#define V4L2_DV_720P24 3 /* SMPTE 296M */ +#define V4L2_DV_720P25 4 /* SMPTE 296M */ +#define V4L2_DV_720P30 5 /* SMPTE 296M */ +#define V4L2_DV_720P50 6 /* SMPTE 296M */ +#define V4L2_DV_720P59_94 7 /* SMPTE 274M */ +#define V4L2_DV_720P60 8 /* SMPTE 274M/296M */ +#define V4L2_DV_1080I29_97 9 /* BT.1120/ SMPTE 274M */ +#define V4L2_DV_1080I30 10 /* BT.1120/ SMPTE 274M */ +#define V4L2_DV_1080I25 11 /* BT.1120 */ +#define V4L2_DV_1080I50 12 /* SMPTE 296M */ +#define V4L2_DV_1080I60 13 /* SMPTE 296M */ +#define V4L2_DV_1080P24 14 /* SMPTE 296M */ +#define V4L2_DV_1080P25 15 /* SMPTE 296M */ +#define V4L2_DV_1080P30 16 /* SMPTE 296M */ +#define V4L2_DV_1080P50 17 /* BT.1120 */ +#define V4L2_DV_1080P60 18 /* BT.1120 */ + +/* + * D V B T T I M I N G S + */ + +/* BT.656/BT.1120 timing data */ +struct v4l2_bt_timings { + __u32 width; /* width in pixels */ + __u32 height; /* height in lines */ + __u32 interlaced; /* Interlaced or progressive */ + __u32 polarities; /* Positive or negative polarity */ + __u64 pixelclock; /* Pixel clock in HZ. Ex. 74.25MHz->74250000 */ + __u32 hfrontporch; /* Horizpontal front porch in pixels */ + __u32 hsync; /* Horizontal Sync length in pixels */ + __u32 hbackporch; /* Horizontal back porch in pixels */ + __u32 vfrontporch; /* Vertical front porch in pixels */ + __u32 vsync; /* Vertical Sync length in lines */ + __u32 vbackporch; /* Vertical back porch in lines */ + __u32 il_vfrontporch; /* Vertical front porch for bottom field of + * interlaced field formats + */ + __u32 il_vsync; /* Vertical sync length for bottom field of + * interlaced field formats + */ + __u32 il_vbackporch; /* Vertical back porch for bottom field of + * interlaced field formats + */ + __u32 reserved[16]; +} __attribute__ ((packed)); + +/* Interlaced or progressive format */ +#define V4L2_DV_PROGRESSIVE 0 +#define V4L2_DV_INTERLACED 1 + +/* Polarities. If bit is not set, it is assumed to be negative polarity */ +#define V4L2_DV_VSYNC_POS_POL 0x00000001 +#define V4L2_DV_HSYNC_POS_POL 0x00000002 + + +/* DV timings */ +struct v4l2_dv_timings { + __u32 type; + union { + struct v4l2_bt_timings bt; + __u32 reserved[32]; + }; +} __attribute__ ((packed)); + +/* Values for the type field */ +#define V4L2_DV_BT_656_1120 0 /* BT.656/1120 timing type */ + +/* * V I D E O I N P U T S */ struct v4l2_input { @@ -744,7 +837,8 @@ struct v4l2_input { __u32 tuner; /* Associated tuner */ v4l2_std_id std; __u32 status; - __u32 reserved[4]; + __u32 capabilities; + __u32 reserved[3]; }; /* Values for the 'type' field */ @@ -775,6 +869,11 @@ struct v4l2_input { #define V4L2_IN_ST_NO_ACCESS 0x02000000 /* Conditional access denied */ #define V4L2_IN_ST_VTR 0x04000000 /* VTR time constant */ +/* capabilities flags */ +#define V4L2_IN_CAP_PRESETS 0x00000001 /* Supports S_DV_PRESET */ +#define V4L2_IN_CAP_CUSTOM_TIMINGS 0x00000002 /* Supports S_DV_TIMINGS */ +#define V4L2_IN_CAP_STD 0x00000004 /* Supports S_STD */ + /* * V I D E O O U T P U T S */ @@ -785,13 +884,19 @@ struct v4l2_output { __u32 audioset; /* Associated audios (bitfield) */ __u32 modulator; /* Associated modulator */ v4l2_std_id std; - __u32 reserved[4]; + __u32 capabilities; + __u32 reserved[3]; }; /* Values for the 'type' field */ #define V4L2_OUTPUT_TYPE_MODULATOR 1 #define V4L2_OUTPUT_TYPE_ANALOG 2 #define V4L2_OUTPUT_TYPE_ANALOGVGAOVERLAY 3 +/* capabilities flags */ +#define V4L2_OUT_CAP_PRESETS 0x00000001 /* Supports S_DV_PRESET */ +#define V4L2_OUT_CAP_CUSTOM_TIMINGS 0x00000002 /* Supports S_DV_TIMINGS */ +#define V4L2_OUT_CAP_STD 0x00000004 /* Supports S_STD */ + /* * C O N T R O L S */ @@ -1626,6 +1731,13 @@ struct v4l2_dbg_chip_ident { #endif #define VIDIOC_S_HW_FREQ_SEEK _IOW('V', 82, struct v4l2_hw_freq_seek) +#define VIDIOC_ENUM_DV_PRESETS _IOWR('V', 83, struct v4l2_dv_enum_preset) +#define VIDIOC_S_DV_PRESET _IOWR('V', 84, struct v4l2_dv_preset) +#define VIDIOC_G_DV_PRESET _IOWR('V', 85, struct v4l2_dv_preset) +#define VIDIOC_QUERY_DV_PRESET _IOR('V', 86, struct v4l2_dv_preset) +#define VIDIOC_S_DV_TIMINGS _IOWR('V', 87, struct v4l2_dv_timings) +#define VIDIOC_G_DV_TIMINGS _IOWR('V', 88, struct v4l2_dv_timings) + /* Reminder: when adding new ioctls please add support for them to drivers/media/video/v4l2-compat-ioctl32.c as well! */ diff --git a/Documentation/DocBook/v4l/vidioc-enum-dv-presets.xml b/Documentation/DocBook/v4l/vidioc-enum-dv-presets.xml new file mode 100644 index 0000000..1d31427 --- /dev/null +++ b/Documentation/DocBook/v4l/vidioc-enum-dv-presets.xml @@ -0,0 +1,238 @@ + + + ioctl VIDIOC_ENUM_DV_PRESETS + &manvol; + + + + VIDIOC_ENUM_DV_PRESETS + Enumerate supported Digital Video presets + + + + + + int ioctl + int fd + int request + struct v4l2_dv_enum_preset *argp + + + + + + Arguments + + + + fd + + &fd; + + + + request + + VIDIOC_ENUM_DV_PRESETS + + + + argp + + + + + + + + + Description + + To query the attributes of a DV preset, applications initialize the +index field and zero the reserved array of &v4l2-dv-enum-preset; +and call the VIDIOC_ENUM_DV_PRESETS ioctl with a pointer to this +structure. Drivers fill the rest of the structure or return an +&EINVAL; when the index is out of bounds. To enumerate all DV Presets supported, +applications shall begin at index zero, incrementing by one until the +driver returns EINVAL. Drivers may enumerate a +different set of DV presets after switching the video input or +output. + + + struct <structname>v4l2_dv_enum_presets</structname> + + &cs-str; + + + __u32 + index + Number of the DV preset, set by the +application. + + + __u32 + preset + This field identifies one of the DV preset values listed in . + + + __u8 + name[24] + Name of the preset, a NUL-terminated ASCII string, for example: "720P-60", "1080I-60". This information is +intended for the user. + + + __u32 + width + Width of the active video in pixels for the DV preset. + + + __u32 + height + Height of the active video in lines for the DV preset. + + + __u32 + reserved[4] + Reserved for future extensions. Drivers must set the array to zero. + + + +
+ + + struct <structname>DV Presets</structname> + + &cs-str; + + + Preset + Preset value + Description + + + + + + + + V4L2_DV_INVALID + 0 + Invalid preset value. + + + V4L2_DV_480P59_94 + 1 + 720x480 progressive video at 59.94 fps as per BT.1362. + + + V4L2_DV_576P50 + 2 + 720x576 progressive video at 50 fps as per BT.1362. + + + V4L2_DV_720P24 + 3 + 1280x720 progressive video at 24 fps as per SMPTE 296M. + + + V4L2_DV_720P25 + 4 + 1280x720 progressive video at 25 fps as per SMPTE 296M. + + + V4L2_DV_720P30 + 5 + 1280x720 progressive video at 30 fps as per SMPTE 296M. + + + V4L2_DV_720P50 + 6 + 1280x720 progressive video at 50 fps as per SMPTE 296M. + + + V4L2_DV_720P59_94 + 7 + 1280x720 progressive video at 59.94 fps as per SMPTE 274M. + + + V4L2_DV_720P60 + 8 + 1280x720 progressive video at 60 fps as per SMPTE 274M/296M. + + + V4L2_DV_1080I29_97 + 9 + 1920x1080 interlaced video at 29.97 fps as per BT.1120/SMPTE 274M. + + + V4L2_DV_1080I30 + 10 + 1920x1080 interlaced video at 30 fps as per BT.1120/SMPTE 274M. + + + V4L2_DV_1080I25 + 11 + 1920x1080 interlaced video at 25 fps as per BT.1120. + + + V4L2_DV_1080I50 + 12 + 1920x1080 interlaced video at 50 fps as per SMPTE 296M. + + + V4L2_DV_1080I60 + 13 + 1920x1080 interlaced video at 60 fps as per SMPTE 296M. + + + V4L2_DV_1080P24 + 14 + 1920x1080 progressive video at 24 fps as per SMPTE 296M. + + + V4L2_DV_1080P25 + 15 + 1920x1080 progressive video at 25 fps as per SMPTE 296M. + + + V4L2_DV_1080P30 + 16 + 1920x1080 progressive video at 30 fps as per SMPTE 296M. + + + V4L2_DV_1080P50 + 17 + 1920x1080 progressive video at 50 fps as per BT.1120. + + + V4L2_DV_1080P60 + 18 + 1920x1080 progressive video at 60 fps as per BT.1120. + + + +
+
+ + + &return-value; + + + + EINVAL + + The &v4l2-dv-enum-preset; index +is out of bounds. + + + + +
+ + diff --git a/Documentation/DocBook/v4l/vidioc-enuminput.xml b/Documentation/DocBook/v4l/vidioc-enuminput.xml index 414856b..71b868e 100644 --- a/Documentation/DocBook/v4l/vidioc-enuminput.xml +++ b/Documentation/DocBook/v4l/vidioc-enuminput.xml @@ -124,7 +124,13 @@ current input. __u32 - reserved[4] + capabilities + This field provides capabilities for the +input. See for flags. + + + __u32 + reserved[3] Reserved for future extensions. Drivers must set the array to zero. @@ -261,6 +267,34 @@ flag is set Macrovision has been detected. + + + + Input capabilities + + &cs-def; + + + V4L2_IN_CAP_PRESETS + 0x00000001 + This input supports setting DV presets by using VIDIOC_S_DV_PRESET. + + + V4L2_OUT_CAP_CUSTOM_TIMINGS + 0x00000002 + This input supports setting custom video timings by using VIDIOC_S_DV_TIMINGS. + + + V4L2_IN_CAP_STD + 0x00000004 + This input supports setting the TV standard by using VIDIOC_S_STD. + + + +
diff --git a/Documentation/DocBook/v4l/vidioc-enumoutput.xml b/Documentation/DocBook/v4l/vidioc-enumoutput.xml index e8d16dc..a281d26 100644 --- a/Documentation/DocBook/v4l/vidioc-enumoutput.xml +++ b/Documentation/DocBook/v4l/vidioc-enumoutput.xml @@ -114,7 +114,13 @@ details on video standards and how to switch see __u32 - reserved[4] + capabilities + This field provides capabilities for the +output. See for flags. + + + __u32 + reserved[3] Reserved for future extensions. Drivers must set the array to zero. @@ -147,6 +153,34 @@ CVBS, S-Video, RGB. + + + Output capabilities + + &cs-def; + + + V4L2_OUT_CAP_PRESETS + 0x00000001 + This output supports setting DV presets by using VIDIOC_S_DV_PRESET. + + + V4L2_OUT_CAP_CUSTOM_TIMINGS + 0x00000002 + This output supports setting custom video timings by using VIDIOC_S_DV_TIMINGS. + + + V4L2_OUT_CAP_STD + 0x00000004 + This output supports setting the TV standard by using VIDIOC_S_STD. + + + +
+
&return-value; diff --git a/Documentation/DocBook/v4l/vidioc-g-dv-preset.xml b/Documentation/DocBook/v4l/vidioc-g-dv-preset.xml new file mode 100644 index 0000000..3c6784e --- /dev/null +++ b/Documentation/DocBook/v4l/vidioc-g-dv-preset.xml @@ -0,0 +1,111 @@ + + + ioctl VIDIOC_G_DV_PRESET, VIDIOC_S_DV_PRESET + &manvol; + + + + VIDIOC_G_DV_PRESET + VIDIOC_S_DV_PRESET + Query or select the DV preset of the current input or output + + + + + + int ioctl + int fd + int request + &v4l2-dv-preset; +*argp + + + + + + Arguments + + + + fd + + &fd; + + + + request + + VIDIOC_G_DV_PRESET, VIDIOC_S_DV_PRESET + + + + argp + + + + + + + + + Description + To query and select the current DV preset, applications +use the VIDIOC_G_DV_PRESET and VIDIOC_S_DV_PRESET +ioctls which take a pointer to a &v4l2-dv-preset; type as argument. +Applications must zero the reserved array in &v4l2-dv-preset;. +VIDIOC_G_DV_PRESET returns a dv preset in the field +preset of &v4l2-dv-preset;. + + VIDIOC_S_DV_PRESET accepts a pointer to a &v4l2-dv-preset; +that has the preset value to be set. Applications must zero the reserved array in &v4l2-dv-preset;. +If the preset is not supported, it returns an &EINVAL; + + + + &return-value; + + + + EINVAL + + This ioctl is not supported, or the +VIDIOC_S_DV_PRESET,VIDIOC_S_DV_PRESET parameter was unsuitable. + + + + EBUSY + + The device is busy and therefore can not change the preset. + + + + + + struct <structname>v4l2_dv_preset</structname> + + &cs-str; + + + __u32 + preset + Preset value to represent the digital video timings + + + __u32 + reserved[4] + Reserved fields for future use + + + +
+ +
+
+ + diff --git a/Documentation/DocBook/v4l/vidioc-g-dv-timings.xml b/Documentation/DocBook/v4l/vidioc-g-dv-timings.xml new file mode 100644 index 0000000..ecc1957 --- /dev/null +++ b/Documentation/DocBook/v4l/vidioc-g-dv-timings.xml @@ -0,0 +1,224 @@ + + + ioctl VIDIOC_G_DV_TIMINGS, VIDIOC_S_DV_TIMINGS + &manvol; + + + + VIDIOC_G_DV_TIMINGS + VIDIOC_S_DV_TIMINGS + Get or set custom DV timings for input or output + + + + + + int ioctl + int fd + int request + &v4l2-dv-timings; +*argp + + + + + + Arguments + + + + fd + + &fd; + + + + request + + VIDIOC_G_DV_TIMINGS, VIDIOC_S_DV_TIMINGS + + + + argp + + + + + + + + + Description + To set custom DV timings for the input or output, applications use the +VIDIOC_S_DV_TIMINGS ioctl and to get the current custom timings, +applications use the VIDIOC_G_DV_TIMINGS ioctl. The detailed timing +information is filled in using the structure &v4l2-dv-timings;. These ioctls take +a pointer to the &v4l2-dv-timings; structure as argument. If the ioctl is not supported +or the timing values are not correct, the driver returns &EINVAL;. + + + + &return-value; + + + + EINVAL + + This ioctl is not supported, or the +VIDIOC_S_DV_TIMINGS parameter was unsuitable. + + + + EBUSY + + The device is busy and therefore can not change the timings. + + + + + + struct <structname>v4l2_bt_timings</structname> + + &cs-str; + + + __u32 + width + Width of the active video in pixels + + + __u32 + height + Height of the active video in lines + + + __u32 + interlaced + Progressive (0) or interlaced (1) + + + __u32 + polarities + This is a bit mask that defines polarities of sync signals. +bit 0 (V4L2_DV_VSYNC_POS_POL) is for vertical sync polarity and bit 1 (V4L2_DV_HSYNC_POS_POL) is for horizontal sync polarity. If the bit is set +(1) it is positive polarity and if is cleared (0), it is negative polarity. + + + __u64 + pixelclock + Pixel clock in Hz. Ex. 74.25MHz->74250000 + + + __u32 + hfrontporch + Horizontal front porch in pixels + + + __u32 + hsync + Horizontal sync length in pixels + + + __u32 + hbackporch + Horizontal back porch in pixels + + + __u32 + vfrontporch + Vertical front porch in lines + + + __u32 + vsync + Vertical sync length in lines + + + __u32 + vbackporch + Vertical back porch in lines + + + __u32 + il_vfrontporch + Vertical front porch in lines for bottom field of interlaced field formats + + + __u32 + il_vsync + Vertical sync length in lines for bottom field of interlaced field formats + + + __u32 + il_vbackporch + Vertical back porch in lines for bottom field of interlaced field formats + + + +
+ + + struct <structname>v4l2_dv_timings</structname> + + &cs-str; + + + __u32 + type + + Type of DV timings as listed in . + + + union + + + + + + &v4l2-bt-timings; + bt + Timings defined by BT.656/1120 specifications + + + + __u32 + reserved[32] + + + + +
+ + + DV Timing types + + &cs-str; + + + Timing type + value + Description + + + + + + + + V4L2_DV_BT_656_1120 + 0 + BT.656/1120 timings + + + +
+
+
+ + diff --git a/Documentation/DocBook/v4l/vidioc-query-dv-preset.xml b/Documentation/DocBook/v4l/vidioc-query-dv-preset.xml new file mode 100644 index 0000000..87e4f0f --- /dev/null +++ b/Documentation/DocBook/v4l/vidioc-query-dv-preset.xml @@ -0,0 +1,85 @@ + + + ioctl VIDIOC_QUERY_DV_PRESET + &manvol; + + + + VIDIOC_QUERY_DV_PRESET + Sense the DV preset received by the current +input + + + + + + int ioctl + int fd + int request + &v4l2-dv-preset; *argp + + + + + + Arguments + + + + fd + + &fd; + + + + request + + VIDIOC_QUERY_DV_PRESET + + + + argp + + + + + + + + + Description + + The hardware may be able to detect the current DV preset +automatically, similar to sensing the video standard. To do so, applications +call VIDIOC_QUERY_DV_PRESET with a pointer to a +&v4l2-dv-preset; type. Once the hardware detects a preset, that preset is +returned in the preset field of &v4l2-dv-preset;. When detection is not +possible or fails, the value V4L2_DV_INVALID is returned. + + + + &return-value; + + + EINVAL + + This ioctl is not supported. + + + + EBUSY + + The device is busy and therefore can not sense the preset + + + + + + + -- cgit v1.1 From b33f5f8af3fabe6a080c9d06f8f3184a4de3f3c2 Mon Sep 17 00:00:00 2001 From: Hans Verkuil Date: Thu, 3 Dec 2009 01:32:12 -0300 Subject: V4L/DVB (13573): v4l2-spec: updated revision history, updated version to 2.6.33. Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- Documentation/DocBook/v4l/compat.xml | 16 ++++++++++++---- Documentation/DocBook/v4l/v4l2.xml | 22 ++++++++++++++++++++-- 2 files changed, 32 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/v4l/compat.xml b/Documentation/DocBook/v4l/compat.xml index 4d1902a..b9dbdf9 100644 --- a/Documentation/DocBook/v4l/compat.xml +++ b/Documentation/DocBook/v4l/compat.xml @@ -2291,8 +2291,8 @@ was renamed to v4l2_chip_ident_old New control V4L2_CID_COLORFX was added. - - + +
V4L2 in Linux 2.6.32 @@ -2322,8 +2322,16 @@ more information. Added Remote Controller chapter, describing the default Remote Controller mapping for media devices. - -
+ + +
+ V4L2 in Linux 2.6.33 + + + Added support for Digital Video timings in order to support HDTV receivers and transmitters. + + +
diff --git a/Documentation/DocBook/v4l/v4l2.xml b/Documentation/DocBook/v4l/v4l2.xml index 77b7dd8..060105a 100644 --- a/Documentation/DocBook/v4l/v4l2.xml +++ b/Documentation/DocBook/v4l/v4l2.xml @@ -74,6 +74,17 @@ Remote Controller chapter. + + + Muralidharan + Karicheri + Documented the Digital Video timings API. + +
+ m-karicheri2@ti.com +
+
+
@@ -89,7 +100,7 @@ Remote Controller chapter. 2008 2009 Bill Dirks, Michael H. Schimek, Hans Verkuil, Martin -Rubli, Andy Walls, Mauro Carvalho Chehab +Rubli, Andy Walls, Muralidharan Karicheri, Mauro Carvalho Chehab Except when explicitly stated as GPL, programming examples within @@ -103,6 +114,13 @@ structs, ioctls) must be noted in more detail in the history chapter applications. --> + 2.6.33 + 2009-12-03 + mk + Added documentation for the Digital Video timings API. + + + 2.6.32 2009-08-31 mcc @@ -355,7 +373,7 @@ and discussions on the V4L mailing list. Video for Linux Two API Specification - Revision 2.6.32 + Revision 2.6.33 &sub-common; -- cgit v1.1 From 422eaac5d5367bab71fa2cc33393b2ea894498fe Mon Sep 17 00:00:00 2001 From: Muralidharan Karicheri Date: Thu, 10 Dec 2009 10:31:44 -0300 Subject: V4L/DVB (13619): v4l2-spec: Adds EBUSY error code for S_STD and QUERYSTD ioctls During review of Video Timing API documentation, Hans Verkuil had a comment on adding EBUSY error code for VIDIOC_S_STD and VIDIOC_QUERYSTD ioctls. This patch updates the document for this. Signed-off-by: Muralidharan Karicheri Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- Documentation/DocBook/v4l/vidioc-g-std.xml | 6 ++++++ Documentation/DocBook/v4l/vidioc-querystd.xml | 6 ++++++ 2 files changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/DocBook/v4l/vidioc-g-std.xml b/Documentation/DocBook/v4l/vidioc-g-std.xml index b6f5d26..912f851 100644 --- a/Documentation/DocBook/v4l/vidioc-g-std.xml +++ b/Documentation/DocBook/v4l/vidioc-g-std.xml @@ -86,6 +86,12 @@ standards. VIDIOC_S_STD parameter was unsuitable. + + EBUSY + + The device is busy and therefore can not change the standard + + diff --git a/Documentation/DocBook/v4l/vidioc-querystd.xml b/Documentation/DocBook/v4l/vidioc-querystd.xml index b5a7ff9..1a9e603 100644 --- a/Documentation/DocBook/v4l/vidioc-querystd.xml +++ b/Documentation/DocBook/v4l/vidioc-querystd.xml @@ -70,6 +70,12 @@ current video input or output. This ioctl is not supported. + + EBUSY + + The device is busy and therefore can not detect the standard + + -- cgit v1.1 From 329e4e18dfdc552f36b0642a3de5ebfa96063666 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 15 Dec 2009 21:51:08 -0200 Subject: thinkpad-acpi: volume subdriver rewrite I don't trust the coupled EC writes and SMI calls the current volume control code does very much, although it is exactly what the IBM DSDTs seem to do (they never do more than a single step though). Change the driver to stop issuing SMIs, and just drive the EC directly to the desired level (DSDTs seem to confirm this will work even on very old models like the 570 and 600e/x). We checkpoint directly to NVRAM (this can be turned off) at suspend/shutdown/driver unload, which from what I can see in tbp, should also work on every ThinkPad. Signed-off-by: Henrique de Moraes Holschuh Cc: Lorne Applebaum Cc: Matthew Garrett Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index f5056c7..96687d0 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1092,22 +1092,33 @@ WARNING: its level up and down at every change. -Volume control -- /proc/acpi/ibm/volume ---------------------------------------- +Volume control +-------------- + +procfs: /proc/acpi/ibm/volume -This feature allows volume control on ThinkPad models which don't have -a hardware volume knob. The available commands are: +This feature allows volume control on ThinkPad models with a digital +volume knob, as well as mute/unmute control. The available commands are: echo up >/proc/acpi/ibm/volume echo down >/proc/acpi/ibm/volume echo mute >/proc/acpi/ibm/volume echo 'level ' >/proc/acpi/ibm/volume -The number range is 0 to 15 although not all of them may be +The number range is 0 to 14 although not all of them may be distinct. The unmute the volume after the mute command, use either the up or down command (the level command will not unmute the volume). The current volume level and mute state is shown in the file. +There are two strategies for volume control. To select which one +should be used, use the volume_mode module parameter: volume_mode=1 +selects EC mode, and volume_mode=3 selects EC mode with NVRAM backing +(so that volume/mute changes are remembered across shutdown/reboot). + +The driver will operate in volume_mode=3 by default. If that does not +work well on your ThinkPad model, please report this to +ibm-acpi-devel@lists.sourceforge.net. + The ALSA mixer interface to this feature is still missing, but patches to add it exist. That problem should be addressed in the not so distant future. @@ -1376,6 +1387,7 @@ to enable more than one output class, just add their values. 0x0008 HKEY event interface, hotkeys 0x0010 Fan control 0x0020 Backlight brightness + 0x0040 Audio mixer/volume control There is also a kernel build option to enable more debugging information, which may be necessary to debug driver problems. -- cgit v1.1 From a112ceee673629afc204bf6b4a4828a6143a083f Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 15 Dec 2009 21:51:09 -0200 Subject: thinkpad-acpi: support MUTE-only ThinkPads Lenovo removed the extra mixer since the T61 and thereabouts. Newer Lenovo models only have the mute gate function, and leave the volume control to the HDA mixer. Until a way to automatically query the firmware about its audio control capabilities is discovered (there might not be any), use a white/black list. We will likely need to ask T60 (old and new model) and Z60/Z61 users whether they have volume control to populate the black/white list. Meanwhile, provide a volume_capabilities parameter that can be used to override the defaults. Signed-off-by: Henrique de Moraes Holschuh Cc: Lorne Applebaum Cc: Matthew Garrett Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 96687d0..bd87682 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1098,18 +1098,31 @@ Volume control procfs: /proc/acpi/ibm/volume This feature allows volume control on ThinkPad models with a digital -volume knob, as well as mute/unmute control. The available commands are: +volume knob (when available, not all models have it), as well as +mute/unmute control. The available commands are: echo up >/proc/acpi/ibm/volume echo down >/proc/acpi/ibm/volume echo mute >/proc/acpi/ibm/volume + echo unmute >/proc/acpi/ibm/volume echo 'level ' >/proc/acpi/ibm/volume The number range is 0 to 14 although not all of them may be distinct. The unmute the volume after the mute command, use either the -up or down command (the level command will not unmute the volume). +up or down command (the level command will not unmute the volume), or +the unmute command. + The current volume level and mute state is shown in the file. +You can use the volume_capabilities parameter to tell the driver +whether your thinkpad has volume control or mute-only control: +volume_capabilities=1 for mixers with mute and volume control, +volume_capabilities=2 for mixers with only mute control. + +If the driver misdetects the capabilities for your ThinkPad model, +please report this to ibm-acpi-devel@lists.sourceforge.net, so that we +can update the driver. + There are two strategies for volume control. To select which one should be used, use the volume_mode module parameter: volume_mode=1 selects EC mode, and volume_mode=3 selects EC mode with NVRAM backing @@ -1450,3 +1463,5 @@ Sysfs interface changelog: is deprecated and marked for removal. 0x020600: Marker for backlight change event support. + +0x020700: Support for mute-only mixers. -- cgit v1.1 From c7ac6291ea7ebc568a1fce16fed87d102898f264 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 15 Dec 2009 21:51:10 -0200 Subject: thinkpad-acpi: disable volume control Disable volume control by default. It can be enabled at module load time by a module parameter (volume_control=1). The audio control mixer that thinkpad-acpi interacts with is fully functional without any drivers, and operated by hotkeys. The idea behind the console audio control is that the human operator is the only one that can interact with it. The ThinkVantage suite in Windows does not allow any software-based overrides, and only does OSD (on-screen-display) functions. The Linux driver will, with the addition of the ALSA interface, try to follow and enforce the ThinkVantage UI design: The user is supposed to use the keyboard hotkeys to interact with the console audio control. The kernel and the desktop environment is supposed to cooperate to provide proper user feedback through on-screen-display functions. Distros are urged to not to enable volume control by default. Enabling this must be a local admin's decision. This is the reason why there is no Kconfig option. Keep in mind that all ThinkPads have a normal, main mixer (AC97 or HDA) for regular software-based audio control. We are not talking about that mixer here. Advanced users are, of course, free to enable volume control and do as they please. Signed-off-by: Henrique de Moraes Holschuh Cc: Lorne Applebaum Cc: Matthew Garrett Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index bd87682..6a58143 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1097,6 +1097,18 @@ Volume control procfs: /proc/acpi/ibm/volume +NOTE: by default, the volume control interface operates in read-only +mode, as it is supposed to be used for on-screen-display purposes. +The read/write mode can be enabled through the use of the +"volume_control=1" module parameter. + +NOTE: distros are urged to not enable volume_control by default, this +should be done by the local admin only. The ThinkPad UI is for the +console audio control to be done through the volume keys only, and for +the desktop environment to just provide on-screen-display feedback. +Software volume control should be done only in the main AC97/HDA +mixer. + This feature allows volume control on ThinkPad models with a digital volume knob (when available, not all models have it), as well as mute/unmute control. The available commands are: @@ -1465,3 +1477,4 @@ Sysfs interface changelog: 0x020600: Marker for backlight change event support. 0x020700: Support for mute-only mixers. + Volume control in read-only mode by default. -- cgit v1.1 From 0d204c34e85d1d63e5fdd3e3192747daf0ee7ec1 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 15 Dec 2009 21:51:11 -0200 Subject: thinkpad-acpi: basic ALSA mixer support (v2) Add the basic ALSA mixer functionality. The mixer is event-driven, and will work fine on IBM ThinkPads. I expect Lenovo ThinkPads will cause some trouble with the event interface. Heavily based on work by Lorne Applebaum and ideas from Matthew Garrett . Signed-off-by: Henrique de Moraes Holschuh Cc: Lorne Applebaum Cc: Matthew Garrett Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 6a58143..b4ed30c 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1096,6 +1096,7 @@ Volume control -------------- procfs: /proc/acpi/ibm/volume +ALSA: "ThinkPad Console Audio Control", default ID: "ThinkPadEC" NOTE: by default, the volume control interface operates in read-only mode, as it is supposed to be used for on-screen-display purposes. @@ -1144,9 +1145,8 @@ The driver will operate in volume_mode=3 by default. If that does not work well on your ThinkPad model, please report this to ibm-acpi-devel@lists.sourceforge.net. -The ALSA mixer interface to this feature is still missing, but patches -to add it exist. That problem should be addressed in the not so -distant future. +The driver supports the standard ALSA module parameters. If the ALSA +mixer is disabled, the driver will disable all volume functionality. Fan control and monitoring: fan speed, fan enable/disable @@ -1478,3 +1478,4 @@ Sysfs interface changelog: 0x020700: Support for mute-only mixers. Volume control in read-only mode by default. + Marker for ALSA mixer support. -- cgit v1.1 From 5d2eb14d36723eba0b31ae208bc346835751e944 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Tue, 15 Dec 2009 21:51:13 -0200 Subject: thinkpad-acpi: bump version to 0.24 Signed-off-by: Henrique de Moraes Holschuh Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index b4ed30c..169091f 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1,7 +1,7 @@ ThinkPad ACPI Extras Driver - Version 0.23 - April 10th, 2009 + Version 0.24 + December 11th, 2009 Borislav Deianov Henrique de Moraes Holschuh -- cgit v1.1 From 0e9052eb98a9986ec0669d030604f7a68f6df638 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 16 Dec 2009 12:19:57 +0100 Subject: page-types: add standard GPL license header Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/page-types.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 7a7d9ba..66e9358 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -1,11 +1,22 @@ /* * page-types: Tool for querying page flags * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; version 2. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should find a copy of v2 of the GNU General Public License somewhere on + * your Linux system; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place, Suite 330, Boston, MA 02111-1307 USA. + * * Copyright (C) 2009 Intel corporation * * Authors: Wu Fengguang - * - * Released under the General Public License (GPL). */ #define _LARGEFILE64_SOURCE -- cgit v1.1 From 847ce401df392b0704369fd3f75df614ac1414b4 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 16 Dec 2009 12:19:58 +0100 Subject: HWPOISON: Add unpoisoning support The unpoisoning interface is useful for stress testing tools to reclaim poisoned pages (to prevent OOM) There is no hardware level unpoisioning, so this cannot be used for real memory errors, only for software injected errors. Note that it may leak pages silently - those who have been removed from LRU cache, but not isolated from page cache/swap cache at hwpoison time. Especially the stress test of dirty swap cache pages shall reboot system before exhausting memory. AK: Fix comments, add documentation, add printks, rename symbol Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 3ffadf8..f047e75 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -98,10 +98,22 @@ madvise(MADV_POISON, ....) hwpoison-inject module through debugfs - /sys/debug/hwpoison/corrupt-pfn -Inject hwpoison fault at PFN echoed into this file +/sys/debug/hwpoison/ +corrupt-pfn + +Inject hwpoison fault at PFN echoed into this file. + +unpoison-pfn + +Software-unpoison page at PFN echoed into this file. This +way a page can be reused again. +This only works for Linux injected failures, not for real +memory failures. + +Note these injection interfaces are not stable and might change between +kernel versions Architecture specific MCE injector -- cgit v1.1 From 7c116f2b0dbac4a1dd051c7a5e8cef37701cafd4 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 16 Dec 2009 12:19:59 +0100 Subject: HWPOISON: add fs/device filters Filesystem data/metadata present the most tricky-to-isolate pages. It requires careful code review and stress testing to get them right. The fs/device filter helps to target the stress tests to some specific filesystem pages. The filter condition is block device's major/minor numbers: - corrupt-filter-dev-major - corrupt-filter-dev-minor When specified (non -1), only page cache pages that belong to that device will be poisoned. The filters are checked reliably on the locked and refcounted page. Haicheng: clear PG_hwpoison and drop bad page count if filter not OK AK: Add documentation CC: Haicheng Li CC: Nick Piggin Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index f047e75..fdf5804 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -115,6 +115,13 @@ memory failures. Note these injection interfaces are not stable and might change between kernel versions +corrupt-filter-dev-major +corrupt-filter-dev-minor + +Only handle memory failures to pages associated with the file system defined +by block device major/minor. -1U is the wildcard value. +This should be only used for testing with artificial injection. + Architecture specific MCE injector x86 has mce-inject, mce-test -- cgit v1.1 From 31d3d3484f9bd263925ecaa341500ac2df3a5d9b Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 16 Dec 2009 12:19:59 +0100 Subject: HWPOISON: limit hwpoison injector to known page types __memory_failure()'s workflow is set PG_hwpoison //... unset PG_hwpoison if didn't pass hwpoison filter That could kill unrelated process if it happens to page fault on the page with the (temporary) PG_hwpoison. The race should be big enough to appear in stress tests. Fix it by grabbing the page and checking filter at inject time. This also avoids the very noisy "Injecting memory failure..." messages. - we don't touch madvise() based injection, because the filters are generally not necessary for it. - if we want to apply the filters to h/w aided injection, we'd better to rearrange the logic in __memory_failure() instead of this patch. AK: fix documentation, use drain all, cleanups CC: Haicheng Li Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index fdf5804..4ef7bb3 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -103,7 +103,8 @@ hwpoison-inject module through debugfs corrupt-pfn -Inject hwpoison fault at PFN echoed into this file. +Inject hwpoison fault at PFN echoed into this file. This does +some early filtering to avoid corrupted unintended pages in test suites. unpoison-pfn -- cgit v1.1 From 478c5ffc0b50527bd2390f2daa46cc16276b8413 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 16 Dec 2009 12:19:59 +0100 Subject: HWPOISON: add page flags filter When specified, only poison pages if ((page_flags & mask) == value). - corrupt-filter-flags-mask - corrupt-filter-flags-value This allows stress testing of many kinds of pages. Strictly speaking, the buddy pages requires taking zone lock, to avoid setting PG_hwpoison on a "was buddy but now allocated to someone" page. However we can just do nothing because we set PG_locked in the beginning, this prevents the page allocator from allocating it to someone. (It will BUG() on the unexpected PG_locked, which is fine for hwpoison testing.) [AK: Add select PROC_PAGE_MONITOR to satisfy dependency] CC: Nick Piggin Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 4ef7bb3..f454d3c 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -123,6 +123,16 @@ Only handle memory failures to pages associated with the file system defined by block device major/minor. -1U is the wildcard value. This should be only used for testing with artificial injection. + +corrupt-filter-flags-mask +corrupt-filter-flags-value + +When specified, only poison pages if ((page_flags & mask) == value). +This allows stress testing of many kinds of pages. The page_flags +are the same as in /proc/kpageflags. The flag bits are defined in +include/linux/kernel-page-flags.h and documented in +Documentation/vm/pagemap.txt + Architecture specific MCE injector x86 has mce-inject, mce-test -- cgit v1.1 From 4fd466eb46a6a917c317a87fb94bfc7252a0f7ed Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Wed, 16 Dec 2009 12:19:59 +0100 Subject: HWPOISON: add memory cgroup filter The hwpoison test suite need to inject hwpoison to a collection of selected task pages, and must not touch pages not owned by them and thus kill important system processes such as init. (But it's OK to mis-hwpoison free/unowned pages as well as shared clean pages. Mis-hwpoison of shared dirty pages will kill all tasks, so the test suite will target all or non of such tasks in the first place.) The memory cgroup serves this purpose well. We can put the target processes under the control of a memory cgroup, and tell the hwpoison injection code to only kill pages associated with some active memory cgroup. The prerequisite for doing hwpoison stress tests with mem_cgroup is, the mem_cgroup code tracks task pages _accurately_ (unless page is locked). Which we believe is/should be true. The benefits are simplification of hwpoison injector code. Also the mem_cgroup code will automatically be tested by hwpoison test cases. The alternative interfaces pin-pfn/unpin-pfn can also delegate the (process and page flags) filtering functions reliably to user space. However prototype implementation shows that this scheme adds more complexity than we wanted. Example test case: mkdir /cgroup/hwpoison usemem -m 100 -s 1000 & echo `jobs -p` > /cgroup/hwpoison/tasks memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg page-types -p `pidof init` --hwpoison # shall do nothing page-types -p `pidof usemem` --hwpoison # poison its pages [AK: Fix documentation] [Add fix for problem noticed by Li Zefan ; dentry in the css could be NULL] CC: KOSAKI Motohiro CC: Hugh Dickins CC: Daisuke Nishimura CC: Balbir Singh CC: KAMEZAWA Hiroyuki CC: Li Zefan CC: Paul Menage CC: Nick Piggin CC: Andi Kleen Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index f454d3c..989e5af 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -123,6 +123,22 @@ Only handle memory failures to pages associated with the file system defined by block device major/minor. -1U is the wildcard value. This should be only used for testing with artificial injection. +corrupt-filter-memcg + +Limit injection to pages owned by memgroup. Specified by inode number +of the memcg. + +Example: + mkdir /cgroup/hwpoison + + usemem -m 100 -s 1000 & + echo `jobs -p` > /cgroup/hwpoison/tasks + + memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') + echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg + + page-types -p `pidof init` --hwpoison # shall do nothing + page-types -p `pidof usemem` --hwpoison # poison its pages corrupt-filter-flags-mask corrupt-filter-flags-value -- cgit v1.1 From fe194d3e100dea323d7b2de96d3b44d0c067ba7a Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Wed, 16 Dec 2009 12:20:00 +0100 Subject: HWPOISON: Use correct name for MADV_HWPOISON in documentation Signed-off-by: Andi Kleen --- Documentation/vm/hwpoison.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 989e5af..12f9ba2 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -92,7 +92,7 @@ PR_MCE_KILL_GET Testing: -madvise(MADV_POISON, ....) +madvise(MADV_HWPOISON, ....) (as root) Poison a page in the process for testing -- cgit v1.1 From facb6011f3993947283fa15d039dacb4ad140230 Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Wed, 16 Dec 2009 12:20:00 +0100 Subject: HWPOISON: Add soft page offline support This is a simpler, gentler variant of memory_failure() for soft page offlining controlled from user space. It doesn't kill anything, just tries to invalidate and if that doesn't work migrate the page away. This is useful for predictive failure analysis, where a page has a high rate of corrected errors, but hasn't gone bad yet. Instead it can be offlined early and avoided. The offlining is controlled from sysfs, including a new generic entry point for hard page offlining for symmetry too. We use the page isolate facility to prevent re-allocation race. Normally this is only used by memory hotplug. To avoid races with memory allocation I am using lock_system_sleep(). This avoids the situation where memory hotplug is about to isolate a page range and then hwpoison undoes that work. This is a big hammer currently, but the simplest solution currently. When the page is not free or LRU we try to free pages from slab and other caches. The slab freeing is currently quite dumb and does not try to focus on the specific slab cache which might own the page. This could be potentially improved later. Thanks to Fengguang Wu and Haicheng Li for some fixes. [Added fix from Andrew Morton to adapt to new migrate_pages prototype] Signed-off-by: Andi Kleen --- .../ABI/testing/sysfs-memory-page-offline | 44 ++++++++++++++++++++++ 1 file changed, 44 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-memory-page-offline (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-memory-page-offline b/Documentation/ABI/testing/sysfs-memory-page-offline new file mode 100644 index 0000000..e14703f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-memory-page-offline @@ -0,0 +1,44 @@ +What: /sys/devices/system/memory/soft_offline_page +Date: Sep 2009 +KernelVersion: 2.6.33 +Contact: andi@firstfloor.org +Description: + Soft-offline the memory page containing the physical address + written into this file. Input is a hex number specifying the + physical address of the page. The kernel will then attempt + to soft-offline it, by moving the contents elsewhere or + dropping it if possible. The kernel will then be placed + on the bad page list and never be reused. + + The offlining is done in kernel specific granuality. + Normally it's the base page size of the kernel, but + this might change. + + The page must be still accessible, not poisoned. The + kernel will never kill anything for this, but rather + fail the offline. Return value is the size of the + number, or a error when the offlining failed. Reading + the file is not allowed. + +What: /sys/devices/system/memory/hard_offline_page +Date: Sep 2009 +KernelVersion: 2.6.33 +Contact: andi@firstfloor.org +Description: + Hard-offline the memory page containing the physical + address written into this file. Input is a hex number + specifying the physical address of the page. The + kernel will then attempt to hard-offline the page, by + trying to drop the page or killing any owner or + triggering IO errors if needed. Note this may kill + any processes owning the page. The kernel will avoid + to access this page assuming it's poisoned by the + hardware. + + The offlining is done in kernel specific granuality. + Normally it's the base page size of the kernel, but + this might change. + + Return value is the size of the number, or a error when + the offlining failed. + Reading the file is not allowed. -- cgit v1.1 From 35b23b522696118b5ec5b84ba6846ee96aa2743e Mon Sep 17 00:00:00 2001 From: Guennadi Liakhovetski Date: Fri, 11 Dec 2009 11:34:20 -0300 Subject: V4L/DVB (13651): sh_mobile_ceu_camera: document the scaling and cropping algorithm The sh_mobile_ceu_camera driver implements an advanced algorithm, combining scaling and cropping on the client and on the host. Due to its complexity the algorithm deserves separate documentation. create mode 100644 Documentation/video4linux/sh_mobile_ceu_camera.txt Signed-off-by: Guennadi Liakhovetski Signed-off-by: Mauro Carvalho Chehab --- Documentation/video4linux/sh_mobile_ceu_camera.txt | 157 +++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 Documentation/video4linux/sh_mobile_ceu_camera.txt (limited to 'Documentation') diff --git a/Documentation/video4linux/sh_mobile_ceu_camera.txt b/Documentation/video4linux/sh_mobile_ceu_camera.txt new file mode 100644 index 0000000..2ae1634 --- /dev/null +++ b/Documentation/video4linux/sh_mobile_ceu_camera.txt @@ -0,0 +1,157 @@ + Cropping and Scaling algorithm, used in the sh_mobile_ceu_camera driver + ======================================================================= + +Terminology +----------- + +sensor scales: horizontal and vertical scales, configured by the sensor driver +host scales: -"- host driver +combined scales: sensor_scale * host_scale + + +Generic scaling / cropping scheme +--------------------------------- + +-1-- +| +-2-- -\ +| --\ +| --\ ++-5-- -\ -- -3-- +| ---\ +| --- -4-- -\ +| -\ +| - -6-- +| +| - -6'- +| -/ +| --- -4'- -/ +| ---/ ++-5'- -/ +| -- -3'- +| --/ +| --/ +-2'- -/ +| +| +-1'- + +Produced by user requests: + +S_CROP(left / top = (5) - (1), width / height = (5') - (5)) +S_FMT(width / height = (6') - (6)) + +Here: + +(1) to (1') - whole max width or height +(1) to (2) - sensor cropped left or top +(2) to (2') - sensor cropped width or height +(3) to (3') - sensor scale +(3) to (4) - CEU cropped left or top +(4) to (4') - CEU cropped width or height +(5) to (5') - reverse sensor scale applied to CEU cropped width or height +(2) to (5) - reverse sensor scale applied to CEU cropped left or top +(6) to (6') - CEU scale - user window + + +S_FMT +----- + +Do not touch input rectangle - it is already optimal. + +1. Calculate current sensor scales: + + scale_s = ((3') - (3)) / ((2') - (2)) + +2. Calculate "effective" input crop (sensor subwindow) - CEU crop scaled back at +current sensor scales onto input window - this is user S_CROP: + + width_u = (5') - (5) = ((4') - (4)) * scale_s + +3. Calculate new combined scales from "effective" input window to requested user +window: + + scale_comb = width_u / ((6') - (6)) + +4. Calculate sensor output window by applying combined scales to real input +window: + + width_s_out = ((2') - (2)) / scale_comb + +5. Apply iterative sensor S_FMT for sensor output window. + + subdev->video_ops->s_fmt(.width = width_s_out) + +6. Retrieve sensor output window (g_fmt) + +7. Calculate new sensor scales: + + scale_s_new = ((3')_new - (3)_new) / ((2') - (2)) + +8. Calculate new CEU crop - apply sensor scales to previously calculated +"effective" crop: + + width_ceu = (4')_new - (4)_new = width_u / scale_s_new + left_ceu = (4)_new - (3)_new = ((5) - (2)) / scale_s_new + +9. Use CEU cropping to crop to the new window: + + ceu_crop(.width = width_ceu, .left = left_ceu) + +10. Use CEU scaling to scale to the requested user window: + + scale_ceu = width_ceu / width + + +S_CROP +------ + +If old scale applied to new crop is invalid produce nearest new scale possible + +1. Calculate current combined scales. + + scale_comb = (((4') - (4)) / ((6') - (6))) * (((2') - (2)) / ((3') - (3))) + +2. Apply iterative sensor S_CROP for new input window. + +3. If old combined scales applied to new crop produce an impossible user window, +adjust scales to produce nearest possible window. + + width_u_out = ((5') - (5)) / scale_comb + + if (width_u_out > max) + scale_comb = ((5') - (5)) / max; + else if (width_u_out < min) + scale_comb = ((5') - (5)) / min; + +4. Issue G_CROP to retrieve actual input window. + +5. Using actual input window and calculated combined scales calculate sensor +target output window. + + width_s_out = ((3') - (3)) = ((2') - (2)) / scale_comb + +6. Apply iterative S_FMT for new sensor target output window. + +7. Issue G_FMT to retrieve the actual sensor output window. + +8. Calculate sensor scales. + + scale_s = ((3') - (3)) / ((2') - (2)) + +9. Calculate sensor output subwindow to be cropped on CEU by applying sensor +scales to the requested window. + + width_ceu = ((5') - (5)) / scale_s + +10. Use CEU cropping for above calculated window. + +11. Calculate CEU scales from sensor scales from results of (10) and user window +from (3) + + scale_ceu = calc_scale(((5') - (5)), &width_u_out) + +12. Apply CEU scales. + +-- +Author: Guennadi Liakhovetski -- cgit v1.1 From 49b14650ba5bf80234cb1984fd8396aff03430ce Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Thu, 3 Dec 2009 19:50:35 -0300 Subject: V4L/DVB (13680a): DocBook/media: copy images after building HTML The rule for %.html removes the output directory, so there is no point in copying images before building HTML. Documentation/DocBook/Makefile | 10 +++++----- Signed-off-by: Ben Hutchings Signed-off-by: Mauro Carvalho Chehab --- Documentation/DocBook/Makefile | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index ab8300f..22bbf7e 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -32,7 +32,7 @@ PS_METHOD = $(prefer-db2x) ### # The targets that may be used. -PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs cleandocs media +PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs cleandocs BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) xmldocs: $(BOOKS) @@ -45,15 +45,15 @@ PDF := $(patsubst %.xml, %.pdf, $(BOOKS)) pdfdocs: $(PDF) HTML := $(sort $(patsubst %.xml, %.html, $(BOOKS))) -htmldocs: media $(HTML) +htmldocs: $(HTML) $(call build_main_index) + $(call build_images) MAN := $(patsubst %.xml, %.9, $(BOOKS)) mandocs: $(MAN) -media: - mkdir -p $(srctree)/Documentation/DocBook/media/ - cp $(srctree)/Documentation/DocBook/dvb/*.png $(srctree)/Documentation/DocBook/v4l/*.gif $(srctree)/Documentation/DocBook/media/ +build_images = mkdir -p $(objtree)/Documentation/DocBook/media/ && \ + cp $(srctree)/Documentation/DocBook/dvb/*.png $(srctree)/Documentation/DocBook/v4l/*.gif $(objtree)/Documentation/DocBook/media/ installmandocs: mandocs mkdir -p /usr/local/man/man9/ -- cgit v1.1 From 5bf583473813530c1bf82051a35fac8d5045f4f7 Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Thu, 3 Dec 2009 19:51:09 -0300 Subject: V4L/DVB (13680b): DocBook/media: create links for included sources If docs are being built in a separate directory, xmlto and xsltproc can't find included sources. Make links back to the source directory. I would much prefer to have xmlto and xsltproc look in the source directory for included entities but couldn't see how to do that. This needs to be solved in some way for 2.6.32, even if this patch isn't the right way to do it. Signed-off-by: Ben Hutchings Signed-off-by: Mauro Carvalho Chehab --- Documentation/DocBook/Makefile | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 22bbf7e..50075df 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -32,10 +32,10 @@ PS_METHOD = $(prefer-db2x) ### # The targets that may be used. -PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs cleandocs +PHONY += xmldocs sgmldocs psdocs pdfdocs htmldocs mandocs installmandocs cleandocs xmldoclinks BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) -xmldocs: $(BOOKS) +xmldocs: $(BOOKS) xmldoclinks sgmldocs: xmldocs PS := $(patsubst %.xml, %.ps, $(BOOKS)) @@ -55,6 +55,15 @@ mandocs: $(MAN) build_images = mkdir -p $(objtree)/Documentation/DocBook/media/ && \ cp $(srctree)/Documentation/DocBook/dvb/*.png $(srctree)/Documentation/DocBook/v4l/*.gif $(objtree)/Documentation/DocBook/media/ +xmldoclinks: +ifneq ($(objtree),$(srctree)) + for dep in dvb media-entities.tmpl media-indices.tmpl v4l; do \ + rm -f $(objtree)/Documentation/DocBook/$$dep \ + && ln -s $(srctree)/Documentation/DocBook/$$dep $(objtree)/Documentation/DocBook/ \ + || exit; \ + done +endif + installmandocs: mandocs mkdir -p /usr/local/man/man9/ install Documentation/DocBook/man/*.9.gz /usr/local/man/man9/ -- cgit v1.1 From 9ea9a886b0e8630e12cff515955e7f0f5be32cb1 Mon Sep 17 00:00:00 2001 From: Clemens Ladisch Date: Tue, 15 Dec 2009 16:45:39 -0800 Subject: vt: make the default cursor shape configurable For embedded systems, the blinking cursor at startup time can be annoying and unintended. Add a new kernel parameter to change the default cursor shape. Signed-off-by: Clemens Ladisch Cc: Daniel Mack Acked-by: Pavel Machek Cc: David Newall Cc: Alan Cox Cc: Greg Kroah-Hartman Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ab95d3a..c309515 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2729,6 +2729,11 @@ and is between 256 and 4096 characters. It is defined in the file vmpoff= [KNL,S390] Perform z/VM CP command after power off. Format: + vt.cur_default= [VT] Default cursor shape. + Format: 0xCCBBAA, where AA, BB, and CC are the same as + the parameters of the [?A;B;Cc escape sequence; + see VGA-softcursor.txt. Default: 2 = underline. + vt.default_blu= [VT] Format: ,,,..., Change the default blue palette of the console. -- cgit v1.1 From 0769746183caff9d4334be48c7b0e7d2ec8716c4 Mon Sep 17 00:00:00 2001 From: Jani Nikula Date: Tue, 15 Dec 2009 16:46:20 -0800 Subject: gpiolib: add support for changing value polarity in sysfs Drivers may use gpiolib sysfs as part of their public user space interface. The GPIO number and polarity might change from board to board. The gpio_export_link() call can be used to hide the GPIO number from user space. Add support for also hiding the GPIO line polarity changes from user space. Signed-off-by: Jani Nikula Cc: David Brownell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/gpio.txt | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'Documentation') diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index e4e7dae..1866c27 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt @@ -531,6 +531,13 @@ and have the following read/write attributes: This file exists only if the pin can be configured as an interrupt generating input pin. + "active_low" ... reads as either 0 (false) or 1 (true). Write + any nonzero value to invert the value attribute both + for reading and writing. Existing and subsequent + poll(2) support configuration via the edge attribute + for "rising" and "falling" edges will follow this + setting. + GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the controller implementing GPIOs starting at #42) and have the following read-only attributes: @@ -566,6 +573,8 @@ requested using gpio_request(): int gpio_export_link(struct device *dev, const char *name, unsigned gpio) + /* change the polarity of a GPIO node in sysfs */ + int gpio_sysfs_set_active_low(unsigned gpio, int value); After a kernel driver requests a GPIO, it may only be made available in the sysfs interface by gpio_export(). The driver can control whether the @@ -580,3 +589,9 @@ After the GPIO has been exported, gpio_export_link() allows creating symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can use this to provide the interface under their own device in sysfs with a descriptive name. + +Drivers can use gpio_sysfs_set_active_low() to hide GPIO line polarity +differences between boards from user space. This only affects the +sysfs interface. Polarity change can be done both before and after +gpio_export(), and previously enabled poll(2) support for either +rising or falling edge will be reconfigured to follow this setting. -- cgit v1.1 From 4562aea791e97aa0f2e342849daa18b588c46df1 Mon Sep 17 00:00:00 2001 From: Harald Welte Date: Tue, 15 Dec 2009 16:46:42 -0800 Subject: viafb: documentation update We now support the VX855, and the VX800 is no longer unaccellerated. viafb_video_dev was removed as it was useless. Signed-off-by: Harald Welte Signed-off-by: Florian Tobias Schandinat Cc: Joseph Chan Cc: Scott Fang Cc: Krzysztof Helt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/fb/viafb.txt | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/fb/viafb.txt b/Documentation/fb/viafb.txt index 67dbf44..f3e046a 100644 --- a/Documentation/fb/viafb.txt +++ b/Documentation/fb/viafb.txt @@ -7,7 +7,7 @@ VIA UniChrome Family(CLE266, PM800 / CN400 / CN300, P4M800CE / P4M800Pro / CN700 / VN800, CX700 / VX700, K8M890, P4M890, - CN896 / P4M900, VX800) + CN896 / P4M900, VX800, VX855) [Driver features] ------------------------ @@ -154,13 +154,6 @@ 0 : No Dual Edge Panel (default) 1 : Dual Edge Panel - viafb_video_dev: - This option is used to specify video output devices(CRT, DVI, LCD) for - duoview case. - For example: - To output video on DVI, we should use: - modprobe viafb viafb_video_dev=DVI... - viafb_lcd_port: This option is used to specify LCD output port, available values are "DVP0" "DVP1" "DFP_HIGHLOW" "DFP_HIGH" "DFP_LOW". @@ -181,9 +174,6 @@ Notes: and bpp, need to call VIAFB specified ioctl interface VIAFB_SET_DEVICE instead of calling common ioctl function FBIOPUT_VSCREENINFO since viafb doesn't support multi-head well, or it will cause screen crush. - 4. VX800 2D accelerator hasn't been supported in this driver yet. When - using driver on VX800, the driver will disable the acceleration - function as default. [Configure viafb with "fbset" tool] -- cgit v1.1 From 7de3369c14b67fe77d8b5f65171bb3a3b4f371ba Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Tue, 15 Dec 2009 16:46:59 -0800 Subject: doc: SubmitChecklist, add ioctls, remove OSDL reference If a patch adds ioctls, then Documentation/ioctl/ioctl-number.txt should also be updated. Remove reference to the OSDL PLM build farm. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/SubmitChecklist | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/SubmitChecklist b/Documentation/SubmitChecklist index 78a9168..1053a56 100644 --- a/Documentation/SubmitChecklist +++ b/Documentation/SubmitChecklist @@ -15,7 +15,7 @@ kernel patches. 2: Passes allnoconfig, allmodconfig 3: Builds on multiple CPU architectures by using local cross-compile tools - or something like PLM at OSDL. + or some other build farm. 4: ppc64 is a good architecture for cross-compilation checking because it tends to use `unsigned long' for 64-bit quantities. @@ -88,3 +88,6 @@ kernel patches. 24: All memory barriers {e.g., barrier(), rmb(), wmb()} need a comment in the source code that explains the logic of what they are doing and why. + +25: If any ioctl's are added by the patch, then also update + Documentation/ioctl/ioctl-number.txt. -- cgit v1.1 From 82c1e49ccb28534b4e8b77d5f0ff553f19912d4d Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Tue, 15 Dec 2009 16:46:59 -0800 Subject: proc: remove docbook and example Example is outdated, it still uses old ->read_proc interfaces and "fb" example is plain racy. There are better examples all over the tree. Docbook itself says almost nothing about /proc and contain quite a number of simply wrong facts, e.g. device nodes support. What it does is describing at great length interface which are going to be removed. There are Documentation/filesystems/seq_file.txt in exchange. Signed-off-by: Alexey Dobriyan Acked-by: Erik Mouw Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DocBook/Makefile | 17 +- Documentation/DocBook/procfs-guide.tmpl | 626 -------------------------------- Documentation/DocBook/procfs_example.c | 201 ---------- 3 files changed, 3 insertions(+), 841 deletions(-) delete mode 100644 Documentation/DocBook/procfs-guide.tmpl delete mode 100644 Documentation/DocBook/procfs_example.c (limited to 'Documentation') diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index ab8300f..ee34ceb 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -8,7 +8,7 @@ DOCBOOKS := z8530book.xml mcabook.xml device-drivers.xml \ kernel-hacking.xml kernel-locking.xml deviceiobook.xml \ - procfs-guide.xml writing_usb_driver.xml networking.xml \ + writing_usb_driver.xml networking.xml \ kernel-api.xml filesystems.xml lsm.xml usb.xml kgdb.xml \ gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \ genericirq.xml s390-drivers.xml uio-howto.xml scsi.xml \ @@ -65,7 +65,7 @@ KERNELDOC = $(srctree)/scripts/kernel-doc DOCPROC = $(objtree)/scripts/basic/docproc XMLTOFLAGS = -m $(srctree)/Documentation/DocBook/stylesheet.xsl -#XMLTOFLAGS += --skip-validation +XMLTOFLAGS += --skip-validation ### # DOCPROC is used for two purposes: @@ -101,17 +101,6 @@ endif # Changes in kernel-doc force a rebuild of all documentation $(BOOKS): $(KERNELDOC) -### -# procfs guide uses a .c file as example code. -# This requires an explicit dependency -C-procfs-example = procfs_example.xml -C-procfs-example2 = $(addprefix $(obj)/,$(C-procfs-example)) -$(obj)/procfs-guide.xml: $(C-procfs-example2) - -# List of programs to build -##oops, this is a kernel module::hostprogs-y := procfs_example -obj-m += procfs_example.o - # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -238,7 +227,7 @@ clean-files := $(DOCBOOKS) \ $(patsubst %.xml, %.pdf, $(DOCBOOKS)) \ $(patsubst %.xml, %.html, $(DOCBOOKS)) \ $(patsubst %.xml, %.9, $(DOCBOOKS)) \ - $(C-procfs-example) $(index) + $(index) clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) man diff --git a/Documentation/DocBook/procfs-guide.tmpl b/Documentation/DocBook/procfs-guide.tmpl deleted file mode 100644 index 9eba4b7..0000000 --- a/Documentation/DocBook/procfs-guide.tmpl +++ /dev/null @@ -1,626 +0,0 @@ - - -]> - - - - Linux Kernel Procfs Guide - - - - Erik - (J.A.K.) - Mouw - -
- mouw@nl.linux.org -
-
-
- - - This software and documentation were written while working on the - LART computing board - (http://www.lartmaker.nl/), - which was sponsored by the Delt University of Technology projects - Mobile Multi-media Communications and Ubiquitous Communications. - - -
- - - - 1.0 - May 30, 2001 - Initial revision posted to linux-kernel - - - 1.1 - June 3, 2001 - Revised after comments from linux-kernel - - - - - 2001 - Erik Mouw - - - - - - This documentation is free software; you can redistribute it - and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - - - - This documentation is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR - PURPOSE. See the GNU General Public License for more details. - - - - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - - - - For more details see the file COPYING in the source - distribution of Linux. - - -
- - - - - - - - - - - - Preface - - - This guide describes the use of the procfs file system from - within the Linux kernel. The idea to write this guide came up on - the #kernelnewbies IRC channel (see http://www.kernelnewbies.org/), - when Jeff Garzik explained the use of procfs and forwarded me a - message Alexander Viro wrote to the linux-kernel mailing list. I - agreed to write it up nicely, so here it is. - - - - I'd like to thank Jeff Garzik - jgarzik@pobox.com and Alexander Viro - viro@parcelfarce.linux.theplanet.co.uk for their input, - Tim Waugh twaugh@redhat.com for his Selfdocbook, - and Marc Joosen marcj@historia.et.tudelft.nl for - proofreading. - - - - Erik - - - - - - - - Introduction - - - The /proc file system - (procfs) is a special file system in the linux kernel. It's a - virtual file system: it is not associated with a block device - but exists only in memory. The files in the procfs are there to - allow userland programs access to certain information from the - kernel (like process information in /proc/[0-9]+/), but also for debug - purposes (like /proc/ksyms). - - - - This guide describes the use of the procfs file system from - within the Linux kernel. It starts by introducing all relevant - functions to manage the files within the file system. After that - it shows how to communicate with userland, and some tips and - tricks will be pointed out. Finally a complete example will be - shown. - - - - Note that the files in /proc/sys are sysctl files: they - don't belong to procfs and are governed by a completely - different API described in the Kernel API book. - - - - - - - - Managing procfs entries - - - This chapter describes the functions that various kernel - components use to populate the procfs with files, symlinks, - device nodes, and directories. - - - - A minor note before we start: if you want to use any of the - procfs functions, be sure to include the correct header file! - This should be one of the first lines in your code: - - - -#include <linux/proc_fs.h> - - - - - - - Creating a regular file - - - - struct proc_dir_entry* create_proc_entry - const char* name - mode_t mode - struct proc_dir_entry* parent - - - - - This function creates a regular file with the name - name, file mode - mode in the directory - parent. To create a file in the root of - the procfs, use NULL as - parent parameter. When successful, the - function will return a pointer to the freshly created - struct proc_dir_entry; otherwise it - will return NULL. describes how to do something useful with - regular files. - - - - Note that it is specifically supported that you can pass a - path that spans multiple directories. For example - create_proc_entry("drivers/via0/info") - will create the via0 - directory if necessary, with standard - 0755 permissions. - - - - If you only want to be able to read the file, the function - create_proc_read_entry described in may be used to create and initialise - the procfs entry in one single call. - - - - - - - - Creating a symlink - - - - struct proc_dir_entry* - proc_symlink const - char* name - struct proc_dir_entry* - parent const - char* dest - - - - - This creates a symlink in the procfs directory - parent that points from - name to - dest. This translates in userland to - ln -s dest - name. - - - - - Creating a directory - - - - struct proc_dir_entry* proc_mkdir - const char* name - struct proc_dir_entry* parent - - - - - Create a directory name in the procfs - directory parent. - - - - - - - - Removing an entry - - - - void remove_proc_entry - const char* name - struct proc_dir_entry* parent - - - - - Removes the entry name in the directory - parent from the procfs. Entries are - removed by their name, not by the - struct proc_dir_entry returned by the - various create functions. Note that this function doesn't - recursively remove entries. - - - - Be sure to free the data entry from - the struct proc_dir_entry before - remove_proc_entry is called (that is: if - there was some data allocated, of - course). See for more information - on using the data entry. - - - - - - - - - Communicating with userland - - - Instead of reading (or writing) information directly from - kernel memory, procfs works with call back - functions for files: functions that are called when - a specific file is being read or written. Such functions have - to be initialised after the procfs file is created by setting - the read_proc and/or - write_proc fields in the - struct proc_dir_entry* that the - function create_proc_entry returned: - - - -struct proc_dir_entry* entry; - -entry->read_proc = read_proc_foo; -entry->write_proc = write_proc_foo; - - - - If you only want to use a the - read_proc, the function - create_proc_read_entry described in may be used to create and initialise the - procfs entry in one single call. - - - - - - Reading data - - - The read function is a call back function that allows userland - processes to read data from the kernel. The read function - should have the following format: - - - - - int read_func - char* buffer - char** start - off_t off - int count - int* peof - void* data - - - - - The read function should write its information into the - buffer, which will be exactly - PAGE_SIZE bytes long. - - - - The parameter - peof should be used to signal that the - end of the file has been reached by writing - 1 to the memory location - peof points to. - - - - The data - parameter can be used to create a single call back function for - several files, see . - - - - The rest of the parameters and the return value are described - by a comment in fs/proc/generic.c as follows: - - -
- - You have three ways to return data: - - - - - Leave *start = NULL. (This is the default.) - Put the data of the requested offset at that - offset within the buffer. Return the number (n) - of bytes there are from the beginning of the - buffer up to the last byte of data. If the - number of supplied bytes (= n - offset) is - greater than zero and you didn't signal eof - and the reader is prepared to take more data - you will be called again with the requested - offset advanced by the number of bytes - absorbed. This interface is useful for files - no larger than the buffer. - - - - - Set *start to an unsigned long value less than - the buffer address but greater than zero. - Put the data of the requested offset at the - beginning of the buffer. Return the number of - bytes of data placed there. If this number is - greater than zero and you didn't signal eof - and the reader is prepared to take more data - you will be called again with the requested - offset advanced by *start. This interface is - useful when you have a large file consisting - of a series of blocks which you want to count - and return as wholes. - (Hack by Paul.Russell@rustcorp.com.au) - - - - - Set *start to an address within the buffer. - Put the data of the requested offset at *start. - Return the number of bytes of data placed there. - If this number is greater than zero and you - didn't signal eof and the reader is prepared to - take more data you will be called again with the - requested offset advanced by the number of bytes - absorbed. - - - -
- - - shows how to use a read call back - function. - -
- - - - - - Writing data - - - The write call back function allows a userland process to write - data to the kernel, so it has some kind of control over the - kernel. The write function should have the following format: - - - - - int write_func - struct file* file - const char* buffer - unsigned long count - void* data - - - - - The write function should read count - bytes at maximum from the buffer. Note - that the buffer doesn't live in the - kernel's memory space, so it should first be copied to kernel - space with copy_from_user. The - file parameter is usually - ignored. shows how to use the - data parameter. - - - - Again, shows how to use this call back - function. - - - - - - - - A single call back for many files - - - When a large number of almost identical files is used, it's - quite inconvenient to use a separate call back function for - each file. A better approach is to have a single call back - function that distinguishes between the files by using the - data field in struct - proc_dir_entry. First of all, the - data field has to be initialised: - - - -struct proc_dir_entry* entry; -struct my_file_data *file_data; - -file_data = kmalloc(sizeof(struct my_file_data), GFP_KERNEL); -entry->data = file_data; - - - - The data field is a void - *, so it can be initialised with anything. - - - - Now that the data field is set, the - read_proc and - write_proc can use it to distinguish - between files because they get it passed into their - data parameter: - - - -int foo_read_func(char *page, char **start, off_t off, - int count, int *eof, void *data) -{ - int len; - - if(data == file_data) { - /* special case for this file */ - } else { - /* normal processing */ - } - - return len; -} - - - - Be sure to free the data data field - when removing the procfs entry. - - -
- - - - - - Tips and tricks - - - - - - Convenience functions - - - - struct proc_dir_entry* create_proc_read_entry - const char* name - mode_t mode - struct proc_dir_entry* parent - read_proc_t* read_proc - void* data - - - - - This function creates a regular file in exactly the same way - as create_proc_entry from does, but also allows to set the read - function read_proc in one call. This - function can set the data as well, like - explained in . - - - - - - - Modules - - - If procfs is being used from within a module, be sure to set - the owner field in the - struct proc_dir_entry to - THIS_MODULE. - - - -struct proc_dir_entry* entry; - -entry->owner = THIS_MODULE; - - - - - - - - Mode and ownership - - - Sometimes it is useful to change the mode and/or ownership of - a procfs entry. Here is an example that shows how to achieve - that: - - - -struct proc_dir_entry* entry; - -entry->mode = S_IWUSR |S_IRUSR | S_IRGRP | S_IROTH; -entry->uid = 0; -entry->gid = 100; - - - - - - - - - - Example - - - -&procfsexample; - - -
diff --git a/Documentation/DocBook/procfs_example.c b/Documentation/DocBook/procfs_example.c deleted file mode 100644 index a5b1179..0000000 --- a/Documentation/DocBook/procfs_example.c +++ /dev/null @@ -1,201 +0,0 @@ -/* - * procfs_example.c: an example proc interface - * - * Copyright (C) 2001, Erik Mouw (mouw@nl.linux.org) - * - * This file accompanies the procfs-guide in the Linux kernel - * source. Its main use is to demonstrate the concepts and - * functions described in the guide. - * - * This software has been developed while working on the LART - * computing board (http://www.lartmaker.nl), which was sponsored - * by the Delt University of Technology projects Mobile Multi-media - * Communications and Ubiquitous Communications. - * - * This program is free software; you can redistribute - * it and/or modify it under the terms of the GNU General - * Public License as published by the Free Software - * Foundation; either version 2 of the License, or (at your - * option) any later version. - * - * This program is distributed in the hope that it will be - * useful, but WITHOUT ANY WARRANTY; without even the implied - * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR - * PURPOSE. See the GNU General Public License for more - * details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place, - * Suite 330, Boston, MA 02111-1307 USA - * - */ - -#include -#include -#include -#include -#include -#include - - -#define MODULE_VERS "1.0" -#define MODULE_NAME "procfs_example" - -#define FOOBAR_LEN 8 - -struct fb_data_t { - char name[FOOBAR_LEN + 1]; - char value[FOOBAR_LEN + 1]; -}; - - -static struct proc_dir_entry *example_dir, *foo_file, - *bar_file, *jiffies_file, *symlink; - - -struct fb_data_t foo_data, bar_data; - - -static int proc_read_jiffies(char *page, char **start, - off_t off, int count, - int *eof, void *data) -{ - int len; - - len = sprintf(page, "jiffies = %ld\n", - jiffies); - - return len; -} - - -static int proc_read_foobar(char *page, char **start, - off_t off, int count, - int *eof, void *data) -{ - int len; - struct fb_data_t *fb_data = (struct fb_data_t *)data; - - /* DON'T DO THAT - buffer overruns are bad */ - len = sprintf(page, "%s = '%s'\n", - fb_data->name, fb_data->value); - - return len; -} - - -static int proc_write_foobar(struct file *file, - const char *buffer, - unsigned long count, - void *data) -{ - int len; - struct fb_data_t *fb_data = (struct fb_data_t *)data; - - if(count > FOOBAR_LEN) - len = FOOBAR_LEN; - else - len = count; - - if(copy_from_user(fb_data->value, buffer, len)) - return -EFAULT; - - fb_data->value[len] = '\0'; - - return len; -} - - -static int __init init_procfs_example(void) -{ - int rv = 0; - - /* create directory */ - example_dir = proc_mkdir(MODULE_NAME, NULL); - if(example_dir == NULL) { - rv = -ENOMEM; - goto out; - } - /* create jiffies using convenience function */ - jiffies_file = create_proc_read_entry("jiffies", - 0444, example_dir, - proc_read_jiffies, - NULL); - if(jiffies_file == NULL) { - rv = -ENOMEM; - goto no_jiffies; - } - - /* create foo and bar files using same callback - * functions - */ - foo_file = create_proc_entry("foo", 0644, example_dir); - if(foo_file == NULL) { - rv = -ENOMEM; - goto no_foo; - } - - strcpy(foo_data.name, "foo"); - strcpy(foo_data.value, "foo"); - foo_file->data = &foo_data; - foo_file->read_proc = proc_read_foobar; - foo_file->write_proc = proc_write_foobar; - - bar_file = create_proc_entry("bar", 0644, example_dir); - if(bar_file == NULL) { - rv = -ENOMEM; - goto no_bar; - } - - strcpy(bar_data.name, "bar"); - strcpy(bar_data.value, "bar"); - bar_file->data = &bar_data; - bar_file->read_proc = proc_read_foobar; - bar_file->write_proc = proc_write_foobar; - - /* create symlink */ - symlink = proc_symlink("jiffies_too", example_dir, - "jiffies"); - if(symlink == NULL) { - rv = -ENOMEM; - goto no_symlink; - } - - /* everything OK */ - printk(KERN_INFO "%s %s initialised\n", - MODULE_NAME, MODULE_VERS); - return 0; - -no_symlink: - remove_proc_entry("bar", example_dir); -no_bar: - remove_proc_entry("foo", example_dir); -no_foo: - remove_proc_entry("jiffies", example_dir); -no_jiffies: - remove_proc_entry(MODULE_NAME, NULL); -out: - return rv; -} - - -static void __exit cleanup_procfs_example(void) -{ - remove_proc_entry("jiffies_too", example_dir); - remove_proc_entry("bar", example_dir); - remove_proc_entry("foo", example_dir); - remove_proc_entry("jiffies", example_dir); - remove_proc_entry(MODULE_NAME, NULL); - - printk(KERN_INFO "%s %s removed\n", - MODULE_NAME, MODULE_VERS); -} - - -module_init(init_procfs_example); -module_exit(cleanup_procfs_example); - -MODULE_AUTHOR("Erik Mouw"); -MODULE_DESCRIPTION("procfs examples"); -MODULE_LICENSE("GPL"); -- cgit v1.1 From 6be4b78993498c253e99b12c4d0f7684a36955e2 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Tue, 15 Dec 2009 16:47:00 -0800 Subject: seq_file: use proc_create() in documentation Using create_proc_entry() + ->proc_fops assignment is racy because ->proc_fops will be NULL for some time, use proc_create() to avoid race. Signed-off-by: Alexey Dobriyan Cc: Randy Dunlap Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/seq_file.txt | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt index 0d15ebc..a1e2e0d 100644 --- a/Documentation/filesystems/seq_file.txt +++ b/Documentation/filesystems/seq_file.txt @@ -248,9 +248,7 @@ code, that is done in the initialization code in the usual way: { struct proc_dir_entry *entry; - entry = create_proc_entry("sequence", 0, NULL); - if (entry) - entry->proc_fops = &ct_file_ops; + proc_create("sequence", 0, NULL, &ct_file_ops); return 0; } -- cgit v1.1 From 15293df82bd1c15196e7cb336130c243e9a41806 Mon Sep 17 00:00:00 2001 From: Detlef Riekenberg Date: Thu, 10 Dec 2009 13:55:48 +0100 Subject: vgaarbiter: fix a typo in the vgaarbiter Documentation I detected a typo, while reading "Documentation/vgaarbiter.txt". Fix the 'fieldd' mispelling. Signed-off-by: Detlef Riekenberg Signed-off-by: Jesse Barnes --- Documentation/vgaarbiter.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vgaarbiter.txt b/Documentation/vgaarbiter.txt index 987f9b0..43a9b06 100644 --- a/Documentation/vgaarbiter.txt +++ b/Documentation/vgaarbiter.txt @@ -103,7 +103,7 @@ I.2 libpciaccess ---------------- To use the vga arbiter char device it was implemented an API inside the -libpciaccess library. One fieldd was added to struct pci_device (each device +libpciaccess library. One field was added to struct pci_device (each device on the system): /* the type of resource decoded by the device */ -- cgit v1.1 From 3c57e89b467d1db6fda74d5c97112c8b9466dd97 Mon Sep 17 00:00:00 2001 From: Clemens Ladisch Date: Wed, 16 Dec 2009 21:38:25 +0100 Subject: hwmon: New driver for AMD Family 10h/11h CPUs This adds a driver for the internal temperature sensor of AMD Family 10h and 11h CPUs. Signed-off-by: Clemens Ladisch Signed-off-by: Jean Delvare --- Documentation/hwmon/k10temp | 60 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) create mode 100644 Documentation/hwmon/k10temp (limited to 'Documentation') diff --git a/Documentation/hwmon/k10temp b/Documentation/hwmon/k10temp new file mode 100644 index 0000000..a7a18d4 --- /dev/null +++ b/Documentation/hwmon/k10temp @@ -0,0 +1,60 @@ +Kernel driver k10temp +===================== + +Supported chips: +* AMD Family 10h processors: + Socket F: Quad-Core/Six-Core/Embedded Opteron + Socket AM2+: Opteron, Phenom (II) X3/X4 + Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II + Socket S1G3: Athlon II, Sempron, Turion II +* AMD Family 11h processors: + Socket S1G2: Athlon (X2), Sempron (X2), Turion X2 (Ultra) + + Prefix: 'k10temp' + Addresses scanned: PCI space + Datasheets: + BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors: + http://support.amd.com/us/Processor_TechDocs/31116.pdf + BIOS and Kernel Developer's Guide (BKDG) for AMD Family 11h Processors: + http://support.amd.com/us/Processor_TechDocs/41256.pdf + Revision Guide for AMD Family 10h Processors: + http://support.amd.com/us/Processor_TechDocs/41322.pdf + Revision Guide for AMD Family 11h Processors: + http://support.amd.com/us/Processor_TechDocs/41788.pdf + AMD Family 11h Processor Power and Thermal Data Sheet for Notebooks: + http://support.amd.com/us/Processor_TechDocs/43373.pdf + AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet: + http://support.amd.com/us/Processor_TechDocs/43374.pdf + AMD Family 10h Desktop Processor Power and Thermal Data Sheet: + http://support.amd.com/us/Processor_TechDocs/43375.pdf + +Author: Clemens Ladisch + +Description +----------- + +This driver permits reading of the internal temperature sensor of AMD +Family 10h and 11h processors. + +All these processors have a sensor, but on older revisions of Family 10h +processors, the sensor may return inconsistent values (erratum 319). The +driver will refuse to load on these revisions unless you specify the +"force=1" module parameter. + +There is one temperature measurement value, available as temp1_input in +sysfs. It is measured in degrees Celsius with a resolution of 1/8th degree. +Please note that it is defined as a relative value; to quote the AMD manual: + + Tctl is the processor temperature control value, used by the platform to + control cooling systems. Tctl is a non-physical temperature on an + arbitrary scale measured in degrees. It does _not_ represent an actual + physical temperature like die or case temperature. Instead, it specifies + the processor temperature relative to the point at which the system must + supply the maximum cooling for the processor's specified maximum case + temperature and maximum thermal power dissipation. + +The maximum value for Tctl is available in the file temp1_max. + +If the BIOS has enabled hardware temperature control, the threshold at +which the processor will throttle itself to avoid damage is available in +temp1_crit and temp1_crit_hyst. -- cgit v1.1 From 4251417484a1775ba5cbfe38c67e6d5af9615de4 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Thu, 17 Dec 2009 11:43:29 -0600 Subject: cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt Signed-off-by: Rusty Russell Cc: Gautham R Shenoy Cc: Ashok Raj --- Documentation/cpu-hotplug.txt | 49 +++++++++++++++---------------------------- 1 file changed, 17 insertions(+), 32 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt index 4d4a644b..a99d703 100644 --- a/Documentation/cpu-hotplug.txt +++ b/Documentation/cpu-hotplug.txt @@ -315,41 +315,26 @@ A: The following are what is required for CPU hotplug infrastructure to work Q: I need to ensure that a particular cpu is not removed when there is some work specific to this cpu is in progress. -A: First switch the current thread context to preferred cpu +A: There are two ways. If your code can be run in interrupt context, use + smp_call_function_single(), otherwise use work_on_cpu(). Note that + work_on_cpu() is slow, and can fail due to out of memory: int my_func_on_cpu(int cpu) { - cpumask_t saved_mask, new_mask = CPU_MASK_NONE; - int curr_cpu, err = 0; - - saved_mask = current->cpus_allowed; - cpu_set(cpu, new_mask); - err = set_cpus_allowed(current, new_mask); - - if (err) - return err; - - /* - * If we got scheduled out just after the return from - * set_cpus_allowed() before running the work, this ensures - * we stay locked. - */ - curr_cpu = get_cpu(); - - if (curr_cpu != cpu) { - err = -EAGAIN; - goto ret; - } else { - /* - * Do work : But cant sleep, since get_cpu() disables preempt - */ - } - ret: - put_cpu(); - set_cpus_allowed(current, saved_mask); - return err; - } - + int err; + get_online_cpus(); + if (!cpu_online(cpu)) + err = -EINVAL; + else +#if NEEDS_BLOCKING + err = work_on_cpu(cpu, __my_func_on_cpu, NULL); +#else + smp_call_function_single(cpu, __my_func_on_cpu, &err, + true); +#endif + put_online_cpus(); + return err; + } Q: How do we determine how many CPUs are available for hotplug. A: There is no clear spec defined way from ACPI that can give us that -- cgit v1.1 From 1de6129f381b4907013ccea08a3bdea8c966d50a Mon Sep 17 00:00:00 2001 From: FUJITA Tomonori Date: Fri, 18 Dec 2009 12:37:48 +0100 Subject: block: remove Documentation/block/as-iosched.txt Commit 492af6350a5ccf087e4964104a276ed358811458 removed the AS IO scheduler, so remove its documentation too. Signed-off-by: FUJITA Tomonori Signed-off-by: Jens Axboe --- Documentation/block/00-INDEX | 2 - Documentation/block/as-iosched.txt | 172 ------------------------------------- 2 files changed, 174 deletions(-) delete mode 100644 Documentation/block/as-iosched.txt (limited to 'Documentation') diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX index 961a051..a406286 100644 --- a/Documentation/block/00-INDEX +++ b/Documentation/block/00-INDEX @@ -1,7 +1,5 @@ 00-INDEX - This file -as-iosched.txt - - Anticipatory IO scheduler barrier.txt - I/O Barriers biodoc.txt diff --git a/Documentation/block/as-iosched.txt b/Documentation/block/as-iosched.txt deleted file mode 100644 index 738b72b..0000000 --- a/Documentation/block/as-iosched.txt +++ /dev/null @@ -1,172 +0,0 @@ -Anticipatory IO scheduler -------------------------- -Nick Piggin 13 Sep 2003 - -Attention! Database servers, especially those using "TCQ" disks should -investigate performance with the 'deadline' IO scheduler. Any system with high -disk performance requirements should do so, in fact. - -If you see unusual performance characteristics of your disk systems, or you -see big performance regressions versus the deadline scheduler, please email -me. Database users don't bother unless you're willing to test a lot of patches -from me ;) its a known issue. - -Also, users with hardware RAID controllers, doing striping, may find -highly variable performance results with using the as-iosched. The -as-iosched anticipatory implementation is based on the notion that a disk -device has only one physical seeking head. A striped RAID controller -actually has a head for each physical device in the logical RAID device. - -However, setting the antic_expire (see tunable parameters below) produces -very similar behavior to the deadline IO scheduler. - -Selecting IO schedulers ------------------------ -Refer to Documentation/block/switching-sched.txt for information on -selecting an io scheduler on a per-device basis. - -Anticipatory IO scheduler Policies ----------------------------------- -The as-iosched implementation implements several layers of policies -to determine when an IO request is dispatched to the disk controller. -Here are the policies outlined, in order of application. - -1. one-way Elevator algorithm. - -The elevator algorithm is similar to that used in deadline scheduler, with -the addition that it allows limited backward movement of the elevator -(i.e. seeks backwards). A seek backwards can occur when choosing between -two IO requests where one is behind the elevator's current position, and -the other is in front of the elevator's position. If the seek distance to -the request in back of the elevator is less than half the seek distance to -the request in front of the elevator, then the request in back can be chosen. -Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. -This favors forward movement of the elevator, while allowing opportunistic -"short" backward seeks. - -2. FIFO expiration times for reads and for writes. - -This is again very similar to the deadline IO scheduler. The expiration -times for requests on these lists is tunable using the parameters read_expire -and write_expire discussed below. When a read or a write expires in this way, -the IO scheduler will interrupt its current elevator sweep or read anticipation -to service the expired request. - -3. Read and write request batching - -A batch is a collection of read requests or a collection of write -requests. The as scheduler alternates dispatching read and write batches -to the driver. In the case a read batch, the scheduler submits read -requests to the driver as long as there are read requests to submit, and -the read batch time limit has not been exceeded (read_batch_expire). -The read batch time limit begins counting down only when there are -competing write requests pending. - -In the case of a write batch, the scheduler submits write requests to -the driver as long as there are write requests available, and the -write batch time limit has not been exceeded (write_batch_expire). -However, the length of write batches will be gradually shortened -when read batches frequently exceed their time limit. - -When changing between batch types, the scheduler waits for all requests -from the previous batch to complete before scheduling requests for the -next batch. - -The read and write fifo expiration times described in policy 2 above -are checked only when in scheduling IO of a batch for the corresponding -(read/write) type. So for example, the read FIFO timeout values are -tested only during read batches. Likewise, the write FIFO timeout -values are tested only during write batches. For this reason, -it is generally not recommended for the read batch time -to be longer than the write expiration time, nor for the write batch -time to exceed the read expiration time (see tunable parameters below). - -When the IO scheduler changes from a read to a write batch, -it begins the elevator from the request that is on the head of the -write expiration FIFO. Likewise, when changing from a write batch to -a read batch, scheduler begins the elevator from the first entry -on the read expiration FIFO. - -4. Read anticipation. - -Read anticipation occurs only when scheduling a read batch. -This implementation of read anticipation allows only one read request -to be dispatched to the disk controller at a time. In -contrast, many write requests may be dispatched to the disk controller -at a time during a write batch. It is this characteristic that can make -the anticipatory scheduler perform anomalously with controllers supporting -TCQ, or with hardware striped RAID devices. Setting the antic_expire -queue parameter (see below) to zero disables this behavior, and the -anticipatory scheduler behaves essentially like the deadline scheduler. - -When read anticipation is enabled (antic_expire is not zero), reads -are dispatched to the disk controller one at a time. -At the end of each read request, the IO scheduler examines its next -candidate read request from its sorted read list. If that next request -is from the same process as the request that just completed, -or if the next request in the queue is "very close" to the -just completed request, it is dispatched immediately. Otherwise, -statistics (average think time, average seek distance) on the process -that submitted the just completed request are examined. If it seems -likely that that process will submit another request soon, and that -request is likely to be near the just completed request, then the IO -scheduler will stop dispatching more read requests for up to (antic_expire) -milliseconds, hoping that process will submit a new request near the one -that just completed. If such a request is made, then it is dispatched -immediately. If the antic_expire wait time expires, then the IO scheduler -will dispatch the next read request from the sorted read queue. - -To decide whether an anticipatory wait is worthwhile, the scheduler -maintains statistics for each process that can be used to compute -mean "think time" (the time between read requests), and mean seek -distance for that process. One observation is that these statistics -are associated with each process, but those statistics are not associated -with a specific IO device. So for example, if a process is doing IO -on several file systems on separate devices, the statistics will be -a combination of IO behavior from all those devices. - - -Tuning the anticipatory IO scheduler ------------------------------------- -When using 'as', the anticipatory IO scheduler there are 5 parameters under -/sys/block/*/queue/iosched/. All are units of milliseconds. - -The parameters are: -* read_expire - Controls how long until a read request becomes "expired". It also controls the - interval between which expired requests are served, so set to 50, a request - might take anywhere < 100ms to be serviced _if_ it is the next on the - expired list. Obviously request expiration strategies won't make the disk - go faster. The result basically equates to the timeslice a single reader - gets in the presence of other IO. 100*((seek time / read_expire) + 1) is - very roughly the % streaming read efficiency your disk should get with - multiple readers. - -* read_batch_expire - Controls how much time a batch of reads is given before pending writes are - served. A higher value is more efficient. This might be set below read_expire - if writes are to be given higher priority than reads, but reads are to be - as efficient as possible when there are no writes. Generally though, it - should be some multiple of read_expire. - -* write_expire, and -* write_batch_expire are equivalent to the above, for writes. - -* antic_expire - Controls the maximum amount of time we can anticipate a good read (one - with a short seek distance from the most recently completed request) before - giving up. Many other factors may cause anticipation to be stopped early, - or some processes will not be "anticipated" at all. Should be a bit higher - for big seek time devices though not a linear correspondence - most - processes have only a few ms thinktime. - -In addition to the tunables above there is a read-only file named est_time -which, when read, will show: - - - The probability of a task exiting without a cooperating task - submitting an anticipated IO. - - - The current mean think time. - - - The seek distance used to determine if an incoming IO is better. - -- cgit v1.1 From 360b6e5cab1cea1d838b0100956ce0d3dbccbb6f Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 18 Dec 2009 15:16:56 -0800 Subject: Documentation: Update mmiotrace.txt Fix typos, spellos, hyphenation, line lengths. BTW: are there some userspace tools? There is a reference to some at the wiki page, but there are no tools listed there. Signed-off-by: Randy Dunlap Acked-by: Pekka Paalanen LKML-Reference: <4B2C0D68.6080401@oracle.com> Signed-off-by: Ingo Molnar --- Documentation/trace/mmiotrace.txt | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt index 162effb..664e738 100644 --- a/Documentation/trace/mmiotrace.txt +++ b/Documentation/trace/mmiotrace.txt @@ -44,7 +44,8 @@ Check for lost events. Usage ----- -Make sure debugfs is mounted to /sys/kernel/debug. If not, (requires root privileges) +Make sure debugfs is mounted to /sys/kernel/debug. +If not (requires root privileges): $ mount -t debugfs debugfs /sys/kernel/debug Check that the driver you are about to trace is not loaded. @@ -91,7 +92,7 @@ $ dmesg > dmesg.txt $ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt and then send the .tar.gz file. The trace compresses considerably. Replace "pciid" and "nick" with the PCI ID or model name of your piece of hardware -under investigation and your nick name. +under investigation and your nickname. How Mmiotrace Works @@ -100,7 +101,7 @@ How Mmiotrace Works Access to hardware IO-memory is gained by mapping addresses from PCI bus by calling one of the ioremap_*() functions. Mmiotrace is hooked into the __ioremap() function and gets called whenever a mapping is created. Mapping is -an event that is recorded into the trace log. Note, that ISA range mappings +an event that is recorded into the trace log. Note that ISA range mappings are not caught, since the mapping always exists and is returned directly. MMIO accesses are recorded via page faults. Just before __ioremap() returns, @@ -122,11 +123,11 @@ Trace Log Format ---------------- The raw log is text and easily filtered with e.g. grep and awk. One record is -one line in the log. A record starts with a keyword, followed by keyword -dependant arguments. Arguments are separated by a space, or continue until the +one line in the log. A record starts with a keyword, followed by keyword- +dependent arguments. Arguments are separated by a space, or continue until the end of line. The format for version 20070824 is as follows: -Explanation Keyword Space separated arguments +Explanation Keyword Space-separated arguments --------------------------------------------------------------------------- read event R width, timestamp, map id, physical, value, PC, PID @@ -136,7 +137,7 @@ iounmap event UNMAP timestamp, map id, PC, PID marker MARK timestamp, text version VERSION the string "20070824" info for reader LSPCI one line from lspci -v -PCI address map PCIDEV space separated /proc/bus/pci/devices data +PCI address map PCIDEV space-separated /proc/bus/pci/devices data unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual -- cgit v1.1 From b41df645c829d961068aecd30909c2675acbaaea Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 18 Dec 2009 15:17:04 -0800 Subject: Documentation: Update tracepoint-analysis.txt Fix grammar, spelling, punctuation, hyphenation, section numbering. Tell what PCL means. Signed-off-by: Randy Dunlap Cc: Mel Gorman Cc: Steven Rostedt LKML-Reference: <4B2C0D70.4030707@oracle.com> Signed-off-by: Ingo Molnar --- Documentation/trace/tracepoint-analysis.txt | 60 ++++++++++++++--------------- 1 file changed, 30 insertions(+), 30 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt index 5eb4e48..87bee3c 100644 --- a/Documentation/trace/tracepoint-analysis.txt +++ b/Documentation/trace/tracepoint-analysis.txt @@ -10,8 +10,8 @@ Tracepoints (see Documentation/trace/tracepoints.txt) can be used without creating custom kernel modules to register probe functions using the event tracing infrastructure. -Simplistically, tracepoints will represent an important event that when can -be taken in conjunction with other tracepoints to build a "Big Picture" of +Simplistically, tracepoints represent important events that can be +taken in conjunction with other tracepoints to build a "Big Picture" of what is going on within the system. There are a large number of methods for gathering and interpreting these events. Lacking any current Best Practises, this document describes some of the methods that can be used. @@ -33,12 +33,12 @@ calling will give a fair indication of the number of events available. -2.2 PCL +2.2 PCL (Performance Counters for Linux) ------- -Discovery and enumeration of all counters and events, including tracepoints +Discovery and enumeration of all counters and events, including tracepoints, are available with the perf tool. Getting a list of available events is a -simple case of +simple case of: $ perf list 2>&1 | grep Tracepoint ext4:ext4_free_inode [Tracepoint event] @@ -49,19 +49,19 @@ simple case of [ .... remaining output snipped .... ] -2. Enabling Events +3. Enabling Events ================== -2.1 System-Wide Event Enabling +3.1 System-Wide Event Enabling ------------------------------ See Documentation/trace/events.txt for a proper description on how events can be enabled system-wide. A short example of enabling all events related -to page allocation would look something like +to page allocation would look something like: $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done -2.2 System-Wide Event Enabling with SystemTap +3.2 System-Wide Event Enabling with SystemTap --------------------------------------------- In SystemTap, tracepoints are accessible using the kernel.trace() function @@ -86,7 +86,7 @@ were allocating the pages. print_count() } -2.3 System-Wide Event Enabling with PCL +3.3 System-Wide Event Enabling with PCL --------------------------------------- By specifying the -a switch and analysing sleep, the system-wide events @@ -107,16 +107,16 @@ for a duration of time can be examined. Similarly, one could execute a shell and exit it as desired to get a report at that point. -2.4 Local Event Enabling +3.4 Local Event Enabling ------------------------ Documentation/trace/ftrace.txt describes how to enable events on a per-thread basis using set_ftrace_pid. -2.5 Local Event Enablement with PCL +3.5 Local Event Enablement with PCL ----------------------------------- -Events can be activate and tracked for the duration of a process on a local +Events can be activated and tracked for the duration of a process on a local basis using PCL such as follows. $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ @@ -131,18 +131,18 @@ basis using PCL such as follows. 0.973913387 seconds time elapsed -3. Event Filtering +4. Event Filtering ================== Documentation/trace/ftrace.txt covers in-depth how to filter events in ftrace. Obviously using grep and awk of trace_pipe is an option as well as any script reading trace_pipe. -4. Analysing Event Variances with PCL +5. Analysing Event Variances with PCL ===================================== Any workload can exhibit variances between runs and it can be important -to know what the standard deviation in. By and large, this is left to the +to know what the standard deviation is. By and large, this is left to the performance analyst to do it by hand. In the event that the discrete event occurrences are useful to the performance analyst, then perf can be used. @@ -166,7 +166,7 @@ In the event that some higher-level event is required that depends on some aggregation of discrete events, then a script would need to be developed. Using --repeat, it is also possible to view how events are fluctuating over -time on a system wide basis using -a and sleep. +time on a system-wide basis using -a and sleep. $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ -e kmem:mm_pagevec_free \ @@ -180,7 +180,7 @@ time on a system wide basis using -a and sleep. 1.002251757 seconds time elapsed ( +- 0.005% ) -5. Higher-Level Analysis with Helper Scripts +6. Higher-Level Analysis with Helper Scripts ============================================ When events are enabled the events that are triggering can be read from @@ -190,11 +190,11 @@ be gathered on-line as appropriate. Examples of post-processing might include o Reading information from /proc for the PID that triggered the event o Deriving a higher-level event from a series of lower-level events. - o Calculate latencies between two events + o Calculating latencies between two events Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example script that can read trace_pipe from STDIN or a copy of a trace. When used -on-line, it can be interrupted once to generate a report without existing +on-line, it can be interrupted once to generate a report without exiting and twice to exit. Simplistically, the script just reads STDIN and counts up events but it @@ -212,12 +212,12 @@ also can do more such as processes, the parent process responsible for creating all the helpers can be identified -6. Lower-Level Analysis with PCL +7. Lower-Level Analysis with PCL ================================ -There may also be a requirement to identify what functions with a program +There may also be a requirement to identify what functions within a program were generating events within the kernel. To begin this sort of analysis, the -data must be recorded. At the time of writing, this required root +data must be recorded. At the time of writing, this required root: $ perf record -c 1 \ -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ @@ -253,11 +253,11 @@ perf report. # (For more details, try: perf report --sort comm,dso,symbol) # -According to this, the vast majority of events occured triggered on events -within the VDSO. With simple binaries, this will often be the case so lets +According to this, the vast majority of events triggered on events +within the VDSO. With simple binaries, this will often be the case so let's take a slightly different example. In the course of writing this, it was -noticed that X was generating an insane amount of page allocations so lets look -at it +noticed that X was generating an insane amount of page allocations so let's look +at it: $ perf record -c 1 -f \ -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ @@ -280,8 +280,8 @@ This was interrupted after a few seconds and # (For more details, try: perf report --sort comm,dso,symbol) # -So, almost half of the events are occuring in a library. To get an idea which -symbol. +So, almost half of the events are occurring in a library. To get an idea which +symbol: $ perf report --sort comm,dso,symbol # Samples: 27666 @@ -297,7 +297,7 @@ symbol. 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path 0.00% Xorg [kernel] [k] ftrace_trace_userstack -To see where within the function pixmanFillsse2 things are going wrong +To see where within the function pixmanFillsse2 things are going wrong: $ perf annotate pixmanFillsse2 [ ... ] -- cgit v1.1 From 7e25f44cbf8d95a9748fdfd19c06145f19fd10e3 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 18 Dec 2009 15:17:12 -0800 Subject: Documentation: Update ftrace-design.txt Correct grammos. Spell out words. Add missing words. Consistent use of "mcount()" function name. Signed-off-by: Randy Dunlap Acked-by: Steven Rostedt LKML-Reference: <4B2C0D78.6060707@oracle.com> Signed-off-by: Ingo Molnar --- Documentation/trace/ftrace-design.txt | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt index 641a1ef..239f14b 100644 --- a/Documentation/trace/ftrace-design.txt +++ b/Documentation/trace/ftrace-design.txt @@ -53,14 +53,14 @@ size of the mcount call that is embedded in the function). For example, if the function foo() calls bar(), when the bar() function calls mcount(), the arguments mcount() will pass to the tracer are: "frompc" - the address bar() will use to return to foo() - "selfpc" - the address bar() (with _mcount() size adjustment) + "selfpc" - the address bar() (with mcount() size adjustment) Also keep in mind that this mcount function will be called *a lot*, so optimizing for the default case of no tracer will help the smooth running of your system when tracing is disabled. So the start of the mcount function is -typically the bare min with checking things before returning. That also means -the code flow should usually kept linear (i.e. no branching in the nop case). -This is of course an optimization and not a hard requirement. +typically the bare minimum with checking things before returning. That also +means the code flow should usually be kept linear (i.e. no branching in the nop +case). This is of course an optimization and not a hard requirement. Here is some pseudo code that should help (these functions should actually be implemented in assembly): @@ -131,10 +131,10 @@ some functions to save (hijack) and restore the return address. The mcount function should check the function pointers ftrace_graph_return (compare to ftrace_stub) and ftrace_graph_entry (compare to -ftrace_graph_entry_stub). If either of those are not set to the relevant stub +ftrace_graph_entry_stub). If either of those is not set to the relevant stub function, call the arch-specific function ftrace_graph_caller which in turn calls the arch-specific function prepare_ftrace_return. Neither of these -function names are strictly required, but you should use them anyways to stay +function names is strictly required, but you should use them anyway to stay consistent across the architecture ports -- easier to compare & contrast things. @@ -144,7 +144,7 @@ but the first argument should be a pointer to the "frompc". Typically this is located on the stack. This allows the function to hijack the return address temporarily to have it point to the arch-specific function return_to_handler. That function will simply call the common ftrace_return_to_handler function and -that will return the original return address with which, you can return to the +that will return the original return address with which you can return to the original call site. Here is the updated mcount pseudo code: -- cgit v1.1 From 1a5ba2e9fc7999b8de2a71c7e7b9f58d752c05e4 Mon Sep 17 00:00:00 2001 From: Rafael Avila de Espindola Date: Tue, 22 Dec 2009 07:59:37 +0100 Subject: ALSA: hda - Add support for the new 27 inch IMacs With the attached patch I am able to use the sound on a new IMac 27. What works: *) Internal speakers *) Internal microphone *) Headphone I don't have an external mic or a SPDIF device to test the rest. Signed-off-by: Rafael Avila de Espindola Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/HD-Audio-Models.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index e93afff..e72cee9 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -403,4 +403,5 @@ STAC9872 Cirrus Logic CS4206/4207 ======================== mbp55 MacBook Pro 5,5 + imac27 IMac 27 Inch auto BIOS setup (default) -- cgit v1.1 From a6ab7aa9f432f722808c6fea5a8b7f5f229de031 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 22 Dec 2009 20:43:17 +0100 Subject: PM / Runtime: Use device type and device class callbacks The power management of some devices is handled through device types and device classes rather than through bus types. Since these devices may also benefit from using the run-time power management core, extend it so that the device type and device class run-time PM callbacks can be taken into consideration by it if the bus type callback is not defined. Update the run-time PM core documentation to reflect this change. Signed-off-by: Rafael J. Wysocki --- Documentation/power/runtime_pm.txt | 173 +++++++++++++++++++------------------ 1 file changed, 87 insertions(+), 86 deletions(-) (limited to 'Documentation') diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index 4a3109b..7b5ab27 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt @@ -42,80 +42,81 @@ struct dev_pm_ops { ... }; -The ->runtime_suspend() callback is executed by the PM core for the bus type of -the device being suspended. The bus type's callback is then _entirely_ -_responsible_ for handling the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_suspend() callback (from the +The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are +executed by the PM core for either the bus type, or device type (if the bus +type's callback is not defined), or device class (if the bus type's and device +type's callbacks are not defined) of given device. The bus type, device type +and device class callbacks are referred to as subsystem-level callbacks in what +follows. + +The subsystem-level suspend callback is _entirely_ _responsible_ for handling +the suspend of the device as appropriate, which may, but need not include +executing the device driver's own ->runtime_suspend() callback (from the PM core's point of view it is not necessary to implement a ->runtime_suspend() -callback in a device driver as long as the bus type's ->runtime_suspend() knows -what to do to handle the device). +callback in a device driver as long as the subsystem-level suspend callback +knows what to do to handle the device). - * Once the bus type's ->runtime_suspend() callback has completed successfully + * Once the subsystem-level suspend callback has completed successfully for given device, the PM core regards the device as suspended, which need not mean that the device has been put into a low power state. It is supposed to mean, however, that the device will not process data and will - not communicate with the CPU(s) and RAM until its bus type's - ->runtime_resume() callback is executed for it. The run-time PM status of - a device after successful execution of its bus type's ->runtime_suspend() - callback is 'suspended'. - - * If the bus type's ->runtime_suspend() callback returns -EBUSY or -EAGAIN, - the device's run-time PM status is supposed to be 'active', which means that - the device _must_ be fully operational afterwards. - - * If the bus type's ->runtime_suspend() callback returns an error code - different from -EBUSY or -EAGAIN, the PM core regards this as a fatal - error and will refuse to run the helper functions described in Section 4 - for the device, until the status of it is directly set either to 'active' - or to 'suspended' (the PM core provides special helper functions for this - purpose). - -In particular, if the driver requires remote wakeup capability for proper -functioning and device_run_wake() returns 'false' for the device, then -->runtime_suspend() should return -EBUSY. On the other hand, if -device_run_wake() returns 'true' for the device and the device is put -into a low power state during the execution of its bus type's -->runtime_suspend(), it is expected that remote wake-up (i.e. hardware mechanism -allowing the device to request a change of its power state, such as PCI PME) -will be enabled for the device. Generally, remote wake-up should be enabled -for all input devices put into a low power state at run time. - -The ->runtime_resume() callback is executed by the PM core for the bus type of -the device being woken up. The bus type's callback is then _entirely_ -_responsible_ for handling the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_resume() callback (from the -PM core's point of view it is not necessary to implement a ->runtime_resume() -callback in a device driver as long as the bus type's ->runtime_resume() knows -what to do to handle the device). - - * Once the bus type's ->runtime_resume() callback has completed successfully, - the PM core regards the device as fully operational, which means that the - device _must_ be able to complete I/O operations as needed. The run-time - PM status of the device is then 'active'. - - * If the bus type's ->runtime_resume() callback returns an error code, the PM - core regards this as a fatal error and will refuse to run the helper - functions described in Section 4 for the device, until its status is - directly set either to 'active' or to 'suspended' (the PM core provides - special helper functions for this purpose). - -The ->runtime_idle() callback is executed by the PM core for the bus type of -given device whenever the device appears to be idle, which is indicated to the -PM core by two counters, the device's usage counter and the counter of 'active' -children of the device. + not communicate with the CPU(s) and RAM until the subsystem-level resume + callback is executed for it. The run-time PM status of a device after + successful execution of the subsystem-level suspend callback is 'suspended'. + + * If the subsystem-level suspend callback returns -EBUSY or -EAGAIN, + the device's run-time PM status is 'active', which means that the device + _must_ be fully operational afterwards. + + * If the subsystem-level suspend callback returns an error code different + from -EBUSY or -EAGAIN, the PM core regards this as a fatal error and will + refuse to run the helper functions described in Section 4 for the device, + until the status of it is directly set either to 'active', or to 'suspended' + (the PM core provides special helper functions for this purpose). + +In particular, if the driver requires remote wake-up capability (i.e. hardware +mechanism allowing the device to request a change of its power state, such as +PCI PME) for proper functioning and device_run_wake() returns 'false' for the +device, then ->runtime_suspend() should return -EBUSY. On the other hand, if +device_run_wake() returns 'true' for the device and the device is put into a low +power state during the execution of the subsystem-level suspend callback, it is +expected that remote wake-up will be enabled for the device. Generally, remote +wake-up should be enabled for all input devices put into a low power state at +run time. + +The subsystem-level resume callback is _entirely_ _responsible_ for handling the +resume of the device as appropriate, which may, but need not include executing +the device driver's own ->runtime_resume() callback (from the PM core's point of +view it is not necessary to implement a ->runtime_resume() callback in a device +driver as long as the subsystem-level resume callback knows what to do to handle +the device). + + * Once the subsystem-level resume callback has completed successfully, the PM + core regards the device as fully operational, which means that the device + _must_ be able to complete I/O operations as needed. The run-time PM status + of the device is then 'active'. + + * If the subsystem-level resume callback returns an error code, the PM core + regards this as a fatal error and will refuse to run the helper functions + described in Section 4 for the device, until its status is directly set + either to 'active' or to 'suspended' (the PM core provides special helper + functions for this purpose). + +The subsystem-level idle callback is executed by the PM core whenever the device +appears to be idle, which is indicated to the PM core by two counters, the +device's usage counter and the counter of 'active' children of the device. * If any of these counters is decreased using a helper function provided by the PM core and it turns out to be equal to zero, the other counter is checked. If that counter also is equal to zero, the PM core executes the - device bus type's ->runtime_idle() callback (with the device as an - argument). + subsystem-level idle callback with the device as an argument. -The action performed by a bus type's ->runtime_idle() callback is totally -dependent on the bus type in question, but the expected and recommended action -is to check if the device can be suspended (i.e. if all of the conditions -necessary for suspending the device are satisfied) and to queue up a suspend -request for the device in that case. The value returned by this callback is -ignored by the PM core. +The action performed by a subsystem-level idle callback is totally dependent on +the subsystem in question, but the expected and recommended action is to check +if the device can be suspended (i.e. if all of the conditions necessary for +suspending the device are satisfied) and to queue up a suspend request for the +device in that case. The value returned by this callback is ignored by the PM +core. The helper functions provided by the PM core, described in Section 4, guarantee that the following constraints are met with respect to the bus type's run-time @@ -238,41 +239,41 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: removing the device from device hierarchy int pm_runtime_idle(struct device *dev); - - execute ->runtime_idle() for the device's bus type; returns 0 on success - or error code on failure, where -EINPROGRESS means that ->runtime_idle() - is already being executed + - execute the subsystem-level idle callback for the device; returns 0 on + success or error code on failure, where -EINPROGRESS means that + ->runtime_idle() is already being executed int pm_runtime_suspend(struct device *dev); - - execute ->runtime_suspend() for the device's bus type; returns 0 on + - execute the subsystem-level suspend callback for the device; returns 0 on success, 1 if the device's run-time PM status was already 'suspended', or error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt to suspend the device again in future int pm_runtime_resume(struct device *dev); - - execute ->runtime_resume() for the device's bus type; returns 0 on + - execute the subsystem-leve resume callback for the device; returns 0 on success, 1 if the device's run-time PM status was already 'active' or error code on failure, where -EAGAIN means it may be safe to attempt to resume the device again in future, but 'power.runtime_error' should be checked additionally int pm_request_idle(struct device *dev); - - submit a request to execute ->runtime_idle() for the device's bus type - (the request is represented by a work item in pm_wq); returns 0 on success - or error code if the request has not been queued up + - submit a request to execute the subsystem-level idle callback for the + device (the request is represented by a work item in pm_wq); returns 0 on + success or error code if the request has not been queued up int pm_schedule_suspend(struct device *dev, unsigned int delay); - - schedule the execution of ->runtime_suspend() for the device's bus type - in future, where 'delay' is the time to wait before queuing up a suspend - work item in pm_wq, in milliseconds (if 'delay' is zero, the work item is - queued up immediately); returns 0 on success, 1 if the device's PM + - schedule the execution of the subsystem-level suspend callback for the + device in future, where 'delay' is the time to wait before queuing up a + suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work + item is queued up immediately); returns 0 on success, 1 if the device's PM run-time status was already 'suspended', or error code if the request hasn't been scheduled (or queued up if 'delay' is 0); if the execution of ->runtime_suspend() is already scheduled and not yet expired, the new value of 'delay' will be used as the time to wait int pm_request_resume(struct device *dev); - - submit a request to execute ->runtime_resume() for the device's bus type - (the request is represented by a work item in pm_wq); returns 0 on + - submit a request to execute the subsystem-level resume callback for the + device (the request is represented by a work item in pm_wq); returns 0 on success, 1 if the device's run-time PM status was already 'active', or error code if the request hasn't been queued up @@ -303,12 +304,12 @@ drivers/base/power/runtime.c and include/linux/pm_runtime.h: run-time PM callbacks described in Section 2 int pm_runtime_disable(struct device *dev); - - prevent the run-time PM helper functions from running the device bus - type's run-time PM callbacks, make sure that all of the pending run-time - PM operations on the device are either completed or canceled; returns - 1 if there was a resume request pending and it was necessary to execute - ->runtime_resume() for the device's bus type to satisfy that request, - otherwise 0 is returned + - prevent the run-time PM helper functions from running subsystem-level + run-time PM callbacks for the device, make sure that all of the pending + run-time PM operations on the device are either completed or canceled; + returns 1 if there was a resume request pending and it was necessary to + execute the subsystem-level resume callback for the device to satisfy that + request, otherwise 0 is returned void pm_suspend_ignore_children(struct device *dev, bool enable); - set/unset the power.ignore_children flag of the device @@ -378,5 +379,5 @@ pm_runtime_suspend() or pm_runtime_idle() or their asynchronous counterparts, they will fail returning -EAGAIN, because the device's usage counter is incremented by the core before executing ->probe() and ->remove(). Still, it may be desirable to suspend the device as soon as ->probe() or ->remove() has -finished, so the PM core uses pm_runtime_idle_sync() to invoke the device bus -type's ->runtime_idle() callback at that time. +finished, so the PM core uses pm_runtime_idle_sync() to invoke the +subsystem-level idle callback for the device at that time. -- cgit v1.1 From f1212ae1332b95fe3bafa7b5c063dd8e473247cf Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Tue, 22 Dec 2009 20:43:40 +0100 Subject: PM: Runtime PM documentation update This patch (as1318) updates the runtime PM documentation, adding a section discussing the interaction between runtime PM and system sleep. [rjw: Rebased and made it agree with the other updates better.] Signed-off-by: Alan Stern Signed-off-by: Rafael J. Wysocki --- Documentation/power/runtime_pm.txt | 50 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) (limited to 'Documentation') diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index 7b5ab27..356fd86 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt @@ -381,3 +381,53 @@ incremented by the core before executing ->probe() and ->remove(). Still, it may be desirable to suspend the device as soon as ->probe() or ->remove() has finished, so the PM core uses pm_runtime_idle_sync() to invoke the subsystem-level idle callback for the device at that time. + +6. Run-time PM and System Sleep + +Run-time PM and system sleep (i.e., system suspend and hibernation, also known +as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of +ways. If a device is active when a system sleep starts, everything is +straightforward. But what should happen if the device is already suspended? + +The device may have different wake-up settings for run-time PM and system sleep. +For example, remote wake-up may be enabled for run-time suspend but disallowed +for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, +the subsystem-level system suspend callback is responsible for changing the +device's wake-up setting (it may leave that to the device driver's system +suspend routine). It may be necessary to resume the device and suspend it again +in order to do so. The same is true if the driver uses different power levels +or other settings for run-time suspend and system sleep. + +During system resume, devices generally should be brought back to full power, +even if they were suspended before the system sleep began. There are several +reasons for this, including: + + * The device might need to switch power levels, wake-up settings, etc. + + * Remote wake-up events might have been lost by the firmware. + + * The device's children may need the device to be at full power in order + to resume themselves. + + * The driver's idea of the device state may not agree with the device's + physical state. This can happen during resume from hibernation. + + * The device might need to be reset. + + * Even though the device was suspended, if its usage counter was > 0 then most + likely it would need a run-time resume in the near future anyway. + + * Always going back to full power is simplest. + +If the device was suspended before the sleep began, then its run-time PM status +will have to be updated to reflect the actual post-system sleep status. The way +to do this is: + + pm_runtime_disable(dev); + pm_runtime_set_active(dev); + pm_runtime_enable(dev); + +The PM core always increments the run-time usage counter before calling the +->prepare() callback and decrements it after calling the ->complete() callback. +Hence disabling run-time PM temporarily like this will not cause any run-time +suspend callbacks to be lost. -- cgit v1.1 From 2ec91eec47f713e3d158ba5b28a24a85a2cf3650 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Mon, 21 Dec 2009 14:37:23 -0800 Subject: mm tracing: cleanup Documentation/trace/events-kmem.txt Clean up typos/grammos/spellos in events-kmem.txt. Signed-off-by: Randy Dunlap Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/trace/events-kmem.txt | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt index 6ef2a86..aa82ee4 100644 --- a/Documentation/trace/events-kmem.txt +++ b/Documentation/trace/events-kmem.txt @@ -1,7 +1,7 @@ Subsystem Trace Points: kmem -The tracing system kmem captures events related to object and page allocation -within the kernel. Broadly speaking there are four major subheadings. +The kmem tracing system captures events related to object and page allocation +within the kernel. Broadly speaking there are five major subheadings. o Slab allocation of small objects of unknown type (kmalloc) o Slab allocation of small objects of known type @@ -9,7 +9,7 @@ within the kernel. Broadly speaking there are four major subheadings. o Per-CPU Allocator Activity o External Fragmentation -This document will describe what each of the tracepoints are and why they +This document describes what each of the tracepoints is and why they might be useful. 1. Slab allocation of small objects of unknown type @@ -34,7 +34,7 @@ kmem_cache_free call_site=%lx ptr=%p These events are similar in usage to the kmalloc-related events except that it is likely easier to pin the event down to a specific cache. At the time of writing, no information is available on what slab is being allocated from, -but the call_site can usually be used to extrapolate that information +but the call_site can usually be used to extrapolate that information. 3. Page allocation ================== @@ -80,9 +80,9 @@ event indicating whether it is for a percpu_refill or not. When the per-CPU list is too full, a number of pages are freed, each one which triggers a mm_page_pcpu_drain event. -The individual nature of the events are so that pages can be tracked +The individual nature of the events is so that pages can be tracked between allocation and freeing. A number of drain or refill pages that occur -consecutively imply the zone->lock being taken once. Large amounts of PCP +consecutively imply the zone->lock being taken once. Large amounts of per-CPU refills and drains could imply an imbalance between CPUs where too much work is being concentrated in one place. It could also indicate that the per-CPU lists should be a larger size. Finally, large amounts of refills on one CPU @@ -102,6 +102,6 @@ is important. Large numbers of this event implies that memory is fragmenting and high-order allocations will start failing at some time in the future. One -means of reducing the occurange of this event is to increase the size of +means of reducing the occurrence of this event is to increase the size of min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where pageblock_size is usually the size of the default hugepage size. -- cgit v1.1 From 8e9b9362266dd16255473c080d846b13e27247bf Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sun, 6 Dec 2009 12:24:31 +0100 Subject: Doc/stable rules: add new cherry-pick logic - it is possible to submit patches for the stable queue without sending them directly stable@kernel.org. If the tag (Cc: stable@kernel.org) is available in the sign-off area than hpa's script will filter them into the stable mailbox once it hits Linus' tree. - Patches which require others to be applied first can be also specified. This was discussued in http://lkml.org/lkml/2009/11/9/474 Signed-off-by: Sebastian Andrzej Siewior Cc: Brandon Philips Signed-off-by: Greg Kroah-Hartman --- Documentation/stable_kernel_rules.txt | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/stable_kernel_rules.txt b/Documentation/stable_kernel_rules.txt index a452227..5effa5b 100644 --- a/Documentation/stable_kernel_rules.txt +++ b/Documentation/stable_kernel_rules.txt @@ -26,13 +26,33 @@ Procedure for submitting patches to the -stable tree: - Send the patch, after verifying that it follows the above rules, to stable@kernel.org. + - To have the patch automatically included in the stable tree, add the + the tag + Cc: stable@kernel.org + in the sign-off area. Once the patch is merged it will be applied to + the stable tree without anything else needing to be done by the author + or subsystem maintainer. + - If the patch requires other patches as prerequisites which can be + cherry-picked than this can be specified in the following format in + the sign-off area: + + Cc: # .32.x: a1f84a3: sched: Check for idle + Cc: # .32.x: 1b9508f: sched: Rate-limit newidle + Cc: # .32.x: fd21073: sched: Fix affinity logic + Cc: # .32.x + Signed-off-by: Ingo Molnar + + The tag sequence has the meaning of: + git cherry-pick a1f84a3 + git cherry-pick 1b9508f + git cherry-pick fd21073 + git cherry-pick + - The sender will receive an ACK when the patch has been accepted into the queue, or a NAK if the patch is rejected. This response might take a few days, according to the developer's schedules. - If accepted, the patch will be added to the -stable queue, for review by other developers and by the relevant subsystem maintainer. - - If the stable@kernel.org address is added to a patch, when it goes into - Linus's tree it will automatically be emailed to the stable team. - Security patches should not be sent to this alias, but instead to the documented security@kernel.org address. -- cgit v1.1 From 26579ab70aa0e0ea434e6e100279d2f67c094431 Mon Sep 17 00:00:00 2001 From: Phil Carmody Date: Fri, 18 Dec 2009 15:34:19 +0200 Subject: Driver core: device_attribute parameters can often be const* Most device_attributes are const, and are begging to be put in a ro section. However, the create and remove file interfaces were failing to propagate the const promise which the only functions they call offer. Signed-off-by: Phil Carmody Signed-off-by: Greg Kroah-Hartman --- Documentation/filesystems/sysfs.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index b245d52..a4ac2b9 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -91,8 +91,8 @@ struct device_attribute { const char *buf, size_t count); }; -int device_create_file(struct device *, struct device_attribute *); -void device_remove_file(struct device *, struct device_attribute *); +int device_create_file(struct device *, const struct device_attribute *); +void device_remove_file(struct device *, const struct device_attribute *); It also defines this helper for defining device attributes: @@ -316,8 +316,8 @@ DEVICE_ATTR(_name, _mode, _show, _store); Creation/Removal: -int device_create_file(struct device *device, struct device_attribute * attr); -void device_remove_file(struct device * dev, struct device_attribute * attr); +int device_create_file(struct device *dev, const struct device_attribute * attr); +void device_remove_file(struct device *dev, const struct device_attribute * attr); - bus drivers (include/linux/device.h) -- cgit v1.1 From 099c2f21d8cf0724b85abb2c589d6276953781b7 Mon Sep 17 00:00:00 2001 From: Phil Carmody Date: Fri, 18 Dec 2009 15:34:21 +0200 Subject: Driver core: driver_attribute parameters can often be const* Many struct driver_attribute descriptors are purely read-only structures, and there's no need to change them. Therefore make the promise not to, which will let those descriptors be put in a ro section. Signed-off-by: Phil Carmody Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-model/driver.txt | 4 ++-- Documentation/filesystems/sysfs.txt | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt index 60120fb..d2cd6fb 100644 --- a/Documentation/driver-model/driver.txt +++ b/Documentation/driver-model/driver.txt @@ -226,5 +226,5 @@ struct driver_attribute driver_attr_debug; This can then be used to add and remove the attribute from the driver's directory using: -int driver_create_file(struct device_driver *, struct driver_attribute *); -void driver_remove_file(struct device_driver *, struct driver_attribute *); +int driver_create_file(struct device_driver *, const struct driver_attribute *); +void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index a4ac2b9..931c806 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -358,7 +358,7 @@ DRIVER_ATTR(_name, _mode, _show, _store) Creation/Removal: -int driver_create_file(struct device_driver *, struct driver_attribute *); -void driver_remove_file(struct device_driver *, struct driver_attribute *); +int driver_create_file(struct device_driver *, const struct driver_attribute *); +void driver_remove_file(struct device_driver *, const struct driver_attribute *); -- cgit v1.1 From baf67741bf52b985e318bed3c4acadcda5351e08 Mon Sep 17 00:00:00 2001 From: Alan Stern Date: Tue, 8 Dec 2009 15:49:48 -0500 Subject: USB: power management documentation update This patch (as1313) updates the documentation concerning USB power management. Signed-off-by: Alan Stern Signed-off-by: Greg Kroah-Hartman --- Documentation/ABI/testing/sysfs-bus-usb | 18 ++++++++------- Documentation/usb/power-management.txt | 41 ++++++++++++--------------------- 2 files changed, 25 insertions(+), 34 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index deb6b48..a07c0f3 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb @@ -21,25 +21,27 @@ Contact: Alan Stern Description: Each USB device directory will contain a file named power/level. This file holds a power-level setting for - the device, one of "on", "auto", or "suspend". + the device, either "on" or "auto". "on" means that the device is not allowed to autosuspend, although normal suspends for system sleep will still be honored. "auto" means the device will autosuspend and autoresume in the usual manner, according to the - capabilities of its driver. "suspend" means the device - is forced into a suspended state and it will not autoresume - in response to I/O requests. However remote-wakeup requests - from the device may still be enabled (the remote-wakeup - setting is controlled separately by the power/wakeup - attribute). + capabilities of its driver. During normal use, devices should be left in the "auto" - level. The other levels are meant for administrative uses. + level. The "on" level is meant for administrative uses. If you want to suspend a device immediately but leave it free to wake up in response to I/O requests, you should write "0" to power/autosuspend. + Device not capable of proper suspend and resume should be + left in the "on" level. Although the USB spec requires + devices to support suspend/resume, many of them do not. + In fact so many don't that by default, the USB core + initializes all non-hub devices in the "on" level. Some + drivers may change this setting when they are bound. + What: /sys/bus/usb/devices/.../power/persist Date: May 2007 KernelVersion: 2.6.23 diff --git a/Documentation/usb/power-management.txt b/Documentation/usb/power-management.txt index c7c1dc2..3bf6818 100644 --- a/Documentation/usb/power-management.txt +++ b/Documentation/usb/power-management.txt @@ -71,12 +71,10 @@ being accessed through sysfs, then it definitely is idle. Forms of dynamic PM ------------------- -Dynamic suspends can occur in two ways: manual and automatic. -"Manual" means that the user has told the kernel to suspend a device, -whereas "automatic" means that the kernel has decided all by itself to -suspend a device. Automatic suspend is called "autosuspend" for -short. In general, a device won't be autosuspended unless it has been -idle for some minimum period of time, the so-called idle-delay time. +Dynamic suspends occur when the kernel decides to suspend an idle +device. This is called "autosuspend" for short. In general, a device +won't be autosuspended unless it has been idle for some minimum period +of time, the so-called idle-delay time. Of course, nothing the kernel does on its own initiative should prevent the computer or its devices from working properly. If a @@ -96,10 +94,11 @@ idle. We can categorize power management events in two broad classes: external and internal. External events are those triggered by some agent outside the USB stack: system suspend/resume (triggered by -userspace), manual dynamic suspend/resume (also triggered by -userspace), and remote wakeup (triggered by the device). Internal -events are those triggered within the USB stack: autosuspend and -autoresume. +userspace), manual dynamic resume (also triggered by userspace), and +remote wakeup (triggered by the device). Internal events are those +triggered within the USB stack: autosuspend and autoresume. Note that +all dynamic suspend events are internal; external agents are not +allowed to issue dynamic suspends. The user interface for dynamic PM @@ -145,9 +144,9 @@ relevant attribute files are: wakeup, level, and autosuspend. number of seconds the device should remain idle before the kernel will autosuspend it (the idle-delay time). The default is 2. 0 means to autosuspend as soon as - the device becomes idle, and -1 means never to - autosuspend. You can write a number to the file to - change the autosuspend idle-delay time. + the device becomes idle, and negative values mean + never to autosuspend. You can write a number to the + file to change the autosuspend idle-delay time. Writing "-1" to power/autosuspend and writing "on" to power/level do essentially the same thing -- they both prevent the device from being @@ -377,9 +376,9 @@ the device hasn't been idle for long enough, a delayed workqueue routine is automatically set up to carry out the operation when the autosuspend idle-delay has expired. -Autoresume attempts also can fail. This will happen if power/level is -set to "suspend" or if the device doesn't manage to resume properly. -Unlike autosuspend, there's no delay for an autoresume. +Autoresume attempts also can fail, although failure would mean that +the device is no longer present or operating properly. Unlike +autosuspend, there's no delay for an autoresume. Other parts of the driver interface @@ -527,13 +526,3 @@ succeed, it may still remain active and thus cause the system to resume as soon as the system suspend is complete. Or the remote wakeup may fail and get lost. Which outcome occurs depends on timing and on the hardware and firmware design. - -More interestingly, a device might undergo a manual resume or -autoresume during system suspend. With current kernels this shouldn't -happen, because manual resumes must be initiated by userspace and -autoresumes happen in response to I/O requests, but all user processes -and I/O should be quiescent during a system suspend -- thanks to the -freezer. However there are plans to do away with the freezer, which -would mean these things would become possible. If and when this comes -about, the USB core will carefully arrange matters so that either type -of resume will block until the entire system has resumed. -- cgit v1.1 From 6d3b82f2d31f22085e5711b28dddcb9fb3d97a25 Mon Sep 17 00:00:00 2001 From: Fang Wenqi Date: Thu, 24 Dec 2009 17:51:42 -0500 Subject: ext4: Update documentation to correct the inode_readahead_blks option name Per commit 240799cd, the option name for readahead should be inode_readahead_blks, not inode_readahead. Signed-off-by: Fang Wenqi Signed-off-by: "Theodore Ts'o" --- Documentation/filesystems/ext4.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index af6885c..e1def17 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -196,7 +196,7 @@ nobarrier This also requires an IO stack which can support also be used to enable or disable barriers, for consistency with other ext4 mount options. -inode_readahead=n This tuning parameter controls the maximum +inode_readahead_blks=n This tuning parameter controls the maximum number of inode table blocks that ext4's inode table readahead algorithm will pre-read into the buffer cache. The default value is 32 blocks. -- cgit v1.1 From fdfa68298816192d47fc7443d1e2f00fa1420985 Mon Sep 17 00:00:00 2001 From: Masanari Iida Date: Sun, 27 Dec 2009 00:09:45 +0900 Subject: ALSA: Fix a typo in Procfile.txt Fix a typo in Documentation/sound/alsa/Procfile.txt Signed-off-by Masanari Iida Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/Procfile.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/Procfile.txt b/Documentation/sound/alsa/Procfile.txt index 719a819..07301de 100644 --- a/Documentation/sound/alsa/Procfile.txt +++ b/Documentation/sound/alsa/Procfile.txt @@ -95,7 +95,7 @@ card*/pcm*/xrun_debug It takes an integer value, can be changed by writing to this file, such as - # cat 5 > /proc/asound/card0/pcm0p/xrun_debug + # echo 5 > /proc/asound/card0/pcm0p/xrun_debug The value consists of the following bit flags: bit 0 = Enable XRUN/jiffies debug messages -- cgit v1.1 From 169220f88f0f26f4450ac0bc8ff0f807b453ec58 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Sat, 26 Dec 2009 22:52:16 -0200 Subject: thinkpad-acpi: update volume subdriver documentation Signed-off-by: Henrique de Moraes Holschuh Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 58 ++++++++++++++++++++++++++++----- 1 file changed, 50 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 169091f..75afa12 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1092,8 +1092,8 @@ WARNING: its level up and down at every change. -Volume control --------------- +Volume control (Console Audio control) +-------------------------------------- procfs: /proc/acpi/ibm/volume ALSA: "ThinkPad Console Audio Control", default ID: "ThinkPadEC" @@ -1110,9 +1110,53 @@ the desktop environment to just provide on-screen-display feedback. Software volume control should be done only in the main AC97/HDA mixer. -This feature allows volume control on ThinkPad models with a digital -volume knob (when available, not all models have it), as well as -mute/unmute control. The available commands are: + +About the ThinkPad Console Audio control: + +ThinkPads have a built-in amplifier and muting circuit that drives the +console headphone and speakers. This circuit is after the main AC97 +or HDA mixer in the audio path, and under exclusive control of the +firmware. + +ThinkPads have three special hotkeys to interact with the console +audio control: volume up, volume down and mute. + +It is worth noting that the normal way the mute function works (on +ThinkPads that do not have a "mute LED") is: + +1. Press mute to mute. It will *always* mute, you can press it as + many times as you want, and the sound will remain mute. + +2. Press either volume key to unmute the ThinkPad (it will _not_ + change the volume, it will just unmute). + +This is a very superior design when compared to the cheap software-only +mute-toggle solution found on normal consumer laptops: you can be +absolutely sure the ThinkPad will not make noise if you press the mute +button, no matter the previous state. + +The IBM ThinkPads, and the earlier Lenovo ThinkPads have variable-gain +amplifiers driving the speakers and headphone output, and the firmware +also handles volume control for the headphone and speakers on these +ThinkPads without any help from the operating system (this volume +control stage exists after the main AC97 or HDA mixer in the audio +path). + +The newer Lenovo models only have firmware mute control, and depend on +the main HDA mixer to do volume control (which is done by the operating +system). In this case, the volume keys are filtered out for unmute +key press (there are some firmware bugs in this area) and delivered as +normal key presses to the operating system (thinkpad-acpi is not +involved). + + +The ThinkPad-ACPI volume control: + +The preferred way to interact with the Console Audio control is the +ALSA interface. + +The legacy procfs interface allows one to read the current state, +and if volume control is enabled, accepts the following commands: echo up >/proc/acpi/ibm/volume echo down >/proc/acpi/ibm/volume @@ -1121,12 +1165,10 @@ mute/unmute control. The available commands are: echo 'level ' >/proc/acpi/ibm/volume The number range is 0 to 14 although not all of them may be -distinct. The unmute the volume after the mute command, use either the +distinct. To unmute the volume after the mute command, use either the up or down command (the level command will not unmute the volume), or the unmute command. -The current volume level and mute state is shown in the file. - You can use the volume_capabilities parameter to tell the driver whether your thinkpad has volume control or mute-only control: volume_capabilities=1 for mixers with mute and volume control, -- cgit v1.1 From dab4b911a5327859bb8f969249c6978c26cd4853 Mon Sep 17 00:00:00 2001 From: Jan Kiszka Date: Sun, 6 Dec 2009 18:24:15 +0100 Subject: KVM: x86: Extend KVM_SET_VCPU_EVENTS with selective updates User space may not want to overwrite asynchronously changing VCPU event states on write-back. So allow to skip nmi.pending and sipi_vector by setting corresponding bits in the flags field of kvm_vcpu_events. [avi: advertise the bits in KVM_GET_VCPU_EVENTS] Signed-off-by: Jan Kiszka Signed-off-by: Avi Kivity --- Documentation/kvm/api.txt | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index e1a1141..2811e45 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -685,7 +685,7 @@ struct kvm_vcpu_events { __u8 pad; } nmi; __u32 sipi_vector; - __u32 flags; /* must be zero */ + __u32 flags; }; 4.30 KVM_SET_VCPU_EVENTS @@ -701,6 +701,14 @@ vcpu. See KVM_GET_VCPU_EVENTS for the data structure. +Fields that may be modified asynchronously by running VCPUs can be excluded +from the update. These fields are nmi.pending and sipi_vector. Keep the +corresponding bits in the flags field cleared to suppress overwriting the +current in-kernel state. The bits are: + +KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel +KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector + 5. The kvm_run structure -- cgit v1.1 From d7f0eea9e431e1b8b0742a74db1a9490730b2a25 Mon Sep 17 00:00:00 2001 From: Zhang Rui Date: Wed, 30 Dec 2009 15:36:42 +0800 Subject: ACPI: introduce kernel parameter acpi_sleep=sci_force_enable Introduce kernel parameter acpi_sleep=sci_force_enable some laptop requires SCI_EN being set directly on resume, or else they hung somewhere in the resume code path. We already have a blacklist for these laptops but we still need this option, especially when debugging some suspend/resume problems, in case there are systems that need this workaround and are not yet in the blacklist. Signed-off-by: Zhang Rui Acked-by: Rafael J. Wysocki Signed-off-by: Len Brown --- Documentation/kernel-parameters.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 5ba4d9d..736d456 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -240,7 +240,7 @@ and is between 256 and 4096 characters. It is defined in the file acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, - old_ordering, s4_nonvs } + old_ordering, s4_nonvs, sci_force_enable } See Documentation/power/video.txt for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep @@ -253,6 +253,9 @@ and is between 256 and 4096 characters. It is defined in the file of _PTS is used by default). s4_nonvs prevents the kernel from saving/restoring the ACPI NVS memory during hibernation. + sci_force_enable causes the kernel to set SCI_EN directly + on resume from S1/S3 (which is against the ACPI spec, + but some broken systems don't work without it). acpi_use_timer_override [HW,ACPI] Use timer override. For some broken Nvidia NF5 boards -- cgit v1.1 From 6aff43f817ddc54fcd6f0215bfba5d334b0bbbbd Mon Sep 17 00:00:00 2001 From: Ryusuke Konishi Date: Sat, 2 Jan 2010 21:41:53 +0900 Subject: nilfs2: update mailing list address This replaces the list address for nilfs discussion to linux-nilfs at vger.kernel.org from users at nilfs.org. Signed-off-by: Ryusuke Konishi --- Documentation/filesystems/nilfs2.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt index 4949fca..839efd8 100644 --- a/Documentation/filesystems/nilfs2.txt +++ b/Documentation/filesystems/nilfs2.txt @@ -28,7 +28,7 @@ described in the man pages included in the package. Project web page: http://www.nilfs.org/en/ Download page: http://www.nilfs.org/en/download.html Git tree web page: http://www.nilfs.org/git/ -NILFS mailing lists: http://www.nilfs.org/mailman/listinfo/users +List info: http://vger.kernel.org/vger-lists.html#linux-nilfs Caveats ======= -- cgit v1.1 From 143724fd3d3c154009fe95846dcbf7afadca8ab1 Mon Sep 17 00:00:00 2001 From: H Hartley Sweeten Date: Fri, 1 Jan 2010 20:35:41 -0800 Subject: Documentation: fix ioremap return type ioremap() returns a void __iomem * not a char *. Update the documentation file to reflect this. Signed-off-by: H Hartley Sweeten Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds --- Documentation/IO-mapping.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/IO-mapping.txt b/Documentation/IO-mapping.txt index 78a4406..1b5aa10 100644 --- a/Documentation/IO-mapping.txt +++ b/Documentation/IO-mapping.txt @@ -157,7 +157,7 @@ For such memory, you can do things like * access only the 640k-1MB area, so anything else * has to be remapped. */ - char * baseptr = ioremap(0xFC000000, 1024*1024); + void __iomem *baseptr = ioremap(0xFC000000, 1024*1024); /* write a 'A' to the offset 10 of the area */ writeb('A',baseptr+10); -- cgit v1.1 From 8d9f99c335ef66e4c44afe8f61816b0edeafba91 Mon Sep 17 00:00:00 2001 From: H Hartley Sweeten Date: Fri, 1 Jan 2010 20:35:54 -0800 Subject: DocBook: fix ioremap return type ioremap() returns a void __iomem * not an unsigned long. Update the Documentation file to reflect this. Signed-off-by: H Hartley Sweeten Signed-off-by: Randy Dunlap Cc: David Woodhouse Signed-off-by: Linus Torvalds --- Documentation/DocBook/mtdnand.tmpl | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index f508a8a..5e7d84b 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -174,7 +174,7 @@ static struct mtd_info *board_mtd; -static unsigned long baseaddr; +static void __iomem *baseaddr; Static example @@ -182,7 +182,7 @@ static unsigned long baseaddr; static struct mtd_info board_mtd; static struct nand_chip board_chip; -static unsigned long baseaddr; +static void __iomem *baseaddr; @@ -283,8 +283,8 @@ int __init board_init (void) } /* map physical address */ - baseaddr = (unsigned long)ioremap(CHIP_PHYSICAL_ADDRESS, 1024); - if(!baseaddr){ + baseaddr = ioremap(CHIP_PHYSICAL_ADDRESS, 1024); + if (!baseaddr) { printk("Ioremap to access NAND chip failed\n"); err = -EIO; goto out_mtd; @@ -316,7 +316,7 @@ int __init board_init (void) goto out; out_ior: - iounmap((void *)baseaddr); + iounmap(baseaddr); out_mtd: kfree (board_mtd); out: @@ -341,7 +341,7 @@ static void __exit board_cleanup (void) nand_release (board_mtd); /* unmap physical address */ - iounmap((void *)baseaddr); + iounmap(baseaddr); /* Free the MTD device structure */ kfree (board_mtd); -- cgit v1.1 From 4207a152bc242effd0b8231143aa5b9f7a1593a9 Mon Sep 17 00:00:00 2001 From: Kusanagi Kouichi Date: Fri, 1 Jan 2010 20:36:09 -0800 Subject: Documentation: Rename Documentation/DMA-mapping.txt It seems that Documentation/DMA-mapping.txt was supposed to be renamed to Documentation/PCI/PCI-DMA-mapping.txt. Signed-off-by: Kusanagi Kouichi Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds --- Documentation/DMA-mapping.txt | 766 ---------------------------------- Documentation/PCI/PCI-DMA-mapping.txt | 766 ++++++++++++++++++++++++++++++++++ Documentation/block/biodoc.txt | 2 +- 3 files changed, 767 insertions(+), 767 deletions(-) delete mode 100644 Documentation/DMA-mapping.txt create mode 100644 Documentation/PCI/PCI-DMA-mapping.txt (limited to 'Documentation') diff --git a/Documentation/DMA-mapping.txt b/Documentation/DMA-mapping.txt deleted file mode 100644 index ecad88d..0000000 --- a/Documentation/DMA-mapping.txt +++ /dev/null @@ -1,766 +0,0 @@ - Dynamic DMA mapping - =================== - - David S. Miller - Richard Henderson - Jakub Jelinek - -This document describes the DMA mapping system in terms of the pci_ -API. For a similar API that works for generic devices, see -DMA-API.txt. - -Most of the 64bit platforms have special hardware that translates bus -addresses (DMA addresses) into physical addresses. This is similar to -how page tables and/or a TLB translates virtual addresses to physical -addresses on a CPU. This is needed so that e.g. PCI devices can -access with a Single Address Cycle (32bit DMA address) any page in the -64bit physical address space. Previously in Linux those 64bit -platforms had to set artificial limits on the maximum RAM size in the -system, so that the virt_to_bus() static scheme works (the DMA address -translation tables were simply filled on bootup to map each bus -address to the physical page __pa(bus_to_virt())). - -So that Linux can use the dynamic DMA mapping, it needs some help from the -drivers, namely it has to take into account that DMA addresses should be -mapped only for the time they are actually used and unmapped after the DMA -transfer. - -The following API will work of course even on platforms where no such -hardware exists, see e.g. arch/x86/include/asm/pci.h for how it is implemented on -top of the virt_to_bus interface. - -First of all, you should make sure - -#include - -is in your driver. This file will obtain for you the definition of the -dma_addr_t (which can hold any valid DMA address for the platform) -type which should be used everywhere you hold a DMA (bus) address -returned from the DMA mapping functions. - - What memory is DMA'able? - -The first piece of information you must know is what kernel memory can -be used with the DMA mapping facilities. There has been an unwritten -set of rules regarding this, and this text is an attempt to finally -write them down. - -If you acquired your memory via the page allocator -(i.e. __get_free_page*()) or the generic memory allocators -(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from -that memory using the addresses returned from those routines. - -This means specifically that you may _not_ use the memory/addresses -returned from vmalloc() for DMA. It is possible to DMA to the -_underlying_ memory mapped into a vmalloc() area, but this requires -walking page tables to get the physical addresses, and then -translating each of those pages back to a kernel address using -something like __va(). [ EDIT: Update this when we integrate -Gerd Knorr's generic code which does this. ] - -This rule also means that you may use neither kernel image addresses -(items in data/text/bss segments), nor module image addresses, nor -stack addresses for DMA. These could all be mapped somewhere entirely -different than the rest of physical memory. Even if those classes of -memory could physically work with DMA, you'd need to ensure the I/O -buffers were cacheline-aligned. Without that, you'd see cacheline -sharing problems (data corruption) on CPUs with DMA-incoherent caches. -(The CPU could write to one word, DMA would write to a different one -in the same cache line, and one of them could be overwritten.) - -Also, this means that you cannot take the return of a kmap() -call and DMA to/from that. This is similar to vmalloc(). - -What about block I/O and networking buffers? The block I/O and -networking subsystems make sure that the buffers they use are valid -for you to DMA from/to. - - DMA addressing limitations - -Does your device have any DMA addressing limitations? For example, is -your device only capable of driving the low order 24-bits of address -on the PCI bus for SAC DMA transfers? If so, you need to inform the -PCI layer of this fact. - -By default, the kernel assumes that your device can address the full -32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs -to be increased. And for a device with limitations, as discussed in -the previous paragraph, it needs to be decreased. - -pci_alloc_consistent() by default will return 32-bit DMA addresses. -PCI-X specification requires PCI-X devices to support 64-bit -addressing (DAC) for all transactions. And at least one platform (SGI -SN2) requires 64-bit consistent allocations to operate correctly when -the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(), -it's good practice to call pci_set_consistent_dma_mask() to set the -appropriate mask even if your device only supports 32-bit DMA -(default) and especially if it's a PCI-X device. - -For correct operation, you must interrogate the PCI layer in your -device probe routine to see if the PCI controller on the machine can -properly support the DMA addressing limitation your device has. It is -good style to do this even if your device holds the default setting, -because this shows that you did think about these issues wrt. your -device. - -The query is performed via a call to pci_set_dma_mask(): - - int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask); - -The query for consistent allocations is performed via a call to -pci_set_consistent_dma_mask(): - - int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask); - -Here, pdev is a pointer to the PCI device struct of your device, and -device_mask is a bit mask describing which bits of a PCI address your -device supports. It returns zero if your card can perform DMA -properly on the machine given the address mask you provided. - -If it returns non-zero, your device cannot perform DMA properly on -this platform, and attempting to do so will result in undefined -behavior. You must either use a different mask, or not use DMA. - -This means that in the failure case, you have three options: - -1) Use another DMA mask, if possible (see below). -2) Use some non-DMA mode for data transfer, if possible. -3) Ignore this device and do not initialize it. - -It is recommended that your driver print a kernel KERN_WARNING message -when you end up performing either #2 or #3. In this manner, if a user -of your driver reports that performance is bad or that the device is not -even detected, you can ask them for the kernel messages to find out -exactly why. - -The standard 32-bit addressing PCI device would do something like -this: - - if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -Another common scenario is a 64-bit capable device. The approach -here is to try for 64-bit DAC addressing, but back down to a -32-bit mask should that fail. The PCI platform code may fail the -64-bit mask not because the platform is not capable of 64-bit -addressing. Rather, it may fail in this case simply because -32-bit SAC addressing is done more efficiently than DAC addressing. -Sparc64 is one platform which behaves in this way. - -Here is how you would handle a 64-bit capable device which can drive -all 64-bits when accessing streaming DMA: - - int using_dac; - - if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) { - using_dac = 1; - } else if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { - using_dac = 0; - } else { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -If a card is capable of using 64-bit consistent allocations as well, -the case would look like this: - - int using_dac, consistent_using_dac; - - if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) { - using_dac = 1; - consistent_using_dac = 1; - pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)); - } else if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { - using_dac = 0; - consistent_using_dac = 0; - pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32)); - } else { - printk(KERN_WARNING - "mydev: No suitable DMA available.\n"); - goto ignore_this_device; - } - -pci_set_consistent_dma_mask() will always be able to set the same or a -smaller mask as pci_set_dma_mask(). However for the rare case that a -device driver only uses consistent allocations, one would have to -check the return value from pci_set_consistent_dma_mask(). - -Finally, if your device can only drive the low 24-bits of -address during PCI bus mastering you might do something like: - - if (pci_set_dma_mask(pdev, DMA_BIT_MASK(24))) { - printk(KERN_WARNING - "mydev: 24-bit DMA addressing not available.\n"); - goto ignore_this_device; - } - -When pci_set_dma_mask() is successful, and returns zero, the PCI layer -saves away this mask you have provided. The PCI layer will use this -information later when you make DMA mappings. - -There is a case which we are aware of at this time, which is worth -mentioning in this documentation. If your device supports multiple -functions (for example a sound card provides playback and record -functions) and the various different functions have _different_ -DMA addressing limitations, you may wish to probe each mask and -only provide the functionality which the machine can handle. It -is important that the last call to pci_set_dma_mask() be for the -most specific mask. - -Here is pseudo-code showing how this might be done: - - #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) - #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) - - struct my_sound_card *card; - struct pci_dev *pdev; - - ... - if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { - card->playback_enabled = 1; - } else { - card->playback_enabled = 0; - printk(KERN_WARNING "%s: Playback disabled due to DMA limitations.\n", - card->name); - } - if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { - card->record_enabled = 1; - } else { - card->record_enabled = 0; - printk(KERN_WARNING "%s: Record disabled due to DMA limitations.\n", - card->name); - } - -A sound card was used as an example here because this genre of PCI -devices seems to be littered with ISA chips given a PCI front end, -and thus retaining the 16MB DMA addressing limitations of ISA. - - Types of DMA mappings - -There are two types of DMA mappings: - -- Consistent DMA mappings which are usually mapped at driver - initialization, unmapped at the end and for which the hardware should - guarantee that the device and the CPU can access the data - in parallel and will see updates made by each other without any - explicit software flushing. - - Think of "consistent" as "synchronous" or "coherent". - - The current default is to return consistent memory in the low 32 - bits of the PCI bus space. However, for future compatibility you - should set the consistent mask even if this default is fine for your - driver. - - Good examples of what to use consistent mappings for are: - - - Network card DMA ring descriptors. - - SCSI adapter mailbox command data structures. - - Device firmware microcode executed out of - main memory. - - The invariant these examples all require is that any CPU store - to memory is immediately visible to the device, and vice - versa. Consistent mappings guarantee this. - - IMPORTANT: Consistent DMA memory does not preclude the usage of - proper memory barriers. The CPU may reorder stores to - consistent memory just as it may normal memory. Example: - if it is important for the device to see the first word - of a descriptor updated before the second, you must do - something like: - - desc->word0 = address; - wmb(); - desc->word1 = DESC_VALID; - - in order to get correct behavior on all platforms. - - Also, on some platforms your driver may need to flush CPU write - buffers in much the same way as it needs to flush write buffers - found in PCI bridges (such as by reading a register's value - after writing it). - -- Streaming DMA mappings which are usually mapped for one DMA transfer, - unmapped right after it (unless you use pci_dma_sync_* below) and for which - hardware can optimize for sequential accesses. - - This of "streaming" as "asynchronous" or "outside the coherency - domain". - - Good examples of what to use streaming mappings for are: - - - Networking buffers transmitted/received by a device. - - Filesystem buffers written/read by a SCSI device. - - The interfaces for using this type of mapping were designed in - such a way that an implementation can make whatever performance - optimizations the hardware allows. To this end, when using - such mappings you must be explicit about what you want to happen. - -Neither type of DMA mapping has alignment restrictions that come -from PCI, although some devices may have such restrictions. -Also, systems with caches that aren't DMA-coherent will work better -when the underlying buffers don't share cache lines with other data. - - - Using Consistent DMA mappings. - -To allocate and map large (PAGE_SIZE or so) consistent DMA regions, -you should do: - - dma_addr_t dma_handle; - - cpu_addr = pci_alloc_consistent(pdev, size, &dma_handle); - -where pdev is a struct pci_dev *. This may be called in interrupt context. -You should use dma_alloc_coherent (see DMA-API.txt) for buses -where devices don't have struct pci_dev (like ISA, EISA). - -This argument is needed because the DMA translations may be bus -specific (and often is private to the bus which the device is attached -to). - -Size is the length of the region you want to allocate, in bytes. - -This routine will allocate RAM for that region, so it acts similarly to -__get_free_pages (but takes size instead of a page order). If your -driver needs regions sized smaller than a page, you may prefer using -the pci_pool interface, described below. - -The consistent DMA mapping interfaces, for non-NULL pdev, will by -default return a DMA address which is SAC (Single Address Cycle) -addressable. Even if the device indicates (via PCI dma mask) that it -may address the upper 32-bits and thus perform DAC cycles, consistent -allocation will only return > 32-bit PCI addresses for DMA if the -consistent dma mask has been explicitly changed via -pci_set_consistent_dma_mask(). This is true of the pci_pool interface -as well. - -pci_alloc_consistent returns two values: the virtual address which you -can use to access it from the CPU and dma_handle which you pass to the -card. - -The cpu return address and the DMA bus master address are both -guaranteed to be aligned to the smallest PAGE_SIZE order which -is greater than or equal to the requested size. This invariant -exists (for example) to guarantee that if you allocate a chunk -which is smaller than or equal to 64 kilobytes, the extent of the -buffer you receive will not cross a 64K boundary. - -To unmap and free such a DMA region, you call: - - pci_free_consistent(pdev, size, cpu_addr, dma_handle); - -where pdev, size are the same as in the above call and cpu_addr and -dma_handle are the values pci_alloc_consistent returned to you. -This function may not be called in interrupt context. - -If your driver needs lots of smaller memory regions, you can write -custom code to subdivide pages returned by pci_alloc_consistent, -or you can use the pci_pool API to do that. A pci_pool is like -a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages. -Also, it understands common hardware constraints for alignment, -like queue heads needing to be aligned on N byte boundaries. - -Create a pci_pool like this: - - struct pci_pool *pool; - - pool = pci_pool_create(name, pdev, size, align, alloc); - -The "name" is for diagnostics (like a kmem_cache name); pdev and size -are as above. The device's hardware alignment requirement for this -type of data is "align" (which is expressed in bytes, and must be a -power of two). If your device has no boundary crossing restrictions, -pass 0 for alloc; passing 4096 says memory allocated from this pool -must not cross 4KByte boundaries (but at that time it may be better to -go for pci_alloc_consistent directly instead). - -Allocate memory from a pci pool like this: - - cpu_addr = pci_pool_alloc(pool, flags, &dma_handle); - -flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor -holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent, -this returns two values, cpu_addr and dma_handle. - -Free memory that was allocated from a pci_pool like this: - - pci_pool_free(pool, cpu_addr, dma_handle); - -where pool is what you passed to pci_pool_alloc, and cpu_addr and -dma_handle are the values pci_pool_alloc returned. This function -may be called in interrupt context. - -Destroy a pci_pool by calling: - - pci_pool_destroy(pool); - -Make sure you've called pci_pool_free for all memory allocated -from a pool before you destroy the pool. This function may not -be called in interrupt context. - - DMA Direction - -The interfaces described in subsequent portions of this document -take a DMA direction argument, which is an integer and takes on -one of the following values: - - PCI_DMA_BIDIRECTIONAL - PCI_DMA_TODEVICE - PCI_DMA_FROMDEVICE - PCI_DMA_NONE - -One should provide the exact DMA direction if you know it. - -PCI_DMA_TODEVICE means "from main memory to the PCI device" -PCI_DMA_FROMDEVICE means "from the PCI device to main memory" -It is the direction in which the data moves during the DMA -transfer. - -You are _strongly_ encouraged to specify this as precisely -as you possibly can. - -If you absolutely cannot know the direction of the DMA transfer, -specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in -either direction. The platform guarantees that you may legally -specify this, and that it will work, but this may be at the -cost of performance for example. - -The value PCI_DMA_NONE is to be used for debugging. One can -hold this in a data structure before you come to know the -precise direction, and this will help catch cases where your -direction tracking logic has failed to set things up properly. - -Another advantage of specifying this value precisely (outside of -potential platform-specific optimizations of such) is for debugging. -Some platforms actually have a write permission boolean which DMA -mappings can be marked with, much like page protections in the user -program address space. Such platforms can and do report errors in the -kernel logs when the PCI controller hardware detects violation of the -permission setting. - -Only streaming mappings specify a direction, consistent mappings -implicitly have a direction attribute setting of -PCI_DMA_BIDIRECTIONAL. - -The SCSI subsystem tells you the direction to use in the -'sc_data_direction' member of the SCSI command your driver is -working on. - -For Networking drivers, it's a rather simple affair. For transmit -packets, map/unmap them with the PCI_DMA_TODEVICE direction -specifier. For receive packets, just the opposite, map/unmap them -with the PCI_DMA_FROMDEVICE direction specifier. - - Using Streaming DMA mappings - -The streaming DMA mapping routines can be called from interrupt -context. There are two versions of each map/unmap, one which will -map/unmap a single memory region, and one which will map/unmap a -scatterlist. - -To map a single region, you do: - - struct pci_dev *pdev = mydev->pdev; - dma_addr_t dma_handle; - void *addr = buffer->ptr; - size_t size = buffer->len; - - dma_handle = pci_map_single(pdev, addr, size, direction); - -and to unmap it: - - pci_unmap_single(pdev, dma_handle, size, direction); - -You should call pci_unmap_single when the DMA activity is finished, e.g. -from the interrupt which told you that the DMA transfer is done. - -Using cpu pointers like this for single mappings has a disadvantage, -you cannot reference HIGHMEM memory in this way. Thus, there is a -map/unmap interface pair akin to pci_{map,unmap}_single. These -interfaces deal with page/offset pairs instead of cpu pointers. -Specifically: - - struct pci_dev *pdev = mydev->pdev; - dma_addr_t dma_handle; - struct page *page = buffer->page; - unsigned long offset = buffer->offset; - size_t size = buffer->len; - - dma_handle = pci_map_page(pdev, page, offset, size, direction); - - ... - - pci_unmap_page(pdev, dma_handle, size, direction); - -Here, "offset" means byte offset within the given page. - -With scatterlists, you map a region gathered from several regions by: - - int i, count = pci_map_sg(pdev, sglist, nents, direction); - struct scatterlist *sg; - - for_each_sg(sglist, sg, count, i) { - hw_address[i] = sg_dma_address(sg); - hw_len[i] = sg_dma_len(sg); - } - -where nents is the number of entries in the sglist. - -The implementation is free to merge several consecutive sglist entries -into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any -consecutive sglist entries can be merged into one provided the first one -ends and the second one starts on a page boundary - in fact this is a huge -advantage for cards which either cannot do scatter-gather or have very -limited number of scatter-gather entries) and returns the actual number -of sg entries it mapped them to. On failure 0 is returned. - -Then you should loop count times (note: this can be less than nents times) -and use sg_dma_address() and sg_dma_len() macros where you previously -accessed sg->address and sg->length as shown above. - -To unmap a scatterlist, just call: - - pci_unmap_sg(pdev, sglist, nents, direction); - -Again, make sure DMA activity has already finished. - -PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be - the _same_ one you passed into the pci_map_sg call, - it should _NOT_ be the 'count' value _returned_ from the - pci_map_sg call. - -Every pci_map_{single,sg} call should have its pci_unmap_{single,sg} -counterpart, because the bus address space is a shared resource (although -in some ports the mapping is per each BUS so less devices contend for the -same bus address space) and you could render the machine unusable by eating -all bus addresses. - -If you need to use the same streaming DMA region multiple times and touch -the data in between the DMA transfers, the buffer needs to be synced -properly in order for the cpu and device to see the most uptodate and -correct copy of the DMA buffer. - -So, firstly, just map it with pci_map_{single,sg}, and after each DMA -transfer call either: - - pci_dma_sync_single_for_cpu(pdev, dma_handle, size, direction); - -or: - - pci_dma_sync_sg_for_cpu(pdev, sglist, nents, direction); - -as appropriate. - -Then, if you wish to let the device get at the DMA area again, -finish accessing the data with the cpu, and then before actually -giving the buffer to the hardware call either: - - pci_dma_sync_single_for_device(pdev, dma_handle, size, direction); - -or: - - pci_dma_sync_sg_for_device(dev, sglist, nents, direction); - -as appropriate. - -After the last DMA transfer call one of the DMA unmap routines -pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_* -call till pci_unmap_*, then you don't have to call the pci_dma_sync_* -routines at all. - -Here is pseudo code which shows a situation in which you would need -to use the pci_dma_sync_*() interfaces. - - my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) - { - dma_addr_t mapping; - - mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE); - - cp->rx_buf = buffer; - cp->rx_len = len; - cp->rx_dma = mapping; - - give_rx_buf_to_card(cp); - } - - ... - - my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) - { - struct my_card *cp = devid; - - ... - if (read_card_status(cp) == RX_BUF_TRANSFERRED) { - struct my_card_header *hp; - - /* Examine the header to see if we wish - * to accept the data. But synchronize - * the DMA transfer with the CPU first - * so that we see updated contents. - */ - pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma, - cp->rx_len, - PCI_DMA_FROMDEVICE); - - /* Now it is safe to examine the buffer. */ - hp = (struct my_card_header *) cp->rx_buf; - if (header_is_ok(hp)) { - pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len, - PCI_DMA_FROMDEVICE); - pass_to_upper_layers(cp->rx_buf); - make_and_setup_new_rx_buf(cp); - } else { - /* Just sync the buffer and give it back - * to the card. - */ - pci_dma_sync_single_for_device(cp->pdev, - cp->rx_dma, - cp->rx_len, - PCI_DMA_FROMDEVICE); - give_rx_buf_to_card(cp); - } - } - } - -Drivers converted fully to this interface should not use virt_to_bus any -longer, nor should they use bus_to_virt. Some drivers have to be changed a -little bit, because there is no longer an equivalent to bus_to_virt in the -dynamic DMA mapping scheme - you have to always store the DMA addresses -returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single -calls (pci_map_sg stores them in the scatterlist itself if the platform -supports dynamic DMA mapping in hardware) in your driver structures and/or -in the card registers. - -All PCI drivers should be using these interfaces with no exceptions. -It is planned to completely remove virt_to_bus() and bus_to_virt() as -they are entirely deprecated. Some ports already do not provide these -as it is impossible to correctly support them. - - Optimizing Unmap State Space Consumption - -On many platforms, pci_unmap_{single,page}() is simply a nop. -Therefore, keeping track of the mapping address and length is a waste -of space. Instead of filling your drivers up with ifdefs and the like -to "work around" this (which would defeat the whole purpose of a -portable API) the following facilities are provided. - -Actually, instead of describing the macros one by one, we'll -transform some example code. - -1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures. - Example, before: - - struct ring_state { - struct sk_buff *skb; - dma_addr_t mapping; - __u32 len; - }; - - after: - - struct ring_state { - struct sk_buff *skb; - DECLARE_PCI_UNMAP_ADDR(mapping) - DECLARE_PCI_UNMAP_LEN(len) - }; - - NOTE: DO NOT put a semicolon at the end of the DECLARE_*() - macro. - -2) Use pci_unmap_{addr,len}_set to set these values. - Example, before: - - ringp->mapping = FOO; - ringp->len = BAR; - - after: - - pci_unmap_addr_set(ringp, mapping, FOO); - pci_unmap_len_set(ringp, len, BAR); - -3) Use pci_unmap_{addr,len} to access these values. - Example, before: - - pci_unmap_single(pdev, ringp->mapping, ringp->len, - PCI_DMA_FROMDEVICE); - - after: - - pci_unmap_single(pdev, - pci_unmap_addr(ringp, mapping), - pci_unmap_len(ringp, len), - PCI_DMA_FROMDEVICE); - -It really should be self-explanatory. We treat the ADDR and LEN -separately, because it is possible for an implementation to only -need the address in order to perform the unmap operation. - - Platform Issues - -If you are just writing drivers for Linux and do not maintain -an architecture port for the kernel, you can safely skip down -to "Closing". - -1) Struct scatterlist requirements. - - Struct scatterlist must contain, at a minimum, the following - members: - - struct page *page; - unsigned int offset; - unsigned int length; - - The base address is specified by a "page+offset" pair. - - Previous versions of struct scatterlist contained a "void *address" - field that was sometimes used instead of page+offset. As of Linux - 2.5., page+offset is always used, and the "address" field has been - deleted. - -2) More to come... - - Handling Errors - -DMA address space is limited on some architectures and an allocation -failure can be determined by: - -- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 - -- checking the returned dma_addr_t of pci_map_single and pci_map_page - by using pci_dma_mapping_error(): - - dma_addr_t dma_handle; - - dma_handle = pci_map_single(pdev, addr, size, direction); - if (pci_dma_mapping_error(pdev, dma_handle)) { - /* - * reduce current DMA mapping usage, - * delay and try again later or - * reset driver. - */ - } - - Closing - -This document, and the API itself, would not be in it's current -form without the feedback and suggestions from numerous individuals. -We would like to specifically mention, in no particular order, the -following people: - - Russell King - Leo Dagum - Ralf Baechle - Grant Grundler - Jay Estabrook - Thomas Sailer - Andrea Arcangeli - Jens Axboe - David Mosberger-Tang diff --git a/Documentation/PCI/PCI-DMA-mapping.txt b/Documentation/PCI/PCI-DMA-mapping.txt new file mode 100644 index 0000000..ecad88d --- /dev/null +++ b/Documentation/PCI/PCI-DMA-mapping.txt @@ -0,0 +1,766 @@ + Dynamic DMA mapping + =================== + + David S. Miller + Richard Henderson + Jakub Jelinek + +This document describes the DMA mapping system in terms of the pci_ +API. For a similar API that works for generic devices, see +DMA-API.txt. + +Most of the 64bit platforms have special hardware that translates bus +addresses (DMA addresses) into physical addresses. This is similar to +how page tables and/or a TLB translates virtual addresses to physical +addresses on a CPU. This is needed so that e.g. PCI devices can +access with a Single Address Cycle (32bit DMA address) any page in the +64bit physical address space. Previously in Linux those 64bit +platforms had to set artificial limits on the maximum RAM size in the +system, so that the virt_to_bus() static scheme works (the DMA address +translation tables were simply filled on bootup to map each bus +address to the physical page __pa(bus_to_virt())). + +So that Linux can use the dynamic DMA mapping, it needs some help from the +drivers, namely it has to take into account that DMA addresses should be +mapped only for the time they are actually used and unmapped after the DMA +transfer. + +The following API will work of course even on platforms where no such +hardware exists, see e.g. arch/x86/include/asm/pci.h for how it is implemented on +top of the virt_to_bus interface. + +First of all, you should make sure + +#include + +is in your driver. This file will obtain for you the definition of the +dma_addr_t (which can hold any valid DMA address for the platform) +type which should be used everywhere you hold a DMA (bus) address +returned from the DMA mapping functions. + + What memory is DMA'able? + +The first piece of information you must know is what kernel memory can +be used with the DMA mapping facilities. There has been an unwritten +set of rules regarding this, and this text is an attempt to finally +write them down. + +If you acquired your memory via the page allocator +(i.e. __get_free_page*()) or the generic memory allocators +(i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from +that memory using the addresses returned from those routines. + +This means specifically that you may _not_ use the memory/addresses +returned from vmalloc() for DMA. It is possible to DMA to the +_underlying_ memory mapped into a vmalloc() area, but this requires +walking page tables to get the physical addresses, and then +translating each of those pages back to a kernel address using +something like __va(). [ EDIT: Update this when we integrate +Gerd Knorr's generic code which does this. ] + +This rule also means that you may use neither kernel image addresses +(items in data/text/bss segments), nor module image addresses, nor +stack addresses for DMA. These could all be mapped somewhere entirely +different than the rest of physical memory. Even if those classes of +memory could physically work with DMA, you'd need to ensure the I/O +buffers were cacheline-aligned. Without that, you'd see cacheline +sharing problems (data corruption) on CPUs with DMA-incoherent caches. +(The CPU could write to one word, DMA would write to a different one +in the same cache line, and one of them could be overwritten.) + +Also, this means that you cannot take the return of a kmap() +call and DMA to/from that. This is similar to vmalloc(). + +What about block I/O and networking buffers? The block I/O and +networking subsystems make sure that the buffers they use are valid +for you to DMA from/to. + + DMA addressing limitations + +Does your device have any DMA addressing limitations? For example, is +your device only capable of driving the low order 24-bits of address +on the PCI bus for SAC DMA transfers? If so, you need to inform the +PCI layer of this fact. + +By default, the kernel assumes that your device can address the full +32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs +to be increased. And for a device with limitations, as discussed in +the previous paragraph, it needs to be decreased. + +pci_alloc_consistent() by default will return 32-bit DMA addresses. +PCI-X specification requires PCI-X devices to support 64-bit +addressing (DAC) for all transactions. And at least one platform (SGI +SN2) requires 64-bit consistent allocations to operate correctly when +the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(), +it's good practice to call pci_set_consistent_dma_mask() to set the +appropriate mask even if your device only supports 32-bit DMA +(default) and especially if it's a PCI-X device. + +For correct operation, you must interrogate the PCI layer in your +device probe routine to see if the PCI controller on the machine can +properly support the DMA addressing limitation your device has. It is +good style to do this even if your device holds the default setting, +because this shows that you did think about these issues wrt. your +device. + +The query is performed via a call to pci_set_dma_mask(): + + int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask); + +The query for consistent allocations is performed via a call to +pci_set_consistent_dma_mask(): + + int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask); + +Here, pdev is a pointer to the PCI device struct of your device, and +device_mask is a bit mask describing which bits of a PCI address your +device supports. It returns zero if your card can perform DMA +properly on the machine given the address mask you provided. + +If it returns non-zero, your device cannot perform DMA properly on +this platform, and attempting to do so will result in undefined +behavior. You must either use a different mask, or not use DMA. + +This means that in the failure case, you have three options: + +1) Use another DMA mask, if possible (see below). +2) Use some non-DMA mode for data transfer, if possible. +3) Ignore this device and do not initialize it. + +It is recommended that your driver print a kernel KERN_WARNING message +when you end up performing either #2 or #3. In this manner, if a user +of your driver reports that performance is bad or that the device is not +even detected, you can ask them for the kernel messages to find out +exactly why. + +The standard 32-bit addressing PCI device would do something like +this: + + if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +Another common scenario is a 64-bit capable device. The approach +here is to try for 64-bit DAC addressing, but back down to a +32-bit mask should that fail. The PCI platform code may fail the +64-bit mask not because the platform is not capable of 64-bit +addressing. Rather, it may fail in this case simply because +32-bit SAC addressing is done more efficiently than DAC addressing. +Sparc64 is one platform which behaves in this way. + +Here is how you would handle a 64-bit capable device which can drive +all 64-bits when accessing streaming DMA: + + int using_dac; + + if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) { + using_dac = 1; + } else if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { + using_dac = 0; + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +If a card is capable of using 64-bit consistent allocations as well, +the case would look like this: + + int using_dac, consistent_using_dac; + + if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) { + using_dac = 1; + consistent_using_dac = 1; + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)); + } else if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(32))) { + using_dac = 0; + consistent_using_dac = 0; + pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32)); + } else { + printk(KERN_WARNING + "mydev: No suitable DMA available.\n"); + goto ignore_this_device; + } + +pci_set_consistent_dma_mask() will always be able to set the same or a +smaller mask as pci_set_dma_mask(). However for the rare case that a +device driver only uses consistent allocations, one would have to +check the return value from pci_set_consistent_dma_mask(). + +Finally, if your device can only drive the low 24-bits of +address during PCI bus mastering you might do something like: + + if (pci_set_dma_mask(pdev, DMA_BIT_MASK(24))) { + printk(KERN_WARNING + "mydev: 24-bit DMA addressing not available.\n"); + goto ignore_this_device; + } + +When pci_set_dma_mask() is successful, and returns zero, the PCI layer +saves away this mask you have provided. The PCI layer will use this +information later when you make DMA mappings. + +There is a case which we are aware of at this time, which is worth +mentioning in this documentation. If your device supports multiple +functions (for example a sound card provides playback and record +functions) and the various different functions have _different_ +DMA addressing limitations, you may wish to probe each mask and +only provide the functionality which the machine can handle. It +is important that the last call to pci_set_dma_mask() be for the +most specific mask. + +Here is pseudo-code showing how this might be done: + + #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) + #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) + + struct my_sound_card *card; + struct pci_dev *pdev; + + ... + if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { + card->playback_enabled = 1; + } else { + card->playback_enabled = 0; + printk(KERN_WARNING "%s: Playback disabled due to DMA limitations.\n", + card->name); + } + if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { + card->record_enabled = 1; + } else { + card->record_enabled = 0; + printk(KERN_WARNING "%s: Record disabled due to DMA limitations.\n", + card->name); + } + +A sound card was used as an example here because this genre of PCI +devices seems to be littered with ISA chips given a PCI front end, +and thus retaining the 16MB DMA addressing limitations of ISA. + + Types of DMA mappings + +There are two types of DMA mappings: + +- Consistent DMA mappings which are usually mapped at driver + initialization, unmapped at the end and for which the hardware should + guarantee that the device and the CPU can access the data + in parallel and will see updates made by each other without any + explicit software flushing. + + Think of "consistent" as "synchronous" or "coherent". + + The current default is to return consistent memory in the low 32 + bits of the PCI bus space. However, for future compatibility you + should set the consistent mask even if this default is fine for your + driver. + + Good examples of what to use consistent mappings for are: + + - Network card DMA ring descriptors. + - SCSI adapter mailbox command data structures. + - Device firmware microcode executed out of + main memory. + + The invariant these examples all require is that any CPU store + to memory is immediately visible to the device, and vice + versa. Consistent mappings guarantee this. + + IMPORTANT: Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to + consistent memory just as it may normal memory. Example: + if it is important for the device to see the first word + of a descriptor updated before the second, you must do + something like: + + desc->word0 = address; + wmb(); + desc->word1 = DESC_VALID; + + in order to get correct behavior on all platforms. + + Also, on some platforms your driver may need to flush CPU write + buffers in much the same way as it needs to flush write buffers + found in PCI bridges (such as by reading a register's value + after writing it). + +- Streaming DMA mappings which are usually mapped for one DMA transfer, + unmapped right after it (unless you use pci_dma_sync_* below) and for which + hardware can optimize for sequential accesses. + + This of "streaming" as "asynchronous" or "outside the coherency + domain". + + Good examples of what to use streaming mappings for are: + + - Networking buffers transmitted/received by a device. + - Filesystem buffers written/read by a SCSI device. + + The interfaces for using this type of mapping were designed in + such a way that an implementation can make whatever performance + optimizations the hardware allows. To this end, when using + such mappings you must be explicit about what you want to happen. + +Neither type of DMA mapping has alignment restrictions that come +from PCI, although some devices may have such restrictions. +Also, systems with caches that aren't DMA-coherent will work better +when the underlying buffers don't share cache lines with other data. + + + Using Consistent DMA mappings. + +To allocate and map large (PAGE_SIZE or so) consistent DMA regions, +you should do: + + dma_addr_t dma_handle; + + cpu_addr = pci_alloc_consistent(pdev, size, &dma_handle); + +where pdev is a struct pci_dev *. This may be called in interrupt context. +You should use dma_alloc_coherent (see DMA-API.txt) for buses +where devices don't have struct pci_dev (like ISA, EISA). + +This argument is needed because the DMA translations may be bus +specific (and often is private to the bus which the device is attached +to). + +Size is the length of the region you want to allocate, in bytes. + +This routine will allocate RAM for that region, so it acts similarly to +__get_free_pages (but takes size instead of a page order). If your +driver needs regions sized smaller than a page, you may prefer using +the pci_pool interface, described below. + +The consistent DMA mapping interfaces, for non-NULL pdev, will by +default return a DMA address which is SAC (Single Address Cycle) +addressable. Even if the device indicates (via PCI dma mask) that it +may address the upper 32-bits and thus perform DAC cycles, consistent +allocation will only return > 32-bit PCI addresses for DMA if the +consistent dma mask has been explicitly changed via +pci_set_consistent_dma_mask(). This is true of the pci_pool interface +as well. + +pci_alloc_consistent returns two values: the virtual address which you +can use to access it from the CPU and dma_handle which you pass to the +card. + +The cpu return address and the DMA bus master address are both +guaranteed to be aligned to the smallest PAGE_SIZE order which +is greater than or equal to the requested size. This invariant +exists (for example) to guarantee that if you allocate a chunk +which is smaller than or equal to 64 kilobytes, the extent of the +buffer you receive will not cross a 64K boundary. + +To unmap and free such a DMA region, you call: + + pci_free_consistent(pdev, size, cpu_addr, dma_handle); + +where pdev, size are the same as in the above call and cpu_addr and +dma_handle are the values pci_alloc_consistent returned to you. +This function may not be called in interrupt context. + +If your driver needs lots of smaller memory regions, you can write +custom code to subdivide pages returned by pci_alloc_consistent, +or you can use the pci_pool API to do that. A pci_pool is like +a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages. +Also, it understands common hardware constraints for alignment, +like queue heads needing to be aligned on N byte boundaries. + +Create a pci_pool like this: + + struct pci_pool *pool; + + pool = pci_pool_create(name, pdev, size, align, alloc); + +The "name" is for diagnostics (like a kmem_cache name); pdev and size +are as above. The device's hardware alignment requirement for this +type of data is "align" (which is expressed in bytes, and must be a +power of two). If your device has no boundary crossing restrictions, +pass 0 for alloc; passing 4096 says memory allocated from this pool +must not cross 4KByte boundaries (but at that time it may be better to +go for pci_alloc_consistent directly instead). + +Allocate memory from a pci pool like this: + + cpu_addr = pci_pool_alloc(pool, flags, &dma_handle); + +flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor +holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent, +this returns two values, cpu_addr and dma_handle. + +Free memory that was allocated from a pci_pool like this: + + pci_pool_free(pool, cpu_addr, dma_handle); + +where pool is what you passed to pci_pool_alloc, and cpu_addr and +dma_handle are the values pci_pool_alloc returned. This function +may be called in interrupt context. + +Destroy a pci_pool by calling: + + pci_pool_destroy(pool); + +Make sure you've called pci_pool_free for all memory allocated +from a pool before you destroy the pool. This function may not +be called in interrupt context. + + DMA Direction + +The interfaces described in subsequent portions of this document +take a DMA direction argument, which is an integer and takes on +one of the following values: + + PCI_DMA_BIDIRECTIONAL + PCI_DMA_TODEVICE + PCI_DMA_FROMDEVICE + PCI_DMA_NONE + +One should provide the exact DMA direction if you know it. + +PCI_DMA_TODEVICE means "from main memory to the PCI device" +PCI_DMA_FROMDEVICE means "from the PCI device to main memory" +It is the direction in which the data moves during the DMA +transfer. + +You are _strongly_ encouraged to specify this as precisely +as you possibly can. + +If you absolutely cannot know the direction of the DMA transfer, +specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in +either direction. The platform guarantees that you may legally +specify this, and that it will work, but this may be at the +cost of performance for example. + +The value PCI_DMA_NONE is to be used for debugging. One can +hold this in a data structure before you come to know the +precise direction, and this will help catch cases where your +direction tracking logic has failed to set things up properly. + +Another advantage of specifying this value precisely (outside of +potential platform-specific optimizations of such) is for debugging. +Some platforms actually have a write permission boolean which DMA +mappings can be marked with, much like page protections in the user +program address space. Such platforms can and do report errors in the +kernel logs when the PCI controller hardware detects violation of the +permission setting. + +Only streaming mappings specify a direction, consistent mappings +implicitly have a direction attribute setting of +PCI_DMA_BIDIRECTIONAL. + +The SCSI subsystem tells you the direction to use in the +'sc_data_direction' member of the SCSI command your driver is +working on. + +For Networking drivers, it's a rather simple affair. For transmit +packets, map/unmap them with the PCI_DMA_TODEVICE direction +specifier. For receive packets, just the opposite, map/unmap them +with the PCI_DMA_FROMDEVICE direction specifier. + + Using Streaming DMA mappings + +The streaming DMA mapping routines can be called from interrupt +context. There are two versions of each map/unmap, one which will +map/unmap a single memory region, and one which will map/unmap a +scatterlist. + +To map a single region, you do: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + void *addr = buffer->ptr; + size_t size = buffer->len; + + dma_handle = pci_map_single(pdev, addr, size, direction); + +and to unmap it: + + pci_unmap_single(pdev, dma_handle, size, direction); + +You should call pci_unmap_single when the DMA activity is finished, e.g. +from the interrupt which told you that the DMA transfer is done. + +Using cpu pointers like this for single mappings has a disadvantage, +you cannot reference HIGHMEM memory in this way. Thus, there is a +map/unmap interface pair akin to pci_{map,unmap}_single. These +interfaces deal with page/offset pairs instead of cpu pointers. +Specifically: + + struct pci_dev *pdev = mydev->pdev; + dma_addr_t dma_handle; + struct page *page = buffer->page; + unsigned long offset = buffer->offset; + size_t size = buffer->len; + + dma_handle = pci_map_page(pdev, page, offset, size, direction); + + ... + + pci_unmap_page(pdev, dma_handle, size, direction); + +Here, "offset" means byte offset within the given page. + +With scatterlists, you map a region gathered from several regions by: + + int i, count = pci_map_sg(pdev, sglist, nents, direction); + struct scatterlist *sg; + + for_each_sg(sglist, sg, count, i) { + hw_address[i] = sg_dma_address(sg); + hw_len[i] = sg_dma_len(sg); + } + +where nents is the number of entries in the sglist. + +The implementation is free to merge several consecutive sglist entries +into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any +consecutive sglist entries can be merged into one provided the first one +ends and the second one starts on a page boundary - in fact this is a huge +advantage for cards which either cannot do scatter-gather or have very +limited number of scatter-gather entries) and returns the actual number +of sg entries it mapped them to. On failure 0 is returned. + +Then you should loop count times (note: this can be less than nents times) +and use sg_dma_address() and sg_dma_len() macros where you previously +accessed sg->address and sg->length as shown above. + +To unmap a scatterlist, just call: + + pci_unmap_sg(pdev, sglist, nents, direction); + +Again, make sure DMA activity has already finished. + +PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be + the _same_ one you passed into the pci_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + pci_map_sg call. + +Every pci_map_{single,sg} call should have its pci_unmap_{single,sg} +counterpart, because the bus address space is a shared resource (although +in some ports the mapping is per each BUS so less devices contend for the +same bus address space) and you could render the machine unusable by eating +all bus addresses. + +If you need to use the same streaming DMA region multiple times and touch +the data in between the DMA transfers, the buffer needs to be synced +properly in order for the cpu and device to see the most uptodate and +correct copy of the DMA buffer. + +So, firstly, just map it with pci_map_{single,sg}, and after each DMA +transfer call either: + + pci_dma_sync_single_for_cpu(pdev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_cpu(pdev, sglist, nents, direction); + +as appropriate. + +Then, if you wish to let the device get at the DMA area again, +finish accessing the data with the cpu, and then before actually +giving the buffer to the hardware call either: + + pci_dma_sync_single_for_device(pdev, dma_handle, size, direction); + +or: + + pci_dma_sync_sg_for_device(dev, sglist, nents, direction); + +as appropriate. + +After the last DMA transfer call one of the DMA unmap routines +pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_* +call till pci_unmap_*, then you don't have to call the pci_dma_sync_* +routines at all. + +Here is pseudo code which shows a situation in which you would need +to use the pci_dma_sync_*() interfaces. + + my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) + { + dma_addr_t mapping; + + mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE); + + cp->rx_buf = buffer; + cp->rx_len = len; + cp->rx_dma = mapping; + + give_rx_buf_to_card(cp); + } + + ... + + my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) + { + struct my_card *cp = devid; + + ... + if (read_card_status(cp) == RX_BUF_TRANSFERRED) { + struct my_card_header *hp; + + /* Examine the header to see if we wish + * to accept the data. But synchronize + * the DMA transfer with the CPU first + * so that we see updated contents. + */ + pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + + /* Now it is safe to examine the buffer. */ + hp = (struct my_card_header *) cp->rx_buf; + if (header_is_ok(hp)) { + pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len, + PCI_DMA_FROMDEVICE); + pass_to_upper_layers(cp->rx_buf); + make_and_setup_new_rx_buf(cp); + } else { + /* Just sync the buffer and give it back + * to the card. + */ + pci_dma_sync_single_for_device(cp->pdev, + cp->rx_dma, + cp->rx_len, + PCI_DMA_FROMDEVICE); + give_rx_buf_to_card(cp); + } + } + } + +Drivers converted fully to this interface should not use virt_to_bus any +longer, nor should they use bus_to_virt. Some drivers have to be changed a +little bit, because there is no longer an equivalent to bus_to_virt in the +dynamic DMA mapping scheme - you have to always store the DMA addresses +returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single +calls (pci_map_sg stores them in the scatterlist itself if the platform +supports dynamic DMA mapping in hardware) in your driver structures and/or +in the card registers. + +All PCI drivers should be using these interfaces with no exceptions. +It is planned to completely remove virt_to_bus() and bus_to_virt() as +they are entirely deprecated. Some ports already do not provide these +as it is impossible to correctly support them. + + Optimizing Unmap State Space Consumption + +On many platforms, pci_unmap_{single,page}() is simply a nop. +Therefore, keeping track of the mapping address and length is a waste +of space. Instead of filling your drivers up with ifdefs and the like +to "work around" this (which would defeat the whole purpose of a +portable API) the following facilities are provided. + +Actually, instead of describing the macros one by one, we'll +transform some example code. + +1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures. + Example, before: + + struct ring_state { + struct sk_buff *skb; + dma_addr_t mapping; + __u32 len; + }; + + after: + + struct ring_state { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) + DECLARE_PCI_UNMAP_LEN(len) + }; + + NOTE: DO NOT put a semicolon at the end of the DECLARE_*() + macro. + +2) Use pci_unmap_{addr,len}_set to set these values. + Example, before: + + ringp->mapping = FOO; + ringp->len = BAR; + + after: + + pci_unmap_addr_set(ringp, mapping, FOO); + pci_unmap_len_set(ringp, len, BAR); + +3) Use pci_unmap_{addr,len} to access these values. + Example, before: + + pci_unmap_single(pdev, ringp->mapping, ringp->len, + PCI_DMA_FROMDEVICE); + + after: + + pci_unmap_single(pdev, + pci_unmap_addr(ringp, mapping), + pci_unmap_len(ringp, len), + PCI_DMA_FROMDEVICE); + +It really should be self-explanatory. We treat the ADDR and LEN +separately, because it is possible for an implementation to only +need the address in order to perform the unmap operation. + + Platform Issues + +If you are just writing drivers for Linux and do not maintain +an architecture port for the kernel, you can safely skip down +to "Closing". + +1) Struct scatterlist requirements. + + Struct scatterlist must contain, at a minimum, the following + members: + + struct page *page; + unsigned int offset; + unsigned int length; + + The base address is specified by a "page+offset" pair. + + Previous versions of struct scatterlist contained a "void *address" + field that was sometimes used instead of page+offset. As of Linux + 2.5., page+offset is always used, and the "address" field has been + deleted. + +2) More to come... + + Handling Errors + +DMA address space is limited on some architectures and an allocation +failure can be determined by: + +- checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 + +- checking the returned dma_addr_t of pci_map_single and pci_map_page + by using pci_dma_mapping_error(): + + dma_addr_t dma_handle; + + dma_handle = pci_map_single(pdev, addr, size, direction); + if (pci_dma_mapping_error(pdev, dma_handle)) { + /* + * reduce current DMA mapping usage, + * delay and try again later or + * reset driver. + */ + } + + Closing + +This document, and the API itself, would not be in it's current +form without the feedback and suggestions from numerous individuals. +We would like to specifically mention, in no particular order, the +following people: + + Russell King + Leo Dagum + Ralf Baechle + Grant Grundler + Jay Estabrook + Thomas Sailer + Andrea Arcangeli + Jens Axboe + David Mosberger-Tang diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 8d2158a..6fab97e 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -186,7 +186,7 @@ a virtual address mapping (unlike the earlier scheme of virtual address do not have a corresponding kernel virtual address space mapping) and low-memory pages. -Note: Please refer to Documentation/DMA-mapping.txt for a discussion +Note: Please refer to Documentation/PCI/PCI-DMA-mapping.txt for a discussion on PCI high mem DMA aspects and mapping of scatter gather lists, and support for 64 bit PCI. -- cgit v1.1 From c5114a1cd6d84b2b3144c1c3e093c80ca6c30f47 Mon Sep 17 00:00:00 2001 From: Clemens Ladisch Date: Sun, 10 Jan 2010 20:52:34 +0100 Subject: hwmon: (k10temp) Blacklist more family 10h processors The latest version of the Revision Guide for AMD Family 10h Processors lists two more processor revisions which may be affected by erratum 319. Change the blacklisting code to correctly detect those processors, by implementing AMD's recommended algorithm. Signed-off-by: Clemens Ladisch Signed-off-by: Jean Delvare Cc: Andreas Herrmann --- Documentation/hwmon/k10temp | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/k10temp b/Documentation/hwmon/k10temp index a7a18d4..6526eee 100644 --- a/Documentation/hwmon/k10temp +++ b/Documentation/hwmon/k10temp @@ -3,8 +3,8 @@ Kernel driver k10temp Supported chips: * AMD Family 10h processors: - Socket F: Quad-Core/Six-Core/Embedded Opteron - Socket AM2+: Opteron, Phenom (II) X3/X4 + Socket F: Quad-Core/Six-Core/Embedded Opteron (but see below) + Socket AM2+: Quad-Core Opteron, Phenom (II) X3/X4, Athlon X2 (but see below) Socket AM3: Quad-Core Opteron, Athlon/Phenom II X2/X3/X4, Sempron II Socket S1G3: Athlon II, Sempron, Turion II * AMD Family 11h processors: @@ -36,10 +36,15 @@ Description This driver permits reading of the internal temperature sensor of AMD Family 10h and 11h processors. -All these processors have a sensor, but on older revisions of Family 10h -processors, the sensor may return inconsistent values (erratum 319). The -driver will refuse to load on these revisions unless you specify the -"force=1" module parameter. +All these processors have a sensor, but on those for Socket F or AM2+, +the sensor may return inconsistent values (erratum 319). The driver +will refuse to load on these revisions unless you specify the "force=1" +module parameter. + +Due to technical reasons, the driver can detect only the mainboard's +socket type, not the processor's actual capabilities. Therefore, if you +are using an AM3 processor on an AM2+ mainboard, you can safely use the +"force=1" parameter. There is one temperature measurement value, available as temp1_input in sysfs. It is measured in degrees Celsius with a resolution of 1/8th degree. -- cgit v1.1 From cb5a8b2c92febbed57126e1b8416dfd7607ff03d Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 8 Jan 2010 14:42:34 -0800 Subject: docs: large update to ioctl-number.txt Add many ioctl definitions to ioctl-number.txt. Fix some whitespace/formatting. Correct some filenames/paths. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ioctl/ioctl-number.txt | 203 +++++++++++++++++++++++++++-------- 1 file changed, 159 insertions(+), 44 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 9473749..35cf64d 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -56,10 +56,11 @@ Following this convention is good because: (5) When following the convention, the driver code can use generic code to copy the parameters between user and kernel space. -This table lists ioctls visible from user land for Linux/i386. It contains -most drivers up to 2.3.14, but I know I am missing some. +This table lists ioctls visible from user land for Linux/x86. It contains +most drivers up to 2.6.31, but I know I am missing some. There has been +no attempt to list non-X86 architectures or ioctls from drivers/staging/. -Code Seq# Include File Comments +Code Seq#(hex) Include File Comments ======================================================== 0x00 00-1F linux/fs.h conflict! 0x00 00-1F scsi/scsi_ioctl.h conflict! @@ -69,119 +70,228 @@ Code Seq# Include File Comments 0x03 all linux/hdreg.h 0x04 D2-DC linux/umsdos_fs.h Dead since 2.6.11, but don't reuse these. 0x06 all linux/lp.h -0x09 all linux/md.h +0x09 all linux/raid/md_u.h +0x10 00-0F drivers/char/s390/vmcp.h 0x12 all linux/fs.h linux/blkpg.h 0x1b all InfiniBand Subsystem 0x20 all drivers/cdrom/cm206.h 0x22 all scsi/sg.h '#' 00-3F IEEE 1394 Subsystem Block for the entire subsystem +'$' 00-0F linux/perf_counter.h, linux/perf_event.h '1' 00-1F PPS kit from Ulrich Windl +'2' 01-04 linux/i2o.h +'3' 00-0F drivers/s390/char/raw3270.h conflict! +'3' 00-1F linux/suspend_ioctls.h conflict! + and kernel/power/user.c '8' all SNP8023 advanced NIC card -'A' 00-1F linux/apm_bios.h +'@' 00-0F linux/radeonfb.h conflict! +'@' 00-0F drivers/video/aty/aty128fb.c conflict! +'A' 00-1F linux/apm_bios.h conflict! +'A' 00-0F linux/agpgart.h conflict! + and drivers/char/agp/compat_ioctl.h +'A' 00-7F sound/asound.h conflict! +'B' 00-1F linux/cciss_ioctl.h conflict! +'B' 00-0F include/linux/pmu.h conflict! 'B' C0-FF advanced bbus -'C' all linux/soundcard.h +'C' all linux/soundcard.h conflict! +'C' 01-2F linux/capi.h conflict! +'C' F0-FF drivers/net/wan/cosa.h conflict! 'D' all arch/s390/include/asm/dasd.h -'E' all linux/input.h -'F' all linux/fb.h -'H' all linux/hiddev.h -'I' all linux/isdn.h +'D' 40-5F drivers/scsi/dpt/dtpi_ioctl.h +'D' 05 drivers/scsi/pmcraid.h +'E' all linux/input.h conflict! +'E' 00-0F xen/evtchn.h conflict! +'F' all linux/fb.h conflict! +'F' 01-02 drivers/scsi/pmcraid.h conflict! +'F' 20 drivers/video/fsl-diu-fb.h conflict! +'F' 20 drivers/video/intelfb/intelfb.h conflict! +'F' 20 linux/ivtvfb.h conflict! +'F' 20 linux/matroxfb.h conflict! +'F' 20 drivers/video/aty/atyfb_base.c conflict! +'F' 00-0F video/da8xx-fb.h conflict! +'F' 80-8F linux/arcfb.h conflict! +'F' DD video/sstfb.h conflict! +'G' 00-3F drivers/misc/sgi-gru/grulib.h conflict! +'G' 00-0F linux/gigaset_dev.h conflict! +'H' 00-7F linux/hiddev.h conflict! +'H' 00-0F linux/hidraw.h conflict! +'H' 00-0F sound/asound.h conflict! +'H' 20-40 sound/asound_fm.h conflict! +'H' 80-8F sound/sfnt_info.h conflict! +'H' 10-8F sound/emu10k1.h conflict! +'H' 10-1F sound/sb16_csp.h conflict! +'H' 10-1F sound/hda_hwdep.h conflict! +'H' 40-4F sound/hdspm.h conflict! +'H' 40-4F sound/hdsp.h conflict! +'H' 90 sound/usb/usx2y/usb_stream.h +'H' C0-F0 net/bluetooth/hci.h conflict! +'H' C0-DF net/bluetooth/hidp/hidp.h conflict! +'H' C0-DF net/bluetooth/cmtp/cmtp.h conflict! +'H' C0-DF net/bluetooth/bnep/bnep.h conflict! +'I' all linux/isdn.h conflict! +'I' 00-0F drivers/isdn/divert/isdn_divert.h conflict! +'I' 40-4F linux/mISDNif.h conflict! 'J' 00-1F drivers/scsi/gdth_ioctl.h 'K' all linux/kd.h -'L' 00-1F linux/loop.h -'L' 20-2F driver/usb/misc/vstusb.h +'L' 00-1F linux/loop.h conflict! +'L' 10-1F drivers/scsi/mpt2sas/mpt2sas_ctl.h conflict! +'L' 20-2F linux/usb/vstusb.h 'L' E0-FF linux/ppdd.h encrypted disk device driver -'M' all linux/soundcard.h +'M' all linux/soundcard.h conflict! +'M' 01-16 mtd/mtd-abi.h conflict! + and drivers/mtd/mtdchar.c +'M' 01-03 drivers/scsi/megaraid/megaraid_sas.h +'M' 00-0F drivers/video/fsl-diu-fb.h conflict! 'N' 00-1F drivers/usb/scanner.h -'O' 00-02 include/mtd/ubi-user.h UBI -'P' all linux/soundcard.h +'O' 00-06 mtd/ubi-user.h UBI +'P' all linux/soundcard.h conflict! +'P' 60-6F sound/sscape_ioctl.h conflict! +'P' 00-0F drivers/usb/class/usblp.c conflict! 'Q' all linux/soundcard.h -'R' 00-1F linux/random.h +'R' 00-1F linux/random.h conflict! +'R' 01 linux/rfkill.h conflict! +'R' 01-0F media/rds.h conflict! +'R' C0-DF net/bluetooth/rfcomm.h 'S' all linux/cdrom.h conflict! 'S' 80-81 scsi/scsi_ioctl.h conflict! 'S' 82-FF scsi/scsi.h conflict! +'S' 00-7F sound/asequencer.h conflict! 'T' all linux/soundcard.h conflict! +'T' 00-AF sound/asound.h conflict! 'T' all arch/x86/include/asm/ioctls.h conflict! -'U' 00-EF linux/drivers/usb/usb.h -'V' all linux/vt.h +'T' C0-DF linux/if_tun.h conflict! +'U' all sound/asound.h conflict! +'U' 00-0F drivers/media/video/uvc/uvcvideo.h conflict! +'U' 00-CF linux/uinput.h conflict! +'U' 00-EF linux/usbdevice_fs.h +'U' C0-CF drivers/bluetooth/hci_uart.h +'V' all linux/vt.h conflict! +'V' all linux/videodev2.h conflict! +'V' C0 linux/ivtvfb.h conflict! +'V' C0 linux/ivtv.h conflict! +'V' C0 media/davinci/vpfe_capture.h conflict! +'V' C0 media/si4713.h conflict! +'V' C0-CF drivers/media/video/mxb.h conflict! 'W' 00-1F linux/watchdog.h conflict! 'W' 00-1F linux/wanrouter.h conflict! -'X' all linux/xfs_fs.h +'W' 00-3F sound/asound.h conflict! +'X' all fs/xfs/xfs_fs.h conflict! + and fs/xfs/linux-2.6/xfs_ioctl32.h + and include/linux/falloc.h + and linux/fs.h +'X' all fs/ocfs2/ocfs_fs.h conflict! +'X' 01 linux/pktcdvd.h conflict! 'Y' all linux/cyclades.h -'[' 00-07 linux/usb/usbtmc.h USB Test and Measurement Devices +'Z' 14-15 drivers/message/fusion/mptctl.h +'[' 00-07 linux/usb/tmc.h USB Test and Measurement Devices -'a' all ATM on linux +'a' all linux/atm*.h, linux/sonet.h ATM on linux -'b' 00-FF bit3 vme host bridge +'b' 00-FF conflict! bit3 vme host bridge +'b' 00-0F media/bt819.h conflict! +'c' all linux/cm4000_cs.h conflict! 'c' 00-7F linux/comstats.h conflict! 'c' 00-7F linux/coda.h conflict! -'c' 80-9F arch/s390/include/asm/chsc.h -'c' A0-AF arch/x86/include/asm/msr.h +'c' 00-1F linux/chio.h conflict! +'c' 80-9F arch/s390/include/asm/chsc.h conflict! +'c' A0-AF arch/x86/include/asm/msr.h conflict! 'd' 00-FF linux/char/drm/drm/h conflict! +'d' 02-40 pcmcia/ds.h conflict! +'d' 10-3F drivers/media/video/dabusb.h conflict! +'d' C0-CF drivers/media/video/saa7191.h conflict! 'd' F0-FF linux/digi1.h 'e' all linux/digi1.h conflict! -'e' 00-1F net/irda/irtty.h conflict! -'f' 00-1F linux/ext2_fs.h -'h' 00-7F Charon filesystem +'e' 00-1F drivers/net/irda/irtty-sir.h conflict! +'f' 00-1F linux/ext2_fs.h conflict! +'f' 00-1F linux/ext3_fs.h conflict! +'f' 00-0F fs/jfs/jfs_dinode.h conflict! +'f' 00-0F fs/ext4/ext4.h conflict! +'f' 00-0F linux/fs.h conflict! +'f' 00-0F fs/ocfs2/ocfs2_fs.h conflict! +'g' 00-0F linux/usb/gadgetfs.h +'g' 20-2F linux/usb/g_printer.h +'h' 00-7F conflict! Charon filesystem -'i' 00-3F linux/i2o.h +'h' 00-1F linux/hpet.h conflict! +'i' 00-3F linux/i2o-dev.h conflict! +'i' 0B-1F linux/ipmi.h conflict! +'i' 80-8F linux/i8k.h 'j' 00-3F linux/joystick.h +'k' 00-0F linux/spi/spidev.h conflict! +'k' 00-05 video/kyro.h conflict! 'l' 00-3F linux/tcfs_fs.h transparent cryptographic file system 'l' 40-7F linux/udf_fs_i.h in development: -'m' 00-09 linux/mmtimer.h +'m' 00-09 linux/mmtimer.h conflict! 'm' all linux/mtio.h conflict! 'm' all linux/soundcard.h conflict! 'm' all linux/synclink.h conflict! +'m' 00-19 drivers/message/fusion/mptctl.h conflict! +'m' 00 drivers/scsi/megaraid/megaraid_ioctl.h conflict! 'm' 00-1F net/irda/irmod.h conflict! -'n' 00-7F linux/ncp_fs.h +'n' 00-7F linux/ncp_fs.h and fs/ncpfs/ioctl.c 'n' 80-8F linux/nilfs2_fs.h NILFS2 -'n' E0-FF video/matrox.h matroxfb +'n' E0-FF linux/matroxfb.h matroxfb 'o' 00-1F fs/ocfs2/ocfs2_fs.h OCFS2 -'o' 00-03 include/mtd/ubi-user.h conflict! (OCFS2 and UBI overlaps) -'o' 40-41 include/mtd/ubi-user.h UBI -'o' 01-A1 include/linux/dvb/*.h DVB +'o' 00-03 mtd/ubi-user.h conflict! (OCFS2 and UBI overlaps) +'o' 40-41 mtd/ubi-user.h UBI +'o' 01-A1 linux/dvb/*.h DVB 'p' 00-0F linux/phantom.h conflict! (OpenHaptics needs this) +'p' 00-1F linux/rtc.h conflict! 'p' 00-3F linux/mc146818rtc.h conflict! 'p' 40-7F linux/nvram.h -'p' 80-9F user-space parport +'p' 80-9F linux/ppdev.h user-space parport -'p' a1-a4 linux/pps.h LinuxPPS +'p' A1-A4 linux/pps.h LinuxPPS 'q' 00-1F linux/serio.h -'q' 80-FF Internet PhoneJACK, Internet LineJACK - -'r' 00-1F linux/msdos_fs.h +'q' 80-FF linux/telephony.h Internet PhoneJACK, Internet LineJACK + linux/ixjuser.h +'r' 00-1F linux/msdos_fs.h and fs/fat/dir.c 's' all linux/cdk.h 't' 00-7F linux/if_ppp.h 't' 80-8F linux/isdn_ppp.h +'t' 90 linux/toshiba.h 'u' 00-1F linux/smb_fs.h -'v' 00-1F linux/ext2_fs.h conflict! 'v' all linux/videodev.h conflict! +'v' 00-1F linux/ext2_fs.h conflict! +'v' 00-1F linux/fs.h conflict! +'v' 00-0F linux/sonypi.h conflict! +'v' C0-CF drivers/media/video/ov511.h conflict! +'v' C0-DF media/pwc-ioctl.h conflict! +'v' C0-FF linux/meye.h conflict! +'v' C0-CF drivers/media/video/zoran/zoran.h conflict! +'v' D0-DF drivers/media/video/cpia2/cpia2dev.h conflict! 'w' all CERN SCI driver 'y' 00-1F packet based user level communications -'z' 00-3F CAN bus card +'z' 00-3F CAN bus card conflict! -'z' 40-7F CAN bus card +'z' 40-7F CAN bus card conflict! +'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict! 0x80 00-1F linux/fb.h 0x81 00-1F linux/videotext.h +0x88 00-3F media/ovcamchip.h 0x89 00-06 arch/x86/include/asm/sockios.h 0x89 0B-DF linux/sockios.h 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range +0x89 E0-EF linux/dn.h PROTOPRIVATE range 0x89 F0-FF linux/sockios.h SIOCDEVPRIVATE range 0x8B all linux/wireless.h 0x8C 00-3F WiNRADiO driver 0x90 00 drivers/cdrom/sbpcd.h +0x92 00-0F drivers/usb/mon/mon_bin.c 0x93 60-7F linux/auto_fs.h +0x94 all fs/btrfs/ioctl.h 0x99 00-0F 537-Addinboard driver 0xA0 all linux/sdp/sdp.h Industrial Device Project @@ -192,17 +302,22 @@ Code Seq# Include File Comments 0xAB 00-1F linux/nbd.h 0xAC 00-1F linux/raw.h 0xAD 00 Netfilter device in development: - + 0xAE all linux/kvm.h Kernel-based Virtual Machine 0xB0 all RATIO devices in development: 0xB1 00-1F PPPoX +0xC0 00-0F linux/usb/iowarrior.h 0xCB 00-1F CBM serial IEC bus in development: +0xCD 01 linux/reiserfs_fs.h +0xCF 02 fs/cifs/ioctl.c +0xDB 00-0F drivers/char/mwave/mwavepub.h 0xDD 00-3F ZFCP device driver see drivers/s390/scsi/ -0xF3 00-3F video/sisfb.h sisfb (in development) +0xF3 00-3F drivers/usb/misc/sisusbvga/sisusb.h sisfb (in development) 0xF4 00-1F video/mbxfb.h mbxfb +0xFD all linux/dm-ioctl.h -- cgit v1.1 From 1306d603fcf1f6682f8575d1ff23631a24184b21 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Fri, 8 Jan 2010 14:42:56 -0800 Subject: proc: partially revert "procfs: provide stack information for threads" Commit d899bf7b (procfs: provide stack information for threads) introduced to show stack information in /proc/{pid}/status. But it cause large performance regression. Unfortunately /proc/{pid}/status is used ps command too and ps is one of most important component. Because both to take mmap_sem and page table walk are heavily operation. If many process run, the ps performance is, [before d899bf7b] % perf stat ps >/dev/null Performance counter stats for 'ps': 4090.435806 task-clock-msecs # 0.032 CPUs 229 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 234 page-faults # 0.000 M/sec 8587565207 cycles # 2099.425 M/sec 9866662403 instructions # 1.149 IPC 3789415411 cache-references # 926.409 M/sec 30419509 cache-misses # 7.437 M/sec 128.859521955 seconds time elapsed [after d899bf7b] % perf stat ps > /dev/null Performance counter stats for 'ps': 4305.081146 task-clock-msecs # 0.028 CPUs 480 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 237 page-faults # 0.000 M/sec 9021211334 cycles # 2095.480 M/sec 10605887536 instructions # 1.176 IPC 3612650999 cache-references # 839.160 M/sec 23917502 cache-misses # 5.556 M/sec 152.277819582 seconds time elapsed Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps provide almost same information. we can use it. Commit d899bf7b introduced two features: 1) Add the annotattion of [thread stack: xxxx] mark to /proc/{pid}/task/{tid}/maps. 2) Add StackUsage field to /proc/{pid}/status. I only revert (2), because I haven't seen (1) cause regression. Signed-off-by: KOSAKI Motohiro Cc: Stefani Seibold Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Alexey Dobriyan Cc: "Eric W. Biederman" Cc: Randy Dunlap Cc: Andrew Morton Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 2 -- 1 file changed, 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 220cc63..0d07513 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -177,7 +177,6 @@ read the file /proc/PID/status: CapBnd: ffffffffffffffff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1 - Stack usage: 12 kB This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its @@ -231,7 +230,6 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) Mems_allowed_list Same as previous, but in "list format" voluntary_ctxt_switches number of voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches - Stack usage: stack usage high water mark (round up to page size) .............................................................................. Table 1-3: Contents of the statm files (as of 2.6.8-rc3) -- cgit v1.1 From b5430a04e995081a308b4419bd0940f2badc6e6b Mon Sep 17 00:00:00 2001 From: Tomaz Mertelj Date: Fri, 8 Jan 2010 14:43:04 -0800 Subject: hwmon: driver for Texas Instruments amc6821 chip Signed-off-by: Cc: Jean Delvare Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hwmon/amc6821 | 102 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 Documentation/hwmon/amc6821 (limited to 'Documentation') diff --git a/Documentation/hwmon/amc6821 b/Documentation/hwmon/amc6821 new file mode 100644 index 0000000..ced8359 --- /dev/null +++ b/Documentation/hwmon/amc6821 @@ -0,0 +1,102 @@ +Kernel driver amc6821 +===================== + +Supported chips: + Texas Instruments AMC6821 + Prefix: 'amc6821' + Addresses scanned: 0x18, 0x19, 0x1a, 0x2c, 0x2d, 0x2e, 0x4c, 0x4d, 0x4e + Datasheet: http://focus.ti.com/docs/prod/folders/print/amc6821.html + +Authors: + Tomaz Mertelj + + +Description +----------- + +This driver implements support for the Texas Instruments amc6821 chip. +The chip has one on-chip and one remote temperature sensor and one pwm fan +regulator. +The pwm can be controlled either from software or automatically. + +The driver provides the following sensor accesses in sysfs: + +temp1_input ro on-chip temperature +temp1_min rw " +temp1_max rw " +temp1_crit rw " +temp1_min_alarm ro " +temp1_max_alarm ro " +temp1_crit_alarm ro " + +temp2_input ro remote temperature +temp2_min rw " +temp2_max rw " +temp2_crit rw " +temp2_min_alarm ro " +temp2_max_alarm ro " +temp2_crit_alarm ro " +temp2_fault ro " + +fan1_input ro tachometer speed +fan1_min rw " +fan1_max rw " +fan1_fault ro " +fan1_div rw Fan divisor can be either 2 or 4. + +pwm1 rw pwm1 +pwm1_enable rw regulator mode, 1=open loop, 2=fan controlled + by remote temperature, 3=fan controlled by + combination of the on-chip temperature and + remote-sensor temperature, +pwm1_auto_channels_temp ro 1 if pwm_enable==2, 3 if pwm_enable==3 +pwm1_auto_point1_pwm ro Hardwired to 0, shared for both + temperature channels. +pwm1_auto_point2_pwm rw This value is shared for both temperature + channels. +pwm1_auto_point3_pwm rw Hardwired to 255, shared for both + temperature channels. + +temp1_auto_point1_temp ro Hardwired to temp2_auto_point1_temp + which is rw. Below this temperature fan stops. +temp1_auto_point2_temp rw The low-temperature limit of the proportional + range. Below this temperature + pwm1 = pwm1_auto_point2_pwm. It can go from + 0 degree C to 124 degree C in steps of + 4 degree C. Read it out after writing to get + the actual value. +temp1_auto_point3_temp rw Above this temperature fan runs at maximum + speed. It can go from temp1_auto_point2_temp. + It can only have certain discrete values + which depend on temp1_auto_point2_temp and + pwm1_auto_point2_pwm. Read it out after + writing to get the actual value. + +temp2_auto_point1_temp rw Must be between 0 degree C and 63 degree C and + it defines the passive cooling temperature. + Below this temperature the fan stops in + the closed loop mode. +temp2_auto_point2_temp rw The low-temperature limit of the proportional + range. Below this temperature + pwm1 = pwm1_auto_point2_pwm. It can go from + 0 degree C to 124 degree C in steps + of 4 degree C. + +temp2_auto_point3_temp rw Above this temperature fan runs at maximum + speed. It can only have certain discrete + values which depend on temp2_auto_point2_temp + and pwm1_auto_point2_pwm. Read it out after + writing to get actual value. + + +Module parameters +----------------- + +If your board has a BIOS that initializes the amc6821 correctly, you should +load the module with: init=0. + +If your board BIOS doesn't initialize the chip, or you want +different settings, you can set the following parameters: +init=1, +pwminv: 0 default pwm output, 1 inverts pwm output. + -- cgit v1.1 From 006b4298f26984d514546fe4e53371761f66b643 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 8 Jan 2010 14:43:07 -0800 Subject: Documentation: update ring-buffer-design.txt Fix typos, grammos, spellos, hyphenation. Signed-off-by: Randy Dunlap Acked-by: Steven Rostedt Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/trace/ring-buffer-design.txt | 56 +++++++++++++++--------------- 1 file changed, 28 insertions(+), 28 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt index 5b1d23d..d299ff3 100644 --- a/Documentation/trace/ring-buffer-design.txt +++ b/Documentation/trace/ring-buffer-design.txt @@ -33,9 +33,9 @@ head_page - a pointer to the page that the reader will use next tail_page - a pointer to the page that will be written to next -commit_page - a pointer to the page with the last finished non nested write. +commit_page - a pointer to the page with the last finished non-nested write. -cmpxchg - hardware assisted atomic transaction that performs the following: +cmpxchg - hardware-assisted atomic transaction that performs the following: A = B iff previous A == C @@ -52,15 +52,15 @@ The Generic Ring Buffer The ring buffer can be used in either an overwrite mode or in producer/consumer mode. -Producer/consumer mode is where the producer were to fill up the +Producer/consumer mode is where if the producer were to fill up the buffer before the consumer could free up anything, the producer will stop writing to the buffer. This will lose most recent events. -Overwrite mode is where the produce were to fill up the buffer +Overwrite mode is where if the producer were to fill up the buffer before the consumer could free up anything, the producer will overwrite the older data. This will lose the oldest events. -No two writers can write at the same time (on the same per cpu buffer), +No two writers can write at the same time (on the same per-cpu buffer), but a writer may interrupt another writer, but it must finish writing before the previous writer may continue. This is very important to the algorithm. The writers act like a "stack". The way interrupts works @@ -79,16 +79,16 @@ the interrupt doing a write as well. Readers can happen at any time. But no two readers may run at the same time, nor can a reader preempt/interrupt another reader. A reader -can not preempt/interrupt a writer, but it may read/consume from the +cannot preempt/interrupt a writer, but it may read/consume from the buffer at the same time as a writer is writing, but the reader must be on another processor to do so. A reader may read on its own processor and can be preempted by a writer. -A writer can preempt a reader, but a reader can not preempt a writer. +A writer can preempt a reader, but a reader cannot preempt a writer. But a reader can read the buffer at the same time (on another processor) as a writer. -The ring buffer is made up of a list of pages held together by a link list. +The ring buffer is made up of a list of pages held together by a linked list. At initialization a reader page is allocated for the reader that is not part of the ring buffer. @@ -102,7 +102,7 @@ the head page. The reader has its own page to use. At start up time, this page is allocated but is not attached to the list. When the reader wants -to read from the buffer, if its page is empty (like it is on start up) +to read from the buffer, if its page is empty (like it is on start-up), it will swap its page with the head_page. The old reader page will become part of the ring buffer and the head_page will be removed. The page after the inserted page (old reader_page) will become the @@ -206,7 +206,7 @@ The main pointers: commit page - the page that last finished a write. -The commit page only is updated by the outer most writer in the +The commit page only is updated by the outermost writer in the writer stack. A writer that preempts another writer will not move the commit page. @@ -281,7 +281,7 @@ with the previous write. The commit pointer points to the last write location that was committed without preempting another write. When a write that preempted another write is committed, it only becomes a pending commit -and will not be a full commit till all writes have been committed. +and will not be a full commit until all writes have been committed. The commit page points to the page that has the last full commit. The tail page points to the page with the last write (before @@ -292,7 +292,7 @@ be several pages ahead. If the tail page catches up to the commit page then no more writes may take place (regardless of the mode of the ring buffer: overwrite and produce/consumer). -The order of pages are: +The order of pages is: head page commit page @@ -311,7 +311,7 @@ Possible scenario: There is a special case that the head page is after either the commit page and possibly the tail page. That is when the commit (and tail) page has been swapped with the reader page. This is because the head page is always -part of the ring buffer, but the reader page is not. When ever there +part of the ring buffer, but the reader page is not. Whenever there has been less than a full page that has been committed inside the ring buffer, and a reader swaps out a page, it will be swapping out the commit page. @@ -338,7 +338,7 @@ and a reader swaps out a page, it will be swapping out the commit page. In this case, the head page will not move when the tail and commit move back into the ring buffer. -The reader can not swap a page into the ring buffer if the commit page +The reader cannot swap a page into the ring buffer if the commit page is still on that page. If the read meets the last commit (real commit not pending or reserved), then there is nothing more to read. The buffer is considered empty until another full commit finishes. @@ -395,7 +395,7 @@ The main idea behind the lockless algorithm is to combine the moving of the head_page pointer with the swapping of pages with the reader. State flags are placed inside the pointer to the page. To do this, each page must be aligned in memory by 4 bytes. This will allow the 2 -least significant bits of the address to be used as flags. Since +least significant bits of the address to be used as flags, since they will always be zero for the address. To get the address, simply mask out the flags. @@ -460,7 +460,7 @@ When the reader tries to swap the page with the ring buffer, it will also use cmpxchg. If the flag bit in the pointer to the head page does not have the HEADER flag set, the compare will fail and the reader will need to look for the new head page and try again. -Note, the flag UPDATE and HEADER are never set at the same time. +Note, the flags UPDATE and HEADER are never set at the same time. The reader swaps the reader page as follows: @@ -539,7 +539,7 @@ updated to the reader page. | +-----------------------------+ | +------------------------------------+ -Another important point. The page that the reader page points back to +Another important point: The page that the reader page points back to by its previous pointer (the one that now points to the new head page) never points back to the reader page. That is because the reader page is not part of the ring buffer. Traversing the ring buffer via the next pointers @@ -572,7 +572,7 @@ not be able to swap the head page from the buffer, nor will it be able to move the head page, until the writer is finished with the move. This eliminates any races that the reader can have on the writer. The reader -must spin, and this is why the reader can not preempt the writer. +must spin, and this is why the reader cannot preempt the writer. tail page | @@ -659,9 +659,9 @@ before pushing the head page. If it is, then it can be assumed that the tail page wrapped the buffer, and we must drop new writes. This is not a race condition, because the commit page can only be moved -by the outter most writer (the writer that was preempted). +by the outermost writer (the writer that was preempted). This means that the commit will not move while a writer is moving the -tail page. The reader can not swap the reader page if it is also being +tail page. The reader cannot swap the reader page if it is also being used as the commit page. The reader can simply check that the commit is off the reader page. Once the commit page leaves the reader page it will never go back on it unless a reader does another swap with the @@ -733,7 +733,7 @@ The write converts the head page pointer to UPDATE. --->| |<---| |<---| |<---| |<--- +---+ +---+ +---+ +---+ -But if a nested writer preempts here. It will see that the next +But if a nested writer preempts here, it will see that the next page is a head page, but it is also nested. It will detect that it is nested and will save that information. The detection is the fact that it sees the UPDATE flag instead of a HEADER or NORMAL @@ -761,7 +761,7 @@ to NORMAL. --->| |<---| |<---| |<---| |<--- +---+ +---+ +---+ +---+ -After the nested writer finishes, the outer most writer will convert +After the nested writer finishes, the outermost writer will convert the UPDATE pointer to NORMAL. @@ -812,7 +812,7 @@ head page. +---+ +---+ +---+ +---+ The nested writer moves the tail page forward. But does not set the old -update page to NORMAL because it is not the outer most writer. +update page to NORMAL because it is not the outermost writer. tail page | @@ -892,7 +892,7 @@ It will return to the first writer. --->| |<---| |<---| |<---| |<--- +---+ +---+ +---+ +---+ -The first writer can not know atomically test if the tail page moved +The first writer cannot know atomically if the tail page moved while it updates the HEAD page. It will then update the head page to what it thinks is the new head page. @@ -923,9 +923,9 @@ if the tail page is either where it use to be or on the next page: --->| |<---| |<---| |<---| |<--- +---+ +---+ +---+ +---+ -If tail page != A and tail page does not equal B, then it must reset the -pointer back to NORMAL. The fact that it only needs to worry about -nested writers, it only needs to check this after setting the HEAD page. +If tail page != A and tail page != B, then it must reset the pointer +back to NORMAL. The fact that it only needs to worry about nested +writers means that it only needs to check this after setting the HEAD page. (first writer) @@ -939,7 +939,7 @@ nested writers, it only needs to check this after setting the HEAD page. +---+ +---+ +---+ +---+ Now the writer can update the head page. This is also why the head page must -remain in UPDATE and only reset by the outer most writer. This prevents +remain in UPDATE and only reset by the outermost writer. This prevents the reader from seeing the incorrect head page. -- cgit v1.1 From d2b34e20c1f431604e0dde910c3ff271c84ed706 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 8 Jan 2010 14:43:09 -0800 Subject: documentation: update kernel-doc-nano-HOWTO information Remove comments about function short descriptions not allowed to be on multiple lines (that was fixed/changed recently). Add comments that function "section header:" names need to be unique per function/struct/union/typedef/enum. Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-doc-nano-HOWTO.txt | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-doc-nano-HOWTO.txt b/Documentation/kernel-doc-nano-HOWTO.txt index 348b9e5..27a52b3 100644 --- a/Documentation/kernel-doc-nano-HOWTO.txt +++ b/Documentation/kernel-doc-nano-HOWTO.txt @@ -214,11 +214,13 @@ The format of the block comment is like this: * (section header: (section description)? )* (*)?*/ -The short function description ***cannot be multiline***, but the other -descriptions can be (and they can contain blank lines). If you continue -that initial short description onto a second line, that second line will -appear further down at the beginning of the description section, which is -almost certainly not what you had in mind. +All "description" text can span multiple lines, although the +function_name & its short description are traditionally on a single line. +Description text may also contain blank lines (i.e., lines that contain +only a "*"). + +"section header:" names must be unique per function (or struct, +union, typedef, enum). Avoid putting a spurious blank line after the function name, or else the description will be repeated! -- cgit v1.1 From aa4e2e171385bb77b4da8b760d26dea2aa291587 Mon Sep 17 00:00:00 2001 From: Ben Hutchings Date: Mon, 11 Jan 2010 15:53:45 -0800 Subject: Documentation/3c509: document ethtool support 3c509 was changed to support ethtool in 2002, making the 'xcvr' module parameter obsolete in most cases. More recently 3c509 was converted to the modern driver model and this parameter was removed. Fix the documentation to refer to ethtool rather than the module parameter. Signed-off-by: Ben Hutchings Signed-off-by: David S. Miller --- Documentation/networking/3c509.txt | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt index 0643e3b..3c45d5d 100644 --- a/Documentation/networking/3c509.txt +++ b/Documentation/networking/3c509.txt @@ -48,11 +48,11 @@ for LILO parameters for doing this: This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts with other card types when overriding the I/O address. When the driver is -loaded as a module, only the IRQ and transceiver setting may be overridden. -For example, setting two cards to 10base2/IRQ10 and AUI/IRQ11 is done by using -the xcvr and irq module options: +loaded as a module, only the IRQ may be overridden. For example, +setting two cards to IRQ10 and IRQ11 is done by using the irq module +option: - options 3c509 xcvr=3,1 irq=10,11 + options 3c509 irq=10,11 (2) Full-duplex mode @@ -77,6 +77,8 @@ operation. itself full-duplex capable. This is almost certainly one of two things: a full- duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on another system that's connected directly to the 3c509B via a crossover cable. + +Full-duplex mode can be enabled using 'ethtool'. /////Extremely important caution concerning full-duplex mode///// Understand that the 3c509B's hardware's full-duplex support is much more @@ -113,6 +115,8 @@ This insured that merely upgrading the driver from an earlier version would never automatically enable full-duplex mode in an existing installation; it must always be explicitly enabled via one of these code in order to be activated. + +The transceiver type can be changed using 'ethtool'. (4a) Interpretation of error messages and common problems -- cgit v1.1 From ceafe1d2fe33e92691bfdbd5a93ed259c3da7b60 Mon Sep 17 00:00:00 2001 From: Hans de Goede Date: Mon, 11 Jan 2010 10:50:53 -0200 Subject: feature-removal-schedule: Add v4l1 drivers obsoleted by gspca sub drivers This patch adds the ov511, quickcam_messenger, w9968cf, stv680 and ovcamchip v4l1 drivers to the feature removal schedule as the devices they support are now all also supported by v4l2 gspca sub drivers. This patch also adds the v4l2 vc0301 driver for removal as it duplicates functionality of the gspca_zc3xx driver, zc0301 only supports 2 USB-ID's (because it only supports a limited set of sensors) wich are also supported by the gspca_zc3xx driver (which supports 53 USB-ID's in total). [mchehab@redhat.com: change "when" to 2.6.35] Signed-off-by: Hans de Goede Signed-off-by: Mauro Carvalho Chehab --- Documentation/feature-removal-schedule.txt | 49 ++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 870d190..0a46833 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -493,3 +493,52 @@ Why: These two features use non-standard interfaces. There are the Who: Corentin Chary ---------------------------- + +What: usbvideo quickcam_messenger driver +When: 2.6.35 +Files: drivers/media/video/usbvideo/quickcam_messenger.[ch] +Why: obsolete v4l1 driver replaced by gspca_stv06xx +Who: Hans de Goede + +---------------------------- + +What: ov511 v4l1 driver +When: 2.6.35 +Files: drivers/media/video/ov511.[ch] +Why: obsolete v4l1 driver replaced by gspca_ov519 +Who: Hans de Goede + +---------------------------- + +What: w9968cf v4l1 driver +When: 2.6.35 +Files: drivers/media/video/w9968cf*.[ch] +Why: obsolete v4l1 driver replaced by gspca_ov519 +Who: Hans de Goede + +---------------------------- + +What: ovcamchip sensor framework +When: 2.6.35 +Files: drivers/media/video/ovcamchip/* +Why: Only used by obsoleted v4l1 drivers +Who: Hans de Goede + +---------------------------- + +What: stv680 v4l1 driver +When: 2.6.35 +Files: drivers/media/video/stv680.[ch] +Why: obsolete v4l1 driver replaced by gspca_stv0680 +Who: Hans de Goede + +---------------------------- + +What: zc0301 v4l driver +When: 2.6.35 +Files: drivers/media/video/zc0301/* +Why: Duplicate functionality with the gspca_zc3xx driver, zc0301 only + supports 2 USB-ID's (because it only supports a limited set of + sensors) wich are also supported by the gspca_zc3xx driver + (which supports 53 USB-ID's in total) +Who: Hans de Goede -- cgit v1.1