From 3e347261a80b57df792ab9464b5f0ed59add53a8 Mon Sep 17 00:00:00 2001 From: Bob Picco Date: Sat, 3 Sep 2005 15:54:28 -0700 Subject: [PATCH] sparsemem extreme implementation With cleanups from Dave Hansen SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. SPARSEMEM_EXTREME requires bootmem to be functioning at the time of memory_present() calls. This is not always feasible, so architectures which do not need it may allocate everything statically by using SPARSEMEM_STATIC. Signed-off-by: Andy Whitcroft Signed-off-by: Bob Picco Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'arch/i386') diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index 619d843..dcb0ad0 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -754,6 +754,7 @@ config NUMA depends on SMP && HIGHMEM64G && (X86_NUMAQ || X86_GENERICARCH || (X86_SUMMIT && ACPI)) default n if X86_PC default y if (X86_NUMAQ || X86_SUMMIT) + select SPARSEMEM_STATIC # Need comments to help the hapless user trying to turn on NUMA support comment "NUMA (NUMA-Q) requires SMP, 64GB highmem support" -- cgit v1.1 From 7bf07f3d4b4358aa6d99a26d7a0165f1e91c3fcc Mon Sep 17 00:00:00 2001 From: Adam Litke Date: Sat, 3 Sep 2005 15:55:00 -0700 Subject: [PATCH] hugetlb: move stale pte check into huge_pte_alloc() Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Signed-off-by: Adam Litke Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mm/hugetlbpage.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mm/hugetlbpage.c b/arch/i386/mm/hugetlbpage.c index 3b099f3..57c486f 100644 --- a/arch/i386/mm/hugetlbpage.c +++ b/arch/i386/mm/hugetlbpage.c @@ -22,12 +22,21 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; pud_t *pud; - pmd_t *pmd = NULL; + pmd_t *pmd; + pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); pmd = pmd_alloc(mm, pud, addr); - return (pte_t *) pmd; + + if (!pmd) + goto out; + + pte = (pte_t *) pmd; + if (!pte_none(*pte) && !pte_huge(*pte)) + hugetlb_clean_stale_pgtable(pte); +out: + return pte; } pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) -- cgit v1.1 From 02b0ccef903e85673ead74ddb7c431f2f7ce183d Mon Sep 17 00:00:00 2001 From: Adam Litke Date: Sat, 3 Sep 2005 15:55:01 -0700 Subject: [PATCH] hugetlb: check p?d_present in huge_pte_offset() For demand faulting, we cannot assume that the page tables will be populated. Do what the rest of the architectures do and test p?d_present() while walking down the page table. Signed-off-by: Adam Litke Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mm/hugetlbpage.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mm/hugetlbpage.c b/arch/i386/mm/hugetlbpage.c index 57c486f..24c8a53 100644 --- a/arch/i386/mm/hugetlbpage.c +++ b/arch/i386/mm/hugetlbpage.c @@ -46,8 +46,11 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) pmd_t *pmd = NULL; pgd = pgd_offset(mm, addr); - pud = pud_offset(pgd, addr); - pmd = pmd_offset(pud, addr); + if (pgd_present(*pgd)) { + pud = pud_offset(pgd, addr); + if (pud_present(*pud)) + pmd = pmd_offset(pud, addr); + } return (pte_t *) pmd; } -- cgit v1.1 From 0e5c9f39f64d8a55c5db37a5ea43e37d3422fd92 Mon Sep 17 00:00:00 2001 From: "Chen, Kenneth W" Date: Sat, 3 Sep 2005 15:55:02 -0700 Subject: [PATCH] remove hugetlb_clean_stale_pgtable() and fix huge_pte_alloc() I don't think we need to call hugetlb_clean_stale_pgtable() anymore in 2.6.13 because of the rework with free_pgtables(). It now collect all the pte page at the time of munmap. It used to only collect page table pages when entire one pgd can be freed and left with staled pte pages. Not anymore with 2.6.13. This function will never be called and We should turn it into a BUG_ON. I also spotted two problems here, not Adam's fault :-) (1) in huge_pte_alloc(), it looks like a bug to me that pud is not checked before calling pmd_alloc() (2) in hugetlb_clean_stale_pgtable(), it also missed a call to pmd_free_tlb. I think a tlb flush is required to flush the mapping for the page table itself when we clear out the pmd pointing to a pte page. However, since hugetlb_clean_stale_pgtable() is never called, so it won't trigger the bug. Signed-off-by: Ken Chen Cc: Adam Litke Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mm/hugetlbpage.c | 23 +++-------------------- 1 file changed, 3 insertions(+), 20 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mm/hugetlbpage.c b/arch/i386/mm/hugetlbpage.c index 24c8a53..d524127 100644 --- a/arch/i386/mm/hugetlbpage.c +++ b/arch/i386/mm/hugetlbpage.c @@ -22,20 +22,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { pgd_t *pgd; pud_t *pud; - pmd_t *pmd; pte_t *pte = NULL; pgd = pgd_offset(mm, addr); pud = pud_alloc(mm, pgd, addr); - pmd = pmd_alloc(mm, pud, addr); + if (pud) + pte = (pte_t *) pmd_alloc(mm, pud, addr); + BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte)); - if (!pmd) - goto out; - - pte = (pte_t *) pmd; - if (!pte_none(*pte) && !pte_huge(*pte)) - hugetlb_clean_stale_pgtable(pte); -out: return pte; } @@ -130,17 +124,6 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, } #endif -void hugetlb_clean_stale_pgtable(pte_t *pte) -{ - pmd_t *pmd = (pmd_t *) pte; - struct page *page; - - page = pmd_page(*pmd); - pmd_clear(pmd); - dec_page_state(nr_page_table_pages); - page_cache_release(page); -} - /* x86_64 also uses this file */ #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA -- cgit v1.1 From 869f96a00e8f53c7db8470ca9cf72e2e3fa40119 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Sat, 3 Sep 2005 15:56:26 -0700 Subject: [PATCH] x86: compress the stack layout of do_page_fault() This patch pushes the creation of a rare signal frame (SIGBUS or SIGSEGV) into a separate function, thus saving stackspace in the main do_page_fault() stackframe. The effect is 132 bytes less of stack used by the typical do_page_fault() invocation - resulting in a denser cache-layout. (Another minor effect is that in case of kernel crashes that come from a pagefault, we add less space to the already existing frame, giving the crash functions a slightly higher chance to do their stuff without overflowing the stack.) (The changes also result in slightly cleaner code.) argument bugfix from "Guillaume C." Signed-off-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mm/fault.c | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c index 8e90339..61d9e34 100644 --- a/arch/i386/mm/fault.c +++ b/arch/i386/mm/fault.c @@ -199,6 +199,18 @@ static inline int is_prefetch(struct pt_regs *regs, unsigned long addr, return 0; } +static noinline void force_sig_info_fault(int si_signo, int si_code, + unsigned long address, struct task_struct *tsk) +{ + siginfo_t info; + + info.si_signo = si_signo; + info.si_errno = 0; + info.si_code = si_code; + info.si_addr = (void __user *)address; + force_sig_info(si_signo, &info, tsk); +} + fastcall void do_invalid_op(struct pt_regs *, unsigned long); /* @@ -218,8 +230,7 @@ fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code) struct vm_area_struct * vma; unsigned long address; unsigned long page; - int write; - siginfo_t info; + int write, si_code; /* get the address */ __asm__("movl %%cr2,%0":"=r" (address)); @@ -233,7 +244,7 @@ fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code) tsk = current; - info.si_code = SEGV_MAPERR; + si_code = SEGV_MAPERR; /* * We fault-in kernel-space virtual memory on-demand. The @@ -313,7 +324,7 @@ fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code) * we can handle it.. */ good_area: - info.si_code = SEGV_ACCERR; + si_code = SEGV_ACCERR; write = 0; switch (error_code & 3) { default: /* 3: write, present */ @@ -387,11 +398,7 @@ bad_area_nosemaphore: /* Kernel addresses are always protection faults */ tsk->thread.error_code = error_code | (address >= TASK_SIZE); tsk->thread.trap_no = 14; - info.si_signo = SIGSEGV; - info.si_errno = 0; - /* info.si_code has been set above */ - info.si_addr = (void __user *)address; - force_sig_info(SIGSEGV, &info, tsk); + force_sig_info_fault(SIGSEGV, si_code, address, tsk); return; } @@ -500,11 +507,7 @@ do_sigbus: tsk->thread.cr2 = address; tsk->thread.error_code = error_code; tsk->thread.trap_no = 14; - info.si_signo = SIGBUS; - info.si_errno = 0; - info.si_code = BUS_ADRERR; - info.si_addr = (void __user *)address; - force_sig_info(SIGBUS, &info, tsk); + force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); return; vmalloc_fault: -- cgit v1.1 From 4116c527ea9517623369a5b3b037aedde280d672 Mon Sep 17 00:00:00 2001 From: Venkatesh Pallipadi Date: Sat, 3 Sep 2005 15:56:27 -0700 Subject: [PATCH] hpet: use read_timer_tsc only when CPU has TSC Only use read_timer_tsc only when CPU has TSC. Thanks to Andrea for pointing this out. Should not be issue on any platforms as all recent systems that has HPET also has CPUs that supports TSC. The patch is still required for correctness. Signed-off-by: Venkatesh Pallipadi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/timers/timer_hpet.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/timers/timer_hpet.c b/arch/i386/kernel/timers/timer_hpet.c index ef8dac5..cbb3f22 100644 --- a/arch/i386/kernel/timers/timer_hpet.c +++ b/arch/i386/kernel/timers/timer_hpet.c @@ -136,6 +136,8 @@ static void delay_hpet(unsigned long loops) } while ((hpet_end - hpet_start) < (loops)); } +static struct timer_opts timer_hpet; + static int __init init_hpet(char* override) { unsigned long result, remain; @@ -163,6 +165,8 @@ static int __init init_hpet(char* override) } set_cyc2ns_scale(cpu_khz/1000); } + /* set this only when cpu_has_tsc */ + timer_hpet.read_timer = read_timer_tsc; } /* @@ -186,7 +190,6 @@ static struct timer_opts timer_hpet __read_mostly = { .get_offset = get_offset_hpet, .monotonic_clock = monotonic_clock_hpet, .delay = delay_hpet, - .read_timer = read_timer_tsc, }; struct init_timer_opts __initdata timer_hpet_init = { -- cgit v1.1 From 7ae65fd334232468a9d6b523a4fc141cd6ec5ea4 Mon Sep 17 00:00:00 2001 From: Matt Tolentino Date: Sat, 3 Sep 2005 15:56:27 -0700 Subject: [PATCH] x86: fix EFI memory map parsing The memory descriptors that comprise the EFI memory map are not fixed in stone such that the size could change in the future. This uses the memory descriptor size obtained from EFI to iterate over the memory map entries during boot. This enables the removal of an x86 specific pad (and ifdef) in the EFI header. I also couldn't stomach the broken up nature of the function to put EFI runtime calls into virtual mode any longer so I fixed that up a bit as well. For reference, this patch only impacts x86. Signed-off-by: Matt Tolentino Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/efi.c | 101 +++++++++++++++++++++++------------------------ arch/i386/kernel/setup.c | 14 ++++--- arch/i386/mm/init.c | 5 ++- 3 files changed, 62 insertions(+), 58 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/efi.c b/arch/i386/kernel/efi.c index 385883e..850648a 100644 --- a/arch/i386/kernel/efi.c +++ b/arch/i386/kernel/efi.c @@ -233,22 +233,23 @@ void __init efi_map_memmap(void) { memmap.map = NULL; - memmap.map = (efi_memory_desc_t *) - bt_ioremap((unsigned long) memmap.phys_map, - (memmap.nr_map * sizeof(efi_memory_desc_t))); - + memmap.map = bt_ioremap((unsigned long) memmap.phys_map, + (memmap.nr_map * memmap.desc_size)); if (memmap.map == NULL) printk(KERN_ERR PFX "Could not remap the EFI memmap!\n"); + + memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); } #if EFI_DEBUG static void __init print_efi_memmap(void) { efi_memory_desc_t *md; + void *p; int i; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map, i = 0; p < memmap.map_end; p += memmap.desc_size, i++) { + md = p; printk(KERN_INFO "mem%02u: type=%u, attr=0x%llx, " "range=[0x%016llx-0x%016llx) (%lluMB)\n", i, md->type, md->attribute, md->phys_addr, @@ -271,10 +272,10 @@ void efi_memmap_walk(efi_freemem_callback_t callback, void *arg) } prev, curr; efi_memory_desc_t *md; unsigned long start, end; - int i; + void *p; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; if ((md->num_pages == 0) || (!is_available_memory(md))) continue; @@ -325,6 +326,7 @@ void __init efi_init(void) memmap.phys_map = EFI_MEMMAP; memmap.nr_map = EFI_MEMMAP_SIZE/EFI_MEMDESC_SIZE; memmap.desc_version = EFI_MEMDESC_VERSION; + memmap.desc_size = EFI_MEMDESC_SIZE; efi.systab = (efi_system_table_t *) boot_ioremap((unsigned long) efi_phys.systab, @@ -428,22 +430,30 @@ void __init efi_init(void) printk(KERN_ERR PFX "Could not map the runtime service table!\n"); /* Map the EFI memory map for use until paging_init() */ - - memmap.map = (efi_memory_desc_t *) - boot_ioremap((unsigned long) EFI_MEMMAP, EFI_MEMMAP_SIZE); - + memmap.map = boot_ioremap((unsigned long) EFI_MEMMAP, EFI_MEMMAP_SIZE); if (memmap.map == NULL) printk(KERN_ERR PFX "Could not map the EFI memory map!\n"); - if (EFI_MEMDESC_SIZE != sizeof(efi_memory_desc_t)) { - printk(KERN_WARNING PFX "Warning! Kernel-defined memdesc doesn't " - "match the one from EFI!\n"); - } + memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); + #if EFI_DEBUG print_efi_memmap(); #endif } +static inline void __init check_range_for_systab(efi_memory_desc_t *md) +{ + if (((unsigned long)md->phys_addr <= (unsigned long)efi_phys.systab) && + ((unsigned long)efi_phys.systab < md->phys_addr + + ((unsigned long)md->num_pages << EFI_PAGE_SHIFT))) { + unsigned long addr; + + addr = md->virt_addr - md->phys_addr + + (unsigned long)efi_phys.systab; + efi.systab = (efi_system_table_t *)addr; + } +} + /* * This function will switch the EFI runtime services to virtual mode. * Essentially, look through the EFI memmap and map every region that @@ -457,43 +467,32 @@ void __init efi_enter_virtual_mode(void) { efi_memory_desc_t *md; efi_status_t status; - int i; + void *p; efi.systab = NULL; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; - if (md->attribute & EFI_MEMORY_RUNTIME) { - md->virt_addr = - (unsigned long)ioremap(md->phys_addr, - md->num_pages << EFI_PAGE_SHIFT); - if (!(unsigned long)md->virt_addr) { - printk(KERN_ERR PFX "ioremap of 0x%lX failed\n", - (unsigned long)md->phys_addr); - } + if (!(md->attribute & EFI_MEMORY_RUNTIME)) + continue; - if (((unsigned long)md->phys_addr <= - (unsigned long)efi_phys.systab) && - ((unsigned long)efi_phys.systab < - md->phys_addr + - ((unsigned long)md->num_pages << - EFI_PAGE_SHIFT))) { - unsigned long addr; - - addr = md->virt_addr - md->phys_addr + - (unsigned long)efi_phys.systab; - efi.systab = (efi_system_table_t *)addr; - } + md->virt_addr = (unsigned long)ioremap(md->phys_addr, + md->num_pages << EFI_PAGE_SHIFT); + if (!(unsigned long)md->virt_addr) { + printk(KERN_ERR PFX "ioremap of 0x%lX failed\n", + (unsigned long)md->phys_addr); } + /* update the virtual address of the EFI system table */ + check_range_for_systab(md); } if (!efi.systab) BUG(); status = phys_efi_set_virtual_address_map( - sizeof(efi_memory_desc_t) * memmap.nr_map, - sizeof(efi_memory_desc_t), + memmap.desc_size * memmap.nr_map, + memmap.desc_size, memmap.desc_version, memmap.phys_map); @@ -533,10 +532,10 @@ efi_initialize_iomem_resources(struct resource *code_resource, { struct resource *res; efi_memory_desc_t *md; - int i; + void *p; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; if ((md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT)) > 0x100000000ULL) @@ -613,10 +612,10 @@ efi_initialize_iomem_resources(struct resource *code_resource, u32 efi_mem_type(unsigned long phys_addr) { efi_memory_desc_t *md; - int i; + void *p; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; if ((md->phys_addr <= phys_addr) && (phys_addr < (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) return md->type; @@ -627,10 +626,10 @@ u32 efi_mem_type(unsigned long phys_addr) u64 efi_mem_attributes(unsigned long phys_addr) { efi_memory_desc_t *md; - int i; + void *p; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; if ((md->phys_addr <= phys_addr) && (phys_addr < (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) return md->attribute; diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c index af4de58..9adbf71 100644 --- a/arch/i386/kernel/setup.c +++ b/arch/i386/kernel/setup.c @@ -370,12 +370,16 @@ static void __init limit_regions(unsigned long long size) int i; if (efi_enabled) { - for (i = 0; i < memmap.nr_map; i++) { - current_addr = memmap.map[i].phys_addr + - (memmap.map[i].num_pages << 12); - if (memmap.map[i].type == EFI_CONVENTIONAL_MEMORY) { + efi_memory_desc_t *md; + void *p; + + for (p = memmap.map, i = 0; p < memmap.map_end; + p += memmap.desc_size, i++) { + md = p; + current_addr = md->phys_addr + (md->num_pages << 12); + if (md->type == EFI_CONVENTIONAL_MEMORY) { if (current_addr >= size) { - memmap.map[i].num_pages -= + md->num_pages -= (((current_addr-size) + PAGE_SIZE-1) >> PAGE_SHIFT); memmap.nr_map = i + 1; return; diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c index 12216b5..d8b23ab 100644 --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -198,9 +198,10 @@ int page_is_ram(unsigned long pagenr) if (efi_enabled) { efi_memory_desc_t *md; + void *p; - for (i = 0; i < memmap.nr_map; i++) { - md = &memmap.map[i]; + for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { + md = p; if (!is_available_memory(md)) continue; addr = (md->phys_addr+PAGE_SIZE-1) >> PAGE_SHIFT; -- cgit v1.1 From 5fd75ebb1a58c1a3c9e3d9fdf75ce7286b79bb74 Mon Sep 17 00:00:00 2001 From: Petr Tesarik Date: Sat, 3 Sep 2005 15:56:28 -0700 Subject: [PATCH] vm86: Honor TF bit when emulating an instruction If the virtual 86 machine reaches an instruction which raises a General Protection Fault (such as CLI or STI), the instruction is emulated (in handle_vm86_fault). However, the emulation ignored the TF bit, so the hardware debug interrupt was not invoked after such an emulated instruction (and the DOS debugger missed it). This patch fixes the problem by emulating the hardware debug interrupt as the last action before control is returned to the VM86 program. Signed-off-by: Petr Tesarik Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/vm86.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/vm86.c b/arch/i386/kernel/vm86.c index ec0f68c..2daa06f 100644 --- a/arch/i386/kernel/vm86.c +++ b/arch/i386/kernel/vm86.c @@ -542,7 +542,7 @@ void handle_vm86_fault(struct kernel_vm86_regs * regs, long error_code) unsigned char opcode; unsigned char __user *csp; unsigned char __user *ssp; - unsigned short ip, sp; + unsigned short ip, sp, orig_flags; int data32, pref_done; #define CHECK_IF_IN_TRAP \ @@ -551,8 +551,12 @@ void handle_vm86_fault(struct kernel_vm86_regs * regs, long error_code) #define VM86_FAULT_RETURN do { \ if (VMPI.force_return_for_pic && (VEFLAGS & (IF_MASK | VIF_MASK))) \ return_to_32bit(regs, VM86_PICRETURN); \ + if (orig_flags & TF_MASK) \ + handle_vm86_trap(regs, 0, 1); \ return; } while (0) + orig_flags = *(unsigned short *)®s->eflags; + csp = (unsigned char __user *) (regs->cs << 4); ssp = (unsigned char __user *) (regs->ss << 4); sp = SP(regs); -- cgit v1.1 From 484b90c4b965d54037ff99b198d84cdf144f8a35 Mon Sep 17 00:00:00 2001 From: Vivek Goyal Date: Sat, 3 Sep 2005 15:56:31 -0700 Subject: [PATCH] kdump: Save parameter segment in protected mode (x86) o With introduction of kexec as boot-loader, the assumption that parameter segment will always be loaded at lower address than kernel and will be addressable by early bootup page tables is no longer valid. In kexec on panic case parameter segment might well be loaded beyond kernel image and might not be addressable by early boot page tables. o This case might hit in the scenario where user has reserved a chunk of memory for second kernel, for example 16MB to 64MB, and has also built second kernel for physical memory location 16MB. In this case kexec has no choice but to load the parameter segment at a higher address than new kernel image at safe location where new kernel does not stomp it. o Though problem should automatically go away once relocatable kernel for i386 is in place and kexec can determine the location of new kernel at run time and load parameter segment at lower address than kernel image. But till then this patch can go in (assuming it does not break something else). o This patch moves up the boot parameter saving code. Now boot parameters are copied out in protected mode before page tables are initialized. This will ensure that parameter segment is always addressable irrespective of its physical location. Signed-off-by: Vivek Goyal Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/head.S | 48 ++++++++++++++++++++++++++---------------------- 1 file changed, 26 insertions(+), 22 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S index 4477bb1..0480ca9 100644 --- a/arch/i386/kernel/head.S +++ b/arch/i386/kernel/head.S @@ -77,6 +77,32 @@ ENTRY(startup_32) subl %edi,%ecx shrl $2,%ecx rep ; stosl +/* + * Copy bootup parameters out of the way. + * Note: %esi still has the pointer to the real-mode data. + * With the kexec as boot loader, parameter segment might be loaded beyond + * kernel image and might not even be addressable by early boot page tables. + * (kexec on panic case). Hence copy out the parameters before initializing + * page tables. + */ + movl $(boot_params - __PAGE_OFFSET),%edi + movl $(PARAM_SIZE/4),%ecx + cld + rep + movsl + movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi + andl %esi,%esi + jnz 2f # New command line protocol + cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR + jne 1f + movzwl OLD_CL_OFFSET,%esi + addl $(OLD_CL_BASE_ADDR),%esi +2: + movl $(saved_command_line - __PAGE_OFFSET),%edi + movl $(COMMAND_LINE_SIZE/4),%ecx + rep + movsl +1: /* * Initialize page tables. This creates a PDE and a set of page @@ -214,28 +240,6 @@ ENTRY(startup_32_smp) */ call setup_idt -/* - * Copy bootup parameters out of the way. - * Note: %esi still has the pointer to the real-mode data. - */ - movl $boot_params,%edi - movl $(PARAM_SIZE/4),%ecx - cld - rep - movsl - movl boot_params+NEW_CL_POINTER,%esi - andl %esi,%esi - jnz 2f # New command line protocol - cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR - jne 1f - movzwl OLD_CL_OFFSET,%esi - addl $(OLD_CL_BASE_ADDR),%esi -2: - movl $saved_command_line,%edi - movl $(COMMAND_LINE_SIZE/4),%ecx - rep - movsl -1: checkCPUtype: movl $-1,X86_CPUID # -1 for no CPUID initially -- cgit v1.1 From 911a62d42365076209e2c327e7688db296e35d62 Mon Sep 17 00:00:00 2001 From: Venkatesh Pallipadi Date: Sat, 3 Sep 2005 15:56:31 -0700 Subject: [PATCH] x86: sutomatically enable bigsmp when we have more than 8 CPUs i386 generic subarchitecture requires explicit dmi strings or command line to enable bigsmp mode. The patch below removes that restriction, and uses bigsmp as soon as it finds more than 8 logical CPUs, Intel processors and xAPIC support. Signed-off-by: Venkatesh Pallipadi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/acpi/boot.c | 3 +++ arch/i386/kernel/mpparse.c | 9 +++++++++ arch/i386/kernel/setup.c | 8 +++++++- arch/i386/mach-generic/bigsmp.c | 5 ++++- arch/i386/mach-generic/probe.c | 20 ++++++++++++++++++++ 5 files changed, 43 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/acpi/boot.c b/arch/i386/kernel/acpi/boot.c index b7808a8..34ee500 100644 --- a/arch/i386/kernel/acpi/boot.c +++ b/arch/i386/kernel/acpi/boot.c @@ -833,6 +833,9 @@ acpi_process_madt(void) if (!error) { acpi_lapic = 1; +#ifdef CONFIG_X86_GENERICARCH + generic_bigsmp_probe(); +#endif /* * Parse MADT IO-APIC entries */ diff --git a/arch/i386/kernel/mpparse.c b/arch/i386/kernel/mpparse.c index ce838ab..788efff 100644 --- a/arch/i386/kernel/mpparse.c +++ b/arch/i386/kernel/mpparse.c @@ -65,6 +65,8 @@ int nr_ioapics; int pic_mode; unsigned long mp_lapic_addr; +unsigned int def_to_bigsmp = 0; + /* Processor that is doing the boot up */ unsigned int boot_cpu_physical_apicid = -1U; /* Internal processor count */ @@ -213,6 +215,13 @@ static void __init MP_processor_info (struct mpc_config_processor *m) ver = 0x10; } apic_version[m->mpc_apicid] = ver; + if ((num_processors > 8) && + APIC_XAPIC(ver) && + (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)) + def_to_bigsmp = 1; + else + def_to_bigsmp = 0; + bios_cpu_apicid[num_processors - 1] = m->mpc_apicid; } diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c index 9adbf71..294bcca 100644 --- a/arch/i386/kernel/setup.c +++ b/arch/i386/kernel/setup.c @@ -1585,8 +1585,14 @@ void __init setup_arch(char **cmdline_p) */ acpi_boot_table_init(); acpi_boot_init(); -#endif +#if defined(CONFIG_SMP) && defined(CONFIG_X86_PC) + if (def_to_bigsmp) + printk(KERN_WARNING "More than 8 CPUs detected and " + "CONFIG_X86_PC cannot handle it.\nUse " + "CONFIG_X86_GENERICARCH or CONFIG_X86_BIGSMP.\n"); +#endif +#endif #ifdef CONFIG_X86_LOCAL_APIC if (smp_found_config) get_smp_config(); diff --git a/arch/i386/mach-generic/bigsmp.c b/arch/i386/mach-generic/bigsmp.c index 25883b4..037b2af 100644 --- a/arch/i386/mach-generic/bigsmp.c +++ b/arch/i386/mach-generic/bigsmp.c @@ -47,7 +47,10 @@ static struct dmi_system_id __initdata bigsmp_dmi_table[] = { static __init int probe_bigsmp(void) { - dmi_check_system(bigsmp_dmi_table); + if (def_to_bigsmp) + dmi_bigsmp = 1; + else + dmi_check_system(bigsmp_dmi_table); return dmi_bigsmp; } diff --git a/arch/i386/mach-generic/probe.c b/arch/i386/mach-generic/probe.c index 5497c65..cea5b3c 100644 --- a/arch/i386/mach-generic/probe.c +++ b/arch/i386/mach-generic/probe.c @@ -30,6 +30,25 @@ struct genapic *apic_probe[] __initdata = { NULL, }; +static int cmdline_apic; + +void __init generic_bigsmp_probe(void) +{ + /* + * This routine is used to switch to bigsmp mode when + * - There is no apic= option specified by the user + * - generic_apic_probe() has choosen apic_default as the sub_arch + * - we find more than 8 CPUs in acpi LAPIC listing with xAPIC support + */ + + if (!cmdline_apic && genapic == &apic_default) + if (apic_bigsmp.probe()) { + genapic = &apic_bigsmp; + printk(KERN_INFO "Overriding APIC driver with %s\n", + genapic->name); + } +} + void __init generic_apic_probe(char *command_line) { char *s; @@ -52,6 +71,7 @@ void __init generic_apic_probe(char *command_line) if (!changed) printk(KERN_ERR "Unknown genapic `%s' specified.\n", s); *p = old; + cmdline_apic = changed; } for (i = 0; !changed && apic_probe[i]; i++) { if (apic_probe[i]->probe()) { -- cgit v1.1 From 252943efcfce945d8dd3738ca4c4b9cbeb4f3fa9 Mon Sep 17 00:00:00 2001 From: Venkatesh Pallipadi Date: Sat, 3 Sep 2005 15:56:32 -0700 Subject: [PATCH] x86: Add the check for all the cores in a package in cache information Signed-off-by: Venkatesh Pallipadi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/cpu/intel_cacheinfo.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/cpu/intel_cacheinfo.c b/arch/i386/kernel/cpu/intel_cacheinfo.c index 6c55b50..9e0d5f8 100644 --- a/arch/i386/kernel/cpu/intel_cacheinfo.c +++ b/arch/i386/kernel/cpu/intel_cacheinfo.c @@ -305,6 +305,9 @@ static void __devinit cache_shared_cpu_map_setup(unsigned int cpu, int index) { struct _cpuid4_info *this_leaf; unsigned long num_threads_sharing; +#ifdef CONFIG_X86_HT + struct cpuinfo_x86 *c = cpu_data + cpu; +#endif this_leaf = CPUID4_INFO_IDX(cpu, index); num_threads_sharing = 1 + this_leaf->eax.split.num_threads_sharing; @@ -314,10 +317,12 @@ static void __devinit cache_shared_cpu_map_setup(unsigned int cpu, int index) #ifdef CONFIG_X86_HT else if (num_threads_sharing == smp_num_siblings) this_leaf->shared_cpu_map = cpu_sibling_map[cpu]; -#endif + else if (num_threads_sharing == (c->x86_num_cores * smp_num_siblings)) + this_leaf->shared_cpu_map = cpu_core_map[cpu]; else - printk(KERN_INFO "Number of CPUs sharing cache didn't match " + printk(KERN_DEBUG "Number of CPUs sharing cache didn't match " "any known set of CPUs\n"); +#endif } #else static void __init cache_shared_cpu_map_setup(unsigned int cpu, int index) {} -- cgit v1.1 From 56f1d5d52a21b93bc2984c920b17e0d80df5d1b2 Mon Sep 17 00:00:00 2001 From: "Natalie.Protasevich@unisys.com" Date: Sat, 3 Sep 2005 15:56:34 -0700 Subject: [PATCH] ES7000 platform update (i386) This is subarch update for ES7000. I've modified platform check code and removed unnecessary OEM table parsing for newer systems that don't use OEM information during boot. Parsing the table in fact is causing problems, and the platform doesn't get recognized. The patch only affects the ES7000 subach. Signed-off-by: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mach-es7000/es7000.h | 5 +++-- arch/i386/mach-es7000/es7000plat.c | 45 +++++++++++++++++++------------------- 2 files changed, 25 insertions(+), 25 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mach-es7000/es7000.h b/arch/i386/mach-es7000/es7000.h index 70691f0..898ed90 100644 --- a/arch/i386/mach-es7000/es7000.h +++ b/arch/i386/mach-es7000/es7000.h @@ -104,7 +104,8 @@ struct mip_reg { #define MIP_SW_APIC 0x1020b #define MIP_FUNC(VALUE) (VALUE & 0xff) -extern int parse_unisys_oem (char *oemptr, int oem_entries); -extern int find_unisys_acpi_oem_table(unsigned long *oem_addr, int *length); +extern int parse_unisys_oem (char *oemptr); +extern int find_unisys_acpi_oem_table(unsigned long *oem_addr); +extern void setup_unisys (); extern int es7000_start_cpu(int cpu, unsigned long eip); extern void es7000_sw_apic(void); diff --git a/arch/i386/mach-es7000/es7000plat.c b/arch/i386/mach-es7000/es7000plat.c index d5936d5..2000bdc 100644 --- a/arch/i386/mach-es7000/es7000plat.c +++ b/arch/i386/mach-es7000/es7000plat.c @@ -75,12 +75,29 @@ es7000_rename_gsi(int ioapic, int gsi) #endif // (CONFIG_X86_IO_APIC) && (CONFIG_ACPI_INTERPRETER || CONFIG_ACPI_BOOT) +void __init +setup_unisys () +{ + /* + * Determine the generation of the ES7000 currently running. + * + * es7000_plat = 1 if the machine is a 5xx ES7000 box + * es7000_plat = 2 if the machine is a x86_64 ES7000 box + * + */ + if (!(boot_cpu_data.x86 <= 15 && boot_cpu_data.x86_model <= 2)) + es7000_plat = 2; + else + es7000_plat = 1; + ioapic_renumber_irq = es7000_rename_gsi; +} + /* * Parse the OEM Table */ int __init -parse_unisys_oem (char *oemptr, int oem_entries) +parse_unisys_oem (char *oemptr) { int i; int success = 0; @@ -95,7 +112,7 @@ parse_unisys_oem (char *oemptr, int oem_entries) tp += 8; - for (i=0; i <= oem_entries; i++) { + for (i=0; i <= 6; i++) { type = *tp++; size = *tp++; tp -= 2; @@ -130,34 +147,18 @@ parse_unisys_oem (char *oemptr, int oem_entries) default: break; } - if (i == 6) break; tp += size; } if (success < 2) { es7000_plat = 0; - } else { - printk("\nEnabling ES7000 specific features...\n"); - /* - * Determine the generation of the ES7000 currently running. - * - * es7000_plat = 0 if the machine is NOT a Unisys ES7000 box - * es7000_plat = 1 if the machine is a 5xx ES7000 box - * es7000_plat = 2 if the machine is a x86_64 ES7000 box - * - */ - if (!(boot_cpu_data.x86 <= 15 && boot_cpu_data.x86_model <= 2)) - es7000_plat = 2; - else - es7000_plat = 1; - - ioapic_renumber_irq = es7000_rename_gsi; - } + } else + setup_unisys(); return es7000_plat; } int __init -find_unisys_acpi_oem_table(unsigned long *oem_addr, int *length) +find_unisys_acpi_oem_table(unsigned long *oem_addr) { struct acpi_table_rsdp *rsdp = NULL; unsigned long rsdp_phys = 0; @@ -201,13 +202,11 @@ find_unisys_acpi_oem_table(unsigned long *oem_addr, int *length) acpi_table_print(header, sdt.entry[i].pa); t = (struct oem_table *) __acpi_map_table(sdt.entry[i].pa, header->length); addr = (void *) __acpi_map_table(t->OEMTableAddr, t->OEMTableSize); - *length = header->length; *oem_addr = (unsigned long) addr; return 0; } } } - Dprintk("ES7000: did not find Unisys ACPI OEM table!\n"); return -1; } -- cgit v1.1 From 2a0694d15d55d0deed928786a6393d5e45e37d76 Mon Sep 17 00:00:00 2001 From: Roland McGrath Date: Sat, 3 Sep 2005 15:56:35 -0700 Subject: [PATCH] i386: clean up vDSO alignment padding This makes the vDSO use nops for all its padding around instructions, rather than sometimes zeros, and nop-pads the end of the area containing instructions to a 32-byte cache line, to keep text and data in separate lines. Signed-off-by: Roland McGrath Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/vsyscall-sigreturn.S | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/vsyscall-sigreturn.S b/arch/i386/kernel/vsyscall-sigreturn.S index c8fcf75..68afa50 100644 --- a/arch/i386/kernel/vsyscall-sigreturn.S +++ b/arch/i386/kernel/vsyscall-sigreturn.S @@ -15,7 +15,7 @@ */ .text - .org __kernel_vsyscall+32 + .org __kernel_vsyscall+32,0x90 .globl __kernel_sigreturn .type __kernel_sigreturn,@function __kernel_sigreturn: @@ -35,6 +35,7 @@ __kernel_rt_sigreturn: int $0x80 .LEND_rt_sigreturn: .size __kernel_rt_sigreturn,.-.LSTART_rt_sigreturn + .balign 32 .previous .section .eh_frame,"a",@progbits -- cgit v1.1 From 4bb0d3ec3e5b1e9e2399cdc641b3b6521ac9cdaa Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:36 -0700 Subject: [PATCH] i386: inline asm cleanup i386 Inline asm cleanup. Use cr/dr accessor functions. Also, a potential bugfix. Also, some CR accessors really should be volatile. Reads from CR0 (numeric state may change in an exception handler), writes to CR4 (flipping CR4.TSD) and reads from CR2 (page fault) prevent instruction re-ordering. I did not add memory clobber to CR3 / CR4 / CR0 updates, as it was not there to begin with, and in no case should kernel memory be clobbered, except when doing a TLB flush, which already has memory clobber. I noticed that page invalidation does not have a memory clobber. I can't find a bug as a result, but there is definitely a potential for a bug here: #define __flush_tlb_single(addr) \ __asm__ __volatile__("invlpg %0": :"m" (*(char *) addr)) Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/cpu/common.c | 12 ++++++------ arch/i386/kernel/cpu/cpufreq/longhaul.c | 12 +++--------- arch/i386/kernel/cpu/cyrix.c | 6 +----- arch/i386/kernel/efi.c | 4 ++-- arch/i386/kernel/machine_kexec.c | 8 +------- arch/i386/kernel/process.c | 16 ++++++---------- arch/i386/kernel/smp.c | 2 +- arch/i386/mm/fault.c | 6 +++--- arch/i386/mm/pageattr.c | 2 +- arch/i386/power/cpu.c | 16 ++++++++-------- 10 files changed, 32 insertions(+), 52 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/cpu/common.c b/arch/i386/kernel/cpu/common.c index 4553ffd..361f2e7 100644 --- a/arch/i386/kernel/cpu/common.c +++ b/arch/i386/kernel/cpu/common.c @@ -642,12 +642,12 @@ void __devinit cpu_init(void) asm volatile ("xorl %eax, %eax; movl %eax, %fs; movl %eax, %gs"); /* Clear all 6 debug registers: */ - -#define CD(register) set_debugreg(0, register) - - CD(0); CD(1); CD(2); CD(3); /* no db4 and db5 */; CD(6); CD(7); - -#undef CD + set_debugreg(0, 0); + set_debugreg(0, 1); + set_debugreg(0, 2); + set_debugreg(0, 3); + set_debugreg(0, 6); + set_debugreg(0, 7); /* * Force FPU initialization: diff --git a/arch/i386/kernel/cpu/cpufreq/longhaul.c b/arch/i386/kernel/cpu/cpufreq/longhaul.c index 04e3563d..bf02b50 100644 --- a/arch/i386/kernel/cpu/cpufreq/longhaul.c +++ b/arch/i386/kernel/cpu/cpufreq/longhaul.c @@ -64,8 +64,6 @@ static int dont_scale_voltage; #define dprintk(msg...) cpufreq_debug_printk(CPUFREQ_DEBUG_DRIVER, "longhaul", msg) -#define __hlt() __asm__ __volatile__("hlt": : :"memory") - /* Clock ratios multiplied by 10 */ static int clock_ratio[32]; static int eblcr_table[32]; @@ -168,11 +166,9 @@ static void do_powersaver(union msr_longhaul *longhaul, outb(0xFE,0x21); /* TMR0 only */ outb(0xFF,0x80); /* delay */ - local_irq_enable(); - - __hlt(); + safe_halt(); wrmsrl(MSR_VIA_LONGHAUL, longhaul->val); - __hlt(); + halt(); local_irq_disable(); @@ -251,9 +247,7 @@ static void longhaul_setstate(unsigned int clock_ratio_index) bcr2.bits.CLOCKMUL = clock_ratio_index; local_irq_disable(); wrmsrl (MSR_VIA_BCR2, bcr2.val); - local_irq_enable(); - - __hlt(); + safe_halt(); /* Disable software clock multiplier */ rdmsrl (MSR_VIA_BCR2, bcr2.val); diff --git a/arch/i386/kernel/cpu/cyrix.c b/arch/i386/kernel/cpu/cyrix.c index ba4b011..ff87cc2 100644 --- a/arch/i386/kernel/cpu/cyrix.c +++ b/arch/i386/kernel/cpu/cyrix.c @@ -132,11 +132,7 @@ static void __init set_cx86_memwb(void) setCx86(CX86_CCR2, getCx86(CX86_CCR2) & ~0x04); /* set 'Not Write-through' */ cr0 = 0x20000000; - __asm__("movl %%cr0,%%eax\n\t" - "orl %0,%%eax\n\t" - "movl %%eax,%%cr0\n" - : : "r" (cr0) - :"ax"); + write_cr0(read_cr0() | cr0); /* CCR2 bit 2: lock NW bit and set WT1 */ setCx86(CX86_CCR2, getCx86(CX86_CCR2) | 0x14 ); } diff --git a/arch/i386/kernel/efi.c b/arch/i386/kernel/efi.c index 850648a..921fdb1 100644 --- a/arch/i386/kernel/efi.c +++ b/arch/i386/kernel/efi.c @@ -79,7 +79,7 @@ static void efi_call_phys_prelog(void) * directory. If I have PSE, I just need to duplicate one entry in * page directory. */ - __asm__ __volatile__("movl %%cr4, %0":"=r"(cr4)); + cr4 = read_cr4(); if (cr4 & X86_CR4_PSE) { efi_bak_pg_dir_pointer[0].pgd = @@ -115,7 +115,7 @@ static void efi_call_phys_epilog(void) cpu_gdt_descr[0].address = (unsigned long) __va(cpu_gdt_descr[0].address); __asm__ __volatile__("lgdt %0":"=m"(cpu_gdt_descr)); - __asm__ __volatile__("movl %%cr4, %0":"=r"(cr4)); + cr4 = read_cr4(); if (cr4 & X86_CR4_PSE) { swapper_pg_dir[pgd_index(0)].pgd = diff --git a/arch/i386/kernel/machine_kexec.c b/arch/i386/kernel/machine_kexec.c index cb699a2..f19f6d3 100644 --- a/arch/i386/kernel/machine_kexec.c +++ b/arch/i386/kernel/machine_kexec.c @@ -17,13 +17,7 @@ #include #include #include - -static inline unsigned long read_cr3(void) -{ - unsigned long cr3; - asm volatile("movl %%cr3,%0": "=r"(cr3)); - return cr3; -} +#include #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE))) diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c index e3f362e..761d4ed 100644 --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -313,16 +313,12 @@ void show_regs(struct pt_regs * regs) printk(" DS: %04x ES: %04x\n", 0xffff & regs->xds,0xffff & regs->xes); - __asm__("movl %%cr0, %0": "=r" (cr0)); - __asm__("movl %%cr2, %0": "=r" (cr2)); - __asm__("movl %%cr3, %0": "=r" (cr3)); - /* This could fault if %cr4 does not exist */ - __asm__("1: movl %%cr4, %0 \n" - "2: \n" - ".section __ex_table,\"a\" \n" - ".long 1b,2b \n" - ".previous \n" - : "=r" (cr4): "0" (0)); + cr0 = read_cr0(); + cr2 = read_cr2(); + cr3 = read_cr3(); + if (current_cpu_data.x86 > 4) { + cr4 = read_cr4(); + } printk("CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n", cr0, cr2, cr3, cr4); show_trace(NULL, ®s->esp); } diff --git a/arch/i386/kernel/smp.c b/arch/i386/kernel/smp.c index cec4bde..48b55db 100644 --- a/arch/i386/kernel/smp.c +++ b/arch/i386/kernel/smp.c @@ -576,7 +576,7 @@ static void stop_this_cpu (void * dummy) local_irq_disable(); disable_local_APIC(); if (cpu_data[smp_processor_id()].hlt_works_ok) - for(;;) __asm__("hlt"); + for(;;) halt(); for (;;); } diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c index 61d9e34..411b850 100644 --- a/arch/i386/mm/fault.c +++ b/arch/i386/mm/fault.c @@ -233,7 +233,7 @@ fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code) int write, si_code; /* get the address */ - __asm__("movl %%cr2,%0":"=r" (address)); + address = read_cr2(); if (notify_die(DIE_PAGE_FAULT, "page fault", regs, error_code, 14, SIGSEGV) == NOTIFY_STOP) @@ -453,7 +453,7 @@ no_context: printk(" at virtual address %08lx\n",address); printk(KERN_ALERT " printing eip:\n"); printk("%08lx\n", regs->eip); - asm("movl %%cr3,%0":"=r" (page)); + page = read_cr3(); page = ((unsigned long *) __va(page))[address >> 22]; printk(KERN_ALERT "*pde = %08lx\n", page); /* @@ -526,7 +526,7 @@ vmalloc_fault: pmd_t *pmd, *pmd_k; pte_t *pte_k; - asm("movl %%cr3,%0":"=r" (pgd_paddr)); + pgd_paddr = read_cr3(); pgd = index + (pgd_t *)__va(pgd_paddr); pgd_k = init_mm.pgd + index; diff --git a/arch/i386/mm/pageattr.c b/arch/i386/mm/pageattr.c index cb3da6b..bce06a7 100644 --- a/arch/i386/mm/pageattr.c +++ b/arch/i386/mm/pageattr.c @@ -62,7 +62,7 @@ static void flush_kernel_map(void *dummy) { /* Could use CLFLUSH here if the CPU supports it (Hammer,P4) */ if (boot_cpu_data.x86_model >= 4) - asm volatile("wbinvd":::"memory"); + wbinvd(); /* Flush all to work around Errata in early athlons regarding * large page flushing. */ diff --git a/arch/i386/power/cpu.c b/arch/i386/power/cpu.c index c547c1a..4e19c43 100644 --- a/arch/i386/power/cpu.c +++ b/arch/i386/power/cpu.c @@ -57,10 +57,10 @@ void __save_processor_state(struct saved_context *ctxt) /* * control registers */ - asm volatile ("movl %%cr0, %0" : "=r" (ctxt->cr0)); - asm volatile ("movl %%cr2, %0" : "=r" (ctxt->cr2)); - asm volatile ("movl %%cr3, %0" : "=r" (ctxt->cr3)); - asm volatile ("movl %%cr4, %0" : "=r" (ctxt->cr4)); + ctxt->cr0 = read_cr0(); + ctxt->cr2 = read_cr2(); + ctxt->cr3 = read_cr3(); + ctxt->cr4 = read_cr4(); } void save_processor_state(void) @@ -109,10 +109,10 @@ void __restore_processor_state(struct saved_context *ctxt) /* * control registers */ - asm volatile ("movl %0, %%cr4" :: "r" (ctxt->cr4)); - asm volatile ("movl %0, %%cr3" :: "r" (ctxt->cr3)); - asm volatile ("movl %0, %%cr2" :: "r" (ctxt->cr2)); - asm volatile ("movl %0, %%cr0" :: "r" (ctxt->cr0)); + write_cr4(ctxt->cr4); + write_cr3(ctxt->cr3); + write_cr2(ctxt->cr2); + write_cr2(ctxt->cr0); /* * now restore the descriptor tables to their proper values -- cgit v1.1 From 245067d1674d451855692fcd4647daf9fd47f82d Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:37 -0700 Subject: [PATCH] i386: cleanup serialize msr i386 arch cleanup. Introduce the serialize macro to serialize processor state. Why the microcode update needs it I am not quite sure, since wrmsr() is already a serializing instruction, but it is a microcode update, so I will keep the semantic the same, since this could be a timing workaround. As far as I can tell, this has always been there since the original microcode update source. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/microcode.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/microcode.c b/arch/i386/kernel/microcode.c index a77c612..165f131 100644 --- a/arch/i386/kernel/microcode.c +++ b/arch/i386/kernel/microcode.c @@ -164,7 +164,8 @@ static void collect_cpu_info (void *unused) } wrmsr(MSR_IA32_UCODE_REV, 0, 0); - __asm__ __volatile__ ("cpuid" : : : "ax", "bx", "cx", "dx"); + /* see notes above for revision 1.07. Apparent chip bug */ + serialize_cpu(); /* get the current revision from MSR 0x8B */ rdmsr(MSR_IA32_UCODE_REV, val[0], uci->rev); pr_debug("microcode: collect_cpu_info : sig=0x%x, pf=0x%x, rev=0x%x\n", @@ -377,7 +378,9 @@ static void do_update_one (void * unused) (unsigned long) uci->mc->bits >> 16 >> 16); wrmsr(MSR_IA32_UCODE_REV, 0, 0); - __asm__ __volatile__ ("cpuid" : : : "ax", "bx", "cx", "dx"); + /* see notes above for revision 1.07. Apparent chip bug */ + serialize_cpu(); + /* get the current revision from MSR 0x8B */ rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]); -- cgit v1.1 From 4d37e7e3fd851428dede4d05d3e69d03795a744a Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:38 -0700 Subject: [PATCH] i386: inline assembler: cleanup and encapsulate descriptor and task register management i386 inline assembler cleanup. This change encapsulates descriptor and task register management. Also, it is possible to improve assembler generation in two cases; savesegment may store the value in a register instead of a memory location, which allows GCC to optimize stack variables into registers, and MOV MEM, SEG is always a 16-bit write to memory, making the casting in math-emu unnecessary. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/cpu/common.c | 4 ++-- arch/i386/kernel/doublefault.c | 2 +- arch/i386/kernel/efi.c | 5 ++--- arch/i386/kernel/reboot.c | 9 +++++---- arch/i386/kernel/signal.c | 4 ++-- arch/i386/kernel/traps.c | 2 +- arch/i386/kernel/vm86.c | 4 ++-- arch/i386/math-emu/get_address.c | 13 +++---------- arch/i386/power/cpu.c | 26 +++++++++++++------------- 9 files changed, 31 insertions(+), 38 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/cpu/common.c b/arch/i386/kernel/cpu/common.c index 361f2e7..46ce9b2 100644 --- a/arch/i386/kernel/cpu/common.c +++ b/arch/i386/kernel/cpu/common.c @@ -613,8 +613,8 @@ void __devinit cpu_init(void) memcpy(thread->tls_array, &per_cpu(cpu_gdt_table, cpu), GDT_ENTRY_TLS_ENTRIES * 8); - __asm__ __volatile__("lgdt %0" : : "m" (cpu_gdt_descr[cpu])); - __asm__ __volatile__("lidt %0" : : "m" (idt_descr)); + load_gdt(&cpu_gdt_descr[cpu]); + load_idt(&idt_descr); /* * Delete NT diff --git a/arch/i386/kernel/doublefault.c b/arch/i386/kernel/doublefault.c index 789af3e..5edb1d3 100644 --- a/arch/i386/kernel/doublefault.c +++ b/arch/i386/kernel/doublefault.c @@ -20,7 +20,7 @@ static void doublefault_fn(void) struct Xgt_desc_struct gdt_desc = {0, 0}; unsigned long gdt, tss; - __asm__ __volatile__("sgdt %0": "=m" (gdt_desc): :"memory"); + store_gdt(&gdt_desc); gdt = gdt_desc.address; printk("double fault, gdt at %08lx [%d bytes]\n", gdt, gdt_desc.size); diff --git a/arch/i386/kernel/efi.c b/arch/i386/kernel/efi.c index 921fdb1..ecad519 100644 --- a/arch/i386/kernel/efi.c +++ b/arch/i386/kernel/efi.c @@ -104,8 +104,7 @@ static void efi_call_phys_prelog(void) local_flush_tlb(); cpu_gdt_descr[0].address = __pa(cpu_gdt_descr[0].address); - __asm__ __volatile__("lgdt %0":"=m" - (*(struct Xgt_desc_struct *) __pa(&cpu_gdt_descr[0]))); + load_gdt((struct Xgt_desc_struct *) __pa(&cpu_gdt_descr[0])); } static void efi_call_phys_epilog(void) @@ -114,7 +113,7 @@ static void efi_call_phys_epilog(void) cpu_gdt_descr[0].address = (unsigned long) __va(cpu_gdt_descr[0].address); - __asm__ __volatile__("lgdt %0":"=m"(cpu_gdt_descr)); + load_gdt(&cpu_gdt_descr[0]); cr4 = read_cr4(); if (cr4 & X86_CR4_PSE) { diff --git a/arch/i386/kernel/reboot.c b/arch/i386/kernel/reboot.c index c71fef3..1cbb9c0 100644 --- a/arch/i386/kernel/reboot.c +++ b/arch/i386/kernel/reboot.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "mach_reboot.h" #include @@ -242,13 +243,13 @@ void machine_real_restart(unsigned char *code, int length) /* Set up the IDT for real mode. */ - __asm__ __volatile__ ("lidt %0" : : "m" (real_mode_idt)); + load_idt(&real_mode_idt); /* Set up a GDT from which we can load segment descriptors for real mode. The GDT is not used in real mode; it is just needed here to prepare the descriptors. */ - __asm__ __volatile__ ("lgdt %0" : : "m" (real_mode_gdt)); + load_gdt(&real_mode_gdt); /* Load the data segment registers, and thus the descriptors ready for real mode. The base address of each segment is 0x100, 16 times the @@ -316,7 +317,7 @@ void machine_emergency_restart(void) if (!reboot_thru_bios) { if (efi_enabled) { efi.reset_system(EFI_RESET_COLD, EFI_SUCCESS, 0, NULL); - __asm__ __volatile__("lidt %0": :"m" (no_idt)); + load_idt(&no_idt); __asm__ __volatile__("int3"); } /* rebooting needs to touch the page at absolute addr 0 */ @@ -325,7 +326,7 @@ void machine_emergency_restart(void) mach_reboot_fixups(); /* for board specific fixups */ mach_reboot(); /* That didn't work - force a triple fault.. */ - __asm__ __volatile__("lidt %0": :"m" (no_idt)); + load_idt(&no_idt); __asm__ __volatile__("int3"); } } diff --git a/arch/i386/kernel/signal.c b/arch/i386/kernel/signal.c index 140e340..7bcda6d 100644 --- a/arch/i386/kernel/signal.c +++ b/arch/i386/kernel/signal.c @@ -278,9 +278,9 @@ setup_sigcontext(struct sigcontext __user *sc, struct _fpstate __user *fpstate, int tmp, err = 0; tmp = 0; - __asm__("movl %%gs,%0" : "=r"(tmp): "0"(tmp)); + savesegment(gs, tmp); err |= __put_user(tmp, (unsigned int __user *)&sc->gs); - __asm__("movl %%fs,%0" : "=r"(tmp): "0"(tmp)); + savesegment(fs, tmp); err |= __put_user(tmp, (unsigned int __user *)&sc->fs); err |= __put_user(regs->xes, (unsigned int __user *)&sc->es); diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c index cd2d5d5..1144df0 100644 --- a/arch/i386/kernel/traps.c +++ b/arch/i386/kernel/traps.c @@ -1008,7 +1008,7 @@ void __init trap_init_f00f_bug(void) * it uses the read-only mapped virtual address. */ idt_descr.address = fix_to_virt(FIX_F00F_IDT); - __asm__ __volatile__("lidt %0" : : "m" (idt_descr)); + load_idt(&idt_descr); } #endif diff --git a/arch/i386/kernel/vm86.c b/arch/i386/kernel/vm86.c index 2daa06f..16b4850 100644 --- a/arch/i386/kernel/vm86.c +++ b/arch/i386/kernel/vm86.c @@ -294,8 +294,8 @@ static void do_sys_vm86(struct kernel_vm86_struct *info, struct task_struct *tsk */ info->regs32->eax = 0; tsk->thread.saved_esp0 = tsk->thread.esp0; - asm volatile("mov %%fs,%0":"=m" (tsk->thread.saved_fs)); - asm volatile("mov %%gs,%0":"=m" (tsk->thread.saved_gs)); + savesegment(fs, tsk->thread.saved_fs); + savesegment(gs, tsk->thread.saved_gs); tss = &per_cpu(init_tss, get_cpu()); tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0; diff --git a/arch/i386/math-emu/get_address.c b/arch/i386/math-emu/get_address.c index 9117573..9819b70 100644 --- a/arch/i386/math-emu/get_address.c +++ b/arch/i386/math-emu/get_address.c @@ -155,7 +155,6 @@ static long pm_address(u_char FPU_modrm, u_char segment, { struct desc_struct descriptor; unsigned long base_address, limit, address, seg_top; - unsigned short selector; segment--; @@ -173,17 +172,11 @@ static long pm_address(u_char FPU_modrm, u_char segment, /* fs and gs aren't used by the kernel, so they still have their user-space values. */ case PREFIX_FS_-1: - /* The cast is needed here to get gcc 2.8.0 to use a 16 bit register - in the assembler statement. */ - - __asm__("mov %%fs,%0":"=r" (selector)); - addr->selector = selector; + /* N.B. - movl %seg, mem is a 2 byte write regardless of prefix */ + savesegment(fs, addr->selector); break; case PREFIX_GS_-1: - /* The cast is needed here to get gcc 2.8.0 to use a 16 bit register - in the assembler statement. */ - __asm__("mov %%gs,%0":"=r" (selector)); - addr->selector = selector; + savesegment(gs, addr->selector); break; default: addr->selector = PM_REG_(segment); diff --git a/arch/i386/power/cpu.c b/arch/i386/power/cpu.c index 4e19c43..c7a6436 100644 --- a/arch/i386/power/cpu.c +++ b/arch/i386/power/cpu.c @@ -42,17 +42,17 @@ void __save_processor_state(struct saved_context *ctxt) /* * descriptor tables */ - asm volatile ("sgdt %0" : "=m" (ctxt->gdt_limit)); - asm volatile ("sidt %0" : "=m" (ctxt->idt_limit)); - asm volatile ("str %0" : "=m" (ctxt->tr)); + store_gdt(&ctxt->gdt_limit); + store_idt(&ctxt->idt_limit); + store_tr(ctxt->tr); /* * segment registers */ - asm volatile ("movw %%es, %0" : "=m" (ctxt->es)); - asm volatile ("movw %%fs, %0" : "=m" (ctxt->fs)); - asm volatile ("movw %%gs, %0" : "=m" (ctxt->gs)); - asm volatile ("movw %%ss, %0" : "=m" (ctxt->ss)); + savesegment(es, ctxt->es); + savesegment(fs, ctxt->fs); + savesegment(gs, ctxt->gs); + savesegment(ss, ctxt->ss); /* * control registers @@ -118,16 +118,16 @@ void __restore_processor_state(struct saved_context *ctxt) * now restore the descriptor tables to their proper values * ltr is done i fix_processor_context(). */ - asm volatile ("lgdt %0" :: "m" (ctxt->gdt_limit)); - asm volatile ("lidt %0" :: "m" (ctxt->idt_limit)); + load_gdt(&ctxt->gdt_limit); + load_idt(&ctxt->idt_limit); /* * segment registers */ - asm volatile ("movw %0, %%es" :: "r" (ctxt->es)); - asm volatile ("movw %0, %%fs" :: "r" (ctxt->fs)); - asm volatile ("movw %0, %%gs" :: "r" (ctxt->gs)); - asm volatile ("movw %0, %%ss" :: "r" (ctxt->ss)); + loadsegment(es, ctxt->es); + loadsegment(fs, ctxt->fs); + loadsegment(gs, ctxt->gs); + loadsegment(ss, ctxt->ss); /* * sysenter MSRs -- cgit v1.1 From e7a2ff593c0e48b130434dee4d2fd3452a850e6f Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:39 -0700 Subject: [PATCH] i386: load_tls() fix Subtle fix: load_TLS has been moved after saving %fs and %gs segments to avoid creating non-reversible segments. This could conceivably cause a bug if the kernel ever needed to save and restore fs/gs from the NMI handler. It currently does not, but this is the safest approach to avoiding fs/gs corruption. SMIs are safe, since SMI saves the descriptor hidden state. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/process.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c index 761d4ed..9d94995 100644 --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -678,21 +678,26 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas __unlazy_fpu(prev_p); /* - * Reload esp0, LDT and the page table pointer: + * Reload esp0. */ load_esp0(tss, next); /* - * Load the per-thread Thread-Local Storage descriptor. + * Save away %fs and %gs. No need to save %es and %ds, as + * those are always kernel segments while inside the kernel. + * Doing this before setting the new TLS descriptors avoids + * the situation where we temporarily have non-reloadable + * segments in %fs and %gs. This could be an issue if the + * NMI handler ever used %fs or %gs (it does not today), or + * if the kernel is running inside of a hypervisor layer. */ - load_TLS(next, cpu); + savesegment(fs, prev->fs); + savesegment(gs, prev->gs); /* - * Save away %fs and %gs. No need to save %es and %ds, as - * those are always kernel segments while inside the kernel. + * Load the per-thread Thread-Local Storage descriptor. */ - asm volatile("mov %%fs,%0":"=m" (prev->fs)); - asm volatile("mov %%gs,%0":"=m" (prev->gs)); + load_TLS(next, cpu); /* * Restore %fs and %gs if needed. -- cgit v1.1 From c9b02a24130e3ff14a553d966a79f46cf806b037 Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:40 -0700 Subject: [PATCH] i386: use set_pte macros in a couple places where they were missing Also, setting PDPEs in PAE mode does not require atomic operations, since the PDPEs are cached by the processor, and only reloaded on an explicit or implicit reload of CR3. Since the four PDPEs must always be present in an active root, and the kernel PDPE is never updated, we are safe even from SMIs and interrupts / NMIs using task gates (which reload CR3). Actually, much of this is moot, since the user PDPEs are never updated either, and the only usage of task gates is by the doublefault handler. It appears the only place PGDs get updated in PAE mode is in init_low_mappings() / zap_low_mapping() for initial page table creation and recovery from ACPI sleep state, and these sites are safe by inspection. Getting rid of the cmpxchg8b saves code space and 720 cycles in pgd_alloc on P4. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mm/init.c | 2 +- arch/i386/mm/pageattr.c | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c index d8b23ab..9edfc05 100644 --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -349,7 +349,7 @@ static void __init pagetable_init (void) * All user-space mappings are explicitly cleared after * SMP startup. */ - pgd_base[0] = pgd_base[USER_PTRS_PER_PGD]; + set_pgd(&pgd_base[0], pgd_base[USER_PTRS_PER_PGD]); #endif } diff --git a/arch/i386/mm/pageattr.c b/arch/i386/mm/pageattr.c index bce06a7..f600fc2 100644 --- a/arch/i386/mm/pageattr.c +++ b/arch/i386/mm/pageattr.c @@ -12,6 +12,7 @@ #include #include #include +#include static DEFINE_SPINLOCK(cpa_lock); static struct list_head df_list = LIST_HEAD_INIT(df_list); @@ -52,8 +53,8 @@ static struct page *split_large_page(unsigned long address, pgprot_t prot) addr = address & LARGE_PAGE_MASK; pbase = (pte_t *)page_address(base); for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) { - pbase[i] = pfn_pte(addr >> PAGE_SHIFT, - addr == address ? prot : PAGE_KERNEL); + set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT, + addr == address ? prot : PAGE_KERNEL)); } return base; } -- cgit v1.1 From f2ab4461249df85b20930a7a57b54f39c5ae291a Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:42 -0700 Subject: [PATCH] x86: more asm cleanups Some more assembler cleanups I noticed along the way. Signed-off-by: Zachary Amsden Cc: "H. Peter Anvin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/cpu/intel.c | 9 +++------ arch/i386/kernel/crash.c | 2 +- arch/i386/kernel/machine_kexec.c | 10 ++-------- arch/i386/kernel/msr.c | 31 ++++++------------------------- arch/i386/kernel/process.c | 2 +- arch/i386/mach-voyager/voyager_basic.c | 14 ++++++-------- arch/i386/mach-voyager/voyager_smp.c | 2 +- 7 files changed, 20 insertions(+), 50 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/cpu/intel.c b/arch/i386/kernel/cpu/intel.c index a2c33c1..43601de 100644 --- a/arch/i386/kernel/cpu/intel.c +++ b/arch/i386/kernel/cpu/intel.c @@ -82,16 +82,13 @@ static void __devinit Intel_errata_workarounds(struct cpuinfo_x86 *c) */ static int __devinit num_cpu_cores(struct cpuinfo_x86 *c) { - unsigned int eax; + unsigned int eax, ebx, ecx, edx; if (c->cpuid_level < 4) return 1; - __asm__("cpuid" - : "=a" (eax) - : "0" (4), "c" (0) - : "bx", "dx"); - + /* Intel has a non-standard dependency on %ecx for this CPUID level. */ + cpuid_count(4, 0, &eax, &ebx, &ecx, &edx); if (eax & 0x1f) return ((eax >> 26) + 1); else diff --git a/arch/i386/kernel/crash.c b/arch/i386/kernel/crash.c index e5fab12..913be77 100644 --- a/arch/i386/kernel/crash.c +++ b/arch/i386/kernel/crash.c @@ -153,7 +153,7 @@ static int crash_nmi_callback(struct pt_regs *regs, int cpu) disable_local_APIC(); atomic_dec(&waiting_for_crash_ipi); /* Assume hlt works */ - __asm__("hlt"); + halt(); for(;;); return 1; diff --git a/arch/i386/kernel/machine_kexec.c b/arch/i386/kernel/machine_kexec.c index f19f6d3..a912fed 100644 --- a/arch/i386/kernel/machine_kexec.c +++ b/arch/i386/kernel/machine_kexec.c @@ -93,10 +93,7 @@ static void set_idt(void *newidt, __u16 limit) curidt.size = limit; curidt.address = (unsigned long)newidt; - __asm__ __volatile__ ( - "lidtl %0\n" - : : "m" (curidt) - ); + load_idt(&curidt); }; @@ -108,10 +105,7 @@ static void set_gdt(void *newgdt, __u16 limit) curgdt.size = limit; curgdt.address = (unsigned long)newgdt; - __asm__ __volatile__ ( - "lgdtl %0\n" - : : "m" (curgdt) - ); + load_gdt(&curgdt); }; static void load_segments(void) diff --git a/arch/i386/kernel/msr.c b/arch/i386/kernel/msr.c index b2f03c3..03100d6 100644 --- a/arch/i386/kernel/msr.c +++ b/arch/i386/kernel/msr.c @@ -46,23 +46,13 @@ static struct class *msr_class; -/* Note: "err" is handled in a funny way below. Otherwise one version - of gcc or another breaks. */ - static inline int wrmsr_eio(u32 reg, u32 eax, u32 edx) { int err; - asm volatile ("1: wrmsr\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %4,%0\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" " .long 1b,3b\n" ".previous":"=&bDS" (err) - :"a"(eax), "d"(edx), "c"(reg), "i"(-EIO), "0"(0)); - + err = wrmsr_safe(reg, eax, edx); + if (err) + err = -EIO; return err; } @@ -70,18 +60,9 @@ static inline int rdmsr_eio(u32 reg, u32 *eax, u32 *edx) { int err; - asm volatile ("1: rdmsr\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %4,%0\n" - " jmp 2b\n" - ".previous\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" - " .long 1b,3b\n" - ".previous":"=&bDS" (err), "=a"(*eax), "=d"(*edx) - :"c"(reg), "i"(-EIO), "0"(0)); - + err = rdmsr_safe(reg, eax, edx); + if (err) + err = -EIO; return err; } diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c index 9d94995..6609978 100644 --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -164,7 +164,7 @@ static inline void play_dead(void) */ local_irq_disable(); while (1) - __asm__ __volatile__("hlt":::"memory"); + halt(); } #else static inline void play_dead(void) diff --git a/arch/i386/mach-voyager/voyager_basic.c b/arch/i386/mach-voyager/voyager_basic.c index c638406..cc69875 100644 --- a/arch/i386/mach-voyager/voyager_basic.c +++ b/arch/i386/mach-voyager/voyager_basic.c @@ -234,10 +234,9 @@ voyager_power_off(void) #endif } /* and wait for it to happen */ - for(;;) { - __asm("cli"); - __asm("hlt"); - } + local_irq_disable(); + for(;;) + halt(); } /* copied from process.c */ @@ -278,10 +277,9 @@ machine_restart(char *cmd) outb(basebd | 0x08, VOYAGER_MC_SETUP); outb(0x02, catbase + 0x21); } - for(;;) { - asm("cli"); - asm("hlt"); - } + local_irq_disable(); + for(;;) + halt(); } void diff --git a/arch/i386/mach-voyager/voyager_smp.c b/arch/i386/mach-voyager/voyager_smp.c index 0e1f420..16790b7 100644 --- a/arch/i386/mach-voyager/voyager_smp.c +++ b/arch/i386/mach-voyager/voyager_smp.c @@ -1015,7 +1015,7 @@ smp_stop_cpu_function(void *dummy) cpu_clear(smp_processor_id(), cpu_online_map); local_irq_disable(); for(;;) - __asm__("hlt"); + halt(); } static DEFINE_SPINLOCK(call_lock); -- cgit v1.1 From 0998e4228aca046fbd747c3fed909791d52e88eb Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:43 -0700 Subject: [PATCH] x86: privilege cleanup Privilege checking cleanup. Originally, these diffs were much greater, but recent cleanups in Linux have already done much of the cleanup. I added some explanatory comments in places where the reasoning behind certain tests is rather subtle. Also, in traps.c, we can skip the user_mode check in handle_BUG(). The reason is, there are only two call chains - one via die_if_kernel() and one via do_page_fault(), both entering from die(). Both of these paths already ensure that a kernel mode failure has happened. Also, the original check here, if (user_mode(regs)) was insufficient anyways, since it would not rule out BUG faults from V8086 mode execution. Saving the %ss segment in show_regs() rather than assuming a fixed value also gives better information about the current kernel state in the register dump. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/signal.c | 4 +++- arch/i386/kernel/traps.c | 5 +---- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/signal.c b/arch/i386/kernel/signal.c index 7bcda6d..61eb0c8 100644 --- a/arch/i386/kernel/signal.c +++ b/arch/i386/kernel/signal.c @@ -604,7 +604,9 @@ int fastcall do_signal(struct pt_regs *regs, sigset_t *oldset) * We want the common case to go fast, which * is why we may in certain cases get here from * kernel mode. Just return without doing anything - * if so. + * if so. vm86 regs switched out by assembly code + * before reaching here, so testing against kernel + * CS suffices. */ if (!user_mode(regs)) return 1; diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c index 1144df0..b2b4bb8 100644 --- a/arch/i386/kernel/traps.c +++ b/arch/i386/kernel/traps.c @@ -210,7 +210,7 @@ void show_registers(struct pt_regs *regs) unsigned short ss; esp = (unsigned long) (®s->esp); - ss = __KERNEL_DS; + savesegment(ss, ss); if (user_mode(regs)) { in_kernel = 0; esp = regs->esp; @@ -267,9 +267,6 @@ static void handle_BUG(struct pt_regs *regs) char c; unsigned long eip; - if (user_mode(regs)) - goto no_bug; /* Not in kernel */ - eip = regs->eip; if (eip < PAGE_OFFSET) -- cgit v1.1 From a5201129307f414890f9a4410e38da205f5d7359 Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:44 -0700 Subject: [PATCH] x86: make IOPL explicit The pushf/popf in switch_to are ONLY used to switch IOPL. Making this explicit in C code is more clear. This pushf/popf pair was added as a bugfix for leaking IOPL to unprivileged processes when using sysenter/sysexit based system calls (sysexit does not restore flags). When requesting an IOPL change in sys_iopl(), it is just as easy to change the current flags and the flags in the stack image (in case an IRET is required), but there is no reason to force an IRET if we came in from the SYSENTER path. This change is the minimal solution for supporting a paravirtualized Linux kernel that allows user processes to run with I/O privilege. Other solutions require radical rewrites of part of the low level fault / system call handling code, or do not fully support sysenter based system calls. Unfortunately, this added one field to the thread_struct. But as a bonus, on P4, the fastest time measured for switch_to() went from 312 to 260 cycles, a win of about 17% in the fast case through this performance critical path. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/ioport.c | 7 ++++--- arch/i386/kernel/process.c | 6 ++++++ 2 files changed, 10 insertions(+), 3 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/ioport.c b/arch/i386/kernel/ioport.c index 8b25160..f2b3765 100644 --- a/arch/i386/kernel/ioport.c +++ b/arch/i386/kernel/ioport.c @@ -132,6 +132,7 @@ asmlinkage long sys_iopl(unsigned long unused) volatile struct pt_regs * regs = (struct pt_regs *) &unused; unsigned int level = regs->ebx; unsigned int old = (regs->eflags >> 12) & 3; + struct thread_struct *t = ¤t->thread; if (level > 3) return -EINVAL; @@ -140,8 +141,8 @@ asmlinkage long sys_iopl(unsigned long unused) if (!capable(CAP_SYS_RAWIO)) return -EPERM; } - regs->eflags = (regs->eflags &~ 0x3000UL) | (level << 12); - /* Make sure we return the long way (not sysenter) */ - set_thread_flag(TIF_IRET); + t->iopl = level << 12; + regs->eflags = (regs->eflags & ~X86_EFLAGS_IOPL) | t->iopl; + set_iopl_mask(t->iopl); return 0; } diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c index 6609978..b45cbf9 100644 --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -712,6 +712,12 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas loadsegment(gs, next->gs); /* + * Restore IOPL if needed. + */ + if (unlikely(prev->iopl != next->iopl)) + set_iopl_mask(next->iopl); + + /* * Now maybe reload the debug registers */ if (unlikely(next->debugreg[7])) { -- cgit v1.1 From e9f86e351fda5b3c40192fc3990453613f160779 Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:45 -0700 Subject: [PATCH] x86: remove redundant TSS clearing When reviewing GDT updates, I found the code: set_tss_desc(cpu,t); /* This just modifies memory; ... */ per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS].b &= 0xfffffdff; This second line is unnecessary, since set_tss_desc() has already cleared the busy bit. Commented disassembly, line 1: c028b8bd: 8b 0c 86 mov (%esi,%eax,4),%ecx c028b8c0: 01 cb add %ecx,%ebx c028b8c2: 8d 0c 39 lea (%ecx,%edi,1),%ecx => %ecx = per_cpu(cpu_gdt_table, cpu) c028b8c5: 8d 91 80 00 00 00 lea 0x80(%ecx),%edx => %edx = &per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS] c028b8cb: 66 c7 42 00 73 20 movw $0x2073,0x0(%edx) c028b8d1: 66 89 5a 02 mov %bx,0x2(%edx) c028b8d5: c1 cb 10 ror $0x10,%ebx c028b8d8: 88 5a 04 mov %bl,0x4(%edx) c028b8db: c6 42 05 89 movb $0x89,0x5(%edx) => ((char *)%edx)[5] = 0x89 (equivalent) ((char *)per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS])[5] = 0x89 c028b8df: c6 42 06 00 movb $0x0,0x6(%edx) c028b8e3: 88 7a 07 mov %bh,0x7(%edx) c028b8e6: c1 cb 10 ror $0x10,%ebx => other bits Commented disassembly, line 2: c028b8e9: 8b 14 86 mov (%esi,%eax,4),%edx c028b8ec: 8d 04 3a lea (%edx,%edi,1),%eax => %eax = per_cpu(cpu_gdt_table, cpu) c028b8ef: 81 a0 84 00 00 00 ff andl $0xfffffdff,0x84(%eax) => per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS].b &= 0xfffffdff; (equivalent) ((char *)per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS])[5] &= 0xfd Note that (0x89 & ~0xfd) == 0; i.e, set_tss_desc(cpu,t) has already stored the type field in the GDT with the busy bit clear. Eliminating redundant and obscure code is always a good thing; in fact, I pointed out this same optimization many moons ago in arch/i386/setup.c, back when it used to be called that. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/power/cpu.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/power/cpu.c b/arch/i386/power/cpu.c index c7a6436..7b0b9ad 100644 --- a/arch/i386/power/cpu.c +++ b/arch/i386/power/cpu.c @@ -84,7 +84,6 @@ static void fix_processor_context(void) struct tss_struct * t = &per_cpu(init_tss, cpu); set_tss_desc(cpu,t); /* This just modifies memory; should not be necessary. But... This is necessary, because 386 hardware has concept of busy TSS or some similar stupidity. */ - per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS].b &= 0xfffffdff; load_TR_desc(); /* This does ltr */ load_LDT(¤t->active_mm->context); /* This does lldt */ -- cgit v1.1 From f2f30ebca6c0c95e987cb9a1fd1495770a75432e Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:47 -0700 Subject: [PATCH] x86: introduce a write acessor for updating the current LDT Introduce a write acessor for updating the current LDT. This is required for hypervisors like Xen that do not allow LDT pages to be directly written. Testing - here's a fun little LDT test that can be trivially modified to test limits as well. /* * Copyright (c) 2005, Zachary Amsden (zach@vmware.com) * This is licensed under the GPL. */ #include #include #include #include #include #include #include #define __KERNEL__ #include void main(void) { struct user_desc desc; char *code; unsigned long long tsc; code = (char *)mmap(0, 8192, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); desc.entry_number = 0; desc.base_addr = code; desc.limit = 1; desc.seg_32bit = 1; desc.contents = MODIFY_LDT_CONTENTS_CODE; desc.read_exec_only = 0; desc.limit_in_pages = 1; desc.seg_not_present = 0; desc.useable = 1; if (modify_ldt(1, &desc, sizeof(desc)) != 0) { perror("modify_ldt"); } printf("code base is 0x%08x\n", (unsigned)code); code[0x0ffe] = 0x0f; /* rdtsc */ code[0x0fff] = 0x31; code[0x1000] = 0xcb; /* lret */ __asm__ __volatile("lcall $7,$0xffe" : "=A" (tsc)); printf("TSC is 0x%016llx\n", tsc); } Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/ldt.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/ldt.c b/arch/i386/kernel/ldt.c index bb50afb..fe1ffa5 100644 --- a/arch/i386/kernel/ldt.c +++ b/arch/i386/kernel/ldt.c @@ -177,7 +177,7 @@ static int read_default_ldt(void __user * ptr, unsigned long bytecount) static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode) { struct mm_struct * mm = current->mm; - __u32 entry_1, entry_2, *lp; + __u32 entry_1, entry_2; int error; struct user_desc ldt_info; @@ -205,8 +205,6 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode) goto out_unlock; } - lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt); - /* Allow LDTs to be cleared by the user. */ if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { if (oldmode || LDT_empty(&ldt_info)) { @@ -223,8 +221,7 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode) /* Install the new entry ... */ install: - *lp = entry_1; - *(lp+1) = entry_2; + write_ldt_entry(mm->context.ldt, ldt_info.entry_number, entry_1, entry_2); error = 0; out_unlock: -- cgit v1.1 From 748f2edb52712aa3d926470a888608dc500d17e8 Mon Sep 17 00:00:00 2001 From: George Anzinger Date: Sat, 3 Sep 2005 15:56:48 -0700 Subject: [PATCH] x86 NMI: better support for debuggers This patch adds a notify to the die_nmi notify that the system is about to be taken down. If the notify is handled with a NOTIFY_STOP return, the system is given a new lease on life. We also change the nmi watchdog to carry on if die_nmi returns. This give debug code a chance to a) catch watchdog timeouts and b) possibly allow the system to continue, realizing that the time out may be due to debugger activities such as single stepping which is usually done with "other" cpus held. Signed-off-by: George Anzinger Cc: Keith Owens Signed-off-by: George Anzinger Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/nmi.c | 5 ++++- arch/i386/kernel/traps.c | 4 ++++ 2 files changed, 8 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c index 8c242bb..8bbdbda 100644 --- a/arch/i386/kernel/nmi.c +++ b/arch/i386/kernel/nmi.c @@ -501,8 +501,11 @@ void nmi_watchdog_tick (struct pt_regs * regs) */ alert_counter[cpu]++; if (alert_counter[cpu] == 5*nmi_hz) + /* + * die_nmi will return ONLY if NOTIFY_STOP happens.. + */ die_nmi(regs, "NMI Watchdog detected LOCKUP"); - } else { + last_irq_sums[cpu] = sum; alert_counter[cpu] = 0; } diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c index b2b4bb8..54629bb 100644 --- a/arch/i386/kernel/traps.c +++ b/arch/i386/kernel/traps.c @@ -565,6 +565,10 @@ static DEFINE_SPINLOCK(nmi_print_lock); void die_nmi (struct pt_regs *regs, const char *msg) { + if (notify_die(DIE_NMIWATCHDOG, msg, regs, 0, 0, SIGINT) == + NOTIFY_STOP) + return; + spin_lock(&nmi_print_lock); /* * We are in trouble anyway, lets at least try -- cgit v1.1 From d7271b14b2e9e5905aba0fbf5c4dc4f8980c0cb2 Mon Sep 17 00:00:00 2001 From: Zachary Amsden Date: Sat, 3 Sep 2005 15:56:50 -0700 Subject: [PATCH] i386: encapsulate copying of pgd entries Add a clone operation for pgd updates. This helps complete the encapsulation of updates to page tables (or pages about to become page tables) into accessor functions rather than using memcpy() to duplicate them. This is both generally good for consistency and also necessary for running in a hypervisor which requires explicit updates to page table entries. The new function is: clone_pgd_range(pgd_t *dst, pgd_t *src, int count); dst - pointer to pgd range anwhere on a pgd page src - "" count - the number of pgds to copy. dst and src can be on the same page, but the range must not overlap and must not cross a page boundary. Note that I ommitted using this call to copy pgd entries into the software suspend page root, since this is not technically a live paging structure, rather it is used on resume from suspend. CC'ing Pavel in case he has any feedback on this. Thanks to Chris Wright for noticing that this could be more optimal in PAE compiles by eliminating the memset. Signed-off-by: Zachary Amsden Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/smpboot.c | 4 ++-- arch/i386/mm/pgtable.c | 10 +++++----- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c index 8ac8e9f..8e950cd 100644 --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -1017,8 +1017,8 @@ int __devinit smp_prepare_cpu(int cpu) tsc_sync_disabled = 1; /* init low mem mapping */ - memcpy(swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, - sizeof(swapper_pg_dir[0]) * KERNEL_PGD_PTRS); + clone_pgd_range(swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, + KERNEL_PGD_PTRS); flush_tlb_all(); schedule_work(&task); wait_for_completion(&done); diff --git a/arch/i386/mm/pgtable.c b/arch/i386/mm/pgtable.c index bd2f7af..dcdce2c 100644 --- a/arch/i386/mm/pgtable.c +++ b/arch/i386/mm/pgtable.c @@ -207,19 +207,19 @@ void pgd_ctor(void *pgd, kmem_cache_t *cache, unsigned long unused) { unsigned long flags; - if (PTRS_PER_PMD == 1) + if (PTRS_PER_PMD == 1) { + memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); spin_lock_irqsave(&pgd_lock, flags); + } - memcpy((pgd_t *)pgd + USER_PTRS_PER_PGD, + clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD, swapper_pg_dir + USER_PTRS_PER_PGD, - (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); - + KERNEL_PGD_PTRS); if (PTRS_PER_PMD > 1) return; pgd_list_add(pgd); spin_unlock_irqrestore(&pgd_lock, flags); - memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t)); } /* never called when PTRS_PER_PMD > 1 */ -- cgit v1.1 From 4ad8d38342430f8b52f7a8458dce90caf8c8ca64 Mon Sep 17 00:00:00 2001 From: Zwane Mwaikambo Date: Sat, 3 Sep 2005 15:56:51 -0700 Subject: [PATCH] i386 boottime for_each_cpu broken for_each_cpu walks through all processors in cpu_possible_map, which is defined as cpu_callout_map on i386 and isn't initialised until all processors have been booted. This breaks things which do for_each_cpu iterations early during boot. So, define cpu_possible_map as a bitmap with NR_CPUS bits populated. This was triggered by a patch i'm working on which does alloc_percpu before bringing up secondary processors. From: Alexander Nyberg i386-boottime-for_each_cpu-broken.patch i386-boottime-for_each_cpu-broken-fix.patch The SMP version of __alloc_percpu checks the cpu_possible_map before allocating memory for a certain cpu. With the above patches the BSP cpuid is never set in cpu_possible_map which breaks CONFIG_SMP on uniprocessor machines (as soon as someone tries to dereference something allocated via __alloc_percpu, which in fact is never allocated since the cpu is not set in cpu_possible_map). Signed-off-by: Zwane Mwaikambo Signed-off-by: Alexander Nyberg Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/mpparse.c | 8 +++++++- arch/i386/kernel/smpboot.c | 3 +++ arch/i386/mach-voyager/voyager_smp.c | 3 +++ 3 files changed, 13 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/mpparse.c b/arch/i386/kernel/mpparse.c index 788efff..5d0b9a8 100644 --- a/arch/i386/kernel/mpparse.c +++ b/arch/i386/kernel/mpparse.c @@ -122,7 +122,7 @@ static int MP_valid_apicid(int apicid, int version) static void __init MP_processor_info (struct mpc_config_processor *m) { - int ver, apicid; + int ver, apicid, cpu, found_bsp = 0; physid_mask_t tmp; if (!(m->mpc_cpuflag & CPU_ENABLED)) @@ -181,6 +181,7 @@ static void __init MP_processor_info (struct mpc_config_processor *m) if (m->mpc_cpuflag & CPU_BOOTPROCESSOR) { Dprintk(" Bootup CPU\n"); boot_cpu_physical_apicid = m->mpc_apicid; + found_bsp = 1; } if (num_processors >= NR_CPUS) { @@ -204,6 +205,11 @@ static void __init MP_processor_info (struct mpc_config_processor *m) return; } + if (found_bsp) + cpu = 0; + else + cpu = num_processors - 1; + cpu_set(cpu, cpu_possible_map); tmp = apicid_to_cpu_present(apicid); physids_or(phys_cpu_present_map, phys_cpu_present_map, tmp); diff --git a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c index 8e950cd..5e4893d 100644 --- a/arch/i386/kernel/smpboot.c +++ b/arch/i386/kernel/smpboot.c @@ -88,6 +88,8 @@ EXPORT_SYMBOL(cpu_online_map); cpumask_t cpu_callin_map; cpumask_t cpu_callout_map; EXPORT_SYMBOL(cpu_callout_map); +cpumask_t cpu_possible_map; +EXPORT_SYMBOL(cpu_possible_map); static cpumask_t smp_commenced_mask; /* TSC's upper 32 bits can't be written in eariler CPU (before prescott), there @@ -1265,6 +1267,7 @@ void __devinit smp_prepare_boot_cpu(void) cpu_set(smp_processor_id(), cpu_online_map); cpu_set(smp_processor_id(), cpu_callout_map); cpu_set(smp_processor_id(), cpu_present_map); + cpu_set(smp_processor_id(), cpu_possible_map); per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; } diff --git a/arch/i386/mach-voyager/voyager_smp.c b/arch/i386/mach-voyager/voyager_smp.c index 16790b7..46b0cf4 100644 --- a/arch/i386/mach-voyager/voyager_smp.c +++ b/arch/i386/mach-voyager/voyager_smp.c @@ -242,6 +242,8 @@ static cpumask_t smp_commenced_mask = CPU_MASK_NONE; cpumask_t cpu_callin_map = CPU_MASK_NONE; cpumask_t cpu_callout_map = CPU_MASK_NONE; EXPORT_SYMBOL(cpu_callout_map); +cpumask_t cpu_possible_map = CPU_MASK_ALL; +EXPORT_SYMBOL(cpu_possible_map); /* The per processor IRQ masks (these are usually kept in sync) */ static __u16 vic_irq_mask[NR_CPUS] __cacheline_aligned; @@ -1910,6 +1912,7 @@ void __devinit smp_prepare_boot_cpu(void) { cpu_set(smp_processor_id(), cpu_online_map); cpu_set(smp_processor_id(), cpu_callout_map); + cpu_set(smp_processor_id(), cpu_possible_map); } int __devinit -- cgit v1.1 From 52fdd08903a1d1162e184114837e232640191627 Mon Sep 17 00:00:00 2001 From: Benjamin LaHaise Date: Sat, 3 Sep 2005 15:56:52 -0700 Subject: [PATCH] unify x86/x86-64 semaphore code This patch moves the common code in x86 and x86-64's semaphore.c into a single file in lib/semaphore-sleepers.c. The arch specific asm stubs are left in the arch tree (in semaphore.c for i386 and in the asm for x86-64). There should be no changes in code/functionality with this patch. Signed-off-by: Benjamin LaHaise Cc: Andi Kleen Signed-off-by: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/Kconfig | 4 ++ arch/i386/kernel/semaphore.c | 162 ------------------------------------------- 2 files changed, 4 insertions(+), 162 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index dcb0ad0..3b3b017 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -14,6 +14,10 @@ config X86 486, 586, Pentiums, and various instruction-set-compatible chips by AMD, Cyrix, and others. +config SEMAPHORE_SLEEPERS + bool + default y + config MMU bool default y diff --git a/arch/i386/kernel/semaphore.c b/arch/i386/kernel/semaphore.c index 469f496..7455ab6 100644 --- a/arch/i386/kernel/semaphore.c +++ b/arch/i386/kernel/semaphore.c @@ -13,171 +13,9 @@ * rw semaphores implemented November 1999 by Benjamin LaHaise */ #include -#include -#include -#include #include /* - * Semaphores are implemented using a two-way counter: - * The "count" variable is decremented for each process - * that tries to acquire the semaphore, while the "sleeping" - * variable is a count of such acquires. - * - * Notably, the inline "up()" and "down()" functions can - * efficiently test if they need to do any extra work (up - * needs to do something only if count was negative before - * the increment operation. - * - * "sleeping" and the contention routine ordering is protected - * by the spinlock in the semaphore's waitqueue head. - * - * Note that these functions are only called when there is - * contention on the lock, and as such all this is the - * "non-critical" part of the whole semaphore business. The - * critical part is the inline stuff in - * where we want to avoid any extra jumps and calls. - */ - -/* - * Logic: - * - only on a boundary condition do we need to care. When we go - * from a negative count to a non-negative, we wake people up. - * - when we go from a non-negative count to a negative do we - * (a) synchronize with the "sleeper" count and (b) make sure - * that we're on the wakeup list before we synchronize so that - * we cannot lose wakeup events. - */ - -static fastcall void __attribute_used__ __up(struct semaphore *sem) -{ - wake_up(&sem->wait); -} - -static fastcall void __attribute_used__ __sched __down(struct semaphore * sem) -{ - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - unsigned long flags; - - tsk->state = TASK_UNINTERRUPTIBLE; - spin_lock_irqsave(&sem->wait.lock, flags); - add_wait_queue_exclusive_locked(&sem->wait, &wait); - - sem->sleepers++; - for (;;) { - int sleepers = sem->sleepers; - - /* - * Add "everybody else" into it. They aren't - * playing, because we own the spinlock in - * the wait_queue_head. - */ - if (!atomic_add_negative(sleepers - 1, &sem->count)) { - sem->sleepers = 0; - break; - } - sem->sleepers = 1; /* us - see -1 above */ - spin_unlock_irqrestore(&sem->wait.lock, flags); - - schedule(); - - spin_lock_irqsave(&sem->wait.lock, flags); - tsk->state = TASK_UNINTERRUPTIBLE; - } - remove_wait_queue_locked(&sem->wait, &wait); - wake_up_locked(&sem->wait); - spin_unlock_irqrestore(&sem->wait.lock, flags); - tsk->state = TASK_RUNNING; -} - -static fastcall int __attribute_used__ __sched __down_interruptible(struct semaphore * sem) -{ - int retval = 0; - struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); - unsigned long flags; - - tsk->state = TASK_INTERRUPTIBLE; - spin_lock_irqsave(&sem->wait.lock, flags); - add_wait_queue_exclusive_locked(&sem->wait, &wait); - - sem->sleepers++; - for (;;) { - int sleepers = sem->sleepers; - - /* - * With signals pending, this turns into - * the trylock failure case - we won't be - * sleeping, and we* can't get the lock as - * it has contention. Just correct the count - * and exit. - */ - if (signal_pending(current)) { - retval = -EINTR; - sem->sleepers = 0; - atomic_add(sleepers, &sem->count); - break; - } - - /* - * Add "everybody else" into it. They aren't - * playing, because we own the spinlock in - * wait_queue_head. The "-1" is because we're - * still hoping to get the semaphore. - */ - if (!atomic_add_negative(sleepers - 1, &sem->count)) { - sem->sleepers = 0; - break; - } - sem->sleepers = 1; /* us - see -1 above */ - spin_unlock_irqrestore(&sem->wait.lock, flags); - - schedule(); - - spin_lock_irqsave(&sem->wait.lock, flags); - tsk->state = TASK_INTERRUPTIBLE; - } - remove_wait_queue_locked(&sem->wait, &wait); - wake_up_locked(&sem->wait); - spin_unlock_irqrestore(&sem->wait.lock, flags); - - tsk->state = TASK_RUNNING; - return retval; -} - -/* - * Trylock failed - make sure we correct for - * having decremented the count. - * - * We could have done the trylock with a - * single "cmpxchg" without failure cases, - * but then it wouldn't work on a 386. - */ -static fastcall int __attribute_used__ __down_trylock(struct semaphore * sem) -{ - int sleepers; - unsigned long flags; - - spin_lock_irqsave(&sem->wait.lock, flags); - sleepers = sem->sleepers + 1; - sem->sleepers = 0; - - /* - * Add "everybody else" and us into it. They aren't - * playing, because we own the spinlock in the - * wait_queue_head. - */ - if (!atomic_add_negative(sleepers, &sem->count)) { - wake_up_locked(&sem->wait); - } - - spin_unlock_irqrestore(&sem->wait.lock, flags); - return 1; -} - - -/* * The semaphore operations have a special calling sequence that * allow us to do a simpler in-line version of them. These routines * need to convert that sequence back into the C sequence when -- cgit v1.1 From 795312e763569ce4df67e7a0ca726a9901358fa2 Mon Sep 17 00:00:00 2001 From: Pierre Ossman Date: Sat, 3 Sep 2005 15:56:54 -0700 Subject: [PATCH] ISA DMA suspend for i386 Reset the ISA DMA controller into a known state after a suspend. Primary concern was reenabling the cascading DMA channel (4). Signed-off-by: Pierre Ossman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/Makefile | 2 +- arch/i386/kernel/i8237.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 68 insertions(+), 1 deletion(-) create mode 100644 arch/i386/kernel/i8237.c (limited to 'arch/i386') diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile index 4cc83b3..64682a0 100644 --- a/arch/i386/kernel/Makefile +++ b/arch/i386/kernel/Makefile @@ -7,7 +7,7 @@ extra-y := head.o init_task.o vmlinux.lds obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \ ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \ pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \ - doublefault.o quirks.o + doublefault.o quirks.o i8237.o obj-y += cpu/ obj-y += timers/ diff --git a/arch/i386/kernel/i8237.c b/arch/i386/kernel/i8237.c new file mode 100644 index 0000000..c36d1c0 --- /dev/null +++ b/arch/i386/kernel/i8237.c @@ -0,0 +1,67 @@ +/* + * i8237.c: 8237A DMA controller suspend functions. + * + * Written by Pierre Ossman, 2005. + */ + +#include +#include + +#include + +/* + * This module just handles suspend/resume issues with the + * 8237A DMA controller (used for ISA and LPC). + * Allocation is handled in kernel/dma.c and normal usage is + * in asm/dma.h. + */ + +static int i8237A_resume(struct sys_device *dev) +{ + unsigned long flags; + int i; + + flags = claim_dma_lock(); + + dma_outb(DMA1_RESET_REG, 0); + dma_outb(DMA2_RESET_REG, 0); + + for (i = 0;i < 8;i++) { + set_dma_addr(i, 0x000000); + /* DMA count is a bit weird so this is not 0 */ + set_dma_count(i, 1); + } + + /* Enable cascade DMA or channel 0-3 won't work */ + enable_dma(4); + + release_dma_lock(flags); + + return 0; +} + +static int i8237A_suspend(struct sys_device *dev, pm_message_t state) +{ + return 0; +} + +static struct sysdev_class i8237_sysdev_class = { + set_kset_name("i8237"), + .suspend = i8237A_suspend, + .resume = i8237A_resume, +}; + +static struct sys_device device_i8237A = { + .id = 0, + .cls = &i8237_sysdev_class, +}; + +static int __init i8237A_init_sysfs(void) +{ + int error = sysdev_class_register(&i8237_sysdev_class); + if (!error) + error = sysdev_register(&device_i8237A); + return error; +} + +device_initcall(i8237A_init_sysfs); -- cgit v1.1 From 829ca9a30a2ddb727981d80fabdbff2ea86bc9ea Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Sat, 3 Sep 2005 15:56:56 -0700 Subject: [PATCH] swsusp: fix remaining u32 vs. pm_message_t confusion Fix remaining bits of u32 vs. pm_message confusion. Should not break anything. Signed-off-by: Pavel Machek Signed-off-by: Rafael J. Wysocki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/cpu/mtrr/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c index 764cac6..dd4ebd6 100644 --- a/arch/i386/kernel/cpu/mtrr/main.c +++ b/arch/i386/kernel/cpu/mtrr/main.c @@ -561,7 +561,7 @@ struct mtrr_value { static struct mtrr_value * mtrr_state; -static int mtrr_save(struct sys_device * sysdev, u32 state) +static int mtrr_save(struct sys_device * sysdev, pm_message_t state) { int i; int size = num_var_ranges * sizeof(struct mtrr_value); -- cgit v1.1 From c3c433e4f33afe255389ba3b1a003dc8deb3de9a Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Sat, 3 Sep 2005 15:57:07 -0700 Subject: [PATCH] add suspend/resume for timer The timers lack .suspend/.resume methods. Because of this, jiffies got a big compensation after a S3 resume. And then softlockup watchdog reports an oops. This occured with HPET enabled, but it's also possible for other timers. Signed-off-by: Shaohua Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/time.c | 10 ++++++++++ arch/i386/kernel/timers/timer_hpet.c | 14 ++++++++++++++ arch/i386/kernel/timers/timer_pit.c | 27 --------------------------- arch/i386/kernel/timers/timer_pm.c | 9 +++++++++ arch/i386/kernel/timers/timer_tsc.c | 14 ++++++++++++++ 5 files changed, 47 insertions(+), 27 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c index 0ee9dee..6f794a7 100644 --- a/arch/i386/kernel/time.c +++ b/arch/i386/kernel/time.c @@ -383,6 +383,7 @@ void notify_arch_cmos_timer(void) static long clock_cmos_diff, sleep_start; +static struct timer_opts *last_timer; static int timer_suspend(struct sys_device *dev, pm_message_t state) { /* @@ -391,6 +392,10 @@ static int timer_suspend(struct sys_device *dev, pm_message_t state) clock_cmos_diff = -get_cmos_time(); clock_cmos_diff += get_seconds(); sleep_start = get_cmos_time(); + last_timer = cur_timer; + cur_timer = &timer_none; + if (last_timer->suspend) + last_timer->suspend(state); return 0; } @@ -404,6 +409,7 @@ static int timer_resume(struct sys_device *dev) if (is_hpet_enabled()) hpet_reenable(); #endif + setup_pit_timer(); sec = get_cmos_time() + clock_cmos_diff; sleep_length = (get_cmos_time() - sleep_start) * HZ; write_seqlock_irqsave(&xtime_lock, flags); @@ -412,6 +418,10 @@ static int timer_resume(struct sys_device *dev) write_sequnlock_irqrestore(&xtime_lock, flags); jiffies += sleep_length; wall_jiffies += sleep_length; + if (last_timer->resume) + last_timer->resume(); + cur_timer = last_timer; + last_timer = NULL; return 0; } diff --git a/arch/i386/kernel/timers/timer_hpet.c b/arch/i386/kernel/timers/timer_hpet.c index cbb3f22..001de97 100644 --- a/arch/i386/kernel/timers/timer_hpet.c +++ b/arch/i386/kernel/timers/timer_hpet.c @@ -181,6 +181,19 @@ static int __init init_hpet(char* override) return 0; } +static int hpet_resume(void) +{ + write_seqlock(&monotonic_lock); + /* Assume this is the last mark offset time */ + rdtsc(last_tsc_low, last_tsc_high); + + if (hpet_use_timer) + hpet_last = hpet_readl(HPET_T0_CMP) - hpet_tick; + else + hpet_last = hpet_readl(HPET_COUNTER); + write_sequnlock(&monotonic_lock); + return 0; +} /************************************************************/ /* tsc timer_opts struct */ @@ -190,6 +203,7 @@ static struct timer_opts timer_hpet __read_mostly = { .get_offset = get_offset_hpet, .monotonic_clock = monotonic_clock_hpet, .delay = delay_hpet, + .resume = hpet_resume, }; struct init_timer_opts __initdata timer_hpet_init = { diff --git a/arch/i386/kernel/timers/timer_pit.c b/arch/i386/kernel/timers/timer_pit.c index 06de036..eddb640 100644 --- a/arch/i386/kernel/timers/timer_pit.c +++ b/arch/i386/kernel/timers/timer_pit.c @@ -175,30 +175,3 @@ void setup_pit_timer(void) outb(LATCH >> 8 , PIT_CH0); /* MSB */ spin_unlock_irqrestore(&i8253_lock, flags); } - -static int timer_resume(struct sys_device *dev) -{ - setup_pit_timer(); - return 0; -} - -static struct sysdev_class timer_sysclass = { - set_kset_name("timer_pit"), - .resume = timer_resume, -}; - -static struct sys_device device_timer = { - .id = 0, - .cls = &timer_sysclass, -}; - -static int __init init_timer_sysfs(void) -{ - int error = sysdev_class_register(&timer_sysclass); - if (!error) - error = sysdev_register(&device_timer); - return error; -} - -device_initcall(init_timer_sysfs); - diff --git a/arch/i386/kernel/timers/timer_pm.c b/arch/i386/kernel/timers/timer_pm.c index 4ef20e6..264edaa 100644 --- a/arch/i386/kernel/timers/timer_pm.c +++ b/arch/i386/kernel/timers/timer_pm.c @@ -186,6 +186,14 @@ static void mark_offset_pmtmr(void) } } +static int pmtmr_resume(void) +{ + write_seqlock(&monotonic_lock); + /* Assume this is the last mark offset time */ + offset_tick = read_pmtmr(); + write_sequnlock(&monotonic_lock); + return 0; +} static unsigned long long monotonic_clock_pmtmr(void) { @@ -247,6 +255,7 @@ static struct timer_opts timer_pmtmr = { .monotonic_clock = monotonic_clock_pmtmr, .delay = delay_pmtmr, .read_timer = read_timer_tsc, + .resume = pmtmr_resume, }; struct init_timer_opts __initdata timer_pmtmr_init = { diff --git a/arch/i386/kernel/timers/timer_tsc.c b/arch/i386/kernel/timers/timer_tsc.c index 8f4e4d5..6dd470c 100644 --- a/arch/i386/kernel/timers/timer_tsc.c +++ b/arch/i386/kernel/timers/timer_tsc.c @@ -543,6 +543,19 @@ static int __init init_tsc(char* override) return -ENODEV; } +static int tsc_resume(void) +{ + write_seqlock(&monotonic_lock); + /* Assume this is the last mark offset time */ + rdtsc(last_tsc_low, last_tsc_high); +#ifdef CONFIG_HPET_TIMER + if (is_hpet_enabled() && hpet_use_timer) + hpet_last = hpet_readl(HPET_COUNTER); +#endif + write_sequnlock(&monotonic_lock); + return 0; +} + #ifndef CONFIG_X86_TSC /* disable flag for tsc. Takes effect by clearing the TSC cpu flag * in cpu/common.c */ @@ -573,6 +586,7 @@ static struct timer_opts timer_tsc = { .monotonic_clock = monotonic_clock_tsc, .delay = delay_tsc, .read_timer = read_timer_tsc, + .resume = tsc_resume, }; struct init_timer_opts __initdata timer_tsc_init = { -- cgit v1.1 From 94c80b2598dbd2b8a6fe5f5c2c3af1beb37f66c7 Mon Sep 17 00:00:00 2001 From: Bodo Stroesser Date: Sat, 3 Sep 2005 15:57:13 -0700 Subject: [PATCH] Ptrace/i386: fix "syscall audit" interaction with singlestep Paolo 'Blaisorblade' Giarrusso Avoid giving two traps for singlestep instead of one, when syscall auditing is enabled. In fact no singlestep trap is sent on syscall entry, only on syscall exit, as can be seen in entry.S: # Note that in this mask _TIF_SINGLESTEP is not tested !!! <<<<<<<<<<<<<< testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) jnz syscall_trace_entry ... syscall_trace_entry: ... call do_syscall_trace But auditing a SINGLESTEP'ed process causes do_syscall_trace to be called, so the tracer will get one more trap on the syscall entry path, which it shouldn't. Signed-off-by: Paolo 'Blaisorblade' Giarrusso CC: Roland McGrath Cc: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/ptrace.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/ptrace.c b/arch/i386/kernel/ptrace.c index 0da59b4..5ee9e1d 100644 --- a/arch/i386/kernel/ptrace.c +++ b/arch/i386/kernel/ptrace.c @@ -683,8 +683,19 @@ void do_syscall_trace(struct pt_regs *regs, int entryexit) /* do the secure computing check first */ secure_computing(regs->orig_eax); - if (unlikely(current->audit_context) && entryexit) - audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); + if (unlikely(current->audit_context)) { + if (entryexit) + audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); + + /* Debug traps, when using PTRACE_SINGLESTEP, must be sent only + * on the syscall exit path. Normally, when TIF_SYSCALL_AUDIT is + * not used, entry.S will call us only on syscall exit, not + * entry ; so when TIF_SYSCALL_AUDIT is used we must avoid + * calling send_sigtrap() on syscall entry. + */ + else if (is_singlestep) + goto out; + } if (!(current->ptrace & PT_PTRACED)) goto out; -- cgit v1.1 From ed75e8d58010fdc06e2c3a81bfbebae92314c7e3 Mon Sep 17 00:00:00 2001 From: Laurent Vivier Date: Sat, 3 Sep 2005 15:57:18 -0700 Subject: [PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and general usage Jeff Dike , Paolo 'Blaisorblade' Giarrusso , Bodo Stroesser Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL except that the kernel does not execute the requested syscall; this is useful to improve performance for virtual environments, like UML, which want to run the syscall on their own. In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry and on exit, and each time you also have two context switches; with SYSEMU you avoid the 2nd stop and so save two context switches per syscall. Also, some architectures don't have support in the host for changing the syscall number via ptrace(), which is currently needed to skip syscall execution (UML turns any syscall into getpid() to avoid it being executed on the host). Fixing that is hard, while SYSEMU is easier to implement. * This version of the patch includes some suggestions of Jeff Dike to avoid adding any instructions to the syscall fast path, plus some other little changes, by myself, to make it work even when the syscall is executed with SYSENTER (but I'm unsure about them). It has been widely tested for quite a lot of time. * Various fixed were included to handle the various switches between various states, i.e. when for instance a syscall entry is traced with one of PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit. Basically, this is done by remembering which one of them was used even after the call to ptrace_notify(). * We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP to make do_syscall_trace() notice that the current syscall was started with SYSEMU on entry, so that no notification ought to be done in the exit path; this is a bit of a hack, so this problem is solved in another way in next patches. * Also, the effects of the patch: "Ptrace - i386: fix Syscall Audit interaction with singlestep" are cancelled; they are restored back in the last patch of this series. Detailed descriptions of the patches doing this kind of processing follow (but I've already summed everything up). * Fix behaviour when changing interception kind #1. In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag only after doing the debugger notification; but the debugger might have changed the status of this flag because he continued execution with PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag status before calling ptrace_notify(). * Fix behaviour when changing interception kind #2: avoid intercepting syscall on return when using SYSCALL again. A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL crashes. The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch inhibits the syscall-handler to be called, but does not prevent do_syscall_trace() to be called after this for syscall completion interception. The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since the flag is unused in the depicted situation. * Fix behaviour when changing interception kind #3: avoid intercepting syscall on return when using SINGLESTEP. When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had problems with singlestepping on UML in SKAS with SYSEMU. It looped receiving SIGTRAPs without moving forward. EIP of the traced process was the same for all SIGTRAPs. What's missing is to handle switching from PTRACE_SYSCALL_EMU to PTRACE_SINGLESTEP in a way very similar to what is done for the change from PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE. I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is notified and then wake ups the process; the syscall is executed (or skipped, when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and do_syscall_trace() is called again. Since we are on the return path of a SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL), we must still avoid notifying the parent of the syscall exit. Now, this behaviour is extended even to resuming with PTRACE_SINGLESTEP. Signed-off-by: Paolo 'Blaisorblade' Giarrusso Cc: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/entry.S | 9 +++++--- arch/i386/kernel/ptrace.c | 57 ++++++++++++++++++++++++++++++++--------------- 2 files changed, 45 insertions(+), 21 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S index a991d4e..b389e5f 100644 --- a/arch/i386/kernel/entry.S +++ b/arch/i386/kernel/entry.S @@ -203,7 +203,7 @@ sysenter_past_esp: GET_THREAD_INFO(%ebp) /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ - testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) + testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry cmpl $(nr_syscalls), %eax jae syscall_badsys @@ -226,9 +226,9 @@ ENTRY(system_call) pushl %eax # save orig_eax SAVE_ALL GET_THREAD_INFO(%ebp) - # system call tracing in operation + # system call tracing in operation / emulation /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ - testw $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp) + testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry cmpl $(nr_syscalls), %eax jae syscall_badsys @@ -338,6 +338,9 @@ syscall_trace_entry: movl %esp, %eax xorl %edx,%edx call do_syscall_trace + cmpl $0, %eax + jne syscall_exit # ret != 0 -> running under PTRACE_SYSEMU, + # so must skip actual syscall movl ORIG_EAX(%esp), %eax cmpl $(nr_syscalls), %eax jnae syscall_call diff --git a/arch/i386/kernel/ptrace.c b/arch/i386/kernel/ptrace.c index 5ee9e1d..5b569dc 100644 --- a/arch/i386/kernel/ptrace.c +++ b/arch/i386/kernel/ptrace.c @@ -509,15 +509,27 @@ asmlinkage int sys_ptrace(long request, long pid, long addr, long data) } break; + case PTRACE_SYSEMU: /* continue and stop at next syscall, which will not be executed */ case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */ case PTRACE_CONT: /* restart after signal. */ ret = -EIO; if (!valid_signal(data)) break; + /* If we came here with PTRACE_SYSEMU and now continue with + * PTRACE_SYSCALL, entry.S used to intercept the syscall return. + * But it shouldn't! + * So we don't clear TIF_SYSCALL_EMU, which is always unused in + * this special case, to remember, we came from SYSEMU. That + * flag will be cleared by do_syscall_trace(). + */ + if (request == PTRACE_SYSEMU) { + set_tsk_thread_flag(child, TIF_SYSCALL_EMU); + } else if (request == PTRACE_CONT) { + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); + } if (request == PTRACE_SYSCALL) { set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); - } - else { + } else { clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); } child->exit_code = data; @@ -546,6 +558,8 @@ asmlinkage int sys_ptrace(long request, long pid, long addr, long data) ret = -EIO; if (!valid_signal(data)) break; + /*See do_syscall_trace to know why we don't clear + * TIF_SYSCALL_EMU.*/ clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); set_singlestep(child); child->exit_code = data; @@ -678,37 +692,43 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code) * - triggered by current->work.syscall_trace */ __attribute__((regparm(3))) -void do_syscall_trace(struct pt_regs *regs, int entryexit) +int do_syscall_trace(struct pt_regs *regs, int entryexit) { + int is_sysemu, is_systrace, is_singlestep, ret = 0; /* do the secure computing check first */ secure_computing(regs->orig_eax); - if (unlikely(current->audit_context)) { - if (entryexit) - audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); - - /* Debug traps, when using PTRACE_SINGLESTEP, must be sent only - * on the syscall exit path. Normally, when TIF_SYSCALL_AUDIT is - * not used, entry.S will call us only on syscall exit, not - * entry ; so when TIF_SYSCALL_AUDIT is used we must avoid - * calling send_sigtrap() on syscall entry. - */ - else if (is_singlestep) - goto out; - } + if (unlikely(current->audit_context) && entryexit) + audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); if (!(current->ptrace & PT_PTRACED)) goto out; + is_sysemu = test_thread_flag(TIF_SYSCALL_EMU); + is_systrace = test_thread_flag(TIF_SYSCALL_TRACE); + is_singlestep = test_thread_flag(TIF_SINGLESTEP); + + /* We can detect the case of coming from PTRACE_SYSEMU and now running + * with PTRACE_SYSCALL or PTRACE_SINGLESTEP, by TIF_SYSCALL_EMU being + * set additionally. + * If so let's reset the flag and return without action (no singlestep + * nor syscall tracing, since no actual step has been executed). + */ + if (is_sysemu && (is_systrace || is_singlestep)) { + clear_thread_flag(TIF_SYSCALL_EMU); + goto out; + } + /* Fake a debug trap */ if (test_thread_flag(TIF_SINGLESTEP)) send_sigtrap(current, regs, 0); - if (!test_thread_flag(TIF_SYSCALL_TRACE)) + if (!is_systrace && !is_sysemu) goto out; /* the 0x80 provides a way for the tracing parent to distinguish between a syscall stop and SIGTRAP delivery */ + /* Note that the debugger could change the result of test_thread_flag!*/ ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) ? 0x80 : 0)); /* @@ -720,9 +740,10 @@ void do_syscall_trace(struct pt_regs *regs, int entryexit) send_sig(current->exit_code, current, 1); current->exit_code = 0; } + ret = is_sysemu; out: if (unlikely(current->audit_context) && !entryexit) audit_syscall_entry(current, AUDIT_ARCH_I386, regs->orig_eax, regs->ebx, regs->ecx, regs->edx, regs->esi); - + return ret; } -- cgit v1.1 From c8c86cecd1d1a2722acb28a01d1babf7b6993697 Mon Sep 17 00:00:00 2001 From: Bodo Stroesser Date: Sat, 3 Sep 2005 15:57:19 -0700 Subject: [PATCH] Uml support: reorganize PTRACE_SYSEMU support With this patch, we change the way we handle switching from PTRACE_SYSEMU to PTRACE_{SINGLESTEP,SYSCALL}, to free TIF_SYSCALL_EMU from double use as a preparation for PTRACE_SYSEMU_SINGLESTEP extension, without changing the behavior of the host kernel. Signed-off-by: Bodo Stroesser Signed-off-by: Paolo 'Blaisorblade' Giarrusso Cc: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/entry.S | 8 +++++++- arch/i386/kernel/ptrace.c | 43 ++++++++++++++----------------------------- 2 files changed, 21 insertions(+), 30 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S index b389e5f..9a47723 100644 --- a/arch/i386/kernel/entry.S +++ b/arch/i386/kernel/entry.S @@ -339,12 +339,18 @@ syscall_trace_entry: xorl %edx,%edx call do_syscall_trace cmpl $0, %eax - jne syscall_exit # ret != 0 -> running under PTRACE_SYSEMU, + jne syscall_skip # ret != 0 -> running under PTRACE_SYSEMU, # so must skip actual syscall movl ORIG_EAX(%esp), %eax cmpl $(nr_syscalls), %eax jnae syscall_call jmp syscall_exit +syscall_skip: + cli # make sure we don't miss an interrupt + # setting need_resched or sigpending + # between sampling and the iret + movl TI_flags(%ebp), %ecx + jmp work_pending # perform syscall exit tracing ALIGN diff --git a/arch/i386/kernel/ptrace.c b/arch/i386/kernel/ptrace.c index 5b569dc..18642f0 100644 --- a/arch/i386/kernel/ptrace.c +++ b/arch/i386/kernel/ptrace.c @@ -515,21 +515,14 @@ asmlinkage int sys_ptrace(long request, long pid, long addr, long data) ret = -EIO; if (!valid_signal(data)) break; - /* If we came here with PTRACE_SYSEMU and now continue with - * PTRACE_SYSCALL, entry.S used to intercept the syscall return. - * But it shouldn't! - * So we don't clear TIF_SYSCALL_EMU, which is always unused in - * this special case, to remember, we came from SYSEMU. That - * flag will be cleared by do_syscall_trace(). - */ if (request == PTRACE_SYSEMU) { set_tsk_thread_flag(child, TIF_SYSCALL_EMU); - } else if (request == PTRACE_CONT) { - clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); - } - if (request == PTRACE_SYSCALL) { + clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); + } else if (request == PTRACE_SYSCALL) { set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); } else { + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); } child->exit_code = data; @@ -558,8 +551,7 @@ asmlinkage int sys_ptrace(long request, long pid, long addr, long data) ret = -EIO; if (!valid_signal(data)) break; - /*See do_syscall_trace to know why we don't clear - * TIF_SYSCALL_EMU.*/ + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); set_singlestep(child); child->exit_code = data; @@ -694,7 +686,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code) __attribute__((regparm(3))) int do_syscall_trace(struct pt_regs *regs, int entryexit) { - int is_sysemu, is_systrace, is_singlestep, ret = 0; + int is_sysemu, is_singlestep, ret = 0; /* do the secure computing check first */ secure_computing(regs->orig_eax); @@ -705,25 +697,13 @@ int do_syscall_trace(struct pt_regs *regs, int entryexit) goto out; is_sysemu = test_thread_flag(TIF_SYSCALL_EMU); - is_systrace = test_thread_flag(TIF_SYSCALL_TRACE); is_singlestep = test_thread_flag(TIF_SINGLESTEP); - /* We can detect the case of coming from PTRACE_SYSEMU and now running - * with PTRACE_SYSCALL or PTRACE_SINGLESTEP, by TIF_SYSCALL_EMU being - * set additionally. - * If so let's reset the flag and return without action (no singlestep - * nor syscall tracing, since no actual step has been executed). - */ - if (is_sysemu && (is_systrace || is_singlestep)) { - clear_thread_flag(TIF_SYSCALL_EMU); - goto out; - } - /* Fake a debug trap */ - if (test_thread_flag(TIF_SINGLESTEP)) + if (is_singlestep) send_sigtrap(current, regs, 0); - if (!is_systrace && !is_sysemu) + if (!test_thread_flag(TIF_SYSCALL_TRACE) && !is_sysemu) goto out; /* the 0x80 provides a way for the tracing parent to distinguish @@ -745,5 +725,10 @@ int do_syscall_trace(struct pt_regs *regs, int entryexit) if (unlikely(current->audit_context) && !entryexit) audit_syscall_entry(current, AUDIT_ARCH_I386, regs->orig_eax, regs->ebx, regs->ecx, regs->edx, regs->esi); - return ret; + if (ret == 0) + return 0; + + if (unlikely(current->audit_context)) + audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); + return 1; } -- cgit v1.1 From 1b38f0064e4e0b9ec626e39f0740b1cf2e295743 Mon Sep 17 00:00:00 2001 From: Bodo Stroesser Date: Sat, 3 Sep 2005 15:57:20 -0700 Subject: [PATCH] Uml support: add PTRACE_SYSEMU_SINGLESTEP option to i386 This patch implements the new ptrace option PTRACE_SYSEMU_SINGLESTEP, which can be used by UML to singlestep a process: it will receive SINGLESTEP interceptions for normal instructions and syscalls, but syscall execution will be skipped just like with PTRACE_SYSEMU. Signed-off-by: Bodo Stroesser Signed-off-by: Paolo 'Blaisorblade' Giarrusso Cc: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/ptrace.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/ptrace.c b/arch/i386/kernel/ptrace.c index 18642f0..3196ba5 100644 --- a/arch/i386/kernel/ptrace.c +++ b/arch/i386/kernel/ptrace.c @@ -547,11 +547,17 @@ asmlinkage int sys_ptrace(long request, long pid, long addr, long data) wake_up_process(child); break; + case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */ case PTRACE_SINGLESTEP: /* set the trap flag. */ ret = -EIO; if (!valid_signal(data)) break; - clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); + + if (request == PTRACE_SYSEMU_SINGLESTEP) + set_tsk_thread_flag(child, TIF_SYSCALL_EMU); + else + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); + clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); set_singlestep(child); child->exit_code = data; @@ -686,7 +692,10 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code) __attribute__((regparm(3))) int do_syscall_trace(struct pt_regs *regs, int entryexit) { - int is_sysemu, is_singlestep, ret = 0; + int is_sysemu = test_thread_flag(TIF_SYSCALL_EMU), ret = 0; + /* With TIF_SYSCALL_EMU set we want to ignore TIF_SINGLESTEP */ + int is_singlestep = !is_sysemu && test_thread_flag(TIF_SINGLESTEP); + /* do the secure computing check first */ secure_computing(regs->orig_eax); @@ -696,8 +705,11 @@ int do_syscall_trace(struct pt_regs *regs, int entryexit) if (!(current->ptrace & PT_PTRACED)) goto out; - is_sysemu = test_thread_flag(TIF_SYSCALL_EMU); - is_singlestep = test_thread_flag(TIF_SINGLESTEP); + /* If a process stops on the 1st tracepoint with SYSCALL_TRACE + * and then is resumed with SYSEMU_SINGLESTEP, it will come in + * here. We have to check this and return */ + if (is_sysemu && entryexit) + return 0; /* Fake a debug trap */ if (is_singlestep) @@ -728,6 +740,7 @@ int do_syscall_trace(struct pt_regs *regs, int entryexit) if (ret == 0) return 0; + regs->orig_eax = -1; /* force skip of syscall restarting */ if (unlikely(current->audit_context)) audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); return 1; -- cgit v1.1 From ab1c23c24471c760c573f4fb0dd78e166ddfd844 Mon Sep 17 00:00:00 2001 From: Bodo Stroesser Date: Sat, 3 Sep 2005 15:57:21 -0700 Subject: [PATCH] SYSEMU: fix sysaudit / singlestep interaction Paolo 'Blaisorblade' Giarrusso This is simply an adjustment for "Ptrace - i386: fix Syscall Audit interaction with singlestep" to work on top of SYSEMU patches, too. On this patch, I have some doubts: I wonder why we need to alter that way ptrace_disable(). I left the patch this way because it has been extensively tested, but I don't understand the reason. The current PTRACE_DETACH handling simply clears child->ptrace; actually this is not enough because entry.S just looks at the thread_flags; actually, do_syscall_trace checks current->ptrace but I don't think depending on that is good, at least for performance, so I think the clearing is done elsewhere. For instance, on PTRACE_CONT it's done, but doing PTRACE_DETACH without PTRACE_CONT is possible (and happens when gdb crashes and one kills it manually). Signed-off-by: Paolo 'Blaisorblade' Giarrusso CC: Roland McGrath Cc: Jeff Dike Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/ptrace.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/ptrace.c b/arch/i386/kernel/ptrace.c index 3196ba5..3409802 100644 --- a/arch/i386/kernel/ptrace.c +++ b/arch/i386/kernel/ptrace.c @@ -271,6 +271,8 @@ static void clear_singlestep(struct task_struct *child) void ptrace_disable(struct task_struct *child) { clear_singlestep(child); + clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); + clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); } /* @@ -693,14 +695,29 @@ __attribute__((regparm(3))) int do_syscall_trace(struct pt_regs *regs, int entryexit) { int is_sysemu = test_thread_flag(TIF_SYSCALL_EMU), ret = 0; - /* With TIF_SYSCALL_EMU set we want to ignore TIF_SINGLESTEP */ + /* With TIF_SYSCALL_EMU set we want to ignore TIF_SINGLESTEP for syscall + * interception. */ int is_singlestep = !is_sysemu && test_thread_flag(TIF_SINGLESTEP); /* do the secure computing check first */ secure_computing(regs->orig_eax); - if (unlikely(current->audit_context) && entryexit) - audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); + if (unlikely(current->audit_context)) { + if (entryexit) + audit_syscall_exit(current, AUDITSC_RESULT(regs->eax), regs->eax); + /* Debug traps, when using PTRACE_SINGLESTEP, must be sent only + * on the syscall exit path. Normally, when TIF_SYSCALL_AUDIT is + * not used, entry.S will call us only on syscall exit, not + * entry; so when TIF_SYSCALL_AUDIT is used we must avoid + * calling send_sigtrap() on syscall entry. + * + * Note that when PTRACE_SYSEMU_SINGLESTEP is used, + * is_singlestep is false, despite his name, so we will still do + * the correct thing. + */ + else if (is_singlestep) + goto out; + } if (!(current->ptrace & PT_PTRACED)) goto out; -- cgit v1.1 From 640aa46e25922a00b805e6b0d0b5181ad9cf736a Mon Sep 17 00:00:00 2001 From: Paolo 'Blaisorblade' Giarrusso Date: Sat, 3 Sep 2005 15:57:22 -0700 Subject: [PATCH] uml: SYSEMU: slight cleanup and speedup As a follow-up to "UML Support - Ptrace: adds the host SYSEMU support, for UML and general usage" (i.e. uml-support-* in current mm). Avoid unconditionally jumping to work_pending and code copying, just reuse the already existing resume_userspace path. One interesting note, from Charles P. Wright, suggested that the API is improvable with no downsides for UML (except that it will have to support yet another host API, since dropping support for the current API, for UML, is not reasonable from users' point of view). Signed-off-by: Paolo 'Blaisorblade' Giarrusso CC: Charles P. Wright Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/entry.S | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S index 9a47723..abb9097 100644 --- a/arch/i386/kernel/entry.S +++ b/arch/i386/kernel/entry.S @@ -339,18 +339,12 @@ syscall_trace_entry: xorl %edx,%edx call do_syscall_trace cmpl $0, %eax - jne syscall_skip # ret != 0 -> running under PTRACE_SYSEMU, + jne resume_userspace # ret != 0 -> running under PTRACE_SYSEMU, # so must skip actual syscall movl ORIG_EAX(%esp), %eax cmpl $(nr_syscalls), %eax jnae syscall_call jmp syscall_exit -syscall_skip: - cli # make sure we don't miss an interrupt - # setting need_resched or sigpending - # between sampling and the iret - movl TI_flags(%ebp), %ecx - jmp work_pending # perform syscall exit tracing ALIGN -- cgit v1.1 From 54d5d42404e7705cf3804593189e963350d470e5 Mon Sep 17 00:00:00 2001 From: Ashok Raj Date: Tue, 6 Sep 2005 15:16:15 -0700 Subject: [PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinity When handling writes to /proc/irq, current code is re-programming rte entries directly. This is not recommended and could potentially cause chipset's to lockup, or cause missing interrupts. CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the interrupt is pending. The same needs to be done for /proc/irq handling as well. Otherwise user space irq balancers are really not doing the right thing. - Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for lack of a generic name. - added move_irq out of IRQ_BALANCE, and added this same to X86_64 - Added new proc handler for write, so we can do deferred write at irq handling time. - Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead it now shows only active cpu masks, or exactly what was set. - Provided a common move_irq implementation, instead of duplicating when using generic irq framework. Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off. Tested UP builds as well. MSI testing: tbd: I have cards, need to look for a x-over cable, although I did test an earlier version of this patch. Will test in a couple days. Signed-off-by: Ashok Raj Acked-by: Zwane Mwaikambo Grudgingly-acked-by: Andi Kleen Signed-off-by: Coywolf Qi Hunt Signed-off-by: Ashok Raj Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/Kconfig | 5 +++++ arch/i386/kernel/io_apic.c | 55 ++++++++++++++++++++++++---------------------- 2 files changed, 34 insertions(+), 26 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index 3b3b017..4b7de3e 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -1318,6 +1318,11 @@ config GENERIC_IRQ_PROBE bool default y +config GENERIC_PENDING_IRQ + bool + depends on GENERIC_HARDIRQS && SMP + default y + config X86_SMP bool depends on SMP && !X86_VOYAGER diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c index 6578f40..4a59404 100644 --- a/arch/i386/kernel/io_apic.c +++ b/arch/i386/kernel/io_apic.c @@ -33,6 +33,7 @@ #include #include #include + #include #include #include @@ -222,13 +223,21 @@ static void clear_IO_APIC (void) clear_IO_APIC_pin(apic, pin); } +#ifdef CONFIG_SMP static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) { unsigned long flags; int pin; struct irq_pin_list *entry = irq_2_pin + irq; unsigned int apicid_value; + cpumask_t tmp; + cpus_and(tmp, cpumask, cpu_online_map); + if (cpus_empty(tmp)) + tmp = TARGET_CPUS; + + cpus_and(cpumask, tmp, CPU_MASK_ALL); + apicid_value = cpu_mask_to_apicid(cpumask); /* Prepare to do the io_apic_write */ apicid_value = apicid_value << 24; @@ -242,6 +251,7 @@ static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) break; entry = irq_2_pin + entry->next; } + set_irq_info(irq, cpumask); spin_unlock_irqrestore(&ioapic_lock, flags); } @@ -259,7 +269,6 @@ static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) # define Dprintk(x...) # endif -cpumask_t __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS]; #define IRQBALANCE_CHECK_ARCH -999 static int irqbalance_disabled = IRQBALANCE_CHECK_ARCH; @@ -328,12 +337,7 @@ static inline void balance_irq(int cpu, int irq) cpus_and(allowed_mask, cpu_online_map, irq_affinity[irq]); new_cpu = move(cpu, allowed_mask, now, 1); if (cpu != new_cpu) { - irq_desc_t *desc = irq_desc + irq; - unsigned long flags; - - spin_lock_irqsave(&desc->lock, flags); - pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu); - spin_unlock_irqrestore(&desc->lock, flags); + set_pending_irq(irq, cpumask_of_cpu(new_cpu)); } } @@ -528,16 +532,12 @@ tryanotherirq: cpus_and(tmp, target_cpu_mask, allowed_mask); if (!cpus_empty(tmp)) { - irq_desc_t *desc = irq_desc + selected_irq; - unsigned long flags; Dprintk("irq = %d moved to cpu = %d\n", selected_irq, min_loaded); /* mark for change destination */ - spin_lock_irqsave(&desc->lock, flags); - pending_irq_balance_cpumask[selected_irq] = - cpumask_of_cpu(min_loaded); - spin_unlock_irqrestore(&desc->lock, flags); + set_pending_irq(selected_irq, cpumask_of_cpu(min_loaded)); + /* Since we made a change, come back sooner to * check for more variation. */ @@ -568,7 +568,8 @@ static int balanced_irq(void *unused) /* push everything to CPU 0 to give us a starting point. */ for (i = 0 ; i < NR_IRQS ; i++) { - pending_irq_balance_cpumask[i] = cpumask_of_cpu(0); + pending_irq_cpumask[i] = cpumask_of_cpu(0); + set_pending_irq(i, cpumask_of_cpu(0)); } for ( ; ; ) { @@ -647,20 +648,9 @@ int __init irqbalance_disable(char *str) __setup("noirqbalance", irqbalance_disable); -static inline void move_irq(int irq) -{ - /* note - we hold the desc->lock */ - if (unlikely(!cpus_empty(pending_irq_balance_cpumask[irq]))) { - set_ioapic_affinity_irq(irq, pending_irq_balance_cpumask[irq]); - cpus_clear(pending_irq_balance_cpumask[irq]); - } -} - late_initcall(balanced_irq_init); - -#else /* !CONFIG_IRQBALANCE */ -static inline void move_irq(int irq) { } #endif /* CONFIG_IRQBALANCE */ +#endif /* CONFIG_SMP */ #ifndef CONFIG_SMP void fastcall send_IPI_self(int vector) @@ -820,6 +810,7 @@ EXPORT_SYMBOL(IO_APIC_get_PCI_irq_vector); * we need to reprogram the ioredtbls to cater for the cpus which have come online * so mask in all cases should simply be TARGET_CPUS */ +#ifdef CONFIG_SMP void __init setup_ioapic_dest(void) { int pin, ioapic, irq, irq_entry; @@ -838,6 +829,7 @@ void __init setup_ioapic_dest(void) } } +#endif /* * EISA Edge/Level control register, ELCR @@ -1249,6 +1241,7 @@ static void __init setup_IO_APIC_irqs(void) spin_lock_irqsave(&ioapic_lock, flags); io_apic_write(apic, 0x11+2*pin, *(((int *)&entry)+1)); io_apic_write(apic, 0x10+2*pin, *(((int *)&entry)+0)); + set_native_irq_info(irq, TARGET_CPUS); spin_unlock_irqrestore(&ioapic_lock, flags); } } @@ -1944,6 +1937,7 @@ static void ack_edge_ioapic_vector(unsigned int vector) { int irq = vector_to_irq(vector); + move_irq(vector); ack_edge_ioapic_irq(irq); } @@ -1958,6 +1952,7 @@ static void end_level_ioapic_vector (unsigned int vector) { int irq = vector_to_irq(vector); + move_irq(vector); end_level_ioapic_irq(irq); } @@ -1975,14 +1970,17 @@ static void unmask_IO_APIC_vector (unsigned int vector) unmask_IO_APIC_irq(irq); } +#ifdef CONFIG_SMP static void set_ioapic_affinity_vector (unsigned int vector, cpumask_t cpu_mask) { int irq = vector_to_irq(vector); + set_native_irq_info(vector, cpu_mask); set_ioapic_affinity_irq(irq, cpu_mask); } #endif +#endif /* * Level and edge triggered IO-APIC interrupts need different handling, @@ -2000,7 +1998,9 @@ static struct hw_interrupt_type ioapic_edge_type = { .disable = disable_edge_ioapic, .ack = ack_edge_ioapic, .end = end_edge_ioapic, +#ifdef CONFIG_SMP .set_affinity = set_ioapic_affinity, +#endif }; static struct hw_interrupt_type ioapic_level_type = { @@ -2011,7 +2011,9 @@ static struct hw_interrupt_type ioapic_level_type = { .disable = disable_level_ioapic, .ack = mask_and_ack_level_ioapic, .end = end_level_ioapic, +#ifdef CONFIG_SMP .set_affinity = set_ioapic_affinity, +#endif }; static inline void init_IO_APIC_traps(void) @@ -2569,6 +2571,7 @@ int io_apic_set_pci_routing (int ioapic, int pin, int irq, int edge_level, int a spin_lock_irqsave(&ioapic_lock, flags); io_apic_write(ioapic, 0x11+2*pin, *(((int *)&entry)+1)); io_apic_write(ioapic, 0x10+2*pin, *(((int *)&entry)+0)); + set_native_irq_info(use_pci_vector() ? entry.vector : irq, TARGET_CPUS); spin_unlock_irqrestore(&ioapic_lock, flags); return 0; -- cgit v1.1 From a888cebe17e39476e5ca18c3a4bd96c6775070db Mon Sep 17 00:00:00 2001 From: Ashok Raj Date: Tue, 6 Sep 2005 15:16:19 -0700 Subject: [PATCH] x86_64: create sysfs entries for cpu only for present cpus Need to create sysfs only for cpus that are present. Without which we see NR_CPUS entries created when we have CONFIG_HOTPLUG and CONFIG_HOTPLUG_CPU enabled. Signed-off-by: Ashok Raj Acked-by: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/mach-default/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/mach-default/topology.c b/arch/i386/mach-default/topology.c index 23395ff..b643140 100644 --- a/arch/i386/mach-default/topology.c +++ b/arch/i386/mach-default/topology.c @@ -76,7 +76,7 @@ static int __init topology_init(void) for_each_online_node(i) arch_register_node(i); - for_each_cpu(i) + for_each_present_cpu(i) arch_register_cpu(i); return 0; } @@ -87,7 +87,7 @@ static int __init topology_init(void) { int i; - for_each_cpu(i) + for_each_present_cpu(i) arch_register_cpu(i); return 0; } -- cgit v1.1 From 8446f1d391f3d27e6bf9c43d4cbcdac0ca720417 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Tue, 6 Sep 2005 15:16:27 -0700 Subject: [PATCH] detect soft lockups This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP. When enabled then per-CPU watchdog threads are started, which try to run once per second. If they get delayed for more than 10 seconds then a callback from the timer interrupt detects this condition and prints out a warning message and a stack dump (once per lockup incident). The feature is otherwise non-intrusive, it doesnt try to unlock the box in any way, it only gets the debug info out, automatically, and on all CPUs affected by the lockup. Signed-off-by: Ingo Molnar Signed-off-by: Nishanth Aravamudan Signed-Off-By: Matthias Urlichs Signed-off-by: Richard Purdie Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/nmi.c | 5 +++++ arch/i386/kernel/time.c | 1 + 2 files changed, 6 insertions(+) (limited to 'arch/i386') diff --git a/arch/i386/kernel/nmi.c b/arch/i386/kernel/nmi.c index 8bbdbda..0178457 100644 --- a/arch/i386/kernel/nmi.c +++ b/arch/i386/kernel/nmi.c @@ -478,6 +478,11 @@ void touch_nmi_watchdog (void) */ for (i = 0; i < NR_CPUS; i++) alert_counter[i] = 0; + + /* + * Tickle the softlockup detector too: + */ + touch_softlockup_watchdog(); } extern void die_nmi(struct pt_regs *, const char *msg); diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c index 6f794a7..b0c5ee2 100644 --- a/arch/i386/kernel/time.c +++ b/arch/i386/kernel/time.c @@ -422,6 +422,7 @@ static int timer_resume(struct sys_device *dev) last_timer->resume(); cur_timer = last_timer; last_timer = NULL; + touch_softlockup_watchdog(); return 0; } -- cgit v1.1 From c3d8c1414573be8cf7c8fdc1e076935697c7f6af Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Tue, 6 Sep 2005 15:16:33 -0700 Subject: [PATCH] More __read_mostly variables Move some more frequently read variables that showed up during some of our performance tests as sometimes ending up in hot cachelines to the read_mostly section. Fix: Move the __read_mostly from before hpet_usec_quotient to follow the variable like the other uses of __read_mostly. Signed-off-by: Alok N Kataria Signed-off-by: Christoph Lameter Signed-off-by: Shai Fultheim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/setup.c | 2 +- arch/i386/kernel/timers/timer_hpet.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c index 294bcca..e29fd5a 100644 --- a/arch/i386/kernel/setup.c +++ b/arch/i386/kernel/setup.c @@ -82,7 +82,7 @@ EXPORT_SYMBOL(efi_enabled); /* cpu data as detected by the assembly code in head.S */ struct cpuinfo_x86 new_cpu_data __initdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; /* common cpu data for all cpus */ -struct cpuinfo_x86 boot_cpu_data = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; +struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 }; EXPORT_SYMBOL(boot_cpu_data); unsigned long mmu_cr4_features; diff --git a/arch/i386/kernel/timers/timer_hpet.c b/arch/i386/kernel/timers/timer_hpet.c index 001de97..6dbb29f 100644 --- a/arch/i386/kernel/timers/timer_hpet.c +++ b/arch/i386/kernel/timers/timer_hpet.c @@ -18,7 +18,7 @@ #include "mach_timer.h" #include -static unsigned long __read_mostly hpet_usec_quotient; /* convert hpet clks to usec */ +static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to usec */ static unsigned long tsc_hpet_quotient; /* convert tsc to hpet clks */ static unsigned long hpet_last; /* hpet counter value at last tick*/ static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */ -- cgit v1.1 From 19306059cd7fedaf96b4b0260a9a8a45e513c857 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 6 Sep 2005 15:16:35 -0700 Subject: [PATCH] NMI: Update NMI users of RCU to use new API Uses of RCU for dynamically changeable NMI handlers need to use the new rcu_dereference() and rcu_assign_pointer() facilities. This change makes it clear that these uses are safe from a memory-barrier viewpoint, but the main purpose is to document exactly what operations are being protected by RCU. This has been tested on x86 and x86-64, which are the only architectures affected by this change. Signed-off-by: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/traps.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c index 54629bb..029bf94 100644 --- a/arch/i386/kernel/traps.c +++ b/arch/i386/kernel/traps.c @@ -657,7 +657,7 @@ fastcall void do_nmi(struct pt_regs * regs, long error_code) ++nmi_count(cpu); - if (!nmi_callback(regs, cpu)) + if (!rcu_dereference(nmi_callback)(regs, cpu)) default_do_nmi(regs); nmi_exit(); @@ -665,7 +665,7 @@ fastcall void do_nmi(struct pt_regs * regs, long error_code) void set_nmi_callback(nmi_callback_t callback) { - nmi_callback = callback; + rcu_assign_pointer(nmi_callback, callback); } EXPORT_SYMBOL_GPL(set_nmi_callback); -- cgit v1.1 From f8eeaaf4180334a8e5c3582fe62a5f8176a8c124 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Tue, 6 Sep 2005 15:17:24 -0700 Subject: [PATCH] Make the bzImage format self-terminating Signed-off-by: H. Peter Anvin Cc: Frank Sorenson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/boot/setup.S | 2 +- arch/i386/boot/tools/build.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/boot/setup.S b/arch/i386/boot/setup.S index 8cb420f..ca668d9 100644 --- a/arch/i386/boot/setup.S +++ b/arch/i386/boot/setup.S @@ -82,7 +82,7 @@ start: # This is the setup header, and it must start at %cs:2 (old 0x9020:2) .ascii "HdrS" # header signature - .word 0x0203 # header version number (>= 0x0105) + .word 0x0204 # header version number (>= 0x0105) # or else old loadlin-1.5 will fail) realmode_swtch: .word 0, 0 # default_switch, SETUPSEG start_sys_seg: .word SYSSEG diff --git a/arch/i386/boot/tools/build.c b/arch/i386/boot/tools/build.c index 6835f6d..0579841 100644 --- a/arch/i386/boot/tools/build.c +++ b/arch/i386/boot/tools/build.c @@ -177,7 +177,9 @@ int main(int argc, char ** argv) die("Output: seek failed"); buf[0] = (sys_size & 0xff); buf[1] = ((sys_size >> 8) & 0xff); - if (write(1, buf, 2) != 2) + buf[2] = ((sys_size >> 16) & 0xff); + buf[3] = ((sys_size >> 24) & 0xff); + if (write(1, buf, 4) != 4) die("Write of image length failed"); return 0; /* Everything is OK */ -- cgit v1.1 From 96d0821cacd095e25a39dfff5232a45b63ed18dd Mon Sep 17 00:00:00 2001 From: David Gibson Date: Tue, 6 Sep 2005 15:17:26 -0700 Subject: [PATCH] Fix function/macro name collision on i386 oprofile The i386 OProfile code has a function named nmi_exit(), which collides with the nmi_exit() macro in linux/hardirq.h. At the moment, we get away with it, because hardirq.h isn't included in the oprofile code. I hit this as a bug when working with a patch which (indirectly) adds a #include of hardirq.h to oprofile. Regardless, the name collision is probably not a good idea, so this patch fixes it, renaming the oprofile function to op_nmi_exit(). It also renames the nmi_init() and nmi_timer_init() functions similarly, for consistency. Signed-off-by: David Gibson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/oprofile/init.c | 12 ++++++------ arch/i386/oprofile/nmi_int.c | 4 ++-- arch/i386/oprofile/nmi_timer_int.c | 2 +- 3 files changed, 9 insertions(+), 9 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/oprofile/init.c b/arch/i386/oprofile/init.c index c90332d..5341d48 100644 --- a/arch/i386/oprofile/init.c +++ b/arch/i386/oprofile/init.c @@ -15,9 +15,9 @@ * with the NMI mode driver. */ -extern int nmi_init(struct oprofile_operations * ops); -extern int nmi_timer_init(struct oprofile_operations * ops); -extern void nmi_exit(void); +extern int op_nmi_init(struct oprofile_operations * ops); +extern int op_nmi_timer_init(struct oprofile_operations * ops); +extern void op_nmi_exit(void); extern void x86_backtrace(struct pt_regs * const regs, unsigned int depth); @@ -28,11 +28,11 @@ int __init oprofile_arch_init(struct oprofile_operations * ops) ret = -ENODEV; #ifdef CONFIG_X86_LOCAL_APIC - ret = nmi_init(ops); + ret = op_nmi_init(ops); #endif #ifdef CONFIG_X86_IO_APIC if (ret < 0) - ret = nmi_timer_init(ops); + ret = op_nmi_timer_init(ops); #endif ops->backtrace = x86_backtrace; @@ -43,6 +43,6 @@ int __init oprofile_arch_init(struct oprofile_operations * ops) void oprofile_arch_exit(void) { #ifdef CONFIG_X86_LOCAL_APIC - nmi_exit(); + op_nmi_exit(); #endif } diff --git a/arch/i386/oprofile/nmi_int.c b/arch/i386/oprofile/nmi_int.c index 255e470..0493e8b 100644 --- a/arch/i386/oprofile/nmi_int.c +++ b/arch/i386/oprofile/nmi_int.c @@ -355,7 +355,7 @@ static int __init ppro_init(char ** cpu_type) /* in order to get driverfs right */ static int using_nmi; -int __init nmi_init(struct oprofile_operations *ops) +int __init op_nmi_init(struct oprofile_operations *ops) { __u8 vendor = boot_cpu_data.x86_vendor; __u8 family = boot_cpu_data.x86; @@ -420,7 +420,7 @@ int __init nmi_init(struct oprofile_operations *ops) } -void nmi_exit(void) +void op_nmi_exit(void) { if (using_nmi) exit_driverfs(); diff --git a/arch/i386/oprofile/nmi_timer_int.c b/arch/i386/oprofile/nmi_timer_int.c index c58d0c1..ad93cdd 100644 --- a/arch/i386/oprofile/nmi_timer_int.c +++ b/arch/i386/oprofile/nmi_timer_int.c @@ -40,7 +40,7 @@ static void timer_stop(void) } -int __init nmi_timer_init(struct oprofile_operations * ops) +int __init op_nmi_timer_init(struct oprofile_operations * ops) { extern int nmi_active; -- cgit v1.1 From 7f4bde9a3486cd7e70bedd2aff35b38667d50173 Mon Sep 17 00:00:00 2001 From: Adrian Bunk Date: Tue, 6 Sep 2005 15:17:39 -0700 Subject: [PATCH] remove the second arg of do_timer_interrupt() The second arg of do_timer_interrupt() is not used in the functions, and all callers pass NULL. Signed-off-by: Adrian Bunk Cc: Paul Mundt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/time.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c index b0c5ee2..9b94d84 100644 --- a/arch/i386/kernel/time.c +++ b/arch/i386/kernel/time.c @@ -252,8 +252,7 @@ EXPORT_SYMBOL(profile_pc); * timer_interrupt() needs to keep up the real-time clock, * as well as call the "do_timer()" routine every clocktick */ -static inline void do_timer_interrupt(int irq, void *dev_id, - struct pt_regs *regs) +static inline void do_timer_interrupt(int irq, struct pt_regs *regs) { #ifdef CONFIG_X86_IO_APIC if (timer_ack) { @@ -307,7 +306,7 @@ irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs) cur_timer->mark_offset(); - do_timer_interrupt(irq, NULL, regs); + do_timer_interrupt(irq, regs); write_sequnlock(&xtime_lock); return IRQ_HANDLED; -- cgit v1.1 From 6c231b7bab0aa6860cd9da2de8a064eddc34c146 Mon Sep 17 00:00:00 2001 From: Ravikiran G Thirumalai Date: Tue, 6 Sep 2005 15:17:45 -0700 Subject: [PATCH] Additions to .data.read_mostly section Mark variables which are usually accessed for reads with __readmostly. Signed-off-by: Alok N Kataria Signed-off-by: Shai Fultheim Signed-off-by: Ravikiran Thirumalai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/io_apic.c | 10 +++++----- arch/i386/kernel/timers/timer_hpet.c | 2 +- arch/i386/mm/discontig.c | 8 ++++---- arch/i386/mm/init.c | 2 +- 4 files changed, 11 insertions(+), 11 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c index 4a59404..0e727e6 100644 --- a/arch/i386/kernel/io_apic.c +++ b/arch/i386/kernel/io_apic.c @@ -78,7 +78,7 @@ static struct irq_pin_list { int apic, pin, next; } irq_2_pin[PIN_MAP_SIZE]; -int vector_irq[NR_VECTORS] = { [0 ... NR_VECTORS - 1] = -1}; +int vector_irq[NR_VECTORS] __read_mostly = { [0 ... NR_VECTORS - 1] = -1}; #ifdef CONFIG_PCI_MSI #define vector_to_irq(vector) \ (platform_legacy_irq(vector) ? vector : vector_irq[vector]) @@ -1119,7 +1119,7 @@ static inline int IO_APIC_irq_trigger(int irq) } /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */ -u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 }; +u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = { FIRST_DEVICE_VECTOR , 0 }; int assign_irq_vector(int irq) { @@ -1990,7 +1990,7 @@ static void set_ioapic_affinity_vector (unsigned int vector, * edge-triggered handler, without risking IRQ storms and other ugly * races. */ -static struct hw_interrupt_type ioapic_edge_type = { +static struct hw_interrupt_type ioapic_edge_type __read_mostly = { .typename = "IO-APIC-edge", .startup = startup_edge_ioapic, .shutdown = shutdown_edge_ioapic, @@ -2003,7 +2003,7 @@ static struct hw_interrupt_type ioapic_edge_type = { #endif }; -static struct hw_interrupt_type ioapic_level_type = { +static struct hw_interrupt_type ioapic_level_type __read_mostly = { .typename = "IO-APIC-level", .startup = startup_level_ioapic, .shutdown = shutdown_level_ioapic, @@ -2076,7 +2076,7 @@ static void ack_lapic_irq (unsigned int irq) static void end_lapic_irq (unsigned int i) { /* nothing */ } -static struct hw_interrupt_type lapic_irq_type = { +static struct hw_interrupt_type lapic_irq_type __read_mostly = { .typename = "local-APIC-edge", .startup = NULL, /* startup_irq() not used for IRQ0 */ .shutdown = NULL, /* shutdown_irq() not used for IRQ0 */ diff --git a/arch/i386/kernel/timers/timer_hpet.c b/arch/i386/kernel/timers/timer_hpet.c index 6dbb29f..d973a8b 100644 --- a/arch/i386/kernel/timers/timer_hpet.c +++ b/arch/i386/kernel/timers/timer_hpet.c @@ -19,7 +19,7 @@ #include static unsigned long hpet_usec_quotient __read_mostly; /* convert hpet clks to usec */ -static unsigned long tsc_hpet_quotient; /* convert tsc to hpet clks */ +static unsigned long tsc_hpet_quotient __read_mostly; /* convert tsc to hpet clks */ static unsigned long hpet_last; /* hpet counter value at last tick*/ static unsigned long last_tsc_low; /* lsb 32 bits of Time Stamp Counter */ static unsigned long last_tsc_high; /* msb 32 bits of Time Stamp Counter */ diff --git a/arch/i386/mm/discontig.c b/arch/i386/mm/discontig.c index 6711ce3..244d8ec 100644 --- a/arch/i386/mm/discontig.c +++ b/arch/i386/mm/discontig.c @@ -37,7 +37,7 @@ #include #include -struct pglist_data *node_data[MAX_NUMNODES]; +struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); bootmem_data_t node0_bdata; @@ -49,8 +49,8 @@ bootmem_data_t node0_bdata; * 2) node_start_pfn - the starting page frame number for a node * 3) node_end_pfn - the ending page fram number for a node */ -unsigned long node_start_pfn[MAX_NUMNODES]; -unsigned long node_end_pfn[MAX_NUMNODES]; +unsigned long node_start_pfn[MAX_NUMNODES] __read_mostly; +unsigned long node_end_pfn[MAX_NUMNODES] __read_mostly; #ifdef CONFIG_DISCONTIGMEM @@ -66,7 +66,7 @@ unsigned long node_end_pfn[MAX_NUMNODES]; * physnode_map[4-7] = 1; * physnode_map[8- ] = -1; */ -s8 physnode_map[MAX_ELEMENTS] = { [0 ... (MAX_ELEMENTS - 1)] = -1}; +s8 physnode_map[MAX_ELEMENTS] __read_mostly = { [0 ... (MAX_ELEMENTS - 1)] = -1}; EXPORT_SYMBOL(physnode_map); void memory_present(int nid, unsigned long start, unsigned long end) diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c index 9edfc05..2ebaf75 100644 --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -393,7 +393,7 @@ void zap_low_mappings (void) } static int disable_nx __initdata = 0; -u64 __supported_pte_mask = ~_PAGE_NX; +u64 __supported_pte_mask __read_mostly = ~_PAGE_NX; /* * noexec = on|off -- cgit v1.1 From b149ee2233edf08fb59b11e879a2c5941929bcb8 Mon Sep 17 00:00:00 2001 From: john stultz Date: Tue, 6 Sep 2005 15:17:46 -0700 Subject: [PATCH] NTP: ntp-helper functions This patch cleans up a commonly repeated set of changes to the NTP state variables by adding two helper inline functions: ntp_clear(): Clears the ntp state variables ntp_synced(): Returns 1 if the system is synced with a time server. This was compile tested for alpha, arm, i386, x86-64, ppc64, s390, sparc, sparc64. Signed-off-by: John Stultz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/time.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c index 9b94d84..eefea7c 100644 --- a/arch/i386/kernel/time.c +++ b/arch/i386/kernel/time.c @@ -194,10 +194,7 @@ int do_settimeofday(struct timespec *tv) set_normalized_timespec(&xtime, sec, nsec); set_normalized_timespec(&wall_to_monotonic, wtm_sec, wtm_nsec); - time_adjust = 0; /* stop active adjtime() */ - time_status |= STA_UNSYNC; - time_maxerror = NTP_PHASE_LIMIT; - time_esterror = NTP_PHASE_LIMIT; + ntp_clear(); write_sequnlock_irq(&xtime_lock); clock_was_set(); return 0; @@ -347,7 +344,7 @@ static void sync_cmos_clock(unsigned long dummy) * This code is run on a timer. If the clock is set, that timer * may not expire at the correct time. Thus, we adjust... */ - if ((time_status & STA_UNSYNC) != 0) + if (!ntp_synced()) /* * Not synced, exit, do not restart a timer (if one is * running, let it run out). -- cgit v1.1 From 61e032fa2f659fada02ede5087b46963a1c7de34 Mon Sep 17 00:00:00 2001 From: Andrey Panin Date: Tue, 6 Sep 2005 15:18:26 -0700 Subject: [PATCH] dmi: remove uneeded function After elimination of central DMI blacklist dmi_scan_machine() function became a wrapper for dmi_iterate(). This patch moves some code around to kill unneeded function. Signed-off-by: Andrey Panin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/dmi_scan.c | 85 +++++++++++++++++++++------------------------ 1 file changed, 40 insertions(+), 45 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/dmi_scan.c b/arch/i386/kernel/dmi_scan.c index a3cdf89..6ae4ef4 100644 --- a/arch/i386/kernel/dmi_scan.c +++ b/arch/i386/kernel/dmi_scan.c @@ -84,49 +84,6 @@ static int __init dmi_checksum(u8 *buf) return sum == 0; } -static int __init dmi_iterate(void (*decode)(struct dmi_header *)) -{ - u8 buf[15]; - char __iomem *p, *q; - - /* - * no iounmap() for that ioremap(); it would be a no-op, but it's - * so early in setup that sucker gets confused into doing what - * it shouldn't if we actually call it. - */ - p = ioremap(0xF0000, 0x10000); - if (p == NULL) - return -1; - - for (q = p; q < p + 0x10000; q += 16) { - memcpy_fromio(buf, q, 15); - if ((memcmp(buf, "_DMI_", 5) == 0) && dmi_checksum(buf)) { - u16 num = (buf[13] << 8) | buf[12]; - u16 len = (buf[7] << 8) | buf[6]; - u32 base = (buf[11] << 24) | (buf[10] << 16) | - (buf[9] << 8) | buf[8]; - - /* - * DMI version 0.0 means that the real version is taken from - * the SMBIOS version, which we don't know at this point. - */ - if (buf[14] != 0) - printk(KERN_INFO "DMI %d.%d present.\n", - buf[14] >> 4, buf[14] & 0xF); - else - printk(KERN_INFO "DMI present.\n"); - - dmi_printk((KERN_INFO "%d structures occupying %d bytes.\n", - num, len)); - dmi_printk((KERN_INFO "DMI table at 0x%08X.\n", base)); - - if (dmi_table(base,len, num, decode) == 0) - return 0; - } - } - return -1; -} - static char *dmi_ident[DMI_STRING_MAX]; /* @@ -190,8 +147,46 @@ static void __init dmi_decode(struct dmi_header *dm) void __init dmi_scan_machine(void) { - if (dmi_iterate(dmi_decode)) - printk(KERN_INFO "DMI not present.\n"); + u8 buf[15]; + char __iomem *p, *q; + + /* + * no iounmap() for that ioremap(); it would be a no-op, but it's + * so early in setup that sucker gets confused into doing what + * it shouldn't if we actually call it. + */ + p = ioremap(0xF0000, 0x10000); + if (p == NULL) + goto out; + + for (q = p; q < p + 0x10000; q += 16) { + memcpy_fromio(buf, q, 15); + if ((memcmp(buf, "_DMI_", 5) == 0) && dmi_checksum(buf)) { + u16 num = (buf[13] << 8) | buf[12]; + u16 len = (buf[7] << 8) | buf[6]; + u32 base = (buf[11] << 24) | (buf[10] << 16) | + (buf[9] << 8) | buf[8]; + + /* + * DMI version 0.0 means that the real version is taken from + * the SMBIOS version, which we don't know at this point. + */ + if (buf[14] != 0) + printk(KERN_INFO "DMI %d.%d present.\n", + buf[14] >> 4, buf[14] & 0xF); + else + printk(KERN_INFO "DMI present.\n"); + + dmi_printk((KERN_INFO "%d structures occupying %d bytes.\n", + num, len)); + dmi_printk((KERN_INFO "DMI table at 0x%08X.\n", base)); + + if (dmi_table(base,len, num, dmi_decode) == 0) + return; + } + } + +out: printk(KERN_INFO "DMI not present.\n"); } -- cgit v1.1 From 4e70b9a3d68909ad7e79bf6e1b0dcec6de922a7c Mon Sep 17 00:00:00 2001 From: Andrey Panin Date: Tue, 6 Sep 2005 15:18:28 -0700 Subject: [PATCH] dmi: remove old debugging code DMI debugging code is unused for ages. This patch removes it. Signed-off-by: Andrey Panin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/dmi_scan.c | 21 --------------------- 1 file changed, 21 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/dmi_scan.c b/arch/i386/kernel/dmi_scan.c index 6ae4ef4..ecdaae2 100644 --- a/arch/i386/kernel/dmi_scan.c +++ b/arch/i386/kernel/dmi_scan.c @@ -12,13 +12,6 @@ struct dmi_header { u16 handle; }; -#undef DMI_DEBUG - -#ifdef DMI_DEBUG -#define dmi_printk(x) printk x -#else -#define dmi_printk(x) -#endif static char * __init dmi_string(struct dmi_header *dm, u8 s) { @@ -117,29 +110,19 @@ static void __init dmi_decode(struct dmi_header *dm) switch(dm->type) { case 0: - dmi_printk(("BIOS Vendor: %s\n", dmi_string(dm, data[4]))); dmi_save_ident(dm, DMI_BIOS_VENDOR, 4); - dmi_printk(("BIOS Version: %s\n", dmi_string(dm, data[5]))); dmi_save_ident(dm, DMI_BIOS_VERSION, 5); - dmi_printk(("BIOS Release: %s\n", dmi_string(dm, data[8]))); dmi_save_ident(dm, DMI_BIOS_DATE, 8); break; case 1: - dmi_printk(("System Vendor: %s\n", dmi_string(dm, data[4]))); dmi_save_ident(dm, DMI_SYS_VENDOR, 4); - dmi_printk(("Product Name: %s\n", dmi_string(dm, data[5]))); dmi_save_ident(dm, DMI_PRODUCT_NAME, 5); - dmi_printk(("Version: %s\n", dmi_string(dm, data[6]))); dmi_save_ident(dm, DMI_PRODUCT_VERSION, 6); - dmi_printk(("Serial Number: %s\n", dmi_string(dm, data[7]))); dmi_save_ident(dm, DMI_PRODUCT_SERIAL, 7); break; case 2: - dmi_printk(("Board Vendor: %s\n", dmi_string(dm, data[4]))); dmi_save_ident(dm, DMI_BOARD_VENDOR, 4); - dmi_printk(("Board Name: %s\n", dmi_string(dm, data[5]))); dmi_save_ident(dm, DMI_BOARD_NAME, 5); - dmi_printk(("Board Version: %s\n", dmi_string(dm, data[6]))); dmi_save_ident(dm, DMI_BOARD_VERSION, 6); break; } @@ -177,10 +160,6 @@ void __init dmi_scan_machine(void) else printk(KERN_INFO "DMI present.\n"); - dmi_printk((KERN_INFO "%d structures occupying %d bytes.\n", - num, len)); - dmi_printk((KERN_INFO "DMI table at 0x%08X.\n", base)); - if (dmi_table(base,len, num, dmi_decode) == 0) return; } -- cgit v1.1 From c3c7120d552989be94c9137989be5abb6da8954f Mon Sep 17 00:00:00 2001 From: Andrey Panin Date: Tue, 6 Sep 2005 15:18:28 -0700 Subject: [PATCH] dmi: make dmi_string() behave like strdup() This patch changes dmi_string() function to allocate string copy by itself, to avoid code duplication in the next patch. Signed-off-by: Andrey Panin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/dmi_scan.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/dmi_scan.c b/arch/i386/kernel/dmi_scan.c index ecdaae2..ae1a1ae 100644 --- a/arch/i386/kernel/dmi_scan.c +++ b/arch/i386/kernel/dmi_scan.c @@ -16,15 +16,25 @@ struct dmi_header { static char * __init dmi_string(struct dmi_header *dm, u8 s) { u8 *bp = ((u8 *) dm) + dm->length; + char *str = ""; - if (!s) - return ""; - s--; - while (s > 0 && *bp) { - bp += strlen(bp) + 1; + if (s) { s--; - } - return bp; + while (s > 0 && *bp) { + bp += strlen(bp) + 1; + s--; + } + + if (*bp != 0) { + str = alloc_bootmem(strlen(bp) + 1); + if (str != NULL) + strcpy(str, bp); + else + printk(KERN_ERR "dmi_string: out of memory.\n"); + } + } + + return str; } /* @@ -84,19 +94,16 @@ static char *dmi_ident[DMI_STRING_MAX]; */ static void __init dmi_save_ident(struct dmi_header *dm, int slot, int string) { - char *d = (char*)dm; - char *p = dmi_string(dm, d[string]); + char *p, *d = (char*) dm; - if (p == NULL || *p == 0) - return; if (dmi_ident[slot]) return; - dmi_ident[slot] = alloc_bootmem(strlen(p) + 1); - if(dmi_ident[slot]) - strcpy(dmi_ident[slot], p); - else - printk(KERN_ERR "dmi_save_ident: out of memory.\n"); + p = dmi_string(dm, d[string]); + if (p == NULL) + return; + + dmi_ident[slot] = p; } /* -- cgit v1.1 From ebad6a4230bdb5927495e28bc7837f515bf667a7 Mon Sep 17 00:00:00 2001 From: Andrey Panin Date: Tue, 6 Sep 2005 15:18:29 -0700 Subject: [PATCH] dmi: add onboard devices discovery This patch adds onboard devices and IPMI BMC discovery into DMI scan code. Drivers can use dmi_find_device() function to search for devices by type and name. Signed-off-by: Andrey Panin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/dmi_scan.c | 102 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 90 insertions(+), 12 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/dmi_scan.c b/arch/i386/kernel/dmi_scan.c index ae1a1ae..c4a7385 100644 --- a/arch/i386/kernel/dmi_scan.c +++ b/arch/i386/kernel/dmi_scan.c @@ -6,13 +6,6 @@ #include -struct dmi_header { - u8 type; - u8 length; - u16 handle; -}; - - static char * __init dmi_string(struct dmi_header *dm, u8 s) { u8 *bp = ((u8 *) dm) + dm->length; @@ -88,6 +81,7 @@ static int __init dmi_checksum(u8 *buf) } static char *dmi_ident[DMI_STRING_MAX]; +static LIST_HEAD(dmi_devices); /* * Save a DMI string @@ -106,6 +100,58 @@ static void __init dmi_save_ident(struct dmi_header *dm, int slot, int string) dmi_ident[slot] = p; } +static void __init dmi_save_devices(struct dmi_header *dm) +{ + int i, count = (dm->length - sizeof(struct dmi_header)) / 2; + struct dmi_device *dev; + + for (i = 0; i < count; i++) { + char *d = ((char *) dm) + (i * 2); + + /* Skip disabled device */ + if ((*d & 0x80) == 0) + continue; + + dev = alloc_bootmem(sizeof(*dev)); + if (!dev) { + printk(KERN_ERR "dmi_save_devices: out of memory.\n"); + break; + } + + dev->type = *d++ & 0x7f; + dev->name = dmi_string(dm, *d); + dev->device_data = NULL; + + list_add(&dev->list, &dmi_devices); + } +} + +static void __init dmi_save_ipmi_device(struct dmi_header *dm) +{ + struct dmi_device *dev; + void * data; + + data = alloc_bootmem(dm->length); + if (data == NULL) { + printk(KERN_ERR "dmi_save_ipmi_device: out of memory.\n"); + return; + } + + memcpy(data, dm, dm->length); + + dev = alloc_bootmem(sizeof(*dev)); + if (!dev) { + printk(KERN_ERR "dmi_save_ipmi_device: out of memory.\n"); + return; + } + + dev->type = DMI_DEV_TYPE_IPMI; + dev->name = "IPMI controller"; + dev->device_data = data; + + list_add(&dev->list, &dmi_devices); +} + /* * Process a DMI table entry. Right now all we care about are the BIOS * and machine entries. For 2.5 we should pull the smbus controller info @@ -113,25 +159,28 @@ static void __init dmi_save_ident(struct dmi_header *dm, int slot, int string) */ static void __init dmi_decode(struct dmi_header *dm) { - u8 *data __attribute__((__unused__)) = (u8 *)dm; - switch(dm->type) { - case 0: + case 0: /* BIOS Information */ dmi_save_ident(dm, DMI_BIOS_VENDOR, 4); dmi_save_ident(dm, DMI_BIOS_VERSION, 5); dmi_save_ident(dm, DMI_BIOS_DATE, 8); break; - case 1: + case 1: /* System Information */ dmi_save_ident(dm, DMI_SYS_VENDOR, 4); dmi_save_ident(dm, DMI_PRODUCT_NAME, 5); dmi_save_ident(dm, DMI_PRODUCT_VERSION, 6); dmi_save_ident(dm, DMI_PRODUCT_SERIAL, 7); break; - case 2: + case 2: /* Base Board Information */ dmi_save_ident(dm, DMI_BOARD_VENDOR, 4); dmi_save_ident(dm, DMI_BOARD_NAME, 5); dmi_save_ident(dm, DMI_BOARD_VERSION, 6); break; + case 10: /* Onboard Devices Information */ + dmi_save_devices(dm); + break; + case 38: /* IPMI Device Information */ + dmi_save_ipmi_device(dm); } } @@ -221,3 +270,32 @@ char *dmi_get_system_info(int field) return dmi_ident[field]; } EXPORT_SYMBOL(dmi_get_system_info); + +/** + * dmi_find_device - find onboard device by type/name + * @type: device type or %DMI_DEV_TYPE_ANY to match all device types + * @desc: device name string or %NULL to match all + * @from: previous device found in search, or %NULL for new search. + * + * Iterates through the list of known onboard devices. If a device is + * found with a matching @vendor and @device, a pointer to its device + * structure is returned. Otherwise, %NULL is returned. + * A new search is initiated by passing %NULL to the @from argument. + * If @from is not %NULL, searches continue from next device. + */ +struct dmi_device * dmi_find_device(int type, const char *name, + struct dmi_device *from) +{ + struct list_head *d, *head = from ? &from->list : &dmi_devices; + + for(d = head->next; d != &dmi_devices; d = d->next) { + struct dmi_device *dev = list_entry(d, struct dmi_device, list); + + if (((type == DMI_DEV_TYPE_ANY) || (dev->type == type)) && + ((name == NULL) || (strcmp(dev->name, name) == 0))) + return dev; + } + + return NULL; +} +EXPORT_SYMBOL(dmi_find_device); -- cgit v1.1 From 640e803376b9c4072f69fec42e304c974a631298 Mon Sep 17 00:00:00 2001 From: Robert Love Date: Tue, 6 Sep 2005 15:18:30 -0700 Subject: [PATCH] fix: dmi_check_system Background: 1) dmi_check_system() returns the count of the number of matches. Zero thus means no matches. 2) A match callback can return nonzero to stop the match checking. Bug: The count is incremented after we check for the nonzero return value, so it does not reflect the actual count. We could say this is intended, for some dumb reason, except that it means that a match on the first check returns zero--no matches--if the callback returns nonzero. Attached patch implements the count before calling the callback and thus before potentially short-circuiting. Signed-off-by: Robert Love Cc: Andrey Panin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/dmi_scan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/dmi_scan.c b/arch/i386/kernel/dmi_scan.c index c4a7385..58516e2 100644 --- a/arch/i386/kernel/dmi_scan.c +++ b/arch/i386/kernel/dmi_scan.c @@ -248,9 +248,9 @@ int dmi_check_system(struct dmi_system_id *list) /* No match */ goto fail; } + count++; if (d->callback && d->callback(d)) break; - count++; fail: d++; } -- cgit v1.1 From 3d97ae5b958855ac007b6f56a0f94ab8ade09e9e Mon Sep 17 00:00:00 2001 From: Prasanna S Panchamukhi Date: Tue, 6 Sep 2005 15:19:27 -0700 Subject: [PATCH] kprobes: prevent possible race conditions i386 changes This patch contains the i386 architecture specific changes to prevent the possible race conditions. Signed-off-by: Prasanna S Panchamukhi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/entry.S | 13 ++++++++----- arch/i386/kernel/kprobes.c | 29 +++++++++++++++-------------- arch/i386/kernel/traps.c | 12 +++++++----- arch/i386/kernel/vmlinux.lds.S | 1 + arch/i386/mm/fault.c | 4 +++- 5 files changed, 34 insertions(+), 25 deletions(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S index abb9097..3aad038 100644 --- a/arch/i386/kernel/entry.S +++ b/arch/i386/kernel/entry.S @@ -507,7 +507,7 @@ label: \ pushl $__KERNEL_CS; \ pushl $sysenter_past_esp -ENTRY(debug) +KPROBE_ENTRY(debug) cmpl $sysenter_entry,(%esp) jne debug_stack_correct FIX_STACK(12, debug_stack_correct, debug_esp_fix_insn) @@ -518,7 +518,7 @@ debug_stack_correct: movl %esp,%eax # pt_regs pointer call do_debug jmp ret_from_exception - + .previous .text /* * NMI is doubly nasty. It can happen _while_ we're handling * a debug fault, and the debug fault hasn't yet been able to @@ -591,13 +591,14 @@ nmi_16bit_stack: .long 1b,iret_exc .previous -ENTRY(int3) +KPROBE_ENTRY(int3) pushl $-1 # mark this as an int SAVE_ALL xorl %edx,%edx # zero error code movl %esp,%eax # pt_regs pointer call do_int3 jmp ret_from_exception + .previous .text ENTRY(overflow) pushl $0 @@ -631,17 +632,19 @@ ENTRY(stack_segment) pushl $do_stack_segment jmp error_code -ENTRY(general_protection) +KPROBE_ENTRY(general_protection) pushl $do_general_protection jmp error_code + .previous .text ENTRY(alignment_check) pushl $do_alignment_check jmp error_code -ENTRY(page_fault) +KPROBE_ENTRY(page_fault) pushl $do_page_fault jmp error_code + .previous .text #ifdef CONFIG_X86_MCE ENTRY(machine_check) diff --git a/arch/i386/kernel/kprobes.c b/arch/i386/kernel/kprobes.c index a6d8c45..7fb5a6f 100644 --- a/arch/i386/kernel/kprobes.c +++ b/arch/i386/kernel/kprobes.c @@ -62,32 +62,32 @@ static inline int is_IF_modifier(kprobe_opcode_t opcode) return 0; } -int arch_prepare_kprobe(struct kprobe *p) +int __kprobes arch_prepare_kprobe(struct kprobe *p) { return 0; } -void arch_copy_kprobe(struct kprobe *p) +void __kprobes arch_copy_kprobe(struct kprobe *p) { memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); p->opcode = *p->addr; } -void arch_arm_kprobe(struct kprobe *p) +void __kprobes arch_arm_kprobe(struct kprobe *p) { *p->addr = BREAKPOINT_INSTRUCTION; flush_icache_range((unsigned long) p->addr, (unsigned long) p->addr + sizeof(kprobe_opcode_t)); } -void arch_disarm_kprobe(struct kprobe *p) +void __kprobes arch_disarm_kprobe(struct kprobe *p) { *p->addr = p->opcode; flush_icache_range((unsigned long) p->addr, (unsigned long) p->addr + sizeof(kprobe_opcode_t)); } -void arch_remove_kprobe(struct kprobe *p) +void __kprobes arch_remove_kprobe(struct kprobe *p) { } @@ -127,7 +127,8 @@ static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs) regs->eip = (unsigned long)&p->ainsn.insn; } -void arch_prepare_kretprobe(struct kretprobe *rp, struct pt_regs *regs) +void __kprobes arch_prepare_kretprobe(struct kretprobe *rp, + struct pt_regs *regs) { unsigned long *sara = (unsigned long *)®s->esp; struct kretprobe_instance *ri; @@ -150,7 +151,7 @@ void arch_prepare_kretprobe(struct kretprobe *rp, struct pt_regs *regs) * Interrupts are disabled on entry as trap3 is an interrupt gate and they * remain disabled thorough out this function. */ -static int kprobe_handler(struct pt_regs *regs) +static int __kprobes kprobe_handler(struct pt_regs *regs) { struct kprobe *p; int ret = 0; @@ -259,7 +260,7 @@ no_kprobe: /* * Called when we hit the probe point at kretprobe_trampoline */ -int trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs) +int __kprobes trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs) { struct kretprobe_instance *ri = NULL; struct hlist_head *head; @@ -338,7 +339,7 @@ int trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs) * that is atop the stack is the address following the copied instruction. * We need to make it the address following the original instruction. */ -static void resume_execution(struct kprobe *p, struct pt_regs *regs) +static void __kprobes resume_execution(struct kprobe *p, struct pt_regs *regs) { unsigned long *tos = (unsigned long *)®s->esp; unsigned long next_eip = 0; @@ -444,8 +445,8 @@ static inline int kprobe_fault_handler(struct pt_regs *regs, int trapnr) /* * Wrapper routine to for handling exceptions. */ -int kprobe_exceptions_notify(struct notifier_block *self, unsigned long val, - void *data) +int __kprobes kprobe_exceptions_notify(struct notifier_block *self, + unsigned long val, void *data) { struct die_args *args = (struct die_args *)data; switch (val) { @@ -473,7 +474,7 @@ int kprobe_exceptions_notify(struct notifier_block *self, unsigned long val, return NOTIFY_DONE; } -int setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) +int __kprobes setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) { struct jprobe *jp = container_of(p, struct jprobe, kp); unsigned long addr; @@ -495,7 +496,7 @@ int setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) return 1; } -void jprobe_return(void) +void __kprobes jprobe_return(void) { preempt_enable_no_resched(); asm volatile (" xchgl %%ebx,%%esp \n" @@ -506,7 +507,7 @@ void jprobe_return(void) (jprobe_saved_esp):"memory"); } -int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) +int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) { u8 *addr = (u8 *) (regs->eip - 1); unsigned long stack_addr = (unsigned long)jprobe_saved_esp; diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c index 029bf94..09a58cb 100644 --- a/arch/i386/kernel/traps.c +++ b/arch/i386/kernel/traps.c @@ -363,8 +363,9 @@ static inline void die_if_kernel(const char * str, struct pt_regs * regs, long e die(str, regs, err); } -static void do_trap(int trapnr, int signr, char *str, int vm86, - struct pt_regs * regs, long error_code, siginfo_t *info) +static void __kprobes do_trap(int trapnr, int signr, char *str, int vm86, + struct pt_regs * regs, long error_code, + siginfo_t *info) { struct task_struct *tsk = current; tsk->thread.error_code = error_code; @@ -460,7 +461,8 @@ DO_ERROR(12, SIGBUS, "stack segment", stack_segment) DO_ERROR_INFO(17, SIGBUS, "alignment check", alignment_check, BUS_ADRALN, 0) DO_ERROR_INFO(32, SIGSEGV, "iret exception", iret_error, ILL_BADSTK, 0) -fastcall void do_general_protection(struct pt_regs * regs, long error_code) +fastcall void __kprobes do_general_protection(struct pt_regs * regs, + long error_code) { int cpu = get_cpu(); struct tss_struct *tss = &per_cpu(init_tss, cpu); @@ -676,7 +678,7 @@ void unset_nmi_callback(void) EXPORT_SYMBOL_GPL(unset_nmi_callback); #ifdef CONFIG_KPROBES -fastcall void do_int3(struct pt_regs *regs, long error_code) +fastcall void __kprobes do_int3(struct pt_regs *regs, long error_code) { if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP) == NOTIFY_STOP) @@ -710,7 +712,7 @@ fastcall void do_int3(struct pt_regs *regs, long error_code) * find every occurrence of the TF bit that could be saved away even * by user code) */ -fastcall void do_debug(struct pt_regs * regs, long error_code) +fastcall void __kprobes do_debug(struct pt_regs * regs, long error_code) { unsigned int condition; struct task_struct *tsk = current; diff --git a/arch/i386/kernel/vmlinux.lds.S b/arch/i386/kernel/vmlinux.lds.S index 761972f..13b9c62 100644 --- a/arch/i386/kernel/vmlinux.lds.S +++ b/arch/i386/kernel/vmlinux.lds.S @@ -22,6 +22,7 @@ SECTIONS *(.text) SCHED_TEXT LOCK_TEXT + KPROBES_TEXT *(.fixup) *(.gnu.warning) } = 0x9090 diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c index 411b850..9edd448 100644 --- a/arch/i386/mm/fault.c +++ b/arch/i386/mm/fault.c @@ -21,6 +21,7 @@ #include /* For unblank_screen() */ #include #include +#include #include #include @@ -223,7 +224,8 @@ fastcall void do_invalid_op(struct pt_regs *, unsigned long); * bit 1 == 0 means read, 1 means write * bit 2 == 0 means kernel, 1 means user-mode */ -fastcall void do_page_fault(struct pt_regs *regs, unsigned long error_code) +fastcall void __kprobes do_page_fault(struct pt_regs *regs, + unsigned long error_code) { struct task_struct *tsk; struct mm_struct *mm; -- cgit v1.1 From bce0649417d6e71f6df8ab7b11103d247913b142 Mon Sep 17 00:00:00 2001 From: Jim Keniston Date: Tue, 6 Sep 2005 15:19:34 -0700 Subject: [PATCH] kprobes: fix handling of simultaneous probe hit/unregister This patch fixes a bug in kprobes's handling of a corner case on i386 and x86_64. On an SMP system, if one CPU unregisters a kprobe just after another CPU hits that probepoint, kprobe_handler() on the latter CPU sees that the kprobe has been unregistered, and attempts to let the CPU continue as if the probepoint hadn't been hit. The bug is that on i386 and x86_64, we were neglecting to set the IP back to the beginning of the probed instruction. This could cause an oops or crash. This bug doesn't exist on ppc64 and ia64, where a breakpoint instruction leaves the IP pointing to the beginning of the instruction. I don't know about sparc64. (Dave, could you please advise?) This fix has been tested on i386 and x86_64 SMP systems. To reproduce the problem, set one CPU to work registering and unregistering a kprobe repeatedly, and another CPU pounding the probepoint in a tight loop. Acked-by: Prasanna S Panchamukhi Signed-off-by: Jim Keniston Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/kprobes.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'arch/i386') diff --git a/arch/i386/kernel/kprobes.c b/arch/i386/kernel/kprobes.c index 7fb5a6f..e5cec32 100644 --- a/arch/i386/kernel/kprobes.c +++ b/arch/i386/kernel/kprobes.c @@ -221,7 +221,10 @@ static int __kprobes kprobe_handler(struct pt_regs *regs) * either a probepoint or a debugger breakpoint * at this address. In either case, no further * handling of this interrupt is appropriate. + * Back up over the (now missing) int3 and run + * the original instruction. */ + regs->eip -= sizeof(kprobe_opcode_t); ret = 1; } /* Not one of ours: let kernel handle it */ -- cgit v1.1 From deac66ae454cacf942c051b86d9232af546fb187 Mon Sep 17 00:00:00 2001 From: Keshavamurthy Anil S Date: Tue, 6 Sep 2005 15:19:35 -0700 Subject: [PATCH] kprobes: fix bug when probed on task and isr functions This patch fixes a race condition where in system used to hang or sometime crash within minutes when kprobes are inserted on ISR routine and a task routine. The fix has been stress tested on i386, ia64, pp64 and on x86_64. To reproduce the problem insert kprobes on schedule() and do_IRQ() functions and you should see hang or system crash. Signed-off-by: Anil S Keshavamurthy Signed-off-by: Ananth N Mavinakayanahalli Acked-by: Prasanna S Panchamukhi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/i386/kernel/kprobes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'arch/i386') diff --git a/arch/i386/kernel/kprobes.c b/arch/i386/kernel/kprobes.c index e5cec32..6345b43 100644 --- a/arch/i386/kernel/kprobes.c +++ b/arch/i386/kernel/kprobes.c @@ -177,7 +177,8 @@ static int __kprobes kprobe_handler(struct pt_regs *regs) Disarm the probe we just hit, and ignore it. */ p = get_kprobe(addr); if (p) { - if (kprobe_status == KPROBE_HIT_SS) { + if (kprobe_status == KPROBE_HIT_SS && + *p->ainsn.insn == BREAKPOINT_INSTRUCTION) { regs->eflags &= ~TF_MASK; regs->eflags |= kprobe_saved_eflags; unlock_kprobes(); -- cgit v1.1 From a08b6b7968e7a6afc75e365ac31830867275abdc Mon Sep 17 00:00:00 2001 From: "viro@ZenIV.linux.org.uk" Date: Tue, 6 Sep 2005 01:48:42 +0100 Subject: [PATCH] Kconfig fix (BLK_DEV_FD dependencies) Sanitized and fixed floppy dependencies: split the messy dependencies for BLK_DEV_FD by introducing a new symbol (ARCH_MAY_HAVE_PC_FDC), making BLK_DEV_FD depend on that one and taking declarations of ARCH_MAY_HAVE_PC_FDC to arch/*/Kconfig. While we are at it, fixed several obvious cases when BLK_DEV_FD should have been excluded (architectures lacking asm/floppy.h are *not* going to have floppy.c compile, let alone work). If you can come up with better name for that ("this architecture might have working PC-compatible floppy disk controller"), you are more than welcome - just s/ARCH_MAY_HAVE_PC_FDC/your_prefered_name/g in the patch below... Signed-off-by: Al Viro Signed-off-by: Linus Torvalds --- arch/i386/Kconfig | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'arch/i386') diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index 4b7de3e..5d51b38 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -37,6 +37,10 @@ config GENERIC_IOMAP bool default y +config ARCH_MAY_HAVE_PC_FDC + bool + default y + source "init/Kconfig" menu "Processor type and features" -- cgit v1.1