Merge Upstream's stable 3.0.21 branch into msm-3.0
This consists 814 commits and some merge conflicts.
The merge conflicts are because of some local changes to
msm-3.0 as well as some conflicts between google's tree and
the upstream tree.
Conflicts:
arch/arm/kernel/head.S
drivers/bluetooth/ath3k.c
drivers/bluetooth/btusb.c
drivers/mmc/core/core.c
drivers/tty/serial/serial_core.c
drivers/usb/host/ehci-hub.c
drivers/usb/serial/qcserial.c
fs/namespace.c
fs/proc/base.c
Change-Id: I62e2edbe213f84915e27f8cd6e4f6ce23db22a21
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
commit b9980cdcf2524c5fe15d8cbae9c97b3ed6385563 upstream.
Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
and so triggers some BUGs in Transparent HugePage codepaths.
asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc9086004b3d5db75997a645b3fe08d9138b7ad0 upstream.
When isolating pages for migration, migration starts at the start of a
zone while the free scanner starts at the end of the zone. Migration
avoids entering a new zone by never going beyond the free scanned.
Unfortunately, in very rare cases nodes can overlap. When this happens,
migration isolates pages without the LRU lock held, corrupting lists
which will trigger errors in reclaim or during page free such as in the
following oops
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450
PGD 1dda554067 PUD 1e1cb58067 PMD 0
Oops: 0000 [#1] SMP
CPU 37
Pid: 17088, comm: memcg_process_s Tainted: G X
RIP: free_pcppages_bulk+0xcc/0x450
Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
Call Trace:
free_hot_cold_page+0x17e/0x1f0
__pagevec_free+0x90/0xb0
release_pages+0x22a/0x260
pagevec_lru_move_fn+0xf3/0x110
putback_lru_page+0x66/0xe0
unmap_and_move+0x156/0x180
migrate_pages+0x9e/0x1b0
compact_zone+0x1f3/0x2f0
compact_zone_order+0xa2/0xe0
try_to_compact_pages+0xdf/0x110
__alloc_pages_direct_compact+0xee/0x1c0
__alloc_pages_slowpath+0x370/0x830
__alloc_pages_nodemask+0x1b1/0x1c0
alloc_pages_vma+0x9b/0x160
do_huge_pmd_anonymous_page+0x160/0x270
do_page_fault+0x207/0x4c0
page_fault+0x25/0x30
The "X" in the taint flag means that external modules were loaded but but
is unrelated to the bug triggering. The real problem was because the PFN
layout looks like this
Zone PFN ranges:
DMA 0x00000010 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x01e80000
Movable zone start PFN for each node
early_node_map[14] active PFN ranges
0: 0x00000010 -> 0x0000009b
0: 0x00000100 -> 0x0007a1ec
0: 0x0007a354 -> 0x0007a379
0: 0x0007f7ff -> 0x0007f800
0: 0x00100000 -> 0x00680000
1: 0x00680000 -> 0x00e80000
0: 0x00e80000 -> 0x01080000
1: 0x01080000 -> 0x01280000
0: 0x01280000 -> 0x01480000
1: 0x01480000 -> 0x01680000
0: 0x01680000 -> 0x01880000
1: 0x01880000 -> 0x01a80000
0: 0x01a80000 -> 0x01c80000
1: 0x01c80000 -> 0x01e80000
The fix is straight-forward. isolate_migratepages() has to make a
similar check to isolate_freepage to ensure that it never isolates pages
from a zone it does not hold the LRU lock for.
This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
and current mainline.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0bf380bc70ecba68cb4d74dc656cc2fa8c4d801a upstream.
When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned. Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned. This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash. This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.
PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
#0 [d72d3ad0] crash_kexec at c028cfdb
#1 [d72d3b24] oops_end at c05c5322
#2 [d72d3b38] __bad_area_nosemaphore at c0227e60
#3 [d72d3bec] bad_area at c0227fb6
#4 [d72d3c00] do_page_fault at c05c72ec
#5 [d72d3c80] error_code (via page_fault) at c05c47a4
EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
#6 [d72d3cb4] isolate_migratepages at c030b15a
#7 [d72d3d14] zone_watermark_ok at c02d26cb
#8 [d72d3d2c] compact_zone at c030b8de
#9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202
It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.
BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000
It is expected that it also affects 3.2.x and current mainline.
The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned. Lets say we have a case
like this
H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole
H------|------H------|----m-Hoooooo|ooooooH-f----|------H
The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole. When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.
This patch ensures that isolate_migratepages calls pfn_valid when
necessary.
Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 99f02ef1f18631eb0a4e0ea0a3d56878dbcb4b90 upstream.
Fix a race condition that shows in conjunction with xip_file_fault() when
two threads of the same user process fault on the same memory page.
In this case, the race winner will install the page table entry and the
unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
vm_insert_mixed) which drops out at this check:
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
The resulting -EBUSY return value will trigger a BUG_ON() in
xip_file_fault.
This fix simply considers the fault as fixed in this case, because the
race winner has successfully installed the pte.
[akpm@linux-foundation.org: use conventional (and consistent) comment layout]
Reported-by: David Sadler <dsadler@us.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reported-by: Louis Alex Eisner <leisner@cs.ucsd.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3deaa7190a8da38453c4fabd9dec7f66d17fff67 upstream.
Herbert Poetzl reported a performance regression since 2.6.39. The test
is a simple dd read, but with big block size. The reason is:
T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.
There is no request to disk in T3 and T4, so readahead pipeline breaks.
We really don't need block plug for generic_file_aio_read() for buffered
I/O. The readahead already has plug and has fine grained control when I/O
should be submitted. Deleting plug for buffered I/O fixes the regression.
One side effect is plug makes the request size 256k, the size is 128k
without it. This is because default ra size is 128k and not a reason we
need plug here.
Vivek said:
: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged. When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging. Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reported-by: Herbert Poetzl <herbert@13thfloor.at>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 687875fb7de4a95223af20ee024282fa9099f860 upstream.
Fix the following NULL ptr dereference caused by
cat /sys/devices/system/memory/memory0/removable
Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b
We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000
The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.
When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.
We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.
Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.
Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.
In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.
This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.
Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.
This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.
The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When onlining, the onlined pages must be added to the kernel's
list of free pages using __free_page(). However, pages are not
immediately added but placed in a queue to be processed
when the queue size reaches a watermark. The last pages in
the queue may not be processed in time, and if you try to
offline that memory before it is processed, offlining will
always fail.
This fix calls drain_all_pages(), which will process every
free page in the queue. This ensures that all pages are
accounted for when onlining and nothing gets stuck in the queue.
Change-Id: I54dbc0749556702407090e51ce9246abc5db7d1c
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.
However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().
This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.
Fix this by redoing the del_timer_sync() in bdi_destory().
------------[ cut here ]------------
WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
Modules linked in:
Backtrace:
[<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
[<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
[<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
r3:00000009
[<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
r3:c02b1b29 r2:c02bc662
[<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
r6:c7964000 r5:00000001 r4:c7964000
[<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
[<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
[<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
r5:c79641f0 r4:c796420c
[<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
[<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
r5:c79643b4 r4:c79641f0
[<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
r4:c7964000
[<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
r5:00000000 r4:c794c400
[<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
r5:c794c400 r4:c0322824
[<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
r5:c78d5e00 r4:c74083c0
[<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
r3:00000000
[<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
[<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
[<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
r5:c794dc00 r4:c794dc00
[<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
r5:c794dc00 r4:c7942300
[<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
r6:c7942300 r5:00000000 r4:00000000 r3:00000000
[<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
---[ end trace e5c83c92ada51c76 ]---
Change-Id: Iab5fb46548e97443d09f5556d7f57098b728bf07
Cc: stable@kernel.org
Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pratibhasagar V <pratibha@codeaurora.org>
Ensure that the vma that the ashmem region is tied to
is valid before flushing and hold the mm semaphore so
that the vma does not get destroyed from under while
it's being flushed.
CRs-Fixed: 326685
Change-Id: I86320ffe9e6570e6ffa7dda68ab26be7aed86afd
Signed-off-by: Shubhraprakash Das <sadas@codeaurora.org>
commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.
lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.
An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit
f755a04 oom: use pte pages in OOM score
where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing
176 points *= 1000;
and points may have negative value. Meaning the oom score for a mem hog task
will be one.
196 if (points <= 0)
197 return 1;
For example:
[ 3366] 0 3366 35390480 24303939 5 0 0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.
In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.
The points variable should be of type long instead of int to prevent the
int overflow.
Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.
This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:
crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000
So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4
However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000
As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.
Fix it by adding offset_in_page() to the translated page address.
-tj: Combined Eugene's and Petr's commit messages.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Petr Tesarik <ptesarik@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.
Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.
This simply isn't true. Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.
Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.
The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.
Kudos to Dave Young for tracking down the problem.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: WANG Cong <xiyou.wangcong@gmail.com>
Reported-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
LKML-Reference: <4EC21F67.10905@redhat.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit d021563888312018ca65681096f62e36c20e63cc upstream.
setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
[<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
[<c08a771c>] init_per_zone_wmark_min+0x27/0x86
[<c020111b>] do_one_initcall+0x2b/0x160
[<c086639d>] kernel_init+0xbe/0x157
[<c05cae26>] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4
We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.
The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").
Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Dang Bo <bdang@vmware.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arve Hjnnevg <arve@android.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1 upstream.
Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times. But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized. So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.
kernel BUG at include/linux/mm.h:386!
invalid opcode: 0000 [#1] SMP
Call Trace:
gup_pud_range+0xb8/0x19d
get_user_pages_fast+0xcb/0x192
? trace_hardirqs_off+0xd/0xf
hva_to_pfn+0x119/0x2f2
gfn_to_pfn_memslot+0x2c/0x2e
kvm_iommu_map_pages+0xfd/0x1c1
kvm_iommu_map_memslots+0x7c/0xbd
kvm_iommu_map_guest+0xaa/0xbf
kvm_vm_ioctl_assigned_device+0x2ef/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP gup_huge_pud+0xf2/0x159
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add ARM to the supported list of architectures for MEMORY_HOTPLUG. For
ARM, the selection of MEMORY_HOTPLUG/REMOVE is specific to the sub-arch
and has to be explicitly enabled by the sub-arch.
Signed-off-by: Bryan Huntsman <bryanh@codeaurora.org>
Vmalloc will exit if the amount it needs to allocate is
greater than totalram_pages. Vmalloc cannot allocate
from the movable zone, so pages in the movable zone should
not be counted.
This change adds a new global variable: total_unmovable_pages.
It is calculated in init.c, based on totalram_pages minus
the pages in the movable zone. Vmalloc now looks at this new
global instead of totalram_pages.
total_unmovable_pages can be modified during memory_hotplug.
If the zone you are offlining/onlining is unmovable, then
you modify it similar to totalram_pages. If the zone is
movable, then no change is needed.
Change-Id: Ie55c41051e9ad4b921eb04ecbb4798a8bd2344d6
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
If offlined_pages is greater than
zone->present_pages, underflow will occur.
This change will set zone->present_pages to 0 if
offlined_pages is greater.
Change-Id: I728e90c60fb7fc391de7b9c4828ab264ca38653b
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
z->lowmem_reserve[classzone_idx] is an unsigned long but
free_pages and min are longs. If free_pages is
negative, the function will incorrectly return true
because it will treat the negative long as a large,
positive unsigned long.
This change casts z->lowmem_reserve to a long and
fixes a typo in the comment.
Change-Id: Icada1fa5ca650fbcdb0656f637adbb98f223eec5
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
When page migration is unable to find free memory it
will go into an infinite loop because the
all_unreclaimable flag is never set.
This change will allow memory offlining to fail
gracefully if there is not enough memory for page
migration.
__GFP_NORETRY tells the page_alloc to not retry.
__GFP_NOWARN suppresses page fault warnings when
page allocation fails.
__GFP_NOMEMALLOC prevents it from aggressively
allocating beyond zone watermarks.
Change-Id: I94dfd9059851c7b24953f44a4018a3bbac840688
Signed-off-by: Jack Cheung <jackc@codeaurora.org>
commit 7a401a972df8e184b3d1a3fc958c0a4ddee8d312 upstream.
bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.
However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().
This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.
Fix this by redoing the del_timer_sync() in bdi_destory().
------------[ cut here ]------------
WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
Modules linked in:
Backtrace:
[<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
[<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
[<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
r3:00000009
[<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
r3:c02b1b29 r2:c02bc662
[<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
r6:c7964000 r5:00000001 r4:c7964000
[<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
[<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
[<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
r5:c79641f0 r4:c796420c
[<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
[<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
r5:c79643b4 r4:c79641f0
[<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
r4:c7964000
[<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
r5:00000000 r4:c794c400
[<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
r5:c794c400 r4:c0322824
[<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
r5:c78d5e00 r4:c74083c0
[<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
r3:00000000
[<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
[<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
[<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
r5:c794dc00 r4:c794dc00
[<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
r5:c794dc00 r4:c7942300
[<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
r6:c7942300 r5:00000000 r4:00000000 r3:00000000
[<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
---[ end trace e5c83c92ada51c76 ]---
Signed-off-by: Rabin Vincent <rabin.vincent@stericsson.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* common/android-3.0: (570 commits)
misc: remove kernel debugger core
ARM: common: fiq_debugger: dump sysrq directly to console if enabled
ARM: common: fiq_debugger: add irq context debug functions
net: wireless: bcmdhd: Call init_ioctl() only if was started properly for WEXT
net: wireless: bcmdhd: Call init_ioctl() only if was started properly
net: wireless: bcmdhd: Fix possible memory leak in escan/iscan
cpufreq: interactive governor: default 20ms timer
cpufreq: interactive governor: go to intermediate hi speed before max
cpufreq: interactive governor: scale to max only if at min speed
cpufreq: interactive governor: apply intermediate load on current speed
ARM: idle: update idle ticks before call idle end notifier
input: gpio_input: don't print debounce message unless flag is set
net: wireless: bcm4329: Skip dhd_bus_stop() if bus is already down
net: wireless: bcmdhd: Skip dhd_bus_stop() if bus is already down
net: wireless: bcmdhd: Improve suspend/resume processing
net: wireless: bcmdhd: Check if FW is Ok for internal FW call
tcp: Don't nuke connections for the wrong protocol
ARM: common: fiq_debugger: make uart irq be no_suspend
net: wireless: Skip connect warning for CONFIG_CFG80211_ALLOW_RECONNECT
mm: avoid livelock on !__GFP_FS allocations
...
Conflicts:
arch/arm/mm/cache-l2x0.c
arch/arm/vfp/vfpmodule.c
drivers/mmc/core/host.c
kernel/power/wakelock.c
net/bluetooth/hci_event.c
Signed-off-by: Bryan Huntsman <bryanh@codeaurora.org>
commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().
He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).
The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit f5252e009d5b87071a919221e4f6624184005368 upstream.
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.
Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.
The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Allow memblock users to specify range where @base + @size overflows
and automatically cap it at maximum. This makes the interface more
robust and specifying till-the-end-of-memory easier.
Change-Id: Iad3e1ce69a11ac7d3e5490eec56baf8468bd134a
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
[ohaugan@codeaurora.org: Fix merge conflict in memblock_add_region]
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
* changes:
mm: use required fixed size of movable zone if FIX_MOVABLE_ZONE
arm: ensure enough memory is available for movable zone
arm: add CONFIG_FIX_MOVABLE_ZONE
If FIX_MOVABLE_ZONE is enabled, we want a specific size and
location of ZONE_MOVABLE.
Change-Id: I0b858c7310cd328e1118abc9d5fe6f364bb4ffad
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
The check_bytes() function is used by slub debugging. It returns a pointer
to the first unmatching byte for a character in the given memory area.
If the character for matching byte is greater than 0x80, check_bytes()
doesn't work. Becuase 64-bit pattern is generated as below.
value64 = value | value << 8 | value << 16 | value << 24;
value64 = value64 | value64 << 32;
The integer promotions are performed and sign-extended as the type of value
is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and
the second line has no effect.
This fixes the 64-bit pattern generation.
Change-Id: I99787ed3da5ba0a112cb0d307d428adf032bb950
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
slub checks for poison one byte by one, which is highly inefficient
and shows up frequently as a highest cpu-eater in perf top.
Joining reads gives nice speedup:
(Compiling some project with different options)
make -j12 make clean
slub_debug disabled: 1m 27s 1.2 s
slub_debug enabled: 1m 46s 7.6 s
slub_debug enabled + this patch: 1m 33s 3.2 s
check_bytes still shows up high, but not always at the top.
Change-Id: I2ae66a5617d6f240d082da9419de8604b8ebe5bb
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: linux-mm@kvack.org
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Under the following conditions, __alloc_pages_slowpath can loop
forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
The gfp conditions are normally invalid, because !__GFP_FS
disables most of the reclaim methods that __GFP_WAIT would
wait for. However, these conditions happen very often during
suspend and resume, when pm_restrict_gfp_mask() effectively
converts all GFP_KERNEL allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER. __alloc_pages_slowpath will
loop forever between the rebalance label and should_alloc_retry,
unless another thread happens to release enough pages to satisfy
the allocation.
Add a check to detect when PM has disabled __GFP_FS, and do not
retry if reclaim is not making any progress.
[taken from patch on lkml by Mel Gorman, commit message by ccross]
Change-Id: I864a24e9d9fd98bd0e3d6e9c1e85b6c1b766850e
Signed-off-by: Colin Cross <ccross@android.com>
In recent versions, the platform specific physical
offline returns the number of bytes offlined, so
a value of 0 indicates an error, not success as in
older versions. Make sure that the memory
for the original memory resource nodes is not
freed via kfree, as this memory was obtained
from alloc_bootmem very early in the system's life.
Change-Id: Iffcdd8be4483e043d7605fce596ed438b15f3e02
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
commit 486cf46f3f9be5f2a966016c1a8fe01e32cde09e upstream.
I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.
3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
kernel BUG at include/linux/swapops.h:105!
RIP: 0010:[<ffffffff81127b76>] [<ffffffff81127b76>]
migration_entry_wait+0x156/0x160
[<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
[<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
[<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
[<ffffffff81102a31>] handle_mm_fault+0x181/0x310
[<ffffffff81106097>] ? vma_adjust+0x537/0x570
[<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
[<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
[<ffffffff81421d5f>] page_fault+0x1f/0x30
mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.
The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.
Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.
Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.
Reported-and-tested-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Ensure that memory hotplug can co-exist with kmemleak
by taking the hotplug lock before scanning the memory
banks.
Change-Id: Ice6e5eaac45b4d71848ff1d3f30d468c1f088232
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
commit 4508378b9523e22a2a0175d8bf64d932fb10a67d upstream.
Commit 246e87a939 ("memcg: fix get_scan_count() for small targets")
fixes the memcg/kswapd behavior against small targets and prevent vmscan
priority too high.
But the implementation is too naive and adds another problem to small
memcg. It always force scan to 32 pages of file/anon and doesn't handle
swappiness and other rotate_info. It makes vmscan to scan anon LRU
regardless of swappiness and make reclaim bad. This patch fixes it by
adjusting scanning count with regard to swappiness at el.
At a test "cat 1G file under 300M limit." (swappiness=20)
before patch
scanned_pages_by_limit 360919
scanned_anon_pages_by_limit 180469
scanned_file_pages_by_limit 180450
rotated_pages_by_limit 31
rotated_anon_pages_by_limit 25
rotated_file_pages_by_limit 6
freed_pages_by_limit 180458
freed_anon_pages_by_limit 19
freed_file_pages_by_limit 180439
elapsed_ns_by_limit 429758872
after patch
scanned_pages_by_limit 180674
scanned_anon_pages_by_limit 24
scanned_file_pages_by_limit 180650
rotated_pages_by_limit 35
rotated_anon_pages_by_limit 24
rotated_file_pages_by_limit 11
freed_pages_by_limit 180634
freed_anon_pages_by_limit 0
freed_file_pages_by_limit 180634
elapsed_ns_by_limit 367119089
scanned_pages_by_system 0
the numbers of scanning anon are decreased(as expected), and elapsed time
reduced. By this patch, small memcgs will work better.
(*) Because the amount of file-cache is much bigger than anon,
recalaim_stat's rotate-scan counter make scanning files more.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 6e6938b6d3130305a5960c86b1a9b21e58cf6144 upstream.
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.
Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.
Although ext4 is tested to no longer livelock with commit f446daaea9,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.
Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.
Impact: It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.
Acked-by: Jan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
commit 461ae488ecb125b140d7ea29ceeedbcce9327003 upstream.
Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.
(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm. The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.
This would work on some systems depending on what else was using vmalloc.
Fix this by reverting ef691947d8 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Keir Fraser <keir.xen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>