a very occasional diary.




Update of my kernel patches for 2.6.14-rc5

New version of patches is uploaded here.

This series include:


Don't call ->writepage from VM scanner when page is met for the first time during scan.

New page flag PG_skipped is used for this. This flag is TestSet-ed just before calling ->writepage and is cleaned when page enters inactive list.

One can see this as „second chance“ algorithm for the dirty pages on the inactive list.

BSD does the same: src/sys/vm/vm_pageout.c:vm_pageout_scan(), PG_WINATCFLS flag.

Reason behind this is that ->writepages() will perform more efficient writeout than ->writepage(). Skipping of page can be conditioned on zone->pressure.

On the other hand, avoiding ->writepage() increases amount of scanning performed by kswapd.

(Possible drawback: executable text pages are evicted earlier.)


Force artificial failures in page allocator. I used this to harden some kernel code.


Perform calls to the ->writepage() asynchronously.

VM scanner starts pageout for dirty pages found at tail of the inactive list during scan. It is supposed (or at least desired) that under normal conditions amount of such write back is small.

Even if few pages are paged out by scanner, they still stall "direct reclaim" path (__alloc_pages() -> try_to_free_pages() -> ... -> shrink_list() -> writepage()), and to decrease allocation latency it makes sense to perform pageout asynchronously.

Current design is very simple: asynchronous page-out is done through pdflush operation kpgout(). If shrink_list() decides that page is eligible for the asynchronous pageout, it is placed into shared queue and later processed by one of pdflush threads.

Most interesting part of this patch is async_writepage() that decides when page should be paged out asynchronously. Currently this function allows asynchronous writepage only from direct reclaim, only if zone memory pressure is not too high, and only if expected number of dirty pages in the scanned chunk is larger than some threshold: if there are only few dirty pages on the list, context switch to the pdflush outwieghts advantages of asynchronous writepage.


transfer dirtiness from pte to the struct page in page_referenced(). This makes pages dirtied through mmap „visible“ to the file system, that can write them out through ->writepages() (otherwise pages are written from ->writepage() from tail of the inactive list).


Implement pageout clustering at the VM level.

With this patch VM scanner calls pageout_cluster() instead of ->writepage(). pageout_cluster() tries to find a group of dirty pages around target page, called „pivot“ page of the cluster. If group of suitable size is found, ->writepages() is called for it, otherwise, page_cluster() falls back to ->writepage().

This is supposed to help in work-loads with significant page-out of file-system pages from tail of the inactive list (for example, heavy dirtying through mmap), because file system usually writes multiple pages more efficiently. Should also be advantageous for file-systems doing delayed allocation, as in this case they will allocate whole extents at once.

Few points:

  • swap-cache pages are not clustered (although they can be, but by page->private rather than page->index)
  • only kswapd do clustering, because direct reclaim path should be low latency.
  • Original version of this patch added new fields to struct writeback_control and expected ->writepages() to interpret them. This led to hard-to-fix races against inode reclamation. Current version simply calls ->writepage() in the "correct" order, i.e., in the order of increasing page indices..

Export kernel backtrace in /proc/<pid>/task/<tid>/stack. Useful when debugging deadlocks.

This somewhat duplicates functionality of SysRq-T, but is less intrusive to the system operation and can be used in the scripts.

Exporting kernel stack of a thread is probably unsound security-wise. Use with care.

Instead of adding yet another architecture specific function to output thread stack through seq_file API, it introduces „iterator“

void do_with_stack(struct task_struct *tsk, 
     int (*actor)(int, void *, void *, void *), void *opaque)

that has to be implemented by each architecture, so that generic code can iterate over stack frames in architecture-independent way.

lib/do_with_stack.c is provided for archituctures that don't implement their own. It is based on __builtin_{frame,return}_address().


export per-process blocking statistics in /proc/<pid>/task/<tid>/sleep and global sleeping statistics in /proc/sleep. Statistics collection for given file is activated on the first read of corresponding /proc file. When statistics collection is on on each context switch current back-trace is built (through __builtin_return_address()). For each monitored process there is a LRU list of such back-traces. Useful when trying to understand where elapsed time is spent.


move EXPORT_SYMBOL(filemap_populate) to the proper place: just after function itself: it's easy to miss that function is exported otherwise.


Fix write throttling to calculate its thresholds from amount of memory that can be consumed by file system and swap caches, rather than from the total amount of physical memory. This avoids situations (among other things) when memory consumed by kernel slab allocator prevents write throttling from ever happening.


Fix comment describing BUILD_BUG_ON: BUG_ON is not an assertion (unfortunately). Also implement BUILD_BUG_ON in a way that can be used outside of a function scope, so that compile time checks can be placed in convenient places (like in a header, close to the definition of related constants and data-types).

zoneinfo.patch, deadline-iosched.c-cleanup.patch and ll_merge_requests_fn-cleanup.patch were merged into Linus tree.

dont-rotate-active-list.patch was dropped: it seems to cause inactive list exhaustion for certain wrodloads.


Follow by Email