a very occasional diary.




kernel patches for 2.6.12-rc6

After long delay I updated my kernel patches to 2.6.12-rc6. This required installing git and cogito, but it turned out that time wasn't wasted: these tools beat bitkeeper hands down CPU-wise.

New version of patches is uploaded here.

This series include:


Add /proc/zoneinfo file to display information about memory zones. Useful to analyze VM behaviour. This was merged into -mm.


Don't call ->writepage from VM scanner when page is met for the first time during scan.

New page flag PG_skipped is used for this. This flag is TestSet-ed just before calling ->writepage and is cleaned when page enters inactive list.

One can see this as „second chance“ algorithm for the dirty pages on the inactive list.

BSD does the same: src/sys/vm/vm_pageout.c:vm_pageout_scan(), PG_WINATCFLS flag.

Reason behind this is that ->writepages() will perform more efficient writeout than ->writepage(). Skipping of page can be conditioned on zone->pressure.

On the other hand, avoiding ->writepage() increases amount of scanning performed by kswapd.

(Possible drawback: executable text pages are evicted earlier.)


Currently, if zone is short on free pages, refill_inactive_zone() starts moving pages from active_list to inactive_list, rotating active_list as it goes. That is, pages from the tail of active_list are transferred to its head, thus destroying lru ordering, exactly when we need it most --- when system is low on free memory and page replacement has to be performed.

This patch modifies refill_inactive_zone() so that it scans active_list without rotating it. To achieve this, special dummy page zone->scan_page is maintained for each zone. This page marks a place in the active_list reached during scanning.

As an additional bonus, if memory pressure is not so big as to start swapping mapped pages (reclaim_mapped == 0 in refill_inactive_zone()), then not referenced mapped pages can be left behind zone->scan_page instead of moving them to the head of active_list. When reclaim_mapped mode is activated, zone->scan_page is reset back to the tail of active_list so that these pages can be re-scanned.


Force artificial failures in page allocator. I used this to harden some kernel code.


transfer dirtiness from pte to the struct page in page_referenced(). This makes pages dirtied through mmap „visible“ to the file system, that can write them out through ->writepages() (otherwise pages are written from ->writepage() from tail of the inactive list).


Implement pageout clustering at the VM level.

With this patch VM scanner calls pageout_cluster() instead of ->writepage(). pageout_cluster() tries to find a group of dirty pages around target page, called „pivot“ page of the cluster. If group of suitable size is found, ->writepages() is called for it, otherwise, page_cluster() falls back to ->writepage().

This is supposed to help in work-loads with significant page-out of file-system pages from tail of the inactive list (for example, heavy dirtying through mmap), because file system usually writes multiple pages more efficiently. Should also be advantageous for file-systems doing delayed allocation, as in this case they will allocate whole extents at once.

Few points:

  • swap-cache pages are not clustered (although they can be, but by page->private rather than page->index)
  • only kswapd do clustering, because direct reclaim path should be low latency.
  • this patch adds new fields to struct writeback_control and expects ->writepages() to interpret them. This is needed, because pageout_cluster() calls ->writepages() with pivot page already locked, so that ->writepages() is allowed to only trylock other pages in the cluster.
Besides, rather rough plumbing (wbc->pivot_ret field) is added to check whether ->writepages() failed to write pivot page for any reason (in latter case page_cluster() falls back to ->writepage()).

Only mpage_writepages() was updated to honor these new fields, but all in-tree ->writepages() implementations seem to call mpage_writepages(). (Except reiser4, of course, for which I'll send a (trivial) patch, if necessary).


Export kernel backtrace in /proc/<pid>/task/<tid>/stack. Useful when debugging deadlocks.

This somewhat duplicates functionality of SysRq-T, but is less intrusive to the system operation and can be used in the scripts.

Exporting kernel stack of a thread is probably unsound security-wise. Use with care.

Instead of adding yet another architecture specific function to output thread stack through seq_file API, it introduces „iterator“

void do_with_stack(struct task_struct *tsk, 
     int (*actor)(int, void *, void *, void *), void *opaque)

that has to be implemented by each architecture, so that generic code can iterate over stack frames in architecture-independent way.

lib/do_with_stack.c is provided for archituctures that don't implement their own. It is based on __builtin_{frame,return}_address().


export per-process blocking statistics in /proc/<pid>/task/<tid>/sleep and global sleeping statistics in /proc/sleep. Statistics collection for given file is activated on the first read of corresponding /proc file. When statistics collection is on on each context switch current back-trace is built (through __builtin_return_address()). For each monitored process there is a LRU list of such back-traces. Useful when trying to understand where elapsed time is spent.


ll_merge_requests_fn() assigns total_{phys,hw}_segments twice. Fix this and a typo. Merged into -mm.


Small cleanup.

rmap-cleanup.patch and WRITEPAGE_ACTIVATE-doc-fix.patch were merged into Linus tree.


Follow by Email