My first function decompilation

First x86 contact

My hopes to offload as much work as possible were shattered as time was running out and I wasn’t finding the right tool for the job that would be both easy, efficient and not require too much investment from me. If decompilation is not a silver bullet, then there’s no getting around learning some assembly to understand the target code and figure what the original source code may have looked like. Once you have working code, you may be happy and move on to using it but I’m way too curious to not try to also understand how things work beyond mechanically turning a byte stream into a different byte stream.

They say curiosity killed the cat, so I’m probably a serial pet killer at this point and have always been. Of course it’s not my fault, it never is. I can easily shift the blame to public school that gave me the taste to learn and the lack of responsible adults when I was a kid to tell me how I would endanger an entire feline species. But I digress…

So soon after I discovered programming, I tried to learn more about computers in general and eventually I got my virtual hands onto a file called nasm.chm that would keep me occupied for a while. I would mainly write small useless programs using a handful of x86 and x87 instructions to compute well known functions using their Taylor series approximation.

Learning x86

To be honest, I don’t remember whether I was using nasm or fasm, but either way it came with a CHM file: a web based self-contained file with an index and the possibility to jump between sections via hyper links. You’d even find trivia like how many CPU cycles each instruction would take.

These days I’m way more comfortable with the man utility, but in a sense CHM can be viewed as a GUI alternative to GNU info. While I understand the rationale and terminal appeal behind info, I’m always confused whenever I need to look up something not mentioned in the standard manual. In that CHM file I would easily find everything and that probably played a part in the learning process, but I digress.

So let’s look at a very small and simple function:

10001110: push   %esi
10001111: mov    0x8(%esp),%esi
10001115: test   %esi,%esi
10001117: jne    0x1000111d
10001119: xor    %eax,%eax
1000111b: pop    %esi
1000111c: ret
1000111d: push   %edi
[...]
1000113f: ret

This looks nothing like what I learned about x86 assembly. Well it does look familiar but I don’t understand the notation. It turns out this is the AT&T notation and even though I’ve been living on the UNIX-like side of the fence for years now I’ll stick to the Intel notation:

objdump --disassemble-all --disassembler-options=intel xadec.dll

Much better, I can now start my mental decompilation:

10001110: push   esi                      ; esp offset to eip is now 4
10001111: mov    esi,DWORD PTR [esp+0x8]  ; esi = arg0
10001115: test   esi,esi                  ; if (arg0 == NULL) {
10001117: jne    0x1000111d               ;     /* else jump */
10001119: xor    eax,eax                  ;     eax = 0
1000111b: pop    esi                      ;     esp offset to eip is now 0
1000111c: ret                             ;     return (0);
                                          ; }
1000111d: push   edi                      ; esp offset to eip is now 8
[...]
1000113f: ret

Isn’t it funny that physicists accidentally called atoms as such because they were thought to be indivisible? It turns out they are made of electrons, protons and neutrons. And we can even break some further down to what we now think are elementary particles although at this point, I digress…

I was lucky that xadec.dll contained only one instruction I had never encountered before, so the learning curve was almost flat. Instructions are like atoms, the smallest decomposition of work we can give an x86-compatible processor although depending on the operands they will yield different opcodes. The opcodes and their argument bytes don’t bring any value to the objdump output so I cut them out.

Prior exposure to x86 code helped a lot. This way I knew immediately that xor eax, eax is a common alternative to mov eax, 0 and wasn’t bewildered that esi would be tested against itself. Why though, I don’t know, but I suspect this may simply be more efficient or result in smaller machine code. I may be curious, but I can also enjoy a bit of mystery.

However, how can I tell that arg0 is a pointer and not simply a scalar? How can I assume that this is a null check?

The first clues

The snippets above are from the xaDecodeClose function, one of the 4 public symbols I can find in the disassembly:

Export Address Table -- Ordinal Base 1
        [   0] +base[   1] 1110 Export RVA
        [   1] +base[   2] 1170 Export RVA
        [   2] +base[   3] 1000 Export RVA
        [   3] +base[   4] 1140 Export RVA

[Ordinal/Name Pointer] Table
        [   0] xaDecodeClose
        [   1] xaDecodeConvert
        [   2] xaDecodeOpen
        [   3] xaDecodeSize

Looking at other properties of the file I can make sense of the address 1110 of the first table, and deduce that xaDecodeClose is located at 10001110:

BaseOfCode              00001000
BaseOfData              00006000
ImageBase               10000000

All public symbols seem to land in the code section of the library. The C code generated by retdec also confirms my amazing deduction of the location of xaDecodeClose with my mighty arithmetic powers:

// Address range: 0x10001110 - 0x1000113f
int32_t xaDecodeClose(char * pMem, int32_t a2) {
    // 0x10001110
    if (pMem == NULL) {
        // 0x10001119
        return 0;
    }
    // 0x1000111d
    GlobalUnlock(GlobalHandle(pMem));
    GlobalFree(GlobalHandle((char *)(int32_t)pMem));
    return 1;
}

I said earlier that retdec produces horrible C code, this is quite readable. Don’t be fooled though, because retdec was confused and got the signature wrong. It failed to figure the calling convention of the library but that didn’t bother me too much, I knew.

How did I know that? And how did I know that arg0 is a pointer prior to peeping at the decompiled C?

Open source to the rescue, again!

Even though there’s no source code for xadec.dll, it has been used by more than one project so they need to know how to use it. Thanks to DTXMania being open source, I not only found the bundled DLL in the source tree but also a xadec.h file and even better: a sample.c file.

After wasting a fair amount of time failing to cross-compile sample.c to run it in Wine, hoping to get a sample output to compare to what my code would provide, I gave up. I had everything: xadec.dll, xadec.lib, xadec.h, a tool chain to build Windows binaries and Wine to run sample.exe but it would always fail to link and at some point I ran out of the 2-hour budget I had allocated for that. I realized later that it wouldn’t be a problem.

Not too long ago, we discovered that one of our fellow Varnish developers is a beekeeper on his spare time. That reminded me of an old joke of mine that for some reason never caught on: API culture. That was supposed to be a pun: “I’m doing apiculture” was suppose to mean “I’m designing the API”. In French “apiculture” means beekeeping. It’s a mystery why this joke never caught on.

Anyway, thanks again to DTXMania I could sum up the API to this:

typedef struct _XASTREAMHEADER { ... } XASTREAMHEADER;
typedef struct _XAHEADER { ... } XAHEADER;
typedef HANDLE HXASTREAM;

HXASTREAM __cdecl xaDecodeOpen(XAHEADER *, WAVEFORMATEX *);
BOOL __cdecl xaDecodeClose(HXASTREAM);
BOOL __cdecl xaDecodeSize(HXASTREAM, ULONG, ULONG *);
BOOL __cdecl xaDecodeConvert(HXASTREAM, XASTREAMHEADER *);

This is a gold mine:

2 structure definitions
1 opaque structure
4 function signatures
1 calling convention

After a quick MSDN search and a quick look at other Win32 functions used in xaDecodeOpen I was able to decompile the function:

void
xadec_close(struct xahandle *hdl)
{

	free(hdl);
}

My actual code doesn’t look like that, but in essence I don’t care about the return value and free(3) already does a null check for me. I can now move to the next low-hanging fruit because thanks to my past self I didn’t need more research before proceeding. In the next post I will discuss audio file formats and the WAVE file format in particular.