First contact with "the" XA codec

Cracking the codec open

I may have complained in a past blog post that retdec as a decompiler didn’t produce readable or helpful code, I may have even said that the code was horrible, but I also showed that it was rather good at figuring library calls. So in order to understand that the four Windows function calls could be reduced to a plain free(3) call I had to both look at the decompiled code for xaDecodeOpen and documentation of the functions and their flags on the MSDN.

These days I’m way more comfortable with the man utility, but in a sense I also think that online docs are very important. In this regard, I always found Microsoft’s MSDN a very good place to browse documentation. It has always been a terrible place to start browsing though. If it weren’t for search engines, I would never find the starting point.

In order to decompile xaDecodeClose I needed to look at xaDecodeOpen to enumerate all memory management functions:

GlobalAlloc
GlobalFree
GlobalHandle
GlobalLock
GlobalUnlock

In portable C code that can be safely reduced to calloc(3) and free(3). No big surprise here, it is the memory management of the opaque structure, the “handle”. In other words, stuff hidden under the compiler rug that I will need to uncover, gathering clues like a detective, trying to fit all the pieces together. And that first clue was waiting there in the open to be picked up: xaDecodeOpen’s second argument.

The big bad WAV file

One misconception about WAV files is that they are big because uncompressed, another is that they are WAV files. WAV files are really one kind of RIFF file, a generic container that specialized in audio and video. A RIFF file is made of chunks and a WAV file is a RIFF file with a WAVE chunk describing how its audio data is encoded, and then a data chunk containing the aforementioned audio data. You could for example use an MP3 codec in a WAV file, and you would get a file approximately the same size as an MP3 file. It’s probably even possible to add things like track information or subtitles to a RIFF file, similar to MP3s ID3 tags but at this point, I digress…

The reason most WAV files are big is because of their number one (literally) and probably most widely used codec: PCM or Pulse-Code Modulation. To keep it short (or digression-free) the signal is represented with a fixed duration for samples and each samples has a fixed value. In stereo mode, samples are interleaved, with one left sample followed by one right sample for each point in time. Interleaving is streaming-friendly as you may read the audio without the need to jump back and forth in the audio stream and can avoid pseudo random accesses. Arguably, picking the smallest granularity (individual samples) may not be that efficient, but I can’t help it, I digress.

From XA to PCM

Looking at sample.c, xadec.h and some MSDN documentation we roughly see this:

typedef struct _XAHEADER {
	ULONG	id;
	ULONG	nDataLen;
	ULONG	nSamples;
	USHORT	nSamplesPerSec;
	UCHAR	nBits;
	UCHAR	nChannels;
	ULONG	nLoopPtr;
	SHORT	befL[2];
	SHORT	befR[2];
	UCHAR	pad[4];
} XAHEADER;

typedef struct {
	WORD	wFormatTag;
	WORD	nChannels;
	DWORD	nSamplesPerSec;
	DWORD	nAvgBytesPerSec;
	WORD	nBlockAlign;
	WORD	wBitsPerSample;
	WORD	cbSize;
} WAVEFORMATEX;

int
main()
{
	FILE *fp;
	XAHEADER xah;
	WAVEFORMATEX wfx;
	XASTREAMHEADER xash;

	fp = fopen("sample.xa", "rb");
	fread(&xah, 1, sizeof(XAHEADER), fp);
	hxas = xaDecodeOpen(&xah, &wfx);

	/* ... */
}

At this point we could assume that xadec.dll will turn an XA file into a WAV file but no. While I claimed that this XA file format doesn’t look like something the Bandjam author came up with, the use of WAVEFORMATEX on the other hand looks like a Windows-oriented choice in API design. The wfx structure is passed as an “output” parameter, only to neatly pack information about the resulting audio in a single data structure. We technically already have all we need in the XAHEADER structure, provided that we know one more secret about xaDecodeOpen (hint: it’s hidden in the assembly.)

The XA header

The fread(3) call gives us one vital piece of information: the XA header is located at the beginning of the file and is serialized using little endian integers. It means that at this point I’m able to dump the header of an XA and keep track of how the header fields are used using their offset, and that should help me dig further.

I’m always baffled that ISO C99 doesn’t offer anything to deal with the byte order of a given architecture. The byte orders big and little endian define in which order a “word” is laid out in memory. Confusingly enough, endian refers to one end or the other of the “word” once broken down into octets, so in this case values from the XA header start from the little end. It’s as confusing as finding the shutdown function in the start menu on Windows.

Apparently the name was borrowed from Johnathan Swift’s Gulliver’s Travels in which rebels would break their eggs from the big end in a political opposition to the Lilliputian king who’s tyrannic rule imposed breaking them from the little end. So in computing we find architectures relying on either ordering (and in some cases even both I’m told). It can get even more messy with “word” order.

Network protocols tend to favor big endian representation over the wire so it’s often referred to network byte order too. And for no apparent reason C was left with nothing to deal with architectural differences besides the usual undefined behavior. Instead we have cryptic function names from another time like ntohs and ntohl to convert numbers back and forth between network and host (CPU architecture) byte order. Nothing part of the C standard to go back and forth between both byte orders. Either those functions are no-op on a big endian host, or they swap the ordering. But I digress.

The header looks mostly straightforward: a magic number, audio information, some padding to align to a power of 2 byte size (32) and the number of bytes (nDataLen) after the header. Then there are less obvious fields like nSamples that can probably be computed from (and sanity-checked against) the other values and the befL and befR arrays. They will turn out to be very important clues that I only understood in retrospect. I’m not sure whether I could have figured out their purpose beforehand though, but they definitely confirmed my interpretation of the codec.

However, this is a topic for later. In the next post I will describe the decompilation of xaDecodeOpen and how encouraging it was.