Sunday, December 17, 2017

Interpret bytes to numpy

The way to interpret bytes to numpy array.
In this post, I'm gonna use MNIST data you can down load from this link 'http://yann.lecun.com/exdb/mnist/'. At first, data format of label data of MNIST should be clarified.It's written on 'http://yann.lecun.com/exdb/mnist/' as bellow.
    
[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000801(2049) magic number (MSB first) 
0004     32 bit integer  60000            number of items 
0008     unsigned byte   ??               label 
0009     unsigned byte   ??               label 
........ 
xxxx     unsigned byte   ??               label
    

Get contents as bytes.
    
In [1]: import gzip

In [2]: train_label_file = 'train-labels-idx1-ubyte.gz'

In [3]: label_gzip = gzip.open(train_label_file)

In [4]: label_contents = label_gzip.read()

In [5]: label_gzip.close()
    
Now, you can interpret a buffer as a 1-dimensional array with 'frombuffer' function offered by numpy. However, You gotta be careful for dtype argument of frombuffer function. As a default, 'frombuffer' function try to interpret buffer as float which is 64-bit. However from the above information, labels are 1 byte data. Therefor using uint-8 is appropriate. As a trial, compare both numpy.
    
In [6]: # check contents

In [7]: label_contents[8:32]
Out[7]: b'\x05\x00\x04\x01\t\x02\x01\x03\x01\x04\x03\x05\x03\x06\x01\x07\x02\x08\x06\t\x04\x00\t\x01'
In [8]: # invoke frombuffer with float.

In [9]: float_array = np.frombuffer(label_contents,offset=8)

In [10]: # invoke frombuffer with uint8

In [11]: uint8_array = np.frombuffer(label_contents, dtype='uint8',offset=8)

In [12]: # check the value of float_array

In [13]: float_array
Out[13]: 
array([  3.32878858e-294,   6.14613936e-275,   1.13924062e-303, ...,
         6.57718904e-299,   7.47547348e-299,   5.21006186e-270])

In [14]: # check the value of uint8

In [25]: uint8_array
Out[20]: array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

In [26]: # check the number of each array

In [27]: len(float_array)
Out[27]: 7500

In [28]: len(uint8_array)
Out[28]: 60000

In [29]: 
    

From the result above, 'frombuffer' with float retrieve value per 64bit from buffer, whereas with uint8 retrieve value per 8 bit from buffer. (We can see these from the number of elements of each array. The number of elements of float is one forth of the one of uint8:))