zlib Decompress issue

2019. 2. 12. 00:39

1. Overview

Python을 사용하여 'HWP Parser' 제작중 BinData의 내부 Stream에 대해 Decompress를 수행하던 도중 발생한 Trouble에 대해 서술한다.

2. Structure

HWP File Format에 관련하여 한컴 오피스는 홈페이지를 통해 공식 문서를 제공한다.

그 중 BinData 스토리지에는 그림이나 OLE 개체와 같이 문서에 첨부된 바이너리 데이터가 각각의 스트림으로 저장된다.

Parser 제작 과정에서 해당 한글 문서 파일의 악성 유무 판별을 위해 Decompress를 수행해야 했다.

[그림 1] BinData Area

Decompress에는 Python zlib을 활용하였다. 보편적인 zlib의 Decompress 구문은 다음과 같은 에러를 출력했다.

zlib.error: Error -3 while decompressing data: incorrect header check

[그림 2] 에러 확인

3. Trouble Shooting

문제해결을 위해 검색 도중 다음과 같은 글을 확인할 수 있었다.

글에 따르면 메모리에 저장할 수 있는 크기를 초과하는 Stream (또는 파일 입력) 크기 문제로 인해 위의 에러가 발생했을 것이라고 한다. 실제 메모리 크기를 초과한 것이 아닌 버퍼 기본 크기를 초과했기 때문이다.

이를 해결하기 위한 방법은 Stream을 버퍼링으로 처리하고 Decompress를 수행하는 방법이 존재한다. 함께 제공된 솔루션 소스코드는 다음과 같다[1].

 
import zlib
f_in = open('my_data.zz', 'rb')
comp_data = f_in.read()
zobj = zlib.decompressojb()  # obj for decompressing data streams that won’t fit into memory at once.
data = zobj.decompress(comp_data)

위의 방법과 같이 버퍼링을 적용하고 테스트를 진행했을 때 또 다시 동일한 에러가 뜨는 것을 확인했다.

조금 더 찾아보던 도중 다음과 같은 글을 찾을 수 있었다.

[그림 3] wbit 관련 솔루션[2]

위 글에서는 wbit 옵션을 -15로 정의하면 에러가 해결된다고 제시했다. zlib 모듈 공식 홈페이지에서 제공하는 문서에 따르면 WBITS의 의미는 다음과 같다.

[표 1] MAX_WBITS의 의미[3]

The wbits argument controls the size of the history buffer (or the “window size”) used when compressing data, and whether a header and trailer is included in the output. It can take several ranges of values, defaulting to 15 (MAX_WBITS):

+9 to +15: The base-two logarithm of the window size, which therefore ranges between 512 and 32768. Larger values produce better compression at the expense of greater memory usage. The resulting output will include a zlib-specific header and trailer.
−9 to −15: Uses the absolute value of wbits as the window size logarithm, while producing a raw output stream with no header or trailing checksum.
+25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the window size logarithm, while including a basic gzip header and trailing checksum in the output.

MAX_WBIS 값은 15를 가지며 이를 -로 선언하면 -15 값으로 정의되기 때문에 에러를 해결할 수 있다.

이를 적용한 소스코드는 다음과 같다.

 
def bin_data(ole,bin_list):
    print ()
    print ('[+] BinData Information')
    for content in bin_list:
        if content[0] == 'BinData':
            print ('   - File Name : %s' %content[0]+'/'+content[1])
            bin_text = ole.openstream(content[0]+'/'+content[1])
            print ('   - File Size : %s' %ole.get_size(content[0]+'/'+content[1]))
            data2 = bin_text.read()
            print ('   - Hex data ~20bytes(pre-Decompress) : %s' %data2[:20])
            zobj = zlib.decompressobj(-zlib.MAX_WBITS)
            data3 = zobj.decompress(data2)
            print ('   - Hex data ~20bytes(Decompress) : %s' %data3[:20])
            f = open('./'+content[1]+'_Decom.txt','wb')
            f.write(data3)
            f.close
            print ()

결과는 다음과 같다.

[그림 4] 정상적으로 Decompress 완료된 Stream

4. References

[1] https://stackoverflow.com/questions/32367005/zlib-error-error-5-while-decompressing-data-incomplete-or-truncated-stream-in?rq=1

[2] https://daehee87.tistory.com/m/508?category=404227

[3] https://docs.python.org/3/library/zlib.html

저작자표시 비영리 동일조건 (새창열림)

'개발 > Python' 카테고리의 다른 글

방화벽 로그파일 파싱 및 DB 연동 (0)	2019.01.06
BeautifulSoup vs Scrapy (0)	2019.01.06
크롤링(Crawling)이란? 또는 파싱(Parsing)이란? 스크래핑이란? (0)	2019.01.06

Glad to meet you !