hwp, hwpx -> pdf 로 변경.

카테고리 없음

hwp, hwpx -> pdf 로 변경.

사리생성 2026. 3. 9. 20:53

각 파일 포맷별 PDF 변환 원리 (각론)

1. HWP 변환 원리 (구형 포맷)
확장자: .hwp
원리: HWP는 사실상 하나의 덩어리로 된 알 수 없는 암호 코드(이진 바이너리)입니다. 따라서 이 암호를 해독할 수 있는 전문 도구인 pyhwp(hwp5html) 라이브러리를 통해 글자와 서식을 아주 단순한 웹 페이지 문서(.xhtml) 형태로 먼저 해부하고 펼쳐냅니다. 그 펼쳐진 웹 문서를 마치 크롬 브라우저에서 'PDF로 인쇄'하듯이 안정적인 LibreOffice 엔진이 찍어내는 방식입니다.

2. HWPX 변환 원리 (신형 포맷)
확장자: .hwpx
원리: HWPX는 사실 수십 개의 XML 문서 쪼가리들이 모여있는 ZIP 파일입니다. 제가 직접 개발한 **fast_parser.py**가 HWPX 속의 내부 압축을 풀고 section.xml 파일들의 심장부로 파고듭니다. 이때 셀 병합(칸 합치기) 정보, 서식, 중첩 표(표 안의 표) 구조 등을 아주 엄격하게 직계 자손 단위로 추적(재귀 파싱)하여, 누락이나 겹침 없이 가장 완벽한 형태의 최신 HTML 표준 웹 브라우저 코드로 뼈대를 조립합니다. 만들어진 뼈대 구석구석에 강제 테두리(border="1")를 단단히 박아 넣은 후 LibreOffice를 통해 아주 빠르게 구워냅니다.

3. 단일 이미지 변환 원리 (PNG, JPG, JPEG)
확장자: .png, .jpg, .jpeg
원리: 이미지들은 복잡한 표 구조가 없지만 종종 투명한 배경(Alpha 채널) 문제로 인해 PDF로 구우면 검정 배경으로 왜곡되는 치명적 문제가 있습니다. 이를 방지하기 위해 파이썬의 가장 가벼운 화가 라이브러리인 **Pillow**를 이용하여, 이미지를 불러온 뒤 안전한 RGB 도화지(흰 배경)로 보완하고 고화질 규격(100.0 dpi)을 유지한 채 PDF 단일 페이지에 그대로 꽉 차게 구워버리는 img_to_pdf.py 방식을 취합니다.
어떻게 사용할 수 있나요? (통합 사용법)

이 3가지의 원리를 모두 담아서 제가 convert_all.sh 라는 자동화 런처에 모아두었습니다. 작업 파일 목록 옆에 보이는 HWP_Conversion_Package.md 문서를 열어보시면 이 모든 과정을 복사해서 아무 컴퓨터에서나 실행할 수 있도록 딱 **4개의 소스 코드(Dockerfile, fast_parser.py, img_to_pdf.py, convert_all.sh)**로 가장 가볍게 압축해 두었습니다.

#!/bin/bash

# 통합 HWP / HWPX -> PDF 변환 런처
IMAGE_NAME="hwp-to-pdf"

echo "--- 1. 도커 이미지 빌드 중 (최초 1회만 오래 걸림) ---"
podman build -t $IMAGE_NAME . > /dev/null 2>&1

echo "--- 2. 대상 폴더 내 모든 hwp, hwpx 변환 시작 ---"

# HWP 파일 변환 (hwp5html 사용)
for file in *.hwp; do
    [ -e "$file" ] || continue
    echo ">> [HWP] \"$file\" 변환 중..."
    basename="${file%.*}"
    
    # 임시 디렉토리 정리 (혹시 남아있을 수 있으므로)
    rm -rf "hwp_out_$basename"
    
    # pyhwp로 HTML 생성
    podman run --rm --user 0:0 -v "$(pwd):/data:Z" $IMAGE_NAME /opt/venv/bin/hwp5html "/data/$file" --output "/data/hwp_out_$basename" > /dev/null 2>&1
    
    if [ -d "hwp_out_$basename" ]; then
        # 생성된 HTML을 PDF로 렌더링
        podman run --rm --user 0:0 -v "$(pwd):/data:Z" $IMAGE_NAME libreoffice --headless --convert-to pdf --outdir /data "/data/hwp_out_$basename/index.xhtml" > /dev/null 2>&1
        if [ -f "index.pdf" ]; then
            mv index.pdf "${basename}.pdf"
            echo "   -> 완료: ${basename}.pdf"
        else
            echo "   -> PDF 렌더링 실패: $file"
        fi
        rm -rf "hwp_out_$basename"
    else
        echo "   -> 파싱 실패 (HTML로 추출할 수 없음): $file"
    fi
done

# HWPX 파일 변환 (자체 고속 파서 + 보정기 사용)
for file in *.hwpx; do
    [ -e "$file" ] || continue
    echo ">> [HWPX] \"$file\" 변환 중..."
    basename="${file%.*}"
    
    # 파이썬 파서로 HTML 초고속 추출
    podman run --rm -v "$(pwd):/data:Z" $IMAGE_NAME python3 /data/fast_parser.py "/data/$file" "/data/hwpx_out_$basename.html"
    
    if [ -f "hwpx_out_$basename.html" ]; then
        # PDF 렌더링
        podman run --rm -v "$(pwd):/data:Z" $IMAGE_NAME libreoffice --headless --convert-to pdf --outdir /data "/data/hwpx_out_$basename.html" > /dev/null 2>&1
        if [ -f "hwpx_out_${basename}.pdf" ]; then
            mv "hwpx_out_${basename}.pdf" "${basename}.pdf"
            echo "   -> 완료: ${basename}.pdf"
        else
            echo "   -> PDF 렌더링 실패: $file"
        fi
        rm -f "hwpx_out_$basename.html" "hwpx_out_${basename}.pdf"
    else
        echo "   -> 파싱 실패: $file"
    fi
done

# 이미지 파일(PNG, JPG, JPEG) 변환 (Pillow 사용)
for file in *.{png,jpg,jpeg}; do
    [ -e "$file" ] || continue
    echo ">> [IMAGE] \"$file\" 변환 중..."
    basename="${file%.*}"
    
    # 이미지 -> PDF 변환
    podman run --rm -v "$(pwd):/data:Z" $IMAGE_NAME /opt/venv/bin/python /data/img_to_pdf.py "/data/$file" "/data/${basename}.pdf"
    
    if [ -f "${basename}.pdf" ]; then
        echo "   -> 완료: ${basename}.pdf"
    else
        echo "   -> 변환 실패: $file"
    fi
done

echo "--- 모든 변환 작업 완료 ---"

import xml.etree.ElementTree as ET
import zipfile
import sys
import os

def parse_element(elem, ns):
    tag = elem.tag.split('}')[-1]
    
    if tag == 'p':
        content = []
        for child in elem:
            content.append(parse_element(child, ns))
        text_content = "".join(content)
        # 제목 스타일 적용
        if '참가신청서' in text_content or '신청 양식' in text_content or '개인정보 수집 및 이용 동의서' in text_content:
            return f'<h1>{text_content}</h1>'
        return f'<div class="p">{text_content}</div>'
    
    elif tag == 'run':
        content = []
        for child in elem:
            content.append(parse_element(child, ns))
        return "".join(content)
        
    elif tag == 't':
        return elem.text if elem.text else ""

    elif tag == 'tab':
        return "&nbsp;&nbsp;&nbsp;&nbsp;"
    
    elif tag == 'tbl':
        rows_html = []
        for child in elem:
            if child.tag.endswith('}tr'):
                cells_html = []
                for tc in child:
                    if tc.tag.endswith('}tc'):
                        colspan, rowspan = "1", "1"
                        for prop in tc:
                            if prop.tag.endswith('}cellSpan'):
                                colspan = prop.get('colSpan', "1")
                                rowspan = prop.get('rowSpan', "1")
                                break
                        
                        cell_content = ""
                        for tc_child in tc:
                            if tc_child.tag.endswith('}subList'):
                                for sub_child in tc_child:
                                    if sub_child.tag.endswith('}p'):
                                        cell_content += parse_element(sub_child, ns)
                            elif tc_child.tag.endswith('}p') or tc_child.tag.endswith('}tbl'):
                                cell_content += parse_element(tc_child, ns)
                        
                        attr = ''
                        if colspan != "1": attr += f' colspan="{colspan}"'
                        if rowspan != "1": attr += f' rowspan="{rowspan}"'
                        cells_html.append(f'<td{attr} style="border: 1px solid black; padding: 6px; min-height: 20px;">{cell_content}</td>')
                rows_html.append(f'<tr>{"".join(cells_html)}</tr>')
        return f'<table border="1" cellspacing="0" cellpadding="5" style="border-collapse: collapse; width: 100%; margin: 15px 0; border: 2px solid black;">{"".join(rows_html)}</table>'    
    # 이외의 알 수 없는 태그들은 자식 노드들을 계속 파싱합니다.
    else:
        content = []
        for child in elem:
            # 테이블 안의 tr, tc, 기타 등 직접 명시한 태그 외의 것들에 대해 재귀 호출
            # <hp:tc>가 또 다시 반복 호출되는 것을 막아야 중복 출력이 안됨
            if not child.tag.endswith('}tc') and not child.tag.endswith('}tr') and not child.tag.endswith('}cellZoneList'):
                content.append(parse_element(child, ns))
        return "".join(content)

def hwpx_to_pro_html(hwpx_path, html_path):
    ns = {
        'hp': 'http://www.hancom.co.kr/hwpml/2011/paragraph',
        'hs': 'http://www.hancom.co.kr/hwpml/2011/section'
    }
    
    html_header = [
        '<!DOCTYPE html>',
        '<html><head><meta charset="utf-8">',
        '<style>',
        '  @media print { @page { size: A4; margin: 15mm; } }',
        '  body { font-family: "NanumGothic", sans-serif; line-height: 1.4; color: #000; font-size: 10pt; background: #fff; }',
        '  .p { min-height: 1.1em; margin-bottom: 3px; }',
        '  table { border-collapse: collapse; width: 100%; margin: 15px 0; border: 2px solid #000 !important; table-layout: fixed; }',
        '  td, th { border: 1px solid #000 !important; padding: 6px; vertical-align: middle; min-height: 20px; background-color: #fff; word-break: break-all; }',
        '  h1 { text-align: center; font-size: 16pt; margin: 25px 0 15px 0; font-weight: bold; border-bottom: 2px solid #000; padding-bottom: 5px; }',
        '</style>',
        '</head><body>'
    ]

    
    try:
        with zipfile.ZipFile(hwpx_path, 'r') as z:
            sections = [n for n in z.namelist() if n.startswith('Contents/section') and n.endswith('.xml')]
            sections.sort()
            
            all_html = []
            for section_name in sections:
                with z.open(section_name) as f:
                    root = ET.fromstring(f.read())
                    all_html.append(parse_element(root, ns))
            
            with open(html_path, 'w', encoding='utf-8') as f:
                f.write("\n".join(html_header))
                f.write("\n".join(all_html))
                f.write('</body></html>')
        return True
    except Exception as e:
        print(f"Error: {e}")
        return False

if __name__ == "__main__":
    hwpx_to_pro_html(sys.argv[1], sys.argv[2])

import sys
from PIL import Image

def image_to_pdf(image_path, pdf_path):
    try:
        # 이미지를 열고 RGBA 모드인 경우 RGB로 변환 (PDF 저장을 위해)
        image = Image.open(image_path)
        if image.mode in ("RGBA", "P"):
            image = image.convert("RGB")
        
        # 이미지 저장
        image.save(pdf_path, "PDF", resolution=100.0)
        return True
    except Exception as e:
        print(f"Error: {e}")
        return False

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python img_to_pdf.py <input_image> <output_pdf>")
        sys.exit(1)
    
    success = image_to_pdf(sys.argv[1], sys.argv[2])
    sys.exit(0 if success else 1)

현재글hwp, hwpx -> pdf 로 변경.

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

전산사기단

hwp, hwpx -> pdf 로 변경.

'카테고리 없음'의 다른글

티스토리툴바