html 인코딩 방식 추출

HTTP 헤더에서 인코딩 방식 추출하기

#인코딩 방식을 추출하고 디코딩후 dp.html에 저장시키기
import sys
from urllib.request import urlopen

f = urlopen('http://hanbit.co.kr/store/books/full_book_list.html')
file = open('dp.html','w')

encoding = f.info().get_content_charset(failobj="utf-8")
# HTTP 헤더를 기반으로 인코딩 방식 추출(값이 없다면 utf-8을 기본으로 사용)

print('encoding:',encoding,file=sys.stderr)
# 인코딩 방식을 표준 오류에 출력

text = f.read().decode(encoding)
# 추출한 인코딩 방식으로 디코딩

print(text)
file.write(text)
file.close()

<meta> 태그에서 인코딩 방식 추출하기

# meta 태그에서 인코딩 방식 추출하기
import re
import sys
from urllib.request import urlopen

f = urlopen('http://hanbit.co.kr/store/books/full_book_list.html')
bty_content = f.read()
# 변수에 bytes 자료형의 응답 본문을 저장

scan_content = bty_content[:1024].decode('ascii',errors='replace')
# charset은 HTML의 앞부분에 적혀 있으므로 1024까지 ASCII 문자로 디코딩

match = re.search(r'charset=["\']?([\w-]+)',scan_content)
# 디코딩한 문자열에서 정규 표현식으로 charset 값을 추출

if match:
    encoding = match.group(1)
else:
    encoding = 'utf-8'

print('encoding:',encoding,file=sys.stderr)
#추출한 인코딩을 표준 오류에 출력

text = bty_content.decode(encoding)
# 추출한 인코딩으로 다시 디코딩

print(text)

저작자표시 (새창열림)

'Data science > 크롤링' 카테고리의 다른 글

re 모듈로 스크레이핑 실습 (0)	2022.05.18
re 모듈 기본 사용법 (0)	2022.05.18
urllib 기초 (0)	2022.05.18
Wget 실습(원하는 부분 가져오기) (0)	2022.05.15
Wget 실습(실제 사이트 크롤링) (0)	2022.05.15

지식저장소

html 인코딩 방식 추출

'Data science > 크롤링' 카테고리의 다른 글

티스토리툴바

html 인코딩 방식 추출

'Data science > 크롤링' 카테고리의 다른 글

'Data science/크롤링' Related Articles

티스토리툴바