'urllib' 태그의 글 목록

urllib

URL 다루기 위한 python의 built-in 패키지: urllib 2024.08.25

URL 다루기 위한 python의 built-in 패키지: urllib

온별파파 2024. 8. 25. 04:16

2024. 8. 25. 04:16

파이썬에서 URL을 다루기 위한 패키지로 크게 3가지 종류가 있음; urllib, urllib3, requests

urllib은 built-in package이고 나머지 2개는 third party

사용방법

1. 기본 사용방법

from urllib.request import urlopen                     # (1)

with urlopen("<https://www.example.com>") as response: # (2)
    body = response.read()                             # (3)
    print(type(body))                                  # (4)

(1) urllib.request는 built-in package로 따로 설치하지 않아도 됨. HTTP request를 위해 urlopen을 사용

(2) context manager with 문을 통해 request 후 response를 받을 수 있음

(3) response 는 <http.client.HTTPResponse> 객체

read 함수를 통해 bytes로 변환할 수 있음

(4) 실제 body의 type을 print해서 bytes 타입임을 확인

2. GET request for json format response

API 작업시 response가 json format인 경우가 많음

from urllib.request import urlopen
import json                                            # (1)

url = "<https://jsonplaceholder.typicode.com/todos/1>" # (2)
with urlopen(url) as response:
    body = response.read()

print("body: ", body)                                  # (3)
# body:  b'{\\n  "userId": 1,\\n  "id": 1,\\n  "title": "delectus aut autem",\\n  "completed": false\\n}'

# json bytes to dictionary
todo_item = json.loads(body)                           # (4)
print(todo_item)
# {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

(1) urllib 패키지와 함께 json 포맷을 다루기 위해 json package 추가

(2) JSON 형태의 데이터를 얻기 위한 샘플 API 주소

(3) 응답을 print 해보면 json 형태의 bytes format. 이를 dictionary 형태로 변경해주기 위해 json 패키지 필요

(4) json bytes를 파이썬 객체인 dictionary로 변경하기 위해 json.loads 함수 사용

3. Response의 header 정보 얻는 방법

from urllib.request import urlopen
from pprint import pprint

with urlopen("<https://www.example.com>") as response:
    pprint(response.headers.items())                       # (1)
    pprint(response.getheader("Connection")) # 'close'     # (2)

response의 headers.items()를 통해 header 정보를 얻을 수 있음

(1) pretty print(pprint)를 이용해 header 정보를 보기 좋게 출력하면 아래와 같음

[('Accept-Ranges', 'bytes'), ('Age', '78180'), ('Cache-Control', 'max-age=604800'), ('Content-Type', 'text/html; charset=UTF-8'), ('Date', 'Sat, 24 Aug 2024 18:10:20 GMT'), ('Etag', '"3147526947"'), ('Expires', 'Sat, 31 Aug 2024 18:10:20 GMT'), ('Last-Modified', 'Thu, 17 Oct 2019 07:18:26 GMT'), ('Server', 'ECAcc (lac/5598)'), ('Vary', 'Accept-Encoding'), ('X-Cache', 'HIT'), ('Content-Length', '1256'), ('Connection', 'close')]

(2) header의 개별 정보는 getheader 메소드를 이용해 얻을 수 있음

4. bytes를 string으로 변환

from urllib.request import urlopen

with urlopen("<https://www.example.com>") as response:
    body = response.read()                            
    print(type(body)) # <class 'bytes'>                       # (1)

decoded_body = body.decode("utf-8")                           # (2)
print(type(decoded_body)) # <class 'str'>                     # (3)
print(decoded_body[:30])

(1) body의 type을 확인해보면 bytes 이고 아래와 같은 형태이다

b'<!doctype html>\n<html>\n<head>\n

(2) bytes를 string으로 변환하기 위해 decode method를 이용 (”utf-8”을 파라미터로 전달)

(3) decoded_body의 type을 확인해보면 string인걸 확인할 수 있고 decoded_body의 일부를 표시하면 아래와 같은 형태

<!doctype html>
<html>
<head>

5. Bytes를 file로 변환

크게 2가지 방법이 있음

encoding & decoding 없이 바로 file로 작성

from urllib.request import urlopen

with urlopen("<https://www.example.com>") as response:
    body = response.read()

with open("example.html", mode="wb") as html_file:
    html_file.write(body)

write binary(wb) mode로 파일을 열어 bytes를 바로 example.html 파일에 작성
코드를 실행하면 example.html 파일이 생성됨

contents를 file로 encoding해야하는 경우

from urllib.request import urlopen

with urlopen("<https://www.google.com>") as response:
    body = response.read()

character_set = response.headers.get_content_charset()        # (1) 
content = body.decode(character_set)                          # (2)

with open("google.html", encoding="utf-8", mode="w") as file: # (3)
    file.write(content)

(1)&(2): 구글같은 홈페이지는 location에 따라 다른 encoding 방식을 사용하기도 한다. 그래서 get_content_charset () 메소드를 이용해서 encoding 방식을 확인 후 bytes를 string으로 decoding 함

(3) decoded string을 다시 html에 utf-8 모드로 encoding해서 google.html 파일에 저장함

References

https://realpython.com/lessons/python-urllib-request-overview/

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

super() (0)	2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31
Pillow로 Image를 열 때 자동회전되는 현상 (0)	2024.01.15
Python을 이용한 Crawling (Feat. arm64, graviton) (0)	2024.01.06
[Python] annotation 과 forward reference (0)	2022.06.06

PREV 이전 1 NEXT 다음

뒤돌아보기

urllib

URL 다루기 위한 python의 built-in 패키지: urllib

사용방법

References

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바