'Python' 태그의 글 목록

Python

Apps 개발자의 반복 작업 탈출기: AppWrapper 2025.04.19
[책리뷰] 파이썬 클린 코드 Chapter 3. 좋은 코드의 일반적인 특징 2024.11.10
pathlib 모듈 2024.10.20
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (2) 2024.09.29
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1) 2024.09.29
super() 2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) 2024.08.31
URL 다루기 위한 python의 built-in 패키지: urllib 2024.08.25
[책리뷰] 파이썬 클린 코드 Chapter 1. 코드 포매팅과 도구 2024.07.14
Python을 이용한 Crawling (Feat. arm64, graviton) 2024.01.06

Apps 개발자의 반복 작업 탈출기: AppWrapper

온별파파 2025. 4. 19. 17:36

2025. 4. 19. 17:36

Superb Platform과 Apps

우리 회사 플랫폼의 제품군은 크게 Label, Curate, Model, 그리고 Apps로 구성되어 있다.

고객이 업로드한 이미지나 비디오에서 Curate를 통해 데이터를 선별하고, Label에서 라벨을 붙이고, Model에서 AI 모델을 학습시켜 플랫폼 사용자는 나만의 Vision AI 모델을 손쉽게 API 형태로 사용할 수 있다.

그중 Apps는 고객이 플랫폼을 더 쉽고 유연하게 활용할 수 있도록 돕는 자동화 도구들을 제공한다.

주요 목적은 다음과 같다:

기존에 다른 라벨링 툴에서 작업한 데이터를 Superb Platform 에 업로드
- (ex. YOLO, COCO, Labelme 등 다양한 어노테이션 포맷 → Superb 형식으로 자동 변환 후 플랫폼에 업로드)
이미지에서 사람 얼굴을 감지하고 자동으로 비식별화(blur)
OCR로 텍스트 영역을 자동 감지하고 바운딩 박스를 생성

즉, 고객이 “이사”오는 과정에서 겪는 여러 번거로운 과정을 자동화함으로써 onboarding을 부드럽게 만든다.

Apps는 어떻게 실행되는가?

Apps는 여러 기능들을 개별 app 형태로 제공하는 Superb Platform의 구성 요소이며, 각 app은 독립적인 컨테이너로 실행되어 Kubernetes 상에서 운영된다. 실행 흐름은 다음과 같다:

사용자가 웹 프론트를 통해 데이터를 업로드하면, 해당 파일은 S3에 저장된다.
지정된 app이 실행되어 S3의 입력 데이터를 내려받아 처리한다.
처리 결과는 다시 S3에 업로드되거나, 다운로드 링크 또는 플랫폼 내 리소스로 제공된다.
이 모든 과정은 Apps의 백엔드 서버와의 통신을 통해 상태 및 진행 상황이 관리된다.

개별 app은 다양한 개발자가 만들지만, 공통된 실행 흐름과 시스템 환경을 모두 이해해야만 동작하도록 만들고 싶지는 않았다.

AppWrapper의 목적은 앱 실행에 필요한 공통 작업들을 추상화하여, 개발자가 비즈니스 로직에만 집중할 수 있게 하는 것이었다.

하지만 이 실행 흐름은 단순해 보이는 것과 달리, 환경 설정, presigned URL 요청, 오류 처리, 결과 업로드 등을 앱마다 수작업으로 구현하면 유지보수와 품질 관리가 매우 어려워진다.

AppWrapper의 역할

Apps의 실행 흐름에는 파일 다운로드, 결과 업로드, 상태 보고 등 반복되는 작업이 필수적으로 포함된다. 이런 공통 작업을 자동으로 처리해주는 유틸리티가 바로 AppWrapper다.

AppWrapper는 클래스 데코레이터 방식으로 동작하며, 다음과 같은 실행 흐름을 관리한다:

앱 실행 전 (Pre-Processing)

앱 실행에 필요한 입력 파일 목록과 파일 저장 경로, 작업 ID 등의 실행 정보를 시스템으로부터 전달받는다.
전달받은 정보를 기반으로, 사용자가 웹에서 업로드한 파일들을 S3에서 가져와 Pod의 로컬 디렉터리에 저장한다. (정확히는 emptyDir 형태)
만약 실행 준비 중 오류가 발생하면, 해당 작업은 즉시 실패 처리되고 상태가 백엔드로 보고된다.

앱 실행 중

개발자가 작성한 process() 함수를 호출하여 실제 비즈니스 로직을 수행한다.
예를 들어 다음과 같이 간단히 작성할 수 있다:

@AppWrapper()
def process():
    # 예: YOLO 라벨 포맷을 Superb 포맷으로 변환
    return {
        "type": "download",
        "file_path": "/tmp/converted.zip"
    }
    
process()

실행 후 (Post-Processing)

결과의 타입이 download인 경우 → presigned PUT URL을 통해 파일 업로드 후 웹 프론트에서 사용자가 결과 파일 download 가능
결과의 타입이 link인 경우 → URL 그대로 반환
상태 업데이트 및 로그 저장 (S3에 로그 업로드 포함)

결과 포맷은 단순하지만 명확하다

이 통일된 구조 덕분에, 어떤 앱이든 동일한 방식으로 결과를 처리할 수 있다.

Case 1: type이 link인 경우

{
    "type": "link",
    "url": "특정url",
    "label": "Go to Project"
}

app의 결과물이 hyperlink인 경우
url key가 필요 (ex) platform의 특정 url)
label key는 화면에서 보여줄 내용 (ex) 아래 사진에 가장 오른쪽 “Go to Project “버튼 )

Case 2: type이 download인 경우

app의 결과물이 파일인경우
file_path key가 필요 (Pod 내부의 파일 경로)

AppWrapper의 도입 효과

항목	개선 전	개선 후
코드 중복	presigned 요청, S3 처리 반복	데코레이터로 추상화
예외 처리	앱마다 오류 처리 제각각	실패 시 상태 일괄 업데이트
로그 수집	개발자가 직접 구현	자동 S3 업로드
개발 속도	새로운 앱 추가 시 부가작업 다수	비지니스 로직만 구현하면 끝

이 구조 덕분에 앱 개발자는 “파일을 어떻게 받을지, 어디에 저장할지, 상태를 어떻게 보고할지” 같은 운영상의 디테일을 몰라도 앱 로직만 작성하면 된다.

실제 AppWrapper 데코레이터 내부는 어떻게 구현되어 있을까?

AppWrapper는 클래스지만, __call__ 메서드를 구현함으로써 함수 데코레이터처럼 동작한다. 로컬 모드와 운영 모드가 명확히 분기되어 있어 개발/운영 환경에서 모두 활용 가능하고, 예외 발생 시 즉시 로그 저장 & 상태 보고를 통해 시스템 안정성 확보한다.

class AppWrapper:
	....
    
	def __call__(self, func):
        def inner(*args, **kwargs):
            if self.LOCAL_MODE:
                try:
                    result = func(*args, **kwargs) 
                except Exception as e:
                    logger.error(e)
                    sys.exit(1)
            else:
                try:
                    result = func(*args, **kwargs) # 비지니스 로직 함수
                except Exception as e:
                    logger.error(traceback.format_exc())

                    .....

                    if task_data["Task"]["status"] != Status.Canceled.value:
                        self.send_pod_log_to_s3() # pod 로그 s3로 전송
                        self.update_status(Status.Failed.value, str(e))

                try:
                    # 반환된 결과 포맷 유효성 검사
                    self.validate_result_format(result)

                    # 파일이 포함된 경우 업로드
                    if result["type"] == "download":
                        self.upload_file_to_s3(result)

                    # 결과를 backend에 전송
                    data = {"task_id": self.task_id, "result": result}
                    self.send_pod_log_to_s3()
                    requests.post(
                        f"{self.TOAD_HOST}/output/",
                        data=json.dumps(data),
                    )

                    # 성공 상태 업데이트
                    self.update_status(Status.Complete.value, "App completed")

                except Exception as e:
                    error_log = f"Post app failed: {e}"
                    self.update_status(Status.Failed.value, error_log)

        return inner

AppWrapper는 어떻게 활용되고 있을까?

AppWrapper는 내부 앱 개발자들이 공통으로 사용할 수 있도록 구조화되어 있으며,

별도의 설치 없이 앱 코드에서 쉽게 가져다 사용할 수 있도록 PyPI에 패키징되어 배포되어 있다.

$ pip install AppWrapper

PyPI에 공개(링크)되어 있어 누구나 설치는 가능하지만, 소스 코드는 private GitHub repository에서 관리되고 있어 내부 사용자만 직접 수정하거나 검토할 수 있다.
앱 개발자는 별도의 설정 없이 @AppWrapper 데코레이터만 붙이면, 파일 다운로드, 결과 업로드, 상태 보고, 로그 저장 등 반복되는 실행 흐름을 자동으로 처리할 수 있다.

개선 포인트와 앞으로의 방향

AppWrapper는 현재까지 수십 개의 앱에서 안정적으로 사용되며, 개별 app들의 실행 흐름에 많은 역할을 하고 있다.
하지만 실제 운영하면서 다음과 같은 개선 가능성도 보였다:

1. 단일 클래스가 너무 많은 책임을 가짐

환경 초기화, 파일 다운로드/업로드, 상태 전송, 로그 처리, 예외 핸들링 등 너무 많은 역할이 한 클래스에 몰려 있다.
SRP(Single Responsibility Principle)를 따르는 구조로 리팩토링할 필요가 있다.
(예: S3Client, StatusManager, ResultValidator 등으로 분리)

2. 결과 포맷의 유효성 검사는 코드로만 정의됨

result는 dict 형식에 "type", "url" 혹은 "file_path"가 필요하지만, 이를 Pydantic 등으로 명시하지 않았다. Pydantic 기반의 result schema 도입을 통해 유효성 검사를 개선할 수 있겠다.

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

[책리뷰] CPython 파헤치기 6장. 렉싱과 파싱 (0)	2024.11.24
[책리뷰] 파이썬 클린 코드 Chapter 3. 좋은 코드의 일반적인 특징 (0)	2024.11.10
[책리뷰] CPython 파헤치기 5장. 구성과 입력 (0)	2024.11.10
[책리뷰] CPython 파헤치기 4장. 파이썬 언어와 문법 (0)	2024.11.10
pathlib 모듈 (0)	2024.10.20

[책리뷰] 파이썬 클린 코드 Chapter 3. 좋은 코드의 일반적인 특징

온별파파 2024. 11. 10. 02:49

2024. 11. 10. 02:49

계약에 의한 디자인

관계자가 기대하는 바를 암묵적으로 코드에 삽입 X

양측이 동의하는 계약을 먼저 한 다음, 계약을 어겼을 경우는 명시적으로 왜 계속할 수 없는지 예외를 발생시키라는 것

책에서 말하는 계약은 소프트웨어 컴포넌트 간의 통신 중에 반드시 지켜져야 할 몇 가지 규칙을 강제하는 것

사전조건: 코드가 실행되기 전 체크해야하는 것들(ex) 파라미터에 제공된 데이터의 유효성 검사)
사후조건: 함수 반환값의 유효성 검사로 호출자가 이 컴포넌트에서 기대한 것을 제대로 받았는지 확인하기 위해 수행
불변식: 함수가 실행되는 동안 일정하게 유지되는 것으로 로직에 문제가 없는지 확인하기 위한 것(docstring 문서화하는 것이 좋다)
부작용: 선택적으로 코드의 부작용을 docstring에 언급하기도 한다

사전조건(precondition)

함수나 메소드가 제대로 동작하기 위해 보장해야 하는 모든 것들
함수는 처리할 정보에 대한 적절한 유효성 검사를 해야 하는데 어디서 할지에 대해 2가지로 나뉨
- 관대한(tolerant) 접근법: 클라이언트가 함수를 호출하기 전에 모든 유효성 검사를 진행
- 까다로운(demanding) 접근법: 함수가 자체적으로 로직을 실행하기 전에 검사를 진행

⇒ 어디에서 유효성 검사를 진행하든 어느 한쪽에서만 진행해야 함

사후조건(postcondition)

함수나 메소드가 반환된 후의 상태를 강제하는 것

파이썬스러운 계약

메소드, 함수, 클래스에 제어 메커니즘을 추구하고 검사에 실패할 경우 RuntimeError나 ValueError를 발생시키는 것
사전조건, 사후조건 검사, 핵심 기능 구현은 가능한 한 격리된 상태로 유지하는 것이 좋음

계약에 의한 디자인(DbC) - 결론

문제가 있는 부분을 효과적으로 식별하는데 가치가 있음
명시적으로 함수나 메소드가 정상적으로 동작하기 위해 필요한 것이 무엇인지, 무엇을 반환하는지를 정의해 프로그램의 구조를 명확히 할 수 있음
원칙에 따라 추가적인 작업이 발생하지만 이방법으로 얻은 품질은 장기적으로 보상됨

방어적(defensive) 프로그래밍

계약에 의한 디자인과는 다른 접근 방식
계약에서 예외를 발생시키고 실패하게 되는 모든 조건을 기술하는 대신 코드의 모든 부분을 유효하지 않은 것으로부터 스스로 보호할 수 있게 하는 것
- 예상할 수 있는 시나리오의 오류를 처리 - 에러 핸들링 프로시져
- 발생하지 않아야 하는 오류를 처리하는 방법 - assertion error

에러 핸들링

일반적으로 데이터 입력확인 시 자주 사용
목적은 예상되는 에러에 대해서 실행을 계속할지/ 프로그램을 중단할지 결정하는 것

에러처리방법

값 대체(value substitution)
에러 로깅
예외 처리

값 대체

일부 시나리오에서 오류가 있어 소프트웨어가 잘못된 값을 생성하거나 전체가 종료될 위험이 있을 경우 결과 값을 안전한 다른 값으로 대체하는 것
항상 가능하지는 않고 신중하게 선택해야 함 (견고성과 정확성 간의 trade-off)
정보가 제공되지 않을 경우 기본 값을 제공할 수도 있음

import os

configuration = {"dbport": 5432}
print(configuration.get("dbhost", "localhost"))  # localhost
print(configuration.get("dbport"))  # 5432

print(os.getenv("DBHOST"))  # None

print(os.getenv("DPORT", 5432))  # 5432

두번째 파라미터 값을 제공하지 않으면 None을 반환

사용자 정의함수에서도 파라미터의 기본 값을 직접 정의할 수 있음

def connect_database(host="localhost", port=5432):
    pass

일반적으로 누락된 파라미터를 기본 값으로 바꾸어도 큰 문제가 없지만 오류가 있는 데이터를 유사한 값으로 대체하는 것을 더 위험하여 일부 오류를 숨겨버릴 수 있음

예외처리

어떤 경우에는 잘못된 데이터를 사용하여 계속 실행하는 것보다는 차라리 실행을 멈추는 것이 더 좋을 수 있음

입력이 잘못되었을 때만 함수에 문제가 생기는 것이 아님 (외부 컴포넌트에 연결되어 있는 경우)
이런 경우에는 함수 자체의 문제가 아니기 때문에 적절하게 인터페이스를 설계하면 쉽게 디버깅 할 수 있음

⇒ 예외적인 상황을 명확하게 알려주고 원래의 비즈니스 로직에 따라 흐름을 유지하는 것이 중요

정상적인 시나리오나 비즈니스 로직을 예외처리하려고 하면 프로그램의 흐름을 읽기가 어려워짐

→ 예외를 go-to문처럼 사용하는 것과 같다. 올바른 위치에서 추상화를 하지 못하게 되고 로직을 캡슐화하지도 못하게 됨.

마지막으로 예외를 대게 호출자에게 잘못을 알려주는 것으로 캡슐화를 약화시키기 때문에 신중하게 사용해야 함→이는 함수가 너무 많은 책임을 가지고 있다는 것을 의미할 수도 있음. 함수에서 너무 많은 예외를 발생시켜야 한다면 여러개의 작은 기능으로 나눌 수 있는지 검토해야 함

올바른 수준의 추상화 단계에서 예외 처리

예외는 오직 한가지 일을 하는 함수의 한 부분이어야 함
서로 다른 수준의 추상화를 혼합하는 예제. deliver_event 메소드를 중점적으로 살펴보면

import logging
import time

logger = logging.getLogger(__name__)

class DataTransport:
    """다른 레벨에서 예외를 처리하는 객체의 예"""

    _RETRY_BACKOFF: int = 5
    _RETRY_TIMES: int = 3

    def __init__(self, connector):
        self._connector = connector
        self.connection = None

    def deliver_event(self, event):
        try:
            self.connect()
            data = event.decode()
            self.send(data)
        except ConnectionError as e:
            logger.info("커넥션 오류 발견: %s", e)
            raise
        except ValueError as e:
            logger.error("%r 이벤트에 잘못된 데이터 포함: %s", event, e)
            raise

    def connect(self):
        for _ in range(self._RETRY_TIMES):
            try:
                self.connection = self._connector.connect()
            except ConnectionError as e:
                logger.info("%s: 새로운 커넥션 시도 %is", e, self._RETRY_BACKOFF)
                time.sleep(self._RETRY_BACKOFF)
            else:
                return self.connection
        raise ConnectionError(f"연결실패 재시도 횟수 {self._RETRY_TIMES} times")

    def send(self, data):
        return self.connection.send(data)

    def deliver_event(self, event):
        try:
            self.connect()
            data = event.decode()
            self.send(data)
        except ConnectionError as e:
            logger.info("커넥션 오류 발견: %s", e)
            raise
        except ValueError as e:
            logger.error("%r 이벤트에 잘못된 데이터 포함: %s", event, e)
            raise

ConnectionError와 ValueError는 별로 관계가 없음
매우 다른 유형의 오류를 살펴봄으로써 책임을 어떻게 분산해야 하는지에 대한 아이디어를 얻을 수 있음
- ConnectionError는 connect 메소드 내에서 처리되어야 함. 이렇게 하면 행동을 명확하게 분리할 수 있다. 메소드가 재시도를 지원하는 경우 메소드 내에서 예외처리를 할 수 있음
- ValueError는 event의 decode 메소드에 속한 에러로 event를 send 메소드에 파라미터로 전달 후 send 메소드 내에서 예외처리를 할 수 있음
위 내용처럼 구현을 수정하면 deliver_event 메소드에서 예외를 catch할 필요가 없음

def connect_with_retry(connector, retry_n_times: int, retry_backoff: int = 5):
    """<connector>를 사용해 연결을 시도함.
    연결에 실패할 경우 <retry_n_times>회 만큼 재시도
    재시도 사이에는 <retry_backoff>초 만큼 대기

    연결에 성공하면 connection 객체를 반환
    재시도 횟수를 초과하여 연결에 실패하면 ConnectionError 오류 발생

    :param connector: connect() 메소드를 가진 객체
    :param retry_n_times: 연결 재시도 횟수
    :param retry_backoff: 재시도 사이의 대기 시간(초)

    """
    for _ in range(retry_n_times):
        try:
            return connector.connect()
        except ConnectionError as e:
            logger.info("%s: 새로운 커넥션 시도 %is", e, retry_backoff)
            time.sleep(retry_backoff)

    exc = ConnectionError(f"연결 실패 ({retry_n_times}회 재시도)")
    logger.exception(exc)
    raise exc

class DataTransport:
    """추상화 수준에 따른 예외 분리를 한 객체"""

    _RETRY_BACKOFF: int = 5
    _RETRY_TIMES: int = 3

    def __init__(self, connector: Connector) -> None:
        self._connector = connector
        self.connection = None

    def deliver_event(self, event: Event):
        self.connection = connect_with_retry(
            self._connector, self._RETRY_TIMES, self._RETRY_BACKOFF
        )
        self.send(event)

    def send(self, event: Event):
        try:
            return self.connection.send(event.decode())
        except ValueError as e:
            logger.error("%r contains incorrect data: %s", event, e)
            raise

deliver_event 메소드 내에서 예외 catch 하는 부분 없어짐

엔드 유저에게 Traceback 노출 금지

보안을 위한 고려사항으로 예외가 전파되도록하는 경우는 중요한 정보를 공개하지 않고 “알 수 없는 문제가 발생했습니다” 또는 “페이지를 찾을 수 없습니다”와 같은 일반적인 메세지를 사용해야 함

비어있는 except 블록 지양

파이썬의 안티패턴 중 가장 악마같은 패턴(REAL 01)으로 어떠한 예외도 발견할 수 업슨 문제점이 있음

try:
    process_data()
except: 
    pass

아무것도 하지 않는 예외 블록을 자동으로 탐지할 수 있도록 CI 환경을 구축하면 좋음

flake8
pylint

https://pylint.pycqa.org/en/latest/user_guide/messages/warning/bare-except.html

name: Lint Code

on: [push, pull_request]

jobs:
  lint:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run flake8
      run: |
        flake8 . --select=E722
      
    - name: Run pylint
      run: |
        find . -name "*.py" | xargs pylint --disable=all --enable=W0702

대안으로 아래 두 항목 동시에 적용하는 것이 좋다

보다 구체적인 예외처리 (AttributeError 또는 KeyError)
except 블록에서 실제 오류 처리

pass를 사용하는 것은 그것이 의미하는 바를 알 수 없기 때문에 나쁜 코드이다
명시적으로 해당 오류를 무시하려면 contextlib.suppress 함수를 사용하는 것이 올바른 방법

import contextlib

with contextlib.suppress(KeyError):
    process_data()

원본 예외 포함

raise <e> from <original_exception> 구문을 사용하면 여러 예외를 연결할 수 있음
원본 오류의 traceback 정보가 새로운 exception에 포함되고 원본 오류는 새로운 오류의 원인으로 분류되어 cause 속성에 할당 됨

class InternalDataError(Exception):
    """업무 도메인 데이터의 예외"""

def process(data_dictionary, record_id):
    try:
        return data_dictionary[record_id]
    except KeyError as e:
        raise InternalDataError("데이터가 존재하지 않음") from e

test_dict = {"a": 1}

process(test_dict, "b")

Traceback (most recent call last):
File "/Users/woo-seongchoi/Desktop/CleanCode/ch3/main.py", line 7, in process
return data_dictionary[record_id]
~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'b'*

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/woo-seongchoi/Desktop/CleanCode/ch3/main.py", line 14, in <module>
process(test_dict, "b")
File "/Users/woo-seongchoi/Desktop/CleanCode/ch3/main.py", line 9, in process
raise InternalDataError("데이터가 존재하지 않음") from e
InternalDataError: 데이터가 존재하지 않음*

파이썬에서 assertion 사용하기

절대로 일어나지 않아야 하는 상황에 사용되므로 assert 문에 사용된 표현식을 불가능한 조건을 의미로 프로그램을 중단시키는 것이 좋다

try: 
    assert condition.holds(), "조건에 맞지 않음"
except AssertionError:
    alternative_procedure() # catch 후에도 계속 프로그램을 실행하면 안됨

위 코드가 나쁜 또 다른 이유는 AssertionError를 처리하는 것 이외에 assertion 문장이 함수라는 것

assert condition.holds(), "조건에 맞지 않음"

함수 호출은 부작용을 가질 수 있으며 항상 반복가능하지 않음. 또한 디버거를 사용해 해당 라인에서 중지하여 오류 결과를 편리하게 볼 수 없으며 다시 함수를 호출한다 하더라도 잘못된 값이었는지 알 수 없음

result = condition.holds()
assert result > 0, f"Error with {result}"

예외처리와 assertion의 차이

예외처리는 예상하지 못한 상황을 처리하기 위한 것 ⇒ 더 일반적
assertion은 정확성을 보장하기 위해 스스로 체크하는 것

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

Apps 개발자의 반복 작업 탈출기: AppWrapper (0)	2025.04.19
[책리뷰] CPython 파헤치기 6장. 렉싱과 파싱 (0)	2024.11.24
[책리뷰] CPython 파헤치기 5장. 구성과 입력 (0)	2024.11.10
[책리뷰] CPython 파헤치기 4장. 파이썬 언어와 문법 (0)	2024.11.10
pathlib 모듈 (0)	2024.10.20

pathlib 모듈

온별파파 2024. 10. 20. 22:06

2024. 10. 20. 22:06

pathlib 모듈은 파이썬 표준 라이브러리로, 파일 읽기/쓰기 작업이나 디렉토리에 있는 특정 유형 파일 나열, 특정 파일의 상위 디렉토리 찾기 등의 작업을 할 때 사용됨

The Problem With Representing Paths as Strings

python 3.4 버전부터 pathlib 모듈이 등장했는데 pathlib이 존재하기 전에는 전통적으로 string을 이용하여 파일 경로를 표현했음
하지만 경로는 일반 문자열 이상이기 때문에 중요한 기능들이 os, glob, shutil과 같은 라이브러리를 포함한 표준 라이브러리 전체에 분산되어 있었음
예를 들어 아래의 코드는 txt 파일을 하위의 archive 폴더로 이동시키는 내용

import glob
import os
import shutil

for file_name in glob.glob("*.txt"):
    new_path = os.path.join("archive", file_name)
    shutil.move(file_name, new_path)

glob, os, shutil 까지 3개의 import statement 필요
pathlib 모듈은 여러 운영체제에서 동일한 방식으로 작동하는 Path 클래스를 제공해 위 3가지 모듈은 임포트 하는 대신 pathlib 모듈만 사용하여 동일한 작업을 수행할 수 있음

from pathlib import Path

for file_path in Path.cwd().glob("*.txt"):
    new_path = Path("archive") / file_path.name
    file_path.replace(new_path)

Path Instantiation With Python’s pathlib

pathlib 모듈에 대해 한가지 강력한 동기는 string 대신 전용 객체로 파일 시스템을 표현하는 것
객체 지향 접근 방식은 기존 os.path 방식과 대조할 때, pathlib 핵심이 Path 클래스 라는 점에 주목하면 더욱 분명함

>>> from pathlib import Path
>>> Path
<class 'pathlib.Path'>

Path 클래스로 작업하기 때문에 import pathlib; pathlib.Path 보다 from pathlib import Path로 작업하는게 더 효율적
Path 객체를 인스턴스화 하는 방법에는 몇가지가 있지만 이 글에서는 클래스 메소드, 문자열 전달, path 컴포넌트를 조인함으로써 path 객체를 생성하는 것을 살펴봄

Using Path Methods

Path를 import 한 후에 working directory나 home directory를 가져오기 위해 기존 메소드를 사용할 수 있음

>>> from pathlib import Path
>>> Path.cwd()
PosixPath('/Users/woo-seongchoi/Desktop/realpython')

pathlib.Path를 인스턴스로 만들면, OS에 따라 WindowsPath나 PosixPath 객체를 얻을 수 있음
일반적으로 Path를 사용하면, 사용중인 플랫폼에 대한 구체적인 경로를 인스턴스화하는 동시에 코드가 플랫폼에 독립적으로 유지됨

>>> from pathlib import Path
>>> Path.home()
PosixPath('/Users/woo-seongchoi')

Path 객체의 cwd나 home 메소드를 통해 python script의 starting point를 쉽게 얻을 수 있음

Passing in a String

home directory나 current working directory 대신에 string을 Path 에 전달함으로써 directory나 file을 가리킬 수 있음

>>> from pathlib import Path
>>> Path("/Users/woo-seongchoi/Desktop/realpython/file.txt")
PosixPath('/Users/woo-seongchoi/Desktop/realpython/file.txt')

Path 객체를 생성하고 string을 다루는 대신 pathlib 모듈이 제공하는 유연성을 통해 작업 가능
POSIX는 Portable Operating System Interface 이고, path 표현 등을 포함하여 운영 체제간 호환성을 유지하기 위한 표준임

Joining Paths

슬래시 (’/’) 를 이용하여 경로의 일부를 연결하도록 경로를 구성할 수 있음

from pathlib import Path

for file_path in Path.cwd().glob("*.txt"):
    new_path = Path("archive") / file_path.name
    file_path.rename(new_path)

슬래시 연산자는 Path 객체를 포함하는 한 여러 경로 또는 경로와 문자열이 섞인 경우도 결합시킬 수 있음
슬래시 연산자를 사용하지 않는다면 joinpath 메소드를 사용할 수 있음

>>> from pathlib import Path
>>> Path.home().joinpath("python", "scripts", "test.py")
PosixPath('/home/woo-seongchoi/python/scripts/test.py')

References

https://realpython.com/python-pathlib/

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

[책리뷰] CPython 파헤치기 5장. 구성과 입력 (0)	2024.11.10
[책리뷰] CPython 파헤치기 4장. 파이썬 언어와 문법 (0)	2024.11.10
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (2) (0)	2024.09.29
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1) (0)	2024.09.29
super() (0)	2024.09.15

[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (2)

온별파파 2024. 9. 29. 23:05

2024. 9. 29. 23:05

이글은 책 "파이썬 클린 코드" ch2의 내용을 읽고 요약 및 추가한 내용입니다.

예시: R-Trie 자료 구조에 대한 노드 모델링

문자열에 대한 빠른 검색을 위한 자료구조라는 정도로만 알고 넘어가기
현재의 문자를 나타내는 value, 다음에 나올 문자를 나타내는 next_ 배열을 가지고 있음
linked list나 tree 형태와 비슷

from typing import List
from dataclasses import dataclass, field

R = 26

@dataclass
class RTrieNode:
    size = R
    value: int
    next_: List["RTrieNode"] = field(default_factory=lambda: [None] * R)

    def __post_init__(self):
        if len(self.next_) != self.size:
            raise ValueError(f"리스트(next_)의 길이가 유효하지 않음")

size는 class variable로 모든 객체가 값을 공유
value는 정수형이지만 기본값이 없으므로 객체 생성시 반드시 값을 정해줘야 함
next_는 R크기 만큼의 길이를 가진 list로 초기화

__post_init__은 next_가 원하는 형태로 잘 생성되었는지 확인하는 검증

from typing import List
from dataclasses import dataclass, field

R = 26  # 영어 알파벳

@dataclass
class RTrieNode:
    size = R
    value: int
    next_: List["RTrieNode"] = field(default_factory=list)

    def __post_init__(self):
        if len(self.next_) != self.size:
            raise ValueError(f"리스트(next_)의 길이가 유효하지 않음")

rt_node = RTrieNode(value=0) # ValueError: 리스트(next_)의 길이가 유효하지 않음

이터러블 객체

__iter__ 매직 메소드를 구현한 객체

파이썬의 반복은 이터러블 프로토콜이라는 자체 프로토콜을 사용해 동작

for e in my_object

위 형태로 객체를 반복할 수 있는지 확인하기 위해 파이썬은 고수준에서 아래 두가지 차례로 검사

객체가 __next__나 __iter__ 메서드 중 하나를 포함하는지 여부
객체가 시퀀스이고 __len__과 __getitem__을 모두 가졌는지 여부

For-loop에 대한 구체적인 과정

my_list = ["사과", "딸기", "바나나"]

for i in my_list:
    print(i)

for 문이 시작할 때 my_list의 __iter__()로 iterator를 생성
내부적으로 i = __next__() 호출
StopIteration 예외가 발생하면 반복문 종료

Iterable과 Iterator의 차이

Iterable: loop에서 반복될 수 있는 python 객체, __iter__() 가 구현되어있어야 함
Iterator: iterable 객체에서 __iter__() 호출로 생성된 객체로 __iter__()와 __next__()가 있어야하고, iteration 시 현재의 순서를 가지고 있어야 함

이터러블 객체 만들기

객체 반복 시 iter() 함수를 호출하고 이 함수는 해당 객체에 __iter__ 메소드가 있는지 확인

from datetime import timedelta
from datetime import date

class DateRangeIterable:
    """자체 이터레이터 메서드를 가지고 있는 iterable"""

    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date
        self._present_day = start_date

    def __iter__(self):
        return self # 객체 자신이 iterable 임을 나타냄

    def __next__(self):
        if self._present_day >= self.end_date:
            raise StopIteration()
        today = self._present_day
        self._present_day += timedelta(days=1)

        return today

for day in DateRangeIterable(date(2024, 6, 1), date(2024, 6, 4)):
    print(day)

2024-06-01
2024-06-02
2024-06-03

for 루프에서 python은 객체의 iter() 함수를 호출하고 이 함수는 __iter__ 매직 메소드를 호출
self를 반환하면서 객체 자신이 iterable임을 나타냄
루프의 각 단계에서마다 자신의 next() 함수를 호출
next 함수는 다시 __next__ 메소드에게 위임하여 요소를 어떻게 생산하고 하나씩 반환할 것인지 결정
- 더 이상 생산할 것이 없는 경우 파이썬에게 StopIteration 예외를 발생시켜 알려줘야함

⇒ for 루프가 작동하는 원리는 StopIteration 예외가 발생할 때까지 next()를 호출하는 것과 같다

from datetime import timedelta
from datetime import date

class DateRangeIterable:
    """자체 이터레이터 메서드를 가지고 있는 이터러블"""

    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date
        self._present_day = start_date

    def __iter__(self):
        return self

    def __next__(self):
        if self._present_day >= self.end_date:
            raise StopIteration()
        today = self._present_day
        self._present_day += timedelta(days=1)

        return today

r = DateRangeIterable(date(2024, 6, 1), date(2024, 6, 4))
print(next(r))  # 2024-06-01
print(next(r))  # 2024-06-02
print(next(r))  # 2024-06-03
print(next(r))  # raise StopIteration()

위 예제는 잘 동작하지만 하나의 작은 문제가 있음

max 함수 설명

iterable한 object를 받아서 그 중 최댓값을 반환하는 내장함수이다
숫자형뿐만 아니라 문자열 또한 비교 가능

str1 = 'asdzCda'
print(max(str1)) # z

str2 = ['abc', 'abd']
print(max(str2)) # abd 유니코드가 큰 값

str3 = ['2022-01-01', '2022-01-02']
print(max(str3)) # 2022-01-02 
# 숫자로 이루어진 문자열을 비교할 때 각 문자열의 앞 부분을 비교해서 숫자가 큰 것을 출력

r1 = DateRangeIterable(date(2024, 6, 1), date(2024, 6, 4))

a = ", ".join(map(str, r1))  # "2024-06-01, 2024-06-02, 2024-06-03"
print(max(r1))

ValueError: max() iterable argument is empty

문제가 발생하는 이유는 이터러블 프로토콜이 작동하는 방식 때문
- 이터러블의 __iter__ 메소드는 이터레이터를 반환하고 이 이터레이터를 사용해 반복
- 위의 예제에서 __iter__ 는 self를 반환했지만 호출될 때마다 새로운 이터레이터를 만들 수 있음
- 매번 새로운 DateRangeIterable 인스턴스를 만들어서 해결 가능하지만 __iter__에서 제너레이터(이터레이터 객체)를 사용할 수도 있음

from datetime import timedelta
from datetime import date

class DateRangeIterable:
    """자체 이터레이터 메서드를 가지고 있는 이터러블"""

    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date
        self._present_day = start_date

    def __iter__(self):
        current_day = self.start_date
        while current_day < self.end_date:
            yield current_day
            current_day += timedelta(days=1)

    def __next__(self):
        if self._present_day >= self.end_date:
            raise StopIteration()
        today = self._present_day
        self._present_day += timedelta(days=1)

        return today

r1 = DateRangeIterable(date(2024, 6, 1), date(2024, 6, 4))

a = ", ".join(map(str, r1))  # 2024-06-01, 2024-06-02, 2024-06-03
print(max(r1))  # 2024-06-03

달라진 점은 각각의 for loop은 __iter__를 호출하고 이는 제너레이터를 생성

⇒ 이러한 형태의 객체를 컨테이너 이터러블(container iterable)이라고 함

다른 방법

iterable과 iterator 객체를 분리

from datetime import timedelta, date

class DateRangeIterator:
    """Iterator for DateRangeIterable."""

    def __init__(self, start_date, end_date):
        self.current_date = start_date
        self.end_date = end_date

    def __iter__(self):
        return self

    def __next__(self):
        if self.current_date >= self.end_date:
            raise StopIteration()
        today = self.current_date
        self.current_date += timedelta(days=1)
        return today

class DateRangeIterable:
    """Iterable for a range of dates."""

    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __iter__(self):
        return DateRangeIterator(self.start_date, self.end_date)

r1 = DateRangeIterable(date(2024, 6, 1), date(2024, 6, 4))

# Using join with map
print(", ".join(map(str, r1)))  # Output: 2024-06-01, 2024-06-02, 2024-06-03

# Using max
print(max(r1))  # Output: 2024-06-03

DateRangeIterable 에서 __iter__가 호출될 때 마다 새로운 Iterator 를 생성할 수도 있음

시퀀스 만들기

객체에 __iter__ 메소드를 정의하지 않았지만 반복하기를 원하는 경우도 있음

객체에 __iter__ 가 정의되어 있지 않으면 __getitem__을 찾고 없으면 TypeError를 발생시킴

시퀀스는 __len__과 __getitem__을 구현하고 첫번째 인덱스0부터 시작하여 포함된 요소를 한 번에 하나씩 가져올 수 있어야 함

이터러블 객체는 메모리를 적게 사용한다는 장점이 있음

n번째 요소를 얻고 싶다면 도달할 때까지 n번 반복해야하는 단점이 있음 (시간복잡도: O(n))

⇒CPU 메모리 사이의 trade-off

__iter__, __getitem__ 모두 없는 경우

from datetime import timedelta, date

class DateRangeSequence:
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date
        self._range = self._create_range()

    def _create_range(self):
        days = []
        current_day = self.start_date
        while current_day < self.end_date:
            days.append(current_day)
            current_day += timedelta(days=1)
        return days

    # def __getitem__(self, day_no):
    #     return self._range[day_no]

    def __len__(self):
        return len(self._range)

s1 = DateRangeSequence(date(2022, 1, 1), date(2022, 1, 5))
for day in s1:
    print(day)

TypeError: 'DateRangeSequence' object is not iterable

__getitem__있는 경우

from datetime import timedelta, date

class DateRangeSequence:
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date
        self._range = self._create_range()

    def _create_range(self):
        days = []
        current_day = self.start_date
        while current_day < self.end_date:
            days.append(current_day)
            current_day += timedelta(days=1)
        return days

    def __getitem__(self, day_no):
        return self._range[day_no]

    def __len__(self):
        return len(self._range)

s1 = DateRangeSequence(date(2022, 1, 1), date(2022, 1, 5))
for day in s1:
    print(day)

2022-01-01
2022-01-02
2022-01-03
2022-01-04

__iter__ 없어도 for loop에 사용할 수 있음

컨테이너 객체

__contains__ 메서드를 구현한 객체. 일반적으로 boolean 값을 반환하고 이 메서드는 파이썬에서 in 키워드가 발견될 때 호출됨

element in container

위 코드를 파이썬은 아래와 같이 해석 (잘활용하면 가독성이 정말 높아짐)

container.__contains_(element)

def mark_coordinate(grid, coord):
    if 0<= coord.x < grid.width and 0<= coord.y < grid.height:
        grid[coord] = MARKED

grid내에 coord 좌표가 포함되는지 여부를 확인하는 코드

Grid 객체 스스로 특정 좌표가 자신의 영역안에 포함되는지 여부를 판단할 수는 없을까? 더 작은 객체 (Boundaries)에 위임하면 어떨까?

컴포지션을 사용하여 포함관계를 표현하고 다른 클래스에 책임을 분배하고 컨테이너 매직 메소드를 사용

class Boundaries:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def __contains__(self, coord):
        x, y = coord
        return 0 <= x < self.width and 0 <= y < self.height

class Grid:
    def __init__(self, width, height):
        self.width = width
        self.height = height
        self.limits = Boundaries(width, height)

    def __contains__(self, coord):
        return coord in self.limits

Composition 관계 사용 전

def mark_coordinate(grid, coord):
    if 0<= coord.x < grid.width and 0<= coord.y < grid.height:
        grid[coord] = MARKED

Composition 관계 사용 후

def mark_coordinate(grid, coord):
    if coord in grid:
        grid[coord] = MARKED

객체의 동적인 생성

__getattr__ 매직 메소드를 사용하면 객체가 속성에 접근하는 방법을 제어할 수 있음

myobject.myattribute 형태로 객체의 속성에 접근하려면 instance의 속성 정보를 가지고 __dict__에 myattribute가 있는지 검색.

해당 이름의 속성이 있으면 __getattribute__메소드를 호출
없는 경우 조회하려는 속성(myattribute) 이름을 파라미터로 __getattr__ 호출

class DynamicAttributes:
    def __init__(self, attribute):
        self.attribute = attribute

    def __getattr__(self, attr):
        if attr.startswith("fallback_"):
            name = attr.replace("fallback_", "")
            return f"[fallback resolved] {name}"
        raise AttributeError(f"{self.__class__.__name__}에는 {attr} 속성이 없음")

dyn = DynamicAttributes("value")
print(dyn.attribute)  # value

print(dyn.fallback_test)  # [fallback resolved] test

dyn.__dict__["fallback_new"] = "new value" # dict로 직접 인스턴스에 추가
print(dyn.fallback_new)  # new value 

print(getattr(dyn, "something", "default"))  # default

호출형 객체(callable)

함수처럼 동작하는 객체를 만들면 데코레이터 등 편리하게 사용 가능
- __call__ 매직 메소드가 호출됨

from collections import defaultdict

class CallCount:
    def __init__(self):
        self._counts = defaultdict(int)

    def __call__(self, argument):
        self._counts[argument] += 1
        return self._counts[argument]

cc = CallCount()
print(cc(1))  # 1
print(cc(2))  # 1
print(cc(1))  # 2
print(cc(1))  # 3
print(cc("something"))  # 1
print(callable(cc))  # True

매직 메소드 요약

사용 예 매직 메서드 비고

사용예	매직 메소드	비고
obj[key] obj[i:j] obj[i:j:k]	__getitem__(key)	첨자형(subscriptable) 객체
with obj: ...	__enter__ / __exit__	컨텍스트 관리자
for i in obj: ...	__iter__ / __next__ __len__ / __getitem__	이터러블 객체 시퀀스
obj.<attribute>	__getattr__	동적 속성 조회
obj(args, *kwargs)	__call__(arg, *kwargs)	호출형(callable) 객체

이러한 매직 메소드를 올바르게 구현하고 같이 구현해야 하는 조합이 뭔지 확인하는 가장 좋은 방법은 collections.abc 모듈에서 정의된 추상클래스를 상속하는 것

파이썬에서 유의할 점

mutable 파라미터의 기본 값

def wrong_user_display(user_metadata: dict = {"name": "John", "age": 30}):
    name = user_metadata.pop("name")
    age = user_metadata.pop("age")

    return f"{name} ({age})"

2가지 문제 존재

변경 가능한 기본 값을 사용한 것. 함수의 본문에서 수정 가능한 객체의 값을 직접 수정하여 부작용 발생
기본 인자
1. 함수에 인자를 사용하지 않고 호출할 경우 처음에만 정상 동작
2. 파이썬 인터프리터는 함수의 정의에서 dictionary를 발견하면 딱 한번만 생성하기 때문에 pop하는 순간 해당 key, value는 없어짐

print(wrong_user_display())  # John (30)
print(wrong_user_display())  # KeyError: 'name'

참고 링크

수정방법은?

기본 초기 값을 None으로 하고 함수 본문에서 기본 값을 할당

def wrong_user_display(user_metadata: dict = None):
    user_metadata = user_metadata or {"name": "John", "age": 30}
    name = user_metadata.pop("name")
    age = user_metadata.pop("age")

    return f"{name} ({age})"

내장(built-in) 타입 확장

내장 타입을 확장하는 올바른 방법은 list, dict 등을 직접 상속받는 것이 아니라 collections 모듈을 상속받는 것
- collections.UserDict
- collections.UserList
파이썬을 C로 구현한 CPython 코드가 내부에서 스스로 연관된 부분을 모두 찾아서 업데이트 해주지 않기 때문

class BadList(list):
    def __getitem__(self, index):
        value = super().__getitem__(index)
        if index % 2 == 0:
            prefix = "짝수"
        else:
            prefix = "홀수"
        return f"[{prefix}] {value}"

b1 = BadList((0, 1, 2, 3, 4, 5))
print(b1)
print(b1[0])  # [짝수] 0
print(b1[1])  # [홀수] 1
print("".join(b1)) # TypeError: sequence item 0: expected str instance, int found

from collections import UserList

class BadList(UserList):
    def __getitem__(self, index):
        value = super().__getitem__(index)
        if index % 2 == 0:
            prefix = "짝수"
        else:
            prefix = "홀수"
        return f"[{prefix}] {value}"

b1 = BadList((0, 1, 2, 3, 4, 5))
print(b1)
print(b1[0])  # [짝수] 0
print(b1[1])  # [홀수] 1
print("".join(b1))  #  [짝수] 0[홀수] 1[짝수] 2[홀수] 3[짝수] 4[홀수] 5

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

[책리뷰] CPython 파헤치기 4장. 파이썬 언어와 문법 (0)	2024.11.10
pathlib 모듈 (0)	2024.10.20
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1) (0)	2024.09.29
super() (0)	2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31

[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1)

온별파파 2024. 9. 29. 22:53

2024. 9. 29. 22:53

이글은 책 "파이썬 클린 코드" ch2의 내용을 읽고 요약 및 추가한 내용입니다.

pythonic 코드란?

일종의 python 언어에서 사용되는 관용구

Pythonic 코드를 작성하는 이유

일반적으로 더 나은 성능을 보임
코드도 더 작고 이해하기 쉬움

인덱스와 슬라이스

파이썬은 음수 인덱스를 사용하여 끝에서부터 접근이 가능

my_numbers = (4, 5, 3, 9)
print(my_numbers[-1]) # 9
print(my_numbers[-3]) # 5

slice를 이용하여 특정 구간의 요소를 얻을 수 있음
- 끝 인덱스는 제외

my_numbers = (1, 1, 2, 3, 5, 8, 13, 21)
print(my_numbers[2:5])  # (2, 3, 5)
print(my_numbers[::]) # (1, 1, 2, 3, 5, 8, 13, 21)

간격 값 조절

index를 2칸씩 점프

my_numbers = (1, 1, 2, 3, 5, 8, 13, 21)
print(my_numbers[1:7:2])  # 1, 3, 8

slice 함수를 직접 호출할 수도 있음

my_numbers = (1, 1, 2, 3, 5, 8, 13, 21)

interval = slice(1, 7, 2)
print(my_numbers[interval]) # (1, 3, 8)

자체 시퀀스 생성

indexing 및 slice는 __getitem__ 이라는 매직 메서드 덕분에 동작
클래스가 시퀀스임을 선언하기 위해 collections.abc모듈의 Sequence 인터페이스를 구현해야 함

class C(Sequence):                      # Direct inheritance
    def __init__(self): ...             # Extra method not required by the ABC
    def __getitem__(self, index):  ...  # Required abstract method
    def __len__(self):  ...             # Required abstract method
    def count(self, value): ...         # Optionally override a mixin method

from collections.abc import Sequence

class Items:
    def __init__(self, *values):
        self._values = list(values)

    def __len__(self):
        return len(self._values)

    def __getitem__(self, item):
        return self._values.__getitem__(item)

items = Items(1, 2, 3)
print(items[2])  # 3
print(items[0:2]) # [1, 2]

다음 사항에 유의해 시퀀스를 구현해야 함
- 범위로 인덱싱하는 결과는 해당 클래스와 같은 타입의 인스턴스여야 한다. -> 지키지 않는 경우 오류 발생 가능성
- 슬라이스에 의해 제공된 범위는 마지막 요소를 제외해야 한다. -> 파이썬 언어와 일관성 유지

컨텍스트 관리자(context manager)

사전 조건과 사후 조건이 있는 일부 코드를 실행해야 하는 상황에 유용
- 리소스 관리와 관련된 컨텍스트 관리자 자주 볼 수 있음

def process_file(fd):
    line = fd.readline()
    print(line)

fd = open("test.txt")
try:
    process_file(fd)
finally:
		print("file closed")
    fd.close()

123 file closed

똑같은 기능을 매우 우아하게 파이썬 스럽게 구현

def process_file(fd):
    line = fd.readline()
    print(line)

with open("test.txt") as fd:
    process_file(fd)

context manager는 2개의 매직 메소드로 구성

__enter__ : with 문이 호출
__exit__ : with 블록의 마지막 문장이 끄나면 컨텍스트가 종료되고 __exit__가 호출됨

context manager 블록 내에 예외 또는 오류가 있어도 __exit__ 메소드는 여전히 호출되므로 정리 조건을 안정하게 실행하는데 편함

예시: 데이터베이스 백업

백업은 오프라인 상태에서 해야함 (데이터베이스가 실행되고 있지 않는 동안) → 서비스 중지 필요

방법 1

서비스를 중지 → 백업 → 예외 및 특이사항 처리 → 서비스 다시 처리 과정을 단일 함수로 만드는 것

def stop_database():
    run("systemctl stop postgresql.service")

def start_database():
    run("systemctl start postgresql.service")

class DBHandler:
    def __enter__(self):
        stop_database()
        return self

    def __exit__(self, exc_type, ex_value, ex_traceback):
        start_database()

    def db_backup():
        run("pg_dump database")

    def main():
        with DBHandler():
            db_backup()

DBHandler 를 사용한 블록 내부에서 context manager 결과를 사용하지 않음
- __enter__에서 무언가를 반환하는 것이 좋은 습관
main() 에서 유지보수 작업과 상관없이 백업을 실행. 백업에 오류가 있어도 여전히 __exit__을 호출
__exit__의 반환 값을 잘 생각해야 함. True를 반환하면 잠재적으로 발생한 예외를 호출자에게 전파하지 않고 멈춘다는 뜻으로 예외를 삼키는 것은 좋지 않은 습관

Context manager 구현

contextlib.contextmanager 데코레이터 사용

import contextlib

@contextlib.contextmanager
def db_handler():
    try:
        stop_database()  (1)
        yield            (2)
    finally:
        start_database() (4)

with db_handler():
    db_backup()          (3)

@contextlib.contextmanager

해당 함수의 코드를 context manager로 변환
함수는 generator라는 특수한 함수의 형태여야 하는데 이 함수는 코드의 문장을 __enter__와 __exit__매직 메소드로 분리한다.
- yield 키워드 이전이 __enter__ 메소드의 일부처럼 취급
- yield 키워드 다음에 오는 모든 것들을 __exit__로직으로 볼 수 있음

2. contextlib.ContextDecorator 클래스 사용

import contextlib

def stop_database():
    print("stop database")

def start_database():
    print("start database")

def run(text):
    print(text)

class dbhandler_decorator(contextlib.ContextDecorator):
    def __enter__(self):
        stop_database()
        return self

    def __exit__(self, ext_type, ex_value, ex_traceback):
        start_database()

@dbhandler_decorator()
def offline_backup():
    run("pg_dump database")

offline_backup()

stop database
pg_dump database
start database

with 문이 없고 함수를 호출하면 offline_backup 함수가 context manager 안에서 자동으로 실행됨
원본 함수를 래핑하는 데코레이터 형태로 사용
- 단점은 완전히 독립적이라 데코레이터는 함수에 대해 아무것도 모름 (사실 좋은 특성)

contextlib 의 추가적인 기능

import contextlib

with contextlib.suppress(DataConversionException):
    parse_data(nput_json_or_dict)

안전하다고 확신되는 경우 해당 예외를 무시하는 기능
DataConversionException이라고 표현된 예외가 발생하는 경우 parse_data 함수를 실행

컴프리헨션과 할당 표현식

코드를 간결하게 작성할 수 있고 가독성이 높아짐

def run_calculation(i):
    return i

numbers = []

for i in range(10):
    numbers.append(run_calculation(i))

print(numbers) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

위의 코드를 아래와 같이 바로 리스트 컴프리헨션으로 만들 수 있음

numbers = [run_calculation(i) for i in range(10)]

list.append를 반복적으로 호출하는 대신 단일 파이썬 명령어를 호출하므로 일반적으로 더 나은 성능을 보임

dis 패키지를 이용한 어셈블리코드 비교각 assembly 코드 (list comprehension)

import dis

def run_calculation(i):
    return i

def list_comprehension():
    numbers = [run_calculation(i) for i in range(10)]
    return numbers

# Disassemble the list comprehension function
dis.dis(list_comprehension)

def for_loop():
    numbers = []
    for i in range(10):
        numbers.append(run_calculation(i))
    return numbers

# Disassemble the for loop function
dis.dis(for_loop)

각 assembly 코드 (list comprehension)

  6           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f8e5a78f710, file "example.py", line 6>)
              2 LOAD_CONST               2 ('list_comprehension.<locals>.<listcomp>')
              4 **MAKE_FUNCTION**            0
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               3 (10)
             10 **CALL_FUNCTION**            1
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 RETURN_VALUE

 # for loop 
 10           0 BUILD_LIST               0
              2 STORE_FAST               0 (numbers)
 11           4 SETUP_LOOP              28 (to 34)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_CONST               1 (10)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                16 (to 32)
             16 STORE_FAST               1 (i)
 12          18 LOAD_FAST                0 (numbers)
             20 LOAD_ATTR                1 (append)
             22 LOAD_GLOBAL              2 (run_calculation)
             24 LOAD_FAST                1 (i)
             26 CALL_FUNCTION            1
             28 CALL_METHOD              1
             30 POP_TOP
             32 JUMP_ABSOLUTE           14
        >>   34 POP_BLOCK
 13     >>   36 LOAD_FAST                0 (numbers)
             38 RETURN_VALUE

리스트 컴프리헨션 예시

import re
from typing import Iterable, Set

# Define the regex pattern for matching the ARN format
ARN_REGEX = r"arn:(?P<partition>[^:]+):(?P<service>[^:]+):(?P<region>[^:]*):(?P<account_id>[^:]+):(?P<resource_id>[^:]+)"

def collect_account_ids_from_arns(arns: Iterable[str]) -> Set[str]:
    """
    arn:partition:service:region:account-id:resource-id 형태의 ARN들이 주어진 경우 account-id를 찾아서 반환
    """
    collected_account_ids = set()
    for arn in arns:
        matched = re.match(ARN_REGEX, arn)
        if matched is not None:
            account_id = matched.groupdict()["account_id"]
            collected_account_ids.add(account_id)
    return collected_account_ids

# Example usage
arns = [
    "arn:aws:iam::123456789012:user/David",
    "arn:aws:iam::987654321098:role/Admin",
    "arn:aws:iam::123456789012:group/Developers",
]

unique_account_ids = collect_account_ids_from_arns(arns)
print(unique_account_ids)
# {'123456789012', '987654321098'}

위 코드 중 collect_account_ids_from_arns 함수를 집중해서 보면,

def collect_account_ids_from_arns(arns: Iterable[str]) -> Set[str]:
    """
    arn:partition:service:region:account-id:resource-id 형태의 ARN들이 주어진 경우 account-id를 찾아서 반환
    """
    collected_account_ids = set()
    for arn in arns:
        matched = re.match(ARN_REGEX, arn)
        if matched is not None:
            account_id = matched.groupdict()["account_id"]
            collected_account_ids.add(account_id)
    return collected_account_ids

위 코드를 컴프리헨션을 이용해 간단히 작성 가능

def collect_account_ids_from_arns(arns: Iterable[str]) -> Set[str]:
    """
    arn:partition:service:region:account-id:resource-id 형태의 ARN들이 주어진 경우 account-id를 찾아서 반환
    """

    matched_arns = filter(None, (re.match(ARN_REGEX, arn) for arn in arns))
    return {m.groupdict()["account_id"] for m in matched_arns}

python 3.8이후에는 할당표현식을 이용해 한문장으로 다시 작성 가능

def collect_account_ids_from_arns(arns: Iterable[str]) -> Set[str]:
    """
    arn:partition:service:region:account-id:resource-id 형태의 ARN들이 주어진 경우 account-id를 찾아서 반환
    """

    return {
        matched.groupdict()["account_id"]
        for arn in arns
        if (matched := re.match(ARN_REGEX, arn)) is not None
    }

정규식 이용한 match 결과들 중 None이 아닌 것들만 matched 변수에 저장되고 이를 다시 사용

더 간결한 코드가 항상 더 나은 코드를 의미하는 것은 아니지만 분명 두번째나 세번째 코드가 첫번째 코드보다는 낫다는 점에서는 의심의 여지가 없음

프로퍼티, 속성(attribute)과 객체 메서드의 다른 타입들

파이썬에서의 밑줄

class Connector:
    def __init__(self, source):
        self.source = source
        self._timeout = 60

conn = Connector("postgresql://localhost")
print(conn.source)  # postgresql://localhost
print(conn._timeout)  # 60

print(conn.__dict__)  # {'source': 'postgresql://localhost', '_timeout': 60}

source와 timeout이라는 2개의 속성을 가짐
- source는 public, timeout은 private
- 하지만 실제로는 두 개의 속성에 모두 접근 가능
_timeout는 connector 자체에서만 사용되고 바깥에서는 호출하지 않을 것이므로 외부 인터페이스를 고려하지 않고 리팩토링 가능

2개의 밑줄은? (__timeout) → name mangling 으로 실제로 다른 이름을 만듦

_<classname>__<attribute-name>

class Connector:
    def __init__(self, source):
        self.source = source
        self.__timeout = 60

conn = Connector("postgresql://localhost")
print(conn.source)  # postgresql://localhost

print(conn.__dict__)  
# {'source': 'postgresql://localhost', '_Connector__timeout': 60}

__timeout → 실제 이름은_Connector__timeout 이 됨
이는 여러번 확장되는 클래스의 메소드 이름을 충돌없이 오버라이드 하기 위해 만들어진거로 pythonic code의 예가 아님

결론

⇒ 속성을 private으로 정의하는 경우 하나의 밑줄 사용

프로퍼티(Property)

class Coordinate:
    def __init__(self, lat: float, long: float) -> None:
        self._latitude = self._longitude = None
        self.latitude = lat
        self.longitude = long

    @property
    def latitude(self) -> float:
        return self._latitude
    
    @latitude.setter
    def latitude(self, lat_value: float) -> None:
        print("here")
        if lat_value not in range(-90, 90+1):
            raise ValueError(f"유호하지 않은 위도 값: {lat_value}")
        self._latitude = lat_value

    @property
    def longitude(self) -> float:
        return self._longitude
    
    @longitude.setter
    def longitude(self, long_value: float) -> None:
        if long_value not in range(-180, 180+1):
            raise ValueError(f"유효하지 않은 경도 값: {long_value}")
        self._longitude = long_value

coord = Coordinate(10, 10)
print(coord.latitude)

coord.latitude = 190 # ValueError: 유호하지 않은 위도 값: 190

property 데코레이터는 무언가에 응답하기 위한 쿼리
setter는 무언가를 하기 위한 커맨드

둘을 분리하는 것이 명령-쿼리 분리 원칙을 따르는 좋은 방법

보다 간결한 구문으로 클래스 만들기

객체의 값을 초기화하는 일반적인 보일러플레이트

보일러 플레이트: 모든 프로젝트에서 반복해서 사용하는 코드

def __init__(self, x, y, ...):
    self.x = x
    self.y = y

파이썬 3.7부터는 dataclasses 모듈을 사용하여 위 코드를 훨씬 단순화할 수 있다 (PEP-557)
- @dataclass 데코레이터를 제공
클래스에 적용하면 모든 클래스의 속성에 대해서 마치 __init__ 메소드에서 정의한 것처럼 인스턴스 속성으로 처리
@dataclass 데코레이터가 __init__ 메소드를 자동 생성
field라는 객체 제공해서 해당 속성에 특별한 특징이 있음을 표시
- 속성 중 하나가 list처럼 변경가능한 mutable 데이터 타입인 경우 __init__에서 비어 있는 리스트를 할당할 수 없고 대신에 None으로 초기화한 다음에 인스턴스마다 적절한 값으로 다시 초기화 해야함

from dataclasses import dataclass

@dataclass
class Foo:
    bar: list = []

# ValueError: mutable default <class 'list'> for field a is not allowed: use default_factory

안되는 이유는 위의 bar 변수가 class variable이라 모든 Foo 객체들 사이에서 공유되기 때문

class C:
  x = [] # class variable

  def add(self, element):
    self.x.append(element)

c1 = C()
c2 = C()
c1.add(1)
c2.add(2)
print(c1.x)  # [1, 2]
print(c2.x)  # [1, 2]

아래처럼 default_factory 파라미터에 list 를 전달하여 초기값을 지정할 수 있도록 하면 됨

from dataclasses import dataclass, field

@dataclass
class Foo:
    bar = field(default_factory=list)

__init__ 메소드가 없는데 초기화 직후 유효성 검사를 하고 싶다면?

⇒ __post_init__에서 처리 가능

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

pathlib 모듈 (0)	2024.10.20
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (2) (0)	2024.09.29
super() (0)	2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31
URL 다루기 위한 python의 built-in 패키지: urllib (0)	2024.08.25

super()

온별파파 2024. 9. 15. 23:49

2024. 9. 15. 23:49

Python 공식문서에 따르면 super 클래스의 역할은 아래와 같음

Return a proxy object that delegates method calls to a parent or sibling class of type. This is useful for accessing inherited methods that have been overridden in a class.

공식문서 설명은 늘 어려움.

쉽게 말해, 부모나 형제 클래스의 임시 객체를 반환하고, 반환된 객체를 이용해 슈퍼 클래스의 메소드를 사용할 수 있음.

즉, super() 를 통해 super class의 메소드에 접근 가능

단일상속에서 super()

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        return self.length * self.width

    def perimeter(self):
        return 2 * self.length + 2 * self.width

class Square(Rectangle):
    def __init__(self, length):
        super().__init__(length, length)

square = Square(4)
square.area() # 16

Rectangle 클래스를 상속받기 때문에 Rectangle의 area() 메소드 사용 가능

super() with parameters

super() 는 2가지 파라미터를 가질 수 있음
- 첫번째 : subclass
- 두번째 : subclass의 인스턴스 객체

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        return self.length * self.width

    def perimeter(self):
        return 2 * self.length + 2 * self.width

class Square(Rectangle):
    def __init__(self, length):
        super(Square, self).__init__(length, length)

단일 상속인 경우에는 super(Square, self)와 super()는 같은 의미

아래의 경우는?

class Cube(Square):
    def surface_area(self):
        face_area = super(Square, self).area()
        return face_area * 6

super(Square, self).area()

첫번째 argument : subclass 인 Square

Cube가 아닌 Square기 때문에 super(Square, self)의 반환은 Square 클래스의 부모 클래스인 Rectangle 클래스의 임시 객체
결과적으로 Rectangle 인스턴스에서 area() 메소드를 찾음

Q. Square 클래스에 area 메소드를 구현하면??

그래도 super(Square, self) 가 Rectangle 클래스를 반환하기 때문에 Rectangle 인스턴스에서 area() 메소드를 호출

## super 클래스의 정의
class super(object):
	def __init__(self, type1=None, type2=None): # known special case of super.__init__
	        """
	        super() -> same as super(__class__, <first argument>)
	        super(type) -> unbound super object
	        **super(type, obj) -> bound super object; requires isinstance(obj, type)
	        super(type, type2) -> bound super object; requires issubclass(type2, type)**
	        Typical use to call a cooperative superclass method:
	        class C(B):
	            def meth(self, arg):
	                super().meth(arg)
	        This works for class methods too:
	        class C(B):
	            @classmethod
	            def cmeth(cls, arg):
	                super().cmeth(arg)
					"""
	        
	        # (copied from class doc)

두번째 argument : 첫번째 argument의 클래스 인스턴스를 넣어주거나 subclass를 넣어줘야함

print(issubclass(Cube, Square)) # True

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (2) (0)	2024.09.29
[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1) (0)	2024.09.29
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31
URL 다루기 위한 python의 built-in 패키지: urllib (0)	2024.08.25
Pillow로 Image를 열 때 자동회전되는 현상 (0)	2024.01.15

The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자)

온별파파 2024. 8. 31. 17:32

2024. 8. 31. 17:32

정식 이름은 Assignment expression operator인데 walrus operator라고도 불린다.

walrus는 “바다코끼리”라는 뜻으로 operator가 바다 코끼리의 눈과 이빨을 닮아서 이렇게 부른다.
때론 colon(:) equals(=) operator라고도 한다.

Python 3.8버전부터 새로 등장했다.

https://dev.to/davidarmendariz/python-walrus-operator-j13

Statement vs Expression in Python

바다코끼리 연산자의 정식 이름을 보면 Assignment expression operator로, expression이라는 단어가 나온다.

Python에서 statement와 expression이라는 표현이 비슷해 혼동스러운데 간단히 정리하면, 아래와 같다.

statement: 코드를 구성할 수 있는 단위 혹은 모든 것
expression: 값을 평가하는 statement로 연산자와 피연산자의 조합으로 구성됨

예시

x = 25          # a statement
x = x + 10      # an expression

statement는 변수를 생성하는데 사용된다.
expression은 x값에 10을 더하는 연산이 수행된 후 결과가 x에 할당되었다.

>>> walrus = False # (1)
>>> walrus
False

>>> (walrus := True) # (2)
True
>>> walrus
True

walrus = False는 값 False가 walrus에 할당된다. (traditional statement)
(walrus := True) 는 assignment expression으로 walrus에 값 True를 할당한다.

둘의 미묘한 차이중 하나는 walrus = False는 값을 반환하지 않지만 (walrus := True)는 값을 반환한다는 것이다!

>>> walrus = False
>>> (walrus := True)
True

등장한 이유

PEP 572에 Abstract에 아래와 expression 내에서 변수에 할당하는 방법을 제안하고 있다.

creating a way to assign to variables within an expression using the notation NAME := expr.

C언어에서는 변수에 값을 할당하는 statement도 expression인데 강력하지만 찾기 힘든 버그를 생산하기도 한다.

int main(){
	int x = 3, y = 8;
	if (x = y) {
	    printf("x and y are equal (x = %d, y = %d)", x, y);
	}
	return 0;
}

x와 y값을 비교후 값이 같으면 두 값을 출력하는 코드지만 x와 y값이 다르기 때문에 아무것도 출력 안되길 기대되지만 실제 코드 실행 결과는 아래와 같이 print 문이 출력된다. 왜일까?

x and y are equal (x = 8, y = 8)

문제는 위 코드 세번째 줄 if (x = y) 에서 equality comparison operator(==) 대신 assignment operator(=) 를 사용하고 있기 때문이다. if 문의 조건에는 expression이 와야하는데 C언어에서는 x = y를 expression으로 x값이 8로 할당되고 1이상의 값으로 True로 판단되서 print문이 출력된다.

그럼 Python에서는?

x, y = 3, 8
if x = y:
    print(f"x and y are equal ({x = }, {y = })")

SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='?

Syntax Error를 내뱉는데 expression이 아닌 statement이기 때문이다. 파이썬은 이를 분명히 구분하고 walrus operator에도 이러한 설계 원칙이 반영되었다. 그래서 walrus operator를 이용해서 일반적인 assignment를 할 수 없다.

>>> walrus := True
  File "<stdin>", line 1
    walrus := True
           ^
SyntaxError: invalid syntax

이를 해결하기 위해 많은 경우에 assignment expression 에 괄호를 추가해 python에서 syntax error를 피할 수 있다.

>>> (walrus := True)  # Valid, but regular assignments are preferred
True

사용 예시

walrus operator는 반복적으로 사용되는 코드를 간단히 하는데 유용하게 사용될 수 있다.

(1) 수식 검증

예로 복잡한 수식을 코드로 작성하고 이름 검증하고 debugging할 때 walrus operator가 유용할 수 있다.

아래와 같은 수식이 있다고 하자 (참고: haversine formula, 지구 표면의 2점 사이의 거리를 구하는 식)

$$
2 \cdot \text{r} \cdot \arcsin\left(
    \sqrt{
        \sin^2\left(\frac{\phi_2 - \phi_1}{2}\right)
        + \cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)
    }
\right)
$$

ϕ: 위도(latitude), λ: 경도(longitude)

위 수식을 이용해 오슬로(59.9°N 10.8°E) 와 밴쿠버(49.3°N 123.1°W) 사이의 거리를 구하면,

from math import asin, cos, radians, sin, sqrt
# Approximate radius of Earth in kilometers
rad = 6371
# Locations of Oslo and Vancouver
ϕ1, λ1 = radians(59.9), radians(10.8)
ϕ2, λ2 = radians(49.3), radians(-123.1)
# Distance between Oslo and Vancouver
print(2 * rad * asin(
    sqrt(
        sin((ϕ2 - ϕ1) / 2) ** 2
        + cos(ϕ1) * cos(ϕ2) * sin((λ2 - λ1) / 2) ** 2
    )
))

# 7181.7841229421165 (km)

위 수식을 검증하기 위해서 수식의 일부 값을 확인해야할 수 있는데 수식의 일부를 복&붙으로 확인할 수 있다.
이때 walrus operator를 이용하면,

2 * rad * asin(
    sqrt(
        **(ϕ_hav := sin((ϕ2 - ϕ1) / 2) ** 2)**
        + cos(ϕ1) * cos(ϕ2) * sin((λ2 - λ1) / 2) ** 2
    )
)

# 7181.7841229421165

ϕ_hav
# 0.008532325425222883

전체 expression의 값을 계산하면서 동시에 ϕ_hav값을 계속 확인할 수 있어서 copy & paste로 인해 발생할 수 있는 오류의 가능성을 줄일 수 있다.

(2) Lists 에서 활용될 수 있는 walrus operator

numbers = [2, 8, 0, 1, 1, 9, 7, 7]

위 list에서 길이, 합계, 평균 값을 dictionary에 저장한다고 가정해보자

description = {
    "length": len(numbers),
    "sum": sum(numbers),
    "mean": sum(numbers) / len(numbers),
}

print(description) # {'length': 8, 'sum': 35, 'mean': 4.375}

description에서 numbers의 len과 sum이 각각 두번씩 호출된다
짧은 list에서는 큰 문제가 되지 않지만 길이가 더 긴 list나 연산이 복잡할 경우에는 최적화할 필요가 있다

물론 아래처럼 len_numbers, sum_numbers 변수를 dictionary 밖에서 선언 후 사용할 수도 있다

numbers = [2, 8, 0, 1, 1, 9, 7, 7]

len_numbers = len(numbers)
sum_numbers = sum(numbers)

description = {
    "length": len_numbers,
    "sum": sum_numbers,
    "mean": sum_numbers / len_numbers,
}

print(description) # {'length': 8, 'sum': 35, 'mean': 4.375}

하지만 walrus operator를 이용해 len_numbers, sum_numbers 변수를 dictionary 내부에서만 사용하여 code를 최적화할 수 있다

numbers = [2, 8, 0, 1, 1, 9, 7, 7]

description = {
    "length": (len_numbers := len(numbers)),
    "sum": (sum_numbers := sum(numbers)),
    "mean": sum_numbers / len_numbers,
}

print(description) # {'length': 8, 'sum': 35, 'mean': 4.375}

이 경우 코드를 읽는 사람들에게 len_numbers와 sum_numbers 변수는 계산을 최적화하기 위해 dictionary 내부에서만 사용했고 다시 사용되지 않음을 명확히 전달 할 수 있다

(3) Text 파일에서 lines, words, character 수 세는 예시

# wc.py
import pathlib
import sys

for filename in sys.argv[1:]:
    path = pathlib.Path(filename)
    counts = (
        path.read_text().count("\\n"),  # Number of lines
        len(path.read_text().split()),  # Number of words
        len(path.read_text()),  # Number of characters
    )
    print(*counts, path) # 11 32 307 wc.py

wc.py 파일은 11줄, 32단어, 307 character로 구성되어있다
위 코드를 보면 path.read_text() 가 반복적으로 호출되는걸 알 수 있다 ⇒ walrus operator를 이용해 개선해보면,

import pathlib
import sys

for filename in sys.argv[1:]:
    path = pathlib.Path(filename)
    counts = (
        **(text := path.read_text()).count("\\n"),  # Number of lines**
        len(text.split()),  # Number of words
        len(text),  # Number of characters
    )
    print(*counts, path)

물론 아래처럼 text 변수를 이용하면 코드는 한줄 늘어나지만 readability를 훨신 높일 수 있다.

import pathlib
import sys

for filename in sys.argv[1:]:
    path = pathlib.Path(filename)
    text = path.read_text()
    counts = (
        text.count("\\n"),  # Number of lines
        len(text.split()),  # Number of words
        len(text),  # Number of characters
    )
    print(*counts, path)

그러므로 walrus operator가 코드를 간결하게 해주더라도 readability를 고려해야 한다.

(4) List Comprehensions

List comprehension과 함께 연산이 많은 함수를 사용하게 될 때, walrus operator의 사용은 효과적일 수 있다.

import time

t_start = time.time()

def slow(num):
    time.sleep(5)
    return num

numbers = [4, 3, 1, 2, 5]

results = [slow(num) for num in numbers if slow(num) > 4]

t_end = time.time()

print("elapsed time: ", t_end - t_start)

elapsed time: 30.01522707939148

numbers 리스트의 각 element에 slow 함수를 적용 후 3보다 큰 경우에만 results에 slow 호출 결과를 저장하는 코드
문제는 slow 함수가 2번 호출됨
- slow 호출 후 반환 결과가 3보다 큰지 확인할 때
- results 리스트에 저장하기 위해 slow 호출할 때

가장 일반적인 해결책은 list comprehension 대신 for loop을 사용하는 것이다.

import time

t_start = time.time()

def slow(num):
    time.sleep(5)
    return num

numbers = [4, 3, 1, 2, 5]

results = []
for num in numbers:
    slow_num = slow(num)
    if slow_num > 4:
        results.append(slow_num)

t_end = time.time()

print("elapsed time: ", t_end - t_start)

elapsed time: 25.021725063323975

slow 함수가 모든 경우에 한번씩만 호출됨
하지만 코드 양이 늘어나고 가독성이 떨어짐

walrus operator를 사용하면 list comprehension을 유지하면서 가독성을 높일 수 있음

import time

t_start = time.time()

def slow(num):
    time.sleep(5)
    return num

numbers = [4, 3, 1, 2, 5]

results = [slow_num for num in numbers if (slow_num := slow(num)) > 4]
print(results)

t_end = time.time()

print("elapsed time: ", t_end - t_start)

elapsed time: 25.018176908493042

(5) While Loop

question = "Do you use the walrus operator?"
valid_answers = {"yes", "Yes", "y", "Y", "no", "No", "n", "N"}

user_answer = input(f"\n{question} ")
while user_answer not in valid_answers:
    print(f"Please answer one of {', '.join(valid_answers)}")
    user_answer = input(f"\n{question} ")

위 코드는 사용자의 입력을 받는 input 함수가 두번 반복됨
이를 개선하기 위해 While True 와 break를 사용하여 코드를 다시 작성하는 것이 일반적임

question = "Do you use the walrus operator?"
valid_answers = {"yes", "Yes", "y", "Y", "no", "No", "n", "N"}

while True:
    user_answer = input(f"\n{question} ")
    if user_answer in valid_answers:
        break
    print(f"Please answer one of {', '.join(valid_answers)}")

walrus operator를 이용해서 while loop을 간결하게 할 수 있음

question = "Do you use the walrus operator?"
valid_answers = {"yes", "Yes", "y", "Y", "no", "No", "n", "N"}

while (user_answer := input(f"\n{question} ")) not in valid_answers:
    print(f"Please answer one of {', '.join(valid_answers)}")

사용자로부터 받은 input 입력을 user_answer 변수에 저장하고 동시에 valid_answers 내에 포함되어있는지를 체크하여 가독성을 높일 수 있음

Reference

https://realpython.com/python-walrus-operator/

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

[책리뷰] 파이썬 클린 코드 Chapter 2. Pythonic 코드 (1) (0)	2024.09.29
super() (0)	2024.09.15
URL 다루기 위한 python의 built-in 패키지: urllib (0)	2024.08.25
Pillow로 Image를 열 때 자동회전되는 현상 (0)	2024.01.15
Python을 이용한 Crawling (Feat. arm64, graviton) (0)	2024.01.06

URL 다루기 위한 python의 built-in 패키지: urllib

온별파파 2024. 8. 25. 04:16

2024. 8. 25. 04:16

파이썬에서 URL을 다루기 위한 패키지로 크게 3가지 종류가 있음; urllib, urllib3, requests

urllib은 built-in package이고 나머지 2개는 third party

사용방법

1. 기본 사용방법

from urllib.request import urlopen                     # (1)

with urlopen("<https://www.example.com>") as response: # (2)
    body = response.read()                             # (3)
    print(type(body))                                  # (4)

(1) urllib.request는 built-in package로 따로 설치하지 않아도 됨. HTTP request를 위해 urlopen을 사용

(2) context manager with 문을 통해 request 후 response를 받을 수 있음

(3) response 는 <http.client.HTTPResponse> 객체

read 함수를 통해 bytes로 변환할 수 있음

(4) 실제 body의 type을 print해서 bytes 타입임을 확인

2. GET request for json format response

API 작업시 response가 json format인 경우가 많음

from urllib.request import urlopen
import json                                            # (1)

url = "<https://jsonplaceholder.typicode.com/todos/1>" # (2)
with urlopen(url) as response:
    body = response.read()

print("body: ", body)                                  # (3)
# body:  b'{\\n  "userId": 1,\\n  "id": 1,\\n  "title": "delectus aut autem",\\n  "completed": false\\n}'

# json bytes to dictionary
todo_item = json.loads(body)                           # (4)
print(todo_item)
# {'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

(1) urllib 패키지와 함께 json 포맷을 다루기 위해 json package 추가

(2) JSON 형태의 데이터를 얻기 위한 샘플 API 주소

(3) 응답을 print 해보면 json 형태의 bytes format. 이를 dictionary 형태로 변경해주기 위해 json 패키지 필요

(4) json bytes를 파이썬 객체인 dictionary로 변경하기 위해 json.loads 함수 사용

3. Response의 header 정보 얻는 방법

from urllib.request import urlopen
from pprint import pprint

with urlopen("<https://www.example.com>") as response:
    pprint(response.headers.items())                       # (1)
    pprint(response.getheader("Connection")) # 'close'     # (2)

response의 headers.items()를 통해 header 정보를 얻을 수 있음

(1) pretty print(pprint)를 이용해 header 정보를 보기 좋게 출력하면 아래와 같음

[('Accept-Ranges', 'bytes'), ('Age', '78180'), ('Cache-Control', 'max-age=604800'), ('Content-Type', 'text/html; charset=UTF-8'), ('Date', 'Sat, 24 Aug 2024 18:10:20 GMT'), ('Etag', '"3147526947"'), ('Expires', 'Sat, 31 Aug 2024 18:10:20 GMT'), ('Last-Modified', 'Thu, 17 Oct 2019 07:18:26 GMT'), ('Server', 'ECAcc (lac/5598)'), ('Vary', 'Accept-Encoding'), ('X-Cache', 'HIT'), ('Content-Length', '1256'), ('Connection', 'close')]

(2) header의 개별 정보는 getheader 메소드를 이용해 얻을 수 있음

4. bytes를 string으로 변환

from urllib.request import urlopen

with urlopen("<https://www.example.com>") as response:
    body = response.read()                            
    print(type(body)) # <class 'bytes'>                       # (1)

decoded_body = body.decode("utf-8")                           # (2)
print(type(decoded_body)) # <class 'str'>                     # (3)
print(decoded_body[:30])

(1) body의 type을 확인해보면 bytes 이고 아래와 같은 형태이다

b'<!doctype html>\n<html>\n<head>\n

(2) bytes를 string으로 변환하기 위해 decode method를 이용 (”utf-8”을 파라미터로 전달)

(3) decoded_body의 type을 확인해보면 string인걸 확인할 수 있고 decoded_body의 일부를 표시하면 아래와 같은 형태

<!doctype html>
<html>
<head>

5. Bytes를 file로 변환

크게 2가지 방법이 있음

encoding & decoding 없이 바로 file로 작성

from urllib.request import urlopen

with urlopen("<https://www.example.com>") as response:
    body = response.read()

with open("example.html", mode="wb") as html_file:
    html_file.write(body)

write binary(wb) mode로 파일을 열어 bytes를 바로 example.html 파일에 작성
코드를 실행하면 example.html 파일이 생성됨

contents를 file로 encoding해야하는 경우

from urllib.request import urlopen

with urlopen("<https://www.google.com>") as response:
    body = response.read()

character_set = response.headers.get_content_charset()        # (1) 
content = body.decode(character_set)                          # (2)

with open("google.html", encoding="utf-8", mode="w") as file: # (3)
    file.write(content)

(1)&(2): 구글같은 홈페이지는 location에 따라 다른 encoding 방식을 사용하기도 한다. 그래서 get_content_charset () 메소드를 이용해서 encoding 방식을 확인 후 bytes를 string으로 decoding 함

(3) decoded string을 다시 html에 utf-8 모드로 encoding해서 google.html 파일에 저장함

References

https://realpython.com/lessons/python-urllib-request-overview/

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

super() (0)	2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31
Pillow로 Image를 열 때 자동회전되는 현상 (0)	2024.01.15
Python을 이용한 Crawling (Feat. arm64, graviton) (0)	2024.01.06
[Python] annotation 과 forward reference (0)	2022.06.06

[책리뷰] 파이썬 클린 코드 Chapter 1. 코드 포매팅과 도구

온별파파 2024. 7. 14. 21:59

2024. 7. 14. 21:59

이글은 책 "파이썬 클린 코드" ch1의 내용을 읽고 요약 및 추가한 내용입니다.

클린 코드의 의미

기계나 스크립트가 판단할 수 없고 전문가가 판단할 수 있는 것
프로그래밍 언어란 아이디어를 다른 개발에게 전달하는 것이고 여기에 클린 코드의 진정한 본질이 있다

⇒ 클린 코드를 정의하기 보다는 좋은 코드 나쁜 코드의 차이점을 확인하고 훌륭한 코드와 좋은 아키텍쳐를 식별하여 자신만의 정의를 하는 것이 좋음

클린코드의 중요성

민첩한 개발과 지속적인 배포가 가능

유지보수가 가능한 상태로 가독성이 높아야 기획자가 새롭게 기능을 요구할 때마다 리팩토링을 하고 기술 부채를 해결하느라 시간이 오래 걸리지 않음

기술부채 발생

잠재적인 문제로 언젠가는 돌발 변수가 될 수 있음

클린 코드에서 코드 포매팅의 역할

PEP (Python Enhancement Proposal)

PEP 8: Style Guide for Python Code 로 가장 잘 알려진 표준이며 띄어쓰기, 네이밍 컨벤션, 줄 길이 제한 등의 가이드라인을 제공

클린 코드를 위한 보조적인 역할, PEP8을 100% 준수한다고 하더라도 여전히 클린 코드의 요건을 충족하지 못할 수 있음

PEP8 특징

검색 효율성

PEP8: 변수에 값을 할당하는 경우와 함수의 키워드 파라미터에 값을 할당하는 경우를 구분

# core.py
#
#
#

def get_location(location: str = ""):
    pass

#변수에 값을 할당할 때 띄어쓰기 사용 O
current_location = get_location() 

#키워드 인자에 값을 할당할 때 띄어쓰기사용 X
get_location(location=current_location)

location이라는 키워드 인자에 값이 할당되는 경우를 찾는 경우

$ grep -nr “location=” ./core.py:13:get_location(location=current_location)

nr 옵션
- n : line number 표시
- r : 해당 디렉토리에서 recursive 하게 subdirectory도 검색
location이라는 변수에 값이 할당되는 경우를 찾는 경우

$ grep -nr “location =” ./core.py:10:current_location = get_location()

일관성
더 나은 오류 처리
코드 품질

문서화(Documentation)

코드주석(Code comments)

가능한 한 적은 주석을 갖는 것을 목표로 해야 함
- 주석 처리된 코드는 절대 없어야 함

Docstring

소스 코드에 포함된 문서 (리터럴 문자열)
내가 작성한 컴포넌트를 다른 엔지니어가 사용하려고 할 때 docstring을 보고 동작방식과 입출력 정보등을 확인 할 수 있어야 함
Python은 동적인 데이터 타입을 갖기 때문에 docstring이 큰 도움이 됨

docstring은 코드에서 분리되거나 독립된 것이 아니라 일부

단점은 지속적으로 수작업을 해야 한다는 것 (코드가 변경되면 업데이트를 해야함)

⇒ 가치 있는 문서를 만들기 위해 모든 팀원이 문서화에 노력해야 함

어노테이션

PEP-3107에서 어노테이션을 소개
- 코드 사용자에게 함수 인자로 어떤 값이 와야 하는지 힌트를 주자는 것

from dataclasses import dataclass

@dataclass
class Point:
    lat: float
    long: float

def locate(latitude: float, longitude: float) -> Point:
    """맵에서 좌표에 해당하는 객체를 검색"""
    pass

함수 사용자에게 힌트를 주지만 파이썬이 타입을 검사하거나 강제하지는 않음

어노테이션으로 타입만 지정할 수 있는 것은 아니고, 인터프리터에서 유효한 어떤 것(변수의 의도를 설명하는 문자열, 콜백이나 유효성 검사 함수로 사용할 수 있는 callable)도 가능

EX) 몇 초 후에 어떤 작업을 실행하는 함수

def launch_task(delay_in_seconds):
    pass

delay_in_seconds 파라미터는 긴 이름을 가지고 있어 많은 정보를 담고 있는 것 같아 보이지만 사실 충분한 정보를 제공하지 못함
- 허용 가능한 지연시간은 몇초?
- 분수를 입력해도 되나?

Seconds = float
def launch_task(delay: Seconds):
    pass

Seconds 어노테이션을 사용하여 시간을 어떻게 해석할지에 대해 작은 추상화 진행
나중에 입력 값의 형태를 변경하기로 했다면 이제 한 곳에서만 관련 내용을 변경하면 됨
- Seconds = float

어노테이션을 사용하면 __annotations__ 이라는 특수한 속성이 생김

어노테이션의 이름과 값을 매핑한 dictionary

아래 locate함수에 대해 __annotations__ 을 출력해보면,

from dataclasses import dataclass

@dataclass
class Point:
    lat: float
    long: float

def locate(latitude: float, longitude: float) -> Point:
    """맵에서 좌표에 해당하는 객체를 검색"""
    pass

print(locate.__annotations__)

{'latitude': <class 'float'>, 'longitude': <class 'float'>, 'return': <class 'main.Point'>}

타입 힌트는 단순히 데이터 타입을 확인하는 것이 아니라 유의미한 이름을 사용하거나 적절한 데이터 타입 추상화를 하도록 도와줄 수 있음

def process_clients(clients: list):

def process_clients(clients: list[tuple[int, str]]):

...

from typing import Tuple
Client = Tuple[int, str]
def process_clients(clients: list[Client]):

어노테이션을 도입하면 클래스를 보다 간결하게 작성하고 작은 컨테이너 객체를 쉽게 정의 가능

@dataclass 데코레이터를 사용하면 별도의 __init__ 메소드에서 변수를 선언하고 할당하는 작업을 하지 않아도 바로 인스턴스 속성으로 인식

# Before
class Point:
    def __init__(self, lat, long):
        self.lat = lat
        self.long = long

# After
from dataclasses import dataclass
@dataclass
class Point:
    lat: float
    long: float

# After
from dataclasses import dataclass

@dataclass
class Point:
    lat: float
    long: float

print(Point.__annotations__) 
# {'lat': <class 'float'>, 'long': <class 'float'>}
print(Point(1, 2))
# Point(lat=1, long=2)

Q. 어노테이션은 docstring을 대체하는 것일까?

둘은 상호보완적인 개념

def data_from_response(response: dict) -> dict:
    if response["status"] != 200:
        raise ValueError
    return {"data": response["payload"]}

input, output 형식에 대해서는 알 수 있지만 상세한 내용은 알 수 없음
- ex) respone 객체의 올바른 instance 형태

상세한 내용에 대해서 docstring으로 보완할 수 있음

def data_from_response(response: dict) -> dict:
    """response의 HTTP status가 200이라면 response의 payload를 반환
    
    
    - response의 예제::
    {
        "status": 200, # <int>
        "timestamp": "....", # 현재 시간의 ISO 포맷 문자열
        "payload": {...} # 반환하려는 dictionary 데이터
    }
    
    
    """
    if response["status"] != 200:
        raise ValueError
    return {"data": response["payload"]}

input, output의 예상 형태를 더 잘 이해할 수 있고 단위 테스트에서도 유용한 정보로 사용됨

도구설정

반복적인 확인 작업을 줄이기 위해 코드 검사를 자동으로 실행하는 기본도구설정

데이터 타입 일관성 검사

mypy, pytype 등의 도구를 CI build에 포함시킬 수 있음

$ pip install mypy

from typing import Iterable
import logging

logger = logging.getLogger()

def broadcast_notification(message: str, relevant_user_emails: Iterable[str]):
    for email in relevant_user_emails:
        logger.warning(f"{message} 메세지를 {email}에게 전달")

broadcast_notification("welcome", "user1@domain.com")

# mypy가 오류 내뱉지 않음
$ mypy core.py  
Success: no issues found in 1 source file

welcome 메세지를 u에게 전달
welcome 메세지를 s에게 전달
welcome 메세지를 e에게 전달
welcome 메세지를 r에게 전달
welcome 메세지를 1에게 전달
welcome 메세지를 @에게 전달
welcome 메세지를 d에게 전달
welcome 메세지를 o에게 전달
welcome 메세지를 m에게 전달
welcome 메세지를 a에게 전달
welcome 메세지를 i에게 전달
welcome 메세지를 n에게 전달
welcome 메세지를 .에게 전달
welcome 메세지를 c에게 전달
welcome 메세지를 o에게 전달
welcome 메세지를 m에게 전달

잘못된 호출. 문자열 또한 iterable 객체이므로 for 문이 정상 동작하지만 유효한 이메일 형식이 아님

리스트나 튜플만 허용하도록 더 강력한 타입 제한을 주면,

from typing import List, Tuple, Union
import logging

logger = logging.getLogger()

def broadcast_notification(
    message: str, relevant_user_emails: Union[List[str], Tuple[str]]
):
    for email in relevant_user_emails:
        logger.warning(f"{message} 메세지를 {email}에게 전달")

broadcast_notification("welcome", "user1@domain.com")

> mypy core.py
core.py:14: error: Argument 2 to "broadcast_notification" has incompatible type "str"; expected "list[str] | tuple[str]"  [arg-type]
Found 1 error in 1 file (checked 1 source file)

일반적인 코드 검증

데이터 타입 이외에도 일반적인 유형의 품질 검사도 가능
pycodestyle(pep8), flake8
더 엄격한 pylint

자동 포매팅

black formatter
PEP-8보다 엄격하게 포매팅하여 문제의 핵심에 보다 집중
—check 옵션을 사용해 코드를 포맷하지않고 표준을 준수하는지 검사만 하는 것도 가능
- CI 프로세스에 통합하여 유용하게 사용될 수 있음

자동 검사 설정

리눅스 개발환경에서 빌드를 자동화하는 가장 일반적인 방법은 Makefile을 사용하는 것

저작자표시 비영리 변경금지

'Programming' 카테고리의 다른 글

Github Actions 설명 및 Black code formatter 예시 (0)	2024.01.06
[VS Code & WSL] C/C++ #include 지시문에서 suggestion 항목들 안보이는 문제 (0)	2021.12.27

Python을 이용한 Crawling (Feat. arm64, graviton)

온별파파 2024. 1. 6. 02:59

2024. 1. 6. 02:59

상황

Google에서 이미지를 크롤링하는 파이썬 스크립트를 쿠버네티스 클러스터에 pod로 띄우려고 했다
Script → Docker container 과정으로 테스트 후 정상 동작하는 걸 확인 후 Pod로 띄웠는데 실패…
그동안 Docker container 에서 동작 → Pod에서 동작으로 이해하고 있었다.
결론: 실행환경을 고려할 때 CPU 아키텍쳐도 고려를 해야한다!

과정

1. Google에서 이미지 크롤링하는 파이썬 스크립트 (크롬사용)

Selenium은 웹 테스트를 할 때 사용하는 프레임워크인데 BeautifulSoup과 더불어 크롤링할 때 많이 사용되는 도구들 중 하나다.

실제로 우리가 구글에서 검색할때처럼 검색창에 키워드를 입력하고 기다린 후 스크롤을 내리는 과정들을 코드로 작성한다. 코드를 보면 어릴적 게임할 때(~~라스트킹덤 광물캘때)~~ 사용하던 매크로같은 느낌이 든다.

https://www.browserstack.com/guide/selenium-webdriver-tutorial

위 그림처럼 Selenium 패키지를 통해 파이썬 코드로 Browser driver를 통해 실제 browser로 명령/요청을 전달하고 응답을 받는다. 이 과정을 위해 아래의 3가지가 갖춰진 환경이 필요하다.

크롭 웹 브라우저
크롬 드라이버
파이썬 Selenium 패키지

이 때 중요한건 크롬 웹브라우저와 크롬 드라이버의 버전의 호환성이다. 웹 브라우저는 자주 업데이트가 되는데 크롬 드라이버의 버전이 업데이트가 되지 않으면 어느 순간 크롤링이 안되는 문제가 발생한다. 본 글에서는 특정한 버전으로 맞춰서 진행한다.

진행했던 환경은 아래와 같다.

x86_64, Ubuntu 20.04.6 LTS,1CPU, 1GB

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal

$ uname -a 
Linux ubuntu-s-1vcpu-512mb-10gb-sfo3-01 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

1) 크롭 웹 브라우저 설치

현재 날짜 기준(2023.12.25) Chrome 웹 브라우저의 stable 버전은 120.0.6099.109

(https://googlechromelabs.github.io/chrome-for-testing/#stable

# 크롬 웹브라우저 다운로드 및 설치 (working directory: /root/crawling)
$ wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chrome-linux64.zip
$ unzip chrome-linux64.zip

2) 크롬 드라이버 설치

위 버전에 맞춰서 크롬 드라이버도 설치

# 크롬 드라이버 다운로드 및 설치 (working directory: /root/crawling)
$ wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linux64.zip
$ unzip chromedriver-linux64.zip

3) Python selenium 패키지 설치

pip install "selenium == 4.15.1"

크롤링하는 코드는 본 유투브 링크를 참고했고, apple이라는 키워드로 검색했을 때 나오는 이미지들을 저장하는 코드이다.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys  # 엔터처리용
import time
import urllib.request
import os

options = webdriver.ChromeOptions()

options.add_argument("--headless") # '창이 없는’: 서버에서는 웹브라우저 창을 못띄우니 필요
options.add_argument("--no-sandbox") 
options.add_argument("--disable-dev-shm-usage") # 공유메모리 사용하지 않음
options.add_argument("--single-process")
options.binary_location = '/root/crawling/chrome-linux64/chrome'

service = webdriver.ChromeService(executable_path='/root/crawling/chromedriver-linux64/chromedriver')

driver = webdriver.Chrome(service=service, options=options)

URL = "<https://www.google.co.kr/imghp>"
KEYWORD = "apple"
driver.get(url=URL)

# time.sleep은 fixed, implicity는 flexible로 time_to_wait이 maximum time
driver.implicitly_wait(time_to_wait=10)

keyElement = driver.find_element(By.NAME, "q")
keyElement.send_keys(KEYWORD)
keyElement.send_keys(Keys.RETURN)  # 키보드 엔터

bodyElement = driver.find_element(By.TAG_NAME, "body")
time.sleep(5)  # 엔터치고 이미지 나오는 시간 기다림

image_candidates = []

print("Crawling Images Start!")

for i in range(1):
    bodyElement.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.2)

    images = driver.find_elements(
        By.XPATH, '//*[@id="islrg"]/div[1]/div/a[1]'
    )  # XPATH는 변경될 수 있고 web browser의 개발자도구 이용해서 확인필요

    image_candidates.append(images)

for images in image_candidates:
    for idx, image in enumerate(images):
        image.send_keys(Keys.ENTER)
        time.sleep(0.5)

        high_images = driver.find_elements(
            By.XPATH,
            '//*[@id="Sva75c"]/div[2]/div[2]/div[2]/div[2]/c-wiz/div/div/div/div/div[3]/div[1]/a/img[1]', # XPATH는 변경될 수 있고 web browser의 개발자도구 이용해서 확인필요
        )
        try:
            real_image = high_images[0].get_attribute("src")
        except Exception as e:
            print(f"Exception: {e}")
            continue
        try:
            urllib.request.urlretrieve(
                real_image,
                os.path.join(os.getcwd(), str(idx)) + ".jpg",
            )
        except Exception as e:
            print(e)

정상적으로 동작하는걸 확인했고 Docker 이미지로 만들어서 쿠버네티스 파드로 띄우려고 했는데 아래와 같은 에러가 발생했다.

OSError: [Errno 8] Exec format error: '~~/chromedriver-linux64/chromedriver’

에러메세지가 분명하진 않다.

이것저것 시도 후 파악한 결과는 CPU 아키텍처에 맞는 chrome 웹브라우저와 드라이버를 다운받아야했다. 알아보니 쿠버네티스 pod는 AWS graviton 기반의 인스턴스에 뜨게 되는데 이는 arm64기반의 cpu였다!

결국 arm64 기반의 크롬 웹브라우저와 드라이버를 다운로드 후 해결했다.

ARM은 intel이나 amd와 호환되지 않는 cpu 아키텍쳐로 저전력 고효율을 목적으로 하는 곳(ex) 스마트폰)에 많이 사용되었는데 많이 발전해서 이제 PC용으로도 사용하는 시도가 많아지고 있다고 한다. 앞서 이야기한 AWS의 Graviton이나 Apple의 M1, M2칩이 그 예이다. Cloud환경에서 작업할 때 CPU 아키텍쳐도 고려해야한다는 걸 알 수 있는 경험이었다.

참고

[python] selenium 이용한 구글 이미지 크롤링 및 이미지 저장 (https://code-code.tistory.com/165)
https://velog.io/@480/이제는-개발자도-CPU-아키텍처를-구분해야-합니다
https://www.browserstack.com/guide/selenium-webdriver-tutorial

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

super() (0)	2024.09.15
The Walrus Operator: Python's Assignment Expressions (바다코끼리 연산자) (0)	2024.08.31
URL 다루기 위한 python의 built-in 패키지: urllib (0)	2024.08.25
Pillow로 Image를 열 때 자동회전되는 현상 (0)	2024.01.15
[Python] annotation 과 forward reference (0)	2022.06.06

PREV 이전 1 2 NEXT 다음

Python

Superb Platform과 Apps

Apps는 어떻게 실행되는가?

AppWrapper의 역할

앱 실행 전 (Pre-Processing)

앱 실행 중

실행 후 (Post-Processing)

결과 포맷은 단순하지만 명확하다

Case 1: type이 link인 경우

Case 2: type이 download인 경우

AppWrapper의 도입 효과

실제 AppWrapper 데코레이터 내부는 어떻게 구현되어 있을까?

AppWrapper는 어떻게 활용되고 있을까?

개선 포인트와 앞으로의 방향

'Python' 카테고리의 다른 글

계약에 의한 디자인

방어적(defensive) 프로그래밍

'Python' 카테고리의 다른 글

Path Instantiation With Python’s pathlib

Using Path Methods

Passing in a String

Joining Paths

References

'Python' 카테고리의 다른 글

예시: R-Trie 자료 구조에 대한 노드 모델링

컨테이너 객체

객체의 동적인 생성

호출형 객체(callable)

매직 메소드 요약

파이썬에서 유의할 점

내장(built-in) 타입 확장

'Python' 카테고리의 다른 글

pythonic 코드란?

인덱스와 슬라이스

컨텍스트 관리자(context manager)

컴프리헨션과 할당 표현식

리스트 컴프리헨션 예시

프로퍼티, 속성(attribute)과 객체 메서드의 다른 타입들

결론

프로퍼티(Property)

보다 간결한 구문으로 클래스 만들기

'Python' 카테고리의 다른 글

단일상속에서 super()

super() with parameters

'Python' 카테고리의 다른 글

Statement vs Expression in Python

Reference

'Python' 카테고리의 다른 글

사용방법

References

'Python' 카테고리의 다른 글

클린 코드의 의미

클린코드의 중요성

클린 코드에서 코드 포매팅의 역할

문서화(Documentation)

도구설정

'Programming' 카테고리의 다른 글

상황

과정

1. Google에서 이미지 크롤링하는 파이썬 스크립트 (크롬사용)

1) 크롭 웹 브라우저 설치

2) 크롬 드라이버 설치

3) Python selenium 패키지 설치

참고

'Python' 카테고리의 다른 글

티스토리툴바