[파이썬 크롤링] 웹 스크래핑 프로젝트

카테고리 없음

[파이썬 크롤링] 웹 스크래핑 프로젝트

Yeni_aa 2022. 7. 5. 19:58

지시사항

웹에 있는 데이터를 구조화된 데이터(Structured Data)로 만들기 위해 class 를 먼저 정의합니다. 멤버 변수로 들어가야할 것은 다음과 같습니다.
- 국가명
- 수도
- 인구
- 면적
국가별 정보가 담긴 요소를 모두 가져오고, 각 요소를 파이썬 class인 Country의 인스턴스로 만들어 country_list 에 추가합니다.
모든 국가의 수도만 따로 list 를 만듭니다. 이 수도 목록을 sort() 또는 sorted() 를 이용하여 사전 순으로 정렬하고, 목록의 30번째 원소를 찾아 출력합니다.
흔히들 60억 지구촌이라는데, 이 데이터에선 과연 어떨까요? 모든 국가의 인구를 sum() 을 이용하여 더해서 출력합니다.

Tips!

리스트에서 nn 번째인 원소를 찾는 방법
- index는 00 부터 시작하므로, nn 번째 원소의 index는 n−1n - 1
- list[n - 1]
면적 데이터는 아래 예시처럼 숫자가 아닌 표현으로 있을 수 있기 때문에, 그런 경우엔 변환을 해줘야 합니다.

from selenium import webdriver


class Country:
    # 지시사항 1번을 작성하세요.
    def __init__(self,name,capital, population, area):#생성자가 생성될 때 매개변수가 들어갈 수 있도록
        self.name = name 
        self.capital = capital
        self.population = int(population) #숫자로 바꿔주는 형변환
        #self.area = float(area)
        
        if 'E' in area:
            a,b = area.split('E')
            self.area = float(a) * (10 ** int(b))


with webdriver.Firefox() as driver:
    driver.get("https://www.scrapethissite.com/pages/simple/")

    # 지시사항 2번을 작성하세요.
    country_list = []
    div_list = driver.find_elements_by_class_name('country') #country클래스가 담긴 모든 요소를 가져오고
    #모든 요소를 파이썬 class인 Country의 인스턴스로 만들어 country_list에 추가 
    for div in div_list:
        name= div.find_element_by_tag_name('h3').text #.text라고 해줘야 이름 가져옴
        capital = div.find_element_by_class_name('country-capital').text
        population = div.find_element_by_class_name('country-population').text
        area = div.find_element_by_class_name('country-area').text

        country = Country(name, capital, population, area) #파이썬 class인 Country의 인스턴스로
        country_list.append(country)


    # 지시사항 3번을 작성하세요.
    capital_list = []
    for country in country_list:
        capital_list.append(country.capital)
    capital_list.sort() #sorted(capital_list)
    print(capital_list[29])


    # 지시사항 4번을 작성하세요.
    #pop_list =[]
    #for country in country_list:
        #pop_list.append(country.population)
    #print(sum(pop_list))

    global_pop = 0
    for country in country_list:
        global_pop += country.population
    print(global_pop)

Q. 여러 개의 페이지로 구성된 웹 사이트를 스크래핑하기 위해 고려해야 할 점으로 옳은 것을 고르세요.

브라우저 진입 시, 모든 데이터가 한번에 로딩되지 않고, 여러 개의 페이지로 이동하면 해당 페이지의 데이터가 로딩됩니다.
각 페이지마다 스크래핑 코드를 다르게 작성하지 않아도, 페이지를 이동하는 즉, Pagination을 고려하는 코드를 작성해야 합니다.
첫 페이지에는 첫 페이지에 해당하는 데이터가 존재합니다.

지시사항

저번 실습과 마찬가지로, 웹에 있는 데이터를 구조화된 데이터(Structured Data)로 만들기 위해 class 를 먼저 정의합니다. 멤버 변수로 들어가야할 것은 다음과 같습니다.
- 팀명
- 기록연도
- 승수
- 패수
검색 기능을 활용하기 위해, 단어를 입력할 요소와 Search 버튼 요소를 찾습니다.
검색어를 입력(send_keys())하고 Search 버튼을 클릭(click())합니다. 검색어는 New 입니다.
New로 검색하면 총 세 팀이 나올텐데, 연도별 각 팀의 기록을 Record 인스턴스로 만들어 record_list 에 저장합니다. 아마도 여러 page가 나올텐데, 아래의 Tips를 참고하여 모든 데이터를 불러올 수 있도록 합니다.
record_list 를 이용하여 각 연도별 세 팀이 쌓은 승수의 합을 구해서 win_dict에 넣습니다. 해당 사전의 각 key는 연도, value는 승수입니다.
승수를 가장 많이 쌓은 연도를 출력합니다.

Tips!

pagination되어 있는 모든 page의 url은 a 요소의 href 속성을 통해 알 수 있다.

from selenium import webdriver
from typing import NamedTuple

class Record(NamedTuple):
    # 지시사항 1번을 작성하세요.
    name: str
    year: int
    wins: int
    losses: int


with webdriver.Firefox() as driver:
    driver.get("https://www.scrapethissite.com/pages/forms/")

    # 지시사항 2번을 작성하세요.
    input_e = driver.find_element_by_id('q')
    search_e = driver.find_element_by_xpath('//*[@id="hockey"]/div/div[4]/div/form/input[2]')

    # 지시사항 3번을 작성하세요.
    input_e.send_keys('New') #검색어 입력하고
    search_e.click() #버튼 클릭

    # 지시사항 4번을 작성하세요.
    record_list = []

    ul = driver.find_element_by_class_name('pagination')
    a_list = ul.find_elements_by_tag_name('a')

    url_list=[]
    for a in a_list[:-1]:
        url_list.append(a.get_attribute('href'))
    
    for url in url_list:
        driver.get(url)
        tbody = driver.find_element_by_tag_name('tbody')
        team_list = tbody.find_elements_by_class_name('team')
    
    for team in team_list:
        name = team.find_element_by_class_name('name').text
        year = team.find_element_by_class_name('year').text
        wins = team.find_element_by_class_name('name').text
        losses = team.find_element_by_class_name('name').text
    
    record = Record(
        name=name,
        year=int(year),
        wins=int(wins),
        losses = int(losses)
    )
    print(record)



    # 지시사항 5번을 작성하세요.
    win_dict = {}  # {1990: 100, 1991: 110, 1992: 120, ...}

    # 지시사항 6번을 작성하세요.

Q. 동적 렌더링된 웹 페이지에 대한 스크래핑 시 고려해야 할 점으로 옳은 것을 고르세요.

웹 페이지에 보여지지 않기 때문에, HTML 문서에서는 데이터를 가지고 있지 않습니다.
버튼을 눌러야만 데이터를 추출할 수 있기 때문에 스크래핑이 가능합니다.
버튼을 누르는 action 보다는 동적으로 서버에서 렌더되는 정보를 고려해야 합니다.