【專案】PLG球員資料爬蟲與CSV檔案建立

前言

這是第一篇我用md來寫的文章,以往都是直接用WordPress的編輯器撰寫,所以如果排版不好看或是有任何建議都歡迎給我點建議或指教!

正文

今天偶然看到PLG的網站,發現網站裡頭有一個表格,儲存著所有球員的資料、數據,看起來就是一臉欠爬的樣子(X。

檢視了一下他的html結構就開始直接開爬了。 先引入必要的套件。

import requests
from bs4 import BeautifulSoup
import csv

這些套件分別負責

  • requests : 對網站送出請求
  • bs4 : 最常用的爬蟲套件之一
  • csv : 將爬取道的資料寫為csv檔

寫成csv檔的原因是因為csv的泛用性非常高,如果要畫圖或是做其他處理時使用pandas也可以直接讀取資料,對未來處理會比較輕鬆。

接著是爬蟲部分的程式:

url ='https://pleagueofficial.com/stat-player'
html = requests.get(url)
html.encoding = 'UTF-8'
bs = BeautifulSoup(html.text, 'html.parser')
table=bs.tbody
row=table.find_all('tr')

先透過requests取得網頁的原始碼(html),接著用美麗湯把HTML的資料解析並找到網頁中的表格<tbody>標籤,接著要再將tbody中的所有列(row)用find_all()抓下來。

資料處理程式:

data=[]
for i in row:
    for j in i:
        if j.text != '\n':
            data.append(j.text)

爬下來的資料會很雜亂而且有很多筆都是換行符號(‘\n’),所以要做一些過濾,把爬下來的所有資料只要不為換行符號就加入到list中,原先有一併檢查’ ‘,’’,None等等的狀態,但後來發現執行幾次下來應該都不會有這些資料出現,所以就刪掉了。

寫入csv程式:

with open('player.csv', 'w', newline='',encoding="UTF-8") as file:
    writer = csv.writer(file)
    
    field = ["name", "number","team", "games played","time(min)",
             "2pt","2pt shot","2pt percentage","3pt","3pt shot",
             "3pt percentage","free throw","free throw shot",
             "free throw percentage","pt","offReb","defReb",
             "tReb","assist","steal","block","turn over","foul"]
    
    writer.writerow(field)
    for i in range(len(data)//23):
        row=[]
        for j in range(23):
            row.append(data[0])
            data.pop(0)
        # print(row)
        writer.writerow(row)

在.py檔案目錄中開一個player.csv檔,接著定義csv的標頭,根據網站資料有23個項目,這邊我就使用手動的方式建立標頭。

接著根據前一段建立的list將資料逐列寫入csv檔案。

完整程式碼如下:

import requests
from bs4 import BeautifulSoup
import csv
# web crawler
url ='https://pleagueofficial.com/stat-player'
html = requests.get(url)
html.encoding = 'UTF-8'
bs = BeautifulSoup(html.text, 'html.parser')
table=bs.tbody
row=table.find_all('tr')

# data processing
data=[]
for i in row:
    for j in i:
        if j.text != '\n':
            data.append(j.text)

# write into csv
with open('player.csv', 'w', newline='',encoding="UTF-8") as file:
    writer = csv.writer(file)
    
    field = ["name", "number","team", "Games played","time(min)",
             "2pt","2pt shot","2pt percentage","3pt","3pt shot",
             "3pt percentage","free throw","free throw shot",
             "free throw percentage","pt","offReb","defReb",
             "tReb","assist","steal","block","turn over","foul"]
    
    writer.writerow(field)
    for i in range(len(data)//23):
        row=[]
        for j in range(23):
            row.append(data[0])
            data.pop(0)
        # print(row)
        writer.writerow(row)

    

程式也會一併放在GitHub上,如果程式有任何錯誤或是建議再麻煩告知我!

請多多指教!!

Posted in

發表留言