前言
這是第一篇我用md來寫的文章,以往都是直接用WordPress的編輯器撰寫,所以如果排版不好看或是有任何建議都歡迎給我點建議或指教!
正文
今天偶然看到PLG的網站,發現網站裡頭有一個表格,儲存著所有球員的資料、數據,看起來就是一臉欠爬的樣子(X。

檢視了一下他的html結構就開始直接開爬了。 先引入必要的套件。
import requests
from bs4 import BeautifulSoup
import csv
這些套件分別負責
- requests : 對網站送出請求
- bs4 : 最常用的爬蟲套件之一
- csv : 將爬取道的資料寫為csv檔
寫成csv檔的原因是因為csv的泛用性非常高,如果要畫圖或是做其他處理時使用pandas也可以直接讀取資料,對未來處理會比較輕鬆。
接著是爬蟲部分的程式:
url ='https://pleagueofficial.com/stat-player'
html = requests.get(url)
html.encoding = 'UTF-8'
bs = BeautifulSoup(html.text, 'html.parser')
table=bs.tbody
row=table.find_all('tr')
先透過requests取得網頁的原始碼(html),接著用美麗湯把HTML的資料解析並找到網頁中的表格<tbody>標籤,接著要再將tbody中的所有列(row)用find_all()抓下來。
資料處理程式:
data=[]
for i in row:
for j in i:
if j.text != '\n':
data.append(j.text)
爬下來的資料會很雜亂而且有很多筆都是換行符號(‘\n’),所以要做一些過濾,把爬下來的所有資料只要不為換行符號就加入到list中,原先有一併檢查’ ‘,’’,None等等的狀態,但後來發現執行幾次下來應該都不會有這些資料出現,所以就刪掉了。
寫入csv程式:
with open('player.csv', 'w', newline='',encoding="UTF-8") as file:
writer = csv.writer(file)
field = ["name", "number","team", "games played","time(min)",
"2pt","2pt shot","2pt percentage","3pt","3pt shot",
"3pt percentage","free throw","free throw shot",
"free throw percentage","pt","offReb","defReb",
"tReb","assist","steal","block","turn over","foul"]
writer.writerow(field)
for i in range(len(data)//23):
row=[]
for j in range(23):
row.append(data[0])
data.pop(0)
# print(row)
writer.writerow(row)
在.py檔案目錄中開一個player.csv檔,接著定義csv的標頭,根據網站資料有23個項目,這邊我就使用手動的方式建立標頭。
接著根據前一段建立的list將資料逐列寫入csv檔案。
完整程式碼如下:
import requests
from bs4 import BeautifulSoup
import csv
# web crawler
url ='https://pleagueofficial.com/stat-player'
html = requests.get(url)
html.encoding = 'UTF-8'
bs = BeautifulSoup(html.text, 'html.parser')
table=bs.tbody
row=table.find_all('tr')
# data processing
data=[]
for i in row:
for j in i:
if j.text != '\n':
data.append(j.text)
# write into csv
with open('player.csv', 'w', newline='',encoding="UTF-8") as file:
writer = csv.writer(file)
field = ["name", "number","team", "Games played","time(min)",
"2pt","2pt shot","2pt percentage","3pt","3pt shot",
"3pt percentage","free throw","free throw shot",
"free throw percentage","pt","offReb","defReb",
"tReb","assist","steal","block","turn over","foul"]
writer.writerow(field)
for i in range(len(data)//23):
row=[]
for j in range(23):
row.append(data[0])
data.pop(0)
# print(row)
writer.writerow(row)
程式也會一併放在GitHub上,如果程式有任何錯誤或是建議再麻煩告知我!
請多多指教!!

發表留言