题意:从动态加载的页面(无限滚动)抓取网页内容。
问题背景:
I am trying to collect all image filenames from this website:
我正在尝试收集这个网站的所有图片文件名: https://www.shipspotting.com/
I have already collected a python dict cat_dict
of all the category names and their id numbers. So my strategy is to iterate through every category page, call the data loading API and save it's response for every page.
我已经收集了一个 Python 字典 cat_dict
,包含所有类别的名称和对应的 ID 号。所以我的策略是遍历每个类别页面,调用数据加载 API,并保存每个页面的响应。
I have identified https://www.shipspotting.com/ssapi/gallery-search as the request URL which loads the next page of content. However, when I request this URL with the requests library, I get a 404. What do I need to do to obtain the correct response in loading the next page of content?
我已经确定 https://www.shipspotting.com/ssapi/gallery-search 是加载下一页内容的请求 URL。然而,当我使用 requests
库请求这个 URL 时,我得到了 404 错误。我需要做什么才能获取正确的响应,加载下一页的内容?
import requestsfrom bs4 import BeautifulSoupcat_page = 'https://www.shipspotting.com/photos/gallery?category='for cat in cat_dict: cat_link = cat_page + str(cat_dict[cat]) headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0", "Referer": cat_link} response = requests.get('https://www.shipspotting.com/ssapi/gallery-search', headers=headers) soup = BeautifulSoup(response.text, 'html.parser')
https://www.shipspotting.com/photos/gallery?category=169
以上是一个示例页面is an example page (cat_link
)
问题解决:
Every time you scroll the page down, a new request to server is being made (a POST one, with a certain payload). You can verify this in Dev tools, Network tab.
每次你向下滚动页面时,都会向服务器发送一个新的请求(是一个 POST 请求,带有特定的负载)。你可以在开发者工具的网络标签页中验证这一点。
The following works:
以下代码正常支行:
import requestsfrom bs4 import BeautifulSoup### put the following code in a for loop based on a number of pages ### [total number of ship photos]/[12], or async it ... your choicedata = {"category":"","perPage":12,"page":2} r = requests.post('https://www.shipspotting.com/ssapi/gallery-search', data = data)print(r.json())
This returns a json response:
这会返回一个 JSON 响应:
{'page': 1, 'items': [{'lid': 3444123, 'cid': 172, 'title': 'ELLBING II', 'imo_no': '0000000',....}