当前位置:首页 » 《关注互联网》 » 正文

Crawling 101 pages of second-hand housing data in Shenzhen from the Lianjia website

25 人参与  2024年12月13日 10:00  分类 : 《关注互联网》  评论

点击全文阅读


Preface:Implementing path analysis

1.Data source analysis

        1. Identify the requirements:
        Website:https://sz.lianjia.com/ershoufang/
        Target acquisition variables: Location, Street, Total price, Price per square meter, House type, Square meter, Direction, Decorate situation, Floor situation, Floor number, House structure.

        2. Data capture analysis
        Need to turn on the website developer mode(Command+Option+I)

        Retrieve the corresponding data packet by keywords

2.Code implementation steps

        1. Send data request
        Simulate browser to send request to URL address

        2. Acquire data
        Get the server and return the response data
        Using developer tools get the whole data from website

        3. Parsing data
        Extract the required data content

        4. Save data in CSV format


Execute step by step according to the design ideas

1. Send data request to the website:https://sz.lianjia.com/ershoufang/

        1. Simulate browser

        The browser's User-Agent contains information about the browser type, version, etc. If crawler wants to obtain the content of a specific version of the web page , it needs to set the appropriate User-Agent.

        2.Set data request command

        The requests library is a tool for sending HTTP requests to the target website to obtain web page content
        Similarly, Developer mode - network - select names - Headers list to query Request URL

        3.Send data request

        requests.get() is a function in the requests library that sends an HTTP GET request to a specified URL
        The headers parameter is used to pass request header information

        4.Result of executing the program is<Response [200]>,which is the response object, indicating that the request was successful

import requestsheaders={
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36'}url='https://sz.lianjia.com/ershoufang/'
response=requests.get(url=url,headers=headers)print(response)

2. Acquire text data(Web raw data)
        1. Extract the text content from the response object obtained by the request method,

and assign it to the html_data variable

        html_data=response.text

        print(html_data)

3. Parse and extract the data
        1. General parsing methods:Regular Expression, CSS Selector, XPath Node

Extraction, BeautifulSoup, JavaScript Object Notation
        2. Here I would use 2 methods ——CSS selector and BeautifulSoup to parse data

4. Part 1: Usage of CSS selector:

Introduction of CSS selector

The CSS selector can extract data content based on tag attributes and its syntax is relatively simple and intuitive. Through flexible combination selectors, we can accurately locate the elements that need to be extracted
Parsel is a Python library for parsing HTML and XML. It can locate HTML elements using CSS selector syntax. Import it

Selector could convert the HTML text content from the obtained web into a parseable object so that the required data can be extracted

Determine the area where to be extracted

     3. In developer mode, select the area and click the corresponding label to view its CSS syntax

     4.Extract the label according to the syntax. 'divs' is a list containing 30 pieces of labels from the house data on this web page

     5.Extract specific data, including title, community name, street, total price, price per square meter, and other descriptive information.

import parselselector=parsel.Selector(html_data)print(selector)
divs=selector.css('.sellListContent li .info')print(selector)print(divs)
for div in divs:    title=div.css('.title a::text').get()    total_price=div.css('.totalPrice span::text').get()     area_list=div.css('.positionInfo a::text').getall()    unit_price=div.css('.unitPrice span::text').get().replace('元/平','')     house_info=div.css('.houseInfo::text').get().split('|')

5. Part 2: Usage of BeautifulSoup

Introduction of BeautifulSoup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree

Similarly, from the given HTML data (html_data), the BeautifulSoup library is used to extract content such as the house title, area, unit price, total price, and house details, and store them in the corresponding variables

from bs4 import BeautifulSoupsoup=BeautifulSoup(html_data,'html.parser')
title = [tag.text for tag in soup.select("div[class='title']")[:30]]area_list = [tag.text for tag insoup.select("div[class='positionInfo']")]unit_price = [tag.text for tag in soup.select("div[class='unitPrice']")]total_price = [tag.text for tag in soup.select("div[class='totalPricetotalPrice2']")]
house_info = [tag.text for tag in soup.select("div[class='houseInfo']")]
print(unit_price)print(total_price)print(title)
print(area_list)print(house_info)

6. Organizing the data
        1. Categorize the crawled data into areas, streets, house types, squares, orientations, decorations, floors, and house structures

        2.Using Regular Expression sort out the floor height data. '\d+' means matching numeric characters, so that the individual floor values can be sorted out

        3.Using logical language sort out house age because some information doesn't contain age data

        4.The result of these programs is a dictionary storing 30 pieces of house information from page 1. Here are the screenshots of the first 10 results.

import re #Regular Expression
area=area_list[0]area_1=area_list[1]house_type=house_info[0]house_square=house_info[1]house_direction=house_info[2]house_decorate=house_info[3]house_floor=house_info[4]floor_type=house_info[4][1]floor_num=re.findall('\d+',house_floor)[0]house_structure=house_info[-1]
if len(house_info)==7:        house_age=house_info[5]
else:        house_age='NA'
dict={'title':title,'name':area,'street':area_1,'totalprice':total_price,'unitprice':unit_price,'type':house_type,'square':house_square,'direction':house_direction,'decorate':house_decorate,'floor':house_floor,'floor_type':floor_type,'floor_num':floor_num,'structure':house_structure,'age':house_age}print(dict)

7. So far, we have crawled the data from page 1. Let us crawl more data from the subsequent pages (for example, from page 1 to 101).

for page in range(1,102):    print(f'=====collecting data from the page {page}======')            #notice to modify the url format!
    url=f'https://sz.lianjia.com/ershoufang/pg{page}/'    ......
        #The following procedures must be indented!

8. Save data in CSV format

import csvf=open('second_hand_house.csv',mode='w',encoding='utf-8',newline='')csv_writer=csv.DictWriter(f,fieldnames=[
    'title'    'name',    'street',    'totalprice',    'unitprice',    'type',    'square',    'direction',    'decorate',    'floor',    'floor_type',    'floor_num',    'structure',    'age'
])csv_writer.writeheader()#All the programs above should be entered here!csv_writer.writerow(dict)

点击全文阅读


本文链接:http://m.zhangshiyu.com/post/200488.html

<< 上一篇 下一篇 >>

  • 评论(0)
  • 赞助本站

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

最新文章

  • (此去经年无故人)南初陆南城:结局+番外精品选集起点章节+阅读即将发布预订
  • 沈凝夏叶晚怡附加完整在线阅读(归雁不栖故人枝)最近更新列表
  • 剧情人物是时初,白浩雄的玄幻言情小说《召诸神,踏万界,天命帝女逆乾坤》,由网络作家&ldquo;海鸥&rdquo;所著,情节扣人心弦,本站TXT全本,欢迎阅读!本书共计381345字,185章节,:结局+番外免费品鉴:结局+番外评价五颗星
  • 凤青禾,江明远,***枢小说(别人修仙我捡漏,卷王们破防了)最近更新(凤青禾,江明远,***枢)整本无套路阅读
  • 薛梨小说无删减+后续(曾经亲情似草芥)畅享阅读
  • 沈南栀小说(穿越时空,我要修补时空裂缝)章节目录+起点章节(沈南栀)全篇清爽版在线
  • 未婚妻被巨蟒缠身,我该吃就吃该喝就喝前言+后续_阿豪林月周然后续+番外_小说后续在线阅读_无删减免费完结_
  • 陆骁,陆本初小说(陆骁,陆本初)(癫!睁眼穿成老太太挥鞭***逆子)前传+阅读全新作品预订
  • 姐姐含冤而死后冥王另娶,我杀穿整个地府在线阅读_阎罗殿殷红别提一口气完结_小说后续在线阅读_无删减免费完结_
  • (书荒必看)毒后重生:疯王的神医小娇妻沈清歌,萧绝:+后续热血十足
  • 重生后我和太监联手灭了敌国喻辰,林雪续集(重生后我和太监联手灭了敌国)终极反转(喻辰,林雪)全篇一口气阅读
  • 我不做灵媒后,自称灵媒摆渡人的养妹害怕了内容精选_苏晓霍老阿姐无广告_小说后续在线阅读_无删减免费完结_

    关于我们 | 我要投稿 | 免责申明

    Copyright © 2020-2022 ZhangShiYu.com Rights Reserved.豫ICP备2022013469号-1