我从TripAdvisor抓取一些酒店评论,然后发现了一种从它们那里刮掉数十万条酒店评论的好方法。
让我们假设,例如,我们要从大加那利岛刮掉酒店评论。如果转到TripAdvisor,我们将看到URL为:
https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html
复制
首先,我们需要从该位置检索酒店的完整列表。为此,我们将使用下载完整的HTML requests.get(url),然后尝试从HTML中获取此值:
如果我们仔细查看页面HTML,我们将看到此值在此<span>标记内:
由于该范围没有任何标识符,并且该类似乎是自动生成的,因此我们将在.MOBILE_SORT_FILTER_BUTTONS旁边的div中选择范围。就像是:
.MOBILE_SORT_FILTER_BUTTONS + div span
复制
首先,我们将需要PIP的产品requests和bs4包装。我们还将安装Pandas,以快速生成Excel并在以后使用DataFrame。
$ pip install requests bs4 pandas
复制
获取页数
安装库之后,我们可以编写以下代码来获取页数:
import requestsfrom bs4 import BeautifulSoupimport math, timeBASE_URL = 'https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html'PER_PAGE = 30response = requests.get(BASE_URL).textsoup = BeautifulSoup(response)span = soup.select('.MOBILE_SORT_FILTER_BUTTONS + div span')[0]N_PROPERTIES = int(re.sub('([^0-9\.])', '', span.text))print(f'There are {N_PROPERTIES} properties')N_PAGES = math.ceil(N_PROPERTIES / PER_PAGE)print(f'There are {N_PAGES} different pages')
如果导航到页面2,我们将看到URL更改为:
https://www.tripadvisor.com/Hotels-g187471-oa30-Gran_Canaria_Canary_Islands-Hotels.html
获取酒店列表
如我们所见,URL的唯一更改是-oa30在酒店ID之后添加的。如果导航到第二页,则将使用-oa60代替-oa30。这发生在每个页面上。这样,我们可以创建一个函数来:
- 从网址中提取酒店ID
- 为每个页面生成URL
def get_id_from_url(URL):# Split URL by -g to divide it before the IDprefix, suffix = url.split('-g', maxsplit=1)# Divide the URL after the ID (first dash)id, slug = suffix.split('-', maxsplit=1)return int(id)def get_listing_url(page, base_url=BASE_URL, per_page=PER_PAGE):assert page >= 0id = get_id_from_url(base_url)if page == 0:return BASE_URLreturn BASE_URL.replace(f'-g{id}-', f'-g{id}-oa{page * per_page}-')
编写N_PAGES完此代码后,我们可以从0循环到并为每个页面生成URL:
现在,让我们下载每个酒店列表页面,并使用每个酒店URL生成一个数组:
listings = []for i in range(N_PAGES):url = get_listing_url(i)# Random delay to avoid TripAdvisor blocking ustime.sleep(random.randint(2, 8))# Download current pagelisting_html = requests.get(url)listing_soup = BeautifulSoup(listing_html.text, 'html.parser')# Add hotels to listingsraw_listings = listing_soup.select('.listing')for raw_listing in raw_listings:listings.append('https://www.tripadvisor.com' + raw_listing.a['href'])
几分钟后,我们应该listings用每个酒店URL填充变量🤩
分析数据
现在,让我们看看如何从每个酒店刮取评论...这就是我们在TripAdvisor上可以看到的内容:
如果向下滚动,我们将看到每个URL仅获得5条评论,这不是很好(每个酒店可能有数千条评论!)。好的,让我们打开我们的Chrome DevTools并检查在与本节进行交互时发生了什么:
如果现在更改评论语言(例如,更改为德语),我们将看到对此的请求/data/graphql/batched似乎很有趣:
TripAdvisor正在使用某种结构向其GraphQL端点发送请求,并发送了一个名为的属性locationId:
再一次,这locationId与我们在URL(在本例中Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa...)中使用的完全相同。如果我们可以使用此端点从每个酒店获取评论怎么办?🤔
首先,让我们尝试从酒店URL中提取位置ID和地理位置ID。每个酒店网址都与此类似:
https://www.tripadvisor.com/Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html
复制
我们将需要后面-g的数字和之后的数字-d:
def get_ids_from_hotel_url(url):url = url.split('-')geo = url[1]loc = url[2]return (int(geo[1:]), int(loc[1:]))
从GraphQL获取数据
现在,让我们尝试模仿TripAdvisor对他们的GraphQL执行的请求。如果我们从“网络”标签中复制原始请求,则会看到类似于以下JSON的内容:
[{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": 296922,"geo_id": 562819,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": "F5A494D1D5DB4DD491B72FB55E860886"}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "de","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n cachedFilters: personalCache(key: $filterCacheKey)\n cachedPrefs: personalCache(key: $prefsCacheKey)\n locations(locationIds: [$locationId]) {\n locationId\n parentGeoId\n name\n placeType\n reviewSummary {\n rating\n count\n }\n keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n keywords {\n keyword\n }\n }\n ... on LocationInformation {\n parentGeoId\n }\n ... on LocationInformation {\n parentGeoId\n }\n ... on LocationInformation {\n name\n currentUserOwnerStatus {\n isValid\n }\n }\n ... on LocationInformation {\n locationId\n currentUserOwnerStatus {\n isValid\n }\n }\n ... on LocationInformation {\n locationId\n parentGeoId\n accommodationCategory\n currentUserOwnerStatus {\n isValid\n }\n url\n }\n reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n totalCount\n preferredReviewIds\n reviews {\n ... on Review {\n id\n url\n location {\n locationId\n name\n }\n createdDate\n publishedDate\n provider {\n isLocalProvider\n }\n userProfile {\n id\n userId: id\n isMe\n isVerified\n displayName\n username\n avatar {\n id\n photoSizes {\n url\n width\n height\n }\n }\n hometown {\n locationId\n fallbackString\n location {\n locationId\n additionalNames {\n long\n }\n name\n }\n }\n contributionCounts {\n sumAllUgc\n helpfulVote\n }\n route {\n url\n }\n }\n }\n ... on Review {\n title\n language\n url\n }\n ... on Review {\n language\n translationType\n }\n ... on Review {\n roomTip\n }\n ... on Review {\n tripInfo {\n stayDate\n }\n location {\n placeType\n }\n }\n ... on Review {\n additionalRatings {\n rating\n ratingLabel\n }\n }\n ... on Review {\n tripInfo {\n tripType\n }\n }\n ... on Review {\n language\n translationType\n mgmtResponse {\n id\n language\n translationType\n }\n }\n ... on Review {\n text\n publishedDate\n username\n connectionToSubject\n language\n mgmtResponse {\n id\n text\n language\n publishedDate\n username\n connectionToSubject\n }\n }\n ... on Review {\n id\n locationId\n title\n text\n rating\n absoluteUrl\n mcid\n translationType\n mtProviderId\n photos {\n id\n statuses\n photoSizes {\n url\n width\n height\n }\n }\n userProfile {\n id\n displayName\n username\n }\n }\n ... on Review {\n mgmtResponse {\n id\n }\n provider {\n isLocalProvider\n }\n }\n ... on Review {\n translationType\n location {\n locationId\n parentGeoId\n }\n provider {\n isLocalProvider\n isToolsProvider\n }\n original {\n id\n url\n locationId\n userId\n language\n submissionDomain\n }\n }\n ... on Review {\n locationId\n mcid\n attribution\n }\n ... on Review {\n __typename\n locationId\n helpfulVotes\n photoIds\n route {\n url\n }\n socialStatistics {\n followCount\n isFollowing\n isLiked\n isReposted\n isSaved\n likeCount\n repostCount\n tripCount\n }\n status\n userId\n userProfile {\n id\n displayName\n isFollowing\n }\n location {\n __typename\n locationId\n additionalNames {\n normal\n long\n longOnlyParent\n longParentAbbreviated\n longOnlyParentAbbreviated\n longParentStateAbbreviated\n longOnlyParentStateAbbreviated\n geo\n abbreviated\n abbreviatedRaw\n abbreviatedStateTerritory\n abbreviatedStateTerritoryRaw\n }\n parent {\n locationId\n additionalNames {\n normal\n long\n longOnlyParent\n longParentAbbreviated\n longOnlyParentAbbreviated\n longParentStateAbbreviated\n longOnlyParentStateAbbreviated\n geo\n abbreviated\n abbreviatedRaw\n abbreviatedStateTerritory\n abbreviatedStateTerritoryRaw\n }\n }\n }\n }\n ... on Review {\n text\n language\n }\n ... on Review {\n locationId\n absoluteUrl\n mcid\n translationType\n mtProviderId\n originalLanguage\n rating\n }\n ... on Review {\n id\n locationId\n title\n labels\n rating\n absoluteUrl\n mcid\n translationType\n mtProviderId\n alertStatus\n }\n }\n }\n reviewAggregations {\n ratingCounts\n languageCounts\n alertStatusCount\n }\n }\n}\n","variables": {"locationId": 296922,"offset": 0,"filters": [{"axis": "LANGUAGE","selections": ["de"]}],"prefs": None,"initialPrefs": {},"limit": 5,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_296922","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"de\"]}]"}}]复制
分析这个JSON,我们可以看到它是一个包含3个元素的数组。第一个元素似乎是记录交互。第二个元素具有一些有趣的属性,例如locationId,variables.filters.selections(似乎包含语言iso代码的数组),variables.offset(要跳过的评论数)和variables.limit(评论数限制)。第三个要素似乎是将用户首选项写入他们的数据库中。
很多人学习python,不知道从何学起。
很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手。
很多已经做案例的人,却不知道如何去学习更加高深的知识。
那么针对这三类人,我给大家提供一个好的学习平台,免费领取视频教程,电子书籍,以及课程的源代码!
QQ群:553215015
知道了这一点,我们可以创建一个新函数来从某个酒店获取GraphQL数据:
GRAPHQL_URL = 'https://www.tripadvisor.com/data/graphql/batched'def request_graphql(url, page=0):geo, loc = get_ids_from_hotel_url(url)request = [{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": loc,"geo_id": geo,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": '{YOUR_SESSION_ID}'}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "es","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n cachedFilters: personalCache(key: $filterCacheKey)\n cachedPrefs: personalCache(key: $prefsCacheKey)\n locations(locationIds: [$locationId]) {\n locationId\n parentGeoId\n name\n placeType\n reviewSummary {\n rating\n count\n }\n keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n keywords {\n keyword\n }\n }\n ... on LocationInformation {\n parentGeoId\n }\n ... on LocationInformation {\n parentGeoId\n }\n ... on LocationInformation {\n name\n currentUserOwnerStatus {\n isValid\n }\n }\n ... on LocationInformation {\n locationId\n currentUserOwnerStatus {\n isValid\n }\n }\n ... on LocationInformation {\n locationId\n parentGeoId\n accommodationCategory\n currentUserOwnerStatus {\n isValid\n }\n url\n }\n reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n totalCount\n preferredReviewIds\n reviews {\n ... on Review {\n id\n url\n location {\n locationId\n name\n }\n createdDate\n publishedDate\n provider {\n isLocalProvider\n }\n userProfile {\n id\n userId: id\n isMe\n isVerified\n displayName\n username\n avatar {\n id\n photoSizes {\n url\n width\n height\n }\n }\n hometown {\n locationId\n fallbackString\n location {\n locationId\n additionalNames {\n long\n }\n name\n }\n }\n contributionCounts {\n sumAllUgc\n helpfulVote\n }\n route {\n url\n }\n }\n }\n ... on Review {\n title\n language\n url\n }\n ... on Review {\n language\n translationType\n }\n ... on Review {\n roomTip\n }\n ... on Review {\n tripInfo {\n stayDate\n }\n location {\n placeType\n }\n }\n ... on Review {\n additionalRatings {\n rating\n ratingLabel\n }\n }\n ... on Review {\n tripInfo {\n tripType\n }\n }\n ... on Review {\n language\n translationType\n mgmtResponse {\n id\n language\n translationType\n }\n }\n ... on Review {\n text\n publishedDate\n username\n connectionToSubject\n language\n mgmtResponse {\n id\n text\n language\n publishedDate\n username\n connectionToSubject\n }\n }\n ... on Review {\n id\n locationId\n title\n text\n rating\n absoluteUrl\n mcid\n translationType\n mtProviderId\n photos {\n id\n statuses\n photoSizes {\n url\n width\n height\n }\n }\n userProfile {\n id\n displayName\n username\n }\n }\n ... on Review {\n mgmtResponse {\n id\n }\n provider {\n isLocalProvider\n }\n }\n ... on Review {\n translationType\n location {\n locationId\n parentGeoId\n }\n provider {\n isLocalProvider\n isToolsProvider\n }\n original {\n id\n url\n locationId\n userId\n language\n submissionDomain\n }\n }\n ... on Review {\n locationId\n mcid\n attribution\n }\n ... on Review {\n __typename\n locationId\n helpfulVotes\n photoIds\n route {\n url\n }\n socialStatistics {\n followCount\n isFollowing\n isLiked\n isReposted\n isSaved\n likeCount\n repostCount\n tripCount\n }\n status\n userId\n userProfile {\n id\n displayName\n isFollowing\n }\n location {\n __typename\n locationId\n additionalNames {\n normal\n long\n longOnlyParent\n longParentAbbreviated\n longOnlyParentAbbreviated\n longParentStateAbbreviated\n longOnlyParentStateAbbreviated\n geo\n abbreviated\n abbreviatedRaw\n abbreviatedStateTerritory\n abbreviatedStateTerritoryRaw\n }\n parent {\n locationId\n additionalNames {\n normal\n long\n longOnlyParent\n longParentAbbreviated\n longOnlyParentAbbreviated\n longParentStateAbbreviated\n longOnlyParentStateAbbreviated\n geo\n abbreviated\n abbreviatedRaw\n abbreviatedStateTerritory\n abbreviatedStateTerritoryRaw\n }\n }\n }\n }\n ... on Review {\n text\n language\n }\n ... on Review {\n locationId\n absoluteUrl\n mcid\n translationType\n mtProviderId\n originalLanguage\n rating\n }\n ... on Review {\n id\n locationId\n title\n labels\n rating\n absoluteUrl\n mcid\n translationType\n mtProviderId\n alertStatus\n }\n }\n }\n reviewAggregations {\n ratingCounts\n languageCounts\n alertStatusCount\n }\n }\n}\n","variables": {"locationId": loc,"offset": page * 20,"filters": [{"axis": "LANGUAGE","selections": ["es","en","de","fr","it"]}],"prefs": None,"initialPrefs": {},"limit": 20,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_4107099","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"es\"]}]"}}]response = requests.post(GRAPHQL_URL, json=request, headers={'origin': 'https://www.tripadvisor.com','pragma': 'no-cache','referer': url,'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36','x-requested-by': 'TNI1625!AJip35tLIWuhpNQPmrwHxPeiKXvdZcnL7knBuCZi5C72/qqhuKp4Z0UJIclF3lVur1Wu4ZdKfvqHmfGsn939HaPm574AH0+pxs5wBXmVwF5wm/4/retQGYfVPgorX2lUtTDc8/Ej6X5EDaY3f3qV5r4EfRGA8CA5E9Eu39DyE34C','Cookie': '{YOUR_COOKIE_STRING}'})return response.json()
在继续之前,请注意您需要在此函数内进行三件事更改:
- {YOUR_SESSION_ID}:您将需要在DevTools内部搜索cookie“ TASID”,并将此值放在此处。
- {YOUR_COOKIE_STRING}:您将需要转到“请求标头”并在此处粘贴完整的Cookie字符串。
- {YOUR_REQUESTED_BY}:您将需要转到“请求标头”,然后在此处粘贴完整的X-Requested-By标头。
现在我们已经准备好了此功能,我们非常接近通过每个酒店URL进行迭代,并通过调用此功能来获取所有评论。只需要做一件事:我们需要找出每家酒店有多少条评论。
让我们尝试我们的超级功能,看看会发生什么:-)
而且有效!this正如我们在此结构中看到的那样,我们刚刚获得了data.locations[0].reviewListPage.reviews该酒店的前20条评论!
每个评论具有以下结构:
{"id": 780641546,"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","location": {"locationId": 559667,"name": "Hotel Cordial Mogan Playa","placeType": "ACCOMMODATION","parentGeoId": 187471,"__typename": "LocationInformation","additionalNames": {"normal": "Hotel Cordial Mogan Playa","long": "Hotel Cordial Mogan Playa, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Puerto de Mogan","abbreviated": "Hotel Cordial Mogan Playa","abbreviatedRaw": "Hotel Cordial Mogan Playa","abbreviatedStateTerritory": "Hotel Cordial Mogan Playa","abbreviatedStateTerritoryRaw": "Hotel Cordial Mogan Playa"},"parent": {"locationId": 187471,"additionalNames": {"normal": "Gran Canaria","long": "Gran Canaria, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Gran Canaria, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Gran Canaria, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Gran Canaria","abbreviated": "Gran Canaria","abbreviatedRaw": "Gran Canaria","abbreviatedStateTerritory": "Gran Canaria","abbreviatedStateTerritoryRaw": "Gran Canaria"}}},"createdDate": "2021-01-06","publishedDate": "2021-01-06","provider": {"isLocalProvider": true,"isToolsProvider": true},"userProfile": {"id": "63A449F68F3328E582979E7BC8F5D5E3","userId": "63A449F68F3328E582979E7BC8F5D5E3","isMe": false,"isVerified": false,"displayName": "kattullus","username": "kattullus","avatar": {"id": 120146276,"photoSizes": []},"hometown": {"locationId": 189852,"fallbackString": "189852","location": {"locationId": 189852,"additionalNames": {"long": "Stockholm, Sweden"},"name": "Stockholm"}},"contributionCounts": {"sumAllUgc": 890,"helpfulVote": 150},"route": {"url": "/Profile/kattullus"},"isFollowing": false},"title": "Did not stay but want to applaud the fantastic New Year's buffet","language": "en","translationType": null,"roomTip": null,"tripInfo": {"stayDate": "2020-12-31","tripType": "NONE"},"additionalRatings": [{"rating": 4,"ratingLabel": "Location"},{"rating": 5,"ratingLabel": "Service"},{"rating": 5,"ratingLabel": "Sleep Quality"}],"text": "The New Year's buffet was arguably the finest buffet we have enjoyed. Hundreds of dishes, all beautifully presented and delicious. Wines/beer included and the cost very reasonable. The venue is fantastic and in itself worth a detour!","username": "kattullus","connectionToSubject": null,"locationId": 559667,"rating": 5,"absoluteUrl": "https://www.tripadvisor.com/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","mcid": 53922,"mtProviderId": 0,"photos": [],"original": null,"attribution": null,"__typename": "Review","helpfulVotes": 0,"photoIds": [],"route": {"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html"},"originalLanguage": "en","labels": [],"alertStatus": false}
让我们仅解析此文档以生成我们的数据库:
data = []for hotel_url in listings:response = request_graphql(hotel_url)[1]['data']['locations'][0]hotel_name = response['name']print(f'Scraping {hotel_name}')# Get total review counttotal_reviews = response['reviewListPage']['totalCount']# Get number of pages to get all the reviewspages = math.ceil(total_reviews / 20)pages = min(MAX_PAGES, pages)# Iterate through every possible page to get all the reviewsfor i in range(pages):# Sleep random seconds to avoid blockingtime.sleep(random.randint(1, 3))# Get the GraphQL response for each pageresponse = request_graphql(hotel_url, page=i)[1]['data']['locations'][0]# Get the reviews from each responsereviews = response['reviewListPage']['reviews'] if response['reviewListPage'] is not None else []# Add each review to the arrayfor review in reviews:review_title = review['title']review_description = review['text']location = review['location']['parent']['additionalNames']['normal']review_data = {'Hotel Name': hotel_name,'Review Date': review['createdDate'],'Stay Date': review['tripInfo']['stayDate'] if review['tripInfo'] is not None else None,'Location': location,'Lang': review['language'],'Room Tip': review['roomTip'] if 'roomTip' in review else None,'Review Title': review_title,'Review Stars': review['rating'],'Review': review_description,'User Name': review['userProfile']['displayName'] if review['userProfile'] else None,'Hometown': review['userProfile']['hometown']['location']['additionalNames']['long'] if review['userProfile'] is not None and review['userProfile']['hometown']['location'] is not None else None}# Iterate through additionalRatings (Cleanliness, Room Service...)for rating in review['additionalRatings']:review_data[f'{rating["ratingLabel"]} Stars'] = rating['rating']data.append(review_data)print(f'Reviews: {len(data)}')
现在,我们已经填满了数组(这可能会花费很多时间,具体取决于您需要的酒店数量),让我们生成一个熊猫DataFrame并将结果存储为CSV格式:
df = pd.DataFrame(data)df.to_csv('./reviews.csv', index=False, encoding='utf-8-sig', sep=';')df.head()
概要
我们已经学习了如何利用TripAdvisor GraphQL端点从酒店列表中请求所有评论,最终生成结构化的CSV文件,可将其用于进一步的ML分析。