如何用Python从TripAdvisor抓取数十万条酒店评论

article/2025/9/26 16:25:55

我从TripAdvisor抓取一些酒店评论,然后发现了一种从它们那里刮掉数十万条酒店评论的好方法。

让我们假设,例如,我们要从大加那利岛刮掉酒店评论。如果转到TripAdvisor,我们将看到URL为:

https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html

复制

首先,我们需要从该位置检索酒店的完整列表。为此,我们将使用下载完整的HTML requests.get(url),然后尝试从HTML中获取此值:

如果我们仔细查看页面HTML,我们将看到此值在此<span>标记内:

由于该范围没有任何标识符,并且该类似乎是自动生成的,因此我们将在.MOBILE_SORT_FILTER_BUTTONS旁边的div中选择范围。就像是:

.MOBILE_SORT_FILTER_BUTTONS + div span

复制

首先,我们将需要PIP的产品requestsbs4包装。我们还将安装Pandas,以快速生成Excel并在以后使用DataFrame。

$ pip install requests bs4 pandas

复制

获取页数

安装库之后,我们可以编写以下代码来获取页数:

import requestsfrom bs4 import BeautifulSoupimport math, timeBASE_URL = 'https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html'PER_PAGE = 30response = requests.get(BASE_URL).textsoup = BeautifulSoup(response)span = soup.select('.MOBILE_SORT_FILTER_BUTTONS + div span')[0]N_PROPERTIES = int(re.sub('([^0-9\.])', '', span.text))print(f'There are {N_PROPERTIES} properties')N_PAGES = math.ceil(N_PROPERTIES / PER_PAGE)print(f'There are {N_PAGES} different pages')

 

如果导航到页面2,我们将看到URL更改为:

https://www.tripadvisor.com/Hotels-g187471-oa30-Gran_Canaria_Canary_Islands-Hotels.html

获取酒店列表

如我们所见,URL的唯一更改是-oa30在酒店ID之后添加的。如果导航到第二页,则将使用-oa60代替-oa30。这发生在每个页面上。这样,我们可以创建一个函数来:

  1. 从网址中提取酒店ID
  2. 为每个页面生成URL
def get_id_from_url(URL):# Split URL by -g to divide it before the IDprefix, suffix = url.split('-g', maxsplit=1)# Divide the URL after the ID (first dash)id, slug = suffix.split('-', maxsplit=1)return int(id)def get_listing_url(page, base_url=BASE_URL, per_page=PER_PAGE):assert page >= 0id = get_id_from_url(base_url)if page == 0:return BASE_URLreturn BASE_URL.replace(f'-g{id}-', f'-g{id}-oa{page * per_page}-')

编写N_PAGES完此代码后,我们可以从0循环到并为每个页面生成URL:

现在,让我们下载每个酒店列表页面,并使用每个酒店URL生成一个数组:

listings = []for i in range(N_PAGES):url = get_listing_url(i)# Random delay to avoid TripAdvisor blocking ustime.sleep(random.randint(2, 8))# Download current pagelisting_html = requests.get(url)listing_soup = BeautifulSoup(listing_html.text, 'html.parser')# Add hotels to listingsraw_listings = listing_soup.select('.listing')for raw_listing in raw_listings:listings.append('https://www.tripadvisor.com' + raw_listing.a['href'])

几分钟后,我们应该listings用每个酒店URL填充变量🤩

分析数据

现在,让我们看看如何从每个酒店刮取评论...这就是我们在TripAdvisor上可以看到的内容:

如果向下滚动,我们将看到每个URL仅获得5条评论,这不是很好(每个酒店可能有数千条评论!)。好的,让我们打开我们的Chrome DevTools并检查在与本节进行交互时发生了什么:

 

如果现在更改评论语言(例如,更改为德语),我们将看到对此的请求/data/graphql/batched似乎很有趣:

 

TripAdvisor正在使用某种结构向其GraphQL端点发送请求,并发送了一个名为的属性locationId

 

再一次,这locationId与我们在URL(在本例中Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa...)中使用的完全相同。如果我们可以使用此端点从每个酒店获取评论怎么办?🤔

首先,让我们尝试从酒店URL中提取位置ID和地理位置ID。每个酒店网址都与此类似:

https://www.tripadvisor.com/Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html

复制

我们将需要后面-g的数字和之后的数字-d

def get_ids_from_hotel_url(url):url = url.split('-')geo = url[1]loc = url[2]return (int(geo[1:]), int(loc[1:]))

从GraphQL获取数据

现在,让我们尝试模仿TripAdvisor对他们的GraphQL执行的请求。如果我们从“网络”标签中复制原始请求,则会看到类似于以下JSON的内容:

[{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n  logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": 296922,"geo_id": 562819,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": "F5A494D1D5DB4DD491B72FB55E860886"}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "de","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n  cachedFilters: personalCache(key: $filterCacheKey)\n  cachedPrefs: personalCache(key: $prefsCacheKey)\n  locations(locationIds: [$locationId]) {\n    locationId\n    parentGeoId\n    name\n    placeType\n    reviewSummary {\n      rating\n      count\n    }\n    keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n      keywords {\n        keyword\n      }\n    }\n    ... on LocationInformation {\n      parentGeoId\n    }\n    ... on LocationInformation {\n      parentGeoId\n    }\n    ... on LocationInformation {\n      name\n      currentUserOwnerStatus {\n        isValid\n      }\n    }\n    ... on LocationInformation {\n      locationId\n      currentUserOwnerStatus {\n        isValid\n      }\n    }\n    ... on LocationInformation {\n      locationId\n      parentGeoId\n      accommodationCategory\n      currentUserOwnerStatus {\n        isValid\n      }\n      url\n    }\n    reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n      totalCount\n      preferredReviewIds\n      reviews {\n        ... on Review {\n          id\n          url\n          location {\n            locationId\n            name\n          }\n          createdDate\n          publishedDate\n          provider {\n            isLocalProvider\n          }\n          userProfile {\n            id\n            userId: id\n            isMe\n            isVerified\n            displayName\n            username\n            avatar {\n              id\n              photoSizes {\n                url\n                width\n                height\n              }\n            }\n            hometown {\n              locationId\n              fallbackString\n              location {\n                locationId\n                additionalNames {\n                  long\n                }\n                name\n              }\n            }\n            contributionCounts {\n              sumAllUgc\n              helpfulVote\n            }\n            route {\n              url\n            }\n          }\n        }\n        ... on Review {\n          title\n          language\n          url\n        }\n        ... on Review {\n          language\n          translationType\n        }\n        ... on Review {\n          roomTip\n        }\n        ... on Review {\n          tripInfo {\n            stayDate\n          }\n          location {\n            placeType\n          }\n        }\n        ... on Review {\n          additionalRatings {\n            rating\n            ratingLabel\n          }\n        }\n        ... on Review {\n          tripInfo {\n            tripType\n          }\n        }\n        ... on Review {\n          language\n          translationType\n          mgmtResponse {\n            id\n            language\n            translationType\n          }\n        }\n        ... on Review {\n          text\n          publishedDate\n          username\n          connectionToSubject\n          language\n          mgmtResponse {\n            id\n            text\n            language\n            publishedDate\n            username\n            connectionToSubject\n          }\n        }\n        ... on Review {\n          id\n          locationId\n          title\n          text\n          rating\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          photos {\n            id\n            statuses\n            photoSizes {\n              url\n              width\n              height\n            }\n          }\n          userProfile {\n            id\n            displayName\n            username\n          }\n        }\n        ... on Review {\n          mgmtResponse {\n            id\n          }\n          provider {\n            isLocalProvider\n          }\n        }\n        ... on Review {\n          translationType\n          location {\n            locationId\n            parentGeoId\n          }\n          provider {\n            isLocalProvider\n            isToolsProvider\n          }\n          original {\n            id\n            url\n            locationId\n            userId\n            language\n            submissionDomain\n          }\n        }\n        ... on Review {\n          locationId\n          mcid\n          attribution\n        }\n        ... on Review {\n          __typename\n          locationId\n          helpfulVotes\n          photoIds\n          route {\n            url\n          }\n          socialStatistics {\n            followCount\n            isFollowing\n            isLiked\n            isReposted\n            isSaved\n            likeCount\n            repostCount\n            tripCount\n          }\n          status\n          userId\n          userProfile {\n            id\n            displayName\n            isFollowing\n          }\n          location {\n            __typename\n            locationId\n            additionalNames {\n              normal\n              long\n              longOnlyParent\n              longParentAbbreviated\n              longOnlyParentAbbreviated\n              longParentStateAbbreviated\n              longOnlyParentStateAbbreviated\n              geo\n              abbreviated\n              abbreviatedRaw\n              abbreviatedStateTerritory\n              abbreviatedStateTerritoryRaw\n            }\n            parent {\n              locationId\n              additionalNames {\n                normal\n                long\n                longOnlyParent\n                longParentAbbreviated\n                longOnlyParentAbbreviated\n                longParentStateAbbreviated\n                longOnlyParentStateAbbreviated\n                geo\n                abbreviated\n                abbreviatedRaw\n                abbreviatedStateTerritory\n                abbreviatedStateTerritoryRaw\n              }\n            }\n          }\n        }\n        ... on Review {\n          text\n          language\n        }\n        ... on Review {\n          locationId\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          originalLanguage\n          rating\n        }\n        ... on Review {\n          id\n          locationId\n          title\n          labels\n          rating\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          alertStatus\n        }\n      }\n    }\n    reviewAggregations {\n      ratingCounts\n      languageCounts\n      alertStatusCount\n    }\n  }\n}\n","variables": {"locationId": 296922,"offset": 0,"filters": [{"axis": "LANGUAGE","selections": ["de"]}],"prefs": None,"initialPrefs": {},"limit": 5,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n  writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_296922","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"de\"]}]"}}]复制

分析这个JSON,我们可以看到它是一个包含3个元素的数组。第一个元素似乎是记录交互。第二个元素具有一些有趣的属性,例如locationIdvariables.filters.selections(似乎包含语言iso代码的数组),variables.offset(要跳过的评论数)和variables.limit(评论数限制)。第三个要素似乎是将用户首选项写入他们的数据库中。


很多人学习python,不知道从何学起。
很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手。
很多已经做案例的人,却不知道如何去学习更加高深的知识。
那么针对这三类人,我给大家提供一个好的学习平台,免费领取视频教程,电子书籍,以及课程的源代码!
QQ群:553215015


知道了这一点,我们可以创建一个新函数来从某个酒店获取GraphQL数据:

GRAPHQL_URL = 'https://www.tripadvisor.com/data/graphql/batched'def request_graphql(url, page=0):geo, loc = get_ids_from_hotel_url(url)request = [{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n  logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": loc,"geo_id": geo,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": '{YOUR_SESSION_ID}'}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "es","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n  cachedFilters: personalCache(key: $filterCacheKey)\n  cachedPrefs: personalCache(key: $prefsCacheKey)\n  locations(locationIds: [$locationId]) {\n    locationId\n    parentGeoId\n    name\n    placeType\n    reviewSummary {\n      rating\n      count\n    }\n    keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n      keywords {\n        keyword\n      }\n    }\n    ... on LocationInformation {\n      parentGeoId\n    }\n    ... on LocationInformation {\n      parentGeoId\n    }\n    ... on LocationInformation {\n      name\n      currentUserOwnerStatus {\n        isValid\n      }\n    }\n    ... on LocationInformation {\n      locationId\n      currentUserOwnerStatus {\n        isValid\n      }\n    }\n    ... on LocationInformation {\n      locationId\n      parentGeoId\n      accommodationCategory\n      currentUserOwnerStatus {\n        isValid\n      }\n      url\n    }\n    reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n      totalCount\n      preferredReviewIds\n      reviews {\n        ... on Review {\n          id\n          url\n          location {\n            locationId\n            name\n          }\n          createdDate\n          publishedDate\n          provider {\n            isLocalProvider\n          }\n          userProfile {\n            id\n            userId: id\n            isMe\n            isVerified\n            displayName\n            username\n            avatar {\n              id\n              photoSizes {\n                url\n                width\n                height\n              }\n            }\n            hometown {\n              locationId\n              fallbackString\n              location {\n                locationId\n                additionalNames {\n                  long\n                }\n                name\n              }\n            }\n            contributionCounts {\n              sumAllUgc\n              helpfulVote\n            }\n            route {\n              url\n            }\n          }\n        }\n        ... on Review {\n          title\n          language\n          url\n        }\n        ... on Review {\n          language\n          translationType\n        }\n        ... on Review {\n          roomTip\n        }\n        ... on Review {\n          tripInfo {\n            stayDate\n          }\n          location {\n            placeType\n          }\n        }\n        ... on Review {\n          additionalRatings {\n            rating\n            ratingLabel\n          }\n        }\n        ... on Review {\n          tripInfo {\n            tripType\n          }\n        }\n        ... on Review {\n          language\n          translationType\n          mgmtResponse {\n            id\n            language\n            translationType\n          }\n        }\n        ... on Review {\n          text\n          publishedDate\n          username\n          connectionToSubject\n          language\n          mgmtResponse {\n            id\n            text\n            language\n            publishedDate\n            username\n            connectionToSubject\n          }\n        }\n        ... on Review {\n          id\n          locationId\n          title\n          text\n          rating\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          photos {\n            id\n            statuses\n            photoSizes {\n              url\n              width\n              height\n            }\n          }\n          userProfile {\n            id\n            displayName\n            username\n          }\n        }\n        ... on Review {\n          mgmtResponse {\n            id\n          }\n          provider {\n            isLocalProvider\n          }\n        }\n        ... on Review {\n          translationType\n          location {\n            locationId\n            parentGeoId\n          }\n          provider {\n            isLocalProvider\n            isToolsProvider\n          }\n          original {\n            id\n            url\n            locationId\n            userId\n            language\n            submissionDomain\n          }\n        }\n        ... on Review {\n          locationId\n          mcid\n          attribution\n        }\n        ... on Review {\n          __typename\n          locationId\n          helpfulVotes\n          photoIds\n          route {\n            url\n          }\n          socialStatistics {\n            followCount\n            isFollowing\n            isLiked\n            isReposted\n            isSaved\n            likeCount\n            repostCount\n            tripCount\n          }\n          status\n          userId\n          userProfile {\n            id\n            displayName\n            isFollowing\n          }\n          location {\n            __typename\n            locationId\n            additionalNames {\n              normal\n              long\n              longOnlyParent\n              longParentAbbreviated\n              longOnlyParentAbbreviated\n              longParentStateAbbreviated\n              longOnlyParentStateAbbreviated\n              geo\n              abbreviated\n              abbreviatedRaw\n              abbreviatedStateTerritory\n              abbreviatedStateTerritoryRaw\n            }\n            parent {\n              locationId\n              additionalNames {\n                normal\n                long\n                longOnlyParent\n                longParentAbbreviated\n                longOnlyParentAbbreviated\n                longParentStateAbbreviated\n                longOnlyParentStateAbbreviated\n                geo\n                abbreviated\n                abbreviatedRaw\n                abbreviatedStateTerritory\n                abbreviatedStateTerritoryRaw\n              }\n            }\n          }\n        }\n        ... on Review {\n          text\n          language\n        }\n        ... on Review {\n          locationId\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          originalLanguage\n          rating\n        }\n        ... on Review {\n          id\n          locationId\n          title\n          labels\n          rating\n          absoluteUrl\n          mcid\n          translationType\n          mtProviderId\n          alertStatus\n        }\n      }\n    }\n    reviewAggregations {\n      ratingCounts\n      languageCounts\n      alertStatusCount\n    }\n  }\n}\n","variables": {"locationId": loc,"offset": page * 20,"filters": [{"axis": "LANGUAGE","selections": ["es","en","de","fr","it"]}],"prefs": None,"initialPrefs": {},"limit": 20,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n  writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_4107099","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"es\"]}]"}}]response = requests.post(GRAPHQL_URL, json=request, headers={'origin': 'https://www.tripadvisor.com','pragma': 'no-cache','referer': url,'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36','x-requested-by': 'TNI1625!AJip35tLIWuhpNQPmrwHxPeiKXvdZcnL7knBuCZi5C72/qqhuKp4Z0UJIclF3lVur1Wu4ZdKfvqHmfGsn939HaPm574AH0+pxs5wBXmVwF5wm/4/retQGYfVPgorX2lUtTDc8/Ej6X5EDaY3f3qV5r4EfRGA8CA5E9Eu39DyE34C','Cookie': '{YOUR_COOKIE_STRING}'})return response.json()

在继续之前,请注意您需要在此函数内进行三件事更改:

  • {YOUR_SESSION_ID}:您将需要在DevTools内部搜索cookie“ TASID”,并将此值放在此处。
  • {YOUR_COOKIE_STRING}:您将需要转到“请求标头”并在此处粘贴完整的Cookie字符串。

 

  • {YOUR_REQUESTED_BY}:您将需要转到“请求标头”,然后在此处粘贴完整的X-Requested-By标头。

 

现在我们已经准备好了此功能,我们非常接近通过每个酒店URL进行迭代,并通过调用此功能来获取所有评论。只需要做一件事:我们需要找出每家酒店有多少条评论。

让我们尝试我们的超级功能,看看会发生什么:-)

 

而且有效!this正如我们在此结构中看到的那样,我们刚刚获得了data.locations[0].reviewListPage.reviews该酒店的前20条评论!

每个评论具有以下结构:

{"id": 780641546,"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","location": {"locationId": 559667,"name": "Hotel Cordial Mogan Playa","placeType": "ACCOMMODATION","parentGeoId": 187471,"__typename": "LocationInformation","additionalNames": {"normal": "Hotel Cordial Mogan Playa","long": "Hotel Cordial Mogan Playa, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Puerto de Mogan","abbreviated": "Hotel Cordial Mogan Playa","abbreviatedRaw": "Hotel Cordial Mogan Playa","abbreviatedStateTerritory": "Hotel Cordial Mogan Playa","abbreviatedStateTerritoryRaw": "Hotel Cordial Mogan Playa"},"parent": {"locationId": 187471,"additionalNames": {"normal": "Gran Canaria","long": "Gran Canaria, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Gran Canaria, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Gran Canaria, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Gran Canaria","abbreviated": "Gran Canaria","abbreviatedRaw": "Gran Canaria","abbreviatedStateTerritory": "Gran Canaria","abbreviatedStateTerritoryRaw": "Gran Canaria"}}},"createdDate": "2021-01-06","publishedDate": "2021-01-06","provider": {"isLocalProvider": true,"isToolsProvider": true},"userProfile": {"id": "63A449F68F3328E582979E7BC8F5D5E3","userId": "63A449F68F3328E582979E7BC8F5D5E3","isMe": false,"isVerified": false,"displayName": "kattullus","username": "kattullus","avatar": {"id": 120146276,"photoSizes": []},"hometown": {"locationId": 189852,"fallbackString": "189852","location": {"locationId": 189852,"additionalNames": {"long": "Stockholm, Sweden"},"name": "Stockholm"}},"contributionCounts": {"sumAllUgc": 890,"helpfulVote": 150},"route": {"url": "/Profile/kattullus"},"isFollowing": false},"title": "Did not stay but want to applaud the fantastic New Year's buffet","language": "en","translationType": null,"roomTip": null,"tripInfo": {"stayDate": "2020-12-31","tripType": "NONE"},"additionalRatings": [{"rating": 4,"ratingLabel": "Location"},{"rating": 5,"ratingLabel": "Service"},{"rating": 5,"ratingLabel": "Sleep Quality"}],"text": "The New Year's buffet was arguably the finest buffet we have enjoyed. Hundreds of dishes, all beautifully presented and delicious. Wines/beer included and the cost very reasonable. The venue is fantastic and in itself worth a detour!","username": "kattullus","connectionToSubject": null,"locationId": 559667,"rating": 5,"absoluteUrl": "https://www.tripadvisor.com/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","mcid": 53922,"mtProviderId": 0,"photos": [],"original": null,"attribution": null,"__typename": "Review","helpfulVotes": 0,"photoIds": [],"route": {"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html"},"originalLanguage": "en","labels": [],"alertStatus": false}

让我们仅解析此文档以生成我们的数据库:

data = []for hotel_url in listings:response = request_graphql(hotel_url)[1]['data']['locations'][0]hotel_name = response['name']print(f'Scraping {hotel_name}')# Get total review counttotal_reviews = response['reviewListPage']['totalCount']# Get number of pages to get all the reviewspages = math.ceil(total_reviews / 20)pages = min(MAX_PAGES, pages)# Iterate through every possible page to get all the reviewsfor i in range(pages):# Sleep random seconds to avoid blockingtime.sleep(random.randint(1, 3))# Get the GraphQL response for each pageresponse = request_graphql(hotel_url, page=i)[1]['data']['locations'][0]# Get the reviews from each responsereviews = response['reviewListPage']['reviews'] if response['reviewListPage'] is not None else []# Add each review to the arrayfor review in reviews:review_title = review['title']review_description = review['text']location = review['location']['parent']['additionalNames']['normal']review_data = {'Hotel Name': hotel_name,'Review Date': review['createdDate'],'Stay Date': review['tripInfo']['stayDate'] if review['tripInfo'] is not None else None,'Location': location,'Lang': review['language'],'Room Tip': review['roomTip'] if 'roomTip' in review else None,'Review Title': review_title,'Review Stars': review['rating'],'Review': review_description,'User Name': review['userProfile']['displayName'] if review['userProfile'] else None,'Hometown': review['userProfile']['hometown']['location']['additionalNames']['long'] if review['userProfile'] is not None and review['userProfile']['hometown']['location'] is not None else None}# Iterate through additionalRatings (Cleanliness, Room Service...)for rating in review['additionalRatings']:review_data[f'{rating["ratingLabel"]} Stars'] = rating['rating']data.append(review_data)print(f'Reviews: {len(data)}')

现在,我们已经填满了数组(这可能会花费很多时间,具体取决于您需要的酒店数量),让我们生成一个熊猫DataFrame并将结果存储为CSV格式:

df = pd.DataFrame(data)df.to_csv('./reviews.csv', index=False, encoding='utf-8-sig', sep=';')df.head()

概要

我们已经学习了如何利用TripAdvisor GraphQL端点从酒店列表中请求所有评论,最终生成结构化的CSV文件,可将其用于进一步的ML分析。

在这里还是要推荐下我自己建的Python学习群:553215015,群里都是学Python的,如果你想学或者正在学习Python ,欢迎你加入,大家都是软件开发党,不定期分享干货(只有Python软件开发相关的),包括我自己整理的一份2020最新的Python进阶资料和零基础教学,欢迎进阶中和对Python感兴趣的小伙伴加入!
 


http://chatgpt.dhexx.cn/article/HAtas6hv.shtml

相关文章

美通企业日报 | 猫途鹰联手携程打造中国顶级旅行平台;强生战略合作阿里旗下Lazada...

今日看点 TripAdvisor与携程集团联手打造面向中国出境旅行者的顶级旅行平台。全球领先的旅游平台TripAdvisor&#xff08;猫途鹰&#xff09;宣布其中国子公司已与携程集团达成战略合作&#xff0c;以打造中国顶级的旅行计划和预订网站&#xff0c;为渴望探索世界的中国旅行者提…

Python爬取TripAdvisor

直接上代码&#xff1a; #爬取tripadvisor纽约市酒店超值排名#引入requests 获取html文件&#xff0c;才能从html获取信息 import requests #利用BeautifulSoup解析文件&#xff0c;获取想要的到的数据 from bs4 import BeautifulSoup #这段代码只用在获取等待&#xff0c;避免…

猫途鹰公布2023年全球十大最佳旅行体验和十大顶级景点 | 美通社头条

美通社消息&#xff0c;旅游指南平台猫途鹰(Tripadvisor)公布2023年旅行者之选&#xff1a;最佳“必做之事”。 随着夏季旅游的全面展开&#xff0c;这些是猫途鹰评论家们最喜欢的来自世界各地的活动&#xff0c;为希望创造难忘时刻的旅行者提供一份明确的非凡体验清单&#xf…

猫途鹰公布2019年“旅行者之选”全球最佳海滩榜单

全球旅游规划和预订平台猫途鹰(TripAdvisor)公布2019年“旅行者之选”最佳海滩榜单。获奖海滩是基于过去12个月内全球上亿旅行者的评分和点评的数量及质量综合计算得出&#xff0c;巴西费尔南多迪诺罗尼亚群岛&#xff08;Fernando de Noronha&#xff09;的桑乔湾海滩&#xf…

【Python】代码:获取猫途鹰的London酒店信息:基于Scrapy框架和requests库

本文以代码分析的形式记录&#xff1a;利用Scrapy框架和requests库爬取tripadvisor(猫途鹰)多个城市的酒店信息&#xff0c;数据量300w条(1.09G)&#xff0c;运行时间约7h。多个城市与单个城市的操作类似&#xff0c;为避免代码过于冗长&#xff0c;本文仅以爬取London酒店的评…

利用 pyspider 框架抓取猫途鹰酒店信息

利用框架 pyspider 能实现快速抓取网页信息&#xff0c;而且代码简洁&#xff0c;抓取速度也不错。 环境&#xff1a;macOS&#xff1b;Python 版本&#xff1a;Python3。 1.首先&#xff0c;安装 pyspider 框架&#xff0c;使用pip3一键安装&#xff1a; pip3 pyspider 2.终端…

可怕的pyspider猫途鹰

1.启动pyspider 2.新建一个项目 3.代码 4. 注意事项&#xff1a;网址什么的都变了 5.存储到MongoDB&#xff0c; 注意这个地方我错了三次 6.在tableau可视化才发现错误的1,2 之后就能可视化了&#xff0c;本次实验是个半成品。后期会补充。 #!/usr/bin/env python # -*- enco…

爬虫-猫途鹰

from bs4 import BeautifulSoup import requests url https://www.tripadvisor.cn/ wb_data requests.get(url) soup BeautifulSoup(wb_data.text,lxml) for i in soup.select(li):if len(i.select(.ranking))>0:sorti.select(.ranking)[0].text #排名countryi.select(.c…

JS DOM 编程复习笔记--父元素、子元素和兄弟元素(三)

今天我们来复习DOM中的获取父元素、子元素和兄弟元素的API&#xff0c;它们主要有parentNode、firstChild、firstElementChild、lastChild、lastElementChild、childNodes、children、nextElementSibling、nextSibling、previousElementSibling、previousSibling等。 目录 获取…

jquery 在兄弟节点前、或兄弟节点后添加最新元素

使用 jquery 封装好的方法操作 dom&#xff0c;非常方便 1、在兄弟节点前添加最新元素 使用 before() 方法 演示代码如下 <!DOCTYPE html> <html><head><meta charset"UTF-8"><title></title><script src"js/jquery-…

CSS第一章:4.元素关系(兄弟、祖后代关系);关系选择器

总览 1.元素关系 2.关系选择器 一、元素关系 1.元素关系1&#xff1a;父子 父元素&#xff1a;直接包含子元素的元素叫做父元素 子元素&#xff1a;直接被父元素包含的元素叫做子元素 2.元素关系2&#xff1a;祖先后代 祖先元素&#xff1a;直接或间接包含后代元素的元素…

选择兄弟元素中的第几个元素

:nth-child(anb) 这个CSS伪类首先找到所有当前元素的兄弟元素&#xff0c;然后按照位置先后顺序从1开始排序&#xff0c;选择的结果为CSS伪类:nth-child括号中表达式&#xff08;anb&#xff09;匹配到的元素集合&#xff08;n0&#xff0c;1&#xff0c;2&#xff0c;3...&am…

兄弟元素选择器

兄弟元素选择器 语法1&#xff1a;前一个元素 后一个元素作用&#xff1a;选中一个元素后紧挨着的指定的兄弟元素。语法2&#xff1a;前一个元素 ~ 后边一类元素作用&#xff1a;选中后边的所有兄弟元素 举例1&#xff1a; <!DOCTYPE html> <html lang"en&qu…

Thinking -- CSS从根解决选择前一个兄弟元素

Thinking系列&#xff0c;旨在利用10分钟的时间传达一种可落地的编程思想。 开发中遇到这样一个诉求&#xff1a;特定class的元素单独占一行&#xff0c;现需要针对其前一个兄弟元素增加相应标识&#xff0c;以使其占据所在行的剩余所有空间。 换句话&#xff1a;就是如何选中…

JS 兄弟元素和兄弟节点

获取兄弟节点和兄弟元素 例子&#xff1a; //body内容<ul id"ul">wwwwww<!-- 我是注释 --><li>我是li标签1</li><li id"li2">我是li标签2</li><li>我是li标签3</li><li>我是li标签4</li><…

css-知识点(学习笔记)

一、鼠标悬停样式 hover伪类基础用法 在前端学习的初期&#xff0c;想必大家用的很多的属性之一就有hover吧,hover伪类元素使用的三种情况 1、 hover用于父子元素 父子元素直接在父元素的id或者class或者标签名后紧接着:hover 然后空格子元素的名字就可以了。 2、 hover用于…

RPC原理及其调用过程

RPC原理及其调用过程 远程过程调用&#xff0c;简称为RPC&#xff0c;是一个计算机通信协议&#xff0c;它允许运行于一台计算机的程序调用另一台计算机的子程序&#xff0c;而无需额外地为这个交互作用编程。 RPC与传统的HTTP对比 优点&#xff1a; 1. 传输效率高(二进制传输…

用简单的方式去描述RPC原理

接下来我们就讲RPC&#xff1a;远程过程调用。其实RPC仅仅是一个概念&#xff0c;RPC没有协议没有框架&#xff0c;平时我们说的RPC框架其实是简称&#xff0c;很多人被带偏了&#xff0c;以为RPC是一个协议或者是框架&#xff0c;所以你以后如果听到RPC框架&#xff0c;你要立…

RPC核心原理是什么?以及常用技术有哪些?

前面一篇文章有提到过RPC&#xff0c;那么RPC是什么呢&#xff1f; RPC&#xff08;Remote Procedure Call&#xff09;&#xff0c;代表远程过程调用&#xff0c;通过网络通信调用不同的服务&#xff0c;共同支撑一个软件系统&#xff0c;微服务实现的基石技术。使用RPC可以解…

RPC框架原理简介

什么是RPC框架&#xff1f; RPC&#xff0c;全称为Remote Procedure Call&#xff0c;即远程过程调用&#xff0c;是一种计算机通信协议。 比如现在有两台机器&#xff1a;A机器和B机器&#xff0c;并且分别部署了应用A和应用B。假设此时位于A机器上的A应用想要调用位于B机器上…