本文最初写成于 2021 年 7 月，由于作者的拖延问题以致今天才得以和各位同学（即便大概率是自说自话了）见面，因此文中具体细节可能已经发生变化。

Cover Photo by Maksym Kaharlytskyi on Unsplash

使用 UN Comtrade API 批量下载数据

写在前面
背景知识
- 什么是 [UN Comtrade DB](https://comtrade.un.org/)
- The UN Comtrade data extraction API
- Python 知识
Hands-on
- Understanding the Data Request Form
- Dynamic IP
- Coding
- - Part 1 Download specific URL to target location
  - Part 2 The Main Function
  - Part 3 Multithreading
  - Part 4 The Last few lines
Conclusion
Future Works

写在前面

谨以本文记录我与 UN Comtrade Database 的初次交手，四天时间里的爬虫初体验着实让我感到精疲力竭。今天我把这经验分享在这里，抛砖引玉，也许能给后来的同学一些启发。

背景知识

这一部分将向你介绍什么是 UN Comtrade Database，其 API 的使用方法，和所需的 Python 知识。

什么是 UN Comtrade DB

UN COMTRADE is the pseudonym for United Nations International Trade Statistics Database. Over 170 reporter countries/areas provide the United Nations Statistics Division (UNSD) with their annual and monthly l international trade statistics data detailed by commodities/service categories and partner countries.

根据联合国统计司 knowledge base 的上述说明，你可以在 Comtrade Database 中查询170+国家报告的各品类商品/服务的年度/月度贸易数据。数据的具体类型将会在 Hands-on 部分详细分析，在此之前你可以通过官网的在线预览粗略感知，在这里你可以在这一页面指定时间、产品类型、产品种类、报告国和合作国等特征。

The UN Comtrade data extraction API

我们可以通过这一 API 来提取数据库中的特定数据，以 CSV 或 JSON 格式下载，或利用 AJAX call 来将数据融入进你的网页。

统计司鼓励用户把自己的可视化作品分享给他们，优秀的作品有可能被引用在 UN Comtrade Labs 中。我在这里发现了很多有意思的项目，非常有启发性（aka 我实在是太菜自己根本搞不出来）

Python 知识

只需了解基本语法即可！如果你对 Python 感到陌生，可以跟着廖雪峰的官方网站的 Python 教程学习，只需要学到高级特性部分就足够理解下面涉及的所有内容了。

Hands-on

我的工作是根据博主王蛋糕cake的这两篇慷慨的博文的完善，针对中断处理 （刚刚考完计组，写到这里不由得心头一颤） 和下载速度和下载内容有效性的判断等方面进行了优化。此外，还补充了下载后的清理过程，清理下载失败的文件。

UN Comtrade（联合国商品贸易统计数据库）数据爬取Python代码
UN Comtrade（联合国商品贸易统计数据库）数据爬取Python代码——使用动态IP

Understanding the Data Request Form

通过阅读 API文档，我们能在 UN Comtrade data request format 部分了解到发送请求的基本格式：

http://comtrade.un.org/api/get?parameters

其中，parameters 允许的参数在文档中有完整的说明，我就不再在此处赘述了，但其中值得强调的一点是：如果 freq 参数赋值为 M （代表以月份为单位获取数据）时，px (classification) 参数不要选择 SITC 那一套 (ST, S1, S2, … , S4) 因为没有这样的数据，你获得的都将是空表。

假设，我们现在想要查询 csv 格式下，美国2020年9月的AG4精度的进出口商品数据，该如何写这个请求 url 呢？

其中一种答案是：

https://comtrade.un.org/api/get?max=100000&type=C&freq=M&px=HS&ps=202006&r=all&p=842&rg=all&cc=AG4&fmt=csv

也许和你的答案有些出入，请注意API 对参数的前后顺序并不敏感；r 参数和 p 参数任意分别设置为 842 (code for USA) 和 all 即可；如果仔细的阅读过 Knowledgebase 文档，你就会清楚：对于同一条贸易数据，站在 importer 和 exporter 的视角（指选择 importer/exporter 为 reporter），trade value 是不同的，通常 import value 会高于 export value 因为：

Imports is generally reported on the basis of Cost, Insurance and Freight, (CIF) while exports is reported on a Free on Board (FOB) basis. For this reason, import values tend to be higher than export values.

另，根据 Worldbank 的提示，通常来讲，import data 更加准确，出于关税计算的原因。

但回到现实，这样的请求很难获得目标数据；因为美国的进出口贸易量较大，4 digits HS Code (AG4) 描述下的进口和出口数据条数实在是太大了，很可能超过 guest 用户的下载条数限额。
面对这样的问题，如果我们不愿意在精度上做出妥协，那么就需要将这条请求拆分为多条，这里可以有很多种逻辑，例如：

分别下载进口和出口数据（by altering the rg parameter)
遍历国家代码，每次请求指定 reporter 的国家代码 (by specifying the r parameter)

这也就引出了参数设置的迷思，由于下面介绍的，对于 guest 用户的访问频率和总量的限制的存在，成为了下载条数和下载次数之间的博弈（每次下载拆分的越细，需要下载的次数就越多、由于频率的限制，下载的次数越多，耗费的时间就越长）我们要尽可能地保持平衡。

你会在 Usage limits 部分了解到:

对于 guest 用户，每秒不能发送多于 1 条请求(rate)，且每小时不能发送超过 100 条请求。
对于 ps, r 和 p 参数，输入的限制条件 (code) 不能超过五条，上述三个参数只能出现一次 ALL；

我将在下面的例子中展示，如何通过混合动态IP 和简易的多线程技术实现效率更高的数据获取。

Dynamic IP

选择 TB 买家提供的最廉价的 (15 rmb/d 至少在 07/2021 是这样的行情) 动态 IP 服务即可。我遇到的买家很贴心的提供了一些语言的不同包的 Proxy 设置语法，但对于我们的例子，只需要了解 IP 端口用户名和密码，填写到下面代码的对应部分即可。

Coding

Requirement: 获取2019-2021，各国家月度 AG2 精度的进出口数据；
Analysis: 由于数据条数的限制，采取以下下载策略：分月度分别下载每个国家为 partner 的进口和出口数据；

Part 1 Download specific URL to target location

import requests
import os
import json
from random import randint
import time
import threading
import datetime# proxy varification process, it might be different depending on your proxy provider
# please change the content quoted below accordingly
proxyHost = "your host link"
proxyPort = "port your host provided"
proxyUser = "username"
proxyPass = "password"# User agents (UA)
user_agents = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)","Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)","Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20","Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",def download_url(url, path, proxy):''' This function is used to download the passed url to the target file (path) using the proxy config'''time.sleep(0.5)		# you don't really need this line for the multithreads we are about to userandom_agent = user_agents[randint(0, len(user_agents) - 1)]		# chose an user agent from the user agent list abovetunnel = randint(1, 10000)		# generate a tunnelheader = {"Proxy-Tunnel": str(tunnel),"User-Agent": random_agent}try:content = requests.get(url, timeout=100, headers=header, proxies=proxy)''' note that sometimes we only get error informations in the responses, and here are some really dumb quick fixes'''if (content.text == "<html><body><h1>502 Bad Gateway</h1>\nThe server returned an invalid or incomplete response.\n</body></html>\n" or content.text == "Too Many Requests.\n" or content.text == "{\"Message\":\"An error has occurred.\"}"):with open("./data/serverError.csv", 'a', encoding="utf-8") as log:log.write(str(datetime.datetime.now()) + "," + str(url) + "," + str(path) + "\n")print("\n" + content.content.decode())download_url(url, path, proxy)else:# write csv file;with open(path, 'wb') as outfile:outfile.write(content.content)print("download finished")time.sleep(0.5)except requests.RequestException as e:''' I have absolutely no knowledge about Request Exception Handling so I chose to write the error information to a log file'''print(type(e).__name__ + " has occurred")with open("./data/exp.csv", 'a', encoding="utf-8") as log:log.write(str(datetime.datetime.now()) + "," + str(type(e).__name__) + "," + str(url) + "," + str(path) + "\n")download_url(url, path, proxy)
]

为了养成良好的注释习惯同时锻炼自己半死不活的英文能力，代码中形成了密度较高的英文注释；欢迎英文/代码规范大师对我的注释 and Code 进行批评！

download_url 函数中唯一需要进一步阐明的，就是对于返回内容的简单筛选。通过无数次人工筛查，发现一些时候 api 会直接返回一些错误信息，而不是数据。由于一些到现在都没想明白的原因，我在下载的时候遇到这些错误信息的概率非常高，因此在这里直接引入了对 response 内容的检查。如果遇到错误信息内容，则重新下载 URL 。同时还会记录进一个名为 502 的 log 文件中，如果你没有这种需求，可以删掉相关代码，不会对功能造成损害。

但值得注意的是，这并不能确保在下载完成后，我们就能获得万无一失的数据。返回 Too Many Request.\n 这行信息的 response 不论如何都无法被正确识别。一个也许更优的的思路是检测形成的 csv file 是否有 header，但在此处我没有实现。

Part 2 The Main Function

在开始之前，我想说各位同学不能在本文中学习到任何实用软件工程知识，看到 Main 函数的那一刻，或者说在和我写的代码产生最初接触的那一刹那你就会知道这一点。本文的代码结构实在是，太、不、面向对象了。我的初心是在博主王蛋糕cake 的慷慨分享的基础上快速实现，因此就保留了最开始的结构。

def main(start, end):  # start and end should be 4 digits integers; note: [start, end]proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {"host": proxyHost,"port": proxyPort,"user": proxyUser,"pass": proxyPass,}proxies = {"http": proxyMeta,"https": proxyMeta,}# get countries' name and code;if not os.path.exists("./reporterAreas.json"):download_url("https://comtrade.un.org/Data/cache/reporterAreas.json", "./reporterAreas.json")with open("./reporterAreas.json", "r", encoding="utf_8_sig") as file:code_sheet = json.load(file)results = code_sheet.get("results")_id = []name = []for country in results:_id.append(country.get("id"))name.append(country.get("text"))# strip the first element for it's all_id = _id[1:]name = name[1:]# create data dir under cwd(current working directory)if not os.path.exists("./data"):os.makedirs("./data")for YYYY in range(start, end + 1):# make dir for each monthif not os.path.exists("./data/" + str(YYYY)):os.makedirs("./data/" + str(YYYY))for idx in range(0, len(_id)):  # traverse through countriesurl = "http://comtrade.un.org/api/get?max=100000&type=C&freq=A&px=HS" + "&ps=" + str(YYYY) + "&r=all&p=" + str(_id[idx]) + "&rg=1&cc=TOTAL&fmt=csv"path = "./data/" + str(YYYY) + "/imports_from_" + str(name[idx]) + ".csv"if not (os.path.exists(path) and os.path.getsize(path) != 0):print("downloading from " + url + " to " + path)download_url(url, path, proxies)url = "http://comtrade.un.org/api/get?max=100000&type=C&freq=A&px=HS" + "&ps=" + str(YYYY) + "&r=all&p=" + str(_id[idx]) + "&rg=2&cc=TOTAL&fmt=csv"path = "./data/" + str(YYYY) + "/exports_to_" + str(name[idx]) + ".csv"if not (os.path.exists(path) and os.path.getsize(path) != 0):print("downloading from " + url + " to " + path)download_url(url, path, proxies)

main 需要的参数是起始年份和中止年份，注意是区间是左闭右闭的。

我是在之后的一段时间才花费时间研究获取系统传入参数的这种操作的，各位当然可以选择实现这个 feature 来更优雅的实现自动化工作。

随后会下载联合国的 Nation Code JSON 文件在 current working directory 下，如果还没有下载的话。

随后就是 URL 的组装了，调用刚刚完成的 download_url 函数并将生成的 URL 传入。

这里注意，如果已经下载过相同的数据（已经存在同名文件），则不会重复下载。了解这一点可以方便各位进行数据的清洗，在删除错误文件后，可以直接使用本脚本重新下载被删除的文件，而不会重复下载已经存在的文件而浪费更多时间。

Part 3 Multithreading

class DownloadData(threading.Thread):def __init__(self, begin, terminate):super().__init__()self.begin = beginself.end = terminatedef run(self):main(self.begin, self.end)

新建 DowloadData 类，为了 implement thread 来实现多线程同时下载数据。但说实话，在完成本项目时，我对于多线程编程一无所知（当然除去 Java 课上锁仓库开仓库的这种奇怪却又富有禅意的课程实验了），因此很可能这样多线程会导致更加频繁的出现一些问题（存疑）。但对于我来讲，最终的结果是的确对于数据的获取速度有很好的提升。

Part 4 The Last few lines

print("note: [start time, end time]")
start = int(input("please specify the start year: "))
end = int(input("...and the end year: "))threads = []
for year in range(start, end + 1):thread = DownloadData(start, end)thread.start()threads.append(thread)