使用 Python 爬取 motions 表情包

起因

前段时间在微信群中看到群友发这种表情包：

出于感兴趣就问这个群友还有没有更多这种表情（盗图王）
然后群友告诉我了这个表情包的官网地址，看着满屏的沙雕表情包就想爬取下来使用

编写爬虫

点击官网的 YES!!!!! 之后就可以看到全部表情包，随便右击一张图片一张图片复制URL后可以看到图片连接：
http://motions.cat/gif/nhn/0106.gif
然后再右击它相邻的一张复制出链接：
http://motions.cat/gif/nhn/0105.gif
通过上面两个链接我们可以分析出图片路由是按照递增的规则进行存储 GIF 图片的，并且是四位数前置补0

所以我们可以通过自增一个数字然后使用格式化输出匹配出路由，使用python3测试一下：

script

$ python3
>>> number = 1
>>> '%04d' % number # %04d： 输出一个整数的时候 按照4位数对其 多余的使用 0 补齐
'0001'

开始编写 Python 代码：
首先定义前置 URL 这一块是不会变的，然后定义启始页

1
2
3

url = 'http://motions.cat/gif/nhn/'
start_number = 1
max_number = 139

随后我们写一个死循环，每循环一次就对 start_number 进行自增，当 start_number 自增到大于 max_number 的时候就退出循环，然后在循环中编写爬虫

while True:
    if (start_number > max_number):
        break
    
    try: 
        filename = ('%04d' % start_number) + '.gif'
        requestURL = url + filename
        print('开始请求：', requestURL)
        r = requests.get(requestURL)
        if (r.status_code != 200):
            print(requestURL + ', Request fail, status code: ', r.status_code)
            continue
        # 存储目录，目录不存在就创建
        path = './motions/' 
        if not os.path.exists(path):
            os.mkdir(path)
        
        # 将请求到的二进制数据写入到 GIF 文件中
        with open(path + filename, 'wb') as f:
            f.write(r.content)
        print('保存成功：', filename)

        start_number += 1
    except Exception as e:
        print('Error: ', e)

完整代码：

import requests
import os

def main():
	url = 'http://motions.cat/gif/nhn/'
	start_number = 1
	max_number = 139

	while True:
	    if (start_number > max_number):
	        break
	    
	    try: 
	        filename = ('%04d' % start_number) + '.gif'
	        requestURL = url + filename
	        print('开始请求：', requestURL)

	        r = requests.get(requestURL)
	        if (r.status_code != 200):
	            print(requestURL + ', Request fail, status code: ', r.status_code)
	            continue
	        
	        path = './motions/' 
	        if not os.path.exists(path):
	            os.mkdir(path)
	        
	        with open(path + filename, 'wb') as f:
	            f.write(r.content)
	        print('保存成功：', filename)

	        start_number += 1
	    except Exception as e:
	        print('Error: ', e)
	    
if ('__main__' == __name__):
	main()

运行效果：

由于该网站运行在国外，所以爬的很慢所以我打算编写多线程代码交替执行爬虫来提高爬虫效率。

编写多线程爬虫

Python 的标准库提供了 threading 模块用于编写多线程，开启线程使用 threading.Thread 类创建一个实例，然后通过调用 Thread.start() 方法执行线程：

import threading

def methodName(methodArgs1, methodArgs2):
    # 调用 threading.current_thread().name 可以看到当前线程的名称
    print("currency thread name：", threading.current_thread().name)

# 创建一个 Thread 对象
# target 是需要执行的方法，name 给线程起一个名字，args 是方法的参数
thread = threading.Thread(
    target=methodName,
    name='线程名称'
    args=(methodArgs1, methodArgs2)
)
thread.start() # 执行线程
thread.join() # 回到主线程中继续执行

多线程爬虫完整代码：

import requests
import queue
import threading
import os

def crawl_motions(taskQueue):
	while not taskQueue.empty():
		number = taskQueue.get()
		print('正在下载第' + str(number) + '张表情包', threading.currentThread().name)

		filename = ('%04d' % number) + '.gif'
		url = 'http://motions.cat/gif/nhn/' + filename
		try:
			r = requests.get(url)

			if r.status_code != 200:
				print('请求第' + url + '失败', 'Thread name: ', threading.currentThread().name)
				taskQueue.put(number)

			store_path = './motions-test/'
			if not os.path.exists(store_path):
				os.mkdir(store_path)

			with open(store_path + filename, 'wb') as f:
				f.write(r.content)
				print("保存成功" + filename, 'Thread name:', threading.currentThread().name)
		except Exception as e:
			taskQueue.put(number)
			print('Error: ', e, 'Thread name: ', threading.currentThread().name)


# def store_data(dataQueue):



def main():
	taskQueue = queue.Queue()
	start_number = 1
	max_number = 139

	for i in range(start_number, max_number):
		taskQueue.put(i) # 将需要爬取的 GIF 放到任务队列

	crawl_thread = []
	for i in range(1, 10):
		# 创建线程
		threading_crawl = threading.Thread(
			target=crawl_motions,
			name='crawl-' + str(i),
			args=(taskQueue,)
		)
		crawl_thread.append(threading_crawl)
		# 开启线程
		threading_crawl.start()
	
	for thread in crawl_thread:
		thread.join()
	

if '__main__' == __name__:
	main()
	print("OK!")

执行效果：

2020-05-14

Charles

7 分钟 read (About 1107 words)

使用 Charles 对电脑和手机进行抓包

Charles 简介

Charles 是一款 HTTP 代理/ HTTP 监听/ 方向代理服务器软件。开发人员可以使用 Charles 看到计算机和互联网之间的 HTTP 和 HTTPS 信息。包括请求，响应和 HTTP 请求头(Cookie和缓存信息).

安装和使用 Charles

点击链接 https://www.charlesproxy.com/download/ 下载对应的版本进行安装

以 macOS 版本为例：

下载 dmg 文件点击此处可下载4.5.6版本
双击打开 dmg 文件，点击 accept 然后把应用文件拖动到 Applications 目录。
随后就可以在启动台打开 Charles。
启动 Charles 后会出现如下弹窗索取权限，点击Grant Privileges 同意后输入密码同意授权。
随后勾选 Proxy - macOS Proxy 启动系统代理。然后使用浏览器访问百度就可以在左侧看到我们刚才请求百度的 HTTP 请求信息。

安装 SSL 证书

进过以上步骤安装后可以看到百度的请求下面出现 unknown，原因是百度使用的是 HTTPS 使用 SSL 经过加密传输，我们只需安装上 Charles 的证书即可获取到请求信息。

选择 Help - SSL Proxying - Install Charles Root Certificate
Install certificate
点击之后会弹出一个添加证书的弹窗选择添加即可
Add certificate
添加之后我们打开钥匙串访问这个系统软件，在右上方搜索 Charles 就可以找到刚才安装的证书，安装之后是不信任的证书。

双击安装的证书，点开信任一栏，选择始终信任，然后关闭窗口输入密码。如图所示：
信任证书