AIOHTTP - asynchronous HTTP client/server for python

aiohttp 是基於 asyncio 模組所開發的非同步 HTTP client 與 server package，目前大多用於大量併發的爬蟲程式。

IO 與執行運算相比是非常緩慢的，當需要爬取大量內容時，如果單純使用 for loop 完成一筆 request 再發下一筆會非常浪費時間，因此可以利用非同步的方式一次發出大量 request，當 Server 回覆後再進行資料的處理。不過利用這種方式也要考慮接收的 Server 是否能夠一次負擔這麼大量的 request，因此仍須謹慎設定併發的數量。

一般使用 request 的爬蟲，執行時間約 20 秒

def main():
    url = "http://httpbin.org/headers"
    times = 50
    for _ in range(times):
        requests.get(url)

使用 aiohttp 的爬蟲，執行時間約 11 秒

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    url = "http://httpbin.org/headers"
    times = 50
    async with aiohttp.ClientSession() as session:
        for _ in range(times):
            await fetch(session, url)

搭配 asyncio.Queue 的 aiohttp 爬蟲，執行時間約 1 秒

results = []

async def get(session, queue):
    while True:
        try:
            url = queue.get_nowait()
        except asyncio.QueueEmpty:
            return
        resp = await session.get(url)
        results.append(await resp.text(encoding="utf-8"))

async def main():
    url = "http://httpbin.org/headers"
    times = 50
    async with aiohttp.ClientSession() as session:
        queue = asyncio.Queue()
        for _ in range(times):
            queue.put_nowait(url)
        tasks = []
        for _ in range(int(times / 5)): # 切分成 10 (50/5) 個 coroutine
            task = get(session, queue)
            tasks.append(task)
        await asyncio.wait(tasks)

搭配 asyncio.Queue 的版本仍就是只有一個 process 與一個 thread，利用非同步的特性從最簡單的 requests 版本加速了 20 倍。透過建立 10 個 coroutine 發 request 並交給 asyncio.wait 調度，當 coroutine 遇到 IO 的時候就會切換到另外一個 coroutine 發 request，而 queue 中的 url 會不停被取用到空為止。

Reference:

AIOHTTP - asynchronous HTTP client/server for python

Comments