1. Background
In recent days, I find there is an interesting website (https://sc.chinaz.com/yinxiao/index.html) It carries some free resume templates and sound effects resources, which are great in my production. It is will be very convenient if I can collect these resources automatically instead of manual download each of them.
To solve the problem, these python libraries immediately came into my mind, for example, lxml, requests, etc. As a web crawler, the logic behind it is to parse the HTML from each request’s response and ultimately save target data into local files. The website I just mentioned doesn’t have an anti-crawler layer to encrypt the URL or a frequent request ban. That says it will be a very handy resource target for practicing web crawler technology.
This post will demonstrate how strong web crawler techniques can bring us. Also, the most important part of this post, I will show another piece of code to talk about how asynchronous task architecture will save the program tasks.
2. Source Code
2.1 Syncronus-style web crawler
import requests
from lxml import etree
from queue import Queue
# source
# https://sc.chinaz.com/jianli/free.html
def start():
print('Engine Starts...')
print('get list:')
urls = get_list()
total_count = (len(urls))
print(f'Total resumes{len(urls)}')
failure_count = 0
for url_dict in urls:
title = list(url_dict.keys())[0]
url = url_dict[title]
try:
download(url, title)
except Exception as e:
failure_count += 1
print(f'there is an error on {title} and the url is {url}')
print(e)
print(f'Successfully downloaded {total_count - failure_count}, failed to download{failure_count}')
def get_list():
total_pages = 866
url_list = []
for i in range(1, total_pages):
if i == 1:
url = 'https://sc.chinaz.com/jianli/free.html'
else:
url = f'https://sc.chinaz.com/jianli/free_{i}.html'
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text
selector = etree.HTML(html)
# XPath is a major element in the XSLT standard.
# With XPath knowledge you will be able to take great advantage of XSL.
main = selector.xpath('//*[@id="main"]')
title = main[0].xpath('.//a/img/@alt')
href = main[0].xpath('.//a/@href')
for ind, v in enumerate(title):
if href[ind * 2].startswith('//'):
url = 'https' + href[ind * 2]
else:
url = href[ind * 2]
tmp_dict = {v: url}
print(tmp_dict)
print('---')
url_list.append(tmp_dict)
return url_list
def download(url, title):
res = requests.get(url)
res.encoding = 'utf-8'
html = res.text
selector = etree.HTML(html)
href = selector.xpath('//*[@class="clearfix"]/li/a/@href')[0]
r = requests.get(href)
download_path = f'resources/{str(title)}.rar'
with open(download_path, "wb") as f:
f.write(r.content)
print(f'{str(title)} downloaded success')
if __name__ == '__main__':
start()
2.2 web crawler with asynchronous tasks
import asyncio
import aiohttp
from lxml import etree
import requests
import os
import time
async def parser(html):
selector = etree.HTML(html)
audio_list = selector.xpath('//*[@id="AudioList"]')
audio = audio_list[0].xpath('.//*[@class="audio-item"]')
for a in audio:
property = a.xpath('.//p/node()')
name = property[0].replace('\r', '').replace('\n', '').replace(' ', '')
time = property[1]
save_name = f'{name}({time})'
sound_url = a.xpath('.//audio/@src')[0]
sound_url = f'https:{sound_url}' if sound_url.startswith('//') else sound_url
try:
download_task = asyncio.create_task(download(sound_url, save_name))
except Exception as e:
print(f'Downloaded failed {save_name},{sound_url}--「{e}」')
async def get(session, queue):
while True:
try:
page = queue.get_nowait()
except asyncio.QueueEmpty:
return
if page == 1:
url = 'https://sc.chinaz.com/yinxiao/index.html'
else:
url = f'https://sc.chinaz.com/yinxiao/index_{page}.html'
resp = await session.get(url)
parse_task = asyncio.create_task(parser(await resp.text(encoding='utf-8')))
async def download(url, save_name):
r = requests.get(url)
if 'sound_effects' not in os.listdir('.'):
os.mkdir('sound_effects')
download_path = f'sound_effects/{str(save_name)}.mp3'
with open(download_path, "wb") as f:
f.write(r.content)
print(f'{save_name},{url} downloaded success')
async def main():
async with aiohttp.ClientSession() as session:
queue = asyncio.Queue()
# 1000
for page in range(1, 500):
queue.put_nowait(page)
tasks = []
for _ in range(10):
task = get(session, queue)
tasks.append(task)
await asyncio.wait(tasks)
start=time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
end = time.time()
diff=end-start
print(f'start from {start} to {end}, it takes {diff/60} mins in total')
With the above two scripts, you basically can collect all free resume templates and sound effect resources from Chinaz. The synchronous job takes quite a bit longer since the system doesn’t give extra resources to facilitate the process. That is why I use the asynchrony strategy to download sound effects. You will see an incredible performance improvement when collecting sound effects files after adding asynchronous tasks in between.
3. AsyncIO
Asyncio is a library to write concurrent code using the async/await syntax. This module was introduced into python core after version 3.4. In version 3.5, async def and await statements were officially added. For those who cannot execute my example source code, please check if you have updated python to the right version and installed the library appropriately.
Asyncio is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web servers, database connection libraries, distributed task queues, etc.
The most important concept of asynchronous tasks is using coroutines in core execution. Coroutines are a more generalized form of subroutines. Subroutines are entered at one point and exited at another point. Coroutines can be entered, exited, and resumed at many different points. They can be implemented with the async def
statement.
4. Conclusion
To get a faster web crawler application, there are 3 strategies in total; multiprocessing, multithreading, and AsyncIO.
In short, we can summarize the best practice for each of them below:
- CPU Bound => Multi-Processing
- I/O Bound, Fast I/O, Limited Number of Connections => Multi-Threading
- I/O Bound, Slow I/O, Many connections => AsyncIO
Tell me how this post helps you understand web crawler techniques. Please feel free to comment with your idea below and share the link with any of your friends stuck in learning web crawler techniques.