大家国庆快乐呀,写了个爬虫送给大家
先感谢着几位大佬🍭
新年前爬个明日方舟的立绘
明日方舟干员立绘爬虫
CloudFlare Worker搭建随机图片API
无意间刷到了大佬 的博客,看到了这个明日方舟的立绘挺有意思的,想直接抄作业,但奈何代码失效了,就自己改了一下,发出来记录一下
看了一下,应该是后面爬链接的那一块失效了,原文是这样的"给的是 /images/thumb/6/65/%E7%AB%8B%E7%BB%98_%E5%87%AF%E5%B0%94%E5%B8%8C_2.png ,我要的是这个 http://prts.wiki/images/6/65/立绘_凯尔希_2.png 嘛! ”
现在应该是更新了,给的链接是https://prts.wiki/w/文件:立绘_12F_1.png ,但需要的是https://media.prts.wiki/6/61/立绘_12F_1.png
现在完全牛头不对马嘴嘛🌚
所以研究了一下之后,感觉应该可以
我就直接上代码了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 import osfrom bs4 import BeautifulSoupimport timeimport requestsurl = "http://prts.wiki/index.php?title=%E7%89%B9%E6%AE%8A:%E6%90%9C%E7%B4%A2&limit=500&offset=0&profile=images&search=%E7%AB%8B%E7%BB%98" headers = { "Cookie" : "arccount62298=c; arccount62019=c" , "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66" } html = requests.get(url, headers=headers) html.encoding = html.apparent_encoding soup = BeautifulSoup(html.text, "html.parser" ) list = soup.find_all(class_ = "searchResultImage" ) try : os.mkdir("./Arknights" ) except : pass os.chdir("./Arknights" ) num=0 for s in list : string = str (s) namebegin = string.find('title="文件' ) nameend = string[namebegin:].find('png' ) name1 = string[namebegin+13 :namebegin+nameend-1 ] name1begin = name1.find('"干员名"' ) name1_point_1 = name1begin name1_point_2 = name1.find(' 1' or ' 2' or ' skin' ) if (string[namebegin+nameend-3 :namebegin+nameend-1 ] in ['V1' ,'V2' ]): continue elif (string[namebegin+nameend-2 :namebegin+nameend-1 ]=='b' ): continue else : name1 = name1 + '.png' name1 = name1.replace(" " ,"_" ) urlbegin = string.find('https://media.prts.wiki/thumb/' ) if urlbegin != -1 : urlend = string.find('.png' , urlbegin) imgurl_suffix = string[urlbegin+30 :urlend+4 ] imgurl = 'https://media.prts.wiki/' + imgurl_suffix img = requests.get(imgurl, headers=headers).content if (name1 not in ['精二立绘A.png' ,'Pith_2.png' ,'Sharp_2.png' ,'Stormeye_2.png' ,'Touch_2.png' ,'阿米娅(近卫)_2.png' ]): with open (name1, 'wb' ) as f: f.write(img) num+=1 print ("已爬取{}张,图片名称为:{},链接为:{}" .format (num,name1,imgurl)) time.sleep(1 )
这里参照了https://www.heart-of-engine.top/posts/fccf.html 的代码,小小改动了一点,成功运行起来了🌝
后面由于想做成随机图片api所以就用CloudFlare Worker搭建了
直接上教程了
图片转 Webp
这步骤可选,转换为 webp 格式后,图片体积压缩而分辨率不变,可以更快地加载。
为了在引用时方便,先把名称按 1、2、3…顺序标号,再通过 PIL 库,将 jpg 和 png 格式的图片转换成 webp。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import osfrom PIL import Imagedef rename_and_convert_images (directory ): files = os.listdir(directory) png_files = [f for f in files if f.lower().endswith('.png' )] webp_files = [f for f in files if f.lower().endswith('.webp' )] max_index = 0 for f in webp_files: try : index = int (os.path.splitext(f)[0 ]) if index > max_index: max_index = index except ValueError: continue for index, filename in enumerate (png_files, start=max_index + 1 ): new_filename = f"{index} .webp" old_filepath = os.path.join(directory, filename) new_filepath = os.path.join(directory, new_filename) with Image.open (old_filepath) as img: img.save(new_filepath, 'webp' ) os.remove(old_filepath) print (f"Converted {filename} to {new_filename} " ) directory = './Arknights' rename_and_convert_images(directory)
文件上传 Github
这步骤不具体说了,图片通过 jsdeliver 引用。
CloudFlare Worker
新建一个 worker,代码参考如下:
+因为改了文件名,所以下面只需要修改图片总数 total,就能生成一个随机数访问
+个人通过 type 区别宽屏和竖屏图片,自己按需修改
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 addEventListener ('fetch' , event => { event.respondWith ( handleRequest (event.request ).catch ((err ) => new Response ('cfworker error:\n' + err.stack , { status : 502 , }) ) ); }); async function handleRequest (request ) { const url = new URL (request.url ); switch (url.pathname ) { case "/img" : var type = url.searchParams .has ("type" ) ? url.searchParams .get ("type" ) : "pc" ; const total = getTotal (type); if (total == 0 ) return handleImage ("pc" , getTotal ("pc" )); return handleImage (type, total); case "/favicon.ico" : return fetch ("https://gcore.jsdelivr.net/gh/SukiEva/assets/blog/favicon.ico" ); default : return handleNotFound (); } } function getTotal (type ) { switch (type) { case "pc" : return 175 ; case "mb" : return 0 ; default : return 0 ; } } async function handleImage (type, total ) { var index = Math .floor ((Math .random () * total)) + 1 ; var img = "https://gcore.jsdelivr.net/gh/SukiEva/assets/webp/" + type + "/" + index + ".webp" ; res = await fetch (img); return new Response (res.body , { headers : { 'content-type' : 'image/webp' , }, }); } function handleNotFound ( ) { return new Response ('Not Found' , { status : 404 , }); }
自定义域名
默认是一个 CF 的域名,但是目前已经被国内 ban 了,需要自行添加域名路由。
最后再次祝看到这篇博客的小伙伴:
2024,国庆快乐鸭!
在小小的洞里爬呀爬呀爬( 逃