0%

爬取新浪博客文章的Python实现之①

发表于 2020-09-01 分类于程序人生阅读次数：本文字数： 7.5k 阅读时长 ≈ 7 分钟

学以致用，既然看了看python的爬虫，今天就动手折腾一个吧。在折腾的过程中，充分体会到了正则表达式的威力和用处，不过笨人有笨法子。除了正则，还能用字符串不是？

实现路径

确定爬取内容的HTML结构，设计对应的存储结构
确定爬取内容的HTML、JS、CS的结构，确定可能的技术路线
首先实现爬取和数据抓取，然后后续实现和typecho的对接
要点难点
正则表达式熟练度不够，又不想花费太多时间；
不是所有的内容都能在HTML代码中找到的，需要用到Firefox的WEB开发工具->网络->XHD等等去确定这些必要的信息是怎么来的。再去观察Request的信息头，看看发送哪些数据，发到哪个接口
自己的图片怎么样抓下来
程序实现
经过上面的考量，整个程序（爬取部分）分成3个模块:
一个是设置一些常用参数的-ConfigData.py
一个是设置访问对象结构的-Blog.py

然后就是一个主要的动作程序-getBlogofSina.py

ConfigData.py

#ConfigData.py
#一些固定的设置
import logging
class ConfigData(object):
# 设置初始URL
# 新浪博客的URL为 articlelist_+【用户id】+ _d + _d + .html
# 第一个_d： 0表示全部文章；1开始表示分类文章
# 第二个_d:  表示页码从1开始
strUrl = 'http://blog.sina.com.cn/s/articlelist_1495462704_0_'
#   设置爬虫的浏览器User-Agent
#   https://useragent.buyaocha.com 获得当前浏览器的User-Agent
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; \
rv:80.0) Gecko/20100101 Firefox/80.0'
}
#   分析新浪博客的博文目录，得到博文目录列表页的解析正则表达式
#   http://blog.sina.com.cn/s/articlelist_1495462704_0_2.html
patternUrlList = r'articlelist[\w]*_0_[\w]*.html'
#     得到博文目录列表页中每页的博文连接的解析正则表达式
#     <div class="articleCell SG_j_linedot1">
#     <p class="atc_main SG_dot">
#     <span class="atc_ic_f"></span>
#     <span class="atc_title">
#     <a href="http://blog.sina.com.cn/s/blog_5922f3300102z36z.html" target="_blank" title="">完美数和亲和数Python和C</a></span>
#     <span class="atc_ic_b"><img align="absmiddle" class="SG_icon SG_icon18" height="15" src="http://simg.sinajs.cn/blog7style/images/common/sg_trans.gif" title="此博文包含图片" width="15"/></span>
#
#     <p class="atc_info">
#     <span class="atc_data" id="count_5922f3300102z36z"></span>
#     <span class="atc_tm SG_txtc">2020-08-17 22:16</span>
#     <span class="atc_set">
#     </span>
#
#
#patternALink = r'5922f3300[\w]+.html'
patternDiv = {'class': 'articleCell SG_j_linedot1'}
patternUrl = r'http:\/\/[\w\.\/]*.html'
patternTitle = r'>[\u4e00-\u9fa5\w]+'
patternPic = r'img'
patternPublish = r'\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}'
outPic = '有图'
# 获取博客文章详细页面的正则表达式
patternTagDiv = {'class': 'articalTag'}
patternTag = r"\$tag=.+"   #获取TAG
patternCategory = r'\>[\u4e00-\u9a5e\w\s]+\<\/a'
patternTitleDetail = {'class': 'time SG_txtc'}
patternPublishDetail = r'\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}'
patternContent = {'class': 'articalContent'}
#   评论和阅读数量来自于js，通过firefox的web开发工具【网络】查看其api接口和参数
#   其初始#url='http://comet.blog.sina.com.cn/api?maintype=num&uid=5922f330&aids=02z36z,02z2zj,02x881,02vdeu,02v0dv,02uwl0,01fdfa,01e20o,01dzuf,01dqjv,01dmef,01dkrs,01db4f,01d9bi,01d5om,01d5bj,01d2wt,01d2mg,01cx8d,01cx7e,01brlp,01b1ik,01av40,01anza,01a5u8,019zlo,019uv9,019tsy,019rli,015lxq,0156di,0150o8,014z1t,014wcu,014u1r,014rjf,014ovo,014ov5,014ovm,014ov1,014ov6,014o1b,0149fb,0149f4,0137za,0137q6,0137kc,0137ea,0137d9,0136nk&requestId=aritlces_number_6610&fetch=c,r'
#   发现url='http://comet.blog.sina.com.cn/api?maintype=num&fetch=c,r&uid=5922f330&aids=02z36z,02z2zj,02x881,02vdeu,02v0dv,02uwl0,01fdfa,01e20o,01dzuf,01dqjv,01dmef,01dkrs,01db4f,01d9bi,01d5om,01d5bj,01d2wt,01d2mg,01cx8d,01cx7e,01brlp,01b1ik,01av40,01anza,01a5u8,019zlo,019uv9,019tsy,019rli,015lxq,0156di,0150o8,014z1t,014wcu,014u1r,014rjf,014ovo,014ov5,014ovm,014ov1,014ov6,014o1b,0149fb,0149f4,0137za,0137q6,0137kc,0137ea,0137d9,0136nk'
#   即可，也就是说必要的参数包括maintype, fetch, uid和aids
#   在这里，uid = 5922f330， aids需要动态构筑，就是具体博文的后6位
#   博文： http://blog.sina.com.cn/s/blog_5922f3300101e20o.html
#   其中： 5922f330为uid， 01用途不详， 后面六位 01e20o为aid
#   为了后面处理方便，为保持程序灵活性，以后可将uid也设成动态构筑
crUrl = 'http://comet.blog.sina.com.cn/api?maintype=num&fetch=c,r&uid=5922f330'
patternCR = r'{.*}\)'
# 设置logging的格式和级别
loggingLevel = logging.DEBUG
loggingFormat = '%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s'
# 设置初始URL
startURL = 'http://blog.sina.com.cn/s/articlelist_1495462704_0_1.html'
# 设置输出格式
htmlStart = '''
<HTML><head><meta= charset="utf-8" /><title>新浪博客抓取结果</title>
<style type="text/css">
@charset "utf-8";
/* CSS Document */
.tabtop13 {
margin-top: 13px;
}
.tabtop13 td{
background-color:#ffffff;
height:25px;
line-height:150%;
}
.font-center{ text-align:center}
.btbg{background:#e9faff !important;}
.btbg1{background:#f2fbfe !important;}
.btbg2{background:#f3f3f3 !important;}
.biaoti{
font-family: 微软雅黑;
font-size: 36px;
font-weight: bold;
border-bottom:1px dashed #CCCCCC;
color: #255e95;
}
.titfont {
font-family: 微软雅黑;
font-size: 16px;
font-weight: bold;
color: #255e95;
background: url(../images/ico3.gif) no-repeat 15px center;
background-color:#e9faff;
}
.tabtxt2 {
font-family: 微软雅黑;
font-size: 14px;
font-weight: bold;
text-align: right;
padding-right: 10px;
color:#327cd1;
}
.tabtxt3 {
font-family: 微软雅黑;
font-size: 14px;
padding-left: 15px;
color: #000;
margin-top: 10px;
margin-bottom: 10px;
line-height: 20px;
}
</style></head>
<body><table width="100%" border="0" cellspacing="0" cellpadding="0" align="center">
<tr><td align="center" class="biaoti" height="60">
我的新浪博客&nbsp;&nbsp;:&nbsp;&nbsp;
<span style="font-size:32px;color:blue">烟波满目凭栏久</span>
<span style="font-family:Times New Roman;font-size:18px">(http://blog.sina.com.cn/liusanrong)</span>
</td></tr><tr></tr></table><table width="100%" border="0" cellspacing="1" cellpadding="4" bgcolor="#cccccc" class="tabtop13" align="center">
<tr><td width="5%" class="btbg font-center titfont">序号</td>
<td width="20%" class="btbg font-center titfont">主题</td>
<td width="30%" class="btbg font-center titfont">URL</td>
<td width="5%" class="btbg font-center titfont">是否有图</td>
<td width="15%" class="btbg font-center titfont">评论数/阅读数</td>
<td width="25%"  class="btbg font-center titfont">发表时间</td>
</tr>
'''
htmlCont = '''
<tr><td class="btbg2 font-center"> {} </td>
<td class="font-center"> {} </td>
<td class="font-center"><a href="{}">{}</a></td>
<td class="font-center">{}</td>
<td class="font-center">{}/{}</td>
<td class="font-center">{}</td></tr>
'''
htmlEnd = '</table><br /></body></HTML>'
if __name__ == '__main__':
pass