有效沟通问答-【官方】百战程序员_IT在线教育培训机构

会员可以在此提问，百战程序员老师有问必答

对大家有帮助的问答会被标记为“推荐”
看完课程过来浏览一下别人提的问题，会帮你学得更全面

截止目前，同学们一共提了 134111个问题

老师，我在使用多线程爬取文件的时候，会出现文件写入不全，然后我在写入文件的的时候加入锁，但是发现还写入不全，您帮我看一下！

from threading import Thread,Lock
import requests
from lxml import etree
from fake_useragent import UserAgent
from queue import Queue
class Spider(Thread):
    def __init__(self,url_queue,lock):
        Thread.__init__(self)
        self.url_queue = url_queue
        self.lock = lock

    def run(self):
        while not self.url_queue.empty():
            url = self.url_queue.get()
            print(url)
            headers = {'User-Agent':UserAgent().chrome}
            resp = requests.get(url,headers=headers)
            e = etree.HTML(resp.text)
            contents = [div.xpath('string(.)').strip() for div in e.xpath('//div[@class="content"]')]
            #加入锁
            self.lock.acquire()
            with open('qiushi.text', 'a', encoding='utf-8')as f:
                for content in contents:
                    f.write(content+'\n')
            self.lock.release()
if __name__ == '__main__':
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    lock = Lock()
    url_queue = Queue()
    for num in range(1,14):
        url_queue.put(base_url.format(num))
    for i in range(6):
        spider = Spider(url_queue,lock)
        spider.start()

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 721楼

老师，我按视频讲解，运行start.py文件时，只显示到Spider closed(finished).没有打印baidu网页内容，这是哪里错了？

Python全系列/第十六阶段：Python 爬虫开发/移动端爬虫开发- 722楼

老师好，我的pycharm是安装在MAC本上，安装scrapy时一切正常。也进入了我创建的scrapy项目，但利用scrpay genspider baidu baidu.com,创建爬虫项目时，提示command not found。是什么原因？

Python全系列/第十六阶段：Python 爬虫开发/移动端爬虫开发- 723楼

点击终端会出现这样的情况

但是单点start.sh是小鲸鱼是不是Docker Quickstart Terminal 就是start.sh的快捷方式？

Python全系列/第十六阶段：Python 爬虫开发/动态数据抓取 724楼

from threading import Thread
import requests
from lxml import etree
from fake_useragent import UserAgent
from queue import Queue    #引入一个队列，传输url


class Spider(Thread):
    def __init__(self, url_queue):
        Thread.__init__(self)
        self.url_queue = url_queue

    def run(self):
        while not self.url_queue.empty():
            url = self.url_queue.get()
            print(url)
            headers = {'User-Agent': UserAgent().chrome}
            resp = requests.get(url, headers=headers)
            e = etree.HTML(resp.text)
            dates = [div.xpath('string(.)').strip() for div in e.xpath('//div[@class="th200"]/text()')]  #如果这里这样写会报错：
            Maxtemps =e.xpath('//div[@class="th140"][1]/text()')   #如果这样写，取的数据不是base_url中的数据。
            Mintemps =e.xpath('//div[@class="th140"][2]/text()')
            Weathers =e.xpath('//div[@class="th140"][3]/text()')
            Windirs = e.xpath('//div[@class="th140"][4]/text()')
            for date, Maxtemp, Mintemp, Weather, Windir in zip(dates, Maxtemps, Mintemps, Weathers, Windirs):
                contents=[date, Maxtemp, Mintemp, Weather, Windir]
            with open('qixiang.txt', 'a', encoding='utf-8')as f:  # 'a'是追加的意思,文档打开放在文档写的前面。只打开一次。
                for content in contents:
                    f.write(content + '\t''\t''\t''\n')          # 问题：1.这里并没有追加新的内容，而是把当前页面重复加入；2.效果应该是打印某年某一个月全部数据：一行，字段之间有空格，每行换行。


if __name__ == '__main__':
    base_url = 'http://lishi.tianqi.com/zhengzhou/2011{}.html/'
    url_queue = Queue()
    for num in range(1, 13):
        if num <10:
           url_queue.put(base_url.format("%02d" % num))
        else:
            url_queue.put(base_url.format(num))

    for num in range(3):
        spider = Spider(url_queue)
        spider.start()

老师：我模仿多线程的程序讲解，爬取 http://lishi.tianqi.com/zhengzhou/网站上的气象数据，想选择相应的年，月，爬取对应的数据。程序运行后的问题是：

1.没有爬取指定页面

base_url = 'http://lishi.tianqi.com/zhengzhou/2011{}.html/'

2.没有正确保存数据

没有追加新的内容，而是把当前页面重复加入；2.效果应该是打印某年某一个月全部数据：一行，字段之间有空格，每行换行。

3.取当面页面的数据时，会报错

[div.xpath('string(.)').strip() for div

我已经反复看教学视频，上网查资料，不知道问题在哪里，请老师帮助修改一下代码，谢谢！

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 725楼

老师，我要输出1~12数字，

format，如何将个位数格式化为两位字符，如1格式化为01，到10之后的数字正常显示。

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 726楼

老师，我有个问题，正则如何匹配多个html相同的标签

而且我还抓了一个网页，有多个的span标签相同，这是不是意味要选择其他的筛选方法

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 727楼

老师，这三个单引号是什么意思？

里面的语句有什么作用？

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 728楼

老师，我爬取糗事百科前三页的结果是这样的，是什么情况？

!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="chrome=1,IE=edge">
<meta name="renderer" content="webkit" />
<meta name="applicable-device" content="pc">



<meta name="domain_verify" content="pmrgi33nmfuw4ir2ejyws5ltnbuweyljnnss4y3pnurcyithovuwiir2ejqwmyrtguzdgobsmezdgnbyheywcmzthbrdmmtemu4tamrqg5rtmirmej2gs3lfknqxmzjchiytkmrzgq4demjugaydcnd5">

附上代码：

from urllib.request import urlopen,Request
from urllib.parse import quote
def get_html(url):
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
    }
    response=Request(url,headers=headers)
    info=urlopen(response)
    content=info.read().decode()
    return content
def save_html(content,filename):
    with open('./爬虫结果/'+filename+'.html','w',encoding='utf-8') as f:  #'r'模式用'gbk'编码，'w'模式用'utf-8'编码
        f.write(content)
def main():
    num=3
    url0=url='https://www.qiushibaike.com/8hr/page/{}/'
    for i in range(num):
        url=url0.format(i+1)
        html=get_html(url)
        filename='糗事百科的第{}页内容'.format(i+1)
        save_html(html,filename)
if __name__=='__main__':
    main()

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础（旧） 729楼

from fake_useragent import UserAgent
import requests
from pyquery import PyQuery as pq


url = 'https://www.qidian.com/finish'
headers = {'User-Agent':UserAgent().random}

resp = requests.get(url,headers=headers)

doc = pq(resp.text)

name = [a.text for a in doc('h4 a')]
book = doc('a[class="name"]').text()

老师，怎么使用pyquery提取作者名字啊，里面含有跟作者名称一样的a标签，我试了好久就是没提取出来，而且这个是不规律的，没法使用if进行筛选

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 730楼

from fake_useragent import UserAgent
import requests
from lxml import etree
from time import sleep


def get_html(url):
    '''
    :param url: 要爬取的地址
    :return: 返回html
    '''
    headers = {"User-Agent": UserAgent().chrome}
    resp = requests.get(url, headers=headers)
    sleep(2)                        #控制访问速度，以免给对方服务器造成负担，每次访问之前，睡上2秒
    if resp.status_code == 200:     #状态码==200即可返回
        resp.encoding = 'utf-8'
        return resp.text            #返回内容
    else:
        return None


def parse_list(html):
    '''
    :param html: 传递进来一个有电影列表的html
    :return: 返回一个电影列表的url
    '''

    e = etree.HTML(html)      #创建一个对象
    list_url = ['http://maoyan.com{}'.format(url) for url in e.xpath('/div[@class="channel-detail movie-item-title"]/a/@href')]
    return list_url


def pares_index(html):
    '''
    :param html: 传递进来一个有电影信息的html
    :return: 已经提取好的电影信息
    '''
    e = etree.HTML(html)
    name = e.xpath('//h1[@class="name"]/text()')[0]
    type = e.xpath('//li[@class="ellipsis"]/a[1]/text()')[0]
    actors = e.xpath('//div[@class="celebrity-group"][2]/ul[@class="celebrity-list clearfix"]/li/div/a/text()')
    actors = format_data(actors)
    return {"name": name, "type": type, "actors": actors}


def format_data(actors):                  #去重演员信息
    actor_set = set()
    for actor in actors:
        actor_set.add(actor.strip())      #strip()--去空格
    return actor_set


def main():
    num = int(input('请输要获取多少页：'))
    for page in range(num):
        url = 'https://maoyan.com/films?offset={}'.format(page * 30)

        list_html = get_html(url)
        list_url = parse_list(list_html)
        for url in list_url:
            info_html = get_html(url)       #发送新的请求获取电影信息
            movie = pares_index(info_html)  #解析电影信息
            print(movie)


if __name__ == '__main__':
    main()

老师，按照网站的最新链接修改xpath等各链接之后，只运行到页数，就结束了，到process finished with exit code 0,这是什么原因？

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 731楼

老师，本讲中，通过程序提取出来的文本里，还有<br/>这个的代码，怎么能把这样的代码也去掉，只保留文本？

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 732楼

def save_html()

这一部分，其中‘w'是什么意思？

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础（旧） 733楼

你好，老师，我按照源代码写了一边爬虫，但是没收到数据，控制台也没有报错，看打印好像说是没有items，但是对比源代码找不到不一样的地方，我把源代码上传了上去，请老师帮我看看

lianjia11.rar

Python全系列/第十六阶段：Python 爬虫开发/动态数据抓取 734楼

from urllib.request import Request,build_opener
from fake_useragent import UserAgent
from urllib.parse import urlencode
from urllib.request import HTTPCookieProcessor
from http.cookiejar import MozillaCookieJar
def get_cookie():
    login_url = 'https://www.docin.com/app/login'
    # 设置账号密码
    form_data = {
        'user': '18310640655',
        'password': '199759guo'
    }
    # 设置请求头
    headers = {"User-Agent": UserAgent().random}
    # 封装函数
    req = Request(login_url, headers=headers, data=urlencode(form_data).encode())
    #自己保存cookie
    cookie_jar =MozillaCookieJar()
    # 保存cookie
    handler = HTTPCookieProcessor(cookie_jar)
    opener = build_opener(handler)
    resp = opener.open(req)
    #无论我的cookie是否过期都将保存下来
    cookie_jar.save('cookie.txt',ignore_discard=True,ignore_expires=True)

def use_cookie():
    info_url = 'https://www.docin.com/'
    headers = {'User-Agent':UserAgent().random}
    req = Request(info_url,headers=headers)

    #加载cookie信息
    cookie_jar = MozillaCookieJar()
    cookie_jar.load('cookie.txt',ignore_discard=True,ignore_expires=True)
    handler = HTTPCookieProcessor(cookie_jar)
    #将信息放到控制器里面
    opener = build_opener(handler)
    resp = opener.open(req)

    print(resp.read().decode())

if __name__ == '__main__':
    get_cookie()
    use_cookie()

老师，我尝试用cookie登录豆丁网，为什么我保存下来的cookie信息是这样的。并且无法登录

Python全系列/第十六阶段：Python 爬虫开发/scrapy框架使用（旧） 735楼

同学您好