有效沟通问答-【官方】百战程序员_IT在线教育培训机构

会员可以在此提问，百战程序员老师有问必答

对大家有帮助的问答会被标记为“推荐”
看完课程过来浏览一下别人提的问题，会帮你学得更全面

截止目前，同学们一共提了 134267个问题

# _*_coding=utf-8 _*_
from fake_useragent import UserAgent
import requests
from lxml import etree
from time import sleep

def get_html(url):
    """
    :param url: 要爬取的url
    :return返回html
    """
    headers = {
        "User-Agent": UserAgent().chrome
    }
    resp = requests.get(url, headers=headers)
    sleep(3)
    if resp.status_code == 200:
        resp.encoding = 'utf-8'
        return resp.text
    else:
        return None

def parse_list(html):
    """
    :param html: 传递进来一个有电影列表的html
    :return 返回一个电影列表的url
    """
    e = etree.HTML(html)
    list_url = ['https://maoyan.com'+ url for url in e.xpath('//div[@class="movie-item-hover"]/a/@href')]
    return list_url

def parse_index(html):
    """
    :param html: 传递进来一个有电影信息的url
    :return  已经提取好的电影信息
    """
    e = etree.HTML(html)
    names = e.xpath('//h1/text()')[0]
    type = e.xpath('//li[@class="ellipsis"]/a/text()')[0]
    actor = e.xpath('//ul[@class="celebrity-list clearfix"]/li[@class="celebrity actor"]/div/a/text()')
    actors = format_actor(actor)
    return {'name': names, 'type': type, 'actor': actors}

def format_actor(actors):
    actor_set = set()  # 去重
    for actor in actors:
        actor_set.add(actor.strip())
    return actor_set

def main():
    num = int(input('请输入要获取多少页数据'))
    for y in range(num):
        url = 'https://maoyan.com/films?showType=3&offset={}'.format(y*30)
        # print(url)
        list_html = get_html(url)
        list_url = parse_list(list_html)
        for url in list_url:
            # print(url)
            info_html = get_html(url)
            movie = parse_index(info_html)
            print(movie)


if __name__ == '__main__':
    main()

老师为啥没有数据啊！

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 751楼

File "D:/pycharmfile/爬虫/第一个爬虫.py", line 9, in <module>

print(info.decode('utf-8'))

UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 22462: illegal multibyte sequence

按照格式源代码写的啊就是开头写了#coding 为什么这样怎么解决

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础（旧） 752楼

页面里面找不到js代码了

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 753楼

请问下载ChromeDriver软件在哪里下载的

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础（旧） 754楼

老师，没有，这个里面只有人人项目，node-v14.17.4-， image文件里只有图片

Python全系列/第十六阶段：Python 爬虫开发/docker 容器扩展（旧） 755楼

解决了老师，他没有这个contrip包。。。，因为这个包在from scrapy.pipelines.images import ImagesPipeline这样子

所以setting中把视频的'scrapy.contrib.pipeline.images.ImagesPipeline':300

改成

'scrapy.pipelines.images.ImagesPipeline': 300,

就可以正常运行了。

Python全系列/第十六阶段：Python 爬虫开发/scrapy 框架高级 756楼

<size=18>请求失败<size>

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础 757楼

为啥运行的不一样

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础 758楼

异步请求是什么意思啊能讲详细一点吗

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬 759楼

有一个小问题如果爬取过程中遇到这样的就是一段话没说完需要再深一层的点进去之后有分页的应该怎么爬取

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 760楼

为啥运行的不一样

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础 761楼

老师讲义能发一分么

Python全系列/第十六阶段：Python 爬虫开发/scrapy框架使用（旧） 762楼

SyntaxWarning: "is" with a literal. Did you mean "=="? if e.args is ():

这个警告能去掉吗？挺烦人的，百度没找到合适的方法

Python全系列/第十六阶段：Python 爬虫开发/scrapy框架使用（旧） 763楼

老师麻烦帮我看下代码，我的ip好像是不是被封了？响应回来的html好像是要我验证

import requests
from fake_useragent import UserAgent
from lxml import etree

def get_html(url):
    '''传入要爬取的地址，返回html'''
    headers = {"User-Agent": UserAgent().chrome}
    response = requests.get(url,headers)
    if response.status_code == 200:
        response.encoding = 'utf-8'
        print(response.text)
        return response.text
    else:
        return None         

def parse_list(html):
    '''传入含有电影信息的html，返回电影列表的每个电影的html'''
    e = etree.HTML(html)
    list_url = ['https://maoyan.com/{}'.format(url) for url in e.xpath('//div[@class="movie-item film-channel"]/a/@href')]
    # print(list_url)
    return list_url

def parse_index(html):
    '''传入有电影信息的html，返回提取到的电影信息'''
    e = etree.HTML(html)
    name = e.xpath('//h1[@class="name"]/text()') 
    type = e.xpath('//li[@class="ellipsis"][1]/a/text()')
    actor = e.xpath('//div[@class="celebrity-group"][2]/ul[@class="celebrity-list clearfix"]/li/div/a/text()')
    actors = format_data(actor)
    return {'name':name,'type':type,'actor':actors}

def format_data(actors):
    actor_set = set()
    for actor in actors:
        actor_set.add(actor)
    return actor_set

def main():
    num = int(input('请输入要获取的页数：'))
    for page in range(num):
        url = 'https://maoyan.com/films?showType=3&offset={}'.format(page*30)
        list_html = get_html(url)
        list_url = parse_list(list_html)
        print(list_url)
        # for url in list_url:
        #     info_html = get_html(url)
        #     movie = parse_index(info_html)
        #     print(movie)


if __name__ == "__main__":
    main()

Python全系列/第十六阶段：Python 爬虫开发/爬虫反反爬- 764楼

怎么没有爬取微信公众号和APP数据的视频，能简单介绍下这些是怎么爬取的吗？

Python全系列/第十六阶段：Python 爬虫开发/爬虫基础（旧） 765楼

同学您好