有效沟通问答-【官方】百战程序员_IT在线教育培训机构

from urllib.request import Request, build_opener, ProxyHandler
from fake_useragent import UserAgent


url = 'http://httpbin.org/get'

headers = {"User-Agent": UserAgent().chrome}
req = Request(url, headers=headers)
#
handler = ProxyHandler({'http': '103.45.147.157:16817'})

opener = build_opener(handler)
resp = opener.open(req)
print(resp.read().decode())

老师我这个proxy无论写什么都是返回主机的IP 空字符串也是本机IP 为什么？？

Python 全系列/第十五阶段：Python 爬虫开发/scrapy框架使用（旧） 916楼

from my_fake_useragent import UserAgent
from bs4 import  BeautifulSoup
from time import sleep
import requests

url = 'https://maoyan.com/films/1218188'
headers = {"User_Agent":UserAgent().random()}
proxies = {"http":"http://175.155.71.22:1133"}
resp  = requests.get(url,headers=headers,proxies=proxies)
resp.encoding='utf-8'

soup = BeautifulSoup(resp.text,'lxml')
name = soup.select('h1.name')[0].text #电影名
ename = soup.select('div.ename')[0].text #英文名
type = soup.select('li.ellipsis>a')[0].text #电影类型
director = soup.select('li.celebrity > div >a')[0].text #导演
actors = soup.select('li.celebrity actor>a') #演员
intor = soup.select('span.dra')
actor_set = set()
for actor in actors:
    print(actor.text.stirp())
print(name,ename,type,director,intor)

老师我这是爬取猫眼电影的你好李焕英，我的IP被禁止访问，我用了代理也不行，我试了多个免费高匿代理，还是不行，老师帮我看看，是我代码有问题吗？

Python 全系列/第十五阶段：Python 爬虫开发/爬虫基础（旧） 917楼

我也有这个问题，尝试点击ID为“search_icon”的元素时，浏览器实际上已经将点击事件传递给了另一个元素，即ID为“id_qrcode_popup_container”的元素，该怎么解决？

Python 全系列/第十五阶段：Python 爬虫开发/爬虫基础 918楼

anaconda要自己安装和配置咯

Python 全系列/第十五阶段：Python 爬虫开发/Python爬虫基础与应用 919楼

老师，麻烦您帮忙看一下这个是什么原因，我打印斗罗大陆的内容，但是只打印了一章程序就停止了，
刚刚开始接触这个，也不会看，看不出来问题。麻烦老师帮帮忙

first_xiaoshuo04.zip

Python 全系列/第十五阶段：Python 爬虫开发/移动端爬虫开发- 920楼

from fake_useragent import UserAgent
import requests
import parsel

#确定url
url ='https://uland.taobao.com/sem/tbsearch?refpid=mm_26632258_3504122_32538762&keyword=%E6%B7%98%E5%AE%9D&clk1=336e7eebf578863c3d669b4cd1020b7d&upsid=336e7eebf578863c3d669b4cd1020b7d'
headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
response = requests.get(url,headers = headers)
print(response.text)
html_data = parsel.Selector(response.text)
name=html_data.xpath('//div[@class="pc-items-item-title pc-items-item-title-row2"]/span[@class="title-text"]/text()').get()
dollars=html_data.xpath('//div[@class="price-con"]/span[3]/text()').get()
print(name)
print(dollars)无法提取网页的内容

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬- 921楼

老师我的fake-useragent用不了是因为没有在虚拟环境中安装吗，我的代码现在在虚拟环境里，但下面自动出来的文件路径还是python，没有精确到虚拟环境，要怎么调整啊

Python 全系列/第十五阶段：Python 爬虫开发/爬虫基础（旧） 922楼

import requests

# import execjs

from Crypto.Cipher import AES

def get_data():

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36',

}

params = {

'pg': '27',

'pgsz': '15',

'total': '450',

}

# 获取数据

response = requests.get(

'https://jzsc.mohurd.gov.cn/APi/webApi/dataservice/query/project/list',

params=params,

headers=headers,

)

return response.text

def decrypt(data,key,iv):

# 创建一个AES解密的对象

cipher = AES.new(key,AES.MODE_CBC,iv)

# 将数据转成对象所支持的类型bytes

tmp_data = bytes.fromhex(data)

# 解析数据

rs = cipher.decrypt(tmp_data)

return rs

if __name__ == '__main__':

# 初始化iv，key

iv = b'0123456789ABCDEF'

key = b'Dt8j9wGw%6HbxfFn'

# 要解密的数据

t = get_data()

print(t)

print(decrypt(t,key,iv).decode())

老师，我这个报错怎么解决啊，decode()不行

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬 923楼

PS C:\Users\29768> docker pull mongo:5.0.9

5.0.9: Pulling from library/mongo

d7bfe07ed847: Pulling fs layer

97ef66a8492a: Pulling fs layer

20cec14c8f9e: Pulling fs layer

38c3018eb09a: Waiting

ccc9e1c2556b: Waiting

593c62d03532: Waiting 1a103a446c3f: Waiting

be887b845d3f: Waiting

e5543880b183: Waiting

error pulling image configuration: Get "https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/c8/c8b57c4bf7e3a88daf948d5d17bc7145db05771e928b3b3095ca4590719b5469/data?verify=1719932977-PEAYl6kRupqFXcYgbtQiNMGgGbc%3D": dial tcp 199.59.148.89:443: connect: connection refused

出现这个情况怎么办

Python 全系列/第十五阶段：Python 爬虫开发/scrapy框架使用 924楼

(bzsxt_env) PS D:\python\AUTO_CODE\scrapy05> pip install cryptography==36.0.2

WARNING: Ignoring invalid distribution -ip (d:\python\python_env\bzsxt_env\lib\site-packages)

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple

Collecting cryptography==36.0.2

Using cached https://pypi.tuna.tsinghua.edu.cn/packages/46/cd/abfb77b8a0666f38ec321e49eef3733cbecb3caf79926ec14a7fe3b2217f/cryptography-36.0.2-cp36-abi3-win_amd64.whl (2.2 MB)

Requirement already satisfied: cffi>=1.12 in d:\python\python_env\bzsxt_env\lib\site-packages (from cryptography==36.0.2) (1.15.1)

Requirement already satisfied: pycparser in d:\python\python_env\bzsxt_env\lib\site-packages (from cffi>=1.12->cryptography==36.0.2) (2.21)

WARNING: Ignoring invalid distribution -ip (d:\python\python_env\bzsxt_env\lib\site-packages)

Installing collected packages: cryptography

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

pyopenssl 23.2.0 requires cryptography!=40.0.0,!=40.0.1,<42,>=38.0.0, but you have cryptography 36.0.2 which is incompatible.

Successfully installed cryptography-36.0.2

用不了，看错误信息的意思版本要>=38.0.0

Python 全系列/第十五阶段：Python 爬虫开发/scrapy框架使用 925楼

老师，我上次爬取猫眼电影的信息时候依照视频代码，都可以爬取到。但是今天再次用相同代码就爬取不到了。（用了您的原代码资料也不好使）

接着我用相同的代码去爬取起点中文网的作品信息就ok。我是不是可以理解为猫眼电影网站更新优化了，从而认出了我是爬虫才提取不出来数据，而起点中文网反爬没有猫眼高级所以能爬。。。？

再就是像是猫眼这种情况，爬取不到数据，我如何解决？还有就是我如何辨别是网站反爬措施？然后应该怎么找原因？

下面是代码运行后的反馈

from fake_useragent import UserAgent
import requests
from lxml import etree
from time import sleep


def get_html(url):
    '''
    :param url: 要爬取的地址
    :return: 返回html
    '''
    headers = {"User-Agent": UserAgent().chrome}
    resp = requests.get(url, headers=headers)
    sleep(2)
    if resp.status_code == 200:
        resp.encoding = 'utf-8'
        return resp.text
    else:
        return None


def parse_list(html):
    '''
    :param html: 传递进来一个有电影列表的html
    :return: 返回一个电影列表的url
    '''

    e = etree.HTML(html)
    list_url = ['http://maoyan.com{}'.format(url) for url in e.xpath('//div[@class="movie-item"]/a/@href')]
    return list_url


def pares_index(html):
    '''
    :param html: 传递进来一个有电影信息的html
    :return: 已经提取好的电影信息
    '''
    e = etree.HTML(html)
    name = e.xpath('//h3[@class="name"]/text()')[0]
    type = e.xpath('//li[@class="ellipsis"][1]/text()')[0]
    actors = e.xpath('//div[@class="celebrity-group"][2]/ul[@class="celebrity-list clearfix"]/li/div/a/text()')
    actors = format_data(actors)
    return {"name": name, "type": type, "actors": actors}


def format_data(actors):
    actor_set = set()
    for actor in actors:
        actor_set.add(actor.strip())
    return actor_set


def main():
    num = int(input('请输要获取多少页：'))
    for page in range(num):
        url = 'http://maoyan.com/films?showType=3&offset={}'.format(page * 30)
        list_html = get_html(url)
        list_url = parse_list(list_html)
        for url in list_url:
            info_html = get_html(url)
            movie = pares_index(info_html)
            print(movie)


if __name__ == '__main__':
    main()

D:\pythonDownloads\python.exe D:/pythonwd/爬虫/爬虫/代码/demo/29.猫眼电影1.py

请输要获取多少页：1

Process finished with exit code 0

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬- 926楼

老师，我在执行代码时发现滚动条没拉到最底端，最后一行代码显示数量明显不对，代码如下，麻烦老师帮忙分析下，谢谢！

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver
from time import sleep
from lxml import etree
'''
例子：百度图片
但是这个案例没实现把滚动条拉到最下端
'''
driver = webdriver.Chrome()
driver.get('https://image.baidu.com/')
driver.find_element_by_id('kw').send_keys('成吉思汗')
driver.find_element_by_class_name('s_search').click()
js = 'document.documentElement.scrollTop=1000000'
# js = 'window.scrollTo(0,1000000)'
driver.execute_script(js)
sleep(5)
html = driver.page_source
e = etree.HTML(html)
imgurl_list = e.xpath('//li[@class="imgitem"]/div/a/img/@data-imgurl')
linkurl_list = ['https://image.baidu.com{}'.format(url) for url in e.xpath('//li[@class="imgitem"]/div/a/@href')]
for imgurl,linkurl in zip(imgurl_list,linkurl_list):
    print(imgurl,'|',linkurl)
driver.quit()
print(len(linkurl_list))

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬- 927楼

老师，请问，这里把HTTPHandler()这个对象传入，有什么作用呢

Python 全系列/第十五阶段：Python 爬虫开发/scrapy框架使用（旧） 928楼

老师 selenium能否用来判断像金十数据这种不定时更新的数据？就是说我能否用selenium实现数据更新一次我就抓取一次，从而减少发送请求的频率？

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬- 929楼

老师这是啥情况

Python 全系列/第十五阶段：Python 爬虫开发/爬虫反反爬- 930楼

同学您好