有效沟通问答-【官方】百战程序员_IT在线教育培训机构

from urllib.request import Request,urlopen
from fake_useragent import UserAgent

ua = UserAgent()

# print(ua.random)
url = 'http://httpbin.org/get'
headers = {
    'User-Agent':ua.random
}
req = Request(url,headers=headers)
resp = urlopen(req)
info = resp.read().decode()
print(info)

Python 全系列/第十四阶段：Python 爬虫开发/scrapy框架使用（旧） 796楼

import scrapy
import re
from chaojiying_Python.chaojiying import get_code

class Login1Spider(scrapy.Spider):
    name = 'login1'
    allowed_domains = ['ganji.com']
    start_urls = ['https://passport.ganji.com/login.php']
    def parse(self, response):
        img_url = 'https://passport.ganji.com/ajax.php?dir=captcha&module=login_captcha'
        hash_code =re.search(r'"__hash__":"(.+)"',response.text).group(1)
        yield scrapy.Request(img_url,callback=self.do_fromdata,meta={'hash_code':hash_code})
    def do_fromdata(self,response):
        with open('code.jpg','wb') as f:
            f.write(response.body)
        #code = get_code('code.jpg')
        code = input("请输入验证码：")
        hash_code = response.request.meta['hash_code']
        data = {
            'username': '17030240219',
            'password': '123456qaz',
            'setcookie': '14',
            'checkCode':code,
            'next': '/ user / register_success.php?username=17030240219&next=%2F',
            'source':'passport',
            '__hash__':hash_code
        }
        login_url = 'https://passport.ganji.com/login.php'
        yield scrapy.FormRequest(login_url,method='POST',formdata=data,callback=self.after_login)
        #print(response.text)
    def after_login(self,response):
        print(response.text)

为什么登录时显示无效数组长度，这个怎么弄

Python 全系列/第十四阶段：Python 爬虫开发/scrapy 框架高级 797楼

老师请问一下，为什么我的MySQL突然启动不了了呢？密码我也没有更改过，就是今天突然启动不了了，请问老师这应该如何解决?

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 798楼

老师请问一下，我的scrapy已经使用pip安装好了，框架也已经创建好了，可是为什么还是提示我没有scrapy?scrapy下面线是红色的?

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 799楼

运行结果:

老师请问一下，为什么我配置完环境变量以后，还是提示我scrapy不是内部或者外部命令，我已经把C:\users\陈鸿杰\appdata\roming\python\python39\scripts这个路径添加到path中去了，可是还是不行，麻烦老师帮我看看怎么回事?

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 800楼

代码:

from selenium import webdriver
import requests
from time import sleep
from lxml import etree
for i in range (1,4):
    url = "https://www.jianshu.com/?seen_snote_ids%5B%5D=84663118&seen_snote_ids%5B%5D=84266647&seen_snote_ids%5B%5D=84293648&seen_snote_ids%5B%5D=80632322&seen_snote_ids%5B%5D=80844033&seen_snote_ids%5B%5D=84170253&seen_snote_ids%5B%5D=81012207&page={}".format(i)
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    chrome = webdriver.Chrome(options=options)
    sleep(2)
    chrome.get(url)
    js = 'document.documentElement.scrollTop=100000'
    chrome.execute_script(js)
    html = chrome.page_source
    e = etree.HTML(html)
    contents = e.xpath('//div[@class="content"]/a/text()')
    for content in contents:
        print(content)
    chrome.quit()

运行结果:

老师请问一下，这个是我自己写的一个爬取简书的代码，通过f12可以看出这应该是一个Ajax请求，通过下拉滚动条可以实现页面的局部更新，根据request URL可以看出，URL的前面都是相同的只有page部分不同，所以我使用for循环通过更改page来爬取数据，可是为什么我爬取的三页数据全部都是一样的呢?请问老师对于这种情况我应该如何解决?

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 801楼

代码:

from scrapy.cmdline import execute
execute('scrapy crawl baidu'.split())

老师请问一下，为什么我运行execute代码的时候直接报错了，而且我创建的baidu.py文件中的def parse(self,response)中的self,response这里提示我也有问题，请问老师这个是什么原因?

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 802楼

代码:

import requests
from fake_useragent import UserAgent
from lxml import etree
from time import sleep
url="https://www.lagou.com/zhaopin/2/?filterOption=3&sid=021dfdfa56f24b11810b1f18e04902eb"
headers={"User-Agent":UserAgent().chrome}
sleep(2)
resp=requests.get(url,headers=headers)
resp.encoding="utf-8"
html=resp.text
print(html)
e=etree.HTML(html)
a=e.xpath('//div[@class="p_top"]//h3/text()')
for i in a:
    print(i)

运行结果:

body {

margin: 0;

width: 100%;

height: 100%;

}

@keyframes loadingAnimation {

0% {

transform: translate3d(0, 0, 0);

}

50% {

transform: translate3d(0, -10px, 0);

}

.loading-info {

text-align: center;

height: 100%;

position: relative;

background: #fff;

top: 50%;

margin-top: -37px;

}

.loading-info .animation-word {

width: 100%;

}

.loading-info .animation-word p {

margin-top: 10px;

color: #9fa3b0;

}

.animation-word .component-l,

.animation-word .component-a,

.animation-word .component-g,

.animation-word .component-o,

.animation-word .component-u {

display: inline-block;

width: 40px;

height: 42px;

line-height: 42px;

font-family: Helvetica Neue, Helvetica, Arial, Hiragino Sans GB, Hiragino Sans GB W3, Microsoft YaHei UI, Microsoft YaHei, WenQuanYi Micro Hei, sans-serif;

font-weight: bolder;

font-size: 40px;

color: #eceef2;

vertical-align: top;

-webkit-animation-fill-mode: both;

animation-fill-mode: both;

-webkit-animation: loadingAnimation 0.6s infinite linear alternate;

-moz-animation: loadingAnimation 0.6s infinite linear alternate;

animation: loadingAnimation 0.6s infinite linear alternate;

}

.animation-word .component-a {

-webkit-animation-delay: 0.1s;

-moz-animation-delay: 0.1s;

animation-delay: 0.1s;

}

.animation-word .component-g {

-webkit-animation-delay: 0.2s;

-moz-animation-delay: 0.2s;

animation-delay: 0.2s;

}

.animation-word .component-o {

-webkit-animation-delay: 0.3s;

-moz-animation-delay: 0.3s;

animation-delay: 0.3s;

}

.animation-word .component-u {

-webkit-animation-delay: 0.4s;

-moz-animation-delay: 0.4s;

animation-delay: 0.4s;

}

</style>

</head>

<body>

<p class="gray">正在加载中...</p>

</div>

var securityPageName = "securityCheck";

!function () {

function e(c) {

var l, m, n, o, p, q, r, e = function () {

var a = location.hostname;

return "localhost" === a || /^(\d+\.){3}\d+$/.test(a) ? a : "." + a.split(".").slice(-2).join(".")

}(),

f = function (a, b) {

var f = document.createElement("script");

f.setAttribute("type", "text/javascript"), f.setAttribute("charset", "UTF-8"), f.onload = f.onreadystatechange = function () {

d && "loaded" != this.readyState && "complete" != this.readyState || b()

}, f.setAttribute("src", a), "IFRAME" != c.tagName ? c.appendChild(f) : c.contentDocument ? c.contentDocument.body ? c.contentDocument.body.appendChild(f) : c.contentDocument.documentElement.appendChild(f) : c.document && (c.document.body ? c.document.body.appendChild(f) : c.document.documentElement.appendChild(f))

g = function (a) {

var b = new RegExp("(^|&)" + a + "=([^&]*)(&|$)"),

c = window.location.search.substr(1).match(b);

return null != c ? unescape(c[2]) : null

h = {

get: function (a) {

var b, c = new RegExp("(^| )" + a + "=([^;]*)(;|$)");

return (b = document.cookie.match(c)) ? unescape(b[2]) : null

set: function (a, b, c, d, e) {

var g, f = a + "=" + encodeURIComponent(b);

c && (g = new Date(c).toGMTString(), f += ";expires=" + g), f = d ? f + ";domain=" + d : f, f = e ? f + ";path=" + e : f, document.cookie = f

}

i = function (a) {

if (a) {

window.location.replace(a)

}

j = function (a, c) {

c || a.indexOf("security-check.html") > -1 ? i(c) : i(a);

};

window.location.href, l = g("seed") || "", m = g("ts"), n = g("name"),

o = g("callbackUrl"),

p = g("srcReferer") || "", "null" !== n && l && n && o, l && m && n && (f("dist/" + n + ".js", function () {

var n, a = (new Date).getTime() + 1728e5,

d = "",

f = {},

g = window.gt || c.contentWindow.gt;

try {

d = (new g()).a();

} catch (k) { console.log(k) }

if (d) {

(h.set("__lg_stoken__", d, a, e, "/"), j(p, o))

}

}))

}

function j(a) {

if (!f && !g && document.addEventListener) return document.addEventListener("DOMContentLoaded", a, !1);

if (!(h.push(a) > 1))

if (f) ! function () {

try {

document.documentElement.doScroll("left"), i()

} catch (a) {

setTimeout(arguments.callee, 0)

}

}();

else if (g) var b = setInterval(function () {

/^(loaded|complete)$/.test(document.readyState) && (clearInterval(b), i())

}, 0)

}

var d, f, g, h, i, a = 0,

b = (new Date).getTime(),

c = window.navigator.userAgent;

c.indexOf("MSIE ") > -1 && (d = !0),

f = !(!window.attachEvent || window.opera),

g = /webkit\/(\d+)/i.test(navigator.userAgent) && RegExp.$1 < 525,

h = [],

i = function () {

for (var a = 0; a < h.length; a++) h[a]()

};

j(function () {

var b, a = window.navigator.userAgent.toLowerCase();

return "micromessenger" !== a.match(/micromessenger/i) || "wkwebview" == a.match(/wkwebview/i) ?

(e(document.getElementsByTagName("head").item(0)), void 0)

(b = document.createElement("iframe"), b.style.height = 0, b.style.width = 0, b.style.margin = 0, b.style.padding = 0, b.style.border = "0 none", b.name = "tokenframe", b.src = "about:blank", b.attachEvent ? b.attachEvent("onload", function () {

e(b)

}) : b.onload = function () {

e(b)

}, (document.body || document.documentElement).appendChild(b), void 0)

})

}();

</script>

</html>

老师请问一下，为什么我使用requests.get方法爬取拉勾网的时候，返回回来的HTML代码和使用f12查看的源代码不一致?对于这个问题当我使用selenium库爬取的时候就不会出现，请问老师在爬取拉勾网的时候是不是必须使用selenium库进行爬取呢?如果我想用requests.get方法来爬取拉勾网，有没有什么办法可以解决返回的HTML和源码不一致的问题?

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 803楼

代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
options=webdriver.ChromeOptions()
options.add_argument('--headless')
chrome=webdriver.Chrome(options=options)
url="https://www.lagou.com/zhaopin/Python/?filterOption=3&sid=9146e755a7bf4f6dac59edd3e7d43127"
sleep(2)
chrome.get(url)
html=chrome.page_source
print(html)
while True:
    job = chrome.find_elements(By.XPATH, '//div[@class="p_top"]//h3')
    point = chrome.find_elements(By.XPATH, '//span[@class="add"]//em')
    salary = chrome.find_elements(By.XPATH, '//div[@class="li_b_l"]//span[@class="money"]')
    company = chrome.find_elements(By.XPATH, '//div[@class="company_name"]/a')
    industry = chrome.find_elements(By.XPATH, '//div[@class="industry"]')
    n = len(job)
    with open(r"job5.txt", "a", encoding="utf-8") as f:
        for i in range(n):
            f.write((job[i].text).strip())
            f.write("\t")
            f.write((point[i].text).strip())
            f.write("\t")
            f.write((salary[i].text).strip())
            f.write("\t")
            f.write((company[i].text).strip())
            f.write("\t")
            f.write((industry[i].text).strip())
            f.write("\n")
    if html.find('page_no') != -1:
        chrome.find_element_by_class_name('page_no').click()
        sleep(3)
    else:
        break
chrome.quit()

运行结果:

高级爬虫工程师（）  杭州滨江区 丁香园    医疗丨健康企业服务 轮及以上 人
开发工程师    上海长宁区 信飞数科信用飞首付游   科技金融 轮 人
开发工程师    上海徐汇区 问卷网众言科技   工具类产品 轮 人
开发   上海徐汇区 问卷网众言科技   工具类产品 轮 人
开发工程师    北京知春路 天眼查    数据服务｜咨询 不需要融资 人
开发工程师    东莞东莞市东莞美信   汽车丨出行 上市公司 人以上
开发工程师    杭州余杭区 和骏出行   移动互联网 天使轮 人
开发工程师    上海徐汇区 上海江煦信息科技   软件服务｜咨询 不需要融资 人
开发工程师    杭州西湖区 涂鸦智能   物联网智能硬件 轮 人以上
高级开发工程师  上海长宁区 信飞数科信用飞首付游   科技金融 轮 人
开发工程师    上海张江  仑动科技   企业服务人工智能 未融资 少于人
开发工程师    广州白云区 东莞美信   汽车丨出行 上市公司 人以上
开发工程师    深圳前海  易博天下   软件服务｜咨询 上市公司 人
开发工程师    广州海珠区 派客朴食   移动互联网电商 不需要融资 人
上海浦东新北纬三十度  电商平台 未融资 人
高级爬虫工程师（） 杭州滨江区 丁香园    医疗丨健康企业服务 轮及以上 人
开发工程师    上海长宁区 信飞数科信用飞首付游   科技金融 轮 人
开发工程师    上海徐汇区 问卷网众言科技   工具类产品 轮 人
开发   上海徐汇区 问卷网众言科技   工具类产品 轮 人
开发工程师    北京知春路 天眼查    数据服务｜咨询 不需要融资 人
开发工程师    东莞东莞市东莞美信   汽车丨出行 上市公司 人以上
开发工程师    杭州余杭区 和骏出行   移动互联网 天使轮 人
开发工程师    上海徐汇区 上海江煦信息科技   软件服务｜咨询 不需要融资 人
开发工程师    杭州西湖区 涂鸦智能   物联网智能硬件 轮 人以上
高级开发工程师  上海长宁区 信飞数科信用飞首付游   科技金融 轮 人
开发工程师    上海张江  仑动科技   企业服务人工智能 未融资 少于人
开发工程师    广州白云区 东莞美信   汽车丨出行 上市公司 人以上
开发工程师    深圳前海  易博天下   软件服务｜咨询 上市公司 人
开发工程师    广州海珠区 派客朴食   移动互联网电商 不需要融资 人
上海浦东新北纬三十度  电商平台 未融资 人

问题:

老师请问一下，我使用的是selenium来爬取拉勾网上的关于python的职位，第一页中的数据是从高级python爬虫工程师(Insight)15k-25k到python上海.浦东新区15k-30k，当我爬取完一页以后我就用chrome中的click方法点击进入下一页继续爬取，我使用的是while True循环来实现的，可是为什么当我点击进入下一页以后爬取的还是第一页的15条数据，每次都是在重复爬取第一页的15条数据，麻烦老师帮我看看我的代码哪里出错了?

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 804楼

老师，返回的显示百度安全验证，是不是被百度发现我这是爬虫的意思呢？

from urllib.request import Request,urlopen
from urllib.parse import quote
arg = '尚学堂'
# print(quote(arg))
# url = 'https://www.baidu.com/s?ie=UTF-8&wd=%E5%B0%9A%E5%AD%A6%E5%A0%82 '
url = 'https://www.baidu.com/s?wd={}'.format(quote(arg))
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}

request = Request(url,headers=headers)

resp = urlopen(request)

print(resp.read().decode())

Python 全系列/第十四阶段：Python 爬虫开发/爬虫基础（旧） 805楼

老师，这个不管是再保存csv文件用-s FEED_EXPORT_ENCODING='gb18030'

还是在settings. py中加FEED_EXPORT_ENCODING='gb18030'都没用，解决不了乱码，换成utf-8也不会把json文件换成中文编码

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 806楼

代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from lxml import etree
url="https://search.jd.com/Search?keyword=%E7%AC%94%E8%AE%B0%E6%9C%AC&enc=utf-8&suggest=1.his.0.0&wq=&pvid=40937e98d36f4436bba78c1a81d0a967"
options=webdriver.ChromeOptions()
options.add_argument('headless')
chrome=webdriver.Chrome(options=options)
chrome.get(url)
js='document.documentElement.scrollTop=100000'
chrome.execute_script(js)
sleep(2)
html=chrome.page_source
e=etree.HTML(html)
name=e.xpath('//div[@id="J_goodsList"]/ul[@class="gl-warp clearfix"]/li/div[@class="gl-i-wrap"]/div[@class="p-name p-name-type-2"]/a/em/text()')
price=e.xpath('//div[@id="J_goodsList"]//div[@class="p-price"]/strong/i/text()')
for names,prices in zip(name,price):
    print(names,":",prices)
print(len(name))
chrome.quit()

问题:

老师请问一下，为什么我在京东网站中写的xpath，显示results是60条结果，我把它复制到python中打印了电脑名称name的长度，什么改动都没有做，可结果变成了120条?

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 807楼

代码:

mport requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from time import sleep
def get_url(url):
    proxies={"http":"http://61.135.155.82:443"}
    headers={"User-Agent":UserAgent().random}
    # sleep(2)
    resp=requests.get(url,headers=headers,proxies=proxies,timeout=5)
    resp.encoding="utf-8"
    if resp.status_code==200:
        return resp.text
    else:
        return None
def parse_list(html):
    soup=BeautifulSoup(html,'lxml')
    movie_list=["http://maoyan.com{}".format(a.get('href')) for a in soup.select('dl[class="movie-list"] dd>div[class="movie-item film-channel"]>a')]
    return movie_list
def parse_index(html):
    soup=BeautifulSoup(html,'lxml')
    title=soup.select('div[class="movie-brief-container"]>h1')
    # type=soup.select('div[class="movie-brief-container"]>ul>li>a')
    print(title[0].text)
    # print(type[0].text)
def main():
    url="https://maoyan.com/films?showType=3&offset=0"
    html=get_url(url)
    movie_list=parse_list(html)
    for url in movie_list:
        # print(url)
        html=get_url(url)
        parse_index(html)
if __name__=="__main__":
    main()

运行结果:

老师请问一下，为什么我的程序timeout错误，网上说要设置一个timeout时间，我设置了一个timeout=5可是还是报错，请问这是什么原因?麻烦老师帮我看一下

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 808楼

老师，我想请问一下，因为这个是并发的，如果我只想控制爬取20个章节，怎么控制，我自己用全局变量试了一下，好像不可以诶

Python 全系列/第十四阶段：Python 爬虫开发/移动端爬虫开发- 809楼

代码:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
def get_url(url):
    proxies={"http":"http://61.135.155.82:443"}
    headers={"User-Agent":UserAgent().random}
    resp=requests.get(url,headers=headers,proxies=proxies)
    resp.encoding="utf-8"
    if resp.status_code==200:
        return resp.text
    else:
        return None
def parse_list(html):
    soup=BeautifulSoup(html,'lxml')
    movie_list=soup.select('div[class="movies-list"]>dl[class="movie-list"]>dd>div[class="movie-item film-channel"]>a')
    return movie_list
def parse_index(html):
    pass
def main():
    url="https://maoyan.com/films?showType=3&offset=0"
    html=get_url(url)
    movie_list=parse_list(html)
    print(movie_list[0].get('href'))
    # for url in movie_list:
    #     movie_detail=parse_index
    #     print(movie_detail)
if __name__=="__main__":
    main()

运行结果:

老师好，这个是我使用Beautifulsoup爬取猫眼电影的结果，为什么我这里只爬取出来了一部电影的URL，请问出现这种情况的原因会不会是我的爬虫请求被反爬虫机制拦截了?

Python 全系列/第十四阶段：Python 爬虫开发/爬虫反反爬- 810楼

同学您好