<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>采集 - 她和她的猫</title>
    <link>https://her-cat.com/tags/%E9%87%87%E9%9B%86/</link>
    <description>采集的文章列表 - 她和她的猫</description>
    <image>
      <title>她和她的猫</title>
      <url>https://her-cat.com/assets/favorite.jpeg</url>
      <link>https://her-cat.com/assets/favorite.jpeg</link>
    </image>
    <generator>Hugo -- 0.148.1</generator>
    <language>zh</language>
    <lastBuildDate>Sun, 08 Jun 2025 11:58:52 +0800</lastBuildDate>
    <atom:link href="https://her-cat.com/tags/%E9%87%87%E9%9B%86/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>PHP 采集逐浪小说章节列表</title>
      <link>https://her-cat.com/posts/2017/08/02/php-collect-zhuilang-novel-chapters/</link>
      <pubDate>Wed, 02 Aug 2017 21:59:25 +0800</pubDate>
      <guid>https://her-cat.com/posts/2017/08/02/php-collect-zhuilang-novel-chapters/</guid>
      <description>距离上一次写采集教程已经是一年前了，而我已经实习快一个月了，不得不感叹时间过得真快，岁月催人老啊....</description>
      <content:encoded><![CDATA[<p>距离上一次写采集教程已经是一年前了，而我已经实习快一个月了，不得不感叹时间过得真快，岁月催人老啊&hellip;.</p>
<p>先品尝一下《<a href="https://her-cat.com/posts/2016/02/28/php-collect-page/">PHP使用file_get_contents()函数实现采集网页</a>》，食用更佳哦。</p>
<h2 id="第一步获取页面的html">第一步，获取页面的html</h2>
<p>首先，要获取该页面的html内容，随便打开一个小说章节目录的地址，例如 <a href="http://book.zhulang.com/427458/">http://book.zhulang.com/427458/</a>。可以使用 curl，也可以使用 file_get_contents() 函数，因为不用模拟请求头等操作，我就直接用第二种方式。</p>
<p>获取到 HTML 后要进行过滤，将一些换行符、空格进行过滤，使得 HTML 比较干净。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nv">$content</span> <span class="o">=</span> <span class="nx">file_get_contents</span><span class="p">(</span><span class="s1">&#39;http://book.zhulang.com/427458/&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\r</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\r\n</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34; &#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span></code></pre></div><h2 id="第二步分析页面">第二步，分析页面</h2>
<p>拿到 HTML 内容后，使用 preg_match_all() 函数将章节目录的html块提取出来，缩小了匹配范围，也减少了出现错误数据的可能性。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/&lt;divclass=\&#34;chapter-list\&#34;&gt;(.*)&lt;\/div&gt;/&#39;</span><span class="p">,</span> <span class="nv">$content</span><span class="p">,</span> <span class="nv">$chapterHtml</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$chapterHtml</span> <span class="o">=</span> <span class="nv">$chapterHtml</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">];</span>
</span></span></code></pre></div><p>preg_match_all() 函数第一个参数是正则表达式，第二个参数是需要匹配的内容，第三个参数是储存匹配结果的数组，$chapterHtml[0] 包含整个模式匹配的文本，$chapterHtml[1] 是包含第一个括号（正则表达式(http://book.zhulang.com/\d+/\d+.html)）中所匹配的文本，$chapterHtml[2] 就是第二个括号，以此类推。</p>
<h2 id="第三步提取数据">第三步，提取数据</h2>
<p>这一步就是观察章节地址的格式，举个例子：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">http://book.zhulang.com/427458/152725.html
</span></span></code></pre></div><p>格式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">http://book.zhulang.com/数字/数字.html
</span></span></code></pre></div><p>然后编写正则表达式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/(http:\/\/book\.zhulang\.com\/\d+\/\d+\.html)/
</span></span></code></pre></div><p>这样章节地址就出来了，然后编写匹配标题的表达式：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/\&#34;title=\&#34;[\W|\d]+\&#34;&gt;(.*?)&lt;\/a&gt;
</span></span></code></pre></div><p>最后合在一起，并打印：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/(http:\/\/book\.zhulang\.com\/\d+\/\d+\.html)\&#34;title=\&#34;[\W|\d]+\&#34;&gt;(.*?)&lt;\/a&gt;/&#39;</span><span class="p">,</span> <span class="nv">$chapterHtml</span><span class="p">,</span> <span class="nv">$result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nx">var_dump</span><span class="p">(</span><span class="nv">$result</span><span class="p">);</span>
</span></span></code></pre></div><p>得到如下结果：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="k">array</span> <span class="p">(</span><span class="nx">size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="mi">0</span> <span class="o">=&gt;</span> 
</span></span><span class="line"><span class="cl">    <span class="k">array</span> <span class="p">(</span><span class="nx">size</span><span class="o">=</span><span class="mi">72</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">0</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;http://book.zhulang.com/427458/46802.html&#34;title=&#34;第一章捕鱼2017-06-3014:47&#34;&gt;第一章捕鱼&lt;/a&gt;&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">1</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;http://book.zhulang.com/427458/46803.html&#34;title=&#34;第二章神女梦2017-06-3014:48&#34;&gt;第二章神女梦&lt;/a&gt;&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">106</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">2</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;http://book.zhulang.com/427458/46805.html&#34;title=&#34;第三章绿液2017-06-3014:48&#34;&gt;第三章绿液&lt;/a&gt;&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">3</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;http://book.zhulang.com/427458/46807.html&#34;title=&#34;第四章卖参2017-06-3014:49&#34;&gt;第四章卖参&lt;/a&gt;&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="o">......</span>
</span></span><span class="line"><span class="cl">  <span class="mi">1</span> <span class="o">=&gt;</span> 
</span></span><span class="line"><span class="cl">    <span class="k">array</span> <span class="p">(</span><span class="nx">size</span><span class="o">=</span><span class="mi">72</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">0</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;46802&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">1</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;46803&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">2</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;46805&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">3</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;46807&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">       <span class="o">......</span>
</span></span><span class="line"><span class="cl">  <span class="mi">2</span> <span class="o">=&gt;</span> 
</span></span><span class="line"><span class="cl">    <span class="k">array</span> <span class="p">(</span><span class="nx">size</span><span class="o">=</span><span class="mi">72</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">0</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;第一章捕鱼&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">1</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;第二章神女梦&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">18</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">2</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;第三章绿液&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="mi">3</span> <span class="o">=&gt;</span> <span class="nx">string</span> <span class="s1">&#39;第四章卖参&#39;</span> <span class="p">(</span><span class="nx">length</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="o">......</span>
</span></span></code></pre></div><p>还是跟上面一样，$result[0] 是包含整个模式匹配的文本，$result[1] 就是第一个括号（(http://book.zhulang.com/\d+/\d+.html)）匹配出来的章节地址的数组，$result[1] 就是第二个括号（<strong>(.*?)</strong>）匹配出来的章节标题的数组。完整代码：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nv">$content</span> <span class="o">=</span> <span class="nx">file_get_contents</span><span class="p">(</span><span class="s1">&#39;http://book.zhulang.com/427458/&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\r</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34;</span><span class="se">\r\n</span><span class="s2">&#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$content</span><span class="o">=</span><span class="nx">str_replace</span><span class="p">(</span><span class="s2">&#34; &#34;</span><span class="p">,</span><span class="s1">&#39;&#39;</span><span class="p">,</span><span class="nv">$content</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/&lt;divclass=\&#34;chapter-list\&#34;&gt;(.*)&lt;\/div&gt;/&#39;</span><span class="p">,</span> <span class="nv">$content</span><span class="p">,</span> <span class="nv">$chapterHtml</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$chapterHtml</span> <span class="o">=</span> <span class="nv">$chapterHtml</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/(http:\/\/book\.zhulang\.com\/\d+\/\d+\.html)\&#34;title=\&#34;[\W|\d]+\&#34;&gt;(.*?)&lt;\/a&gt;/&#39;</span><span class="p">,</span> <span class="nv">$chapterHtml</span><span class="p">,</span> <span class="nv">$result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="k">for</span> <span class="p">(</span><span class="nv">$i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nv">$i</span> <span class="o">&lt;</span> <span class="nx">count</span><span class="p">(</span><span class="nv">$result</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span> <span class="nv">$i</span><span class="o">++</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">echo</span> <span class="s1">&#39;标题：&#39;</span> <span class="o">.</span> <span class="nv">$result</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="nv">$i</span><span class="p">]</span> <span class="o">.</span> <span class="s1">&#39; ---- URL地址：&#39;</span> <span class="o">.</span> <span class="nv">$result</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="nv">$i</span><span class="p">]</span> <span class="o">.</span> <span class="s1">&#39;&lt;br&gt;&#39;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><h3 id="写在最后">写在最后</h3>
<p>这里只讲了如何获取章节地址及标题，小说列表、小说正文的采集的方法其实都差不多，主要还是观察 div 结构，然后再编写正则表达式。</p>
<blockquote>
<p>这是一篇过去很久的文章，其中的信息可能已经有所发展或是发生改变。</p></blockquote>
]]></content:encoded>
    </item>
    <item>
      <title>Python 使用 BeautifulSoup 抓取网页</title>
      <link>https://her-cat.com/posts/2017/01/15/python-beautiful-soup-collect-page/</link>
      <pubDate>Sun, 15 Jan 2017 14:58:20 +0800</pubDate>
      <guid>https://her-cat.com/posts/2017/01/15/python-beautiful-soup-collect-page/</guid>
      <description>Beautiful Soup 提供一些简单的、Python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为</description>
      <content:encoded><![CDATA[<p>刚刚手贱不小心把前两篇 Python 的文章给删了，关键是我还没有备份！心里一万只草泥马奔腾而过。。。这件事情告诉我们，记得备份！记得备份！记得备份！重要的事情说三遍！</p>
<h2 id="关于-beautifulsoup">关于 BeautifulSoup</h2>
<p>Beautiful Soup 提供一些简单的、Python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。</p>
<h2 id="安装-beautifulsoup">安装 BeautifulSoup</h2>
<p>下载完成后将安装包解压到某个目录中（示例为D:\tools）。解压后安装程序的目录为 D:\tools\beautifulsoup4-4.5.3。</p>
<p>在cmd命令行中使用 <strong>cd D:\tools\beautifulsoup4-4.5.3</strong> 命令进入程序目录。输入 <strong>python setup.py install</strong> 开始安装 BeautifulSoup。</p>
<h2 id="使用-beautifulsoup-抓取网页">使用 BeautifulSoup 抓取网页</h2>
<p>安装完成以后就可以开始编码了，首先导入 urllib2 和 bs4。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="kn">import</span> <span class="nn">urllib2</span> <span class="c1"># urllib 库提供了一个从指定的 URL 地址获取网页数据</span>
</span></span><span class="line"><span class="cl"><span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
</span></span></code></pre></div><p>创建一个Resquest请求，其中 <a href="https://her-cat.com">https://her-cat.com</a> 是请求的站点地址，这里使用的是我的博客网址。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">request</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">Request</span><span class="p">(</span><span class="s1">&#39;https://her-cat.com&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>一些网站做了 User-Agent 判断，防止非正常用户访问页面，所以我们可以给这个 Request 请求添加一个请求头部数据，用于伪造 User-Agent。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">request</span><span class="o">.</span><span class="n">add_header</span><span class="p">(</span><span class="s1">&#39;User-Agent&#39;</span><span class="p">,</span> <span class="s1">&#39;Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>获取 HTML 内容并创建 BeautifulSoup 对象</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">html</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span>
</span></span></code></pre></div><p>然后我们这个 BeautifulSoup 对象使用一些方法来获取想要的内容。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="nb">print</span><span class="p">(</span><span class="n">soup</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span> <span class="c1"># 格式化输入html文本</span>
</span></span><span class="line"><span class="cl"><span class="n">soup</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">&#39;span.comment_text&#39;</span><span class="p">)</span>  <span class="c1"># 找到所有类名为comment_text的span标签</span>
</span></span></code></pre></div><p>查找该网页中所有的 a 标签：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">):</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;href&#39;</span><span class="p">))</span>  <span class="c1"># 输出a标签中的href属性值</span>
</span></span><span class="line"><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="n">link</span><span class="o">.</span><span class="n">get_text</span><span class="p">())</span>  <span class="c1"># 查询a标签的文本内容</span>
</span></span></code></pre></div><p>关于 BeautifulSoup 的安装和使用就到此结束了~</p>
<blockquote>
<p>这是一篇过去很久的文章，其中的信息可能已经有所发展或是发生改变。</p></blockquote>
]]></content:encoded>
    </item>
    <item>
      <title>PHP 使用 file_get_contents() 函数实现采集网页</title>
      <link>https://her-cat.com/posts/2016/02/28/php-collect-page/</link>
      <pubDate>Sun, 28 Feb 2016 22:47:08 +0800</pubDate>
      <guid>https://her-cat.com/posts/2016/02/28/php-collect-page/</guid>
      <description>采集网页关键在于两个地方：如何获取目标网页HTML源代码。如何使用正则匹配出需要的内容。我们使用《终于到了。》这篇文章作为目标网页，获取网页源代码比较简单的方法就是</description>
      <content:encoded><![CDATA[<p>采集网页关键在于两个地方：</p>
<ul>
<li>如何获取目标网页HTML源代码。</li>
<li>如何使用正则匹配出需要的内容。</li>
</ul>
<p>我们使用《终于到了。》这篇文章作为目标网页，获取网页源代码比较简单的方法就是使用file_get_contents()函数，使用方法：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nv">$content</span> <span class="o">=</span> <span class="nx">file_get_contents</span><span class="p">(</span><span class="s2">&#34;https://her-cat.com&#34;</span><span class="p">);</span>
</span></span></code></pre></div><p>变量 $content  就是用来储存我们使用 file_get_contents() 获取的网页源代码。</p>
<p>接下来就是如何使用正则匹配出正文内容，我们先用浏览器打开这个页面，然后右键查看源代码，找到正文处代码。</p>
<p><del>图没了</del></p>
<p>找到包裹着正文的 HTML 标签，就可以使用 preg_match_all() 函数匹配出正文。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/&lt;div  class=&#34;post-content&#34;&gt;(.*)&lt;\/div&gt;/&#39;</span><span class="p">,</span> <span class="nv">$content</span><span class="p">,</span> <span class="nv">$result</span><span class="p">);</span>
</span></span></code></pre></div><p>使用 var_dump() 函数打印 $result 变量。</p>
<p><del>图没了</del></p>
<p>从图中可以看出，打印出了一个二维数组，虽然匹配出了正文，但是里面还有一些 HTML 标签，接下来要做的就是使用 str_replace() 函数去掉这些标签。</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nv">$arr</span> <span class="o">=</span> <span class="k">array</span><span class="p">(</span><span class="s1">&#39;&lt;p&gt;&#39;</span><span class="p">,</span> <span class="s1">&#39;&lt;/p&gt;&#39;</span><span class="p">,</span> <span class="s1">&#39;&lt;br/&gt;&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$result</span> <span class="o">=</span> <span class="nx">str_replace</span><span class="p">(</span><span class="nv">$arr</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="nv">$result</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]);</span>
</span></span></code></pre></div><p>最后输出变量 $result 就可以了。</p>
<p>完整代码：</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-php" data-lang="php"><span class="line"><span class="cl"><span class="nv">$content</span> <span class="o">=</span> <span class="nx">file_get_contents</span><span class="p">(</span><span class="s2">&#34;https://her-cat.com&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nx">preg_match_all</span><span class="p">(</span><span class="s1">&#39;/&lt;div  class=&#34;post-content&#34;&gt;(.*)&lt;\/div&gt;/&#39;</span><span class="p">,</span> <span class="nv">$content</span><span class="p">,</span> <span class="nv">$result</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$arr</span> <span class="o">=</span> <span class="k">array</span><span class="p">(</span><span class="s1">&#39;&lt;p&gt;&#39;</span><span class="p">,</span> <span class="s1">&#39;&lt;/p&gt;&#39;</span><span class="p">,</span> <span class="s1">&#39;&lt;br/&gt;&#39;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nv">$result</span> <span class="o">=</span> <span class="nx">str_replace</span><span class="p">(</span><span class="nv">$arr</span><span class="p">,</span> <span class="s1">&#39;&#39;</span><span class="p">,</span> <span class="nv">$result</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]);</span>
</span></span><span class="line"><span class="cl"><span class="k">echo</span> <span class="nv">$result</span><span class="p">;</span>
</span></span></code></pre></div><p>最后总结，使用PHP采集网页需要注意的地方是，file_get_contents() 获取网页源代码的效率比较低，推荐使用 curl。还有就是正则表达式，正则表达式需要根据网页源代码来编写，并不是一成不变的。关于 curl 和正则表达式的知识可以使用百度了解！</p>
<blockquote>
<p>这是一篇过去很久的文章，其中的信息可能已经有所发展或是发生改变。</p></blockquote>
]]></content:encoded>
    </item><follow_challenge>
      <feedId>58021783493571598</feedId>
      <userId>56882619875632128</userId>
    </follow_challenge>
  </channel>
</rss>
