Scrapy中选择器用法

本文最后更新于:2021年1月30日 晚上

Scrapy提取数据有自己的一套机制,被称作选择器(selectors),通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分
Xpath是专门在XML文件中选择节点的语言,也可以用在HTML上。
CSS是一门将HTML文档样式化语言,选择器由它定义,并与特定的HTML元素的样式相关联。

官方测试页面网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

  • 三种选择器

    • css
    • xpath
    • re
  • 进入shell环境

    scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

  • 使用selector进行查询

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    # 直接执行selector
    >>> response.selector
    <Selector xpath=None data='<html>\n <head>\n <base href="http://e...'>

    # 调用selector的xpath选择器
    >>> response.selector.xpath('//title/text()')
    [<Selector xpath='//title/text()' data='Example website'>]
    >>> response.selector.xpath('//title/text()').extract_first()
    'Example website'

    # 调用selector的css选择器
    >>> response.selector.css('title::text').extract_first()
    'Example website'

    # 其他调用方法(不写selector)
    >>> response.xpath('//title/text()').extract_first()
    'Example website'
    >>> response.css('title::text').extract_first()
    'Example website'

    可以看出,无论是xpath还是css选择器,返回的数据都是一个selector的列表,这就说明,我们可以接着调用这两种方法进行层层的迭代的查询。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    >>> response.xpath('//div[@id="images"]')
    [<Selector xpath='//div[@id="images"]' data='<div id="images">\n <a href="image1....'>
    ]
    # 迭代查询
    >>> response.xpath('//div[@id="images"]').css('img')
    [<Selector xpath='descendant-or-self::img' data='<imgsrc="image1_thumb.jpg">'>,
    <Selector xpath='descendant-or-self::img' data='<img src="image2_thumb.jpg">'>,
    <Selector xpath='descendant-or-self::img' data='<img src="image3_thumb.jpg">'>,
    <Selector xpath='descendant-or-self::img' data='<img src="image4_thumb.jpg">'>,
    <Selector xpath='descendant-or-self::img' data='<img src="image5_thumb.jpg">'>]

    # 获取返回的所有图片链接
    >>> response.xpath('//div[@id="images"]').css('img::attr(src)')
    [<Selector xpath='descendant-or-self::img/@src' data='image1_thumb.jpg'>,
    <Selector xpath='descendant-or-self::img/@src' data='image2_thumb.jpg'>,
    <Selector xpath='descendant-or-self::img/@src' data='image3_thumb.jpg'>,
    <Selector xpath='descendant-or-self::img/@src' data='image4_thumb.jpg'>,
    <Selector xpath='descendant-or-self::img/@src' data='image5_thumb.jpg'>]
    # 再用extract()方法获取其所有内容
    >>> response.xpath('//div[@id="images"]').css('img::attr(src)').extract()
    ['image1_thumb.jpg',
    'image2_thumb.jpg',
    'image3_thumb.jpg',
    'image4_thumb.jpg',
    'image5_thumb.jpg']
    # 再用extract_first()方法获取其第一条内容
    >>> response.xpath('//div[@id="images"]').css('img::attr(src)').extract_first()
    'image1_thumb.jpg'

    # extract_first(default='') 当查不到对应数据时返回defanlt的内容
    >>>response.xpath('//div[@id="images"]').css('img::attr(aaa)').extract_first(default='')
    ''

    # 获取所有a标签的链接
    >>> response.xpath('//a/@href')
    [<Selector xpath='//a/@href' data='image1.html'>,
    <Selector xpath='//a/@href' data='image2.html'>,
    <Selector xpath='//a/@href' data='image3.html'>,
    <Selector xpath='//a/@href'data='image4.html'>,
    <Selector xpath='//a/@href' data='image5.html'>]
    >>> response.css('a::attr(href)').extract()
    ['image1.html',
    'image2.html',
    'image3.html',
    'image4.html',
    'image5.html']

    xpath和css选择器返回的结果都差不多,所以说这个看自己的爱好,用哪一个选择器都可以,下面来说说怎么获取文本内容

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    # xpath获取本文内容(用/text()进行获取)
    >>> response.xpath('//a/text()').extract()
    ['Name: My image 1 ',
    'Name: My image 2 ',
    'Name: My image 3 ',
    'Name: My image 4 ',
    'Name: My image 5 ']
    # css获取文本内容(用::text进行获取)
    >>> response.css('a::text').extract()
    ['Name: My image 1 ',
    'Name: My image 2 ',
    'Name: My image 3 ',
    'Name: My image 4 ',
    'Name: My image 5 ']
  • 其他的selector用法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    # 查找属性名称包含某值的所有标签(contains)
    # xpath写法 //标签名[contains(@属性名,"包含值")]
    # 例如:查找href中包含image的所有a标签
    >>> response.xpath('//a[contains(@href,"image")]').extract()
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

    >>> response.xpath('//a[contains(@href,"image")]/@href').extract()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

    # css写法 标签名[href*=image]
    >>> response.css('a[href*=image]').extract()
    ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
    '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
    '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
    '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
    '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

    >>> response.css('a[href*=image]::attr(href)').extract()
    ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

    # 其他
    >>> response.xpath('//a[contains(@href,"image")]/img/@src').extract()
    ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']

    >>> response.css('a[href*=image] img::attr(src)').extract()
    ['image1_thumb.jpg','image2_thumb.jpg','image3_thumb.jpg','image4_thumb.jpg','image5_thumb.jpg']
  • selector正则表达式匹配

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # .re("")  .re_first("")
    >>> response.css('a::text').extract()
    ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: Myimage4 ', 'Name: My image 5 ']
    >>> response.css('a::text').re('Name\:(.*)')
    [' My image 1 ', ' My image 2 ', ' My image 3 ', ' My image 4 ', ' My image 5']
    >>> response.css('a::text').re_first('Name\:(.*)')
    ' My image 1 '
    >>> response.css('a::text').re_first('Name\:(.*)').strip()
    'My image 1'
  • 需要了解的语法