Preffered Answer:
Yes, you can. As mentioned above, BEAUTIFULSOUP can be used for parsing HTML responses in Scrapy CALLBACKS. You just have to feed the response’s body into a BeautifulSoup object and extract whatever DATA you need from it.
Here’s an example spider using BeautifulSoup API, with lxml as the HTML parser:
from bs4 import BeautifulSoup
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
# USE lxml to get decent HTML parsing SPEED
soup = BeautifulSoup(response.text, 'lxml')
yield {
"url": response.url,
"title": soup.h1.string
}
Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You just have to feed the response’s body into a BeautifulSoup object and extract whatever data you need from it.
Here’s an example spider using BeautifulSoup API, with lxml as the HTML parser:
from bs4 import BeautifulSoup
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
# use lxml to get decent HTML parsing speed
soup = BeautifulSoup(response.text, 'lxml')
yield {
"url": response.url,
"title": soup.h1.string
}
Write Your Comments or Explanations to Help Others