Here is the list of the data that Crowl grabs from pages by default:
url
response_code
content_type
level
referer
latency
nb_title
title
nb_meta_robots
meta_robots
meta_description
meta_viewport
meta_keywords
canonical
prev
next
h1
nb_h1
nb_h2
wordcount
XRobotsTag
http_date
size
html_lang
hreflangs
microdata
We regularily add more items to this list, as well as other content extraction methods. Feel free to suggest your ideas!
However, if you want to get some other information that is not in this list, we offer a feature to extract the whole page content.
You can grab the page content with Crowl by setting CONTENT
to True
in your project file.
This will store the entire source code, in order for you to retrieve informations post-crawl.
If you’re interested in fetching only specific portions of pages, you can use custom extractors.
Crowl supports XPath (and will support Regex and CSS selectors soon).
You must provide the list of extractors you wish to use in the configuration file.
You can retreive data from the json-stored export (extractors
column) using the name
you gave to your extractor.
Here is an example fuction to do this:
Usage example with pandas
: retrieving data for an extractor named nb_products