Here is the list of the data that Crowl grabs from pages by default:
url
response_code
content_type
level
referer
latency
nb_title
title
nb_meta_robots
meta_robots
meta_description
meta_viewport
meta_keywords
canonical
prev
next
h1
nb_h1
nb_h2
wordcount
XRobotsTag
http_date
size
html_lang
hreflangs
microdata
We regularily add more items to this list, as well as other content extraction methods. Feel free to suggest your ideas!
However, if you want to get some other information that is not in this list, we offer a feature to extract the whole page content.
You can grab the page content with Crowl by setting CONTENT
to True
in your project file.
This will store the entire source code, in order for you to retrieve informations post-crawl.
If you’re interested in fetching only specific portions of pages, you can use custom extractors.
Crowl supports XPath (and will support Regex and CSS selectors soon).
You must provide the list of extractors you wish to use in the configuration file.
You can retreive data from the json-stored export (extractors
column) using the name
you gave to your extractor.
Here is an example fuction to do this:
import json
def extractors_to_column(name,extractors):
"""
Retrieves extractor data from raw Crowl custom extractors export.
- `name`: name of the extractor to retrieve
- `extractors`: raw Crowl extractors data
Returns `data` field or None.
"""
try:
items = json.loads(str(extractors))
for item in items:
if item["name"] == name:
if item["data"] == "None":
return None
return item["data"]
except Exception as e:
pass
return None
Usage example with pandas
: retrieving data for an extractor named nb_products
import pandas as pd
# Load crawl data into DataFrame
crawl = pd.read_csv("mycrawl_urls.csv",sep=",")
# Create a new column with extracted data
crawl["nb_products"] = crawl["extractors"].apply(
lambda x: extractors_to_column("nb_products",x))
# Optionnal steps to fill empty values and convert to integer
crawl["nb_products"] = crawl["nb_products"].fillna(0)
crawl["nb_products"] = crawl["nb_products"].astype(int)