Extracting pages content with Crowl

Default data

Here is the list of the data that Crowl grabs from pages by default:

url
response_code
content_type
level
referer
latency
nb_title
title
nb_meta_robots
meta_robots
meta_description
meta_viewport
meta_keywords
canonical
prev
next
h1
nb_h1
nb_h2
wordcount
XRobotsTag
http_date
size
html_lang
hreflangs
microdata

We regularily add more items to this list, as well as other content extraction methods. Feel free to suggest your ideas!

However, if you want to get some other information that is not in this list, we offer a feature to extract the whole page content.

Extract whole page content

You can grab the page content with Crowl by setting CONTENT to True in your project file.

This will store the entire source code, in order for you to retrieve informations post-crawl.

Custom extractors

If you’re interested in fetching only specific portions of pages, you can use custom extractors.
Crowl supports XPath (and will support Regex and CSS selectors soon).
You must provide the list of extractors you wish to use in the configuration file.

Example of function to retrieve data

You can retreive data from the json-stored export (extractors column) using the name you gave to your extractor.

Here is an example fuction to do this:

import json

def extractors_to_column(name,extractors):
    """
    Retrieves extractor data from raw Crowl custom extractors export.  
    - `name`: name of the extractor to retrieve  
    - `extractors`: raw Crowl extractors data  

    Returns `data` field or None.  
    """
    try:
        items = json.loads(str(extractors))
        for item in items:
            if item["name"] == name:
                if item["data"] == "None":
                    return None
                return item["data"]
    except Exception as e:
        pass
    return None

Usage example with pandas: retrieving data for an extractor named nb_products

import pandas as pd
# Load crawl data into DataFrame
crawl = pd.read_csv("mycrawl_urls.csv",sep=",")
# Create a new column with extracted data
crawl["nb_products"] = crawl["extractors"].apply(
    lambda x: extractors_to_column("nb_products",x))
# Optionnal steps to fill empty values and convert to integer
crawl["nb_products"] = crawl["nb_products"].fillna(0)
crawl["nb_products"] = crawl["nb_products"].astype(int)

Extracting pages content with Crowl

Default data

Extract whole page content

Custom extractors

Example of function to retrieve data

Get Connected