Extracting pages content with Crowl

Default data

Here is the list of the data that Crowl grabs from pages by default:

  • url
  • response_code
  • content_type
  • level
  • referer
  • latency
  • nb_title
  • title
  • nb_meta_robots
  • meta_robots
  • meta_description
  • meta_viewport
  • meta_keywords
  • canonical
  • prev
  • next
  • h1
  • nb_h1
  • nb_h2
  • wordcount
  • XRobotsTag
  • http_date
  • size
  • html_lang
  • hreflangs
  • microdata

We regularily add more items to this list, as well as other content extraction methods. Feel free to suggest your ideas!

However, if you want to get some other information that is not in this list, we offer a feature to extract the whole page content.

Extract whole page content

You can grab the page content with Crowl by setting CONTENT to True in your project file.

This will store the entire source code, in order for you to retrieve informations post-crawl.

Custom extractors

If you’re interested in fetching only specific portions of pages, you can use custom extractors.
Crowl supports XPath (and will support Regex and CSS selectors soon).
You must provide the list of extractors you wish to use in the configuration file.

Example of function to retrieve data

You can retreive data from the json-stored export (extractors column) using the name you gave to your extractor.

Here is an example fuction to do this:

import json

def extractors_to_column(name,extractors):
    """
    Retrieves extractor data from raw Crowl custom extractors export.  
    - `name`: name of the extractor to retrieve  
    - `extractors`: raw Crowl extractors data  

    Returns `data` field or None.  
    """
    try:
        items = json.loads(str(extractors))
        for item in items:
            if item["name"] == name:
                if item["data"] == "None":
                    return None
                return item["data"]
    except Exception as e:
        pass
    return None

Usage example with pandas: retrieving data for an extractor named nb_products

import pandas as pd
# Load crawl data into DataFrame
crawl = pd.read_csv("mycrawl_urls.csv",sep=",")
# Create a new column with extracted data
crawl["nb_products"] = crawl["extractors"].apply(
    lambda x: extractors_to_column("nb_products",x))
# Optionnal steps to fill empty values and convert to integer
crawl["nb_products"] = crawl["nb_products"].fillna(0)
crawl["nb_products"] = crawl["nb_products"].astype(int)

Get Connected

  • Buy Me A Coffee