Extracting pages content with Crowl

Default data

Here is the list of the data that Crowl grabs from pages by default:

  • url
  • response_code
  • content_type
  • level
  • referer
  • latency
  • title
  • meta_robots
  • meta_description
  • meta_viewport
  • meta_keywords
  • canonical
  • h1
  • wordcount
  • XRobotsTag

We’ll probably add more items to this list in the near future, as well as other content extraction methods. Feel free to suggest your ideas!

However, if you want to get some other information that is not in this list, we offer a feature to extract the whole page content.

Extract whole page content

You can grab the page content with Crowl by adding --content to your command.
For example, here is how to crawl this website and scrap all its content:

python crowl.py -u https://www.crowl.tech/ -b crowltech --content  

This will store the entire source code, in order for you to retrieve informations post-crawl.

