Extracting links with Crowl

Crowl is capable of extracting all links from your website, which enables you to get data on both your internal and external-bound links.

Simply add --links to your command line.
For instance, here is a simple crawl of this website, with default values, and links extraction:

python crowl.py -u https://www.crowl.tech/ -b crowltech --links  

What kind of data will you get ?

Crowl will grab all links and grab:

  • the source URL (source)
  • the target URL (target)
  • the anchor text (text)

It will also add flags for nofollow links (nofollow) and internal links to pages that are blocked by the robots.txt file (disallow).

Finally, Crowl will also add a weight to each link, stored in the weight column.

What is this weight thingy ?

The weight associated with each link is calculated using the order of links in the source code: the higher a link, the more weight it gets.

The actual formula is:

weight = 1 - c / n

Where c is the id of the link in the list of links (starts with 0), and n the total number of links on the page.

In the future, we will try and add other methods of calculation.

Get Connected