Extracting links with Crowl

Crowl is capable of extracting all links from your website, which enables you to get data on both your internal and external-bound links.

Simply set LINKS to True in your project file.
Crowl will save links either in a separate CSV file (for CSV mode) or a separate MySQL table (for MySQL mode).

Historically, Crowl removes duplicate links from exports.
This means by default, only the first link from page A to page B is saved.

You can deactivate this behavior if you wish to keep all links by using the LINKS_UNIQUE option in your configuration file.

What kind of data will you get ?

Crowl will grab all links and grab:

  • the source URL (source)
  • the target URL (target)
  • the anchor text (text)

It will also add flags for nofollow links (nofollow) and internal links to pages that are blocked by the robots.txt file (disallow).

Finally, Crowl will also add a weight to each link, stored in the weight column.

What is this weight thingy ?

The weight associated with each link is calculated using the order of links in the source code: the higher a link, the more weight it gets.

The actual formula is:

weight = 1 - c / n

Where c is the id of the link in the list of links (starts with 0), and n the total number of links on the page.

In the future, we will try and add other methods of calculation.

Get Connected

  • Buy Me A Coffee