Using MySQL as output

Activate MySQL Pipeline

You can choose which pipeline to run in your project’s configuration file.
See this page for more details.

Configuration

You’ll need a MySQL user with the ability to create databases.
You can either use a UI such as PhpMyAdmin, or the mysql console:

CREATE USER 'user'@'localhost' IDENTIFIED BY 'password';  
GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,DROP ON *.* TO 'user'@'localhost';  
FLUSH PRIVILEGES;  

Remember to replace user, localhost and password with the correct values.

Using root

Using the root account in local environments is ok.
Try not to use it on production envs, thought.

Once your account is created, fill the [MYSQL] section in Crowl’s config.ini file.
You should be all set.

Usage

Simply launch a crawl and Crowl will:

  1. Create a new database using your project basename (set in the configuration file), and adding a timestamp to it. For instance, your new database could be foo_20180921-123000.
  2. Create a urls table in this database, which will contain all data about crawled URLs.
  3. If you activated the --links option, Crowl will also create a links table to store links.
  4. Then it starts crawling and inserting data.

Exporting data

Using PhpMyAdmin

Once connected to PhpMyAdmin, simply select the right database and table. At the bottom of the list you’ll find an Export option in the Query result operations box:

Query result operations

This will enable you to export all the data from your query, and customise the export format.

For example, here is how to download a CSV file with the correct format to import to Microsoft Excel:

Custom CSV export for Microsoft Excel

Using the console

There are several ways to export data from MySQL using the console.
The method we find the easiest is to export to a tab-separated file.

Here is how to export data from the urls table:

echo "SELECT * FROM urls;" | mysql -u user -p database > urls.tsv  

And for the links table:

echo "SELECT * FROM links;" | mysql -u user -p database > links.tsv   

Remember to replace user and database with the correct values. You can also use a custom SELECT statement, as well as another export name.

Useful queries

Count URLs per response code:

SELECT response_code,COUNT(*) FROM urls GROUP BY response_code;  

Count URLs per depth level:

SELECT level,COUNT(*) FROM urls GROUP BY level;  

Get a list of URLs sending a 50* response code:

SELECT * FROM urls WHERE response_code >= 500;  

Get a list of links to 404s (you obviously have to store links for this to work):

SELECT links.*,urls.response_code FROM links LEFT JOIN urls ON links.target = urls.url WHERE urls.response_code = 404;  

Get Connected

  • Buy Me A Coffee