Configuration

Several configuration options are available, and can be set in your project’s .ini file.

You can (or should) have multiple .ini files for each project you’re working on. The aim is to easily switch between projects just by swapping configuration files, without having to remember every detail of your settings:

python crowl.py --conf project1.ini  
python crowl.py --conf project2.ini  
...  

Here are the currently available settings:

General project settings

The only required settings are located in the [PROJECT] section, and are pretty simple:

PROJECT_NAME

The name of the project. Will be used to generate timestamped names for outputs (either CSV files or database name).

START_URL

The start url of your crawl (usualy the homepage).

Crawler settings

The [CRAWLER] section gathers all settings regarding the way Crowl makes its requests (HTTP headers, more or less).

USER_AGENT

Defaults to Crowl (+https://www.crowl.tech/) but you can set whatever you want here.

USE_PROXIES

Do you want to use proxies? Defaults to False.
Must be used in conjunction with PROXIES_LIST.
If set to True, Crowl will randomly choose a different proxy in the list for each request.

PROXIES_LIST

Path to the file containing the list of proxies to be used.
The path can be exact (/path/to/the/file.txt) or relative (for example proxies.txt if your file is in the same directory as crowl.py).

Format: http://username:password@host:port, one proxy per line.
Notes:

username and password are optionnal: you can use http://host:port if the proxy doesn’t require authentication.
you can use HTTP or HTTPS proxies.

ROBOTS_TXT_OBEY

Should we respect the robots.txt file? Defaults to True.

EXCLUSION_PATTERN

Regex pattern to identify URLs that should not be crawled.
If left empty , any available URL will be crawled (if allowed by ROBOTS_TXT_OBEY). This is the default setting.

Otherwise, any URL matching the regex will be ignored. Links to this URL will still be stored thought.

Example: use \/tag\/ if you want to ignore any URL containing a /tag/ pattern.

DOWNLOAD_DELAY

Time to wait between requests (in seconds). Default: 0.5.

CONCURRENT_REQUESTS

Number of crawler threads running at the same time. Default: 5.

MIME_TYPES

Which mime types should Crowl ask for?
Defaults to text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
You probably won’t need to change this setting.

ACCEPT_LANGUAGE

Which language(s) should Crowl ask for in the HTTP headers?
Default is en but you could use more complex strings, such as fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7.

Authentication

The [AUTH] section contains authentication settings you might need to crawl a preprod environment for example.

HTTP_USER

Username to be used in case of basic HTTP authentication.

HTTP_PASS

Password to be used in case of basic HTTP authentication.

Data extraction settings

You’ll find the content extraction options in the [EXTRACTION] section.

LINKS

Should Crowl store links? Defaults to False to save storage space.
See this page of the docs for more details.

LINKS_UNIQUE

Should we remove duplicate links from the exports?
Defaults to True to preserve historic behavior. If set to False, you’ll get all links in the exports, including duplicates. Otherwise, only the first link from page A to page B is kept.

CONTENT

Should Crowl store html source code? Defaults to False to save storage space.
See this page of the docs for more details.

STORE_REQUEST_HEADERS

Should Crowl store the HTTP request headers sent to the server?
Defaults to False.

STORE_RESPONSE_HEADERS

Should Crowl store the HTTP response headers received from the server?
Defaults to False.

DEPTH

How deep should Crowl crawl your website? Defaults to 5.
Remember that Crowl crawls every page from a depth level before going to the next.

MAX_REQUESTS

How many URLs should Crowl crawl? Defaults to 0 (no limit).
Note that a few additionnal URLs can be crawled before the crawler stops completely.

CHECK_LANG

Do you want to check which language is used on your pages?
Crowl uses Fasttext language identification model.
You can check the list of supported languages.

For this feature to work, you’ll need to download the model from Fasttext and place it in the data folder.
Future versions of Crowl might enable you to use your own model.

CUSTOM_EXTRACTORS

You can now configure a list of custom extractors to scrap the content of a specific part of the page.

Provide your extractors in the form of a list of dictionnaries.
Each extractor in the list must provide the following info:

name: the name of your extractor, for you to easily retrieve it from the exported data,
type: we only support xpath extractors for now, but are planning to add regex and css support soon,
pattern: the actual pattern of the extractor.

Important note: for now, Crowl only retrieves the first encountered item matching an extractor.

Here is an example with two xpath extractors:

CUSTOM_EXTRACTORS = [
    {"name":"author", "type":"xpath", "pattern":'//div[@class="post-info-wrapper"]/p/span[1]/text()'},
    {"name":"publication_date", "type":"xpath", "pattern":'//div[@class="post-info-wrapper"]/p/span[2]/text()'},
]

The extracted data is stored in json format in the extractors column in the exports.
If an extractor cannot retreive data on a given page, the corresponding field will be set to None in the export.
Check this page for more details.

Output modes

As of v0.2, Crowl offers multiple output modes (called pipelines), set in the [OUTPUT] section.

There are currently two pipelines available: CSV is the default one.
Simply activate or deactivate a pipeline by commenting it (adding a # at the begining of the line).

The number associated with each pipeline determines its priority: smaller priority pipelines will run first if you happen to use multiple pipelines at the same time.

CSV

crowl.CrowlCsvPipeline: exports data to CSV files:

one for urls,
the other for links.

See this page of the docs for additionnal information.

MySQL

crowl.CrowlMySQLPipeline: exports data to a MySQL database, with two tables:

one for urls,
the other for links.

For this pipeline to work, you’ll have to fill the [MYSQL] section of the file too:

MYSQL_HOST is your MySQL server’s host (probably localhost),
MYSQL_PORT is your MySQL server’s port (probably 3306),
MYSQL_USER is the username you use to connect to the MySQL server,
MYSQL_PASSWORD is the user’s password.