Several configuration options are available, and can be set in your project’s .ini
file.
You can (or should) have multiple .ini
files for each project you’re working on. The aim is to easily switch between projects just by swapping configuration files, without having to remember every detail of your settings:
python crowl.py --conf project1.ini
python crowl.py --conf project2.ini
...
Here are the currently available settings:
The only required settings are located in the [PROJECT]
section, and are pretty simple:
The name of the project. Will be used to generate timestamped names for outputs (either CSV files or database name).
The start url of your crawl (usualy the homepage).
The [CRAWLER]
section gathers all settings regarding the way Crowl makes its requests (HTTP headers, more or less).
Defaults to Crowl (+https://www.crowl.tech/)
but you can set whatever you want here.
Do you want to use proxies? Defaults to False
.
Must be used in conjunction with PROXIES_LIST
.
If set to True
, Crowl will randomly choose a different proxy in the list for each request.
Path to the file containing the list of proxies to be used.
The path can be exact (/path/to/the/file.txt
) or relative (for example proxies.txt
if your file is in the same directory as crowl.py
).
Format: http://username:password@host:port
, one proxy per line.
Notes:
http://host:port
if the proxy doesn’t require authentication.Should we respect the robots.txt
file? Defaults to True
.
Regex pattern to identify URLs that should not be crawled.
If left empty , any available URL will be crawled (if allowed by ROBOTS_TXT_OBEY
). This is the default setting.
Otherwise, any URL matching the regex will be ignored. Links to this URL will still be stored thought.
Example: use \/tag\/
if you want to ignore any URL containing a /tag/
pattern.
Time to wait between requests (in seconds). Default: 0.5
.
Number of crawler threads running at the same time. Default: 5
.
Which mime types should Crowl ask for?
Defaults to text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
.
You probably won’t need to change this setting.
Which language(s) should Crowl ask for in the HTTP headers?
Default is en
but you could use more complex strings, such as fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7
.
The [AUTH]
section contains authentication settings you might need to crawl a preprod environment for example.
Username to be used in case of basic HTTP authentication.
Password to be used in case of basic HTTP authentication.
You’ll find the content extraction options in the [EXTRACTION]
section.
Should Crowl store links? Defaults to False
to save storage space.
See this page of the docs for more details.
Should we remove duplicate links from the exports?
Defaults to True
to preserve historic behavior. If set to False
, you’ll get all links in the exports, including duplicates. Otherwise, only the first link from page A to page B is kept.
Should Crowl store html source code? Defaults to False
to save storage space.
See this page of the docs for more details.
Should Crowl store the HTTP request headers sent to the server?
Defaults to False
.
Should Crowl store the HTTP response headers received from the server?
Defaults to False
.
How deep should Crowl crawl your website? Defaults to 5
.
Remember that Crowl crawls every page from a depth level before going to the next.
How many URLs should Crowl crawl? Defaults to 0
(no limit).
Note that a few additionnal URLs can be crawled before the crawler stops completely.
Do you want to check which language is used on your pages?
Crowl uses Fasttext language identification model.
You can check the list of supported languages.
For this feature to work, you’ll need to download the model from Fasttext and place it in the data
folder.
Future versions of Crowl might enable you to use your own model.
You can now configure a list of custom extractors to scrap the content of a specific part of the page.
Provide your extractors in the form of a list of dictionnaries.
Each extractor in the list must provide the following info:
name
: the name of your extractor, for you to easily retrieve it from the exported data,type
: we only support xpath
extractors for now, but are planning to add regex
and css
support soon,pattern
: the actual pattern of the extractor.Important note: for now, Crowl only retrieves the first encountered item matching an extractor.
Here is an example with two xpath
extractors:
CUSTOM_EXTRACTORS = [
{"name":"author", "type":"xpath", "pattern":'//div[@class="post-info-wrapper"]/p/span[1]/text()'},
{"name":"publication_date", "type":"xpath", "pattern":'//div[@class="post-info-wrapper"]/p/span[2]/text()'},
]
The extracted data is stored in json format in the extractors
column in the exports.
If an extractor cannot retreive data on a given page, the corresponding field will be set to None
in the export.
Check this page for more details.
As of v0.2, Crowl offers multiple output modes (called pipelines), set in the [OUTPUT]
section.
There are currently two pipelines available: CSV is the default one.
Simply activate or deactivate a pipeline by commenting it (adding a #
at the begining of the line).
The number associated with each pipeline determines its priority: smaller priority pipelines will run first if you happen to use multiple pipelines at the same time.
crowl.CrowlCsvPipeline
: exports data to CSV files:
See this page of the docs for additionnal information.
crowl.CrowlMySQLPipeline
: exports data to a MySQL database, with two tables:
For this pipeline to work, you’ll have to fill the [MYSQL]
section of the file too:
MYSQL_HOST
is your MySQL server’s host (probably localhost
),MYSQL_PORT
is your MySQL server’s port (probably 3306
),MYSQL_USER
is the username you use to connect to the MySQL server,MYSQL_PASSWORD
is the user’s password.See this page of the docs for additionnal information.