Welcome to detectem’s documentation!¶
This documentation contains everything you need to know about detectem.
detectem is a passive software detector. Let’s see it in action.
$ det http://domain.tld
[{'name': 'phusion-passenger', 'version': '4.0.10'},
{'name': 'apache-mod_bwlimited', 'version': '1.4'},
{'name': 'apache-mod_fcgid', 'version': '2.3.9'},
{'name': 'jquery', 'version': '1.11.3'},
{'name': 'crayon-syntax-highlighter', 'version': '2.7.2'}]
Using a serie of indicators, it’s able to detect software running on a site and in most cases extract accurately its version information. It uses Splash API to render the website and start the detection routine. It does full analysis on requests, responses and even on the DOM!
There are two important articles to read:
Features¶
Detect software in modern web technologies.
Browser support provided by Splash.
Analysis on requests made and responses received by the browser.
Get software information from the DOM.
Match by file fingerprints.
Great performance (less than 10 seconds to get a fingerprint).
Plugin system to add new software easily.
Test suite to ensure plugin result integrity.
Continuous development to support new features.
Contribuiting¶
It’s easy to contribute. If you want to add a new plugin follow the guide of Plugin development and make your pull request at the official repository.
Documentation¶
Installation¶
Install Docker and add your user to the docker group, then you avoid to use sudo.
Pull the image:
$ docker pull scrapinghub/splash
Create a virtual environment with Python >= 3.6 .
Install detectem:
$ pip install detectem
Run it against some URL:
$ det http://domain.tld
Matchers¶
Matchers are in charge of extract software information. detectem has different matchers according to its target.
Body Matcher¶
Format |
|
Type |
string |
Scope |
All requests/responses except first one |
It operates on response body as a regular expression on raw text. Its scope is every response body except the first one since doing matching at it is highly prone to false positives. To select data from first response, you should use a XPath matcher.
It’s used usually to extract data from commentaries.
Example¶
A website X
uses a library called yobu
and loads it from
https://cdn.tld/yobu.js
.
As you see, no version could be extracted using a URL matcher.
However, the response body contains some valuable information:
//! yobu v1.2.3
[...]
Then, it’s the perfect fit for a body matcher.
Let’s create a plugin to detect yobu
.
from detectem.plugin import Plugin
class YobuPlugin(Plugin):
name = 'yobu'
matchers = [
{'body': r'//! yobu v(?P<version>[0-9\.]+)'},
]
Then, when you run detectem on X
,
it will detect the presence of yobu
and its version 1.2.3
.
DOM Matcher¶
Format |
|
Type |
tuple |
Scope |
DOM |
It operates on the DOM loaded by the browser. When there are some data representation issues (minification, bundles, etc) it’s better to access the objects already loaded in browser’s DOM than trying to parse them with regular expressions.
This matcher is useful to extract information from objects loaded in the DOM that could contain version information under some attribute.
Example¶
A website X
uses a software called yobu
.
It’s loaded as part of a bundle file called assets.js
that groups every Javascript file used by the website.
Moreover, this file is obfuscated and
its content changes every time there’s a change
because of a internal building process.
No way to use a regular expression here.
However, in the browser’s Javascript console
you can see that there’s a Yobu
object.
We will use a DOM matcher to extract that data.
The first element of the tuple is a check_statement
written in Javascript.
What should be able to give us this statement?
It will assert that the target object exists in the DOM
to continue with version extraction.
The second element is an extractor
statement written in Javascript
and it will try to access the attribute where version data lies.
Finally, we are ready with our new matcher:
from detectem.plugin import Plugin
class YobuPlugin(Plugin):
name = 'yobu'
matchers = [
{'dom': ('window.Yobu', 'version': 'window.Yobu.version')},
]
Then, when you run detectem on X
,
it will detect the presence of yobu
and its version 1.2.3
.
Notes¶
The plugins use window
as prefix because
the check statement won’t raise any error if the object doesn’t exist,
it’s easier to emulate browser in our testing suite and avoid side effects
in presence of iframes.
Header Matcher¶
Format |
|
Type |
tuple |
Scope |
First response |
It operates on response headers. As you could expect, it works only on first response since it contains the headers sent by website’s server.
It’s used to extract data exposed by the web server software
and its stack.
You could also dive into Set-Cookie
headers
to extract cookie information.
Example¶
A website X
uses Apache HTTPd Server
.
The response contains the following headers:
[...]
Server: Apache/2.4.25
[...]
We will use a header matcher to extract Apache’s version.
First, we need to decide which header to look for.
In this case, it’s the header Server
.
from detectem.plugin import Plugin
class ApachePlugin(Plugin):
name = 'apache'
matchers = [
{'header': ('Server', r'Apache/(?P<version>[0-9\.]+)')},
]
Then, when you run detectem on X
,
it will detect the presence of Apache
and its version 2.4.25
.
URL matcher¶
Format |
|
Type |
string |
Scope |
All requests/responses except first one |
It operates on request/response URLs made by the browser when loading a website. The scope for this matcher is every request/response URL except the first one, since they are usually the website’s URL to analyze.
Example¶
A website X
uses a library called yobu
and loads it from
https://cdn.tld/yobu-1.2.3.js
.
As you see, the version is present in the URL
and we can extract it using a URL matcher.
Let’s create a plugin to detect yobu
.
from detectem.plugin import Plugin
class YobuPlugin(Plugin):
name = 'yobu'
matchers = [
{'url': r'/yobu-(?P<version>[0-9\.]+)\.js'},
]
Then, when you run detectem on X
,
it will detect the presence of yobu
and its version 1.2.3
.
XPath Matcher¶
Format |
|
Type |
tuple |
Scope |
First response |
It operates on the first response. Since regular expressions are unproper to use on first response body it’s better to use XPaths that are context-aware.
This matcher is useful to extract version information from meta tags, tag attributes or HTML comments. Javascript embedded scripts or inline declarations aren’t available to XPath matcher because of embedded inline split.
Example¶
A website X
uses a software called yobu
.
It doesn’t load any resource that could lead
to identify the version of yobu
but it adds a meta tag to the page
that contains its version.
It looks like:
We will use a XPath matcher to extract that data.
The first element of the tuple is an XPath.
What should be able to give us this XPath?
A string where we could apply our version extractor string.
In this case, our goal is to get yobu 1.2.3
.
A XPath capable of doing this is:
//meta[@name='generator']/@content
.
That is enough but as this case is so common,
we’ve added a helper named meta_generator
that works very well in this scenario.
In this case, it should be called meta_generator('yobu')
.
The second element is our well-known version extractor string. Finally, we are ready with our new matcher:
from detectem.plugin import Plugin
from detectem.plugins.helpers import meta_generator
class YobuPlugin(Plugin):
name = 'yobu'
matchers = [
{'xpath': (meta_generator('yobu'), r'(?P<version>[0-9\.]+)')},
]
Then, when you run detectem on X
,
it will detect the presence of yobu
and its version 1.2.3
.
Most matchers use an argument called extractor
.
Depending on its value, it could extract:
Presence¶
If extractor
doesn’t have a named parameter or doesn’t exist,
the matcher only checks plugin presence.
Version extraction¶
For these cases the extractor
has version
as the named parameter for the regular expression.
Name extraction¶
Some projects like AngularJS have modules that could be included to add functionality. The issue is that both core library and module have the same signature for the version, then it’s needed to determine the software module too.
For these cases extractor
has name
as the named parameter for the regular expression.
Plugin development¶
A plugin is the component in charge of detect one software and its version. Since a software could have many different signatures, every plugin has test files associated to assure version integrity and add new signatures without breaking the working ones.
Let’s see how to write your own plugin.
Requirements¶
The plugin has to:
Be compliant with
IPlugin
interface.Be a subclass of
Plugin
.Have a test file at
tests/plugins/fixtures/<plugin_name>.yml
.
To make it faster, there’s a script called add_new_plugin.py
which creates both plugin and test file.
$ python scripts/add_new_plugin.py --matcher=url example
Created plugin file at detectem/detectem/plugins/example.py
Created test file at detectem/tests/plugins/fixtures/example.yml
Plugin file¶
We’re creating an example plugin for a ficticious software called examplelib. We can detect it easily since it’s included as an external library and in its URL it contains the version. Then we will use the URL matcher for this case.
from detectem.plugin import Plugin
class ExamplePlugin(Plugin):
name = 'example'
homepage = 'http://example.org'
matchers = [
{'url': '/examplelib\.v(?P<version>[0-9\.]+)-min\.js$'},
]
Review matchers page to meet the available matchers to write your own plugin.
Test file¶
This is the test file for our example plugin:
- plugin: example
matches:
- url: http://domain.tld/examplelib.v1.1.3-min.js
version: 1.1.3
Then running the test is simple:
$ pytest tests/plugins/test_common.py --plugin example
When you need to support a new signature and it’s not supported by current signatures, you must modify your plugin file and add a new test to the list to see that your changes don’t break previous detected versions.
References¶
-
interface
detectem.plugin.
IPlugin
[source]¶ -
homepage
= <zope.interface.interface.Attribute object at 0x7f84ffe29890 detectem.plugin.IPlugin.homepage>¶ Plugin homepage.
-
matchers
= <zope.interface.interface.Attribute object at 0x7f84ffdbf1d0 detectem.plugin.IPlugin.matchers>¶ List of matchers
-
name
= <zope.interface.interface.Attribute object at 0x7f84ffe29290 detectem.plugin.IPlugin.name>¶ Name to identify the plugin.
Tags to categorize plugins
-