Welcome to detectem’s documentation!

This documentation contains everything you need to know about detectem.

detectem is a passive software detector. Let’s see it in action.

$ det http://domain.tld
[{'name': 'phusion-passenger', 'version': '4.0.10'},
 {'name': 'apache-mod_bwlimited', 'version': '1.4'},
 {'name': 'apache-mod_fcgid', 'version': '2.3.9'},
 {'name': 'jquery', 'version': '1.11.3'},
 {'name': 'crayon-syntax-highlighter', 'version': '2.7.2'}]

Using a serie of indicators, it’s able to detect software running on a site and in most cases extract accurately its version information. It uses Splash API to render the website and start the detection routine. It does full analysis on requests, responses and even on the DOM!

There are two important articles to read:

Features

  • Detect software in modern web technologies.

  • Browser support provided by Splash.

  • Analysis on requests made and responses received by the browser.

  • Get software information from the DOM.

  • Match by file fingerprints.

  • Great performance (less than 10 seconds to get a fingerprint).

  • Plugin system to add new software easily.

  • Test suite to ensure plugin result integrity.

  • Continuous development to support new features.

Contribuiting

It’s easy to contribute. If you want to add a new plugin follow the guide of Plugin development and make your pull request at the official repository.

Documentation

Installation

  1. Install Docker and add your user to the docker group, then you avoid to use sudo.

  2. Pull the image:

    $ docker pull scrapinghub/splash
    
  3. Create a virtual environment with Python >= 3.6 .

  4. Install detectem:

    $ pip install detectem
    
  5. Run it against some URL:

    $ det http://domain.tld
    

Matchers

Matchers are in charge of extract software information. detectem has different matchers according to its target.

Body Matcher

Format

extractor=string

Type

string

Scope

All requests/responses except first one

It operates on response body as a regular expression on raw text. Its scope is every response body except the first one since doing matching at it is highly prone to false positives. To select data from first response, you should use a XPath matcher.

It’s used usually to extract data from commentaries.

Example

A website X uses a library called yobu and loads it from https://cdn.tld/yobu.js. As you see, no version could be extracted using a URL matcher. However, the response body contains some valuable information:

//! yobu v1.2.3
[...]

Then, it’s the perfect fit for a body matcher. Let’s create a plugin to detect yobu.

from detectem.plugin import Plugin

class YobuPlugin(Plugin):
    name = 'yobu'
    matchers = [
      {'body': r'//! yobu v(?P<version>[0-9\.]+)'},
    ]

Then, when you run detectem on X, it will detect the presence of yobu and its version 1.2.3.

DOM Matcher

Format

(check_statement=string:required, extractor=string:optional)

Type

tuple

Scope

DOM

It operates on the DOM loaded by the browser. When there are some data representation issues (minification, bundles, etc) it’s better to access the objects already loaded in browser’s DOM than trying to parse them with regular expressions.

This matcher is useful to extract information from objects loaded in the DOM that could contain version information under some attribute.

Example

A website X uses a software called yobu. It’s loaded as part of a bundle file called assets.js that groups every Javascript file used by the website. Moreover, this file is obfuscated and its content changes every time there’s a change because of a internal building process.

No way to use a regular expression here. However, in the browser’s Javascript console you can see that there’s a Yobu object.

_images/browser_js_console.png

We will use a DOM matcher to extract that data. The first element of the tuple is a check_statement written in Javascript. What should be able to give us this statement? It will assert that the target object exists in the DOM to continue with version extraction.

The second element is an extractor statement written in Javascript and it will try to access the attribute where version data lies. Finally, we are ready with our new matcher:

from detectem.plugin import Plugin


class YobuPlugin(Plugin):
    name = 'yobu'
    matchers = [
        {'dom': ('window.Yobu', 'version': 'window.Yobu.version')},
    ]

Then, when you run detectem on X, it will detect the presence of yobu and its version 1.2.3.

Notes

The plugins use window as prefix because the check statement won’t raise any error if the object doesn’t exist, it’s easier to emulate browser in our testing suite and avoid side effects in presence of iframes.

Header Matcher

Format

(header=string:required, extractor=string:optional)

Type

tuple

Scope

First response

It operates on response headers. As you could expect, it works only on first response since it contains the headers sent by website’s server.

It’s used to extract data exposed by the web server software and its stack. You could also dive into Set-Cookie headers to extract cookie information.

Example

A website X uses Apache HTTPd Server. The response contains the following headers:

[...]
Server: Apache/2.4.25
[...]

We will use a header matcher to extract Apache’s version. First, we need to decide which header to look for. In this case, it’s the header Server.

from detectem.plugin import Plugin


class ApachePlugin(Plugin):
    name = 'apache'
    matchers = [
        {'header': ('Server', r'Apache/(?P<version>[0-9\.]+)')},
    ]

Then, when you run detectem on X, it will detect the presence of Apache and its version 2.4.25.

URL matcher

Format

extractor=string

Type

string

Scope

All requests/responses except first one

It operates on request/response URLs made by the browser when loading a website. The scope for this matcher is every request/response URL except the first one, since they are usually the website’s URL to analyze.

Example

A website X uses a library called yobu and loads it from https://cdn.tld/yobu-1.2.3.js. As you see, the version is present in the URL and we can extract it using a URL matcher. Let’s create a plugin to detect yobu.

from detectem.plugin import Plugin

class YobuPlugin(Plugin):
    name = 'yobu'
    matchers = [
      {'url': r'/yobu-(?P<version>[0-9\.]+)\.js'},
    ]

Then, when you run detectem on X, it will detect the presence of yobu and its version 1.2.3.

XPath Matcher

Format

(xpath=string:required, extractor=string:optional)

Type

tuple

Scope

First response

It operates on the first response. Since regular expressions are unproper to use on first response body it’s better to use XPaths that are context-aware.

This matcher is useful to extract version information from meta tags, tag attributes or HTML comments. Javascript embedded scripts or inline declarations aren’t available to XPath matcher because of embedded inline split.

Example

A website X uses a software called yobu. It doesn’t load any resource that could lead to identify the version of yobu but it adds a meta tag to the page that contains its version. It looks like:

We will use a XPath matcher to extract that data. The first element of the tuple is an XPath. What should be able to give us this XPath? A string where we could apply our version extractor string. In this case, our goal is to get yobu 1.2.3.

A XPath capable of doing this is: //meta[@name='generator']/@content. That is enough but as this case is so common, we’ve added a helper named meta_generator that works very well in this scenario. In this case, it should be called meta_generator('yobu').

The second element is our well-known version extractor string. Finally, we are ready with our new matcher:

from detectem.plugin import Plugin
from detectem.plugins.helpers import meta_generator


class YobuPlugin(Plugin):
    name = 'yobu'
    matchers = [
        {'xpath': (meta_generator('yobu'), r'(?P<version>[0-9\.]+)')},
    ]

Then, when you run detectem on X, it will detect the presence of yobu and its version 1.2.3.

Most matchers use an argument called extractor. Depending on its value, it could extract:

Presence

If extractor doesn’t have a named parameter or doesn’t exist, the matcher only checks plugin presence.

Version extraction

For these cases the extractor has version as the named parameter for the regular expression.

Name extraction

Some projects like AngularJS have modules that could be included to add functionality. The issue is that both core library and module have the same signature for the version, then it’s needed to determine the software module too.

For these cases extractor has name as the named parameter for the regular expression.

Plugin development

A plugin is the component in charge of detect one software and its version. Since a software could have many different signatures, every plugin has test files associated to assure version integrity and add new signatures without breaking the working ones.

Let’s see how to write your own plugin.

Requirements

The plugin has to:

  • Be compliant with IPlugin interface.

  • Be a subclass of Plugin.

  • Have a test file at tests/plugins/fixtures/<plugin_name>.yml.

To make it faster, there’s a script called add_new_plugin.py which creates both plugin and test file.

$ python scripts/add_new_plugin.py --matcher=url example

Created plugin file at detectem/detectem/plugins/example.py
Created test file at detectem/tests/plugins/fixtures/example.yml

Plugin file

We’re creating an example plugin for a ficticious software called examplelib. We can detect it easily since it’s included as an external library and in its URL it contains the version. Then we will use the URL matcher for this case.

from detectem.plugin import Plugin


class ExamplePlugin(Plugin):
    name = 'example'
    homepage = 'http://example.org'
    matchers = [
        {'url': '/examplelib\.v(?P<version>[0-9\.]+)-min\.js$'},
    ]

Review matchers page to meet the available matchers to write your own plugin.

Test file

This is the test file for our example plugin:

- plugin: example
  matches:
    - url: http://domain.tld/examplelib.v1.1.3-min.js
      version: 1.1.3

Then running the test is simple:

$ pytest tests/plugins/test_common.py --plugin example

When you need to support a new signature and it’s not supported by current signatures, you must modify your plugin file and add a new test to the list to see that your changes don’t break previous detected versions.

References

interface detectem.plugin.IPlugin[source]
homepage = <zope.interface.interface.Attribute object at 0x7f84ffe29890 detectem.plugin.IPlugin.homepage>

Plugin homepage.

matchers = <zope.interface.interface.Attribute object at 0x7f84ffdbf1d0 detectem.plugin.IPlugin.matchers>

List of matchers

name = <zope.interface.interface.Attribute object at 0x7f84ffe29290 detectem.plugin.IPlugin.name>

Name to identify the plugin.

tags = <zope.interface.interface.Attribute object at 0x7f84ffdbf190 detectem.plugin.IPlugin.tags>

Tags to categorize plugins

class detectem.plugin.Plugin[source]

Class used by normal plugins. It implements IPlugin.