Skip to main content

Architecture

ClearlyDefined is structured as a set of stateless, micro-services that can scale horizontally with ease. The services collaborate via REST APIs using shared service-to-service tokens and/or shared access to storage (e.g., blob storage). The website is React-based an talks almost solely to the service.

The system as a whole is made up of the following major subsystems:

  • Website -- A relatively simple but quite useful React app that uses the Create React App framework, Redux, React-Bootstrap and a few other bits and pieces.
  • Service -- A Node based service that supports numerous REST APIs for accessing and searching definitions, getting and harvesting data, creating and managing curations, and more.
  • Crawler -- A horizontally scalable Node service that processes requests to harvest data from components using a variety of tools.
  • Definition store -- A store for component definitions
  • Harvest store -- A store for raw harvest tool outputs. Nothing fancy here. Plain blob storage is fine.
  • Harvest queue -- A queue of requests for the crawlers to process.
  • Curation store -- A place to store and collaborate on curations to the harvested data. In practice this a structured GitHub repo (e.g., https://github.com/clearlydefined/curated-data).
  • Tools -- Any number of openly available code/package analysis tools such as ScanCode and FOSSology as well as a few home grown utilities.

Service

Unsurprisingly, the service is at the heart of the system. It supports a set of REST APIs that server or manage

  • Auth* -- Token based auth using GitHub OAuth API and GitHub teams for permissions
  • Definitions -- Get, list, search or investigate existing definitions in the system
  • Curations -- Get and create curations
  • Harvesting -- Get or queue traversals of components of various supported types
  • Origin scanning -- Support for searching and selecting components and versions from a wide range of systems such as GitHub, npmjs, Maven Central.

The service process itself is completely stateless and can scale horizontally as needed. It does relatively little compute itself. The heaviest lifting it does is summarizing, aggregating and curating definitions from their constituent parts. That computation is only done once and then cached until invalidated by new data. So, most of the time the service is listing blobs or getting blobs and returning their content.

When you think about the service, think really simple. There is a pluggable provider mechanism so, for example, different storage providers (e.g., Azure blob, local file system, ...) can be configured in. The actual business logic is probably < 500 lines of Node code.

Website

The website is a simple React app that uses the Create React App framework, Redux and React-Bootstrap. All of its content comes via the service, even when it is talking to GitHub or npmjs or Maven Central. This allows us to manage access tokens, do caching, and precompute results. This approach also simplifies client code and enables the easy creation of alternative front-ends with consistent functional behavior.

The app was put together by new React devs so is bound to have a number of less than optimal designs and approaches.

Badges

To retrieve a link to the image for your badge on your open source page, you can use the API endpoint /badges/:type/:provider/:namespace/:name/:revision

So, for example: /badges/git/github/expressjs/express/351396f971280ab79faddcf9782ea50f4e88358d

You can embed this into your open source project by putting the following markdown into your Readme. (Note please replace variables with your project information)

![My ClearlyDefined Score](https://api.clearlydefined.io/badges/:type/:provider/:namespace/:name/:revision)

Deploying the ClearlyDefined service

Properties

AUTH_GITHUB_ORG

The name of the org the site will use for authenticating users. Checks team membership.

AUTH_CURATION_TEAM

The GitHub team whose members have permission to programmatically write to the curation repo for this environment (e.g., merge pull requests). If left unset, anyone can do these operations.

AUTH_HARVEST_TEAM

The GitHub team whose members have permission to programmatically queue requests to harvest data. That is, they can POST to the /harvest endpoint. If left unset, anyone can do these operations.

CURATION_GITHUB_OWNER

The GitHub user or org that owns the curation repo. This repo is assumed to be owned by CURATION_GITHUB_OWNER.

CURATION_GITHUB_REPO

The GitHub curation repo to use for curations. This repo is assumed to be owned by CURATION_GITHUB_OWNER.

CURATION_GITHUB_BRANCH

The GitHub curation repo branch to use for curations. For testing and development, feel free to use your own. DON'T use master and you aren't so DO NOT use master.

CURATION_GITHUB_TOKEN

A Personal Access Token with public_repo scope

DEFINITION_AZBLOB_CONNECTION_STRING

Azure blob connection string

DEFINITION_AZBLOB_CONTAINER_NAME

Name of the Azure blob container holding that holds computed definitions

FILE_STORE_LOCATION

This is the location to store harvested data, scan results, ... If left unset, data will be stored in c:\temp\cd (for Windows) and /tmp/cd (for all other systems). This location is shared with other parts of the system.

Deploying the ClearlyDefined crawler

Properties

SERVICE_ENDPOINT

The full origin of the service, e.g. http://domain.com:port.

WEBSITE_ENDPOINT

The full origin of the website/UI, e.g. http://domain.com:port.

AUTH_GITHUB_CLIENT_ID and AUTH_GITHUB_CLIENT_SECRET

If using an OAuth application for GitHub sign-on, set these to the client ID and client secret, respectively. If not provided, auth will fall back to CURATION_GITHUB_TOKEN.

CRAWLER_DEADLETTER_PROVIDER

Crawler's deadletter provider. If unset, it defaults to CRAWLER_STORE_PROVIDER's default.

CRAWLER_GITHUB_TOKEN

The crawler tries to figure out details of the packages and source being traversed using various GitHub API calls. For this it needs an API token. This can be a Personal Access Token (PAT) or the token for an OAuth App. The token does not need any special permissions, only public data is accessed. Without this key GitHub will severely rate limit the crawler (as it should) and you won't get very far.

CRAWLER_STORE_PROVIDER

Crawler's store providers. If left unset, it defaults to cd(file). If multiple stores need to be set, they should be concatenated with "+", for example cdDispatch+cd(azblob)+webhook

FILE_STORE_LOCATION

This is the location to store harvested data, scan results, ... If left unset, data will be stored in c:\temp\cd (for Windows) and /tmp/cd (for all other systems). This location is shared with other parts of the system.

CRAWLER_AZBLOB_CONNECTION_STRING

Azure blob connection string

CRAWLER_AZBLOB_CONTAINER_NAME

name of container holding harvested data

PORT

Defaults to 3000, like a lot of other dev setups. Set this if you are running more than one service that uses that port.

SCANCODE_HOME

The directory where ScanCode is installed.