Eclipse DataEggs

I developed and maintain Eclipse DataEggs, a project that provides datasets related to the development of Eclipse projects, mainly for software practitionners and researchers.

The datasets include various pieces of data retrieved from the Eclipse forge: Mailing lists, Project development data, and AERI stacktraces, all in handy CSV and JSON formats. Each dataset comes with R Markdown documents describing its content and providing hints about how to use it. Examples provided mainly use the R statistical analysis software.

The datasets provided include:

  • Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
  • AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
  • Development data from Eclipse projects. Depending on data sources available for each project, the following information is provided:
    • SCM (git).
    • ITS (Bugzilla, GitHub issues, GitLab issues).
    • CI (Jenkins).
    • PMI checks.
    • Stack Overflow statistics.
    • Scancode analysis (executed on our server).

Privacy has been a major concern from the beginning. Once extracted, data is anonymised using data-anonymiser and published in the downloads section of the project. See our documentation for more details

All data related to projects is retrieved from the Eclipse Alambic instance at https://eclipse.alambic.io. Alambic is an open-source framework for development data extraction and processing, for more information see https://alambic.io.