Eclipse DataEggs

The Eclipse DataEggs project has been archived and is no longer maintained.

I developed and maintained Eclipse DataEggs, a project that provides datasets related to the development of Eclipse projects, mainly for software practitionners and researchers.


The datasets include various pieces of data retrieved from the Eclipse forge: Mailing lists, Project development data, and AERI stacktraces, all in handy CSV and JSON formats. Each dataset comes with R Markdown documents describing its content and providing hints about how to use it. Examples provided mainly use the R statistical analysis software.

The datasets provided include:

  • Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
  • AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
  • Development data from Eclipse projects. Depending on data sources available for each project, the following information is provided:
    • SCM (git).
    • ITS (Bugzilla, GitHub issues, GitLab issues).
    • CI (Jenkins).
    • PMI checks.
    • Stack Overflow statistics.
    • Scancode analysis (executed on our server).

Privacy has been a major concern from the beginning. Once extracted, data is anonymised using data-anonymiser and published in the downloads section of the project. See our documentation for more details

All data related to projects is retrieved from the Eclipse Alambic instance at Alambic is an open-source framework for development data extraction and processing, for more information see