I developed and maintain Eclipse DataEggs, a project that provides datasets related to the development of Eclipse projects, mainly for software practitionners and researchers.
The datasets include various pieces of data retrieved from the Eclipse forge: Mailing lists, Project development data, and AERI stacktraces, all in handy CSV and JSON formats. Each dataset comes with R Markdown documents describing its content and providing hints about how to use it. Examples provided mainly use the R statistical analysis software.
The datasets provided include:
- Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
- AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
- Development data from Eclipse projects. Depending on data sources available for each project, the following information is provided:
- SCM (git).
- ITS (Bugzilla, GitHub issues, GitLab issues).
- CI (Jenkins).
- PMI checks.
- Stack Overflow statistics.
- Scancode analysis (executed on our server).
Privacy has been a major concern from the beginning. Once extracted, data is anonymised using data-anonymiser and published in the downloads section of the project. See our documentation for more details
All data related to projects is retrieved from the Eclipse Alambic instance at https://eclipse.alambic.io. Alambic is an open-source framework for development data extraction and processing, for more information see https://alambic.io.