Abundant Thesis Report

A Distributed Issue Tracker Designed for Individual Use
May 5, 2011
Abundant
Michael Diamond
Computer Science Undergraduate Student Willamette University
Abstract
An issue tracker is a very powerful tool for a software developer working on a major project. However, setting up and using this software is currently prohibitively difficult for students, hobbyists, and small collaborations. The same used to be true of version control until the advent of distributed version control enabled individuals to track their source code on their own computers. I developed Abundant, an issue tracker designed to complement distributed version control and designed with individuals and small projects in mind.
Keywords
Software development, distributed version control, issue tracking, distributed issue tracking
A Discussion of Version Control and Distribution Models

Regardless of the paradigm or design philosophy being used to enable successful programming projects, software development centers around two key tools: version control, used to track changes in source code, and issue tracking, used to track bugs and other issues with the program. These two tools become more and more critical to the success of any project the larger the project becomes. One person often can work without these tools; a team of two people can create a ridged enough system to function successfully without them either; but three or more people and the complexity of managing changes and bugs grows exponentially without specialized tools to do the heavy lifting automatically. The benefits of these tools are immense, yet smaller projects rarely use them. Originally designed for large corporations, the cost of setting up and maintaining the needed resources is often prohibitive in terms of time, expense, or knowledge. Setting up a Version Control System (VCS) and issue tracker properly involves installing and configuring clunky software, often on dedicated hardware, and knowing how to setup and maintain these tools skills and resources not available to the lay-programmer or student. This software model, called server-client, has been around for more than twenty years; and for much of that time was the only option. In recent years, there has been a shift away from the restrictive server-client model towards a free form, far more lightweight and easy to set up distributed model. 1997 saw the introduction of distributed version control with Code Co-op (Alwis and Sillito 2009), and six years later GNU Arch and Monotone made further headway into the field, but not until 2005 did it enter the mainstream with Linus Torvalds releasing Git for the Linux kernel (Torvalds 2007). Ever since, major projects, both proprietary and open source, have been switching to distributed version control rapidly (Alwis and Sillito 2009). Projects with tens of thousands of files and years history including Mozilla, Python, and Google Code have all adopted a distributed paradigm. With distributed version control, projects can share code more easily than ever before, and enable asynchronous collaboration with individuals around the world (Torvalds 2007). The parallel field of issue tracking, however, is far behind version control. Today, there is still not a viable distributed bug tracker, where the user downloads the entire database of bugs so they can do with it as they see fit (Corbet 2008). There is a lot of traction in this area, and this state of affairs is rapidly changing. Many small scale distributed bug trackers have been built (A List of Distributed Bug Trackers 2010), most are simply wrappers around text files, but some, like BugsEverywhere, are slowly developing release candidates that replicate much of the power of server-client bug trackers like Bugzilla locally.
Server-Client Model
The server-client model, where there is one tightly controlled central location for all data and metadata, is the original, vastly prevalent model for version and issue control. There are two primary advantages of a server-client model: space savings and security. Because the database of metadata is stored only on the server, the client only needs to download the data he cares about often in large projects this may be only a few files in a multiple gigabyte repository and so there is less disk space used. Initially, when the user first downloads his versions of the data, the user does not download the metadata, nor extra files he does not need, and so can further save
time with network traffic. By contrast, a distributed system must download the entire repository before any work can be done, and since it stores the entire database locally, takes up far more disk space on each users machine. Secondly, and more importantly in many environments, the server-client model ensures absolute and highly granular control of who can access and change certain files (Torvalds 2007). This is considered necessary in some large corporations where certain users should have access to some files, while others should not. In a distributed environment, the only degree of control available is at the top level, all-or-nothing access, and dynamically or precisely controlling who has access to what is very challenging if not impossible. The server-client model has a third advantage in that its age brings with it stability and ubiquity; however with the growth of distributed systems this advantage is weakening with every passing day.
Distributed Model
Unlike the server-client model, a distributed data model spreads the entirety of the metadata amongst all users. There are several very powerful advantages to distributed version control, including no need for a centralized server or costly hardware and maintenance, nor the need to have access to such a server to have the tools capabilities on hand. The repositories are entirely self-contained with the source code and operate cleanly on each users computer. By having the entire repository on the local machine all the power of the VCS is available to each user; the user can review history and make changes and merges without needing push rights to the original repository or, for that matter internet access at all. In a server-client system the user must make slow network requests to the server for all or nearly all requests, whereas the distributed system only experiences this cost once, when first cloning the repository. Additionally, users are able to use all the powers of the tool on their local machine without fear of damaging the upstream repository or interrupting other users workflows, as they can now make changes and commits locally. In a server-client model there is an expectation that all commits to the server are complete changes that do not damage anything (Kanat-Alexander 2008). This leads to users making fewer commits and ultimately struggling more on their local machines, for without being able to commit whenever they feel like, they lose much of the power an SCM is supposed to afford. Being able to use a distributed system even when you do not have access to push to the canonical source enables external users to work on their own versions of the project, making open source projects even more open. In a server-client model, the source may be available, but the history is often hidden or difficult to access. With distributed SCMs, the entirety of the project, not just the current version, is available for anyone to use.
The State of Distributed Issue Tracking

Unlike version control, bug tracking has not felt the draw of distributed access in the same way. Bug tracking is staple of large corporate code bases, and as consequence of this, many major bug tracking utilities are targeted towards enterprise-level usage, and not to small projects. There are several promising projects moving towards distributed bug tracking, but
there are also greater barriers. Most notably, the centralized model makes more sense for a bug tracker. Having source code differ amongst different instances of a repository isnt an issue, since that is usually the whole point one person develops a feature, and not until they are completely satisfied with it do they ever try to share it with others. There is not necessarily a need for a central, canonical reference though it often makes sense to have one. On the other hand, there is a great need for a canonical set off issues being tracked. When one developer discovers a bug, they dont want to log that bug in their personal issue tracker, they want to share it with everybody, so the whole world is aware of it. This means centralized issue tracking is a desirable model, but it does not mean a distributed model cannot work, and even be better than its centralized counterpart. A distributed bug tracker has several advantages, even beyond the general benefits of a distributed model. Most notably, the bugs are distributed along with the source code, meaning that whichever version of the code one is working with, the recorded issues and their state directly correlate with the version of the code one is working with. With distributed source code but a centralized issues system, it is highly unlikely that the version of the code an individual is looking at is actually the same as the state of the bug tracker. Of course, like a distributed VCS, a distributed issue tracker would be much more readily available to the general masses of people who cannot expend the time and effort to utilize a cumbersome centralized bug tracker. A successful distributed issue tracker needs to merge the best of a centralized system a canonical version with easy reporting of new issues with the benefits of a distributed system users can maintain their own changes and additions to issues until they are ready to share them with others. At the same time, such an issue tracker needs to attempt to solve some of the problems with current software, such as controlling the complexity of bug submissions (KanatAlexander 2008).
Project
I developed a distributed issue tracker called Abundant, specifically targeted at students and small groups, but designed with the capability to grow to larger projects. Though there are several distributed issue trackers already, most notably BugsEverywhere, there has been very limited adoption. Abundant addresses some of the criticisms of major bug trackers today (Kanat-Alexander 2008) and was developed to be both easily adopted by lay users and powerful enough under the hood to support the needs of growing teams. Specifically, the following components have been or will be developed: A command line interface which gives the user access to all functionality This extends from my previous work building a lightweight bug tracker, called b, as a Mercurial extension. The underlying data structures and functionality was greatly extended, but the design principles remain the same. Among many other features, users are able to report bugs, associate them with files and lines of code in the same repository, assign bugs to different users, and track the status of bugs over time. One of the key goals of Abundant is to keep the set of tools available very simple at first, and organically enable new functionality as they are needed by users. For instance, a project with one user has no need to assign bugs or track ownership, and the user should not be concerned by such things. But if a second developer is added,
Abundant automatically and immediately recognizes this new user and adjusts the interface accordingly. Furthermore, Abundant allows for very freeform workflows; users can treat it as nothing more than a glorified todo list, while safely knowing that if requirements change, it will transform into a powerful issue tracker with minimal configuration. By hiding unhelpful metadata and functionality from users, Abundant helps address the criticism of bug reporting being too complicated. This command line interface is VCS agnostic which is to say it functions perfectly well without residing inside a repository. However, the goal is to keep bugs tracked by a DVCS, and as such it is aware of major version control tools and work with them to make the user experience more fluid. At present Abundant works well with Mercurial, but hopefully it will work with Git and Bazaar in the near future. Abundant is still designed in such a way as to be easily extended to support additional VCSs when desired. Abundant is also easy to install, requiring only that the user have Python 3 installed and configured. Once Python is installed and the source code for Abundant is downloaded, it can run straight from the source directory, and the ab command can be added to the system path to be run elsewhere. Future: A web interface for browsing, submitting, and editing bugs Abundant will provide a powerful web interface which can serve as the canonical bug database. This is essential for non-developers to report bugs and browse the state of bugs associated with a standard version of the codebase, while also being very easy to launch on an individual repository, enabling a web-based graphical interface to the bug database in each users repository. There will be customizable security controls on the web interface, allowing unprivileged users to see and maybe report bugs, but go no further. While Abundant works fully at the command line, utilizing it this way requires the user to have technical knowledge of version control and Abundant, as well as developer access to submit changes to the codebase. As such, Abundant is presently limited to groups that expect only developers to be reporting issues. While not crippling, this feature is essential to adoption in environments where non-developers would want to report bugs. Potentially: An Eclipse plugin to the command line tools Providing useful information and access to the bugs database from within common IDEs would be another powerful tool, as it would enable developers to smoothly interface with both their code and their issues at the same time. Users could see bugs by file and even by line, and update bug state and other metadata while browsing and editing the code. Though useful, lacking this functionality would not be more than a minor inconvenience, so this feature will only be developed if there is time.
Implementation
Abundant is implemented in Python 3 and released under the GPL version 3, with a design target of as few dependencies as possible. It involved exploring many new areas I did not have strong understanding of already, notably including building custom, powerful, and efficient data
structures for the backend of the tracker and developing a large scale command line program. The open source Mercurial codebase was used as a structural model, and was integral to solving many problems which came up over time. In particular, Mercurials elegant technique for mapping individual commands to internal functions was largely replicated. In keeping with the design paradigms of b, much of Abundant is designed around the idea of prefix lookup you type in the unique prefix of the id for a given issue, and that is enough to refer to it. Similarly for usernames, commands, and special metadata, all the user has to do is refer to the shortest unique prefix of that object, and Abundant will fill in the rest. This helps both to minimize the users data input burdens, and also to help ensure consistency throughout the database.
Development notes:
Issue storage: January was spent formalized the feature set and data storage methods. This primarily involved researching bug tracking usability, and text based data storage, and how to best store that data in a VCS. How exactly to store the issues in the system was a serious question, which had a lot of factors weighing on it. I identified the following three aspects as necessary characteristics of whatever storage pattern was decided upon. Be Human Readable: In order to best integrate with version control, the data format needs to be text based (as opposed to in some sort of database) so that users can see the bugs and track changes directly in the version control system. Furthermore, version control works best with text content, as it is easier to compare changes manually. Be Able to Store Metadata: A lesson learned from developing b is that metadata (assigned-to, bug status, etc.) needs to stick with the actual data like issue description and comments. In b metadata was stored separately in order to ease caching, but I believe it makes more sense to keep cached data for speed separate from all the actual data. The data structure therefore needs to store structured data that is machine readable, at the same time as remaining human readable. Be Fast to Parse: We want this to be scalable and that means fast access to any bug. This can largely be done with untracked cache files, but assuming each bug is stored in its own file, each file should be fast to display without caching. Listing, browsing, and filtering can be assisted with some sort of caching or indexing mechanism that will remain invisible to the user. In order to meet all of these requirements, I decided to back Abundant with JSON, using one file per issue. JSON is an elegant solution, because it uses minimal markup to clearly structure the content, and is a ubiquitous standard for data storage that other programs (like manual diff/merge programs) can likely work with. Even when treated as plain text, it is easy to visually parse, allowing users to potentially edit the files manually when necessary. Prefix lookup: Implementing the prefix lookup was complicated, but the result is fast and powerful. In b prefixes are looked up in linear time by iterating over all the IDs in a file, which is fine for b but wasteful if we need to look up more than one, as is often the case with Abunadants more complex structure. I learned about the trie (pronounced try or tree), data structure a search tree
organized by similar prefixes of entries which was perfect for my use case. From the root, children all share a common prefix. So the children of the root are grouped by their first character, the grandchildren by their second character, and so on. There are details that differ from implementation to implementation, but that is the conceptual premise of all tries. Tries operate in O(m) time, where m is the length of the prefix being looked up. This is very fast, and often faster than the "O(1)" hash table (it's actually also O(m) due to the hash function, and collisions make it more expensive) or the O(log n) binary search tree. This makes them a drastically more desirable choice than the linear search methodology of b and t. After exploring several existing Python tries I decided to build my own, since none of the existing implementations were quite what I was looking for. Most notably, the "standard" trie stores the entire key, one character at a time, down the tree, and rebuilds the string upon request. For sparse trees (that is to say, long keys, but short unique prefixes) this seemed fairly space/time inefficient, and so my trie only goes as far as it needs to, and restructures as necessary. In order to facilitate ease of use, my trie converts all keys and searches to lowercase, which is potentially dangerous in some use cases, but desirable for Abundant so users can look up data without regard to case sensitivity. Imagine if, due to some platform specific issue, an ID hash was stored with uppercase hex instead of lower case. Users would never be able to find the issue, even though as far as hex is concerned they are legitimately entering the correct number. Furthermore, my trie implements a notion of an alias, which is stored in the trie like any other value, but when retrieved does a dictionary lookup of the alias value and returns the result. For example, users can specify both a username and an email address in angle brackets. It would reasonable for users to be able to reference each other by either a prefix of their name or their email address. As such, Abundant adds the user string (username <email@address>) to the user trie, and then makes an alias from the email address to the user string, enabling prefix lookup on either value, transparently to the user. It's not quite a general case trie, but I've tried to keep these specific features like lowercase and alias abstracted from the underlying data structure, so that a more traditional trie data structure could be constructed fairly trivially. All in all, I'm proud of developing this structure, and pleased with the robustness of this new feature. Dynamic command addition: Version 0.1 was declared in March when the help command was able to successfully output support information for all existing commands, and more importantly could do so dynamically, simply creating a new function and adding it to the table of commands allowed Abundant to automatically include it in the list of acceptable commands and to generate support content for it automatically. While not quite ready for real time (it lacked some simple functionality like the ability to mark an issue resolved) this was an important milestone, as it meant the underlying structure was largely complete, and development could begin in earnest on adding functionality. Dogfooding:
Eating your own dog food, or dogfooding, is the practice of using the product youre developing or selling. This is advantageous both as a marketing statement if you expect other people to buy or use your product, using it yourself is a statement of confidence in the product and as a development practice by having to use the product, bugs that will affect users can be caught faster, and developers can think more like users will when planning new features and improvements. In April, the version number was upgraded to Version 0.2, when Abundant was deemed robust enough to start tracking its own issues. Up until this point an external list had been used, but with additional testing and the inclusion of commands like resolve and duplicate (to mark an issue complete, and to mark it a duplicate of a previous issue, respectively) Abundant was ready for prime time, and has been successfully tracking known issues with the project since Version 0.2 was rolled out. Working demo: In time for Student Scholarship Recognition Day I had put together a working demo and a presentation detailing my project. I was very pleased to see Abundant come together so well, as I had not actually done a large scale run through of the major functionality all in one go before. Amusingly, I discovered in the early morning on the day of the presentation that I had left out a fairly critical feature the ability to assign issues to other users. This was itself an excellent demonstration of the robustness of the codebase, however, because within 10 minutes of discovering the problem a solution had been written, tested, and confirmed. Much of the delay in the early months of the project was due to ensuring Abundant was well structured internally, and the benefit of going down that route was apparent here.
Future Goals
It was never the goal to be finished with Abundant at the end of the term. This is a project I have enjoyed working on, and intend to continue developing after graduation, in the hope of seeing this become a more powerful project. As such, I have identified several goals I specifically intend to accomplish in the near future. Each of these goals correlates to an increment of the minor version number, and accomplishing all of these goals would effectively indicate Abundant is ready for a wide scale 1.0 release. Read-only web interface We should replicate the functionality of Mercurial to serve up information on the Abundant database in a browser, either from a central server or from the local computer. This should be painless to do, and allow the user to quickly browse the current repository state. Interactive web interface Abundant is crippled until users can update issues without first pulling the source code down and then pushing it back up. As such, we need to expand the web interface to allow for users to submit new issues. Speed improvements There is a notable startup delay at present, which is tolerable for now, but absolutely cannot continue. Standard commands, like init, new, and details must execute nearly instantly.
More complex commands like list/tasks can be slower, but still cannot be as susceptible to slowdown as they are now. Advanced workflows Currently Abundant is fairly free form in its workflow, issues can be created, changed, and closed by anyone. This is desirable for small projects, and remains a good default behavior, however larger projects need greater control. For instance, it might be necessary for the person filling an issue to be the only one able to mark it resolved (user A files and assigns to user B, user B implements fix, marks it as fixed, it gets reassigned to user A, user A checks that it is resolved, closes it). These more powerful workflows need to be as simple to set up as possible, and these design decisions have not been made yet. Data Caching As it is, Abundant does not cache any information, and has to poll each individual file in order to get necessary information, including something as simple as the issues title. We need to implement caching which is transparent and failsafe (the user doesnt need to know it exists, and when its outdated or broken, the system still runs, albeit slower) so that searches and lookups can be run faster. Examples include getting titles of related issues without having to load whole issues, and constructing sorted data structures to filter the list command faster. Unit and Regression Testing There may be some rudimentary testing being done before graduation, however it will be important if this project is to grow that a standardized and stable testing suite is developed to ensure the codebase does what is expected, and that future changes do not cause regressions that break functionality that worked previously. Improved User Interface Notable when viewing the details of an issue, currently it is output in a format that is difficult to read, and in an order that isnt necessarily desirable. We want to ensure the look and feel of the command line is professional. Installation Scripts It should be trivial, especially on Windows / Mac to install Abundant. Currently it isnt terribly painful pull code, add command to path, run but isnt the sort of installation flow users of these operating systems expect. It is ok for now to ask users to go through those steps, but for a large scale release to be possible, we must have installation binaries, which are preferably built by an automatic build script.
Demo and Screenshots

Creating a new Abundant database is as simple as calling ab init. Immediately the directory and its contents are paired with an issue tracker, with no additional configuration required. This should be done in a VCS like Mercurial, but does not need to be. If it isnt, however, it will be difficult to robustly share with other people.
Once the init command has been executed, the user has the full power of Abundant at his fingertips, with no additional configuration or setup required. The user can immediately start reporting bugs with the new command, which takes the title of the issue as additional arguments. Optional parameters can be passed to add extra details to the issue, if this is beneficial for the user.
Users can very easily browse all current and resolved issues with the list command, which by default shows all open issues, but can be given other arguments to filter on specific parameters and show closed issues.
If we want additional users to be able to contribute to the same database, its as simple as having them pull the repository containing the database, and start using it. Abundant will automatically adjust its behavior to account for these new users, and they can be specified by any unique prefix of their full name or email address.
We can see just issues assigned to the current user or a specified user with the tasks command issues created before adding a second user remain assigned to the first.
Issues can be assigned to other users, and updated, changed or closed by anyone. Once an issue is resolved it can be marked resolved and will no longer show up in the list of issues by default. Notice the r flag which is used for both tasks and list to show resolved issues instead of open issues.
Conclusion
I am very proud of the results of this project thus far. Abundant works well as a small issue tracker, and scales to small teams well. It is still too limited for any sort of large project, or a project that expects to become larger, however I look forward to continuing to develop this software after graduation, and to resolve these issues and enable Abundant to become more powerful and useful. It is released as open source software, and I hope to get other developers to contribute to the project in the near future.
Terms
Version Control (Revision Control, Version Control System, VCS): A system used to ensure that changes to data are recorded in an efficient and powerful manner, enabling users to easily review those changes, accept and reject changes, restore previous versions and generally maintain a detailed history of the data being tracked. Additionally version control software enables easy collaboration and division of labor, as work can be done separately and the VCS will maintain the work as a cohesive whole, as opposed to teams having to manually compare and merge what each individual has done. Often used in software development, additionally wiki-style content like Wikipedia are built around the notion of version control. Repository: A set of files being tracked by a version control system. Issue Tracking (Bug Tracking): A parallel tool used to track issues with software (or other systems). While this can be as simple as a sheet of paper or a text file storing a list of known issues with software, large projects use far more powerful tools to track details and metadata about these issues, including categories, states, responsible and interested individuals, related issues, and more. Issue tracking is somewhat synonymous with bug tracking, and is often used interchangeably. There is some sense in which issue tracking is a superset of bug tracking, as issues can include more than just bugs, including planned features and requested changes. Server-Client: In a server-client model, a centralized server maintains a global set of data which clients access as needed. With regards to version control, the full history of the repository is stored on the server, only the files a user needs are downloaded to the client computers, and changes are pushed back to the server, which tracks them. Distributed: Conversely, in a distributed model, there is no need for a central server, for the entire dataset is stored on each users machine. Sharing can be done between any two repositories on any connected machines, and users do not need to query a remote server or even be online to access all the necessary tools.
Bibliography
A List of Distributed Bug Trackers. August 11, 2010. http://dist-bugs.kitenet.net/software/ (accessed September 21, 2010). Alwis, Brian de, and Jonathan Sillito. "Why are software projects moving from centralized to decentralized version control systems?" Proceedings of the 2009 ICSE Workshop on Cooperative and Human Aspects on Software Engineering (IEEE Computer Society), 2009: 36-39. Corbet, Jonathan. Distributed bug tracking. May 14, 2008. http://lwn.net/Articles/281849/ (accessed September 21, 2010). Kanat-Alexander, Max. "The Bugzilla Survey, August 2008." August 9, 2008. https://wiki.mozilla.org/images/5/56/Bugzilla-survey-results-2008.pdf (accessed September 21, 2010). Torvalds, Linus. "Linus Torvalds on Git. Transcript from Google Tech Talk." May 2007. http://git.or.cz/gitwiki/LinusTalk200705Transcript.

Abundant Thesis Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abundant Thesis Report

Uploaded by

Copyright:

Available Formats

A Distributed Issue Tracker Designed for Individual Use

A Discussion of Version Control and Distribution Models

The State of Distributed Issue Tracking

Demo and Screenshots

You might also like