System Projects

I have served as the designer and/or main developer of the following system projects.

Hetu

Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, developed and open-sourced by DAIR Lab at Peking University. It takes account of both high availability in industry and innovation in academia, which has a number of advanced characteristics:

Applicability. DL model definition with standard dataflow graph; many basic CPU and GPU operators; efficient implementation of more than plenty of DL models and at least popular 10 ML algorithms.
Efficiency. Achieve at least 30% speedup compared to TensorFlow on DNN, CNN, RNN benchmarks.
Flexibility. Supporting various parallel training protocols and distributed communication architectures, such as Data/Model/Pipeline parallel; Parameter server & AllReduce.
Scalability. Deployment on more than 100 computation nodes; Training giant models with trillions of model parameters, e.g., Criteo Kaggle, Open Graph Benchmark.
Agility. Automatically ML pipeline: feature engineering, model selection, hyperparameter search.

We welcome everyone interested in machine learning or graph computing to contribute codes, create issues or pull requests. Please refer to Hetu Contribution Guide for more details.

PowerFL

PowerFL is an industrial-grade federated privacy-enhancing computing system for geo-distributed collaboration. It concentrates on the federated machine/deep learning and secure collaborative data analysis among enterprises/organizations, with a number of advantageous characteristics:

Security. PowerFL offers a variety of robust privacy protection mechanisms, including homomorphic encryption, secret sharing, differential privacy, and oblivious transfer, meeting financial-grade security requirements. Additionally, PowerFL utilizes a decentralized architectural design, without the need of any trusted third parties or central nodes, fulfilling the security needs of real-world business scenarios.
Efficiency. PowerFL enhances communication and computation efficiency through a series of innovations in system architectures, asynchronous computing, communication optimization, and high-performance algorithm optimization.
Functionality. PowerFL offers comprehensive full-stack capabilities in privacy-enhancing computation, encompassing a wide range of functions such as collaborative feature engineering algorithms, federated machine/deep learning algorithms, and secure data analysis.
Cloud-nativeness. PowerFL adopts a cloud-native design, supporting virtualized deployment and flexible resource scaling based on YARN and Kubernetes, with particular advantages in handling high concurrency and flexible resource scaling.

Angel

Angel is a high-performance distributed machine learning and graph computing platform based on the philosophy of Parameter Server. It is tuned for performance with big data from Tencent and has a wide range of applicability and stability, demonstrating increasing advantage in handling higher dimension model. Angel is jointly developed by Tencent and Peking University, taking account of both high availability in industry and innovation in academia.

With model-centered core design concept, Angel partitions parameters of complex models into multiple parameter-server nodes, and implements a variety of machine learning algorithms and graph algorithms using efficient model-updating interfaces and functions, as well as flexible consistency model for synchronization. Angel is developed with Java and Scala. It supports running on Yarn. With PS Service abstraction, it supports Spark on Angel. Graph computing and deep learning frameworks support is under development and will be released in the future.

We welcome everyone interested in machine learning or graph computing to contribute code, create issues or pull requests. Please refer to Angel Contribution Guide for more details.