Long Zhang is now a Ph.D. student in computer science supervised by Professors Martin Monperrus and Benoit Baudry, and funded by Wallenberg AI, Autonomous Systems and Software Program (WASP) at KTH Royal Institute of Technology, Sweden. He received his BE degree and ME degree in software engineering from Harbin Institute of Technology, China. Before graduation, Long got Tencent internship opportunities twice, with nearly 2 years’ practice in total engaging in software development. After that, as the youngest member of the team and the only graduates being exceptionally hired, he gradually converted from a developer to a project manager, who was responsible for talent cultivation cooperation projects between Tencent and universities, such as Tpai Innovation and Entrepreneurship Competitions, Rhino-bird Elite Graduate Program, etc.
My research interests
Chaos engineering, Self-healing software, Antifragile systems
Work and study experience
Manager of University Relations in Tencent (2015.07 ~ 2017.11)
Assistant Engineer (Intern) in Tencent (2014.07 ~ 2015.07 2012.07 ~ 2013.07)
Master and bachelor in software engineering, Harbin Institute of Technology, China
Brief Introduction to My Research Work
I'm mainly focusing on software systems resilience problems. It's impossible to predict every failure or unanticipated situation of your system, especially when it is deployed into production. So it's quite important to improve system resilience, enabling it to bear and self-heal the perturbations.
On the concept level, I'm using Chaos engineering to address this problem. Chaos engineering is the practice of experimenting on a distributed system in order to build confidence in the system’s capability to withstand unexpected conditions in production. In another word, breaking things on purpose. We should change our perspective on errors, instead of preventing them all the time, but we trigger them in some controlled situation. On the delivery level, I’ll implement a software system that could be easily combined with customers' applications. When the chaos system is running, it automatically monitors the target application and conducts chaos experiments, analyzes weakness points of resilience for the application.
For example, we actively shut down one of the service instances, to check whether the routing node could smoothly redirect the traffic to other working ones. In this scenario, we get more chance to analyze and learn from system error handling behaviors, and finally improve its resilience.
Keywords:chaos engineering, fault injection, monitoring, self-healing, reliability, availability, antifragile, software engineering