Social network Odnoklassniki consists of more than 8 000 servers located in several data centers. Each one of these machines was allocated for a specific task, both for failure isolation and for providing automated infrastructure management. At a certain moment, it became clear that a new data center management system could improve efficiency in utilizing the hardware, make access management easier, automate resource management, better time to market for new services, faster repair of incidents and even full outages. The new system has to manage all servers we have, so it is to become the biggest and the most critical distributed system while simultaneously setting up strict requirements on its resilience under any conditions — especially during major failures and outages. It required both thorough fault tolerance planning and unique architecture solutions. We’ll discuss both interesting details of one-cloud internal workings as well as our experience running containerized Java apps under high load.