SuperComputing 2018

Global Petascale to Exascale Science Workflows Accelerated by Next Generation Software Defined Network Architectures and Applications

Tools used in SC '18

FDT - One of the key advances in this demonstration was Fast Data Transport (FDT;, an open source Java application developed by the Caltech team in close collaboration with the Polytehnica Bucharest team. FDT runs on all major platforms and uses the NIO libraries to achieve stable disk reads and writes coordinated with smooth data flow across long-range networks.

The FDT application streams a large set of files across an open TCP socket, so that a large data set composed of thousands of files, as is typical in high-energy physics applications, can be sent or received at full speed, without the network transfer restarting between files. FDT works with Caltech's MonALISA system to dynamically monitor the capability of the storage systems as well as the network path in real-time, and sends data out to the network at a moderated rate that achieves smooth data flow across long range networks.

MonALISA - MonALISA, stands for Monitoring Agents using a Large Integrated Services Architecture, has been developed by Caltech and its partners with the support of the U.S. CMS software and computing program. The framework is based on Dynamic Distributed Service Architecture and is able to provide complete monitoring, control and global optimization services for complex systems.

The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of information gathering and processing tasks. The system is designed to easily integrate existing monitoring tools and procedures and to provide this information in a dynamic, customized, self describing way to any other services or clients.

OpenDaylight - Addressing a yet unsolved issue in LHCONE, namely the efficiency in interconnecting multiple network domains over more than one connection, Caltech is investigating and has made significant progress in developing the use of north bound interfaces to OpenDaylight SDN contoller for efficiently managing multipath networks. The software has been released under OpenSource license and is available on the github:

Caltech first started working on the OpenFlow Link-layer MultiPath Switching (OLiMPS) project, funded by DOE/OASCR, the group has developed an OpenFlow controller based on Big Switch’s Floodlight open-source controller. Specific code improvements include a more versatile and extensible internal architecture and configuration management features, an improved command line interface, and a range of advanced features.

Later in 2014 through Cisco Research funding, the code was partially converted to the OpenDaylight SDN platform as was demonstrated in SC14 conference. Currently the code is written for the Hydrogen release, however it will be matured for the Helium release in the coming months during the year 2016.

PhEDEx - PhEDEx is the data-placement management tool for the CMS experiment at the LHC. It manages the scheduling of all large-scale WAN transfers in CMS, ensuring reliable delivery of the data. It consists of several components:

  • an Oracle database, hosted at CERN
  • a website and data-service, which users (humans or machine) use to interact with and control PhEDEx
  • a set of central agents that deal with routing, request-management, bookkeeping and other
  • activities. These agents are also hosted at CERN, though they could be run anywhere. The key point is that there is only one set of central agents per PhEDEx instance
  • a set of site-agents, one set for every site that receives data

PhEDEx maintains knowledge and history of transfer performance, and the central agents use that information to choose among source replicas when a user makes a request (users specify the destination, PhEDEx chooses the source).

The central agents then queue the transfer to be processed by the site agents. PhEDEx operates in a data-pull mode, the destination site pulls the data to itself when it is ready. This gives the sites more control over the activity at their site, so they can ensure that neither their network nor their storage are overloaded.