Show simple item record

dc.contributor.advisorCoskun, Ayse K.en_US
dc.contributor.authorAteş, Emreen_US
dc.date.accessioned2020-10-16T13:38:15Z
dc.date.issued2020
dc.identifier.urihttps://hdl.handle.net/2144/41472
dc.description.abstractLarge-scale distributed systems---such as supercomputers, cloud computing platforms, and distributed applications---routinely suffer from slowdowns and crashes due to software and hardware problems, resulting in reduced efficiency and wasted resources. These large-scale systems typically deploy monitoring or tracing systems that gather a variety of statistics about the state of the hardware and the software. State-of-the-art methods either analyze this data manually, or design unique automated methods for each specific problem. This thesis builds on the vision that generalized automated analytics methods on the data sets collected from these complex computing systems provide critical information about the causes of the problems, and this analysis can then enable proactive management to improve performance, resilience, efficiency, or security significantly beyond current limits. This thesis seeks to design scalable, automated analytics methods and frameworks for large-scale distributed systems that minimize dependency on expert knowledge, automate parts of the solution process, and help make systems more resilient. In addition to analyzing data that is already collected from systems, our frameworks also identify what to collect from where in the system, such that the collected data would be concise and useful for manual analytics. We focus on two data sources for conducting analytics: numeric telemetry data, which is typically collected from operating system or hardware counters, and end-to-end traces collected from distributed applications. This thesis makes the following contributions in large-scale distributed systems: (1) Designing a framework for accurately diagnosing previously encountered performance variations, (2) designing a technique for detecting (unwanted) applications running on the systems, (3) developing a suite for reproducing performance variations that can be used to systematically develop analytics methods, (4) designing a method to explain predictions of black-box machine learning frameworks, and (5) constructing an end-to-end tracing framework that can dynamically adjust instrumentation for effective diagnosis of performance problems.en_US
dc.language.isoen_US
dc.rightsAttribution 4.0 Internationalen_US
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectComputer engineeringen_US
dc.subjectDistributed systemsen_US
dc.subjectExplainabilityen_US
dc.subjectHigh performance computingen_US
dc.subjectMachine learningen_US
dc.subjectMonitoringen_US
dc.subjectTracingen_US
dc.titleAutomating telemetry- and trace-based analytics on large-scale distributed systemsen_US
dc.typeThesis/Dissertationen_US
dc.date.updated2020-09-28T07:04:23Z
dc.description.embargo2021-09-28T00:00:00Z
etd.degree.nameDoctor of Philosophyen_US
etd.degree.leveldoctoralen_US
etd.degree.disciplineElectrical & Computer Engineeringen_US
etd.degree.grantorBoston Universityen_US
dc.identifier.orcid0000-0002-2292-2626


This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International
Except where otherwise noted, this item's license is described as Attribution 4.0 International