E-commerce sites, military planners and streaming media providers all want to analyze multiple sources of live data as quickly as possible, a holy grail of data science which is about to become possible through new software called StreamWare, from researchers at Harvard University, New Jersey Institute of Technology and the University of Southern California.
“What we’re trying to do is reimagine data science,” said David Bader, distinguished professor and director of NJIT’s Institute for Data Science.
Bader, along with Harvard professor David Brooks and USC Viterbi School of Engineering Prof. Viktor Prasanna, are the principal investigators hosting a birds-of-a-feather session at the Supercomputing 2021 conference on Nov. 16 in St. Louis to drum up support for StreamWare, already in development by them and other colleagues. So far, they have cooperation pledges from AMD, Google, Intel, Microsoft and Xilinx. “We would also reach out to additional companies with whom we’ve collaborated previously, including for example, IBM, HPE, Qualcomm and ARM,” they stated.
The group is supported by a $250,000 National Science Foundation planning grant, which they’ll use to study areas of improvement at the intersection of high-transaction online applications and the underlying server and network hardware — “A truly interdisciplinary team … to propose a vertically integrated system covering applications, algorithms, software and architecture,” USC’s Prasanna noted.
“The outcomes of the planning phase will include a proposal for the research activities to be carried out in the [next] grant, publications on the results of the survey activities and future research directions for enabling streaming data science, and curriculum for future graduate and undergraduate courses,” they stated in their grant proposal.
“Rather than waiting to collect this information to a central location and process it, our thought is to create streaming analytics able to make decisions on the fly,” Bader explained. Currently, “There’s a lot of work [called] streaming but often that work is very simplistic in nature, for instance with a single data source and running a single analytic such as frequency counting.” Current efforts also have trouble scaling up, he added.
The group will test new combinations of open-source, state-of-the-art algorithms and hardware accelerators, on data sets from astrophysics, smart grids and network science. Bader said he thinks prototypes could be ready to demonstrate in 3-6 months from now.
“The proposed project will positively influence a wide range of applications in scientific and engineering domains. The developed data science kernel catalogue will enable the big data research community to prioritize research on improving the performance of these kernels,” the researchers stated in their grant proposal. “This project will also be directed towards promoting scientific education through the involvement of high school and university students, especially female and underrepresented minority students.”
In addition, starting with Harvard, the project results will influence course curricula in data science and related fields. Ultimately, “Where we’re heading is predictive analytics,” Bader said, by using machine learning and artificial intelligence.