IEEE eScience: Science in the Clouds
Alexander Szalay, Science in the Cloud
movement over time first towards pcs and small scale and then back through Beowulf clusters, the grid, and now clouds
paradigm that the data was in one location and had to be moved and cleaned up afterward wasn’t efficient, so now we’re looking at distributed data
20% of the worlds servers are going to the big 5 (clouds?)
Clouds – have some great benefits, the university model really doesn’t work anymore
clear management advantages
base line cost that is hard to match
easy to grow dynamically
science problems:
- tightly coupled parallelism (low latency, MPI- maybe message passing interface?)
- very high I/O bandwidth
- geographically separate data (can’t keep it in one place, in synch…)
clouds are fractals – exist on all scales
little, big, all science can use the clouds
need a power law distribution of computing facilities for scientific computing
Astro trends -
Cosmic Microwave Background (COBE, 1990, 1000pixels; WMAP, 2003, 1Mpixels; Planck, 2008, 10Mpixels)
Galaxy Redshift Surveys ( 1986-3500 to 2005 750000)
Time domain is new
Lsstars (?) - Petabytes
LSST – total dataset in the 100 Petabytes
scientific data doubles every year – successive generations of inexpensive sensors (new CMOS) and exponentially faster computing – this changes the nature of scientific computing across all areas of science
- but it’s harder to extract knowledge
data on all scales – kb in Excel spreadsheets manually maintained vs. multi-terabye archive facilities (all medical labs are gathering digital data all the time)
industrial revolution in collecting scientific data
acquire data (doubling)…. bottom of the funnel, publication, only growing at 6%
challenges:
access – move analysis to the data
discovery – typically discovery is done at the edges so more doesn’t give us much more… but opening up other orthogonal dimensions gives us more discoveries. federation still requires data movement
analysis – only max NlogN algorithms possible,
data analysis-
on all scales across multiple locations
assumption has been that there’s one optimal solution, that we just need a large enough data set… with unlimited computing power
but this isn’t true any more, randomized incremental algorithms
analysis of scientific data
- working with Jim Gray – coping with the data explosion starting in astro with the SDSS… first data release from Sloan was 100GB, now the dataset is about 100TB
- interactions with every step of the scientific process
Jim Gray:
- scientific computing revolving around data – take analysis to data, need scale-out solution for analysis
- scientists give the database designer top 20 questions in English, and the database designer or computer scientist can design the database accordingly
- build a little something that works today, and then build bigger /scale and build something working tomorrow – go from working to working – build what the world looks like today, not for tomorrow
Projects at JHU
SDSS – finished, final data release has happened
- work has changed
- final archiving in progress (UChi library, JHU library, CAS Mirrors at FnAL +JHU Physics & Astro)
archive will contain >100 TB
- all raw data
- all processed calibrated data
- all versions of the database
- full email archive (capture technical changes not in official drawings) and technical drawings
- full software code repository
- telescope sensor stream, IR fisheye camera, etc.
Public use of the skyserver
- prototype in data publishing
- 500 Million web hits in 6 years, 1M distinct users but only 15k astromers in the world
- 50,000 lectures to high schools
- delivered >100B rows of data
interactive workbench
- sign up, own database, run query, pipe to own database… analysis tools to transfer only plot and not entire database over the wires
- 2,400 power users
GalaxyZoo
built on SkyServer, 27M visual galaxy classification by the public
Dutch school teacher discovery
Virtual Observatory
collaboration of 20 groups
15 countries – international virtual observatory alliance
interfaces were different, no central registry, underlying basic data formats were agreed upon
sociological barriers are much more difficult than technical challenges
Technology
- petabytes
- save, move, some processing near the instrument (for example) in Chile
- funding organizations have to understand the computing costs over time
- open ended modular system
- need Journal for Data (overlay to bridge the gap so that data sets don’t get lost) – curation is key, who does the long-term curation of data
Pan-STARRS
- detect killer asteroids
- >1Petabyte/year
- 80TB SQL Server database built at JHU – largest astro db in the world
Life Under Your Feet (http://lifeunderyourfeet.org/en/default.asp)
- role of soil in global change
- a few hundred wireless computers with 10 sensors each , long term continuous data, complex database of sensor data, built from the sky server
once the project is online, linear growth, exponential growth comes from new technologies new cameras … future might come from individual amateur astronomers using 20MB cameras on their telescopes and systematically gathering data
more growth coming from simulations (software is also an instrument)
(example of one that was so big the tape robot was inaccessible so the data is never used)
also need interactive, immersive usages (like for turbulence)
- store every time slice in the database
- turbulence.pha.jhu.edu (try it today!)
commonalities
- huge amounts of data, need aggregates but also access to raw data
- requests enormously benefit from indexing
Amdahl’s Laws for a balanced system – we’ve gone farther and farther from these
Comparisons of simulations and their data generation vs. what’s available in the computers
data analysis is maxing out the hardware because of the 10-100 TB – no one can really do anything with over 50 TB
IO limitations for analysis
we’re a factor of 500 off of what we can do to get to a 200TFlop Amdahl machine
they built a high io system using cheap components
the large datasets are here, the solutions are not – systems are choking on IO
scientists are cheap
data collection is separated from data analysis
(big experiments just collect data and store it, scientists come along later and analyze the data- decoupled)
How do users interact with petabytes
- can’t wait 2wks to do a sql query on a petabyte
- python crawlers
- partition queries
- MapReduce/Hadoop – but can’t do (or are very difficult) complex joins you need to do for data analysis
William Gibson – “The future is already here. It’s just not very evenly distributed”
data cloud vs.
HPC or
HTCjournal for data?
- with ApJ?
- example postdoc writes a paper, only a table goes into the suppl data, journal can't take the terabytes real data - need another archive for this that's linked to science article.
Labels: IEEEeScience08