mardi 21 juin 2016

Parallel Programming Made Easy

MIT News (06/20/16) Larry Hardesty 

Massachusetts Institute of Technology (MIT) researchers have developed Swarm, a chip design that should make parallel programs more efficient and easier to write. The researchers used simulations to compare Swarm versions of six common algorithms with the best existing parallel versions, which had been individually engineered by software developers. The Swarm versions were between three and 18 times as fast, but they generally required only one-tenth as much code. The researchers focused on a specific set of applications that have resisted parallelization for many years, and many of the apps involve the study of graphs, which are comprised of nodes and edges. Frequently, the edges have associated numbers called "weights," which often represent the strength of correlations between data points in a dataset. Swarm is equipped with extra circuitry specifically designed to handle the prioritization of the weights, and it time-stamps tasks according to their priorities and begins working on the highest-priority tasks in parallel. Higher-priority tasks may engender their own lower-priority tasks, but Swarm automatically slots those into its queue of tasks. Swarm also has a circuit that records the memory addresses of all the data its cores currently are working on; the circuit implements a Bloom filter, which stores data into a fixed allotment of space and answers yes or no questions about its contents.

lundi 6 juin 2016

Finding Relevant Data in a Sea of Languages

MIT News (05/27/16) Ariana Tantillo; Dorothy Ryan 

Researchers in the Massachusetts Institute of Technology Lincoln Laboratory's Human Language Technology (HLT) Group seek to address the challenge of providing multilingual content analysis amid a shortage of analysts with the necessary skills. Their work could potentially benefit law enforcement and the U.S. Department of Defense and intelligence communities. The HLT team is exploiting innovations in language recognition, speaker recognition, speech recognition, machine translation, and information retrieval to automate language-processing tasks so the available linguists who analyze text and spoken foreign languages are more efficiently utilized. The team is concentrating on cross-language information retrieval (CLIR) using the Cross-LAnguage Search Engine (CLASE), which enables English monolingual analysts to help look for and filter foreign language documents. The researchers use probabilistic CLIR based on machine-translation lattices. The method entails documents being machine-translated into English as a lattice containing all possible translations with their respective probabilities of accuracy. Documents containing the most likely translations are extracted from the collection for analysis, based on an analyst's query of a document collection; CLIR results are assessed according to precision, recall, and their harmonic average or F-measure. Meanwhile, HLT's Jennifer Williams is developing algorithms to identify languages in text data so CLASE can select the appropriate machine translation models, and others are working on automatic multilingual text-translation systems.

mercredi 4 mai 2016

Big Data a definition

Big Data a definition

Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.
Big data has also been defined by the four Vs: https://www.oracle.com/big-data/index.html

Volume.

The amount of data. While volume indicates more data, it is the granular nature of the data that is unique. Big data requires processing high volumes of low-density, unstructured Hadoop data—that is, data of unknown value, such as Twitter data feeds, click streams on a web page and a mobile app, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task of big data to convert such Hadoop data into valuable information. For some organizations, this might be tens of terabytes, for others it may be hundreds of petabytes.

Velocity.

The fast rate at which data is received and perhaps acted upon. The highest velocity data normally streams directly into memory versus being written to disk. Some Internet of Things (IoT) applications have health and safety ramifications that require real-time evaluation and action. Other internet-enabled smart products operate in real time or near real time. For example, consumer eCommerce applications seek to combine mobile device location and personal preferences to make time-sensitive marketing offers. Operationally, mobile application experiences have large user populations, increased network traffic, and the expectation for immediate response.

Variety.

New unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video require additional processing to both derive meaning and the supporting metadata. Once understood, unstructured data has many of the same requirements as structured data, such as summarization, lineage, auditability, and privacy. Further complexity arises when data from a known source changes without notice. Frequent or real-time schema changes are an enormous burden for both transaction and analytical environments.

Value.

Data has intrinsic value—but it must be discovered. There are a range of quantitative and investigative techniques to derive value from data—from discovering a consumer preference or sentiment, to making a relevant offer by location, or for identifying a piece of equipment that is about to fail. The technological breakthrough is that the cost of data storage and compute has exponentially decreased, thus providing an abundance of data from which statistical analysis on the entire data set versus previously only sample. The technological breakthrough makes much more accurate and precise decisions possible. However, finding value also requires new discovery processes involving clever and insightful analysts, business users, and executives. The real big data challenge is a human one, which is learning to ask the right questions, recognizing patterns, making informed assumptions, and predicting behavior.