Exjobbsförslag från företag

Detta är ett uppsatsförslag hämtat från Nationella Exjobb-poolen. Klicka här för att komma tillbaka till samtliga exjobbsförslag.

Förslaget inkom 2006-09-18

Integration of data sources in bioinformatics: Study of query plan optimization approaches

OBS! ANSÖKNINGSTIDEN FÖR DETTA EXJOBB HAR LÖPT UT.
Scientists in bioinformatics often have to retrieve data from multiple data sources(*) to solve their research problems. The fact that there exist a large amount of data sources having heterogeneous data, data formats and access methods, makes data retrieval a difficult task. To successfully accomplish this task, a lot of effort and knowledge is required from the user. She has to decide which data sources to access and in which order, how to retrieve data and how to combine the results. Though some information integration systems are available in the area, still a number of problems remain unsolved.

The thesis is a part of a larger project that builds a system enabling transparent access to multiple heterogeneous biological data sources [1]. The user of the system does not need to know about the integrated data sources. She formulates a query in a uniform query language using terms of the mediated schema that uniformly describes content of the underlying data sources. The system performs a query processing, i.e. reformulates a user query expressed over the mediated schema into the query over the relevant multiple data sources, creates a query plan that specifies how the query should be executed, executes the query plan and returns the retrieved results to the user.

One of the aims during query processing is to build an optimal query plan which minimizes the cost of query execution. Different query plan optimization techniques are used at various steps of the query processing. For example, when decomposing a user query into subqueries over data sources, we aim to push as many operations as possible to the remote data sources. When deciding the order in which data sources will be accessed, we prefer to access those data sources that return a smaller set of results. Usually, a number of equivalent plans can be generated for the same user query. Then, cost based techniques are used to select between them.

The focus of the thesis is to study existing query optimization approaches, analyze some available query processing systems, select one of the open source query processing systems and adopt for the needs of the BioTrifu system. Two students (!) are expected to work on this thesis. The students should have good programming skills.

(*) Data sources refers to different types of sources of the data, e.g. databases, text files storing semistructured information and applications.

References

[1] BioTrifu. http://www.ida.liu.se/~patla/research/ceniit.html


  GÅ TILL XJOBB.NU FÖR FULLSTÄNDIG INFO OM DETTA EXJOBB




Informationen om uppsatsförslag är hämtad från Nationella Exjobb-poolen.