Exjobbsförslag från företag

Detta är ett uppsatsförslag hämtat från Nationella Exjobb-poolen. Klicka här för att komma tillbaka till samtliga exjobbsförslag.

Förslaget inkom 2009-12-10

Preparing Data in SQL for Statistical Analysis

OBS! ANSÖKNINGSTIDEN FÖR DETTA EXJOBB HAR LÖPT UT.
This project aims to investigate efficient ways to use SQL for scalable preparation of data stored in a DBMS for epidemiological research.

The department of Medical Epidemiology and Biostatistics (ki.se/meb) at Karolinska Institutet performs advanced analyses of big data volumes for epidemiological research. The amount of the data and the complexity of the analyses require applying the state-of-the-art database technologies. Furthermore, new methods have to be developed in addition. The department of Medical Epidemiology and Biostatistics gives you an opportunity to get practical experience with relational database management systems and learn or develop state-of-the-art methods in data management. You will get experience, which can be used in the area of large scientific and medical databases and also in the area of advanced business intelligence.

The large volumes of data analyzed in epidemiological research are usually stored in a database management system (DBMS) such as DB2 from IBM (www.ibm.com/db2), Oracle (www.oracle.com/database), and SQL Server from Microsoft (www.microsoft.com/sql), and data analyses are performed by statistical programs in statistical software such as SAS (www.sas.com) and R (www.r-project.org). Before applying the statistical programs the data have to be prepared. Data preparations include selecting data of interest, joining data from different tables, projecting needed columns, grouping and calculating aggregates, and doing additional transformations of some data values. Currently data preparations require significant effort from researchers and considerable storage and computational resources. There are number of reasons for this. One reason is absence of necessary meta-data such as mappings between clinical terminology and codes used in databases. Another reason is lack of experience and guidelines in using SQL for data preparation. Thus many researchers specify data preparations in statistical packages. This leads to inefficient utilization of computation and storage resources.

In this project you will investigate efficient implementation of data preparation using SQL and scalable execution of data preparation queries in a DBMS. You will design a relational schema and store meta-data in a DBMS to enable efficient query specification and execution. You will develop several alternative strategies for preparing data for one or several epidemiological projects. You will compare the implemented strategies with each other and with current way of preparing data in the statistical software SAS.

In particular you will design and store mappings between different versions of ICD codes (www.socialstyrelsen.se/klassificeringochkoder/diagnoskoder) and their descriptions and names in DB2 or Oracle. You will implement data preparation queries in DB2 and test them on an epidemiological research project. Your result is going to be a report written in English in addition to the implementation.

It is essential that you have good programming skills, knowledge in mathematics, and taken a database course. It is a plus if you have practical experience with an RDBMS and SQL.
The project requires coming to the department (Stockholm/Solna) every day during implementation phase. All conversations and discussions will be held in English.

  GÅ TILL XJOBB.NU FÖR FULLSTÄNDIG INFO OM DETTA EXJOBB




Informationen om uppsatsförslag är hämtad från Nationella Exjobb-poolen.