Lina FAHED

Lina FAHED


Post-doctorant
Dépt. Informatique

Téléphone : 02 29 00 16 07
Télécopie : 02 29 00 12 82
Courriel : lina.fahed@imt-atlantique.fr
IMT Atlantique
Vue aérienne

Research Activities

Direct access to:

Research Interests

  • Data mining & Machine learning approaches: rules and patterns mining, classification, multi-label classification, clustering
  • Modeling: early prediction modeling, emergent events detection, recommending systems, profile user modeling, adaptive learning
  • Data: data sequences, data streams, complex data (voluminous, varied, rapid, volatile) 

General Context

  • Complex Data have 5 principals characteristics: the 5V's
    • Volume: there is an huge amount of data being generated
    • Velocity: the frequency at which data are generated, captured and shared is very high
    • Variety: data are noisy, raw and unstructured, come from several sources
    • Volatility: the importance of data changes over time
    • Value: data value is only created with contextualization
  • Hypothesis: predicting and detecting the occurrence of events as early as possible allows to react earlier, to influence the occurrence of future events and thus to minimize the associated risk, if necessary.

Postdoctoral Research Activities

During my post-doctoral period, I am working on two independent topics:

  • Anomalies detection and prediction in information systems: we propose an algorithm for the analysis of logs data with several steps:
    • Log parsing: the objective of log parsing is to separate the constant and variable parts in each log in order to form a structured sequence. In this case, raw and unstructured logs are transformed into structured event sequences.
    • Merging multiple data sources: with the complexity of modern systems, several anomalies can occur at the same time and overlap. Therefore, it becomes impossible to identify logs that belong to a specific anomaly. For this reason, we propose to use another data source: the anomaly reporting system with some of its useful information: the structured text written in human language describing the anomaly, creation date, closing date (resolved anomaly). We propose a projection function in order to link the two data sources : the logs and the anomaly reporting system. Thus, we can isolate and label the corresponding logs for each anomaly.
    • Anomaly detection: we propose to extract sequential patterns from labelled logs in order to detect correlations in data.
    • The publication of this algorithm is under preparation.
  • Early prediction of future events in a base of sequences : we propose a sequential rules mining algorithm in a sequences base. The mined rules have two characteristics:
    • A minimum temporal distance, a gap, between the antecedent and the consequent of the rules, which makes it possible to control the horizon of appearance of the consequence (the future event).
    • We hypothesize that the multiple occurrence of an event in a sequence can be a weak signal and should be considered as a trigger for the prediction of a future event. Therefore, we integrate in the algorithm the characteristic that an event may be present several times in the antecedent of a rule. 
    • The publication of this algorithm is planned soon.

Doctoral Research Activities (as a PhD student)

Contributions :

  • DEER (Distant and Essential Episode Rules)  [pdf]: an algorithm for mining episode rules in a sequence of events. DEER models the complete sequence to extract episode rules with two characteristics:
    • The rules have a temporal distance, a gap, between the antecedent (trigger) and the consequence (future event), which makes it possible to specify the horizon of occurrence of this last event.
    • The rules are essential: their antecedent is minimal: the smallest possible in terms of number of events and temporal duration. Therefore, when using the extracted rules in a prediction task, we will be satisfied with a minimum amount of information (the minimal antecedent) to make a prediction as early as possible of a distant event (the consequent of the rule).
  • EER (Emergent Episode Rules): an algorithm for the early detection of emergent associations (episode rules) in an events flow. 
    • Hypothesis: new rules are not born from emptiness and are not totally new, they are obviously linked or influenced by other rules already known. Thus, we consider that the new rules which tend to appear and which are similar to known rules, will emerge in the future.
    • EER algorithm has the originality to process the flow of events in real time in order to capture the first appearances of new episode rules and to detect as soon as possible the emergence of these new rules similar to the rules known (apparent frequently before).
    • EER differs from existing emergence detection algorithms in that it detects the emergence of rules (not single events) that are both unknown and not yet frequent.
    • EER runs on "Apache STORM" (distributed data flow processing platform). 
    • The publication of this algorithm is under preparation.
  • IE (Influencer Events) [pdf]: an algorithm for the detection of events that we called "Influencer Events" in an events sequence.
    • Influencer events are events  that when injected into a specific context will influence some of the characteristics of future events : the probability of their occurrence, their appearance horizon (the temporal distance at which they will appear).
    • Several influencing measures are proposed: frequency influencing measure, confidence influencing measure, temporal disance influencing measure.

 


Master Research Activities (as a research intern)

Contribution : an algorithm for Multi-label classification applied on multiple alignment of protein sequences [pdf]:

  • In bio-informatics, multiple alignment of protein sequences is a way of representing several sequences underneath each other so as to highlight homologous or similar regions. These alignments are  built by software, called "aligners", whose objective is to maximize the number of coincidences between components in the different sequences.
  • It is important to predict the aligner that will align the sequences correctly. However, the challenge is to not only predict the correct aligners, but also to predict the most suitable aligner(s) to align the provided sequences.
  • In order to adapt the multi-label classification techniques to our problem, for each aligner, we propose to assign it the label "best" if its alignment score is the maximum score among all other aligners with a tolerance margin of epsilon (otherwise, the label is "worst"). We allow a tolerance margin of epsilon because there could be several pertinent alignments at a time. The objective of this work is to identify a reasonable value for epsilon.
  • Analysis of the various performance measures specific to multi-label classification (AUC, good prediction rate, accuracy, recall, micro, macro, etc.) was required to determine the lowest appropriate epsilon value.
Technopôle Brest-Iroise - CS 83818 - 29238 Brest Cedex 3 - France
Tél : 33 (0)2 29 00 11 11 - Fax : 33 (0)2 29 00 10 00