Machine learning applied to language
![]() 'In my current PhD research, I found I have successfully transferred all the key skills from a theoretical context into a practical environment. An average day is spent working on the Statistical Multi-Lingual Analysis for Retrieval and Translation (SMART) project, which involves segmenting the sentences into phrases and building up a statistical model for the phrase translation. The thing I enjoy the most about this project is the successful application of my supervised learning knowledge and the statistical model I created. The work has also enabled me to develop extensive new competencies in perl or python as well as stochastic calculus. What is more, the comprehensive applications of the machine learning knowledge also make me possible to join the cross-subject projects. Last year we applied the kernel methods successfully on the fMRI data analysis and lead to our winning entry to the Pittsburgh Brain Activity Interpretation Competition (PBAIC) 2007. Looking for the future, the firm foundation acquired from my MSc. program allows me to go further in my research and study.' Yizhao Ni, PhD student at the ISIS Group, Electronics and Computer Science, University of Southampton, MSc in Machine Learning 2005/2006 |
| Language and its evolution, development and acquisition are the main subject of linguistics, cognitive science and artificial intelligence. |
• Grammar inference

Language is modelled to devise algorithms that correctly infer the structure of a grammar in a context of synthetic but also natural languages and to perform linguistic judgements commonly accomplished by native speakers. Previous research at UCL and RHUL explored the strength of string kernels for grammatical induction in (mildly) context sensitive languages within the PASCAL supported project Grammatical Inference with String Kernels (GISK). Mildly context sensitive languages were geometrically defined as hyperplanes in feature space via the planar language notion, i.e. a set of strings as a basis used language definition whose images in feature space that are part of the least hyperplane including the images of the basis.
In this high-dimensional feature space languages are learnt via the unsupervised learning technique kernel PCA. For the purpose good definitions of learnability were developed and theorems were proven that certain languages from the Chomsky hierarchy can be learnt according to different models of learnability. The project yielded the insight that mildly context sensitive languages can be efficiently learned from positive training examples alone, i.e. the way many native speaker only face examples from their own language. The prevalent conference in this domain is the International Colloquium on Grammatical Inference (ICGI) which discusses different models of grammar inference, algorithms for induction of various classes of languages and automata, theoretical and experimental machine learning approaches to grammar induction and a wide ranger of general topics from language semantics to developmental and evolutionary constraints on language acquisition. These ideas can be embodied into a grammar checker that can be run onto any text document to infer whether it contains a grammar mistake based on the learnt understanding that words play within a text.
• Statistical Multilingual Analysis for Retrieval and Translation (SMART)

UCL works within the 8 sites framework Statistical Multilingual Analysis for Retrieval and Translation (SMART). The motivation for this project is derived from various issues. The majority of the people in the EU does not speak a second language which limits the European integration in terms of economic and cultural exchanges. This creates a demand for tools capable of automatic translation and cross-language retrieval. Statistical approaches were promising compared to rule-based system in the context of accurate lexical choice and development investment but also had their disadvantages as less fluency and little incorporation of contextual information when translating ambiguous words. To tackle these issues correlation analysis approaches and discriminative kernel techniques for structured output learning are developed. These translation models adapt in an online fashion to user feedback. The project addresses three scenarios to enable a clear progress of previous approaches:
• Aimed at professional translators to discuss the issue of productivity
• Aimed at technicians working via phone to prove assistance in a language where no technical documentation is available
• Aimed at obtaining access to portions of the multilingual Wikipedia in languages where the user exhibits little competence

