About me

I am a Principal Researcher and Group Leader at the Samsung AI Centre in Cambridge, U.K., working on efficient Vision and Language. The mission at SAIC-Cambridge is to conduct research that acts as a key enabler for the commercialization of new features within the ever-growing portfolio of Samsung's AI products. As part of this mission, we routinely conduct novel research and publish it at top venues, while also working with other teams within Samsung to bring the technology into products.
In the past years, my group has worked on contrastively-trained V&L models, LMMs, and image generation. We have significant expertise in both large-scale training and on-device porting, allowing us to exploit synergies, e.g. from optimization-based compression and/or quantization techniques, to architectural changes that improve the efficiency of on-device inference.

Before joining Samsung, I worked for about 3 years for Amazon in Seattle, where I enjoyed being part of the Amazon Go and AWS Rekognition teams.

I am interested in a wide variety of topics in machine learning and computer vision - in fact, most often it is the process, the team, and the prospect of impact that I find most appealing rather than the topic itself. While most of my work during my years in academia has been on face analysis, I have worked on topics as diverse as human action recognition, binary neural networks, knowledge distillation, and lipreading.

This is my Google Scholar profile.

News

3 papers at CVPR'25

Discriminative Fine-tuning of LVLMs
https://arxiv.org/abs/2412.04378

Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning
https://arxiv.org/abs/2412.06978

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
https://arxiv.org/abs/2411.18552
1 paper at NeurIPS

A Bayesian Approach to Data Point Selection
https://arxiv.org/abs/2411.03768
2 papers at EMNLP

EMNLP findings:
MobileQuant: Mobile-friendly Quantization for On-device Language Models
https://arxiv.org/abs/2408.13933
https://github.com/saic-fi/MobileQuant

EMNLP main track:
Efficient Vision-Language pre-training via domain-specific learning for human activities
2xECCV and IJCV

Knowledge Distillation Meets Open-Set Semi-Supervised Learning @ IJCV
https://arxiv.org/abs/2205.06701

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation @ ECCV'24
https://arxiv.org/abs/2401.17258

CLIP-DPO: Vision-Language Models as a Source of Preference for Improved Vision-LLMs @ ECCV'24
https://arxiv.org/abs/2408.10433
New arxiv released

You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
Super-resolution with Stable Diffusion: We achieve SOTA in terms of quality, and need only one step at inference time to do so!
https://arxiv.org/pdf/2401.17258.pdf
1 paper at EACL'24 (Oral)

Graph Guided Question Answer Generation for Procedural Question-Answering
Paper: https://arxiv.org/abs/2401.13594
We train compact AI assistants for procedural tasks (e.g. cooking a meal) that can compete or even beat ChatGPT, yet easily run on your phone. The key is to use graph representations of the procedures to automatically create exhaustive and high-quality QA pairs in a controllable manner so that a specialized on-domain model can be trained.
4 papers at ICCV'23

ReGen: A good Generative zero-shot video classifier should be Rewarded
Paper: Openaccess link

Black Box Few-Shot Adaptation for Vision-Language Models
Paper: https://arxiv.org/abs/2304.01752 Code: https://github.com/saic-fi/LFA

FSD-Prompt: Few-Shot Detection Prompting without retraining
Paper: https://arxiv.org/abs/2210.04845

Bayesian Prompt Learning for Image-Language Model Generalization
Paper: https://arxiv.org/abs/2210.02390 Code: https://github.com/saic-fi/Bayesian-Prompt-Learning
1 paper at ICLR'23

Efficient Self-supervised Pre-training on Low-compute Networks without Distillation
Paper: https://arxiv.org/abs/2210.02808 Code: https://github.com/saic-fi/SSLight
3 papers at ECCV'22

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers
Efficient transformers for Mobile devices
Paper: https://arxiv.org/abs/2205.03436 Code: https://github.com/saic-fi/edgevit

Learning hand-held object appearance for compositional action recognition:
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

I was also lucky to be (a small) part of the work led by SAIC-Toronto on instructional videos, accepted as an oral:
Flow graph to Video Grounding for Multi-Step Localization
2 papers at BMVC'21

Preprints of the two BMVC'21 papers are available in arXiv:
Few-shot Action Recognition with Prototype-centered Attentive Learning
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
2 papers at NeurIPS'21

Check out the pre-print versions of our NeurIPS'21 papers:
Space-time Mixing Attention for Video Transformer
Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization
1 new paper on ICCV'21

One ICCV'21 on temporal action localization: Boundary-sensitive Pre-training for Temporal Localization in Videos
1 paper on ICASSP'21

Our work on Lipreading has been accepted for publication on ICASSP'21. See the arxiv version.
Two new ICLR'21 papers

One ICLR'21 on knowledge distillation: Knowledge Distillation via Softmax Regression Representation Learning
Code is publicly available here

One ICLR'21 paper on binary neural networks: High-Capacity Expert Binary Networks
AC@ICCV'21

I've been selected to act as Area Chair for the upcoming ICCV'21.
Organizing CVPR'21 Workshop

I'm co-organizing the 1st workshop on Binary Networks, to be held in conjunction with CVPR'21.
ECCV'20 paper accepted

BATS: Binary ArchitecTure Search has been accepted for ECCV'20.
ICLR'20 accepted

Our paper on Binary CNNs has been accepted at ICLR'20. It sets a new state of the art for binary networks: 65.4% top-1 accuracy on ImageNet using a binary ResNet18 (an improvement of over 5%!).
ICASSP'20 accepted

Our paper on Lipreading has been accepted at ICASSP'20 for an oral presentation. It raises the state of the art on LRW and LRW1000 on 1.2% and 3.2% top-1 accuracy respectively. You can check out the arXiv version here.
ICCV'19 accepted

Our paper on Action Recognition has been accepted at ICCV'19. We achieve 78.8 on Kinetics400 and 53.4 on Something-SomethingV1, and without even using two-stream or non-local NN. The paper can be accessed here.
Moving to Samsung

From April 2019 I'll be part of the Samsung AI Research Center in Cambridge, UK, on a new role as Senior Researcher.
TPAMI accepted!

You can check it through the IEEExplorer page. Alternatively, there is an Arxiv version.
Paper on ECCV'16

You can check it on Arxiv here: https://arxiv.org/abs/1608.01137
Amazon move

From June 2016 I'll be part of Amazon on a new role as Research Scientist. I'll thus be leaving my position at the University of Nottingham
Co-organizing ChaLearn

I'm co-organizer of the Chalearn LAP and FotW challenge and workshop @ CVPR 2016
The challenge page: http://gesture.chalearn.org/
2 ICCV'15

2 papers accepted for ICCV 2015! One in Facial Action Unit detection (available here) and one in Object Tracking (see here)
Organising BMVA Technical Meeting

The Computational Face - Automatic Face Analysis and Synthesis

One Day BMVA symposium in London, UK on 14th October, 2015
Chairs: Brais Martinez, Yorgos Tzimiropoulos and Michel Valstar

Keynote speakers: Tim Cootes (University of Manchester), Darren Cosker (University of Bath), Maja Pantic (Imperial College London), Richard Bowden (University of Surrey)

Webpage and Registration: http://www.bmva.org/meetings

BRAIS

MARTINEZ

About me

News