Introduction

Proteins begin to interact with nascent RNAs as soon as transcription is initiated. The protein complement decorating an RNA molecule changes dynamically in space and time, orchestrating RNA processing and function in the nucleus and cytoplasm1. Ribonucleoprotein (RNP) complexes are key to every step of RNA processing and function, and understanding the roles that RNA-binding proteins (RBPs) play requires methods that identify the set of RNAs that they bind in cells during specific developmental stages, activities or disease states.

Numerous methods can characterize the RNA interactions that coordinate RNP assembly. These approaches can be protein-centric, describing the compendium of RNA sites bound by a specific RBP, or RNA-centric, identifying the RNA-bound proteome. The most common protein-centric strategies are based on the immunopurification of an RBP and its associated RNAs, and can be broadly categorized as RNA immunoprecipitation (RIP) or cross-linking and immunoprecipitation (CLIP) approaches. RIP approaches purify the RNA–protein complexes under native conditions2,3 or using formaldehyde cross-linking4. CLIP techniques are more widely used and rely on the irradiation of cells by UV light, which causes proteins in the immediate vicinity of the irradiated bases to irreversibly cross-link to the RNA by a covalent bond5 (Fig. 1). The covalent cross-links allow stringent purification of the RNA–protein complexes, which is followed by a series of steps to determine the interactions of a specific protein across the transcriptome. CLIP uses a limited RNase treatment of cross-linked RNPs to isolate RNA fragments occupied by the RBP and sequencing of these fragments can identify RBP binding sites, which allows inference of RBP function through determining the location of binding sites relative to, for example, other RBP binding sites or cis-acting elements (Box 1).

Fig. 1: Overview of the general CLIP workflow.
figure 1

Schematic overview of the core steps common to most variants of the cross-linking and immunoprecipitation (CLIP) protocol. RBP, RNA-binding protein. Adapted with permission from ref.14, Elsevier.

The development of high-throughput sequencing of RNA isolated by CLIP (HITS-CLIP) has enabled a transcriptome-wide view of RNA binding sites6. CLIP techniques have been further developed to identify cross-link sites with nucleotide resolution, either through analysis of mutations in reads (photoactivatable ribonucleoside-enhanced CLIP (PAR-CLIP))7 or by capturing cDNAs that terminate at the cross-linked peptide during reverse transcription (individual-nucleotide resolution CLIP (iCLIP))8. The development of dedicated bioinformatics workflows has allowed the determination of binding sites and consensus motifs to better understand post-transcriptional regulation9.

This Primer focuses on experimental and computational aspects of CLIP methods that have been broadly adopted and have generated widely used data sets. We also cover the identification of RBP binding sites by tagging RBPs with enzymes that naturally act on RNA, where the resulting RNA modifications can be identified by high-throughput sequencing10, as well as the use of subcellular compartment-specific proximity labelling to study localized transcriptomes11. Finally, we discuss the applications of these techniques to obtain a systems-level view of RNP assembly and dynamics in multiple model organisms and review strategies for method optimization and quality assessment of the data. For discussion of additional protein-centric methods, we refer the readers to recent reviews12,13,14. Note that we do not extensively cover studies that identify the global RNA-bound proteome, as these have been reviewed elsewhere1; instead, we focus on methods that identify proteins bound to specific RNAs to discuss how their insights complement protein-centric methods, and outline how these integrative approaches can take us closer towards a comprehensive view of RNP assembly and remodelling.

Experimentation

Protein-centric methods

All CLIP-based methods for determining the binding landscape of RBPs on a transcriptome-wide scale share the following core workflow (Fig. 1). First, RNAs and interacting proteins are irreversibly cross-linked by UV light in intact cells (UVC at λ = 254 nm or UVA/B at λ = 312–365 nm for PAR-CLIP). The amount of UV cross-linking energy used needs to be adapted depending on whether cell monolayers, a suspension of dissociated tissue15, whole tissue or whole organisms such as worms16 and plants17,18 are used. For tissues that cannot easily be dissociated, such as most adult mammalian tissues, plants or post-mortem human tissues, frozen tissue can be ground in liquid nitrogen to a fine powder and cross-linked on dry ice18,19. After cross-linking, RNAs are trimmed to short fragments by RNase digestion and the cross-linked RNP of interest is stringently purified using immunoprecipitation or other methods14 (Box 2). RNPs are then further purified using denaturing polyacrylamide gel electrophoresis (SDS-PAGE) and cross-linked RNA fragments released by digestion of the RBP, usually by proteinase K. The yield of RNA fragments is typically in the low-nanogram range, and thus protocols optimized to work with a limited amount of short RNAs are used to convert the RNA into cDNA for high-throughput sequencing20,21. Sequenced reads are mapped to the genome and clusters of overlapping reads representing possible binding sites are computationally separated from the usually high levels of background7,22,23. In order to reveal sites that are likely to be functional, for example those conferring post-transcriptional gene regulatory effects, the list of binding sites can be sorted according to various criteria including the relative RBP occupancy, which describes the fraction of all instances of a binding site occupied by the RBP at the time of cross-linking24.

Each variant of CLIP uses a unique approach to one or more of the above-mentioned steps. We describe the differences among primary variants below, with further comparisons and additional variants being covered elsewhere14. We do not intend to advocate one variant over another, but the provided information can help researchers to make an informed choice of their preferred CLIP variant. Note that RBPs differ greatly in their cross-linking efficiencies depending on their mode of RNA binding and whether UVC, 4-thiouridine (4SU)-induced UVA/B or formaldehyde cross-linking is used25,26,27. However, further studies are needed to determine what factors influence these relative efficiencies.

Original CLIP and its adaptation to high-throughput sequencing

Cross-linking in original CLIP workflows is accomplished using UVC, which preferentially cross-links RBPs to uridines and, to a lesser extent, guanosines28,29,30. Following mild RNase digestion and purification of the selected RBP, RNA fragments are ligated to a 3′ adapter and radiolabelled to visualize and aid purification of the cross-linked RNP after SDS-PAGE and membrane transfer15. Cross-linked RNA fragments are recovered, ligated to a 5′ adapter, converted into cDNA by reverse transcription and amplified by PCR, similar to the standard protocols developed for microRNA (miRNA) characterization31. However, here the reverse transcriptase needs to read across the oligopeptide attached to the cross-linked nucleotide to reach the 5′ adapter. Premature termination results in a bias towards contaminating non-cross-linked sequences in resulting cDNA libraries; some computational tools for HITS-CLIP therefore take advantage of the low but consistent mutation signature at such events22,32,33. CLIP was adapted for next-generation sequencing in HITS-CLIP6 (Fig. 2a) by adding sequences required for Illumina sequencing to the PCR primers6. The related approach cross-linking and analysis of cDNAs (CRAC)32, originally developed for yeast RBPs, uses affinity-based purification under denaturing conditions as an alternative to immunoprecipitation.

Fig. 2: Overview of primary CLIP variants and TRIBE.
figure 2

a | Comparative schematic of methods for the primary cross-linking and immunoprecipitation (CLIP) variants, including high-throughput sequencing of RNA isolated by CLIP (HITS-CLIP), individual-nucleotide resolution CLIP (iCLIP), infrared CLIP (irCLIP), enhanced CLIP (eCLIP), photoactivatable ribonucleoside-enhanced CLIP (PAR-CLIP) and Proximity-CLIP. Top box: steps prior to immunoprecipitation, including treatment of cultured cells with 4-thiouridine (4SU) and cross-linking. Middle box: RNA manipulation steps. Bottom box: cDNA preparation, sequencing and peak calling steps. Note that PAR-CLIP is predominantly performed using 4SU as a photoreactive nucleoside, but 6-thioguanosine (6SG) can also be used and results in a G to A transition. Cross-linking and analysis of cDNAs (CRAC), which closely resembles HITS-CLIP, uses protein tags that allow denaturing purification. b | Targets of RNA-binding proteins identified by editing (TRIBE) workflow. The RNA-binding protein (RBP) of interest is expressed as a fusion protein with the catalytic domain of the RNA editing enzyme RNA-specific adenosine deaminase (ADARcd). Binding sites are detected by ADAR-mediated adenosine to inosine modifications revealed by scoring for A to G mutations in cDNA libraries prepared by standard long-read RNA sequencing (RNA-seq). 32P, 3′ ligation products detected using a phosphorous 32-labelled adapter; IR800, 3′ ligation products detected using an IR800-biotin dye-labelled adapter; x, step not applicable. aRBPs biotinylated in a compartment-specific manner.

Individual-nucleotide resolution CLIP, infrared CLIP and enhanced CLIP

iCLIP8, infrared CLIP (irCLIP)34 and enhanced CLIP (eCLIP)35 differ from original CLIP in their purification and cDNA library preparation strategies (Fig. 2a; Box 2). They take advantage of the tendency of reverse transcriptase to terminate at the cross-linked nucleotide, which yields cDNAs with a 5′ end mapping to the first nucleotide downstream of the cross-linking site and allows the identification of cross-link sites at nucleotide-level resolution. To introduce primer binding sites for cDNA library amplification, iCLIP uses a cDNA circularization approach similar to the ribosome footprinting protocol36; reverse transcription is primed with a long DNA oligonucleotide containing both PCR primer sites, and the cDNA products are circularized using thermostable RNA ligases that also act on DNA37. At least 18 variants of CLIP have adopted the approach to amplify truncated cDNAs14; some, such as irCLIP, use cDNA circularization approaches similarly to iCLIP, whereas others, such as eCLIP and iCLIP2 (ref.38), use highly concentrated T4 RNA ligase 1 to ligate a DNA adapter to the 3′ end of the cDNA.

Photoactivatable ribonucleoside-enhanced CLIP

In PAR-CLIP7,15,5, cultured cells are incubated with nucleosides modified with an exocyclic thione group, specifically 4SU or 6-thioguanosine (6SG), which are then incorporated into nascent RNAs (Fig. 2a). The exocyclic thione group increases the photoreactivity of the base, allowing cross-linking with a lower energy of UV light (UVA/B, 312 ≤ λ ≤ 365 nm) than that used in other CLIP methods. When using 4SU, cross-linked amino acids are attached to position 4 of the base — changing its base-pairing properties — whereas unmodified uridines cross-link at position 5, which leaves their Watson–Crick face intact39. Cross-linked 4SU preferentially pairs with guanosine during reverse transcription, resulting in a characteristic T to C transition in the sequenced cDNA (a G to A transition occurs when using 6SG)7. This may simplify data analysis as enrichment of such transitions at specific genomic regions indicates bona fide interaction sites and helps to determine the precise location and strength of the RNA–RBP interaction.

CLIP of RNA hybrids

Some RBPs, including Staufen proteins, or the Argonaute proteins at the heart of RNA silencing pathways, bind RNA at double-stranded sequence elements. Standard CLIP assays will only reveal one of the bound strands, thus losing information on the nature of the RNA–RNA interaction. All major CLIP variants have been adapted to include an additional step of intermolecular ligation after the limited RNase digestion, which maintains the proximity of the two RNA fragments bound to the RBP and allows the reconstruction of RNA–RNA hybrids interacting with the RBP of interest. Argonaute HITS-CLIP40, cross-linking and sequencing of hybrids (CLASH)41 and modified PAR-CLIP42 have been used to sequence miRNA–target chimeras, and RNA hybrid and iCLIP (hiCLIP)43 revealed a prevalence of long-range intramolecular RNA duplexes bound by human STAU1 protein. These are complementary to the many additional methods that profile RNA structures on a transcriptomic scale by chemical-based approaches or by mapping RNA–RNA contacts12. CLIP has recently been integrated with one such chemical-based approach, selective 2′-hydroxyl acylation analysed by primer extension (SHAPE), to reveal the hydrogen bonds at RNA–protein interfaces44.

Proximity-labelling based isolation of compartment-specific RNAs

Proximity-CLIP11 and the related technique APEX-seq45,46,47 allow the determination of RNA distribution to specific subcellular locations. Both techniques rely on the biotinylation of RNAs (exploited in APEX-seq) and proteins (exploited in Proximity-CLIP) by the engineered ascorbic acid peroxidase protein APEX2 (ref.48), a tool widely used to quantify the localized proteome49 (Supplementary Table 1). To allow subcellular compartment-specific biotinylation of RNA and proteins, APEX2 is typically fused to specific localization elements50. In the case of Proximity-CLIP, prior to protein biotinylation, nascent transcripts are labelled with either 4SU or 6SG and cross-linked to interacting RBPs with UV light of 312–365 nm (Fig. 2a). The compartment-specific proteome, including cross-linked RNPs, is then isolated on streptavidin beads and cross-linked RNA fragments are isolated and sequenced following mild RNase digestion. The characteristic mutations in the cDNA resulting from the use of photoreactive nucleosides reveal cross-linked sequences. A distinctive feature of Proximity-CLIP is that the sequencing of RBP-protected footprints allows for both the profiling of localized RNAs and the identification of protein-occupied, possibly regulatory, cis-acting elements on RNA. In contrast to APEX-seq, this approach provides a snapshot of regulatory elements on RNA that are occupied in the examined compartments.

Numerous other recently developed techniques are capable of performing compartment-specific labelling and analysis of RNA and/or proteins. Some approaches use genetically encoded photosensitizers localized to specific compartments, which mediate the oxidation of proximal guanosines by generating reactive oxygen species after irradiation with visible light51,52,53. Photosensitized guanosines can then be coupled with reactive amino group-containing probes to isolate and quantify localized RNA.

Targets of RNA-binding proteins identified by editing

Enzymatic tagging approaches can allow for transcriptome-wide identification of endogenous RBP interaction sites without requiring cross-linking, biochemical immunoprecipitation or cDNA library preparation steps. An example is targets of RBPs identified by editing (TRIBE)10, which is conceptually related to DNA adenine methyltransferase identification (DamID), a method that identifies chromatin protein-bound regions by fusing them to the Dam methyltransferase and identifying the methylation sites54. TRIBE relies on transgenic expression of the RBP of interest fused to the catalytic domain of double-stranded RNA-specific adenosine deaminase (ADARcd) — which catalyses adenosine to inosine conversions near the RBP interaction sites — or its hyperactive mutant (HyperTRIBE)55. These sites are revealed by excess A to G mutations in libraries that are prepared as standard RNA sequencing (RNA-seq) libraries (Fig. 2b). Among the distinct advantages of TRIBE over CLIP approaches are its minimal number of manipulation steps — which allows for the use of small numbers of cells — and the possibility of expressing the RBP–ADARcd fusion protein in a cell type-specific manner to reveal RBP interactomes in precisely defined subpopulations of cells in model organisms. A disadvantage is that very deep sequencing is necessary to capture sufficient editing signal (A to G mutations) to call interaction sites. Further, carboxy-terminal or amino-terminal fusions of ADARcd may compromise the localization and activity of some RBPs and their ectopic expression in vivo requires optimization to ensure proper cell-type specific expression patterns and avoid excessive levels of RBP–ADARcd fusion protein levels, which can obscure target sites and lead to toxicity caused by hypermodification of RNA. Recently, an approach termed surveying targets by APOBEC-mediated profiling (STAMP) has been developed where RBPs are tagged with APOBEC enzymes56. These enzymes access cytosine bases in single-stranded RNA and produce clusters of edits, giving increased coverage of mutations compared with TRIBE, which relies on ADAR-mediated editing of the relatively infrequent RNA duplexes containing a bulged mismatch10. This higher likelihood of encountering APOBEC1 cytosine substrates increases the sensitivity of STAMP and enables it to be coupled with single-cell capture.

RNA-centric methods

To unravel the composition of full RNPs assembling on a specific RNA, RNA-centric methods are needed to complement protein-centric approaches57. Such methods generally use either RNA affinity capture purification or proximity-based protein labelling.

RNA affinity proteome capture

RNA affinity proteome capture methods are mainly in vitro approaches based on either tagging the endogenous RNA or modifying in vitro-transcribed or synthesized RNA at the 3′, 5′ or both ends with biotin or similar small molecules58 and immobilizing them on solid surfaces such as streptavidin beads (Table 1). Cellular extracts are then added to the immobilized beads, the beads washed and proteins bound to the labelled probes eluted by boiling the beads in SDS elution buffer.

Table 1 RNA affinity capture-based RNA-centric methods

An alternative affinity capture approach is to tag an RNA of interest with aptamers derived from virus-derived heterogeneous RNA stem loops, such as MS2 (ref.59), PP7 (ref.60), S1 (ref.61), Cys4 (ref.62) and D8 (ref.63), or aptamers that mimic tobramycin64 or streptomycin65 (Table 1). When choosing the aptamer, one has to consider the binding affinity of the tag with the cognate ligand, keeping in mind that for highly enriched RNPs, a low binding affinity aptamer–ligand interaction can be sufficient to pull-down highly enriched interactors and will give less background with more specific elution. Lysates from cells expressing the tagged RNA of interest are passed through beads containing the respective substrates. These are stringently washed, which can include applying a competitive binder, and the proteins are eluted for mass spectrometry analysis.

Post-lysis reorganization of RNPs66 may result in the detection of false-positive associations of RBPs with specific RNA baits. To avoid this, several approaches cross-link RNPs in cultured cells by UV with or without photoreactive nucleosides or chemically with formaldehyde prior to cell lysis (Table 1). For example, capture hybridization of analysis of RNA targets (CHART) allows the mapping of interaction sites and proteins bound to the Drosophila RNA roX2 (ref.67) and RNA antisense purification (RAP) has been used to identify the interactome of the non-coding RNAs Xist68 and NORAD69. Comprehensive identification of RBPs by mass spectrometry70 (ChIRP-MS) also systematically identified Xist-interacting proteins in mice and in vivo interactions by pull-down of RNA (vIPR) studied proteins interacting with Caenorhabditis elegans gld-1 RNA71. During the recent COVID-19 public health emergency, RAP and ChIRP-MS were immediately applied to identify host and viral RBPs interacting with the SARS-CoV-2 RNA genome72,73.

RNA-directed proximity-based proteome labelling

RNA-directed proximity-based methods investigate the protein binding partners of a specific RNA in its native cellular context without the need for cross-linking, which is particularly useful for uncovering transient interactions and for studying RNPs from poorly soluble cellular compartments that are prone to precipitate during affinity capture methods, such as chromatin, peroxisomes or the Golgi body. In these methods, a labelling enzyme is recruited to a specific RNA to covalently modify the proteins located in the vicinity of the RNA (Table 2). The enzyme can be recruited to specific RNAs by expressing an aptamer on the RNA and a corresponding loop-binding protein tag on the labelling enzyme. RNA–protein interaction detection (RaPID) approaches use a plasmid expressing the RNA of interest flanked by BoxB stem loops and BASU — a mutant version of BirA*, engineered from Bacillus subtilis — fused to a BoxB stem loop-binding λN peptide13. The RNA of interest can also be tagged endogenously in approaches such as RNA-BioID74. Alternatively, a modified CRISPR–Cas system can be used to recruit an enzyme to an endogenous RNA by tagging the enzyme with an RNA-guided Cas variant and using guide RNAs that are antisense to the RNA of interest75. The excess pool of enzymes not docked to the tagged RNA can produce noise, but this can be reduced by using split proximity-based, RNA-assisted tools such as split APEX2, where two inactive APEX2 subunits are reconstituted to restore peroxidase activity upon physical co-localization76.

Table 2 Proximity-based RNA-centric methods in live cells

Results

Sources of background in CLIP

CLIP reads originate from a large number of RNAs, even when the RBP of interest is predicted to have few functional RNA partners. This could be because most reads reflect short-lived RBP–RNA interactions, whereas functional RNA partners tend to have a high total residence time on the RNA. Thus, binding regions that accumulate a high number of CLIP reads, either narrow or broad, are thought to be functionally relevant77, whereas the regions with few reads are viewed as ‘intrinsic’ background, reflecting transient interactions. There is no absolute distinction between stable and transient interactions, and the functionality of these modes of interaction differs between RBPs (Fig. 3a). For example, CLIP of the P granule protein MEG-3 in C. elegans showed that its function depends on interactions across the full transcripts that are not sequence-specific78. Thus, thought needs to be given to what may constitute an intrinsic background for different RBPs.

Fig. 3: Sources of variability in CLIP sample preparation.
figure 3

Sources of variability in cross-linking and immunoprecipitation (CLIP) experiments. a | Sources of intrinsic background. RNA-binding protein (RBP)–RNA interactions are dynamic, and therefore the probability of an RBP cross-linking to a cognate RNA site at the time of experiment is affected by multiple factors: synergistic or antagonistic interactions between RBPs on the same RNA region, the residence time of the RBP on the RNA (low on low-affinity sites and high on high-affinity sites) and the availability of the RBP and the cognate site, which is influenced by time-dependent stochastic fluctuations in expression and localization. b | After cross-linking, cells are lysed and the RNAs fragmented. Fragmentation is mediated by RNAses, most of which have some sequence bias. This, along with the duration of treatment, leads to fragments of variable length. An RBP-specific antibody is used to immunoprecipitate the protein along with cross-linked RNA fragments. Cross-reactivity or lack of antibody binding can lead to false or undetected sites (grey box). c | The cross-link constitutes a roadblock for reverse transcription, leading stochastically to different types of fragment: those that are accurately transcribed across the cross-link sites, those where reverse transcription stops at the cross-link site and those where mutations, deletions or insertions are introduced at the site of cross-link. Individual-nucleotide resolution CLIP (iCLIP) variants aim to capture the fragments that truncate at the cross-link position, whereas photoactivatable ribonucleoside-enhanced CLIP (PAR-CLIP) aims to capture fragments where read-through occurs. d | During PCR, individual fragments are amplified with variable efficiency. To recognize reads that resulted from the amplification of the same initial fragment, unique molecular identifiers (UMIs) are attached before the amplification step and identical reads (those with identical UMIs) are collapsed to a single read, representing the unique initial fragment. Adapters and UMIs are removed before read counts associated with individual genomic positions are tabulated.

Limited selectivity of the antibodies used to immunoprecipitate RBPs can lead to contamination of the sample with additional RBPs and their bound RNAs, and abundant RNAs may also be carried through sample preparation (Fig. 3b). The quality control and purification of the RBP–RNA complexes of interest on the SDS-PAGE gel are important in analysing and mitigating these two sources of ‘extrinsic’ background, and the way this step is implemented can vary between CLIP protocols (Box 2). It is advisable that control samples are prepared in parallel using IgG-bound or antibody-bound beads and RBP-knockout material, barcoded, pooled and sequenced, to compare with the experimental samples and assess their data specificity.

Quantification of CLIP reads can be complicated by the presence of PCR duplicates resulting from non-uniform amplification of different sequences. Aside from careful optimization of PCR cycle numbers79, the use of unique molecular identifiers (UMIs) for cDNAs produced by most current CLIP variants can mitigate introduction of these artefacts14 (Fig. 3c). UMIs are highly diverse barcodes composed of randomly incorporated nucleotides that are added to the RNA or cDNA fragments using adapters or reverse transcription primers before PCR amplification. As it is highly unlikely that the experiment produces two identical fragments that also ligate to two identical UMIs, the presence of multiple copies of a read with the same UMI will indicate PCR duplicates, which can be computationally collapsed to a single read. Computational tools, such as iCount8, expectation–maximization-based algorithms80 or UMI-tools81, take advantage of the presence of UMIs to quantify the number of unique cDNAs in the library even in the presence of sequencing errors.

CLIP analysis workflow

Peak identification

All CLIP variants aim to capture individual binding sites of RBPs with nucleotide-level resolution; however, the exact experimental approach determines the relationship of the reads to the cross-linked nucleotides on the RNAs and, consequently, the computational analysis that is necessary for revealing the binding sites. Workflows for CLIP data analysis generally cover the following main steps: preprocessing of CLIP reads; alignment of reads to the corresponding genome; peak identification; combined analysis of replicates to identify reproducible peaks; and meta-analysis to identify binding motifs, relationships between binding sites, their positioning relative to transcript landmarks and the functional consequences of binding. We provide a summary of recently introduced or updated tools for binding site identification and peak detection in Table 3. Software for finding motifs and predicting RBP binding sites and peak finding tools only applicable to specific sets of targets can be found in recent reviews9,82.

Table 3 Available peak detection software

Peak identification is an important step that serves to identify regions of the RNA to which the RBP directly binds with high occupancy, thereby representing likely functionally relevant interactions (Fig. 4a,b). The primary goal of peak-calling is to identify RNA regions where the number of cross-link diagnostic features is significantly higher than expected based on background models. These features can be the number of reads mapping to these regions, as well as cross-linking-induced substitutions, insertions/deletions or truncations, depending on the experiment. cDNA mutation and/or truncation occur when the reverse transcriptase reads past the cross-linked nucleotides or truncates at them and are identified once the reads are aligned to the genome. Sites of high RBP occupancy on the RNA are revealed by their high density of reads or cross-linking-induced features relative to neighbouring regions of the same type (introns, coding sequence, 3′ untranslated region) that have similar expression within each gene (Fig. 4a,b). It is important to be aware that a gain in specificity through increased stringency of peak calling can lead to a drop in sensitivity, as discussed later.

Fig. 4: Peak calling.
figure 4

Extraction of peaks from cross-linking and immunoprecipitation (CLIP) data. a | Following adapter and duplicate removal, inserts are mapped to the genome or transcriptome. As an example, tracks show density profiles in the region of the tubulin (TUBB) gene corresponding to samples obtained from K562 cells in the ENCODE project using enhanced CLIP (eCLIP) for Pumilio homologue 2 (PUM2 CLIP) and a size-matched input control (PUM2 SMI) and RNA sequencing (RNA-seq). Numbers 0–77, 0–20 and 0–33 correspond to the maximum read coverage in this region. Colours in the bands at the bottom indicate different nucleotides. The TGTA.ATA motif has been shown to be recognized by PUM2 and, thus, indicates the location of the true binding site. Various approaches are used to distinguish peaks of high RNA-binding protein (RBP) occupancy from background. Background models are constructed from regions neighbouring the putative peaks in the CLIP sample itself or from the same region as the peak in the SMI or RNA-seq samples (indicated by coloured brackets). b | Peaks defined as contiguous regions where the number of reads is significantly higher than expected based on the background models. Coloured dashed lines show the average coverage in different types of background regions/samples (indicated by brackets of matching colours in panel a). Some tools consider the number and type of cross-link diagnostic mutations. Grey shading on the peak indicates the region where most variation in peak shape is expected, depending on the CLIP variant. In individual-nucleotide resolution CLIP (iCLIP), mostly truncated cDNAs are sequenced, leading to an abrupt increase in read coverage starting right after the position of the cross-link.

Assessing background

Peak calling serves to computationally remove the intrinsic background generated by transient interactions. However, when the protein binds broadly along RNAs, without clear peaks of diagnostic features, estimates of the abundance of RNAs encountered by the RBP can improve the detection of these targets. The extrinsic background needs to be assessed experimentally during the quality control step of the size-separated protein–RNA complexes and possibly by obtaining additional data that identify the likely contaminating RNA fragments. In chromatin immunoprecipitation followed by sequencing (ChIP–seq), immunoprecipitation with beads lacking antibody is used to generate a background sample for peak calling. In CLIP experiments, however, it is more challenging to generate experimental background samples. When performing CLIP with beads lacking antibody, the signal on SDS-PAGE is negligible, yielding 100-fold fewer reads if sequenced, which is insufficient for extrinsic background modelling8. Instead, one can use RNA-seq to identify regions where a large number of CLIP reads are a result of high RNA abundance rather than high occupancy by the RBP (Fig. 4a). Outliers are identified with respect to a negative binomial distribution whose parameters are determined from the background sample. This distribution captures the fact that the variance in coverage is generally larger than the mean, contrary to what would be expected from sampling reads with constant probability along a genomic region9. A related approach to assess background experimentally has been taken in eCLIP, where a size-matched input (SMI) is generated by performing all steps of the protocol apart from immunoprecipitation35 (Fig. 4a). The importance of background samples was illustrated in eCLIP by the example of the stem loop-binding protein, where only 1.2% of the peaks identified from the foreground sample were enriched over the background SMI35.

Although approaches to remove background are expected to increase the proportion of functionally relevant binding sites among the called peaks, they can introduce new biases. The SMI sample in eCLIP is often dominated by RNAs cross-linked to abundant RBPs that may not be the same RBPs that contaminate experimental samples, owing to their interactions with the RBP of interest. Conversely, the SMI could be dominated by the RBP of interest itself, resulting in the foreground signal becoming erroneously assigned to the background, precluding the identification of relevant binding sites. RNA-seq may introduce bias depending on whether poly(A) selection or ribosomal RNA depletion was used, each of which yields somewhat different estimates of gene and transcript expression. Poly(A) selection enriches for fully processed RNAs, thereby depleting introns. Ribosomal RNA depletion requires enough sequencing depth to assess individual introns, as even within a gene the abundance of different introns can vary depending on the time taken for transcription, splicing and degradation of each intron. Moreover, the delay between transcription and co-transcriptional splicing leads to increased coverage towards the 5′ end of long introns83, which is common in genes expressed in the brain83,84,85. Such issues suggest that it will be important to obtain data that can accurately estimate the abundance of intronic regions in order to optimally detect enriched intronic CLIP peaks. Finally, most RBPs are localized to specific cellular compartments, where the abundance of RNAs may be quite different from the average abundance of the whole cell. Thus, it will be valuable to develop models based on the local abundance of RNAs that each RBP encounters, estimated based on RNA-seq from cellular subfractions, APEX-seq and/or Proximity-CLIP.

Characterizing RBP binding motifs

Once binding peaks have been identified, the immediate aim is to uncover the sequence and/or structure specificity of the protein. Traditionally, position-specific weight matrices (PWMs) have been used to represent the sequence specificity of nucleic acid-binding proteins, whether transcription factors or RBPs (Fig. 5). PWMs indicate the relative frequency with which individual nucleotides are observed among the binding sites of an RBP, which, in turn, can be related to the contribution of individual nucleotides in the binding site to the energy of interaction with the RBP and thus the affinity of this interaction. PWMs can be inferred from sequences obtained in CLIP experiments with readily available computational tools86,87,88. A key assumption of PWMs is that nucleotides in the binding site contribute independently to the energy of RBP–RNA interactions. This assumption started to be questioned as high-throughput binding data — for example, from protein microarrays — became available. It has been argued that parameter-rich models derived, for example, through machine learning approaches are necessary to quantify the affinity of protein–nucleic acid interactions89,90,91. However, other studies explicitly modelling confounding experimental factors concluded that PWMs are sufficient to quantitatively explain the binding data for the majority of transcription factors92.

Fig. 5: Downstream analysis of CLIP peaks.
figure 5

Typically, peaks that are reproducibly identified in replicate experiments are extracted for further analyses. Here, the agreement between the peaks obtained in two replicates of enhanced CLIP (eCLIP) for Pumilio homologue 2 (PUM2 CLIP) is shown as a function of the number of top peaks selected from each replicate. Peaks are sorted by score, the top x peaks (x indicated by the x axis) are extracted, and the proportion of overlapping peaks is shown on the y axis. Two peaks are considered overlapping if they share at least one nucleotide. Reproducible peaks can be annotated with their location in different genomic regions, the types of RNA in which they occur or the region of protein-coding RNAs (5′ untranslated region (UTR), coding sequence (CDS), 3′ UTR) in which they reside. The sequences of the most enriched peaks (here, the top 1,000 from each of the two samples) are used to search for enriched motifs that point to the sequence preference of the RNA-binding protein (RBP). In this case, the motif identified from the top peaks is the recognition element of PUM2. Information content on the y axis summarizes the strength of the preference for a specific nucleotide at a given position in the binding site. CLIP, cross-linking and immunoprecipitation; lncRNA, long non-coding RNA; miRNA, microRNA; NMD, nonsense-mediated decay.

In the case of RBPs, PWMs are also used to explain both CLIP data and in vitro measured affinities of interaction with RNAs93,94. However, RNA–RBP interactions are likely more complex than the interactions of transcription factors with DNA. The accessibility of binding sites — modulated through an RNA secondary structure that depends on RNA modifications95 — plays an important role in RBP–RNA interactions. A detailed analysis of Gld-1 binding in C. elegans found that a biophysical model including the PWM-defined specificity of the Gld-1 RBP and the predicted structural accessibility of binding sites in RNAs was able to explain the relative enrichment of binding sites in CLIP, alleviating the need for a more parameter-rich model96. Examination of the secondary structure around CLIP binding sites demonstrated that the recognition of RBP binding motifs by RBPs often requires a specific structural context97,98 and led to models that simultaneously infer the sequence–structure preference of RBPs99,100,101 and allow the identification of sites that were missed in CLIP experiments owing to, for example, low RNA expression levels99. Similarly, machine learning approaches have increased the depth of miRNA binding site identification from Argonaute-CLIP data102. Biophysical approaches for the ab initio prediction of molecular interactions can pinpoint potential false negatives in CLIP experiments and provide insights into the interaction propensities that, ultimately, determine the location of binding sites in RNAs103. Conversely, CLIP data typically provide large data sets that can be used to infer biophysical models of RNA–RNA interactions in the context of RNP complexes, such as the ternary miRNA–mRNA–Argonaute protein complex104. These inferred models can predict affinity interactions measured in vitro with surprising accuracy105.

Many tools take into account cross-linking-induced mutations to call RBP binding sites and determine the sequence and structure specificity of the RBP28,100,106,107. Annotation of the putative location of binding sites with respect to various landmarks such as splice sites, the functional category of the gene as well as binding data for RBPs other than the RBP of interest can be further incorporated to improve the accuracy of binding site identification108,109. A drawback is that enforcing specific constraints without a mechanistic basis may lead to overlooking unusual binding sites. Furthermore, it is not always clear that the increase in accuracy justifies the potential for overfitting and reduced interpretability that comes with an increased number of parameters.

Regulatory grammar

The final step in deciphering CLIP data is uncovering the regulatory grammar of the RBP binding sites, including the spatial relationship of RBP binding sites to important transcript categories — such as coding/non-coding transcripts, repeats, small nucleolar RNAs and rRNAs — and landmarks such as exons, introns, exon/intron boundaries and translation start/stop sites110. Binding site data can be combined with data from knock-down and overexpression experiments to generate RNA maps reflecting the functional impact of binding sites located in different transcript regions111. Computational modelling of changes in the expression of transcript isoforms upon perturbation of individual RBPs provides complementary information regarding the RBP binding motifs that are involved, their location within transcripts and their functions in individual steps of RNA processing112. As the number of RBPs studied by CLIP continues to increase, direct comparisons of the binding site profiles in the genome are starting to reveal regulatory complexes and competition between RBPs. Both of these are reflected in multiple proteins binding to closely spaced sites in the RNA, whereas the data from perturbation experiments help resolve the nature of the interactions between RBPs110,113,114.

Assessing the specificity of CLIP

In contrast to RIP or ChIP-seq, CLIP has an in-built step for experimental control of specificity. Visualizing the size-separated protein–RNA complexes can allow estimation of the extrinsic background, which yields signals in negative control lanes or at unexpected sizes. From its initial publication, high standards were established for the specificity of CLIP, evident from the absence of a signal in the negative control and a >20-fold enrichment of binding motifs within Nova CLIP reads compared with the control5. Fusion of affinity tags to the studied RBP can further increase specificity by allowing even more stringent, denaturing purification conditions that maximize the removal of extrinsic background14. However, data specificity for the immunoprecipitation-based variants of CLIP can vary depending on the quality of the antibody and the degree of optimization; when studying a new RBP using CLIP, RNase fragmentation and immunoprecipitation conditions must be optimized for variations in RNase stocks, cross-linking efficiencies of RBPs, the stability of their interactions with other RBPs and the type of cells or tissue used15,115.

As optimizations are carried out to variable extents across laboratories employing CLIP, there is a need for computational assessment of CLIP data to facilitate integration of collected data sets. The first approach is to study the cross-link distribution across RNA types. Nuclear and cytoplasmic RBPs tend to have the most cross-links in introns and exons, respectively. In cases where the dominant RNA binding partners are known, these are expected to rank highly in the data. However, the most likely source of extrinsic background is RBPs that interact with the studied RBP, which often have similar localization patterns and RNA partners; therefore, analysis of RNA types offers only partial reassurance. The second approach is to compare the enrichment of sequence motifs in CLIP data with their affinities for the purified RBP as determined by biophysical methods. Systematic motif enrichment data are available from in vitro binding assays such as SELEX116,117, RNA Bind-n-seq118 and RNAcompete97. Often, in vivo-identified binding sites resemble the highest-affinity motifs derived from these methods. When they do not, the reason can either be the low specificity of the in vivo data or biases of in vitro assays. For example, these assays often examine the binding of individual domains rather than full proteins, which lack post-translational modifications and the context of other proteins. They also tend to study binding to short RNA sequences, whereas in vivo RBPs can assemble on long RNAs with complicated secondary structures. To distinguish whether the RNA features that are unique to the in vivo data reflect the specificity of the RBP or represent technical artefacts, it will be informative to examine the reproducibility of these features across multiple data sets produced by various laboratories or by various protein-centric methods for the same RBPs.

For many RBPs there is no in vitro binding information available to provide expected binding motifs. However, binding motifs can be identified de novo from the CLIP data and the extent of their enrichment provides some measure of data quality. For example, a comparison of publicly available data for polypyrimidine tract binding protein 1 (PTBP1) revealed that whereas all CLIP variants show enrichment of similar motifs, the extent of enrichment varies dramatically between variants, indicating major differences in data specificity115. There are several caveats to de novo motif discovery using CLIP, as factors unrelated to the studied RBP may result in enrichment of specific sequence motifs. Such factors include the nucleotide preferences of UV cross-linking or the sequence biases of the RNases and RNA ligases used to join adapters to the ends of RNA fragments22,29,79,115. One way to minimize the impact of these biases is by producing parallel data sets for diverse RBPs from the same type of biological material and then deriving motifs unique for each RBP after correcting for the features that are in common for different RBPs7,28,85,119.

A recent approach to assess the validity of de novo motifs involves the analysis of sites overlapping heterozygous single-nucleotide polymorphisms. A difference in the number of CLIP cDNAs mapping to the two alleles indicates that the single-nucleotide polymorphism affects cross-linking efficiency28, and therefore likely influences the affinity of the RBP of interest to the site. However, allelic imbalance is equally expected at motifs bound by co-purified RBPs that represent extrinsic background, and can also result from the nucleotide preferences of cross-linking, and therefore should be interpreted with caution.

Finally, enrichment of CLIP peaks around regulated elements, such as alternative exons, can be assessed using RNA maps to understand the ‘functional specificity’ of data, which can yield comparative assessment for multiple data sets of a specific RBP111. Such analysis requires that orthogonal data that examine functionality are available, such as RNA-seq of knockout or knock-down cells or tissues93. Finally, experiments to support the functionality of specific binding sites can be designed by perturbing such sites, such as through mutations of cis-acting elements in minigene reporters or CRISPR-mediated mutations of the endogenous gene, or by blocking them with antisense oligonucleotides.

Assessing the sensitivity of CLIP

The sensitivity of CLIP refers to its capacity to comprehensively identify the relevant RNA sites bound by the studied RBP. Such sensitivity depends on the complexity of the resultant cDNA library, that is, the number of unique cDNAs produced. This has increased by orders of magnitude with the adaptation of high-throughput sequencing and the increased efficiency of cDNA library preparation steps14. However, the capacity to prepare high-complexity libraries depends on RBP characteristics, particularly abundance and UV cross-linking efficiency. In addition to the cDNA complexity, the sensitivity of CLIP also depends on specificity because increased external background will decrease the proportion of signal for the RBP of interest. For example, CLIP libraries for PTBP1 of similar complexities showed different numbers of identified binding peaks115 and different capacities to identify binding sites around regulated exons as evident with RNA maps. The choice of peak-calling method strongly affected the functional sensitivity of the same PTBP1 CLIP data9. These points highlight the need for combined analysis of data specificity and sensitivity when assessing the pros and cons of the experimental variants of CLIP and of the various computational approaches to data analysis.

Applications

CLIP experiments have been carried out using various model organisms, including mammalian cell culture35, yeast32, mice6, flies120, worms16,121 and plants17,18 (Table 4). Below, we discuss applications of CLIP techniques in selected systems with distinctive considerations, advantages and disadvantages for various applications.

Table 4 CLIP applications in model organisms

Cell culture models

Cultured cells (transformed cell lines, primary cells and stem cells) are the most widely used experimental model for CLIP, with more than 2,500 different CLIP data sets deposited on the Gene Expression Omnibus at the time of writing. Only ~7% of RBPs are either expressed in a tissue-specific manner or show strong tissue-specific expression bias, mainly in the germline and, to a lesser extent, neuronal tissues122,123, whereas the rest tend to be expressed across most cell types124, making cultured cells appropriate for the majority of cases with the caveat that some RBP targets may be absent. Cultured cells are easily genetically tractable, allowing for epitope tagging of RBPs for stringent purification, introduction of transgenically expressed cell type-specific RBPs or introduction of a clinically or functionally important mutation that could be lethal in an animal model. Cell culture also allows for multiple RBPs to be studied in a comparative manner in the context of the same transcriptome. The same principles apply to single-cell organisms such as yeast, although its lower cross-linking efficiency make it difficult to use in CLIP experiments32.

Although the use of cultured cells provides valuable insights into mechanisms of post-transcriptional regulation — even for ectopically expressed RBPs125 — certain key bound transcripts and interacting proteins may be expressed in a cell type-specific manner themselves. Further, the binding repertoire of RBPs regulating biological processes such as developmental transitions or circadian timekeeping may be best studied in an organismal context.

Model organisms

CLIP/HITS-CLIP5,6, iCLIP85, PAR-CLIP16,126 and eCLIP127 have all been successfully used in mouse, fly and worm models. These studies provided useful insights into the roles of RBPs in various aspects of mRNA biogenesis and regulation during neuronal development and function122, as well as specialized functions such as transposon silencing in human and mouse brain128 and the Piwi-interacting RNA (piRNA) pathway in mouse testes and fly embryos129,130,131. Animal models present unique challenges for the application of CLIP techniques. First, most tissues require mechanical dissociation of fresh or frozen tissue prior to UV cross-linking5,80. In the case of PAR-CLIP, modified nucleotides must be delivered to the cells of interest prior to cross-linking; this can be accomplished by injection or use of transgenic animals expressing uracil phosphoribosyltransferase in a cell type-specific manner to allow the conversion of thiouracil into thiouridine — a process known as TU tagging132. Second, lethal mutations can only be studied if introduced in a conditional manner. Last, if a specific antibody for immunoprecipitation of the RBP is not available, expression of an epitope-tagged version of the RBP in a transgenic animal is required. Nevertheless, by epitope tagging the RBP of interest in specialized cell types133, CLIP can be performed from a subset of cells, analogous to TRIBE10. This approach, employed by conditionally tagged CLIP (cTag-CLIP), revealed the interactome of Nova2, Pabpc1 and Fmrp in various cell types, including neuronal subsets of mouse brain134,135,136.

Plants

Investigating the RNP composition in higher plants is made difficult by several technical challenges. In contrast to mammalian cell cultures, plant cell cultures cannot be cultivated in monolayers and are of limited use for CLIP techniques; as a result, experiments have mostly been performed in transgenic Arabidopsis plants expressing epitope-tagged RBPs17,18. Although the presence of UV-absorbing pigments and secondary metabolites such as chlorophyll and flavonoids can inhibit cross-linking efficiency, UVC-based cross-linking has been successfully applied to whole plants17,18. Another obstacle in plants is the rigid cell wall that requires mechanical force and harsh denaturing conditions for efficient cell lysis137. Moreover, the large amounts of endogenous RNases present in the plant vacuole require the use of RNase inhibitors to prevent extensive RNA degradation during extract preparation (also reported for pancreatic tissue). To ensure a controlled RNase treatment to fragment RNA, RNase treatment is performed after immunoprecipitation of the RNA–protein complexes rather than on the lysate18.

Genome-wide binding data from HITS-CLIP have been obtained in Arabidopsis for HLP1, a protein with similarity to mammalian HNRNPA/B17. In the hlp1-knockout mutant, a shift from proximal to distal polyadenylation sites was observed for more than 2,000 transcripts. As HLP1 binds to approximately 20% of these aberrantly polyadenylated transcripts close to the polyadenylation site in vivo, it has been implicated in regulating their alternative polyadenylation; aberrant polyadenylation of transcripts involved in flowering time control may explain the delayed transition to flowering in the hlp1 mutant17.

The first plant iCLIP study was performed for the heterogeneous nuclear RNP (hnRNP)-like Arabidopsis thaliana glycine-rich RNA-binding protein 7 (AtGRP7)18, which revealed that AtGRP7 binds to U/C-rich motifs mainly in the 3′ untranslated regions of its targets. Among AtGRP7 binding partners were transcripts that are only expressed in inner cell layers of the leaf, demonstrating that UV light penetrates deep into the tissue. Cross-referencing RNA-seq data of mutants and overexpression lines revealed that AtGRP7 predominantly downregulates its binding partners, dampening the peak expression of circadian clock-regulated transcripts in line with its role as a slave oscillator transducing timing information from the circadian clock to rhythmic transcripts within the cell138.

Many protein candidates for CLIP have emerged from proteomic studies identifying proteins that UV cross-link to polyadenylated RNAs in Arabidopsis tissues. To increase the efficiency of UV cross-linking, these studies were performed in etiolated (dark-grown) seedlings to avoid the presence of chlorophyll139, as well as in leaf protoplasts, cells without a cell wall140, cell suspension cultures and leaves of adult plants141,142. These studies identified more than 1,100 candidate RBPs; only a few RBPs were identified by all studies142,143, potentially owing to the different developmental stages and tissues investigated and the different protocols and levels of stringency used. As in non-plant species144, a recurrent theme of these studies was that many proteins without known RNA-binding domains or without a link to RNA biology were identified139,140,141,142. Among these were photosynthesis-related proteins and photoreceptors with no known role in RNA-based regulation; it is imperative to validate their RNA-binding activity by methods such as CLIP143.

Development and disease

RBPs play many important roles in development and diseases1,124. The first applications of CLIP concerned brain-specific RBPs that regulate alternative splicing and are implicated in neurological diseases, such as Nova proteins122. The capacity of CLIP to define binding sites in low-abundant RNAs led to an unexpected finding that splicing regulators can have many thousands of high-affinity binding sites in introns5,6. Binding sites close to alternative exons coordinate splicing in a highly position-dependent manner that can be described by an RNA map6,111. Moreover, most binding sites are located far from annotated exons and these often repress splicing of cryptic exons such as those emerging from transposable elements145. CLIP of core spliceosomal components, such as PRPF8, can also be used to interrogate splicing mechanisms, such as the regulation of recursive splicing by the exon junction complex, which is particularly important for appropriate splicing in the brain146. Moreover, CLIP has been used to study a broad range of RBPs with roles in the regulation of RNA transport, stability and translation. For example, HITS-CLIP study of Fragile X mental retardation protein (FMRP) revealed its binding to a subset of transcripts across their entire coding length, which was suggested to result from its dual interactions with the ribosome and the mRNA that could be important for its regulation of local translation at the synapse80.

CLIP can be performed on post-mortem human tissues to interrogate pathology-related changes in protein–RNA interactions. For example, a study of brain tissue from patients with pathological aggregates of TDP43, an RBP implicated in multiple neurodegenerative diseases, demonstrated increased binding to the non-coding RNA NEAT1 (ref.147). NEAT1 assembles multiple RBPs, including TDP43, into biomolecular condensates called paraspeckles148. TDP43 in turn regulates the 3′ end processing of Neat1 RNA, which leads to cross-regulation between NEAT1 and TDP43 that contributes to exit from pluripotency in mouse embryonic stem cells149. Such cross-regulation between RNAs and RBPs is likely a common phenomenon; it is becoming clear that RNAs can act as regulators of their bound RBPs, as was shown for the case of vault RNA-dependent regulation of proteins involved in autophagy150.

CLIP is increasingly used in pathogen research, including in studies concerning the RNA interaction profiles of bacterial RBPs151 and viral remodelling of the host and viral RNA–RNP interactome. For example, miRNAs encoded by Kaposi’s sarcoma-associated herpesvirus (KSHV) may function by competing with host miRNAs for AGO2 (ref.152), and a later study using CLASH additionally identified more than 1,400 cellular mRNAs that are targeted and might be regulated by KSHV miRNAs153. Moreover, a study of the HIV-1 Gag protein uncovered dramatic changes in its RNA-binding properties that occur during virion genesis and contribute to viral packaging154, a study of APOBEC3 proteins showed how their RNA binding ensures their effective encapsidation into HIV-1 as part of the host’s defence155 and a study of poly(C)-binding protein 2 (PCBP2) provided support for its roles in hepatitis C virus-infected cells156. These studies also provided computational solutions for parallel analysis of human and user-definable non-human transcriptomes. Most recently, CLIP has been used to identify human RNAs that are bound by the proteins encoded by the SARS-CoV-2 genome, such as non-structural proteins157 and nucleocapsid protein158, which helped to show how these RBPs alter gene expression pathways to suppress host defences. Conversely, CLIP of host RBPs was used to identify their binding to SARS-CoV-2 RNAs, which contributes to host defence strategies73. Much more work remains to be done with CLIP and complementary approaches to understand how cross-regulation between the RBPs and RNAs of pathogens and their hosts modulates pathogenicity.

Complementary insights

Several studies combined protein-centric and RNA-centric approaches to gain complementary insights into RNP assembly and function. One example is the study of NORAD long non-coding RNA (lncRNA), where RNA-antisense purification coupled with mass spectrometry (RAP-MS) was used to identify its interaction with hnRNP G and several other proteins, the RNA binding sites of which were then mapped with CLIP. This showed how NORAD assembles an RNP that links proteins involved in DNA replication or repair69. Another example is the study of Xist lncRNA, where its bound RBPs were first identified through RNA-centric methods68,70 and later studied by CLIP to show how Xist seeds a heteromeric RNP condensate that is required for heritable gene silencing159. Most recently, host RBPs bound to SARS-CoV-2 RNAs were first identified by RAP-MS, and then studied further with CLIP to map their direct interactions with the SARS-CoV-2 RNA in infected human cells73. These studies show that complementary data from these approaches present an opportunity to build computational models that position each RBP at its bound cis-acting RNA elements along an RNA and thus understand how protein–RNA and protein–protein interactions act combinatorially to drive the assembly and remodelling of RNPs on full RNAs.

A question that is particularly pertinent to the field of RNA localization is how RNPs form dynamic condensates, often referred to as ‘RNP granules’, which regulate RNA transport and local translation in response to signalling160. Understanding RNP assembly and dynamics in RNP granules is particularly challenging as they are mediated by direct protein–RNA and protein–protein interactions and involve both structural domains and intrinsically disordered regions (IDRs). IDRs often form weak multivalent contacts that coordinate condensation of proteins into the granule161. Important questions are how the cis-regulatory sequence and structural elements on the RNA mediate the assembly of the full RNP in order to coordinate its selective transport, and how post-translational modifications of the IDRs mediate RNP remodelling in response to specific signals1. Performing both CLIP and RNA-centric methods under dynamic states will be essential for resolving how specific RBPs are released, rebound or repositioned on RNAs in response to stimuli. Comparisons between localized mRNAs may reveal whether they share a subset of core RBPs, and how these RBPs mediate mRNA recruitment to transport machineries and the translational apparatus. Finally, studies of RNA–RNA interactions in addition to protein–RNA and protein–protein contacts will be needed to fully disentangle the principles of RNP assembly160.

Such understanding of RNP remodelling is of paramount importance as it underlies many aspects of cellular remodelling, including cellular polarity and movement, axon guidance, synaptic plasticity and memory formation. Moreover, deregulated RNP dynamics can lead to formation of aberrant condensates and aggregates in many neurological diseases, such as amyotrophic lateral sclerosis and fragile X syndrome162. Combining RNA-centric and protein-centric methods in models of these diseases will be essential to understand how changes in RNP assembly contribute to the disease processes by affecting the biogenesis, transport, translation and degradation of specific RNAs.

Finally, to fully understand RNP assembly, it is also important to define sites on RBPs that bind to RNAs, which can be done through a combination of UV cross-linking, high-resolution mass spectrometry and a dedicated computational workflow to identify both cross-linked peptides and RNA oligonucleotides — an approach that can be RNA-centric or applied to the whole RBPome30. Recently, several additional approaches have been developed for high-throughput mapping of cross-linked peptides or amino acids within RBPs1. With the ever-increasing capacity of these complementary methods to monitor specific functions of RBPs, integrative approaches are bound to become increasingly informative.

Reproducibility and data deposition

Reproducibility of CLIP data

It is necessary to understand the reproducibility of CLIP data before one can proceed to studies of biological variation through comparisons of data sets produced across conditions, cell types, species and RBPs. Data have been obtained by multiple CLIP variants for many RBPs, and in some cases also by complementary methods such as RIP and TRIBE, yet such data remain to be comprehensively compared and integrated163,164. These comparisons are challenging partly because the metadata available from existing raw sequence archives are rarely sufficient. The minimal reporting standards appropriate for full annotation of CLIP and related methods are still to be consolidated, but our recommendation would be that the following should be reported with standardized nomenclature in a table format: name of the purified protein following official nomenclature, information on tags or mutations in the protein if present, the species, information on the biological material (name of cells or tissue), the essential description of its conditions (for example, treatment, genetic modification), the name of the protocol variant, the essential description of experimental conditions that complement the protocol (such as cross-linking, RNase conditions, the molecular weight range used for excision of the protein–RNA complex) and annotation of the experimental barcode and UMI (their sequence and position).

For comparisons between data sets documenting the same RBPs to be informative, technical and biological sources of variation need to be distinguished. Technical variation can be caused by differences between variant protocols in specific steps, such as cross-linking conditions, stringencies of lysis and washing steps, in use of different antibodies for immunoprecipitation or affinity purification for RBP purification and in cDNA library preparation. Moreover, even when the same CLIP variant is used, variation can arise from unintentional differences in implementation, such as in the density of cultured cells or RNase fragmentation conditions. Finally, even with optimal implementation, binding sites in lowly expressed RNAs are hard to reproduce due to stochastic variation in the low numbers of cDNA counts.

As discussed earlier, the most valuable indicator of CLIP data specificity is its cross-validation using orthogonal information, such as the motif enrichment in CLIP peaks, or enrichment of peaks around regulated events, as shown by RNA maps. Although a necessary indicator of data quality, reproducibility across replicate CLIP experiments is less informative than cross-validation. This is because cross-contamination from a co-immunoprecipitated RBP can be reproducible, as can technical biases of cross-linking, nuclease digestion and ligation. These reproducible biases can distort the data, potentially boosting the significance of otherwise low-occupancy sites. Therefore, performing comparative benchmarking of multiple data sets of the same RBPs and reconstructing comprehensive and accurate sets of binding sites are essential. For instance, although the peak identification methods mentioned above can yield tens of thousands of peaks for some well-characterized RBPs, it is informative to assess peak reproducibility for replicate samples within a laboratory, across laboratories and across CLIP variants35. For samples that assess biological variation, comparisons can be made between samples obtained from different animals6. A concern remains that reproducible peaks are more likely to be located in relatively abundant RNAs. Peaks in low-abundance RNAs may be less reproducible, although this can be partly compensated by predictive computational models99.

Data resources

Resources that provide CLIP data across studies are essential for compiling RBP interaction data and enabling comparisons across data sets. Raw sequencing data are made available upon publication from general public repositories such as the Sequence Read Archive165 or the European Nucleotide Archive, which enforce the tracking of metadata. However, full annotation of CLIP variants ideally requires annotation of additional metadata, as described in the previous section. Alignments of reads are provided as binary alignment map (bam) files that can be visualized with tools such as the Integrative Genomics Viewer166. Specialized databases such as doRiNA167, ENCORI (previously known as starBase)168 and POSTAR2 (ref.169) enable the exploration of processed CLIP peaks, along with additional information such as annotation of corresponding genes and gene expression. doRiNA also allows users to upload their binding site data for visualization. A tool called SEQing has been developed to visualize Arabidopsis iCLIP binding sites170, again in the context of gene expression data. Databases of RBP binding motifs have started to emerge; CISBP-RNA171 summarizes data on in vitro RBP–RNA interactions and ATtRACT contains curated data from various sources172, albeit without resolving discrepancies in motifs that are inferred for the same protein from different types of experiment.

Limitations and optimizations

RBP-specific data analysis challenges

RBPs differ in many aspects that can influence data analysis and interpretation. Perhaps the clearest are the characteristics of the RNA binding motifs. Some RBPs, such as the Pumilio family of proteins, primarily bind long, well-defined motifs that overlap with sharp cross-linking peaks7, whereas others recognize short (often only two to four nucleotides long) degenerate motifs, which often occur in multivalent clusters to drive in vivo binding173. Binding peaks for such RBPs can be dispersed over long clusters of motifs, as exemplified by RBPs binding to long interspersed nuclear element (LINE)-derived RNA elements that contain enriched motifs dispersed over hundreds of nucleotides174. RBPs with limited sequence preferences, such as FUS or SUZ12, show even broader cross-linking distributions across nascent transcripts85,175 In such cases, technical biases such as uridine cross-linking preferences can have a stronger impact on the positioning of identified peaks, which should therefore be considered with caution. Thus, strategies to assign binding sites from CLIP data ideally need to be adjusted to the binding characteristics of each RBP, although approaches for doing so are yet to be developed.

Many RBPs interact with large RNPs, and their RNA interactions are often dominated by one or a few abundant non-coding RNAs, such as small nuclear RNA (snRNA) for the spliceosome and rRNA for the ribosome. Nevertheless, even such RBPs can have additional moonlighting functions, as has been seen for ribosomal proteins176. Thus, one needs to be cautious not to automatically assign secondary binding to background. Moreover, even though the standard immunoprecipitation conditions of CLIP are quite stringent, stable RNPs may not fully disassemble and, in such cases, the RBP partners generate considerable extrinsic background in the resulting data. Such RBPs tend to bind to similar RNAs and perform shared functions, and in some cases CLIP experiments were designed to intentionally profile the RNA interactome of many RBPs that are associated with specific stable RNPs; for example, Sm proteins are immunoprecipitated in ‘spliceosome iCLIP’ to yield the RNA interactome of multiple RBPs associated with various snRNAs, thus revealing their interaction sites on snRNAs and pre-mRNAs, as well as the positions of intronic branch points177.

Challenges of RNA-centric methods

RNA affinity capture methods

The development of RNA-centric methods that are based on RNA affinity capture has greatly expanded our knowledge of RBPs bound to specific RNAs. However, an inherent limitation of these methods is the potential loss of transient and compartment-specific interactions and the possibility of co-purifying post-lysis, false-positive interactions66. The choice of lysis buffer and lysis method, and the addition of aptamers, can change the secondary structure, the half-life of the RNA and, thereby, the protein binding pattern on the RNA178,179. These issues can be partly addressed by maintaining the post-lysis integrity of the RNP with formaldehyde or UV cross-linking, followed by either biotin-labelled antisense oligo RAP180, peptide nucleic acid (PNA)-assisted affinity purification181,182 or 2′-O-methylated antisense RNA-mediated tandem RNA isolation (TRIP)183.

Proximity-based methods

Proximity-based methods can overcome limitations associated with affinity-based methods but are associated with limitations such as the need for sufficient available lysine or other electron-rich amino acids on the protein surface for efficient biotinylation. Moreover, free proximity biotinylation enzyme can biotinylate proteins in a non-specific manner. Background biotinylation can be partially corrected when analysing the data in a cell-specific or tissue-specific way, and general contaminants can be diminished from the data set by referring to the CRAPome database184. Various experimental approaches aimed at improving the signal to noise ratio are discussed in a recent review57.

Another consideration when using proximity biotinylation enzymes is their labelling range (10–20 nm). The enzymes differ in their labelling range and substrates, and can be broadly grouped into peroxidases and biotin ligases185 (Supplementary Table 1). Biotin ligases convert biotin and ATP into biotinoyl-5′-adenylate (bioAMP), which diffuses around the activation site and covalently bonds with nearby lysine residues186. In vitro, the BirA–bioAMP complex has a half-life of ~30 min; therefore, biotinylation of substrates also depends on the activity and diffusion speed of this complex in the cell. The efficiency of different proximity ligases also depends on the specific redox environment and proximal nucleophile concentrations, which might explain why BioID and TurboID are effective when tagged with a nuclear localization sequence, a mitochondrial targeting sequence or endoplasmic reticulum-targeting sequences, whereas miniTurboID is more effective in an open cytosolic environment than in membrane-enclosed organelles187.

miniTurboID can be used at a lower temperature (20–37 °C) than BioID (37 °C) and BioID2, which has an optimal temperature of 50 °C (refs187,188). However, it is concerning that constitutive expression of TurboID in the absence of exogenous biotin leads to decreased size and viability in Drosophila melanogaster187 and that incubation times greater than 6 h or use of excess biotin (50 µM) may result in non-specific biotinylation in the cell187. Deletion of the N-terminal region was found to decrease the stability of miniTurboID in C. elegans187. Recently, with the help of enzyme reconstruction algorithms and residue replacements on optimized biotin ligases, a new BirA enzyme, AirID (ancestral BirA for proximity-dependent biotin identification), has been developed and found to be less toxic than TurboID in Hek293 cells189.

Analysing RNA binding sites

Extracting RNA interaction parameters from CLIP data and interpreting the potential functions of these interactions can be challenging, and is an area of intense research. Defining cross-linking peaks of high occupancy is important; however, such peaks should not be directly equated to functionally relevant binding sites. Even though CLIP tends to detect binding events with high specificity, the functionality of these events depends on additional factors, such as the binding position relative to other functional elements and the total residence time of the protein173. Recently, femtosecond UV laser cross-linking followed by CLIP (KIN-CLIP) was shown to be capable of characterizing in vivo binding kinetics at individual sites and thus revealing the increased functionality of sites that are composed of clusters of motifs77, in agreement with insights from the studies of RNA maps111,190.

The assignment of RNA binding sites can be improved by combining CLIP data with analysis RNA sequences and structural motifs99. Further indication of the functional relevance of binding sites can be obtained by assessing their evolutionary conservation. However, many RNA sequences are not strongly conserved; for example, although the length and arrangement of lncRNAs and introns are under considerable evolutionary constraint, their sequences show weak conservation across species and rapid accumulation of repetitive elements, indicating weak functional constraint191. Nevertheless, even intronic repetitive elements can contain high-affinity binding sites that are under some selection, as demonstrated by the observation that many RBPs repress the inclusion of cryptic exons that are often present in these elements192.

To discern functionally relevant sites, it is valuable to integrate CLIP data with orthogonal transcriptomics data from RBP perturbation experiments5,7,190,193. On the one hand, such integration identifies CLIP peaks that likely mediate the regulation of specific elements, and, on the other, it distinguishes the RNAs detected by RNA-seq that are directly regulated by the RBP from those that likely change owing to off-target effects of RBP perturbation, feedback loops via other RBPs or other types of cellular compensation. When analysis leads to sensitive and specific positional patterns observed by an RNA map, it also provides a valuable measure of the quality of CLIP and RNA-seq data that are being integrated9. In addition to integration with RNA-seq for studies of RNA processing, CLIP-derived binding sites have also been integrated with additional types of orthogonal data sets to study 3′ end RNA processing6,194, RNA methylation14, stability7,164, translation80,136 and localization195,196.

Outlook

There is no one size fits all guideline for the design and analysis of CLIP experiments. It is important to be aware of the steps that can be taken for quality control and optimization in order to tailor the experimental and computational steps according to the RBP that is studied, the input material and the type of questions that are asked.

We expect many new applications of CLIP to be developed in coming years, with increasing integration of CLIP with data from methods based on enzymatic tagging and RNA-centric approaches. These complementary methods have not yet been used in combination, but we hope that this Primer will encourage their integrative use. Cross-method comparisons will be valuable to better understand the advantages of each method and correct for technical biases. Integration of CLIP data that detect direct protein–RNA interactions with approaches that also detect RNA-proximal proteins will help to understand which proteins are recruited to RNAs primarily through direct recognition of specific RNA elements versus protein–protein interactions with other RBPs. Another valuable application will be to study specific RBPs in subcellular compartments with complementary methods to provide insights into the assembly properties of RBPs at organelles or biomolecular condensates161. For example, such methods could be applied to chloroplasts, which rely heavily on post-transcriptional mechanisms for controlling the expression of their genome197.

Important questions in RNP remodelling and combinatorial assembly can be answered when CLIP and complementary methods are used under comparative scenarios. For example, CLIP of one RBP from cells lacking another RBP can reveal how individual RBPs compete for binding to overlapping sites113 or how larger RNPs compete, such as how the exon junction complex blocks access of the splicing machinery to regions around exon–exon junctions in spliced RNAs146. The competitive and combinatorial assembly principles can be further unravelled using ‘in vitro CLIP’ experiments, in which recombinant RBPs with varying concentrations are incubated with long transcripts, followed by modelling and machine learning198. Moreover, CLIP can be performed with purified RNPs in specific states, for example to define helicase–RNA contacts in specific spliceosomal states by purified spliceosome iCLIP (psiCLIP)199. A long-term challenge will be to understand how RNA regulatory networks are remodelled on various timescales, for example during cellular signal response, development, ageing, mutation-driven changes in cancer and other diseases, and over the course of organismal evolution. These questions are starting to be addressed by studies across species or in response to disease mutations27,200. It will be important to understand how variations in IDRs, which tend to evolve faster than structured domains and are hotspots of disease-causing mutations and post-translational modifications1, might affect the RNA binding and regulatory functions of RBPs.

Two emerging applications of transcriptomic techniques not covered in this Primer are mapping of RNA structure and RNA modifications genome-wide, as the topic has been comprehensively covered elsewhere12,201,202,203. Integration of protein–RNA interactions with information on RNA structure and RNA–RNA spatial interactions will help understand the roles of RNA molecules in organizing RNP assembly12,43,203,204,205. Recently, an RNA pull-down method was used to identify proteins bound to 186 RNA structures conserved across yeast species206. This approach enables the study of dozens of short RNA fragments to uncover RBPs that tend to bind similar RNA structures or other types of similar RNA motifs from a group of RNAs, offering a valuable complement to the RNA-centric or global RNA interactome approaches.

More than 100 RNA modifications have been described; most affect the assembly of protein–RNA complexes and therefore should be integrated into studies of protein–RNA interactions. Interestingly, mutations of certain methyltransferases can stabilize covalently linked protein–RNA catalytic intermediates, thus enabling CLIP to be performed without the need for UV cross-linking, as has been done for m5C-miCLIP207. Most methods to date have been developed for transcriptomic studies of m6A, the type of modification that is most common in mRNAs, and these include variants of CLIP, such as m6A-miCLIP, which employ antibodies that recognize m6A-containing RNA208. The success of such approaches critically depends on the quality of the antibodies recognizing the modification209. Therefore, similar to studies of protein–RNA interactions, integration of data from complementary methods will be valuable to gain a full picture of RNA modifications and their roles in RNP assembly202,210.

We expect computational methods for site and motif identification to soon reach maturity, leading to high-quality databases of in vivo RBP binding motifs. As most of the computational methods work with uniquely mapping reads, improvements are foreseen in the quantification of sites located in repeat elements as well as at exon–exon boundaries or in splicing and polyadenylation isoforms. Ultimately, we can start to consider what to do next with information on all of the protein–RNA interaction sites; for example, we could construct whole-cell models to predict RNA fates and their roles in cellular changes during development and disease. The path taken towards this ultimate aim will require integration of complementary data sets to gain understanding of the full RNP assembled on each transcript, its spatial dynamics as the transcript moves through the cell and temporal dynamics in response to post-translational protein modifications, RNA methylation and RNA structural switches. As such, RNPs will surely continue to teach us about the highly interconnected and ever-changing world of living cells.